Datasette - PyBay 2018

Datasette - Simon Willison @simonw - PyBay 2018

Slides and notes from my presentation on Datasette given at PyBay 2018 on 19th August 2018.

Data Journalism

I designed Datasette based on my experence working as a data journalist at the Guardian newspaper in London.

Screenshot of the Guardian Data Blog

One of the projects I was involved with at the Guardian was the Guardian Data Blog.

Simon Rogers was the journalist most responsible for gathering the data used for infographics in the newspaper. We decided to start publishing the raw data on a blog.

Screenshot of a Google Spreadsheet

After some consideration of our options, we ended up chosing Google Spreadsheets as a publishing tool.

Screenshot of the Guardian Datastore Flickr Group

This worked really well: it was easy to publish data, and a community of readers grew up around the data that was being published who built their own analyses and visualizations on top of the data that was shared.

We even had a Flickr group where people shared their work.

A better way of publishing data?

But was Google Sheets really the best way to publish this kind of data? I wanted something better. Last year I realized that the pieces were now available to solve this problem in a much more elegant way.

Datasette - a tool for exploring and publishing data - datasette.io

GitHub: simonw/datasette

Documentation: datasette.readthedocs.io

Updates on my blog: simonwillison.net/tags/datasette/

FiveThirtyEight article: We Researched Hundreds Of Races. Here’s Who Democrats Are Nominating.

Let's start with an example. FiveThirtyEight publish a large number of high quality data journalism articles.

Recent example: We Researched Hundreds Of Races. Here’s Who Democrats Are Nominating.

Same FiveThirtyEight article - an arrow highlights the Get the data on GitHub link

And they have a link... Get the data on GitHub

A FiveThirtyEight CSV file on GitHub

The fivethirtyeight/data repository on GitHub is an incredible collection of over 400 CSV files containing the data behind their stories.

A screenshot of the corresponding table in Datasette

I have a daily Travis CI job which grabs their repository, converts the CSV files to SQLite and deploys them to a Datasette instance running at fivethirtyeight.datasettes.com.

Here's the data from that story.

Datasette's faceted browse interface

One of the most interesting features of Datasette is its ability to add faceted browse to any table.

Here I'm faceting the candidates by state, office_type, won_primary, primary_status and Race

Some handy features: Filtering and faceting, Custom SQL queries, JSON API to everything, Export table (or query results) as CSV

I also demonstrated the ability to apply fiters, export the data as JSON or CSV and execute custom SQL queries (the results of which can also be exported).

The secret sauce is SQLite. Small, fast, ubiquitous. A database is a single file. Doesn't scale well for writes... but who cares if your data is read-only? Ship your data and code in the same container!

SQLite is an incredibly embedded database library.

It runs basically everywhere. It's in my watch, on my phone, on my loptop, embedded in a huge variety of applications.

It has a bunch of interesting characteristics which Datasette takes full advantage of.

pip install datasette - Python 3 only

You can install the Datasette command-line utility from PyPI.

find ~/Library -iname '*.sqlite*'

I said SQLite was ubiquitous: let's prove it.

Run find ~/Library -iname '*.sqlite*' to see some of the SQLite databases that applications have created on your OS X laptop.

I get several hundred! Firefox, Chrome and Safari all use SQLite for bookmarks and history, and applications that use it include Evernote, 1Password and many others.

find ~/Library -iname '*.sqlite*' -type f -exec du -h {} + | sort -r -h

find ~/Library -iname '*.sqlite*' -type f -exec du -h {} + | sort -r -h

This command sorts those database files by the largest first. On my machine, the largest SQLite database is:

49M /Users/simonw/Library/Containers/com.apple.photoanalysisd/Data/Library/Caches/CLSBusinessCategoryCache.sqlite

CLSBusinessCategoryCache.sqlite?

I ran datasette against this file and showed how it allowed me to explore it via https://127.0.0.1:8001/

datasette /Users/simonw/Library/Containers/com.apple.photoanalysisd/Data/Library/Caches/CLSBusinessCategoryCache.sqlite

Since I had the datasette-cluster-map plugin installed I could run the custom SQL query select *, zlatitude as latitude, zlongitude as longitude from ZENTRY to see everywhere I have been in the past few years rendered on a map.

Creating SQLite databases

So that's how to explore an existing SQLite database... but how can we go about creating them?

csvs-to-sqlite

csvs-to-sqlite is a simple command-line tool I wrote which converts CSV files into a SQLite database (using Pandas under the hood).

But where can we get interesting CSV data?

Screenshot of opendatainception.io - a map of 2,600 open data portals

It turns out that the worldwide campaign for more open data and transparency in government has worked beyond our wildest dreams.

This website lists over 2,600 open data portals run by local governments around the world!

Screenshot of the San Francisco Data Portal

San Francisco's open data portal is at datasf.org. It has 469 datasets!

Here's a recent example (updated in May): it's a list of all of the buildings and facilities owned by the city.

csvs-to-sqlite City_Facilities.csv facilities.db

I exported the data as CSV and used csvs-to-sqlite to convert it into a SQLite database:

csvs-to-sqlite City_Facilities.csv facilities.db

Then I used Datasette to browse the file locally:

datasette facilities.db

datasette publish now facilities.db

How can we publish it to the internet?

The datasette publish subcommand (documented here) knows how to upload a database to a hosting platform and configure Datasette to run against that file.

It currently works with both Zeit Now and Heroku, and further publishers can be supported via a plugin hook.

datasette publish now facilities.db

This outputs the URL to the new deployment. A few minutes later the instance is available for anyone to browse or build against.

Datasette Publish - https://publish.datasettes.com/

What if you don't want to install any software?

Datasette Publish is a web application which lets you authenticate against your (free) Zeit Now hosting account, then upload CSVs and convert and deploy them using Datasette.

Crucially, the deployment happens to our own hosting account: I'm not responsible for hosting your data.

More about Datasette Publish on my blog.

Screenshot of the Datasette Publish interface

I used Datasette Publish to upload the City_Facilities.csv file and deploy it along with the datasette-cluster-map plugin.

Here's the demo I deployed: san-francisco-facilities.now.sh.

sqlite-utils

What if your data isn't a CSV?

Recently, I've been working on a new package called sqlite-utils (documentation here) which aims to make creating new SQLite databases as productive as possible.

Building a database of PyPI packages in Jupyter

It's designed to work well in Jupyter notebooks.

Screenshot of the notebook

I demonstrated a notebook which pulls the list of the top downloaded PyPI packages from this site, then pulls package details from PyPI's JSON API and creates a SQLite database combining data from the two sources.

Here's the notebook I used.

Datasette Plugins

Datasette supports plugins.

datasette-cluster-map plus screenshot

I've already demonstrated this plugin a couple of times.

It looks for any columns named latitude and longitude and, if it finds any, it renders those points on a map.

simonw/datasette-cluster-map

pip install datasette-cluster-map

If you install the plugin globally (or into a virtual environment) it will load any time you run Datasette.

datasette publish now mydb.db --install=datasette-cluster-map

You can pass the name of a plugin to the datasette publish --install option to cause it to be installed with that deployment.

The Datasette Publish web app allows you to deploy specific plugins by clicking their checkbox.

datasette-vega plus screenshot

Datasette Vega is a plugin that uses the excellent Vega visualization library to enable charts against any Datasette table or query.

What's next?

I have a whole load of plans for improvements to Datasette and the Datasette ecosystem.

Better many-to-many relationship support

The next feature I'm working on is better support for many-to-many relationships. You can track that here.

GIS with SpatiaLite

SpatiaLite is a powerful extension for SQLite that adds comprehensive support for GIS features such as polygon operations and spatial indexing.

I wrote more about it in Building a location to time zone API with SpatiaLite, OpenStreetMap and Datasette. I'm excited about using it to further extend Datasette's geospatial abilities.

SQLite 3.25 window functions

SQLite 3.25 is due out in September and will be adding support for window functions, a powerful set of tools for running complex analytical queries, especially against time series data.

I'm excited about exploring these further in Datasette. For the moment I have a demo running at pysqlite3-datasette.now.sh.

Datasette 1.0

I don't have a timeline for Datasette 1.0 yet, but it will represent the point at which I consider the Plugin mechanism to be stable enough that people can write plugins without fear of them breaking with future (non-2.0) updates.

So please, take a look at the plugin mechanism and let me know if it supports your needs!

The world is full of interesting data...
... let's publish it in the most useful way possible

The more people using Datasette and giving me feedback on how it can improve the better.

simonwillison.net - @simonw on Twitter

simonwillison.net

@simonw