Bikeshare data notes

2024-10-14

Shop notes from working on my project about bikeshare data:

The 'cool kid' formats are Zarr, DuckDB, and Parquet. Working with them has been kind of a struggle: you can use DuckDB to query its own DuckDB database format, or to query data in Parquet files. The two options are sort of in contention - there are reasons to use Parquet with DuckDB and reasons not to.
Annoyingly, DuckDB really tailors its data ingestion to a situation in which you already have CSV files or Parquet files as input. In this case, I have neither - I have to do some transformation from a bunch of nested JSON files into a tabular format. CSV isn't the worst format, but I fear what happens to my data types as they travel between places: it's a bit annoying. "Creating a file of data" really feels like a corner case for these things… the documentation always includes some example of setting up a DataFrame with some hard-coded array of 10 numbers: what if I have some Python code that's transforming 100 million rows - just… a big array? So far the answer has been yes, but it feels very wrong, given the one weird trick of all Python libraries is to do all the internal hard parts in Rust, C++, or Fortran.
The tools for working with this stuff are pretty overlapping: you can write a Parquet file with Polars, or pyarrow, or Pandas. I've been using Polars because it's the cool new thing. That could be a mistake but it hasn't bitten me yet.
Right in the middle of rebuilding on parquet, the folks at Earthmover released Icechunk, which is their multi-dimensional array storage. I've been tinkering with Xarr, too, as an option, and Icechunk seems like a perfect option.
Overall, I find it kind of frustrating how this datascience tooling stuff doesn't give that much detail about how access patterns work. Like, when I use Clickhouse or Postgres, there's pretty good documentation and third-party writing about how fast it is to query on one column or another, how indexes work, etc. It's harder to find that kind of documentation for this stack: how do I save a Parquet file that's easy to partially read based on a range of one column or another? I know that some of my queries are slow in DuckDB, but how do I make them faster? It's a bit of a mystery - still learning there.

Tom MacWright

Bikeshare data notes