Are new geospatial formats useful (to me?)
File this blog post under thinking in public. It gets in the weeds, way in the weeds. Maybe a few other folks are thinking about the same problem area. Or have a different perspective that’ll be useful to learn. Here goes.
I recently added support for FlatGeobuf, a really neat format that can store basically the same stuff as GeoJSON, but it produces much smaller files, includes a tree index so that you can search for geospatial data within a buffer, and supports random access by storing offsets to data in the file. I can take some tiny credit for a point in this evolution, having written the first specification of Geobuf, before it was totally reinvented and improved by Volodymyr Agafonkin.
So that’s a lot of words to say it’s a new, nice format that advertises a bunch of benefits. There are a whole bunch of these formats - Zarr, Parquet and GeoParquet, Arrow, Cloud Optimized GeoTIFFs, . All claim some combination of a compact, or zero-copy, encoding, as well as efficient random access. Some have built-in indexes.
Buried the lede but here it is: do any of the benefits of these new formats translate into wins for any of my applications within my constraints? The answer may be yes, but I have a hunch that it’s closer to no.
Now, these formats are definitely useful to a lot of people, especially those using Python or low-level languages like C++ or Rust. Some of the more established options like Cloud Optimized GeoTIFFs have been a game-changer for people working with satellite data, and those people sing the praises all the time.
From the vantage point of JSON, for example, let’s consider this minimal JSON file (bare numbers are valid JSON):
In a zero-copy format, your program would refer to that number itself, in the original data. From CaptnProto:
The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!
The theoretical answer may be yes - sound off in the comments - but the practical answer, so far, really seems to be no. The demos I’ve seen of zero-copy formats in the browser usually require a lot of copying. So do the libraries for parsing those formats. The library for flatgeobuf, for example, gives you GeoJSON objects which are not just indexes into the original datastructure.
There are, possibly, exceptions to this. If you load data into WebGL, I think you can do it straight from a fancy file format. And it seems like James Halliday, in the wildly-underrated peermaps project, has been working on a buffer format that you can just load in WebGL.
The next big benefit of these modern formats is random access, or you might call it “range request” access, because HTTP range requests are the tool that you’ll be using on the web.
For big datasets that you want to store statically as files, there’s a conundrum. Lots of little files can wreak havoc on systems - back when we were handling millions of map tiles at Mapbox on S3, the overhead in just per-request fees, not even bandwidth, was tremendous. See also filesystem limits. Filesystems do not like having millions of files in a folder.
Or you can bundle those files together in a wrapper - like we did with MBTiles, but we did it with SQLite. So now you need a server in between your storage and the user in order to open the big file and pick out the small bits of data that the user was interested in.
A better solution is to bundle the files together in a way that you can access individual bits of data without reading the whole file and without involving a server. Usually you accomplish this by putting an index at the beginning of the file, so a client can request the first, say, 4kb, find all the offsets in the file, and ask for a particular file 1,234 kilobytes in. Lots of file formats use this trick or something similar.
Fun fact, this trick, if you do it incorrectly, is incredibly dangerous! If your file reader doesn’t properly check that those file offsets are actually within the file, then it’ll read random memory elsewhere. This is how the PlayStation Portable and iPhone got hacked.
Anyway, back to the topic: are random seeks that useful on the web? In some cases, yes. Brandon Liu’s brilliant PMTiles format is what MBTiles should have been. You can throw a big tile index on S3 and use it directly using Range-Requests, with a few minor caveats. The same isn’t as true with Cloud Optimized GeoTIFFs, mostly because they support multiple projections and aren’t guaranteed to have all the precomputed data you’d need for a web map.
So you usually put titiler or another server between the S3 bucket and the client. You can technically do without the server if you have a particularly well-formed file, like OpenLayers does in a demo, but not with every file.
So, a moderate win. A Cloud Optimized GeoTIFF tiling server is a lot simpler than a traditional map server, but it’s a lot more complicated than no server at all. No matter how simple people say Lambda functions are, I don’t believe them. They aren’t.
Compression & file size
Compression is a tricky thing because compression ratio is so closely tied to encoding/decoding performance. This tradeoff is real - check out formats like snappy which aim for fast compression instead of perfect compression.
No, it’s usually not worth it. Because brotli or gzip compression is standard on the web, and the JSON parsers in modern web engines are just really, really, really fast. I can remember multiple projects involving smart people trying to invent something better and faster than JSON and hitting a wall: it’s hard to beat the king.
Where a modern file format could come in handy in terms of size and performance is the WebWorker and WebAssembly boundary. Both provide really nice ways to transfer certain blessed objects, mostly ArrayBuffers, much faster than they transfer other data. If, say, Mapbox GL JS accepted FlatGeobuf as a file format and didn’t have to decode it and send it as GeoJSON, would this constitute a speedup? Possibly!
Finally, that “write heavy” bit. You can optimize a system for reads, writes, or balance the two. Most things that optimize for reads will make writes slower and vice versa. That’s just the facts.
Even, say, a traditional database like Postgres. If you add lots of indexes to your tables, you’ll have really fast queries, but really slow writes because each write updates all the indexes. Sure, you can sometimes defer the index rebuilds, but the work is still there.
A lot of these new formats are really read-oriented. Columnar formats like Apache Arrow or Parquet are extremely slow to update, because updating one “record” means updating each of the columns it lives in, which are each in different areas of memory. With row-oriented databases, the data for each row sticks together.
And then remember those indexes that are so crucial for allowing random reads into these formats? They’re also the reason why most explicitly don’t support random writes, because a write might change the index and the offsets of all the data, wreaking havoc.
I wrote this in a fury, like I write everything, but not one of anger. These formats are awesome.
But I think this is a situation where I should look long and hard and figure out how they fit. Their usefulness for static data on servers, and their usefulness to data scientists doing analysis on read-only data, is enormous, and should be celebrated.
But it’s always tempting to read fast, efficient, new as broad attributes of software, that you can add a little of the new stuff and immediately get the effects, and the shine of being on the bleeding edge. It doesn’t really work that way. Lots of applications make bad tradeoffs, like running really quickly after startup but having a long startup time… and restarting often (see also, cold starts). Or doing unnecessary compression that’s overridden later or takes longer to compress than the comparative savings in transit. A lot of fancy optimizations do nothing, and some do worse.
So I’m not sure. I like these new formats and I’ll support them, but do they benefit a usecase like Placemark? If you’re on the web, and have data that you expect to update pretty often, are there wins to be had with new geospatial formats? I’m not sure. Sound off in the comments.