Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Beyond indexes: How open table formats optimize query performance (jack-vanlightly.com)
95 points by jandrewrogers 55 days ago | hide | past | favorite | 3 comments


A lot of upvotes, but no discussion :)

How would you approach migrating a ~5 TB OLTP database that fundamentally constain analytical time series data? I’d think e.g., Apache Iceberg could be a better data store, and make writing much easier (almost just dump a parquet in there)

It’s exposed to the outside world via APIs


That 5 TB of data will probably be 3-400 GB in Parquet. Try and denormalise the data into a few datasets or just one dataset if you can.

DuckDB querying the data should be able to return results in milliseconds if the smaller columns are being used a better if the row-group stats can be used to answer queries.

You can host those Parquet files on a local disk or S3. A local disk might be cheaper if this is exposed to the outside world as well as giving you a price ceiling on hosting.

If you have a Parquet file with billions of records and row-groups measuring into the thousands then hosting on something like Cloudflare where there is a per-request charge could get a bit expensive if this is a popular dataset. At a minimum, DuckDB will look at the stats for each row-group for any column involved with a query. It might be cheaper just to pay for 400 GB of storage with your hosting provider.

There is a project to convert OSM to Parquet every week and we had to look into some of those issues https://github.com/osmus/layercake/issues/22


Great article.

Any other resources that provide a comprehensive treatment of open table formats?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: