Thanks for the analysis of their benchmark, I wanted to view the details by myse...

paulasmuth · on Sept 8, 2016

Yes, the delta encoding scheme you described (and other fancy coding schemes such as bitpacking, varints, RLE or a combination thereof) are frequently employed in columnar storage formats and databases. Columnar storage is basically a generalization that allows one to apply these optimizations to all kinds of data (not just timeseries). One popular open-source implementation of columnar storage that I am not affiliated with is https://parquet.apache.org/.

(On the other hand, columnar storage also has a bunch of tradeoffs/downsides so it's not a superior choice for every db product.)

My point about no "superior compression technology here" was specific to the linked benchmark. I.e. the lack of this potential optimization in cassandra does not appear to be the reason for the space blowup in the benchmark, but rather that they're duplicating the series ID for each sample.

michaelcampbell · on Sept 8, 2016

A commercial DB that (also) does this HP Vertica. They tout a 4:1 to 5:1 compression ratio on average; due to the nature of the data the firm I work for stores in it, we get quite a bit better than that. Delta encoding is just one of maybe 5 different schemes it can use for a given column.

paulasmuth · on Sept 8, 2016

Just so sad that Vertica is proprietary so we can't see how they did it! ;)

On a serious note: Please check out EventQL [0] some time. It's very similar to Vertica in some ways and completely open-source. It's a new project (beta) and not nearly as mature as vertica yet though (still a long way to go).

[0] https://eventql.io/

msiebuhr · on Sept 8, 2016

Facebook does this (and quite a few other tricks) for storing time-series data in Gorilla (in-memory TSDB, Paper: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf), getting to 1,37 B per sample.

Prometheus implemented the Gorilla-bits (see https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chu...) and reports getting down to 1,28 B per sample on some workloads, though at a cost of increased query-latencies.