Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for the analysis of their benchmark, I wanted to view the details by myself but it required creating an account on their page.

> There is no superior "compression" technology

Isn't it feasible to employ special encoding for time series data? For example, to encode a series of timestamps like 1473333629, 1473333630, 1473333631 you could encode it as 1473333629, +1, +2 (where +1, +2 are encoded in one byte). And there are many cases of such metrics with adjacent values, like averages, counters.



Yes, the delta encoding scheme you described (and other fancy coding schemes such as bitpacking, varints, RLE or a combination thereof) are frequently employed in columnar storage formats and databases. Columnar storage is basically a generalization that allows one to apply these optimizations to all kinds of data (not just timeseries). One popular open-source implementation of columnar storage that I am not affiliated with is https://parquet.apache.org/.

(On the other hand, columnar storage also has a bunch of tradeoffs/downsides so it's not a superior choice for every db product.)

My point about no "superior compression technology here" was specific to the linked benchmark. I.e. the lack of this potential optimization in cassandra does not appear to be the reason for the space blowup in the benchmark, but rather that they're duplicating the series ID for each sample.


A commercial DB that (also) does this HP Vertica. They tout a 4:1 to 5:1 compression ratio on average; due to the nature of the data the firm I work for stores in it, we get quite a bit better than that. Delta encoding is just one of maybe 5 different schemes it can use for a given column.


Just so sad that Vertica is proprietary so we can't see how they did it! ;)

On a serious note: Please check out EventQL [0] some time. It's very similar to Vertica in some ways and completely open-source. It's a new project (beta) and not nearly as mature as vertica yet though (still a long way to go).

[0] https://eventql.io/


Facebook does this (and quite a few other tricks) for storing time-series data in Gorilla (in-memory TSDB, Paper: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf), getting to 1,37 B per sample.

Prometheus implemented the Gorilla-bits (see https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chu...) and reports getting down to 1,28 B per sample on some workloads, though at a cost of increased query-latencies.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: