Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't see a problem with them expressing the ratio as a decimal since it becomes a simple multiplier of the original file size 38GB x 0.3.

But it's downright misleading to show the vertical axis from something other than 0.0 to 1.0 when comparing ratios. They start it at 0.2. In reality, LZJB is saving 50% of the space whereas gzip saves 70%. But a naive glance at the graph implies gzip look roughly 3 times smaller/better than LZJB.

Classic "How to Lie with Statistics" stuff.* I would have expected better from an "analytics" database.

* Not saying they intend to lie here but it's representative of the classic text https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics



Author here. Believe it or not I originally had the compression ratio graph rotated 90 degrees, and had manually modified it to run from 0.00 to 1.00. Google docs for some god awful reason insists on starting at 0.2 by default. Anyway, when my colleagues reviewed a draft of this post they requested that I rotate the graph back, and in the process I forgot to reset the scale. Sorry for the confusion. It's fixed now. As for the definition of "compression ratio", I looked this up and went with the definition found here: http://en.wikipedia.org/wiki/Data_compression_ratio

I agree that it's kind of counterintuitive.


Perhaps "file size on disk" would be an unambiguous way to put it.


If you read in any other article something like the following: "Taking Product X as having a baseline compression ratio of 1, Product Y had a compression ratio of 0.5 and Product Z had a compression ratio of 0.3", I'm pretty sure 99.9999% of the HN population would interpret that as Products Y and Z having worse compression than X, not better. That's my point.


This academic-looking paper (first hit I tried from Wikipedia) gives the standard definition of "compression ratio" as compressed/uncompressed size (section 4.2), consistent with the linked article.

I'm pretty sure you're impression of 99.9999% of the HN population is wrong.


Link?

OK, found this: http://en.wikipedia.org/wiki/Data_compression_ratio

Which includes this section on "Usage of the term": "There is some confusion about the term 'compression ratio', particularly outside academia and commerce. In particular, some authors use the term 'compression ratio' to mean 'space savings', even though the latter is not a ratio; and others use the term 'compression ratio' to mean its inverse, even though that equates higher compression ratio with lower compression."

So, my bad, however in my practical workplace experience the above (in italics) has been the case, hence the confusion.


Simple rule: If it's under 1.0 or expressed as a percentage under 100%, it's a compression ratio. If it's over 1.0, it's a compression factor.

Otherwise, it's not compressed. :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: