Massive Speed Gains via Parallelized BZIP2 Compression

aphyr · on May 30, 2012

18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That’s an increase of over 80%!

Er, not really. How about...

"pbzip2 reduced running time by 80%."

"pbzip2 took only 20% as long as bzip2 did."

"pbzip2 is five times faster."

SnowLprd · on May 30, 2012

Quite right. Clearly not enough coffee. Updated the article with better wording. Thanks!

wmf · on May 30, 2012

BTW, bz2 is kinda over. Check out xz and the parallel version pxz.

vasi · on May 30, 2012

I've also got my version of parallel xz: https://github.com/vasi/pixz

It doesn't require use of large temporary files like pxz, and the xz files it produces can also be decompressed in parallel.

ak217 · on May 30, 2012

vasi, thanks very much for writing pixz, I've had a great experience working with it and it sped up my code very significantly.

The only thing I wish for is more documentation and inclusion in xz-utils (which I know is not up to you, but I'm hopeful).

vasi · on May 31, 2012

Glad to hear it's been of use :)

The author of xz is working on a parallel compression implementation of his own, which will hopefully be in future versions of xz.

LogicX · on May 30, 2012

Looks worth trying out --

Can I suggest adding installation packages for: brew ubuntu (at least through launchpad)

SnowLprd · on May 31, 2012

That looks very promising. As others have pointed out here, pixz would be much more accessible if one could simply install it via "brew install pixz" or "aptitude install pixz".

vasi · on May 31, 2012

I welcome anybody who'd like to package it! I think it's already available in Arch Linux.

bacr · on May 30, 2012

I'm no expert in compression, so its no surprise that I've never heard of xz. What are it's strengths compared to gzip, or b2z? Efficiency? Convention?

keeperofdakeys · on May 31, 2012

You've probably heard of LZMA, xz is merely an implementation of this (7zip and rar also use LZMA). The most notable differences to bzip2 are significantly longer compression time and rather moderate decompression speed increases. This makes xz quite good for packaging data for distribution, although it takes a lot longer to compress, it decompresses a lot faster (unlike bzip2) and saves a significant amount of space.

shin_lao · on May 30, 2012

You probably heard of 7Zip and LZMA. xz uses the LZMA2 algorithm to achieve better compression ratios.

SnowLprd · on May 31, 2012

I disagree with your assessment. I'm aware of xz and use it when file size is paramount -- which, by the way, is almost never. Most of the time, I'm willing to give up a few megabytes in order to save time, and in my tests pbzip2 crushed xz by a factor of ten when it comes to speed.

Why not use parallelized versions of xz such as pxz or pixz? Because pre-built pxz/pixz packages are nearly non-existent. When that changes, I'll consider switching formats.

vasi · on May 31, 2012

Remember that compressors, including xz, support multiple compression levels. The default level for xz is 6, which is perhaps too far on the small-but-slow side. Levels 2 and lower tend to give similar compression levels to bzip2, and are considerably faster.

Also, note that decompressing bzip2 is very slow, xz usually beats it by a factor of two or more.

SnowLprd · on May 31, 2012

I agree that the default level (6) for xz probably errs too much on favoring file size over speed. My tests with compression levels 1-2 do indeed show modestly improved size and speed performance relative to single-threaded bzip2.

The fact remains, however, that I can't seem to find a simple way to install a parallelized version of xz. Perhaps I'll post an issue in the Github issue tracker for pixz and see if we can't resolve that. :)

james4k · on May 30, 2012

Recently when compressing some SQL backups, I found xz to be very slow when compared to bz2 (at least with default compression settings).

Edit: And in this case, bz2 even gave better compression ratios if I recall correctly.

ralph · on May 30, 2012

If speed matters more than compression then consider lzop(1).

th0ma5 · on May 30, 2012

Since our move to multicore over faster processors, I'm sure we'll see a lot of this sort of thing, that is, people suddenly realizing that their code will be some multiple faster if they can find a way to do operations in parallel. I imagine that the compression itself might be slightly less optimal however since similar blocks that could be compressed are on different threads? I didn't dig into how this might or might not be a concern with this project, however. Long of the short of it, however, parallel is the reality. In theory one could arbitrarily split the file, and then compress each of the splits and get a speed up that is roughly comparable?

malkia · on May 30, 2012

"In theory one could arbitrarily split the file, and then compress each of the splits and get a speed up that is roughly comparable?" - it could. that's what is done, but for LZ type compressor bigger dictionary (e.g. avoiding splits is for the better) - one more reason why there is .tar -> .tgz (or .tbz), while .zip compresses individually.

It would be even better if files are arranged in such way that contentwise they are similar one after each other.

I'm not compression expert, just average user since the early days of what we used to call it back in the day "solid" compression.

CJefferson · on May 30, 2012

bz2, out of all current compression methods, is particularly parallisable, as it has already split the files up into 900k (or smaller) blocks, and compressed each individually (well, run BWT on each seperately at least).

stcredzero · on May 30, 2012

Since our move to multicore over faster processors, I'm sure we'll see a lot of this sort of thing, that is, people suddenly realizing that their code will be some multiple faster if they can find a way to do operations in parallel.

Reimplementation of things like compression algorithms seems very math/algorithm-heavy and thus amenable to functional programming. How about the Haskell/OCaml guys re-implement a bunch of Un^x style utilities for us?

th0ma5 · on May 31, 2012

in theory gnu parallel can be used to automate an idea in place like xargs

wtetzner · on May 30, 2012

>In theory one could arbitrarily split the file, and then compress each of the splits and get a speed up that is roughly comparable?

It doesn't seem unlikely that that's what they're doing, considering you can't pipe data to it on stdin.

sciurus · on May 30, 2012

For parallel gzip there's pigz (pronounced pig-zee).

http://www.zlib.net/pigz/

dguido · on May 31, 2012

Parallel gzip, in case anyone wanted it: http://zlib.net/pigz/

I've used it to great effect during incident response when I needed to search through hundreds of gigs of logs at a time.

malkia · on May 30, 2012

"The results: 18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That’s an increase of over 80%!"

File cache effect? He should cold reboot first (not sure how you force the file cache out on OSX/linux, on Windows I do it with SysInternals RamMap) and try in different order.

It could still be faster, but he could really be measuring I/O that was done in the first case, and not in the second.

It's also strange that .tar files are used, not tar.bz2 or .tbz (if such extension makes sense)

sciurus · on May 30, 2012

On linux, it's

    echo 1 > /proc/sys/vm/drop_caches

http://linux-mm.org/Drop_Caches

minimax · on May 30, 2012

Awesome tip! Thanks!

malkia · on May 30, 2012

Yup. Thanks scirius! Anyone knows what's the one on OSX? (Or some multi-platform library that knows more about this stuff (Windows included))?

lilyball · on May 31, 2012

On OS X it's merely `purge`.

SnowLprd · on May 30, 2012

I'll do some additional testing to see if the results are affected by caching.

Not sure why you're surprised that I used .tar files for the compression testing. As I mentioned in the article, most of the time I'm creating bzipped tarballs from directories of files, so it made sense to use what is, for me, a common real-world use case. Your mention of tar.bz/.tbz makes me think there's some misunderstanding, since clearly I wouldn't want to test compression of already-compressed files. But perhaps it's I who am misunderstanding your suggestion. Please feel free to enlighten me. :)

malkia · on May 30, 2012

Got it!

mattst88 · on May 30, 2012

I used to use pbzip2 before I learned about lbzip2 (http://lacos.hu/)

lbzip2 is able to decompress single streams using multiple threads, which apparently pbzip2 cannot do. See the thread beginning with http://lists.debian.org/debian-mentors/2009/02/msg00098.html

juiceandjuice · on May 30, 2012

bzip2 has always been parallelizable. At one point a few years ago I was working on a compressed file format with that included compressed block metadata, because bzip2 is most efficient when it gets about ~900kB to compress at a time. In effect, you split the file up into 900kb chunks, compress them in parallel, and recombine them into one file at the end.

Inufu · on May 30, 2012

Is there a reason this is not the default?

reidrac · on May 30, 2012

I'm GNU tar user (I believe that's the version in most Linux distributions, but I may be wrong), so I tend to use -z for gzip, -j for bzip2 and -J for xz.

That said, I guess using the "alternatives" framework in Linux it would be reasonably easy (and transparent) to support the parallel version of each tool as replacement to the regular one.

BrainInAJar · on May 30, 2012

is there a pbzip2 that doesn't eat all your memory ?

rorrr · on May 30, 2012

A GPU implementation would be cool.