That looks very promising. As others have pointed out here, pixz would be much more accessible if one could simply install it via "brew install pixz" or "aptitude install pixz".
I'm no expert in compression, so its no surprise that I've never heard of xz. What are it's strengths compared to gzip, or b2z? Efficiency? Convention?
You've probably heard of LZMA, xz is merely an implementation of this (7zip and rar also use LZMA). The most notable differences to bzip2 are significantly longer compression time and rather moderate decompression speed increases. This makes xz quite good for packaging data for distribution, although it takes a lot longer to compress, it decompresses a lot faster (unlike bzip2) and saves a significant amount of space.
I disagree with your assessment. I'm aware of xz and use it when file size is paramount -- which, by the way, is almost never. Most of the time, I'm willing to give up a few megabytes in order to save time, and in my tests pbzip2 crushed xz by a factor of ten when it comes to speed.
Why not use parallelized versions of xz such as pxz or pixz? Because pre-built pxz/pixz packages are nearly non-existent. When that changes, I'll consider switching formats.
Remember that compressors, including xz, support multiple compression levels. The default level for xz is 6, which is perhaps too far on the small-but-slow side. Levels 2 and lower tend to give similar compression levels to bzip2, and are considerably faster.
Also, note that decompressing bzip2 is very slow, xz usually beats it by a factor of two or more.
I agree that the default level (6) for xz probably errs too much on favoring file size over speed. My tests with compression levels 1-2 do indeed show modestly improved size and speed performance relative to single-threaded bzip2.
The fact remains, however, that I can't seem to find a simple way to install a parallelized version of xz. Perhaps I'll post an issue in the Github issue tracker for pixz and see if we can't resolve that. :)
Since our move to multicore over faster processors, I'm sure we'll see a lot of this sort of thing, that is, people suddenly realizing that their code will be some multiple faster if they can find a way to do operations in parallel. I imagine that the compression itself might be slightly less optimal however since similar blocks that could be compressed are on different threads? I didn't dig into how this might or might not be a concern with this project, however. Long of the short of it, however, parallel is the reality. In theory one could arbitrarily split the file, and then compress each of the splits and get a speed up that is roughly comparable?
"In theory one could arbitrarily split the file, and then compress each of the splits and get a speed up that is roughly comparable?" - it could. that's what is done, but for LZ type compressor bigger dictionary (e.g. avoiding splits is for the better) - one more reason why there is .tar -> .tgz (or .tbz), while .zip compresses individually.
It would be even better if files are arranged in such way that contentwise they are similar one after each other.
I'm not compression expert, just average user since the early days of what we used to call it back in the day "solid" compression.
bz2, out of all current compression methods, is particularly parallisable, as it has already split the files up into 900k (or smaller) blocks, and compressed each individually (well, run BWT on each seperately at least).
Since our move to multicore over faster processors, I'm sure we'll see a lot of this sort of thing, that is, people suddenly realizing that their code will be some multiple faster if they can find a way to do operations in parallel.
Reimplementation of things like compression algorithms seems very math/algorithm-heavy and thus amenable to functional programming. How about the Haskell/OCaml guys re-implement a bunch of Un^x style utilities for us?
"The results: 18.7 seconds for bzip2, and… wait for it… 3.5 seconds for pbzip2. That’s an increase of over 80%!"
File cache effect? He should cold reboot first (not sure how you force the file cache out on OSX/linux, on Windows I do it with SysInternals RamMap) and try in different order.
It could still be faster, but he could really be measuring I/O that was done in the first case, and not in the second.
It's also strange that .tar files are used, not tar.bz2 or .tbz (if such extension makes sense)
I'll do some additional testing to see if the results are affected by caching.
Not sure why you're surprised that I used .tar files for the compression testing. As I mentioned in the article, most of the time I'm creating bzipped tarballs from directories of files, so it made sense to use what is, for me, a common real-world use case. Your mention of tar.bz/.tbz makes me think there's some misunderstanding, since clearly I wouldn't want to test compression of already-compressed files. But perhaps it's I who am misunderstanding your suggestion. Please feel free to enlighten me. :)
bzip2 has always been parallelizable. At one point a few years ago I was working on a compressed file format with that included compressed block metadata, because bzip2 is most efficient when it gets about ~900kB to compress at a time. In effect, you split the file up into 900kb chunks, compress them in parallel, and recombine them into one file at the end.
I'm GNU tar user (I believe that's the version in most Linux distributions, but I may be wrong), so I tend to use -z for gzip, -j for bzip2 and -J for xz.
That said, I guess using the "alternatives" framework in Linux it would be reasonably easy (and transparent) to support the parallel version of each tool as replacement to the regular one.
Er, not really. How about...
"pbzip2 reduced running time by 80%."
"pbzip2 took only 20% as long as bzip2 did."
"pbzip2 is five times faster."