zstd beats gzip handily, either in terms of speed for a given compression ratio or compression ratio for a given amount of wall clock time. And that's without even using multiple threads, which zstd supports out of the box but for gzip you would need to use a different tool like pigz. And on top of that zstd is also much faster to decompress.
In 2022, if you care about compression performance at all, there's no reason to even consider gzip unless you have a specific need for it (e.g. you have to deal with some other thing that can't handle zstd).
For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels, which also has Content-Encoding support in web browsers. You might also want to play with compression levels (-1 to -11 or more, zstd's --fast=n); they may not map 1:1 to gzip's.
One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things that would have been impractical in the 90s but are reasonable with modern hardware. The new algorithms also contain lots of smart ideas and have fine-tuned implementations, but that core difference seems important to note.
So if you have many connections, you will spend gigabytes of RAM on compression alone? In the age of VM and cloud computing efficiency is important again.
When compressing, window size is a tunable option. I think the compression level sets it implicitly, but it can be set explicitly using brotli -w or windowLog in the advanced zstd API (https://facebook.github.io/zstd/zstd_manual.html).
Decompressing many streams at once is more interesting. The decompression APIs do let you specify a max history-window size, but e.g. HTTP won't let you advertise a window size limit, so all you can do is error out if the response would take too much memory to decode. You can just not advertise support for new Content-Encodings at all if it's a concern.
Virtually every compression format utilizing LZ77 has a way to declare the window size used and reject any archive requiring too large window. This is even the case with the traditional DEFLATE.
Zstd is available in distros but not browsers. Brotli is available in all web modern browsers for sending http resources to the browser. Very handy for wasm and things like that:
In distros, I would say no anecdotally (not sure about latest versions). I frequently find it not installed on things like Ubuntu 18/20 and RHEL-based systems whereas gzip is pretty much universally available on everything from embedded to Windows
I’d like to throw in lz4 compression as a ridiculously fast, low CPU alternative while still providing acceptable ratios for the presented use-case.
Maybe obvious for many but still worth mentioning is that on Unix-like systems any external compression can be used by piping to stdout, no need to rely on built-in support. Especially relevant with BSD tar implementations. Example:
Thanks for that information, I wasn't aware until reading into it from your suggestion here that it's what underpins the encryption in btrfs, which is something I've experimented with a bit but didn't realize had zstd as a part of it. Very useful!
Tar is the standard. It's pretty easy to produce and consume tar archives on Unix-like systems (especially in a streaming fashion, which comes in handy if your storage is scratch storage is slow/tiny for some reason), and a lot of software will just work with it.
Other than that, tar is still frequently used for packing data to be recorded onto LTO tapes (even despite LTFS being a thing).
Tar is not a compression algorithm, it is a format for serializing files, so it covers a different use-space.
I've used it often to speed up transfers of many small files over SSH - scp always took a small amount of time before sending each file, and with many files, the wasted time blew up. Serializing all those files with tar and deserializing them on the other side turned out to be much faster, especially since tar works in a stream-like fashion. And it's convenient to type, too:
tar -c path/to/file1 path/to/file2 ... | ssh user@host tar -x
TAR makes multiple things into one thing, so that repetitions in those things will cause compression gains. Not sure how the other compression algos work, but at least when gzipping, you will not gain that compression, if the repetitions are in separate files. So first TAR then gzip has been the standard way of doing it.
If you need or want to use tar and compression, you can consider pixz. It's compatible with xz, can do multithreaded compression and decompression, and can efficiently extract a single member from a .tar.xz archive without having to decompress the entire archive. xz/pixz is much more computationally expensive than zstd or gzip but also provides better compression ratios.
ZIP with LZMA compression is one possibility. Random seek were always supported, and later versions also gained support for modern encryption, modern compression, and IIRC Unix permissions too.
ZIP with non-deflate compression is just silly. If you are not going to create defacto-standard compatible .zips then you might as well use a better archiving format.
Your choice of tools always depends on the use case. If you're only going to be creating these archives for internal use, or will be exchanging them/reading them on modern Unix-like systems, then there is no worry about the decompressor not supporting LZMA. If you need to exchange files with feature-poor systems, then yes - stick to the de-facto standard.
Besides, what "better" archiving format are you going to use? Tar clearly isn't one (the great... grandparent comment wouldn't be asking for alternatives if it were). "Defacto standard" ZIP files have a poor compression ratio. 7z feels strange to use in a professional setting or for long-term storage. RAR is a closed-source abomination. Are there any other "better" archiving formats that have wider (de)compressor adoption than ZIP+LZMA?
Wow, that's really neat. I have never heard of pigz before despite being an avid Ubuntu user for a long time. Also, I see it supports the "-11" level which is using the zopfli algorithm!
Is there a way to decompress in parallel? Afaik pigz can use a few threads to separate decompressing and i/o -- is there anything can we can pass to tar to ungzip faster?
pzstd can decompress in parallel if the data was compressed using pzstd. Although zstd is so much faster for decompression that you rarely need multiple threads.
Gzip files are allowed to consist of multiple streams. Bgzip (block gzip) is one tool that reads and writes gzip-compatible files in parallel. The output can be decompressed in parallel by bgzip, or semi-random access, or decompressed normally by gzip.
As a FYI pigz also can speed up decompression as well, but the performance boost is minimal. The decompression is still in a single thread but it spawns other threads for i/o and so forth. For compression the speedup is quite linear.
igzip (https://github.com/intel/isa-l) is much faster than gzip or pigz when it comes to decompression, 2-3x in my experience. There is also a Python module (isal) that provides a GzipFile-like wrapper class, for an easy speed-up of Python scripts that read gzipped files.
However, it only supports up to level 3 when compressing data, so it can't be used as a drop-in replacement for gzip. You also need to make sure to use the latest version if you are going to use it in the context of bioinformatics, since older versions choke on concatenated gzip files common in that field.
> However, it only supports up to level 3 when compressing data, so it can't be used as a drop-in replacement for gzip.
[EDIT: I misread a benchmark result [1] and wrongly concluded that igzip -3 is better than gzip -9, in reality it is somewhere in the middle of -1 and -9.]
That benchmark seems to be comparing igzip with gzip -1, since the reported gzip output size is 42298786 bytes and not 36548933 bytes (from the earlier benchmarks). My own experience based mostly on FASTQ/text data is that igzip -3 gives results very similar to gzip -3.
The whole scheme could use some improvements, tar.gz feels like a bad tool here.
1. If you can, use btrfs/zfs/lvm snapshot and skip the waiting completely. There's no point in duplication.
2. If you do want the copies, use something that can do incremental backup instead. For example Borg can both compress and deduplicate and on top of that will only copy the changes, not the whole directory.
3. (optional) If you're into running custom latest software, you can save some time by using nixpkgs instead, which will also ensure you don't risk collisions in /usr/local. (takes some upfront time investment though)
You know what I like about tar.gz? gzip is often a transparent compression layer on some tools, like say Athena. Tar is a simple concatenation of files + header.
Therefore, even though Athena/Hive/Presto doesn't handle tar, I can use tar (or grep) to search the files without having to decompress them.
In general sure, but not in this context. The author specifically says this is a short term copy done before installation, so it's more of a restore checkpoint than a proper backup. The use case is:
> With a handy /usr/local archive, if something goes crazy wrong during my new install, it’s easy to revert.
Yep! Thanks for the suggestion! Apparently -1 is the same as --fast. See comment below by geocrasher: "gzip --fast. Only 23% slower in speed, but 73% reduction"
Darkstar is a traditional name for computers running Slackware.
Shameless plug: My gorgeous Darkstar is an antique server on which I sell traditional shell accounts from which one can make "metal" VPSes from the command line. Occasionally somebody who knows 10X or 100X or 1,000X more than I do comes along and becomes a customer. If you are interested, please check metalvps.com. Also, if you like Xv6, please check metalvps.com/Xv6.
This would not negate the point but possibly make the delta even larger, but you have to be careful when benchmarking tar with the "v" flag. The time to print ~400,000 lines out to the terminal can be nontrivial! It may even be the majority of the time for the pure tar case.
Like I said, this won't negate the point but amplify it; subtract an approximately constant time from both times and the multiplicative delta between them will grow.
(Some modern shells that focus on handling lots of text very quickly may be less impacted than some older-style shells.)
In the interests of not making another tiny post, I also endorse the use of parallel compression for the benchmark too, which will close the gap somewhat. You also want to consider the overall use case and the overall bandwidth you have to the final destination. If I were trying to upload something over my home network connection to some remote source, I can afford almost any parallel compression scheme there is because it's a pure win if I can end up sending fewer bytes. Compression is better-than-free in that case. On the flip side, if you don't care about disk usage it's hard to do very much compression at NVMe SSD speeds. (There are some things that may be able to do it, but you won't get much.)
Not sure if tar is this way but I think some programs change their buffering behavior based on whether they detect a tty which can further change things
> Before compiling, I use tar to make a short term backup of how things were before the compile and install.
So most of his tar files are mostly the same, having many of the same files in the same order, and 512-block aligned to boot. Due to this block aligning property of tar files, they are well suited to block level desuplication file systems, provided most of their contents are similar to other tar files. I have blogged about using this idea in the context of ci/cd[1], but the idea works just as well with backups. Sharing compression between tar files instead of compressing each tar file might be better anyway.
I posted this in a comment on the WordPress Hosting FB group in a discussion about whether gzipping a sql file for transport was worth it. I thought it an interesting comparison.
This is on an NVMe disk on a 32 core server with a light load.
$ time wp db export test.sql
real 0m3.410s
user 0m2.568s
sys 0m0.431s
176M test.sql
$ time wp db export - | gzip > file.sql.gz
real 0m8.304s
user 0m10.010s
sys 0m0.439s
40M file.sql.gz
$ time wp db export - | gzip --fast > fast.sql.gz
real 0m4.196s
user 0m5.728s
sys 0m0.340s
49M fast.sql.gz
The clear winner is gzip --fast. Only 23% slower in speed, but 73% reduction
in file size, vs normal's 240% decrease in speed but 88% reduction in file
size. 15% smaller file at 240% increased backup time is not worth it.
That's true, it does. And on a very fast connection, it's cheaper to leave it uncompressed. And true, I didn't factor in decompression. The context was specifically the speed of backups rather than the speed of restores.
I'm somewhat surprised, disk being so many orders of magnitudes slower than compute (and to a lesser extent, memory), that there's such a wide gulf. My expectation would have been that the difference would be negligible, as `tar` writes roughly double the bytes (per the article) back to the disk.
One reason to use tar without gzip is that you can random access files within the tarball without unpacking it. tar + xz with --block-size also allows this.
> you can random access files within the tarball without unpacking it
Kinda sorta. You still need a linear scan first and remember all the offsets, because the tar format does not have an index.
> tar + xz with --block-size also allows this.
This also requires a linear scan + unpack stage first, with memorizing offsets. Remembering where the XZ blocks start (and corresponding uncompressed offsets) allows a sufficiently smart implementation to pretend that it can seek in the compressed stream, while under the hood seeking to the closest offset it has memorized and uncompressing from there. This is also alluded to in the GitHub issue you linked to.
Manually tuning down the block size allows that to be done faster, at the cost of compression.
But you still need an unpack/scan phase first. And potentially unpacking and skipping adjacent data when "randomly" accessing a file.
Technically, there is nothing stopping an implementation from doing this with gzip/zlib frames as well.
You only have to scan to the start of the file you want to unpack. In many cases (we use this with Kubernetes CDI) the tarball contains only a few files with the disk image we want to access taking up the majority of the tarball, so in fact a full linear scan is not required and we're only unpacking the first few blocks.
Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.
bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.
For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd
I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount
Interesting. Back in the 90s I found that compressing files before writing them actually improved performance quite significantly, because disk I/O was so much slower back then. How things change.
Being used to slow broadband links, I always compressed my tar files. You might have wait a half hour like the article, but you'd save an hour on the transfer. And then I used my first gigabit local area network.
It was was even more amazing using a 10gb SAN with SSD's. At that point the bottleneck seemed to be all the tiny files on disk (python venv's, for instance)
It's been a few days, but I still had the original files in their original locations. So:
root@darkstar:/usr# date -u; time gzip < local-revert.tar > local-revert.tar.gz
Tue Oct 11 02:49:32 UTC 2022
real 24m35.214s
user 23m57.449s
sys 0m27.327s
root@darkstar:/usr#
From the linked article, for convenient reference, making the .tgz file with tar cvzf took about 28.2 minutes. Making an uncompressed .tar file took about 1.2 minutes. Just now, compressing with gzip took 24.6 minutes. It seems, in this situation, about 2.4 minutes might be saved by creating the tar file first and then compressing separately.
Would be curious about OS / input/output size / hardware/virtualization tech, the only reason I can think for this would be tiny buffers with exorbitantly expensive context switches like you'd see on some older virtualization or e.g. puny escaped-a-VCR ARM chips
I agree with this, even as a long-time zstd user both personally and professionally. The fact is nearly every app, widget, API, and even CPU can encode and decode gzip. Sometimes you need to make the prosaic choice.
Most things are faster than gzip, and many of those things have smaller outputs as well. Brotli at level 2 produces the same-sized archive with this input, in just 3.4s. Gzip even at level 1, with a 10% larger archive, still needs 6.5s. zstd, at level 1, produces the same size as gzip -1, in 1 second! zstd, cranked up to 9, makes a 20% smaller archive in 7 seconds, less than half the time of gzip -6.
It also has an "-a" flag that just runs the right compressor based on output file name. "tar -acf archive.tar.gz" and "tar -acf archive.tar.zst" will do the right thing in both cases.
The time to make an archive depends primarily on two factors: processing power and IO speed. Depending on scenario one might have a fast CPU with a slow target device, especially if remote, in which case spending extra time compressing might be a win. In other cases the IO is plenty fast but you're limited by CPU, in which case little or no compression is what you want.
Was wondering if any had tried implementing a control loop that would dynamically adjust compression parameters to provide optimal compression speed, utilizing both processing power and IO as best as possible?
> Second, once we have done compiling a few times, compiling a program from its latest sources can be easier than figuring out how to install an often older version with our distribution’s package manager.
I'll add that the "older" versions of some utilities in a distribution's package managers often have patches with security fixes from newer versions.
The distribution's package manager will identify when there's a newer version, especially with critical security fixes, so I don't have to regularly do manual checks to see if something should be upgraded.
Newest isn't always better, and I'm usually happy to be conservative about common system utilities like tar and gzip.
The article talks about tar.gz for archival, which is scary considering none of tar.gz, tar.xz, tar.zstd are safe formats. Try corrupting a few bytes in the middle and half the archive is lost, even files untouched by the corruption
Repair tools only exist for tar.bz2 and tar.lz, but neither gets much use compared to gz/xz. The most safe compression for tar is something that splits the compression into blocks, such as btrfs compression or LTO tape compression
Did a bit of testing about how to get files from a remote endpoint unpacked on disk the fastest, most interesting part was developing the method. Some good ideas on this thread I hadn’t thought off to test too.
In 2022, if you care about compression performance at all, there's no reason to even consider gzip unless you have a specific need for it (e.g. you have to deal with some other thing that can't handle zstd).