Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How much faster is making a tar archive without gzip? (lowendbox.com)
96 points by edward on Oct 11, 2022 | hide | past | favorite | 90 comments


zstd beats gzip handily, either in terms of speed for a given compression ratio or compression ratio for a given amount of wall clock time. And that's without even using multiple threads, which zstd supports out of the box but for gzip you would need to use a different tool like pigz. And on top of that zstd is also much faster to decompress.

In 2022, if you care about compression performance at all, there's no reason to even consider gzip unless you have a specific need for it (e.g. you have to deal with some other thing that can't handle zstd).


For anyone who wants to try this, zstd -T0 uses all your threads to compress, and https://github.com/facebook/zstd has a lot more description. Brotli, https://github.com/google/brotli, is another modern format with some good features for high compression levels, which also has Content-Encoding support in web browsers. You might also want to play with compression levels (-1 to -11 or more, zstd's --fast=n); they may not map 1:1 to gzip's.

One reason these modern compressors do better is not any particular mistake made defining DEFLATE in the 90s, but that new algos use a few MB of recently seen data as context instead of 32KB, and do other things that would have been impractical in the 90s but are reasonable with modern hardware. The new algorithms also contain lots of smart ideas and have fine-tuned implementations, but that core difference seems important to note.


So if you have many connections, you will spend gigabytes of RAM on compression alone? In the age of VM and cloud computing efficiency is important again.


If you are concerned about RAM usage, there are some trade-offs available (slightly lower compression ratio), for more effective resource usage.

As used by HAProxy for gzip filter, from the Will Tarreau, creator of HAProxy: https://github.com/wtarreau/libslz


When compressing, window size is a tunable option. I think the compression level sets it implicitly, but it can be set explicitly using brotli -w or windowLog in the advanced zstd API (https://facebook.github.io/zstd/zstd_manual.html).

Decompressing many streams at once is more interesting. The decompression APIs do let you specify a max history-window size, but e.g. HTTP won't let you advertise a window size limit, so all you can do is error out if the response would take too much memory to decode. You can just not advertise support for new Content-Encodings at all if it's a concern.


Virtually every compression format utilizing LZ77 has a way to declare the window size used and reject any archive requiring too large window. This is even the case with the traditional DEFLATE.


In the age of cloud computing, compression can easily be offloaded to another layer. CDNs can do brotli compression for you for example.


Is zstd readily available by default in distros and browsers and such?


Zstd is available in distros but not browsers. Brotli is available in all web modern browsers for sending http resources to the browser. Very handy for wasm and things like that:

https://caniuse.com/brotli


Brotli is not enabled by default on Android's webview yet.


In distros, I would say no anecdotally (not sure about latest versions). I frequently find it not installed on things like Ubuntu 18/20 and RHEL-based systems whereas gzip is pretty much universally available on everything from embedded to Windows


If I remember correctly, it's in EPEL for CentOS and peer distributions.

Maybe RHEL v9 took it into the main repositories, would have to check.


I’d like to throw in lz4 compression as a ridiculously fast, low CPU alternative while still providing acceptable ratios for the presented use-case.

Maybe obvious for many but still worth mentioning is that on Unix-like systems any external compression can be used by piping to stdout, no need to rely on built-in support. Especially relevant with BSD tar implementations. Example:

    tar cfv - /path/to/files | lz4 > output.tar.lz4


Intel's new Sapphire Rapids Xeons will have special on-chip gzip IP: https://www.anandtech.com/show/17596/intel-demos-sapphire-ra...


That sounds about 30 years too late.


That sounds like Intel Quick Assist (QAT).

That's a not-greatly-documented technology that's included in some processors (including a bunch of Celerons, but mostly Xeons) and in PCIe cards.

They are used in firewalls, ssl terminators and storage systems, as they can provide Gbps of compression or cryptography.

See one of the older cards: https://www.intel.com/content/dam/www/public/us/en/documents...

Netgate sells one 8950 based card for it's firewalls: https://shop.netgate.com/products/netgate-cpic-8955-cryptogr...

STH had a report this summer: https://www.servethehome.com/intel-quickassist-parts-and-car...

This cards are expensive when bought new, but you can probably buy some of them cheaply in eBay. Please don't drive the prices up ;)


They shall patent it quick before Fraunhofer or Dolby does it.


Thanks for that information, I wasn't aware until reading into it from your suggestion here that it's what underpins the encryption in btrfs, which is something I've experimented with a bit but didn't realize had zstd as a part of it. Very useful!


Of course, if you care about (random access) performance you aren’t using that, either. I don’t really understand why anyone uses tar anymore, really.


Tar is the standard. It's pretty easy to produce and consume tar archives on Unix-like systems (especially in a streaming fashion, which comes in handy if your storage is scratch storage is slow/tiny for some reason), and a lot of software will just work with it.

Other than that, tar is still frequently used for packing data to be recorded onto LTO tapes (even despite LTFS being a thing).


Tar is not a compression algorithm, it is a format for serializing files, so it covers a different use-space.

I've used it often to speed up transfers of many small files over SSH - scp always took a small amount of time before sending each file, and with many files, the wasted time blew up. Serializing all those files with tar and deserializing them on the other side turned out to be much faster, especially since tar works in a stream-like fashion. And it's convenient to type, too:

    tar -c path/to/file1 path/to/file2 ... | ssh user@host tar -x


TAR makes multiple things into one thing, so that repetitions in those things will cause compression gains. Not sure how the other compression algos work, but at least when gzipping, you will not gain that compression, if the repetitions are in separate files. So first TAR then gzip has been the standard way of doing it.


If you need or want to use tar and compression, you can consider pixz. It's compatible with xz, can do multithreaded compression and decompression, and can efficiently extract a single member from a .tar.xz archive without having to decompress the entire archive. xz/pixz is much more computationally expensive than zstd or gzip but also provides better compression ratios.


What would you use instead of tar?


ZIP with LZMA compression is one possibility. Random seek were always supported, and later versions also gained support for modern encryption, modern compression, and IIRC Unix permissions too.


ZIP always had filename encoding issues, afair. At least in a “real-world zip”. Have they been resolved? (And does tar have them?)


If you use pax format all metadata is assumed to be UTF-8. So that would work with tar.


ZIP with non-deflate compression is just silly. If you are not going to create defacto-standard compatible .zips then you might as well use a better archiving format.


Your choice of tools always depends on the use case. If you're only going to be creating these archives for internal use, or will be exchanging them/reading them on modern Unix-like systems, then there is no worry about the decompressor not supporting LZMA. If you need to exchange files with feature-poor systems, then yes - stick to the de-facto standard.

Besides, what "better" archiving format are you going to use? Tar clearly isn't one (the great... grandparent comment wouldn't be asking for alternatives if it were). "Defacto standard" ZIP files have a poor compression ratio. 7z feels strange to use in a professional setting or for long-term storage. RAR is a closed-source abomination. Are there any other "better" archiving formats that have wider (de)compressor adoption than ZIP+LZMA?


The real trick is to use a parallelized implementation of the compression tool, viz:

    tar --use-compress-program /usr/bin/lbzip2 --create --file file.tar.bz2
or if you must use .gz

    tar --use-compress-program /usr/bin/pigz --create --file file.tar.gz


Wow, that's really neat. I have never heard of pigz before despite being an avid Ubuntu user for a long time. Also, I see it supports the "-11" level which is using the zopfli algorithm!

Appreciate you sharing.


Is there a way to decompress in parallel? Afaik pigz can use a few threads to separate decompressing and i/o -- is there anything can we can pass to tar to ungzip faster?


pzstd can decompress in parallel if the data was compressed using pzstd. Although zstd is so much faster for decompression that you rarely need multiple threads.

For xz, pixz can decompress in parallel.


Lzip can be decompressed in parallel if compressed with "tarlz".

https://www.nongnu.org/lzip/tarlz.html


I have been working on exactly that: pragzip. It decompresses in parallel in a similar manner to the prototype pugz by implementing a two-staged decompression. https://github.com/mxmlnkn/pragzip I also did a Show HN: https://news.ycombinator.com/item?id=32366959


Not without changing the structure of the gzip unfortunately.


Gzip files are allowed to consist of multiple streams. Bgzip (block gzip) is one tool that reads and writes gzip-compatible files in parallel. The output can be decompressed in parallel by bgzip, or semi-random access, or decompressed normally by gzip.


As a FYI pigz also can speed up decompression as well, but the performance boost is minimal. The decompression is still in a single thread but it spawns other threads for i/o and so forth. For compression the speedup is quite linear.


igzip (https://github.com/intel/isa-l) is much faster than gzip or pigz when it comes to decompression, 2-3x in my experience. There is also a Python module (isal) that provides a GzipFile-like wrapper class, for an easy speed-up of Python scripts that read gzipped files.

However, it only supports up to level 3 when compressing data, so it can't be used as a drop-in replacement for gzip. You also need to make sure to use the latest version if you are going to use it in the context of bioinformatics, since older versions choke on concatenated gzip files common in that field.


> However, it only supports up to level 3 when compressing data, so it can't be used as a drop-in replacement for gzip.

[EDIT: I misread a benchmark result [1] and wrongly concluded that igzip -3 is better than gzip -9, in reality it is somewhere in the middle of -1 and -9.]

[1] https://encode.su/threads/3398-Sequence-Compression-Benchmar...


That benchmark seems to be comparing igzip with gzip -1, since the reported gzip output size is 42298786 bytes and not 36548933 bytes (from the earlier benchmarks). My own experience based mostly on FASTQ/text data is that igzip -3 gives results very similar to gzip -3.


Oops! Yes, you are correct. I confused it with an other reply with level 9 benchmarks, please disregard the parent comment.


lrzip


The whole scheme could use some improvements, tar.gz feels like a bad tool here.

1. If you can, use btrfs/zfs/lvm snapshot and skip the waiting completely. There's no point in duplication.

2. If you do want the copies, use something that can do incremental backup instead. For example Borg can both compress and deduplicate and on top of that will only copy the changes, not the whole directory.

3. (optional) If you're into running custom latest software, you can save some time by using nixpkgs instead, which will also ensure you don't risk collisions in /usr/local. (takes some upfront time investment though)


You know what I like about tar.gz? gzip is often a transparent compression layer on some tools, like say Athena. Tar is a simple concatenation of files + header.

Therefore, even though Athena/Hive/Presto doesn't handle tar, I can use tar (or grep) to search the files without having to decompress them.


> There's no point in duplication.

Duplication is precisely the point of backups. Well, multiplication preferably, more than duplication.

The old chestnut is that 99% of restores are due to user error. There's still that 1% rump, though.


In general sure, but not in this context. The author specifically says this is a short term copy done before installation, so it's more of a restore checkpoint than a proper backup. The use case is:

> With a handy /usr/local archive, if something goes crazy wrong during my new install, it’s easy to revert.


I can't believe my article is here on HN.


Next time try gzip -1, it's much faster.


Yep! Thanks for the suggestion! Apparently -1 is the same as --fast. See comment below by geocrasher: "gzip --fast. Only 23% slower in speed, but 73% reduction"


Congrats! I did some writing on LEB a few years ago and it never made it to HN. So, well done :)


Thanks! Best wishes!


What is darkstar couldn't find a link


Darkstar is a traditional name for computers running Slackware.

Shameless plug: My gorgeous Darkstar is an antique server on which I sell traditional shell accounts from which one can make "metal" VPSes from the command line. Occasionally somebody who knows 10X or 100X or 1,000X more than I do comes along and becomes a customer. If you are interested, please check metalvps.com. Also, if you like Xv6, please check metalvps.com/Xv6.


The name originally comes from a song by The Greatful Dead.

https://m.youtube.com/watch?v=AlWqitKLnfs

These are the lyrics:

Dark star crashes, pouring its light into ashes

Reason tatters, the forces tear loose from the axis

Searchlight casting for faults in the clouds of delusion

Shall we go, you and I while we can

Through the transitive nightfall of diamonds?

-

Mirror shatters in formless reflections of matter

Glass hand dissolving in ice, petal flowers revolving

Lady in velvet recedes in the nights of good-bye

Shall we go, you and I while we can

Through the transitive nightfall of diamonds?


The way I read the terminal prompts, its the name of his computer.


Exactly!


"time tar cvzf local-revert.tgz local"

This would not negate the point but possibly make the delta even larger, but you have to be careful when benchmarking tar with the "v" flag. The time to print ~400,000 lines out to the terminal can be nontrivial! It may even be the majority of the time for the pure tar case.

Like I said, this won't negate the point but amplify it; subtract an approximately constant time from both times and the multiplicative delta between them will grow.

(Some modern shells that focus on handling lots of text very quickly may be less impacted than some older-style shells.)

In the interests of not making another tiny post, I also endorse the use of parallel compression for the benchmark too, which will close the gap somewhat. You also want to consider the overall use case and the overall bandwidth you have to the final destination. If I were trying to upload something over my home network connection to some remote source, I can afford almost any parallel compression scheme there is because it's a pure win if I can end up sending fewer bytes. Compression is better-than-free in that case. On the flip side, if you don't care about disk usage it's hard to do very much compression at NVMe SSD speeds. (There are some things that may be able to do it, but you won't get much.)


Not sure if tar is this way but I think some programs change their buffering behavior based on whether they detect a tty which can further change things


> Before compiling, I use tar to make a short term backup of how things were before the compile and install.

So most of his tar files are mostly the same, having many of the same files in the same order, and 512-block aligned to boot. Due to this block aligning property of tar files, they are well suited to block level desuplication file systems, provided most of their contents are similar to other tar files. I have blogged about using this idea in the context of ci/cd[1], but the idea works just as well with backups. Sharing compression between tar files instead of compressing each tar file might be better anyway.

1: https://blog.djha.skin/blog/cicd-package-compression-could-b...


I posted this in a comment on the WordPress Hosting FB group in a discussion about whether gzipping a sql file for transport was worth it. I thought it an interesting comparison.

   This is on an NVMe disk on a 32 core server with a light load.
   $ time wp db export test.sql
   real 0m3.410s
   user 0m2.568s
   sys 0m0.431s
   176M test.sql
   $ time wp db export - | gzip > file.sql.gz
   real 0m8.304s
   user 0m10.010s
   sys 0m0.439s
   40M file.sql.gz
   $ time wp db export - | gzip --fast > fast.sql.gz
   real 0m4.196s
   user 0m5.728s
   sys 0m0.340s
   49M fast.sql.gz
   The clear winner is gzip --fast. Only 23% slower in speed, but 73% reduction          
   in file size, vs normal's 240% decrease in speed but 88% reduction in file 
   size. 15% smaller file at 240% increased backup time is not worth it.


> 15% smaller file at 240% increased backup time is not worth it.

Doesn't this depend on the bandwidth of the connection the file is being transferred over? You also have to factor in unzipping time.


That's true, it does. And on a very fast connection, it's cheaper to leave it uncompressed. And true, I didn't factor in decompression. The context was specifically the speed of backups rather than the speed of restores.


I'm somewhat surprised, disk being so many orders of magnitudes slower than compute (and to a lesser extent, memory), that there's such a wide gulf. My expectation would have been that the difference would be negligible, as `tar` writes roughly double the bytes (per the article) back to the disk.


One reason to use tar without gzip is that you can random access files within the tarball without unpacking it. tar + xz with --block-size also allows this.

https://libguestfs.org/nbdkit-tar-filter.1.html https://libguestfs.org/nbdkit-xz-filter.1.html

Unfortunately zstd still does not:

https://github.com/facebook/zstd/issues/395#issuecomment-535...


> you can random access files within the tarball without unpacking it

Kinda sorta. You still need a linear scan first and remember all the offsets, because the tar format does not have an index.

> tar + xz with --block-size also allows this.

This also requires a linear scan + unpack stage first, with memorizing offsets. Remembering where the XZ blocks start (and corresponding uncompressed offsets) allows a sufficiently smart implementation to pretend that it can seek in the compressed stream, while under the hood seeking to the closest offset it has memorized and uncompressing from there. This is also alluded to in the GitHub issue you linked to.

Manually tuning down the block size allows that to be done faster, at the cost of compression.

But you still need an unpack/scan phase first. And potentially unpacking and skipping adjacent data when "randomly" accessing a file.

Technically, there is nothing stopping an implementation from doing this with gzip/zlib frames as well.


You only have to scan to the start of the file you want to unpack. In many cases (we use this with Kubernetes CDI) the tarball contains only a few files with the disk image we want to access taking up the majority of the tarball, so in fact a full linear scan is not required and we're only unpacking the first few blocks.


Pragzip actually decompress in parallel and also access at random. I did a Show HN here: https://news.ycombinator.com/item?id=32366959

indexed_gzip https://github.com/pauldmccarthy/indexed_gzip can also do random access but is not parallel.

Both have to do a linear scan first though. The implementations however can do the linear scan on-demand, i.e., they scan only as far as needed.

bzip2 works very well with this approach. xz only works with this approach when compressed with multiple blocks. Similar is true for zstd.

For zstd, there also exists a seekable variant, which stores the block index at the end as metadata to avoid the linear scan. indexed_zstd offers random access to those files https://github.com/martinellimarco/indexed_zstd

I wrote pragzip and also combined all of the other random access compression backends in ratarmount to offer random access to TAR files that is magnitudes faster than archivemount: https://github.com/mxmlnkn/ratarmount


Interesting. Back in the 90s I found that compressing files before writing them actually improved performance quite significantly, because disk I/O was so much slower back then. How things change.


Being used to slow broadband links, I always compressed my tar files. You might have wait a half hour like the article, but you'd save an hour on the transfer. And then I used my first gigabit local area network.

It was was even more amazing using a 10gb SAN with SSD's. At that point the bottleneck seemed to be all the tiny files on disk (python venv's, for instance)


The last time I did a test like this, it was way faster to tar first without compression, and then run gzip on the tar file.


It's been a few days, but I still had the original files in their original locations. So:

  root@darkstar:/usr# date -u; time gzip < local-revert.tar > local-revert.tar.gz
  Tue Oct 11 02:49:32 UTC 2022

  real    24m35.214s
  user    23m57.449s
  sys     0m27.327s
  root@darkstar:/usr# 
From the linked article, for convenient reference, making the .tgz file with tar cvzf took about 28.2 minutes. Making an uncompressed .tar file took about 1.2 minutes. Just now, compressing with gzip took 24.6 minutes. It seems, in this situation, about 2.4 minutes might be saved by creating the tar file first and then compressing separately.


Would be curious about OS / input/output size / hardware/virtualization tech, the only reason I can think for this would be tiny buffers with exorbitantly expensive context switches like you'd see on some older virtualization or e.g. puny escaped-a-VCR ARM chips


/usr/bin as input

  tar czf: 18s (214MB)
  tar cf: 240ms (553MB)
  tar c | lz4 -c: 1s (305MB)
  tar c | zstd -c: 1.6s (214MB)
TL;DR gzip is and has been for a long time obsolete as both a format and as a tool.

  find | cpio: 1.2s
Bit disappointed in that last bit, honestly.


Did you run that more than once to ensure uniform cache hits? Otherwise only the first one will pay the price for the actual disk reads.


Gzip is fine for lots of tasks.

It's not the most efficient compressor, either by space or speed. But it has a million implementations that all work well together.

If you just need "good enough" compression with no fuss, gzip is still an excellent choice.


I agree with this, even as a long-time zstd user both personally and professionally. The fact is nearly every app, widget, API, and even CPU can encode and decode gzip. Sometimes you need to make the prosaic choice.


I feel like zstd, if nothing better shows up, will become a lot like gzip in the long run


for a long time the status quo was that while there were better compressions engines (bzip2 comes to mind) none were faster than gzip.

I actually preferred gzip for a lot of tasks because it was faster to compress and decompress.

I saw a comment that zstd is actually faster than gzip so I guess that consideration no longer holds.


Most things are faster than gzip, and many of those things have smaller outputs as well. Brotli at level 2 produces the same-sized archive with this input, in just 3.4s. Gzip even at level 1, with a 10% larger archive, still needs 6.5s. zstd, at level 1, produces the same size as gzip -1, in 1 second! zstd, cranked up to 9, makes a 20% smaller archive in 7 seconds, less than half the time of gzip -6.


Recent GNU tar versions have a --zstd flag, FWIW.


It also has an "-a" flag that just runs the right compressor based on output file name. "tar -acf archive.tar.gz" and "tar -acf archive.tar.zst" will do the right thing in both cases.


Agreed. I’ve switched everything over to zstd and find using it actually speeds most tasks up (lower disk IO).


Sometimes you just want things to go in a hurry.

The time to make an archive depends primarily on two factors: processing power and IO speed. Depending on scenario one might have a fast CPU with a slow target device, especially if remote, in which case spending extra time compressing might be a win. In other cases the IO is plenty fast but you're limited by CPU, in which case little or no compression is what you want.

Was wondering if any had tried implementing a control loop that would dynamically adjust compression parameters to provide optimal compression speed, utilizing both processing power and IO as best as possible?


> Second, once we have done compiling a few times, compiling a program from its latest sources can be easier than figuring out how to install an often older version with our distribution’s package manager.

This is nonsense.


I'll add that the "older" versions of some utilities in a distribution's package managers often have patches with security fixes from newer versions.

The distribution's package manager will identify when there's a newer version, especially with critical security fixes, so I don't have to regularly do manual checks to see if something should be upgraded.

Newest isn't always better, and I'm usually happy to be conservative about common system utilities like tar and gzip.


The article talks about tar.gz for archival, which is scary considering none of tar.gz, tar.xz, tar.zstd are safe formats. Try corrupting a few bytes in the middle and half the archive is lost, even files untouched by the corruption

Repair tools only exist for tar.bz2 and tar.lz, but neither gets much use compared to gz/xz. The most safe compression for tar is something that splits the compression into blocks, such as btrfs compression or LTO tape compression


Did a bit of testing about how to get files from a remote endpoint unpacked on disk the fastest, most interesting part was developing the method. Some good ideas on this thread I hadn’t thought off to test too.

https://nhoughto.github.io/blog/posts/2022/03/09/caching/


How about doing the tar first then firing the gzip off as a background job after.


You may try sqlite3 -A




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: