Archives

Categories

Comparing Compression

I just did a quick test of different compression options in Debian. The source file is a 1.1G MySQL dump file. The time is user CPU time on a i7-930 running under KVM, the compression programs may have different levels of optimisation for other CPU families.

Facebook people designed the zstd compression system (here’s a page giving an overview of it [1]). It has some interesting new features that can provide real differences at scale (like unusually large windows and pre-defined dictionaries), but I just tested the default mode and the -9 option for more compression. For the SQL file “zstd -9” provides significantly better compression than gzip while taking only slightly less CPU time than “gzip -9” while zstd with the default option (equivalent to “zstd -3”) gives much faster compression than “gzip -9” while also being slightly smaller. For this use case bzip2 is too slow for inline compression of a MySQL dump as the dump process locks tables and can hang clients. The lzma and xz compression algorithms provide significant benefits in size but the time taken is grossly disproportionate.

In a quick check of my collection of files compressed with gzip I was only able to fine 1 fild that got less compression with zstd with default options, and that file got better compression with “zstd -9”. So zstd seems to beat gzip everywhere by every measure.

The bzip2 compression seems to be obsolete, “zstd -9” is much faster and has slightly smaller output.

Both xz and lzma seem to offer a combination of compression and time taken that zstd can’t beat (for this file type at least). The ultra compression mode 22 gives 2% smaller output files but almost 28 minutes of CPU time for compression is a bit ridiculous. There is a threaded mode for zstd that could potentially allow a shorter wall clock time for “zstd --ultra -22” than lzma/xz while also giving better compression.

Compression Time Size
zstd 5.2s 130m
zstd -9 28.4s 114m
gzip -9 33.4s 141m
bzip2 -9 3m51 119m
lzma 6m20 97m
xz 6m36 97m
zstd -19 9m57 99m
zstd --ultra -22 27m46 95m

Conclusion

For distributions like Debian which have large archives of files that are compressed once and transferred a lot the “zstd --ultra -22” compression might be useful with multi-threaded compression. But given that Debian already has xz in use it might not be worth changing until faster CPUs with lots of cores become more commonly available. One could argue that for Debian it doesn’t make sense to change from xz as hard drives seem to be getting larger capacity (and also smaller physical size) faster than the Debian archive is growing. One possible reason for adopting zstd in a distribution like Debian is that there are more tuning options for things like memory use. It would be possible to have packages for an architecture like ARM that tends to have less RAM compressed in a way that decreases memory use on decompression.

For general compression such as compressing log files and making backups it seems that zstd is the clear winner. Even bzip2 is far too slow and in my tests zstd clearly beats gzip for every combination of compression and time taken. There may be some corner cases where gzip can compete on compression time due to CPU features, optimisation for CPUs, etc but I expect that in almost all cases zstd will win for compression size and time. As an aside I once noticed the 32bit of gzip compressing faster than the 64bit version on an Opteron system, the 32bit version had assembly optimisation and the 64bit version didn’t at that time.

To create a tar archive you can run “tar czf” or “tar cJf” to create an archive with gzip or xz compression. To create an archive with zstd compression you have to use “tar --zstd -cf”, that’s 7 extra characters to type. It’s likely that for most casual archive creation (EG for copying files around on a LAN or USB stick) saving 7 characters of typing is more of a benefit than saving a small amount of CPU time and storage space. It would be really good if tar got a single character option for zstd compression.

The external dictionary support in zstd would work really well with rsync for backups. Currently rsync only supports zlib, adding zstd support would be a good project for someone (unfortunately I don’t have enough spare time).

Now I will change my database backup scripts to use zstd.

Update:

The command “tar acvf a.zst filenames” will create a zstd compressed tar archive, the “a” option to GNU tar makes it autodetect the compression type from the file name. Thanks Enrico!

4 comments to Comparing Compression

  • Bas

    Have you also tested decompression times and memory use? I’d be interested to see how zstd would do in those benchmarks.

  • Bas: Surprisingly zstd has the fastest decompression, even faster then gzip.
    zstd: 1.8s
    gzip: 5.9s
    xz: 11.2s
    bzip2: 40.4s
    So for the case of distributions zstd might offer benefit for low CPU power platforms like some of the small ARM devices.

  • HI Russell,
    I agree with your analysis. We have done similar things for the TeX Live distribution which ships about 25000 tar containers with xz compression. Formerly we used zip, then tar.gz, now tar.xz, and this has the best trade off for us.

    OTOH, on the user computer, before an update is made, we make a backup of the respective package, and this is compressed with lz4, which we found has somehow an optimal compression/speed ratio (in particular considering that we have to do this also on Windows…). AFAIR lz4 and zstd use the (nearly the) same compression algorithm.

  • Torbjørn Thorsen

    This is probably well known, but using the “–single-transacation” argument to mysqldump will change the locking behavior for mysqldump such that many table locks are avoided.
    Using the argument for non-InnoDB tables has interesting behavior, so it’s not applicable for everyone.