Comparison of compression libraries for JavaVM

We have collected and benchmarked several libraries for the selection of lossless compression algorithms on JavaVM (sources and numbers at github). As a result, my personal perception of the compression algorithm is as follows.

algorithm	type	Applications that seem to be suitable
Zstandard	High compression rate/Low processing speed type	File-based data storage and transfer between servers
lz4>Snappy	Low compression rate/High processing speed type	High throughput such as streaming processing/Phases that require low latency
bzip2\approxBrotli	Compression rate is the highest priority	Huge data distribution and storage of cold data
gzip\approxZLib(ZIP)	Portability is the highest priority	Data transfer and long-term storage

In addition to the Java standard GZIP and Deflate (ZIP), we used a library that implements the following algorithms. See build.sbt at the end of the page for each Maven repository ID.

** Zstandard **: Compressibility is greater than or equal to gzip and less than bzip2, but significantly faster than them. Not to mention the existing gzip, bzip2 is also worth considering Zstd unless it is a severe world where a few percent size difference jumps to business costs. Adjustable by level.
** lz4 **: Equivalent to Snappy in compression ratio and more advantageous than Snappy in terms of speed, so lz4 is for processing speed priority.
** Brotli **: Even in high compression rates, it can be one of the best. The speed difference between compression and decompression is large, and the decompression speed is inferior to Zstd but faster than gzip and bzip2. Using UTF-8 mode increases the efficiency of text compression (unconfirmed).
** Snappy **: High-speed type, but inferior to lz4 in terms of speed, so it seems that there will be no particular turn. I understand that it is a transfer method when lz4 cannot be used for some reason.

In the benchmarks below, Block is the speed when using the byte [] to byte [] API provided by each library, and Stream is when using the Java standard ʻInputStream / ʻOutputStream. The result.

Text data compression

The first to introduce is the US Declaration of Independence as a sample binary. The compression processing result. Since it is US-ASCII, all byte values are within 0x00-0x7F and the entropy is low. Brotli had the highest compression ratio and lz4 had the highest processing speed.

The following is the result of using the full text of Natsume Soseki's "Kokoro" as a sample. The Japanese UTF-8 representation has a higher compression ratio than the us-ascii document because each character is 3 bytes and the pattern is redundant.

Regarding text data, the characteristics of high compression rate type and high speed type are clearly shown. Brotli seems to have a UTF-8 optimization mode, but this benchmark uses GENERAL mode for comparison with the binary type.

Numerical data compression

The figure below is a sample of ʻint [] initialized with a normal random number $ \ lfloor {\ rm Norm (\ mu = 0, \ sigma ^ 2 = 100)} \ rfloor $ and arrayed with big endian. Binary. That is, 68% of all array elements are in the $ \ pm 10 $ range and 95% are in the $ \ pm 20 $ range. In other words, it is a byte array where most of the values are 0x00 or 0xFF`, with 3 bytes excluding the least significant byte.

The compression ratio of such binaries is high, and the high compression ratio type has a reduction of about 70-80%. Moreover, since the difference from the high-speed type is reduced in terms of speed, the superiority of the high-speed type is reduced.

The figure below shows a compressed sample of double [], which generated a value with the standard normal random number $ {\ rm Norm (\ mu = 0, \ sigma ^ 2 = 1)} $, in byte array in IEEE 754 format. .. For floating-point data, even with a high compression ratio, the compression effect was less than 5%, and the compression effect could not be expected. In fact, the entropy of the sample binary is also high, and as a byte array, it looks like a case where each byte value is generated with a uniform random number.

The negative value of lz4 indicates that there was a slight size increase.

Comparison of compression libraries for JavaVM

Text data compression

Numerical data compression

reference