Comparison of compression libraries for JavaVM

We have collected and benchmarked several libraries for the selection of lossless compression algorithms on JavaVM (sources and numbers at github). As a result, my personal perception of the compression algorithm is as follows.

algorithm type Applications that seem to be suitable
Zstandard High compression rate/Low processing speed type File-based data storage and transfer between servers
lz4>Snappy Low compression rate/High processing speed type High throughput such as streaming processing/Phases that require low latency
bzip2\approxBrotli Compression rate is the highest priority Huge data distribution and storage of cold data
gzip\approxZLib(ZIP) Portability is the highest priority Data transfer and long-term storage

In addition to the Java standard GZIP and Deflate (ZIP), we used a library that implements the following algorithms. See build.sbt at the end of the page for each Maven repository ID.

In the benchmarks below, Block is the speed when using the byte [] to byte [] API provided by each library, and Stream is when using the Java standard ʻInputStream / ʻOutputStream. The result.

Text data compression

The first to introduce is the US Declaration of Independence as a sample binary. The compression processing result. Since it is US-ASCII, all byte values are within 0x00-0x7F and the entropy is low. Brotli had the highest compression ratio and lz4 had the highest processing speed. 20180322_us-ascii.png

The following is the result of using the full text of Natsume Soseki's "Kokoro" as a sample. The Japanese UTF-8 representation has a higher compression ratio than the us-ascii document because each character is 3 bytes and the pattern is redundant. 20180322_utf-8.png

Regarding text data, the characteristics of high compression rate type and high speed type are clearly shown. Brotli seems to have a UTF-8 optimization mode, but this benchmark uses GENERAL mode for comparison with the binary type.

Numerical data compression

The figure below is a sample of ʻint [] initialized with a normal random number $ \ lfloor {\ rm Norm (\ mu = 0, \ sigma ^ 2 = 100)} \ rfloor $ and arrayed with big endian. Binary. That is, 68% of all array elements are in the $ \ pm 10 $ range and 95% are in the $ \ pm 20 $ range. In other words, it is a byte array where most of the values are 0x00 or 0xFF`, with 3 bytes excluding the least significant byte.

The compression ratio of such binaries is high, and the high compression ratio type has a reduction of about 70-80%. Moreover, since the difference from the high-speed type is reduced in terms of speed, the superiority of the high-speed type is reduced. 20180322_norn-int.png

The figure below shows a compressed sample of double [], which generated a value with the standard normal random number $ {\ rm Norm (\ mu = 0, \ sigma ^ 2 = 1)} $, in byte array in IEEE 754 format. .. For floating-point data, even with a high compression ratio, the compression effect was less than 5%, and the compression effect could not be expected. In fact, the entropy of the sample binary is also high, and as a byte array, it looks like a case where each byte value is generated with a uniform random number. 20180322_norn-double.png

The negative value of lz4 indicates that there was a slight size increase.

reference

Recommended Posts

Comparison of compression libraries for JavaVM
Comparison of for instructions to fetch arrays sequentially
List of libraries useful for ios application development
Comparison of Web App for Containers and Azure Container Instances
About size comparison of compareTo
[Java] Summary of for statements