We have collected and benchmarked several libraries for the selection of lossless compression algorithms on JavaVM (sources and numbers at github). As a result, my personal perception of the compression algorithm is as follows.
algorithm | type | Applications that seem to be suitable |
---|---|---|
Zstandard | High compression rate/Low processing speed type | File-based data storage and transfer between servers |
lz4>Snappy | Low compression rate/High processing speed type | High throughput such as streaming processing/Phases that require low latency |
bzip2 |
Compression rate is the highest priority | Huge data distribution and storage of cold data |
gzip |
Portability is the highest priority | Data transfer and long-term storage |
In addition to the Java standard GZIP and Deflate (ZIP), we used a library that implements the following algorithms. See build.sbt at the end of the page for each Maven repository ID.
In the benchmarks below, Block is the speed when using the byte []
to byte []
API provided by each library, and Stream is when using the Java standard ʻInputStream / ʻOutputStream
. The result.
The first to introduce is the US Declaration of Independence as a sample binary. The compression processing result. Since it is US-ASCII, all byte values are within 0x00-0x7F and the entropy is low. Brotli had the highest compression ratio and lz4 had the highest processing speed.
The following is the result of using the full text of Natsume Soseki's "Kokoro" as a sample. The Japanese UTF-8 representation has a higher compression ratio than the us-ascii document because each character is 3 bytes and the pattern is redundant.
Regarding text data, the characteristics of high compression rate type and high speed type are clearly shown. Brotli seems to have a UTF-8 optimization mode, but this benchmark uses GENERAL mode for comparison with the binary type.
The figure below is a sample of ʻint [] initialized with a normal random number $ \ lfloor {\ rm Norm (\ mu = 0, \ sigma ^ 2 = 100)} \ rfloor $ and arrayed with big endian. Binary. That is, 68% of all array elements are in the $ \ pm 10 $ range and 95% are in the $ \ pm 20 $ range. In other words, it is a byte array where most of the values are
0x00 or
0xFF`, with 3 bytes excluding the least significant byte.
The compression ratio of such binaries is high, and the high compression ratio type has a reduction of about 70-80%. Moreover, since the difference from the high-speed type is reduced in terms of speed, the superiority of the high-speed type is reduced.
The figure below shows a compressed sample of double []
, which generated a value with the standard normal random number $ {\ rm Norm (\ mu = 0, \ sigma ^ 2 = 1)} $, in byte array in IEEE 754 format. .. For floating-point data, even with a high compression ratio, the compression effect was less than 5%, and the compression effect could not be expected. In fact, the entropy of the sample binary is also high, and as a byte array, it looks like a case where each byte value is generated with a uniform random number.
The negative value of lz4 indicates that there was a slight size increase.
Recommended Posts