background

How do you compress your data backups? In this entry, I would like to summarize a brief verification result for the XZ compression format, which is known for its high compression ratio.

The reason for this verification was that I was looking for a way to back up data to S3 on AWS. I am doing double backup with a data server that contains project information related to business. As a disaster countermeasure, I decided to take a backup to a remote server. However, costs are incurred depending on the amount of data and communication volume, so we conducted verification to reduce this cost.

When I researched on the net, I found some articles that verified the compression ratio and the time required for compression, I had the impression that the types of data were biased, and many of them were close to logical verification.

Therefore, I would like to verify how much effect can be obtained with data types that are common in business.

Purpose of verification

I want to verify with the data pattern that occurs in the actual business rather than the logical compression ratio.
Since it depends on the performance of the machine, strict verification results are not required. The purpose is to know the "standard"
From the survey results, consider whether it is practical to use in practice

Validation data

/tmp/compress-test
├ design:    1.8GB
├ logs:      8.8GB
└ wordpress: 50MB

First, I prepared the test data in the directory. The design data includes data such as PDF, AI, PSD, XD, and PNG. In order to make the structure more practical, we dare to include multiple types. It's supposed to be data from designers and directors.

For the log data, we prepared 8.8 GB of log files output daily on the WEB server. This is also a common pattern in server operation.

Finally, it's the data that contains the source code. We have prepared the Wordpress package in the default state.

Method of verification

This time, we will verify with the multi-threaded version of XZ so that it can be closer to practice. XZ has a high compression rate, so the compression time is extremely long. It was impractical to compress hundreds of GB with a single thread, so We will verify with multithreading, but of course it depends on the performance of the machine.

Verification machine

#	iMac (Retina 5K, 27-inch, Late 2015)
CPU	Core i7-6700K 4 cores 8 threads(4.0〜4.2GHz)
RAM	32GB DDR3 1867MHz
SSD	WD Black SN750　（Read/2700MB for both Write/s）

Since the CPU has 4 cores and 8 threads, this verification uses 8 threads. Read / write speed affects I / O, so you have to consider that it is SSD. Also, since the memory is DDR3, it must be taken into consideration that it is inferior to the current DDR4.

Verification environment

Install PIXZ

$ brew install pixz

Verification command

#compression
$ tar -C Parent directory path to be compressed-cf -Directory name to be compressed| pixz -9 >Output file path

#Deployment
$ pixz -d -i Output file path| tar zxf -

If you specify an absolute path in the tar command, the compressed file will contain the absolute path, so the "-C" option is used as a countermeasure.

Conduct verification

Design data

After all, is it because it contains various data formats? It took about 2 minutes, but it's slow for 1.8GB. The file size is now 34%, so the compression ratio is 66%. Compared to compression, decompression was faster than I expected, and I was surprised.

Log data

It is log data with only text data, but it takes 8 minutes. Comparing the design data, it seems to be proportional, It seems that the log data is a little faster. The compressed data is about 5% in size, and the compression rate is 95%! And decompression is fast for the capacity!

Source code data

Finally, it's Wordpress source data. Since the capacity is small, it takes about 18 seconds to complete. The size after compression is about 18%, which is 82% compression ratio. As expected, unlike the log data, I think the cause is that some image data was included.

inspection result

type of data	Data capacity	Compression time	Capacity after compression	Compression rate	Defrosting time
Design data	1.8GB	2 minutes 19 seconds	624MB	66%	6.2 seconds
Log data	8.8GB	8 minutes 11 seconds	480MB	95%	15.6 seconds
Source code	50MB	18.7 seconds	9.1MB	82%	1.7 seconds

Since this verification is just a "standard", it is calculated in MB units, and the decimal point is omitted for the time. Please note that this is not a strict verification result.

Impressions

From the above results, I was able to understand that the compression rate and time change depending on the type of data. One question remains. This time, I created an archive file with tar and then compressed it, so Isn't the compression ratio dependent on tar at the time of archiving? That is.

If so, compression in XZ does not change with the type of data, but only with the compression ratio of tar. There is also the possibility that. I think we still need to verify this, If anyone is familiar with it, please let me know.

Load on the machine

I ran the compression in 8 threads and the CPU usage during compression was 600-800%. Since all cores were close to 100% Considering business use, it is essential to limit the number of threads to be allocated.

Also, when using with a VPS server, if you continue to operate for a long time with a high CPU usage rate, you may be subject to CPU usage restrictions.

I have had the experience of being called by "Sakura VPS" before.

In EC2, there is a possibility that the CPU credits will be used up early in the T instance, so I think it is better to consider a compression method that does not put a load on the CPU.

Compression rate and time

The design data was 66%, the log data was 95%, and the source code was 82%, which were very satisfactory results in terms of compression ratio. In particular, design data often cannot save capacity even if it is compressed, so it seems that it can be used.

The compression ratio is good, but it takes too long ... It's an impression that it's this time, using 8 threads. It may be a little difficult in an environment where available resources are limited, but it seems that there are various uses for personal use.

The defrosting time is reasonably fast for its capacity, so it seems to be able to handle moderate urgency.

It was a less rigorous verification, but I hope it will be helpful for those who want to know a guideline.

Verify the compression rate and time of PIXZ used in practice