How do you compress your data backups? In this entry, I would like to summarize a brief verification result for the XZ compression format, which is known for its high compression ratio.
The reason for this verification was that I was looking for a way to back up data to S3 on AWS. I am doing double backup with a data server that contains project information related to business. As a disaster countermeasure, I decided to take a backup to a remote server. However, costs are incurred depending on the amount of data and communication volume, so we conducted verification to reduce this cost.
When I researched on the net, I found some articles that verified the compression ratio and the time required for compression, I had the impression that the types of data were biased, and many of them were close to logical verification.
Therefore, I would like to verify how much effect can be obtained with data types that are common in business.
/tmp/compress-test
├ design: 1.8GB
├ logs: 8.8GB
└ wordpress: 50MB
First, I prepared the test data in the directory. The design data includes data such as PDF, AI, PSD, XD, and PNG. In order to make the structure more practical, we dare to include multiple types. It's supposed to be data from designers and directors.
For the log data, we prepared 8.8 GB of log files output daily on the WEB server. This is also a common pattern in server operation.
Finally, it's the data that contains the source code. We have prepared the Wordpress package in the default state.
This time, we will verify with the multi-threaded version of XZ so that it can be closer to practice. XZ has a high compression rate, so the compression time is extremely long. It was impractical to compress hundreds of GB with a single thread, so We will verify with multithreading, but of course it depends on the performance of the machine.
# | iMac (Retina 5K, 27-inch, Late 2015) |
---|---|
CPU | Core i7-6700K 4 cores 8 threads(4.0〜4.2GHz) |
RAM | 32GB DDR3 1867MHz |
SSD | WD Black SN750 (Read/2700MB for both Write/s) |
Since the CPU has 4 cores and 8 threads, this verification uses 8 threads. Read / write speed affects I / O, so you have to consider that it is SSD. Also, since the memory is DDR3, it must be taken into consideration that it is inferior to the current DDR4.
$ brew install pixz
#compression
$ tar -C Parent directory path to be compressed-cf -Directory name to be compressed| pixz -9 >Output file path
#Deployment
$ pixz -d -i Output file path| tar zxf -
If you specify an absolute path in the tar command, the compressed file will contain the absolute path, so the "-C" option is used as a countermeasure.
type of data | Data capacity | Compression time | Capacity after compression | Compression rate | Defrosting time |
---|---|---|---|---|---|
Design data | 1.8GB | 2 minutes 19 seconds | 624MB | 66% | 6.2 seconds |
Log data | 8.8GB | 8 minutes 11 seconds | 480MB | 95% | 15.6 seconds |
Source code | 50MB | 18.7 seconds | 9.1MB | 82% | 1.7 seconds |
Since this verification is just a "standard", it is calculated in MB units, and the decimal point is omitted for the time. Please note that this is not a strict verification result.
From the above results, I was able to understand that the compression rate and time change depending on the type of data. One question remains. This time, I created an archive file with tar and then compressed it, so Isn't the compression ratio dependent on tar at the time of archiving? That is.
If so, compression in XZ does not change with the type of data, but only with the compression ratio of tar. There is also the possibility that. I think we still need to verify this, If anyone is familiar with it, please let me know.
I ran the compression in 8 threads and the CPU usage during compression was 600-800%. Since all cores were close to 100% Considering business use, it is essential to limit the number of threads to be allocated.
Also, when using with a VPS server, if you continue to operate for a long time with a high CPU usage rate, you may be subject to CPU usage restrictions.
In EC2, there is a possibility that the CPU credits will be used up early in the T instance, so I think it is better to consider a compression method that does not put a load on the CPU.
The design data was 66%, the log data was 95%, and the source code was 82%, which were very satisfactory results in terms of compression ratio. In particular, design data often cannot save capacity even if it is compressed, so it seems that it can be used.
The compression ratio is good, but it takes too long ... It's an impression that it's this time, using 8 threads. It may be a little difficult in an environment where available resources are limited, but it seems that there are various uses for personal use.
The defrosting time is reasonably fast for its capacity, so it seems to be able to handle moderate urgency.
It was a less rigorous verification, but I hope it will be helpful for those who want to know a guideline.