Frequently used when temporarily saving 2D array data in python
--Pickle protocol = 4 for compression ratio and speed ――If you repeat reading or writing only a part, save parquet with pyarrow
Looks good
CPU: Xeon E5-2630 x 2 chip VRAM: 128GB Windows8 64bit python 3.6
Tried with feature data for machine learning · Pandas.DataFrame 536 rows 178886 columns 0.77GB · Pandas.DataFrame 4803 rows 178886 columns 6.87GB
pickle Looking only at the compression rate and speed, pickle's protocol = 3 or later is excellent, and especially protocol = 4 is extremely convenient because it also supports export of 4 GB or more. However, in ubuntu environment python3.6 or earlier, there is a problem with protocol = 4 or it does not work as it is for reading and writing. Since it works normally with python3.7 or later, pickle seems to be good if the environment can be secured or the capacity is small.
joblib Compared to pickle, the compression rate and read / write speed are a little halfway, but since it is possible to read / write 4GB or more even in the python3.6 environment, this may be good for those who can not renew python due to package reasons.
pyarrow => parquet It is attractive that you can read and write specified rows and specified columns while having the same compression rate and read / write speed as joblib compress = 0. Especially since writing is fast, this seems to be good when reading and writing are randomly entered.
The influence of the environment seems to be very large, and when I experimented on another machine with a different OS, the difference was 20 times, but only 4 times. It seems better to test in the environment where you actually use it.
Recommended Posts