This article is for performance comparison of Vaex, Dask, Pandas, etc. in data formats such as CSV, Parquet, HDF5.
It's a loose and rough response with a light feeling that I hope it will be useful for my work, so I would appreciate it if you could forgive me for the details.
As a prerequisite, we are unified on the project side with AWS, the data infrastructure is unified with AWS, etc., and it is built around AWS without making it multi-cloud (BigQuery with GCP only for log related)・ ・ I don't do anything like that).
Then, S3 will be the center of data storage in the form of using Kinesis Firehose or Athena on AWS. In addition to Athena etc., S3 is mounted and used in various ways such as Pandas and Dask (and at our company, it is mainly used in Pandas etc. rather than Athena etc.).
When putting data in S3, you can freely select various formats (although it may be possible to get close by throwing a query to Cloud Storage etc. even with BigQuery ...). Then, you need to select either CSV or Parquet format (or multiple formats arranged in parallel and used properly).
Also, since the amount of data is enormous with log infrastructure, I have always felt that I wanted to understand the disk size comparison and read / write speed when compressed.
The difference in disk size for each format is calculated not only by the disk cost in S3, but also by the file size after the scan cost is compressed in Athena etc., so it can not be ridiculous.
There are various file formats that can be used for logs in the world, but before verification, we are proceeding with the following impressions in consideration of our usage.
-** CSV : Can be used in most environments (languages, libraries, services, etc.). The speed is quite slow compared to other formats, so you may feel a little tight with large data. - JSON : It is supported in many environments like CSV, and since it has a hierarchical structure in the data and the key character string is assigned to each value, I feel that it is difficult for columns to shift in CSV etc. I will. However, since most of our logs are matrix data, the file size is larger than CSV, so I think there is little merit in selecting it. However, since most of the key names are the same, the size will be much smaller when compressed, so if you compress it, the file size will not be so large compared to CSV. - Parquet : Is the speed a little slower than HDF5 etc.? I have an impression. However, the impression is that it is considerably faster than CSV. I have the impression that the file size after compression has the potential to become much smaller due to the benefits of the columnar format (I wonder if it will be the smallest of the formats handled this time). It is supported by many services on AWS, and is also supported by libraries such as Pandas, Dask, and Vaex. However, I also feel that if one file is small, it will have the opposite effect (the file size can be smaller if CSV is compressed). When logs are saved in units of several minutes with Firehose etc., there will be quite a few cases (tables) where the file is small, so there is not much merit unless such data in the data lake is integrated on a daily basis. There seems to be a case. Reading and writing in each library feels very simple and easy. - HDF5 : Although it can be handled by each Python library, each AWS service (Athena etc.) does not support it. It is an image that is often used in Vaex (many of the samples that bring out high-speed processing in Vaex are HDF5, etc.). It is an image that the speed of thinning to Pickle etc. comes out (an image that is slightly slower than Pickle). - Pickle, npy, etc. **: Pickle (serialized binary of Python object) and npy (NumPy format binary) can be handled very fast in Python, but Python and NumPy Since it becomes unusable due to version change etc., it is not realistic to handle for long-term log storage (it is very useful for temporary analysis etc.), so I will not mention it in this article. Also, except for the case of using NumPy format memmap etc., it is necessary to put everything in memory at once (basically it can not be handled directly with Vaex or Dask) and it is not supported even in the AWS environment I feel it.
TL;DR
Since the code etc. has become quite long, I will describe it from the conclusion first.
――Parquet did not have an advantage in terms of disc size compared to CSV as I expected. As far as I tried this time, I have the impression that it will not be a plus in terms of disk cost reduction unless there are quite a lot of lines. --Since CSV is better for thousands of lines, you need to be careful when sending small files every few minutes when sending to a data lake. ――Especially when using Kinesis Firehose etc., the cost of AWS is extra for Parquet conversion, so the disk cost is high and the conversion cost is also high. ――Parquet seems to shine in cases such as collecting data from a data lake into one Parquet file on a daily basis, or in cases such as data extracted by ETL. ――In terms of speed, CSV is preferable if it is at the level of several hundred lines, but after that, Parquet will give you a much faster speed. ――For very small data, Pandas is faster than Vaex, but Vaex is faster in units of 10,000 lines, and in some cases even thousands of lines. ――When it comes to the scale of 100,000 lines or more, HDF5 will be faster with Vaex than Oita Parquet. Until then, the Parquet side seems to have a considerable advantage. --The Snappy Parquet compression method, which is also the default setting of Pandas, seems to have a very small effect on reading. It is a compression that balances disk size reduction and performance. As the file size grows, the difference from gzip compression becomes quite noticeable. ――The result of Dask was not good in the range I tried this time. The fact that the number of verification lines is small and that it is a single file may have a large effect. As for Vaex, it seems that even a single file works with very good performance. ――I got the impression that Parquet + Vaex's performance is better than I expected, and it may be possible to actively consider it instead of CSV because of the balance of advantages such as ease of handling and file size. --Vaex + CSV is quite slow, so it seems that there is little merit in using it.
--File formats are supported by uncompressed CSV, gzip-compressed CSV, uncompressed Parquet, Snappy-compressed Parquet, gzip-compressed Parquet, uncompressed HDF5, and zlib-compressed HDF5 (excluding Vaex). --The library uses Pandas, Dask, and Vaex. --Compare and plot disk sizes-Compare and plot imports and calculations. ――The number of lines has been verified including a small number of lines, assuming a case where detailed data is sent in units of minutes due to a data lake or the like. I am trying with 500 lines, 2000 lines, 5000 lines, 10,000 lines, 50,000 lines, 100,000 lines, 500,000 lines, and 1 million lines in a single file.
-I checked Dask in Python (comfortably handle large datasets) -I took a closer look at Python's Vaex, which is rumored to be fast. -Unexpectedly deep world of HDF (Introduction to Python / h5py)
--Control for data with a large number of lines in multiple files will be skipped this time. If there is an opportunity in the future, I will do it in another article (taking into account the time and length of the article). ――As for writing speed, it has a low priority for my use (the number of times is less than reading), so I hardly touch it.
We will use the Python 3.8 Jupyter notebook with the Docker environment with the following settings. The OS of this image is Ubuntu 20.04.
FROM jupyter/datascience-notebook:python-3.8.6
RUN pip install vaex==2.6.1
RUN pip install jupyter-contrib-nbextensions==0.5.1
RUN jupyter contrib nbextension install --user
RUN jupyter nbextension enable hinterland/hinterland \
&& jupyter nbextension enable toc2/main
Pandas, Dask, etc. can be entered without specifying them in the above Dockerfile, so we will use that version. Specifically, each library will contain the following versions.
The host OS is Windows 10, memory 32GB, CPU is i7-8700 3.20GHz (6 cores 12 threads) machine.
Since S3 is mounted and handled by the Python library, S3 should actually be prepared and verified, but this time we will take time and effort to verify it locally.
** I'm writing the code below, but it's quite long, so if you want to skip to the actual comparison section ... I hope you can click this section ** to fly. ..
First, let's compare with a small file of about 500 rows and 5 columns. Imagine a case where you saved a specific dataset in a short span (not very many rows) with Firehose or Fluentd, such as in a data lake.
I will write the process for preparing the data. Each of the five columns was set as follows.
--column_a: int-> A random value is set in the range of 0 to 4999999. --column_b: int-> A random value is set in the range of 0 to 10000. --column_c: int-> A random value is set in the range of 0 to 100. --column_d: int-> One of the values 100, 300, 500, 1000, 3000, 10000 is set. It is assumed that some fixed values such as in-app purchases are set in the smartphone app. --e: str-> 20 character strings of lowercase and uppercase alphabets are set.
from string import ascii_letters
import random
import numpy as np
import pandas as pd
def make_random_ascii_str(char_num):
"""
Generates a random lowercase and uppercase string with the specified number of characters.
Parameters
----------
char_num : int
The number of characters to generate.
Returns
-------
random_str : str
The generated string.
"""
return ''.join(random.choices(population=ascii_letters, k=char_num))
def make_basic_pandas_df(row_num, char_num):
"""
Generate a Pandas dataframe for verifying the basic structure.
Parameters
----------
row_num : int
The number of rows in the data frame to generate.
char_num : str
The number of characters in the value to set in the string column.
Returns
-------
pandas_df : pd.DataFrame
The generated data frame. Each value is set in the following 5 columns.
- column_a : int ->A random value is set in the range of 0 to 4999999.
- column_b : int ->A random value is set in the range of 0 to 10000.
- column_c : int ->A random value is set in the range of 0 to 100.
- column_d : int -> 100, 300, 500, 1000, 3000,One of the values 10000 is set.
- column_e : str -> char_Lowercase and uppercase alphabets with the number of characters specified by num
It is a character string.
"""
pandas_df = pd.DataFrame(
index=np.arange(0, row_num), columns=['column_a'])
row_num = len(pandas_df)
pandas_df['column_a'] = np.random.randint(0, 5000000, size=row_num)
pandas_df['column_b'] = np.random.randint(0, 10000, size=row_num)
pandas_df['column_c'] = np.random.randint(0, 100, size=row_num)
pandas_df['column_d'] = np.random.choice(a=[100, 300, 500, 1000, 3000, 10000], size=row_num)
pandas_df['column_e'] = [make_random_ascii_str(char_num=20) for _ in range(row_num)]
return pandas_df
A data frame of 500 rows is generated as shown below.
pandas_df = make_basic_pandas_df(row_num=500, char_num=20)
print(pandas_df.head())
column_a column_b column_c column_d column_e
0 217141 3701 90 300 qgnzZkIMxOIhiIdvWZNF
1 3123093 9938 60 3000 oMcRXqyeYLuWAqLknVmG
2 3547451 4437 55 300 wJMPyrJGaLciRPQiSiuC
3 1141404 2126 92 10000 pvQUQekUPVddWOyccdfD
4 641110 3734 52 1000 fesOBYqfgofLVNwTLGEc
For comparison of disk size and read speed, output in each of the following formats. As for HDF5, the contents of the data hierarchy may differ in each library, so output for each library.
Also, for saving, the index is not saved, and for export with encoding options, UTF-8 is specified.
--Uncompressed CSV --gzip compressed CSV --Uncompressed Parquet --Snappy compressed Parquet --gzip compressed Parquet --Uncompressed HDF5 --zlib compressed HDF5
It's mainly about comparing export speeds, but CSV is only handled by Pandas and Dask because there is no export interface in the version of Vaex used in this article. Also, in Vaex, Parquet's compression method is not specified (only Snappy is available), so the export process is performed by Pandas, and that data is read by referring to Vaex.
As for HDF5, Vaex supports only uncompressed ones, so compression relations are supported only on Pandas and Dask side. Also, regarding the compression rate level specification (0 to 9, the higher the value, the higher the compression and load), 4 is specified this time.
from datetime import datetime
import vaex
import dask.dataframe as dd
class FP:
"""
Where to save each format for comparison of disk size and read speed
The class that defines the file path.
Attributes
----------
CSV_NO_COMPRESSION : str
Uncompressed CSV path.
CSV_GZIP : str
The gzip-compressed CSV path.
PARQUET_NO_COMPRESSION : str
Uncompressed Parquet path.
PARQUET_SNAPPY : str
Snappy Compressed Parquet path.
PARQUET_GZIP : str
The gzip-compressed Parquet path.
HDF5_NO_COMPRESSION_PANDAS : str
File path for uncompressed HDF5 Pandas.
HDF5_NO_COMPRESSION_DASK : str
File path for uncompressed HDF5 Dask.
HDF5_NO_COMPRESSION_VAEX : str
File path for uncompressed HDF5 Vaex.
HDF5_ZLIB_PANDAS : str
File path for HDF5 Pandas with zlib compression.
HDF5_ZLIB_DASK : str
File path for Dask in HDF5 with zlib compression.
"""
CSV_NO_COMPRESSION = 'csv_no_compression.csv'
CSV_GZIP = 'csv_gzip.csv.gz'
PARQUET_NO_COMPRESSION = 'parquet_no_compression.parquet'
PARQUET_SNAPPY = 'parquet_snappy.parquet'
PARQUET_GZIP = 'parquet_gzip.parquet'
HDF5_NO_COMPRESSION_PANDAS = 'hdf5_no_compression_pandas.hdf5'
HDF5_NO_COMPRESSION_DASK = 'hdf5_no_compression_dask.hdf5'
HDF5_NO_COMPRESSION_VAEX = 'hdf5_no_compression_vaex.hdf5'
HDF5_ZLIB_PANDAS = 'hdf5_zlib_pandas.hdf5'
HDF5_ZLIB_DASK = 'hdf5_zlib_dask.hdf5'
def remove_each_format_files():
"""
Delete each saved file.
"""
file_or_dir_names = os.listdir()
data_extensions = ('.csv', '.gz', '.parquet', '.hdf5')
for file_or_dir_name in file_or_dir_names:
if not os.path.isfile(file_or_dir_name):
continue
extension = f'.{file_or_dir_name.split(".")[-1]}'
if extension not in data_extensions:
continue
os.remove(file_or_dir_name)
def save_each_format(pandas_df):
"""
Save the file in each format.
Parameters
----------
pandas_df : pd.DataFrame
The data frame to be saved.
"""
remove_each_format_files()
dask_df = dd.from_pandas(data=pandas_df, npartitions=1)
vaex_df = vaex.from_pandas(df=pandas_df, copy_index=False)
print(datetime.now(), 'Start saving uncompressed CSV...')
pandas_df.to_csv(FP.CSV_NO_COMPRESSION, index=False, encoding='utf-8')
print(datetime.now(), 'Start saving gzip-compressed CSV...')
pandas_df.to_csv(FP.CSV_GZIP, index=False, encoding='utf-8', compression='gzip')
print(datetime.now(), 'Start saving uncompressed Parquet...')
pandas_df.to_parquet(FP.PARQUET_NO_COMPRESSION, compression=None, index=False)
print(datetime.now(), 'Start saving Snappy compressed Parquet...')
pandas_df.to_parquet(FP.PARQUET_SNAPPY, compression='snappy', index=False)
print(datetime.now(), 'Start saving gzip-compressed Parquet...')
pandas_df.to_parquet(FP.PARQUET_GZIP, compression='gzip', index=False)
print(datetime.now(), 'Start saving uncompressed Pandas HDF5...')
pandas_df.to_hdf(
FP.HDF5_NO_COMPRESSION_PANDAS, key='data', mode='w', complib=None, index=False)
print(datetime.now(), 'Start saving uncompressed Dask HDF5...')
dask_df.to_hdf(
FP.HDF5_NO_COMPRESSION_DASK, key='data', mode='w', complib=None, index=False)
print(datetime.now(), 'Start saving uncompressed Vaex HDF5...')
vaex_df.export_hdf5(FP.HDF5_NO_COMPRESSION_VAEX)
print(datetime.now(), 'Start saving zlib-compressed Pandas HDF5...')
pandas_df.to_hdf(
FP.HDF5_ZLIB_PANDAS, key='data', mode='w', complib='zlib',
complevel=4, index=False)
print(datetime.now(), 'Start saving zlib-compressed Dask HDF5...')
dask_df.to_hdf(
FP.HDF5_ZLIB_DASK, key='data', mode='w', complib='zlib',
complevel=4, index=False)
print(datetime.now(), 'The save process is complete.')
I will deal with various compression methods such as gzip, zlib, Snappy, but I think that some people are not familiar with Snappy etc. (I was also unfamiliar with Parquet etc.) I will.
Zlib, which is the default compression method of HDF5 of Pandas, is a library used to create gzip and zip formats, so if the compression method specifies gzip
or zlib
, there is basically a tendency for compression. I think it's okay to see the same.
Reference: How are zlib, gzip and zip related? What do they have in common and how are they different?
Snappy is quite different from gzip etc. and has the following features.
--The file size is larger (compression rate is lower) than gzip etc. --The read / write speed is much faster than gzip etc. --You can divide the internal data by reading and writing one file. In other words, instead of having all the data in the memory at once with Vaex etc., writing while having the data in the memory for each fixed number of lines reduces the instantaneous load, and the library makes it easier to parallelize with a compressed file. I am.
Reference:
We will add a process to get the disc size of each saved format. I'll try to plot it later, so I'll keep each value in the Pandas series.
import os
def get_disk_size_info_sr():
"""
Stores disk size information for each saved format file
Get the series.
Returns
-------
sr : pd.Series
A series that stores disc size information. The following indexes
Set.
- csv_no_compression : int ->Uncompressed CSV file size.
- csv_gzip : int ->The file size of the gzip-compressed CV.
- parquet_no_compression : int ->Uncompressed Parquet file size.
- parquet_snappy : int ->Snappy Compressed Parquet file size.
- parquet_gzip : int ->The gzip-compressed Parquet file size.
- hdf5_no_compression : int ->Uncompressed HDF5 file size (see Pandas files).
- hdf5_zlib : int ->HDF5 file size compressed with zlib (see Pandas files).
"""
sr = pd.Series(
data={
'csv_no_compression': os.path.getsize(FP.CSV_NO_COMPRESSION),
'csv_gzip': os.path.getsize(FP.CSV_GZIP),
'parquet_no_compression': os.path.getsize(FP.PARQUET_NO_COMPRESSION),
'parquet_snappy': os.path.getsize(FP.PARQUET_SNAPPY),
'parquet_gzip': os.path.getsize(FP.PARQUET_GZIP),
'hdf5_no_compression': os.path.getsize(FP.HDF5_NO_COMPRESSION_PANDAS),
'hdf5_zlib': os.path.getsize(FP.HDF5_ZLIB_PANDAS),
})
return sr
We will add a process for plotting the series that stores the file size data added in the previous section.
import matplotlib
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter
import matplotlib.pyplot as plt
from enum import Enum
matplotlib.style.use('ggplot')
class PlotFileSizeUnit(Enum):
"""
A class that handles Enum, which is a unit of file size for plotting.
Attributes
----------
BYTES : str
A constant for displaying B (bytes).
KILO_BYTES : str
Constant for displaying KB.
MEGA_BYTES : str
Constant for MB display.
"""
BYTES = 'Bytes'
KILO_BYTES = 'KB'
MEGA_BYTES = 'MB'
def plot_file_size_info(
sr, exclude_hdf5=False, unit=PlotFileSizeUnit.BYTES,
print_sr = True):
"""
Plot for file size comparison.
Parameters
----------
sr : pd.Series
A series that stores file size information. The following indexes are required.
- csv_no_compression : int
- csv_gzip : int
- parquet_no_compression : int
- parquet_snappy : int
- parquet_gzip : int
- hdf5_no_compression : int
- hdf5_zlib : int
exclude_hdf5 : bool, default False
Whether to exclude HDF5's from plotting.
unit : PlotFileSizeUnit
Specifying the unit of plot file size.
print_sr : bool, default True
Whether to output the contents of a series of plots.
"""
sr = sr.copy()
sr.sort_values(inplace=True, ascending=False)
if exclude_hdf5:
sr = sr[~sr.index.isin(['hdf5_no_compression', 'hdf5_zlib'])]
if unit == PlotFileSizeUnit.KILO_BYTES:
sr = sr / 1000
elif unit == PlotFileSizeUnit.MEGA_BYTES:
sr = sr / 1000 / 1000
if print_sr:
print(sr.sort_values(ascending=True))
title = f'File size ({unit.value})'
if exclude_hdf5:
title += ' - hdf5 excluded'
ax = sr.plot(kind='barh', figsize=(10, len(sr) // 2), title=title)
ax.xaxis.set_major_formatter(FormatStrFormatter('%d'))
plt.show()
I will write the process to get the value of the reading speed. In addition, timeit
is used to measure the processing time, but since there are many magic commands and the description becomes complicated, we will calculate and process in the function without using the magic command.
Regarding Dask and Vaex, depending on the format, reading etc. may be lazy evaluation. Therefore, it is not possible to obtain the processing time only by reading, so we will handle some light calculation processing and calculate in the time until reading and calculation are completed.
The following patterns are provided for processing. Since Vaex is said to have a particularly fast string relationship, we also include string operations. The second pattern is a little more calculated.
-[Pattern 1]. Read-> Slice to a value of 1 million or more in column_a-> Calculate the total value of column_b.
-[Pattern 2]. Read-> Slice column_a value to 3 million or less-> Slice column_b value to 1000 or more-> Slice column_e string value only to lines starting with a-> Calculate the maximum value of column_b for each group by GROUP BY-> the value of column_d.
-[Pattern 3]. Read-> column_e to slice only the lines containing the character string ab
-> Calculate the number of lines after slicing.
Slice to a value of 1 million or more in column_a-> Write the processing of Pandas, Dask, Vaex of the pattern to calculate the total value of the value of column_b. Data reading will be added separately.
def calculate_pattern_1_with_pandas_df(pandas_df):
"""
Calculate the first pattern in a Pandas data frame.
Parameters
----------
pandas_df : pd.DataFrame
The target Pandas data frame.
Returns
-------
sum_val : int
Calculated total value.
"""
pandas_df = pandas_df[pandas_df['column_a'] >= 1_000_000]
sum_val = pandas_df['column_b'].sum()
return sum_val
def calculate_pattern_1_with_dask_df(dask_df):
"""
The calculation of the first pattern is done in the Dask data frame.
Parameters
----------
dask_df : dd.DataFrame
The data frame of the target Dask.
Returns
-------
sum_val : int
Calculated total value.
"""
dask_df = dask_df[dask_df['column_a'] >= 1_000_000]
sum_val = dask_df['column_b'].sum()
sum_val = sum_val.compute()
return sum_val
def calculate_pattern_1_with_vaex_df(vaex_df):
"""
The calculation of the first pattern is performed in the Vaex data frame.
Parameters
----------
vaex_df : vaex.dataframe.DataFrame
The target Vaex data frame.
Returns
-------
sum_val : int
Calculated total value.
"""
vaex_df = vaex_df[vaex_df['column_a'] >= 1_000_000]
sum_val = int(vaex_df.column_b.sum())
return sum_val
Slice column_a value to 3 million or less-> Slice column_b value to 7000 or less-> Slice column_e string value only to lines starting with a-> GROUP BY-> column_c value I will write the processing of Pandas, Dask, Vaex of the pattern that calculates the maximum value of column_b for each group.
def calculate_pattern_2_with_pandas_df(pandas_df):
"""
Calculate the second pattern in a Pandas data frame.
Parameters
----------
pandas_df : pd.DataFrame
The target Pandas data frame.
Returns
-------
max_sr : pd.Series
A series that stores each calculated total value.
"""
pandas_df = pandas_df[pandas_df['column_a'] <= 3_000_000]
pandas_df = pandas_df[pandas_df['column_b'] >= 1000]
pandas_df = pandas_df[pandas_df['column_e'].str.startswith('a')]
grouped = pandas_df.groupby(by='column_d')
max_df = grouped.max()
max_sr = max_df['column_b']
return max_sr
def calculate_pattern_2_with_dask_df(dask_df):
"""
The calculation of the second pattern is done in the Dask data frame.
Parameters
----------
dask_df : dd.DataFrame
The data frame of the target Dask.
Returns
-------
max_sr : pd.Series
A series that stores each calculated total value.
"""
dask_df = dask_df[dask_df['column_a'] <= 3_000_000]
dask_df = dask_df[dask_df['column_b'] >= 1000]
dask_df = dask_df[dask_df['column_e'].str.startswith('a')]
grouped = dask_df.groupby(by='column_d')
max_df = grouped.max()
max_sr = max_df['column_b']
max_sr = max_sr.compute()
return max_sr
def calculate_pattern_2_with_vaex_df(vaex_df):
"""
The calculation of the second pattern is performed in the Vaex data frame.
Parameters
----------
vaex_df : vaex.dataframe.DataFrame
The target Vaex data frame.
Returns
-------
max_sr : pd.Series
A series that stores each calculated total value.
"""
vaex_df = vaex_df[vaex_df['column_a'] <= 3_000_000]
vaex_df = vaex_df[vaex_df['column_b'] >= 1000]
vaex_df = vaex_df[vaex_df['column_e'].str.startswith('a')]
max_df = vaex_df.groupby(
by='column_d',
agg={
'column_b': vaex.agg.max,
})
max_df = max_df.to_pandas_df(column_names=['column_d', 'column_b'])
max_df.index = max_df['column_d']
max_sr = max_df['column_b']
return max_sr
Slice only in the line containing the character string ab
in column_e-> Write the Pandas / Dask / Vaex process of the pattern to calculate the number of lines after slicing.
def calculate_pattern_3_with_pandas_df(pandas_df):
"""
Calculate the third pattern in a Pandas data frame.
Parameters
----------
pandas_df : pd.DataFrame
The target Pandas data frame.
Returns
-------
row_count : int
The number of lines in the calculation result.
"""
pandas_df = pandas_df[pandas_df['column_e'].str.contains('ab')]
row_count = len(pandas_df)
return row_count
def calculate_pattern_3_with_dask_df(dask_df):
"""
The calculation of the third pattern is performed in the Dask data frame.
Parameters
----------
dask_df : dd.DataFrame
The data frame of the target Dask.
Returns
-------
row_count : int
The number of lines in the calculation result.
"""
dask_df = dask_df[dask_df['column_e'].str.contains('ab')]
row_count = len(dask_df)
return row_count
def calculate_pattern_3_with_vaex_df(vaex_df):
"""
The calculation of the third pattern is performed in the Vaex data frame.
Parameters
----------
vaex_df : vaex.dataframe.DataFrame
The target Vaex data frame.
Returns
-------
row_count : int
The number of lines in the calculation result.
"""
vaex_df = vaex_df[vaex_df['column_e'].str.contains('ab')]
row_count = len(vaex_df)
return row_count
I will write the data reading process for each format.
def read_csv_no_compression_pandas_df():
"""
Read Pandas data frames from uncompressed CSV data.
Returns
-------
pandas_df : pd.DataFrame
The loaded Pandas data frame.
"""
pandas_df = pd.read_csv(FP.CSV_NO_COMPRESSION)
return pandas_df
def read_csv_no_compression_dask_df():
"""
Read Dask data frames from uncompressed CSV data.
Returns
-------
dask_df : dd.DataFrame
The loaded Dask data frame.
"""
dask_df = dd.read_csv(FP.CSV_NO_COMPRESSION)
return dask_df
def read_csv_no_compression_vaex_df():
"""
Read a Vaex data frame from uncompressed CSV data.
Returns
-------
vaex_df : vaex.dataframe.DataFrame
The loaded Vaex data frame.
"""
vaex_df = vaex.from_csv(FP.CSV_NO_COMPRESSION, copy_index=False)
return vaex_df
def read_csv_gzip_pandas_df():
"""
Read Pandas dataframes from gzip-compressed CSV data.
Returns
-------
pandas_df : pd.DataFrame
The loaded Pandas data frame.
"""
pandas_df = pd.read_csv(FP.CSV_GZIP, compression='gzip')
return pandas_df
def read_csv_gzip_dask_df():
"""
Read Dask data frames from gzip-compressed CSV data.
Returns
-------
dask_df : dd.DataFrame
The loaded Dask data frame.
"""
dask_df = dd.read_csv(
FP.CSV_GZIP, compression='gzip', blocksize=None)
return dask_df
def read_csv_gzip_vaex_df():
"""
Read a Vaex data frame from gzip-compressed CSV data.
Returns
-------
vaex_df : vaex.dataframe.DataFrame
The loaded Vaex data frame.
"""
vaex_df = vaex.from_csv(
FP.CSV_GZIP, compression='gzip', copy_index=False)
return vaex_df
def read_parquet_no_compression_pandas_df():
"""
Read Pandas data frames from uncompressed Parquet data.
Returns
-------
pandas_df : pd.DataFrame
The loaded Pandas data frame.
"""
pandas_df = pd.read_parquet(FP.PARQUET_NO_COMPRESSION)
return pandas_df
def read_parquet_no_compression_dask_df():
"""
Read a Dask data frame from uncompressed Parquet data.
Returns
-------
dask_df : dd.DataFrame
The loaded Dask data frame.
"""
dask_df = dd.read_parquet(FP.PARQUET_NO_COMPRESSION)
return dask_df
def read_parquet_no_compression_vaex_df():
"""
Read Vaex data frames from uncompressed Parquet data.
Returns
-------
vaex_df : vaex.dataframe.DataFrame
The loaded Vaex data frame.
"""
vaex_df = vaex.open(FP.PARQUET_NO_COMPRESSION, copy_index=False)
return vaex_df
def read_parquet_snappy_pandas_df():
"""
Read Pandas data frames from Snappy-compressed Parquet data.
Returns
-------
pandas_df : pd.DataFrame
The loaded Pandas data frame.
"""
pandas_df = pd.read_parquet(FP.PARQUET_SNAPPY)
return pandas_df
def read_parquet_snappy_dask_df():
"""
Read Dask dataframes from Snappy-compressed Parquet data
Returns
-------
dask_df : dd.DataFrame
The loaded Dask data frame.
"""
dask_df = dd.read_parquet(FP.PARQUET_SNAPPY)
return dask_df
def read_parquet_snappy_vaex_df():
"""
Read Vaex dataframes from Snappy-compressed Parquet data
Returns
-------
vaex_df : vaex.dataframe.DataFrame
The loaded Vaex data frame.
"""
vaex_df = vaex.open(FP.PARQUET_SNAPPY, copy_index=False)
return vaex_df
def read_parquet_gzip_pandas_df():
"""
Read Pandas dataframes from gzip-compressed Parquet data.
Returns
-------
pandas_df : pd.DataFrame
The loaded Pandas data frame.
"""
pandas_df = pd.read_parquet(FP.PARQUET_GZIP)
return pandas_df
def read_parquet_gzip_dask_df():
"""
Read Dask dataframes from gzip-compressed Parquet data
Returns
-------
dask_df : dd.DataFrame
The loaded Dask data frame.
"""
dask_df = dd.read_parquet(FP.PARQUET_GZIP)
return dask_df
def read_parquet_gzip_vaex_df():
"""
Read Vaex dataframes from gzip-compressed Parquet data
Returns
-------
vaex_df : vaex.dataframe.DataFrame
The loaded Vaex data frame.
"""
vaex_df = vaex.open(FP.PARQUET_GZIP, copy_index=False)
return vaex_df
def read_hdf5_no_compression_pandas_df():
"""
Read Pandas data frames from uncompressed HDF5 data.
Returns
-------
pandas_df : pd.DataFrame
The loaded Pandas data frame.
"""
pandas_df = pd.read_hdf(FP.HDF5_NO_COMPRESSION_PANDAS, key='data')
return pandas_df
def read_hdf5_no_compression_dask_df():
"""
Read Dask data frames from uncompressed HDF5 data.
Returns
-------
dask_df : dd.DataFrame
Loaded Dask dataframe
"""
dask_df = dd.read_hdf(FP.HDF5_NO_COMPRESSION_DASK, key='data')
return dask_df
def read_hdf5_no_compression_vaex_df():
"""
Read Vaex data frames from uncompressed HDF5 data.
Returns
-------
vaex_df : vaex.dataframe.DataFrame
The loaded Vaex data frame.
"""
vaex_df = vaex.open(FP.HDF5_NO_COMPRESSION_VAEX, copy_index=False)
return vaex_df
def read_hdf5_zlib_pandas_df():
"""
Read Pandas dataframes from zlib-compressed HDF5 data.
Returns
-------
pandas_df : pd.DataFrame
The loaded Pandas data frame.
"""
pandas_df = pd.read_hdf(FP.HDF5_ZLIB_PANDAS, key='data')
return pandas_df
def read_hdf5_zlib_dask_df():
"""
Read Dask data frames from zlib-compressed HDF5 data.
Returns
-------
dask_df : dd.DataFrame
Loaded Dask dataframe
"""
dask_df = dd.read_hdf(FP.HDF5_ZLIB_DASK, key='data')
return dask_df
As I thought I wrote, the description of reading is simple for Parquet relations (there is no difference in description depending on the presence or absence of compression and type, there is no difference for each library, the same file can be used, etc.).
Since each reading process and aggregation process have been added, we will write a process that combines them.
Writing a combination of functions one by one will make it much longer, so we will set up a class to handle measurement processing.
from timeit import timeit
class ReadAndCalcRunner:
def __init__(self, label, pattern, read_func, calc_func):
"""
A class that handles the setting and execution processing of the combination of reading and calculation processing.
Parameters
----------
label : str
A label for identifying the target combination.
Example: csv_no_compression_pandas
pattern : int
Target pattern (1 to 3).
read_func : Callable
A function that handles reading processing. Arguments are optional and change the data frame
You need a format.
calc_func : Callable
A function that handles computational processing. A format that accepts a data frame as the first argument
You need to specify something.
"""
self.label = label
self.pattern = pattern
self._read_func = read_func
self._calc_func = calc_func
def run(self, n, debug=False):
"""
Perform reading and calculation processing. After execution, mean_in seconds attribute
The average number of seconds of execution (float) is set.
Parameters
----------
n : int
Number of executions. The larger the number, the higher the accuracy of the processing time, but it is completed.
Please note that it will take a long time to do so.
debug : bool, default False
Debug settings. If True is specified, the calculation result will be output.
"""
statement = 'df = self._read_func();'
if not debug:
statement += 'self._calc_func(df);'
else:
statement += 'result = self._calc_func(df); print(result)'
result_seconds = timeit(
stmt=statement,
number=n, globals=locals())
self.mean_seconds = result_seconds / n
We will define each combination using the added class.
Each function etc. has a name according to the rule (name using format, library etc.), so add it by turning it in a loop. I'm worried about a lot of indentation, but this time it's a disposable cord, so I'll proceed as it is.
from copy import deepcopy
import sys
this_module = sys.modules[__name__]
FORMATS = (
'csv_no_compression',
'csv_gzip',
'parquet_no_compression',
'parquet_snappy',
'parquet_gzip',
'hdf5_no_compression',
'hdf5_zlib',
)
LIBS = (
'pandas',
'dask',
'vaex',
)
PATTERNS = (1, 2, 3)
runners = []
#HDF5 compressed files do not have a function defined in Vaex because they have no interface.
for format_str in FORMATS:
for lib_str in LIBS:
for pattern in PATTERNS:
label = f'{format_str}_{lib_str}'
read_func_name = f'read_{format_str}_{lib_str}_df'
if not hasattr(this_module, read_func_name):
print(
'Add definition because the target read processing function cannot be found'
f'Skipped: {read_func_name}')
continue
read_func = getattr(this_module, read_func_name)
calc_func_name = f'calculate_pattern_{pattern}_with_{lib_str}_df'
if not hasattr(this_module, calc_func_name):
print(
'Add definition because the target calculation function cannot be found'
f'Skipped: {calc_func_name}')
continue
calc_func = getattr(this_module, calc_func_name)
runners.append(
ReadAndCalcRunner(
label=label,
pattern=pattern,
read_func=read_func,
calc_func=calc_func)
)
This adds the definition of 60 combinations (7 types of file formats x 3 types of libraries x 3 types of calculation patterns-HDF compression formats 3 cases = 60 cases that Vaex does not support).
Also, add a series acquisition process that holds the processing time for each pattern.
def run_and_get_result_time_sr(runners, pattern, n):
"""
Executed each measurement process of the specified pattern and stored the value of each number of seconds of the result.
Get the series.
Parameters
----------
runners : list of ReadAndCalcRunner
A list of instances that hold definitions of execution processes.
pattern : int
Pattern to execute(1~3)。
n : int
Number of executions. The larger the number, the higher the accuracy of the processing time, but it is completed.
Please note that it will take a long time to do so.
Returns
-------
sr : pd.Series
A series that stores measurement results. Each format and index
A concatenated string of labels and patterns for library identification is set,
The value is set to the number of seconds.
"""
runners = deepcopy(runners)
data_dict = {}
for runner in runners:
if runner.pattern != pattern:
continue
label = f'{runner.label}_{pattern}'
runner.run(n=n)
data_dict[label] = runner.mean_seconds
sr = pd.Series(data=data_dict)
return sr
We will add the process that we added earlier to plot the contents of the series of aggregation times for each pattern.
def plot_time_info(time_sr, patten, print_sr = True):
"""
Plot the processing time.
Parameters
----------
time_sr : pd.Series
A series that stores the processing time.
patten : int
Target aggregation processing pattern (1 to 3).
print_sr : bool, default True
Whether to output and display a series of plot contents.
"""
sr = time_sr.copy()
sr.sort_values(inplace=True, ascending=False)
if print_sr:
print(sr.sort_values(ascending=True))
title = f'Read and calculation seconds (pattern: {patten})'
ax = sr.plot(kind='barh', figsize=(10, len(sr) // 2), title=title)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plt.show()
In order to simplify the description, add a function to process each process (save, load, aggregate, visualization) at once.
def run_overall(row_num, n, unit, pattern, save_data=True):
"""
Each process (save, load, aggregate, visualization, etc.) for the target pattern
Run overall.
Parameters
----------
row_num : int
The number of rows of data of interest.
n : int
The number of reads and calculations performed. The larger the number, the more accurate the processing time,
Please note that it will take a long time to complete.
unit : PlotFileSizeUnit
A unit of file size.
pattern : int
The target calculation pattern (1 to 3).
save_data : bool, default True
Whether to save the data. Data in another pattern of calculation execution
Specify False when reusing (extra save processing and plotting
Will be skipped).
"""
if save_data:
pandas_df = make_basic_pandas_df(row_num=row_num, char_num=20)
save_each_format(pandas_df=pandas_df)
disk_size_info_sr = get_disk_size_info_sr()
plot_file_size_info(sr=disk_size_info_sr, unit=unit)
plot_file_size_info(sr=disk_size_info_sr, unit=unit, exclude_hdf5=True)
time_sr = run_and_get_result_time_sr(
runners=runners, pattern=pattern, n=n)
plot_time_info(time_sr=time_sr, patten=pattern)
It feels like it's been a lot longer (than expected) so far, but I'm finally going to do things like execution and visualization. This section covers small data with 500 rows and 5 columns.
First of all, we will start with the save process and pattern 1.
run_overall(
row_num=500, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
2021-01-11 10:50:52.504175 Start saving uncompressed CSV...
2021-01-11 10:50:52.506491 Start saving gzip compressed CSV...
2021-01-11 10:50:52.509357 Start saving uncompressed Parquet...
2021-01-11 10:50:52.511121 Snappy Start saving compressed Parquet...
2021-01-11 10:50:52.512792 gzip Start saving compressed Parquet...
2021-01-11 10:50:52.516839 Start saving uncompressed Pandas HDF5...
2021-01-11 10:50:52.525298 Start saving uncompressed Dask HDF5...
2021-01-11 10:50:52.530746 Start saving uncompressed Vaex HDF5...
2021-01-11 10:50:52.592227 zlib Start saving compressed Pandas HDF5...
2021-01-11 10:50:52.602412 Start saving HDF5 of zlib compressed Dask...
2021-01-11 10:50:52.608369 The save process is complete.
The save process is instant for any file of this size.
** Disc size: **
csv_gzip 13.698
parquet_gzip 17.913
csv_no_compression 20.671
parquet_snappy 22.427
parquet_no_compression 26.952
hdf5_zlib 1084.923
hdf5_no_compression 1091.984
The disc size of each format has become considerably larger in relation to HDF5. This seems to involve the minimum required size of HDF5 itself rather than data (so for data of this scale, the size does not change even if you specify compression with HDF5).
** This shows that HDF5 is not suitable for small data. ** Due to the nature of hierarchical data, it is a format that puts a lot of data in one file, so it can be said that there is no discomfort.
It's hard to see, so let's check the plot excluding HDF5.
** Overall, CSV is better than Parquet. ** Parquet loses to raw CSV even with Snappy compression.
It is said in other articles and slides that Parquet itself is "not suitable for small files and should be avoided", and in such cases the use of Parquet is suitable from the viewpoint of disk size. not.
In other words, when saving a small file in a short span such as several minutes with Firehose or Fluentd in the data lake of S3, when Parquet is selected for performance or disk size reduction ** Reverse for CSV etc. It can be effective **. Please note that setting Parquet conversion for Firehose etc. will increase the cost of AWS. It seems good to select whether to use CSV or Parquet by considering the target data set and the number of file divisions (how often the data is saved).
** Pattern 1 processing time: **
Let's look at the processing time of reading and calculation. First is the calculation of pattern 1 (simple reading and slicing / total value calculation).
csv_no_compression_pandas_1 0.002111
csv_gzip_pandas_1 0.002147
parquet_no_compression_pandas_1 0.002332
parquet_gzip_pandas_1 0.002439
parquet_snappy_pandas_1 0.002447
parquet_snappy_vaex_1 0.003457
parquet_gzip_vaex_1 0.003632
parquet_no_compression_vaex_1 0.003653
hdf5_no_compression_vaex_1 0.005135
csv_no_compression_vaex_1 0.006282
csv_gzip_vaex_1 0.006520
hdf5_no_compression_pandas_1 0.006611
hdf5_zlib_pandas_1 0.007772
csv_gzip_dask_1 0.009453
csv_no_compression_dask_1 0.009640
parquet_gzip_dask_1 0.010723
parquet_no_compression_dask_1 0.010762
parquet_snappy_dask_1 0.010876
hdf5_zlib_dask_1 0.017010
hdf5_no_compression_dask_1 0.017790
You can see that the combination of CSV + Pandas is fast. On the contrary, Dask is late. Especially the combination of HDF5 and Dask is slow.
Vaex isn't as good as Dask, but it's a lot slower than Pandas.
As you can see on the best practices page of Dask's documentation, if you have small data that fits in memory, using Dask will wastefully parallelize, calculate graphs, lazy evaluation, etc. The fact that it is slow has an effect.
Reference: Best Practices --Dask documentation
In this case, let's use Pandas normally.
Also, with data of this scale not only in terms of disk size but also in speed, there is no big difference between Parquet and CSV (Parquet side does not become much faster).
** Pattern 2 processing time: **
This time, it is the calculation of pattern 2 (a slightly longer calculation including the character string relation).
run_overall(
row_num=500, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=2, save_data=False)
csv_gzip_pandas_2 0.005518
parquet_snappy_pandas_2 0.005648
parquet_no_compression_pandas_2 0.005668
parquet_no_compression_vaex_2 0.005784
parquet_gzip_pandas_2 0.005845
csv_no_compression_pandas_2 0.005853
parquet_snappy_vaex_2 0.005925
parquet_gzip_vaex_2 0.006170
hdf5_no_compression_vaex_2 0.007094
csv_no_compression_vaex_2 0.008296
csv_gzip_vaex_2 0.008618
hdf5_no_compression_pandas_2 0.010095
hdf5_zlib_pandas_2 0.010720
csv_no_compression_dask_2 0.021852
csv_gzip_dask_2 0.022408
parquet_no_compression_dask_2 0.022504
parquet_snappy_dask_2 0.022828
parquet_gzip_dask_2 0.023470
hdf5_no_compression_dask_2 0.030001
hdf5_zlib_dask_2 0.030424
The advantages of CSV and Pandas remain the same, but with so much processing, the combination of Vaex + Parquet seems to be at a level that is thin to Pandas.
** Pattern 3 processing time: **
Next is the calculation of pattern 3 (simple reading, character string processing, and line count calculation).
run_overall(
row_num=500, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=3, save_data=False)
csv_no_compression_pandas_3 0.002348
csv_gzip_pandas_3 0.002382
parquet_no_compression_pandas_3 0.002456
parquet_snappy_pandas_3 0.002553
parquet_no_compression_vaex_3 0.002748
parquet_gzip_pandas_3 0.002815
parquet_gzip_vaex_3 0.002947
parquet_snappy_vaex_3 0.003032
hdf5_no_compression_vaex_3 0.004627
csv_gzip_vaex_3 0.006253
csv_no_compression_vaex_3 0.006258
hdf5_no_compression_pandas_3 0.006701
hdf5_zlib_pandas_3 0.007467
csv_no_compression_dask_3 0.010006
parquet_no_compression_dask_3 0.011430
parquet_snappy_dask_3 0.011600
parquet_gzip_dask_3 0.011652
csv_gzip_dask_3 0.012087
hdf5_no_compression_dask_3 0.018044
hdf5_zlib_dask_3 0.018257
Here, Pandas and CSV gave excellent results. Even though Vaex has a fast string manipulation system, Pandas (CSV instead of + Parquet) usually seems to give better results when there is a lot of processing and data.
Let's increase the data a little and try with 2000 lines.
run_overall(
row_num=2000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
2021-01-11 10:56:19.629120 Start saving uncompressed CSV...
2021-01-11 10:56:19.635464 gzip Start saving compressed CSV...
2021-01-11 10:56:19.646662 Start saving uncompressed Parquet...
2021-01-11 10:56:19.649773 Snappy Start saving compressed Parquet...
2021-01-11 10:56:19.652469 Start saving gzip compressed Parquet...
2021-01-11 10:56:19.663905 Start saving uncompressed Pandas HDF5...
2021-01-11 10:56:19.672032 Start saving uncompressed Dask HDF5...
2021-01-11 10:56:19.678898 Start saving uncompressed Vaex HDF5...
2021-01-11 10:56:19.692107 zlib Start saving compressed Pandas HDF5...
2021-01-11 10:56:19.702695 zlib Start saving compressed Dask HDF5...
2021-01-11 10:56:19.711186 The save process is complete.
csv_gzip 53.358
parquet_gzip 59.419
parquet_snappy 77.077
csv_no_compression 82.433
parquet_no_compression 94.157
hdf5_zlib 1131.977
hdf5_no_compression 1186.488
First of all, it is the disk size, but it seems that compressed Parquet loses to uncompressed CSV at this scale. However, CSV is still better if it is the same compression format.
parquet_snappy_pandas_1 0.002602
parquet_no_compression_pandas_1 0.002694
csv_no_compression_pandas_1 0.003224
parquet_gzip_pandas_1 0.003239
parquet_no_compression_vaex_1 0.003570
csv_gzip_pandas_1 0.003762
parquet_snappy_vaex_1 0.003798
parquet_gzip_vaex_1 0.004458
hdf5_no_compression_vaex_1 0.005106
hdf5_no_compression_pandas_1 0.006921
hdf5_zlib_pandas_1 0.008155
parquet_snappy_dask_1 0.011113
parquet_no_compression_dask_1 0.011172
csv_no_compression_dask_1 0.011765
parquet_gzip_dask_1 0.012178
csv_gzip_dask_1 0.012792
csv_no_compression_vaex_1 0.015389
csv_gzip_vaex_1 0.015665
hdf5_no_compression_dask_1 0.017919
hdf5_zlib_dask_1 0.018418
In pattern 1, the Pandas + Parquet combination is faster than CSV. Also, even if it is Snappy compressed, it seems to be about the same level as the uncompressed case.
Next, let's look at pattern 2.
run_overall(
row_num=2000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=2, save_data=False)
parquet_snappy_vaex_2 0.006218
parquet_no_compression_vaex_2 0.006335
parquet_gzip_vaex_2 0.006521
parquet_no_compression_pandas_2 0.006702
parquet_snappy_pandas_2 0.006715
parquet_gzip_pandas_2 0.007106
csv_no_compression_pandas_2 0.007148
hdf5_no_compression_vaex_2 0.007222
csv_gzip_pandas_2 0.007696
hdf5_no_compression_pandas_2 0.011027
hdf5_zlib_pandas_2 0.011480
csv_no_compression_vaex_2 0.016811
csv_gzip_vaex_2 0.017303
parquet_snappy_dask_2 0.023846
csv_no_compression_dask_2 0.024046
parquet_gzip_dask_2 0.024549
parquet_no_compression_dask_2 0.024705
csv_gzip_dask_2 0.025683
hdf5_zlib_dask_2 0.031773
hdf5_no_compression_dask_2 0.031882
Vaex with Parquet or HDF5 at the top has become prominent. The slow combination of CSV + Vaex is not strange because it is written in the Vaex docstring that CSV is read using Pandas, and it is often seen that it does not become faster when using CSV.
Next is pattern 3.
run_overall(
row_num=2000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=3, save_data=False)
parquet_snappy_vaex_3 0.003102
parquet_no_compression_vaex_3 0.003210
parquet_snappy_pandas_3 0.003282
parquet_no_compression_pandas_3 0.003468
parquet_gzip_vaex_3 0.003671
parquet_gzip_pandas_3 0.003808
csv_no_compression_pandas_3 0.003832
csv_gzip_pandas_3 0.004321
hdf5_no_compression_vaex_3 0.004826
hdf5_no_compression_pandas_3 0.007564
hdf5_zlib_pandas_3 0.008139
parquet_snappy_dask_3 0.012079
csv_no_compression_dask_3 0.012595
parquet_gzip_dask_3 0.012903
parquet_no_compression_dask_3 0.013000
csv_gzip_dask_3 0.013813
csv_no_compression_vaex_3 0.014868
csv_gzip_vaex_3 0.015253
hdf5_no_compression_dask_3 0.019515
hdf5_zlib_dask_3 0.019784
Even though this is also a simple calculation, Vaex stands out at the top, probably because of the influence of the character string relationship. However, Pandas is also thin. As another feature, Parquet has become more prominent at the top.
Then try with 5000 lines.
run_overall(
row_num=5000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
2021-01-11 11:00:51.745293 Start saving uncompressed CSV...
2021-01-11 11:00:51.759650 gzip Start saving compressed CSV...
2021-01-11 11:00:51.785199 Start saving uncompressed Parquet...
2021-01-11 11:00:51.788939 Snappy Start saving compressed Parquet...
2021-01-11 11:00:51.792390 gzip Start saving compressed Parquet...
2021-01-11 11:00:51.819260 Start saving uncompressed Pandas HDF5...
2021-01-11 11:00:51.827838 Start saving uncompressed Dask HDF5...
2021-01-11 11:00:51.836513 Start saving uncompressed Vaex HDF5...
2021-01-11 11:00:51.849884 zlib Start saving compressed Pandas HDF5...
2021-01-11 11:00:51.862147 Start saving HDF5 of zlib compressed Dask...
2021-01-11 11:00:51.875275 The save process is complete.
csv_gzip 132.612
parquet_gzip 142.983
parquet_snappy 186.526
csv_no_compression 206.247
parquet_no_compression 226.477
hdf5_zlib 1226.921
hdf5_no_compression 1375.496
CSV seems to be better than Parquet if the disk size is still the same compression format with this number of lines.
parquet_snappy_pandas_1 0.003580
parquet_no_compression_vaex_1 0.003808
parquet_snappy_vaex_1 0.003830
parquet_no_compression_pandas_1 0.003854
parquet_gzip_pandas_1 0.004526
parquet_gzip_vaex_1 0.005253
hdf5_no_compression_vaex_1 0.005374
csv_no_compression_pandas_1 0.005457
csv_gzip_pandas_1 0.006872
hdf5_no_compression_pandas_1 0.007761
hdf5_zlib_pandas_1 0.008421
parquet_no_compression_dask_1 0.012125
parquet_snappy_dask_1 0.012637
parquet_gzip_dask_1 0.013911
csv_no_compression_dask_1 0.017265
csv_gzip_dask_1 0.019980
hdf5_no_compression_dask_1 0.020944
hdf5_zlib_dask_1 0.022355
csv_no_compression_vaex_1 0.033170
csv_gzip_vaex_1 0.034185
The speed of pattern 1 is a combination of Pandas + Parquet or Vaex + Parquet at the top. Also, the bottom two are a combination of Vaex + CSV. I feel that Vaex is not good at CSV.
Next is pattern 2.
run_overall(
row_num=5000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=2, save_data=False)
parquet_no_compression_vaex_2 0.006076
parquet_snappy_vaex_2 0.006594
hdf5_no_compression_vaex_2 0.007164
parquet_gzip_vaex_2 0.007509
parquet_snappy_pandas_2 0.008118
parquet_no_compression_pandas_2 0.008377
parquet_gzip_pandas_2 0.010168
csv_no_compression_pandas_2 0.010611
csv_gzip_pandas_2 0.011410
hdf5_no_compression_pandas_2 0.012331
hdf5_zlib_pandas_2 0.013154
parquet_snappy_dask_2 0.025350
parquet_no_compression_dask_2 0.025814
parquet_gzip_dask_2 0.027304
csv_no_compression_dask_2 0.031233
csv_gzip_dask_2 0.033282
csv_no_compression_vaex_2 0.034763
hdf5_no_compression_dask_2 0.035235
csv_gzip_vaex_2 0.035323
hdf5_zlib_dask_2 0.036155
Vaex other than CSV occupies the upper rank.
Next is pattern 3.
run_overall(
row_num=5000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=3, save_data=False)
parquet_no_compression_vaex_3 0.003913
parquet_snappy_vaex_3 0.003922
parquet_snappy_pandas_3 0.005204
parquet_no_compression_pandas_3 0.005261
parquet_gzip_vaex_3 0.005454
hdf5_no_compression_vaex_3 0.005745
parquet_gzip_pandas_3 0.006408
csv_no_compression_pandas_3 0.007710
csv_gzip_pandas_3 0.008557
hdf5_no_compression_pandas_3 0.009136
hdf5_zlib_pandas_3 0.010429
parquet_snappy_dask_3 0.013971
parquet_no_compression_dask_3 0.014509
parquet_gzip_dask_3 0.015864
csv_no_compression_dask_3 0.020176
csv_gzip_dask_3 0.022093
hdf5_no_compression_dask_3 0.022922
hdf5_zlib_dask_3 0.024209
csv_no_compression_vaex_3 0.032915
csv_gzip_vaex_3 0.034003
Again, the combination of Parquet + Vaex is fast.
run_overall(
row_num=10_000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
2021-01-11 11:05:12.979342 Start saving uncompressed CSV...
2021-01-11 11:05:13.004653 Start saving gzip compressed CSV...
2021-01-11 11:05:13.054621 Start saving uncompressed Parquet...
2021-01-11 11:05:13.060265 Start saving Snappy compressed Parquet...
2021-01-11 11:05:13.065405 gzip Start saving compressed Parquet...
2021-01-11 11:05:13.119761 Start saving uncompressed Pandas HDF5...
2021-01-11 11:05:13.129005 Start saving uncompressed Dask HDF5...
2021-01-11 11:05:13.141110 Start saving uncompressed Vaex HDF5...
2021-01-11 11:05:13.155398 zlib Start saving compressed Pandas HDF5...
2021-01-11 11:05:13.170231 Start saving HDF5 of zlib compressed Dask...
2021-01-11 11:05:13.190747 The save process is complete.
csv_gzip 264.581
parquet_gzip 281.168
parquet_snappy 366.227
csv_no_compression 412.415
parquet_no_compression 439.441
hdf5_zlib 1385.012
hdf5_no_compression 1690.528
It's thin, but Parquet still doesn't seem to beat CSV.gz. I thought that Parquet would be more advantageous at this scale, but isn't it? Is it the effect of including columns of random strings ...? (Parquet is much more advantageous if it is composed of columns with similar numbers, etc.?)
parquet_no_compression_vaex_1 0.004094
parquet_snappy_vaex_1 0.004115
parquet_snappy_pandas_1 0.004121
parquet_no_compression_pandas_1 0.004316
hdf5_no_compression_vaex_1 0.005096
parquet_gzip_pandas_1 0.006430
parquet_gzip_vaex_1 0.006769
csv_no_compression_pandas_1 0.008503
hdf5_no_compression_pandas_1 0.008716
hdf5_zlib_pandas_1 0.009722
csv_gzip_pandas_1 0.011210
parquet_snappy_dask_1 0.012788
parquet_no_compression_dask_1 0.013130
parquet_gzip_dask_1 0.016297
csv_no_compression_dask_1 0.021566
hdf5_no_compression_dask_1 0.024418
csv_gzip_dask_1 0.026689
hdf5_zlib_dask_1 0.027605
csv_no_compression_vaex_1 0.060481
csv_gzip_vaex_1 0.062800
In terms of speed in Pattern 1, Parquet occupies the upper part, and most of the CSV is lower.
Next is pattern 2.
run_overall(
row_num=10_000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=2, save_data=False)
parquet_no_compression_vaex_2 0.006648
parquet_snappy_vaex_2 0.006972
hdf5_no_compression_vaex_2 0.007346
parquet_gzip_vaex_2 0.009408
parquet_no_compression_pandas_2 0.010664
parquet_snappy_pandas_2 0.011815
parquet_gzip_pandas_2 0.013018
hdf5_no_compression_pandas_2 0.014761
csv_no_compression_pandas_2 0.015061
csv_gzip_pandas_2 0.016204
hdf5_zlib_pandas_2 0.017293
parquet_no_compression_dask_2 0.028120
parquet_snappy_dask_2 0.028369
parquet_gzip_dask_2 0.032319
csv_no_compression_dask_2 0.035587
csv_gzip_dask_2 0.040959
hdf5_no_compression_dask_2 0.041060
hdf5_zlib_dask_2 0.044385
csv_no_compression_vaex_2 0.062479
csv_gzip_vaex_2 0.065195
It seems that the result is that Vaex + Parquet is fast.
Next is pattern 3.
run_overall(
row_num=10_000, n=20, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=3, save_data=False)
parquet_snappy_vaex_3 0.005270
parquet_no_compression_vaex_3 0.005497
hdf5_no_compression_vaex_3 0.006828
parquet_gzip_vaex_3 0.008125
parquet_no_compression_pandas_3 0.008216
parquet_snappy_pandas_3 0.008459
parquet_gzip_pandas_3 0.010360
hdf5_no_compression_pandas_3 0.011945
csv_no_compression_pandas_3 0.012204
csv_gzip_pandas_3 0.013290
hdf5_zlib_pandas_3 0.013831
parquet_snappy_dask_3 0.017223
parquet_no_compression_dask_3 0.017663
parquet_gzip_dask_3 0.020585
csv_no_compression_dask_3 0.026153
hdf5_no_compression_dask_3 0.029112
csv_gzip_dask_3 0.029944
hdf5_zlib_dask_3 0.031420
csv_no_compression_vaex_3 0.061439
csv_gzip_vaex_3 0.063572
It seems that the tendency is the same as when there were 5000 lines.
The number of lines has gradually increased. Next is a file with 50,000 lines. Since it will take some time, we will reduce the number of executions (argument n) from here.
run_overall(
row_num=50_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
2021-01-11 11:08:51.857375 Start saving uncompressed CSV...
2021-01-11 11:08:51.984150 gzip Start saving compressed CSV...
2021-01-11 11:08:52.236012 Start saving uncompressed Parquet...
2021-01-11 11:08:52.253600 Snappy Start saving compressed Parquet...
2021-01-11 11:08:52.272157 gzip Start saving compressed Parquet...
2021-01-11 11:08:52.519038 Start saving uncompressed Pandas HDF5...
2021-01-11 11:08:52.535660 Start saving uncompressed Dask HDF5...
2021-01-11 11:08:52.578888 Start saving uncompressed Vaex HDF5...
2021-01-11 11:08:52.600329 zlib Start saving compressed Pandas HDF5...
2021-01-11 11:08:52.639769 zlib Start saving compressed Dask HDF5...
2021-01-11 11:08:52.721034 The save process is complete.
csv_gzip 1320.048
parquet_gzip 1343.917
parquet_snappy 1744.180
parquet_no_compression 2021.223
csv_no_compression 2062.447
hdf5_zlib 2649.801
hdf5_no_compression 4210.736
The size of HDF5 is much closer to the numbers in other formats, such as when zlib compressed.
CSV and Parquet have almost the same numbers.
hdf5_no_compression_vaex_1 0.005502
parquet_no_compression_vaex_1 0.006683
parquet_snappy_vaex_1 0.007091
parquet_snappy_pandas_1 0.011606
parquet_no_compression_pandas_1 0.011925
hdf5_no_compression_pandas_1 0.014885
parquet_gzip_vaex_1 0.018865
hdf5_zlib_pandas_1 0.021611
parquet_snappy_dask_1 0.022349
parquet_gzip_pandas_1 0.022426
parquet_no_compression_dask_1 0.023482
csv_no_compression_pandas_1 0.030946
parquet_gzip_dask_1 0.039596
csv_gzip_pandas_1 0.041766
csv_no_compression_dask_1 0.045717
hdf5_no_compression_dask_1 0.056825
csv_gzip_dask_1 0.061265
hdf5_zlib_dask_1 0.068667
csv_no_compression_vaex_1 0.282166
csv_gzip_vaex_1 0.293648
At the speed of pattern 1, Pandas occupies the top position at 10,000 lines, but at 50,000 lines, Vaex occupies it. Also, HDF5 has risen to the top, while Parquet used to occupy it.
The difference between the top HDF5 + Vaex and the bottom CSV + Vaex is becoming quite noticeable.
Next is pattern 2.
run_overall(
row_num=50_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=2, save_data=False)
hdf5_no_compression_vaex_2 0.007800
parquet_no_compression_vaex_2 0.009540
parquet_snappy_vaex_2 0.010558
parquet_gzip_vaex_2 0.022170
parquet_snappy_pandas_2 0.026817
parquet_no_compression_pandas_2 0.028118
hdf5_no_compression_pandas_2 0.030404
hdf5_zlib_pandas_2 0.036779
parquet_gzip_pandas_2 0.039773
parquet_snappy_dask_2 0.047335
csv_no_compression_pandas_2 0.048013
parquet_no_compression_dask_2 0.049952
csv_gzip_pandas_2 0.060180
parquet_gzip_dask_2 0.065872
csv_no_compression_dask_2 0.070893
hdf5_no_compression_dask_2 0.084020
csv_gzip_dask_2 0.085232
hdf5_zlib_dask_2 0.093619
csv_no_compression_vaex_2 0.286849
csv_gzip_vaex_2 0.299390
The speed of Oita Vaex + HDF5 has become noticeable. Compared to the fastest case in Pandas, there is a speed difference of about 3 times.
In addition, Vaex seems to be the best behavior of HDF5 at this scale, but Pandas etc. are characterized by better results in Parquet than HDF5. Also, you can see that even with Snappy compression, each library has a much smaller effect than when it is uncompressed.
Although the HDF5 + Vaex combination is fast, the Parquet + Vaex combination is also doing well.
Next is pattern 3.
run_overall(
row_num=50_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=3, save_data=False)
hdf5_no_compression_vaex_3 0.014123
parquet_no_compression_vaex_3 0.014571
parquet_snappy_vaex_3 0.015824
parquet_gzip_vaex_3 0.027412
parquet_snappy_pandas_3 0.030680
parquet_no_compression_pandas_3 0.031236
hdf5_no_compression_pandas_3 0.034395
hdf5_zlib_pandas_3 0.040794
parquet_snappy_dask_3 0.042222
parquet_no_compression_dask_3 0.043285
parquet_gzip_pandas_3 0.043325
csv_no_compression_pandas_3 0.051075
parquet_gzip_dask_3 0.059660
csv_gzip_pandas_3 0.062562
csv_no_compression_dask_3 0.065778
hdf5_no_compression_dask_3 0.078580
csv_gzip_dask_3 0.080105
hdf5_zlib_dask_3 0.087821
csv_no_compression_vaex_3 0.291779
csv_gzip_vaex_3 0.302356
Again, Vaex is about twice as fast as Pandas' fastest case, and the rankings seem to be similar.
run_overall(
row_num=100_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
2021-01-11 11:13:13.365703 Start saving uncompressed CSV...
2021-01-11 11:13:13.620238 gzip Start saving compressed CSV...
2021-01-11 11:13:14.130993 Start saving uncompressed Parquet...
2021-01-11 11:13:14.155798 Snappy Start saving compressed Parquet...
2021-01-11 11:13:14.184444 gzip Start saving compressed Parquet...
2021-01-11 11:13:14.657454 Start saving uncompressed Pandas HDF5...
2021-01-11 11:13:14.690242 Start saving uncompressed Dask HDF5...
2021-01-11 11:13:14.768885 Start saving uncompressed Vaex HDF5...
2021-01-11 11:13:14.797431 zlib Start saving Pandas HDF5...
2021-01-11 11:13:14.869557 zlib Start saving compressed Dask HDF5...
2021-01-11 11:13:15.030618 The save process is complete.
parquet_gzip 2592.230
csv_gzip 2639.877
parquet_snappy 3365.400
parquet_no_compression 3878.444
csv_no_compression 4123.427
hdf5_zlib 4230.293
hdf5_no_compression 7361.000
In terms of disc size, Parquet finally outperformed CSV with 100,000 lines. Also, when HDF5 is zlib-compressed, it is almost as thin as uncompressed CSV.
hdf5_no_compression_vaex_1 0.006252
parquet_no_compression_vaex_1 0.009005
parquet_snappy_vaex_1 0.010455
hdf5_no_compression_pandas_1 0.023864
parquet_snappy_pandas_1 0.025336
parquet_no_compression_pandas_1 0.027367
parquet_gzip_vaex_1 0.032729
parquet_snappy_dask_1 0.037711
hdf5_zlib_pandas_1 0.038161
parquet_no_compression_dask_1 0.041333
parquet_gzip_pandas_1 0.047903
csv_no_compression_pandas_1 0.062740
parquet_gzip_dask_1 0.075457
csv_no_compression_dask_1 0.079175
csv_gzip_pandas_1 0.085268
hdf5_no_compression_dask_1 0.099241
csv_gzip_dask_1 0.104104
hdf5_zlib_dask_1 0.126630
csv_no_compression_vaex_1 0.562856
csv_gzip_vaex_1 0.583959
In pattern 1, there was a difference between HDF5 and Parquet of Vaex. With a simple process that does not involve strings, the combination of Vaex + HDF5 is about four times faster than the fastest pattern of Pandas.
Next is pattern 2.
run_overall(
row_num=100_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=2, save_data=False)
hdf5_no_compression_vaex_2 0.008375
parquet_no_compression_vaex_2 0.011477
parquet_snappy_vaex_2 0.014794
parquet_gzip_vaex_2 0.035837
hdf5_no_compression_pandas_2 0.053891
parquet_no_compression_pandas_2 0.054034
parquet_snappy_pandas_2 0.057317
hdf5_zlib_pandas_2 0.065070
parquet_gzip_pandas_2 0.078063
parquet_no_compression_dask_2 0.082266
parquet_snappy_dask_2 0.085195
csv_no_compression_pandas_2 0.091169
csv_gzip_pandas_2 0.116742
parquet_gzip_dask_2 0.117286
csv_no_compression_dask_2 0.117380
hdf5_no_compression_dask_2 0.140906
csv_gzip_dask_2 0.145802
hdf5_zlib_dask_2 0.160026
csv_no_compression_vaex_2 0.569166
csv_gzip_vaex_2 0.591414
The combination of HDF5 + Vaex is about 6 times faster than the combination of Pandas + HDF5. Also, the speed of HDF5 has become noticeable in Pandas.
And the slowness of CSV + Vaex has become quite noticeable ...
Next is pattern 3.
run_overall(
row_num=100_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=3, save_data=False)
hdf5_no_compression_vaex_3 0.024104
parquet_no_compression_vaex_3 0.025932
parquet_snappy_vaex_3 0.028616
parquet_gzip_vaex_3 0.049881
hdf5_no_compression_pandas_3 0.066834
parquet_no_compression_pandas_3 0.066970
parquet_snappy_pandas_3 0.068455
hdf5_zlib_pandas_3 0.077481
parquet_no_compression_dask_3 0.084306
parquet_snappy_dask_3 0.084496
parquet_gzip_pandas_3 0.087239
csv_no_compression_pandas_3 0.104310
parquet_gzip_dask_3 0.114899
csv_no_compression_dask_3 0.117771
csv_gzip_pandas_3 0.120368
csv_gzip_dask_3 0.145227
hdf5_no_compression_dask_3 0.145364
hdf5_zlib_dask_3 0.162844
csv_no_compression_vaex_3 0.586798
csv_gzip_vaex_3 0.608084
I will try it for a file with 500,000 lines. From the results so far, the slow speed of Vaex + CSV has become quite anxious, and if you use Vaex, you should use Parquet or HDF5 instead of CSV. Now that it takes a long time to calculate, I will remove the CSV + Vaex combination by adjusting the list for execution as follows.
new_runners = []
for runner in runners:
if 'csv' in runner.label and 'vaex' in runner.label:
continue
new_runners.append(runner)
runners = new_runners
run_overall(
row_num=500_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
2021-01-11 11:23:47.660002 Start saving uncompressed CSV...
2021-01-11 11:23:48.923959 gzip Start saving compressed CSV...
2021-01-11 11:23:51.454028 Start saving uncompressed Parquet...
2021-01-11 11:23:51.541757 Start saving Snappy compressed Parquet...
2021-01-11 11:23:51.642734 gzip Start saving compressed Parquet...
2021-01-11 11:23:53.860624 Start saving uncompressed Pandas HDF5...
2021-01-11 11:23:53.991259 Start saving uncompressed Dask HDF5...
2021-01-11 11:23:54.336325 Start saving uncompressed Vaex HDF5...
2021-01-11 11:23:54.416441 zlib Start saving compressed Pandas HDF5...
2021-01-11 11:23:54.762570 zlib Start saving compressed Dask HDF5...
2021-01-11 11:23:55.578732 The save process is complete.
parquet_gzip 11759.944
csv_gzip 13196.472
parquet_snappy 15549.102
hdf5_zlib 16642.935
parquet_no_compression 17962.955
csv_no_compression 20616.341
hdf5_no_compression 32563.056
At this point, Parquet's file size is getting smaller. Also, zlib's HDF5 is getting as small as Snappy's Parquet and so on.
hdf5_no_compression_vaex_1 0.009741
parquet_no_compression_vaex_1 0.030765
parquet_snappy_vaex_1 0.035542
hdf5_no_compression_pandas_1 0.102294
parquet_no_compression_pandas_1 0.137117
parquet_gzip_vaex_1 0.139221
hdf5_zlib_pandas_1 0.148760
parquet_snappy_pandas_1 0.150342
parquet_no_compression_dask_1 0.193046
parquet_snappy_dask_1 0.202919
parquet_gzip_pandas_1 0.248371
csv_no_compression_pandas_1 0.309617
csv_no_compression_dask_1 0.338050
parquet_gzip_dask_1 0.344163
csv_gzip_pandas_1 0.408416
hdf5_no_compression_dask_1 0.444608
csv_gzip_dask_1 0.458034
hdf5_zlib_dask_1 0.535816
As for the speed of pattern 1, the combination of HDF5 + Vaex has become quite noticeable. It's about four times faster than the uncompressed Parquet + Vaex or Snappy compressed ones. Also, when comparing the Vaex of Parquet of gzip and the Vaex of HDF5, the difference is about 14 times. The slowness of gzip has become noticeable.
Compared to Pandas' HDF5, Vaex's HDF5 is about 10 times more different.
run_overall(
row_num=500_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=2, save_data=False)
hdf5_no_compression_vaex_2 0.013669
parquet_no_compression_vaex_2 0.033931
parquet_snappy_vaex_2 0.039456
parquet_gzip_vaex_2 0.142346
hdf5_no_compression_pandas_2 0.240121
parquet_no_compression_pandas_2 0.276711
parquet_snappy_pandas_2 0.282710
hdf5_zlib_pandas_2 0.289592
parquet_no_compression_dask_2 0.346122
parquet_snappy_dask_2 0.353175
parquet_gzip_pandas_2 0.389155
csv_no_compression_pandas_2 0.444285
csv_no_compression_dask_2 0.493776
parquet_gzip_dask_2 0.496309
csv_gzip_pandas_2 0.554616
hdf5_no_compression_dask_2 0.589466
csv_gzip_dask_2 0.606128
hdf5_zlib_dask_2 0.682696
In pattern 2, Vaex's HDF5 is 18 times higher than Pandas. Very fast.
Dask still feels slow at this level, but Vaex is getting pretty fast.
Next is pattern 3.
run_overall(
row_num=500_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=3, save_data=False)
hdf5_no_compression_vaex_3 0.095968
parquet_no_compression_vaex_3 0.116657
parquet_snappy_vaex_3 0.120759
parquet_gzip_vaex_3 0.228370
hdf5_no_compression_pandas_3 0.294173
hdf5_zlib_pandas_3 0.343163
parquet_no_compression_pandas_3 0.345838
parquet_snappy_pandas_3 0.352694
parquet_no_compression_dask_3 0.396586
parquet_snappy_dask_3 0.397329
parquet_gzip_pandas_3 0.453819
csv_no_compression_pandas_3 0.497882
parquet_gzip_dask_3 0.544685
csv_no_compression_dask_3 0.593265
hdf5_no_compression_dask_3 0.627107
csv_gzip_dask_3 0.646636
csv_gzip_pandas_3 0.650796
hdf5_zlib_dask_3 0.721067
There seems to be no difference between Vaex's HDF5 and Parquet compared to pattern 2. It seems that it is about 3 times different from Pandas. Vaex seems to shine when there is a lot of calculation processing.
run_overall(
row_num=1_000_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
2021-01-11 12:11:13.260384 Start saving uncompressed CSV...
2021-01-11 12:11:15.783632 Start saving gzip compressed CSV...
2021-01-11 12:11:20.860890 Start saving uncompressed Parquet...
2021-01-11 12:11:21.009845 Snappy Start saving compressed Parquet...
2021-01-11 12:11:21.205758 gzip Start saving compressed Parquet...
2021-01-11 12:11:25.590414 Start saving uncompressed Pandas HDF5...
2021-01-11 12:11:25.861853 Start saving uncompressed Dask HDF5...
2021-01-11 12:11:26.538407 Start saving uncompressed Vaex HDF5...
2021-01-11 12:11:26.681708 zlib Start saving compressed Pandas HDF5...
2021-01-11 12:11:27.372079 zlib Start saving compressed Dask HDF5...
2021-01-11 12:11:28.995101 The save process is complete.
At this point, it will take a long time to save CSV.gz files. On the other hand, I have the impression that Snappy's Parquet and uncompressed HDF5 via Vaex are fairly fast to write (Snappy's Parquet is also likely to be faster with Vaex).
parquet_gzip 23126.870
csv_gzip 26390.647
parquet_snappy 30674.674
hdf5_zlib 32221.201
parquet_no_compression 35467.713
csv_no_compression 41232.579
hdf5_no_compression 64065.632
hdf5_no_compression_vaex_1 0.015352
parquet_no_compression_vaex_1 0.058410
parquet_snappy_vaex_1 0.066124
hdf5_no_compression_pandas_1 0.203620
parquet_gzip_vaex_1 0.276284
parquet_snappy_pandas_1 0.304232
hdf5_zlib_pandas_1 0.316301
parquet_no_compression_dask_1 0.369187
parquet_snappy_dask_1 0.374635
parquet_no_compression_pandas_1 0.396081
parquet_gzip_pandas_1 0.508149
csv_no_compression_pandas_1 0.592706
csv_no_compression_dask_1 0.611989
parquet_gzip_dask_1 0.658335
csv_gzip_pandas_1 0.816507
hdf5_no_compression_dask_1 0.850102
csv_gzip_dask_1 0.888696
hdf5_zlib_dask_1 1.185779
Even in pattern 1, the speed of Vaex in Oita HDF5 stands out. As for the 2nd and 3rd Parquet, it seems that the presence or absence of Snappy compression does not significantly affect the performance, although it is quite slow compared to HDF5.
run_overall(
row_num=1_000_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=2, save_data=False)
hdf5_no_compression_vaex_2 0.020531
parquet_no_compression_vaex_2 0.062161
parquet_snappy_vaex_2 0.069428
parquet_gzip_vaex_2 0.280575
hdf5_no_compression_pandas_2 0.472019
hdf5_zlib_pandas_2 0.563466
parquet_no_compression_pandas_2 0.670589
parquet_snappy_pandas_2 0.705254
parquet_snappy_dask_2 0.720435
parquet_gzip_pandas_2 0.748364
parquet_no_compression_dask_2 0.837436
parquet_gzip_dask_2 1.061136
csv_gzip_pandas_2 1.080294
hdf5_no_compression_dask_2 1.142565
csv_no_compression_pandas_2 1.180453
csv_no_compression_dask_2 1.219284
hdf5_zlib_dask_2 1.320128
csv_gzip_dask_2 1.794837
Pattern 2 is even more prominent, about 24 times less than Pandas.
run_overall(
row_num=1_000_000, n=10, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=3, save_data=False)
hdf5_no_compression_vaex_3 0.188722
parquet_no_compression_vaex_3 0.239557
parquet_snappy_vaex_3 0.314108
parquet_gzip_vaex_3 0.554921
parquet_no_compression_pandas_3 0.679682
parquet_snappy_pandas_3 0.699997
parquet_no_compression_dask_3 0.739946
hdf5_zlib_pandas_3 0.741525
hdf5_no_compression_pandas_3 0.748426
parquet_snappy_dask_3 0.756193
csv_no_compression_pandas_3 0.970395
csv_no_compression_dask_3 1.007364
parquet_gzip_pandas_3 1.105683
parquet_gzip_dask_3 1.250151
csv_gzip_pandas_3 1.267784
csv_gzip_dask_3 1.338443
hdf5_zlib_dask_3 1.423993
hdf5_no_compression_dask_3 1.655324
I will increase the data at once and try it with 5 million lines next.
run_overall(
row_num=5_000_000, n=5, unit=PlotFileSizeUnit.KILO_BYTES,
pattern=1, save_data=True)
Then there was an error.
ValueError: array is of length 1048576, while the length of the DataFrame is 5000000
Apparently it's a bug around the chunk size of Vaex's Parquet.
Reference:
It seems to be a relatively new issue, and there is a sign that it is not supported by the version covered in this article. If you want to handle data that exceeds about 1 million lines in one file, it seems better to separate the files or consider using a newer version of Vaex (when Vaex 3 or 4 becomes stable, that) Etc.).
I usually handle larger size data, and I feel that the performance advantage of Vaex etc. is increasing as the data size increases, so I could not handle the verification with large data this time. It's a little disappointing, but for the time being, this article is a verification with a single file, so I will stop with verification of up to 1 million lines.
If there is a chance in the future, I would like to do verification for huge data with multiple files in another article (such as handling a large number of files with 1 million lines per one at the same time).
Recommended Posts