The other day, I did A simple performance comparison of Pandas, Dask, and Vaex in CSV, Parquet, and HDF5 in a single file.
However, since I often handle a lot of files with time series data etc. in my usual work, this time I will roughly compare the performance including the number of lines that the memory is quite tight when handling multiple files and normally. To go.
TL;DR
――In the time series assumed data I tried this time, there was no difference between Vaex + uncompressed HDF5 and Vaex + Snappy compressed Parquet as I expected. Considering the ease of handling and file size, Parquet with Snappy compression seems to be a good choice. -* In the previous comparison with a single file, HDF5 was faster than Parquet when the number of lines in one file was much larger, so it seems that only one file contains a large amount of data. HDF5 seems to shine in the case. --Pandas and Dask don't seem to be very fast with HDF5 when there are many files in chronological data (especially Dask is significantly slower). ――As time series data, when totaling around 100,000 lines per file as verified this time, it seems that there is no extreme difference between Vaex and Dask (although Vaex is about twice as fast). If anything, the disk access aspect of the environment I tried this time may also be a bottleneck. ――However, it is very comfortable and helpful that the calculation can be completed in about one and a half minutes even if it is less than 200 million lines.
We will proceed with the following Docker image settings. The host is using a Windows 10 laptop.
FROM jupyter/datascience-notebook:python-3.8.6
RUN pip install vaex==2.6.1
RUN pip install jupyter-contrib-nbextensions==0.5.1
RUN jupyter contrib nbextension install --user
RUN jupyter nbextension enable hinterland/hinterland \
&& jupyter nbextension enable toc2/main
The OS and library environment is as follows.
--Ubuntu is 20.04
In the comparison of the previous article, it was like uncompressed HDF5 or Snappy compressed Parquet, so this time I will deal with two formats, HDF5 and Snappy compressed Parquet, without including CSV.
Also, regarding the library, if you want to compare the number of lines that can fit in the memory, including Pandas, if the number of lines becomes quite large, proceed with only Dask and Vaex.
Code from the previous performance comparison article will be used to some extent.
Assuming time-series data, we will prepare a data set with data of about 80,000 to 120,000 rows and 5 columns per day, and prepare data for 5 years from January 2016 to the end of December 2020.
The five columns are configured as follows.
--column_a: int-> Set a random value in the range of 0 to 4999999. --column_b: str-> Set the date and time character string (example: 2020-12-31 15:10:20). --column_c: int-> Set a random value in the range of 0 to 100. --column_d: int-> Set one of 100, 300, 500, 1000, 3000, 10000. --column_e: str-> char_num 20 Set a lowercase and uppercase alphabetic string of characters.
In addition, HDF5 has a different hierarchical structure depending on the library, so each library is saved separately.
from string import ascii_letters
import random
from datetime import date, datetime
import numpy as np
import pandas as pd
import dask.dataframe as dd
from pandas.tseries.offsets import Day
import vaex
def make_random_ascii_str(char_num):
"""
Generates a random lowercase and uppercase string with the specified number of characters.
Parameters
----------
char_num : int
The number of characters to generate.
Returns
-------
random_str : str
The generated string.
"""
return ''.join(random.choices(population=ascii_letters, k=char_num))
def make_pandas_df(row_num, char_num, date_str):
"""
Generate a Pandas data frame for validation.
Parameters
----------
row_num : int
The number of rows in the data frame to generate.
char_num : str
The number of characters in the value to set in the string column.
date_str : str
The string of the target date.
Returns
-------
pandas_df : pd.DataFrame
The generated data frame. Each value is set in the following 5 columns.
- column_a : int ->A random value in the range 0-4999999.
- column_b : str ->Date and time strings (eg: 2020-12-31 15:10:20)。
- column_c : int ->Random value in the range 0-100.
- column_d : int -> 100, 300, 500, 1000, 3000,Any value of 10000.
- column_e : str -> char_num A string of lowercase and uppercase alphabets with a number of 20 characters.
"""
pandas_df = pd.DataFrame(
index=np.arange(0, row_num), columns=['column_a'])
row_num = len(pandas_df)
pandas_df['column_a'] = np.random.randint(0, 5000000, size=row_num)
random_hours = np.random.randint(low=0, high=24, size=row_num)
random_minutes = np.random.randint(low=0, high=60, size=row_num)
random_seconds = np.random.randint(low=0, high=60, size=row_num)
times = []
for i in range(row_num):
hour_str = str(random_hours[i]).zfill(2)
minute_str = str(random_minutes[i]).zfill(2)
second_str = str(random_seconds[i]).zfill(2)
time_str = f'{date_str} {hour_str}:{minute_str}:{second_str}'
times.append(time_str)
pandas_df['column_b'] = times
pandas_df['column_c'] = np.random.randint(0, 100, size=row_num)
pandas_df['column_d'] = np.random.choice(a=[100, 300, 500, 1000, 3000, 10000], size=row_num)
pandas_df['column_e'] = [make_random_ascii_str(char_num=20) for _ in range(row_num)]
pandas_df.sort_values(by='column_b', inplace=True)
return pandas_df
def get_pandas_hdf5_file_path(date_str):
"""
Get the path of the Pandas HDF5 file for the target date.
Parameters
----------
date_str : str
The character string of the target date.
Returns
-------
file_path : str
The generated file path.
"""
return f'pandas_{date_str}.hdf5'
def get_dask_hdf5_file_path(date_str):
"""
Get the path of the Dask HDF5 file for the target date.
Parameters
----------
date_str : str
The character string of the target date.
Returns
-------
file_path : str
The generated file path.
"""
return f'dask_{date_str}.hdf5'
def get_vaex_hdf5_file_path(date_str):
"""
Get the path of the Vaex HDF5 file for the target date.
Parameters
----------
date_str : str
The character string of the target date.
Returns
-------
file_path : str
The generated file path.
"""
return f'vaex_{date_str}.hdf5'
def get_parquet_file_path(date_str):
"""
Get the path of the Parquet file of the target date.
Parameters
----------
date_str : str
The character string of the target date.
Returns
-------
file_path : str
The generated file path.
"""
return f'{date_str}.parquet'
def save_data():
"""
Save each file of time series data for verification.
"""
current_date = date(2016, 1, 1)
last_date = date(2020, 12, 31)
while current_date <= last_date:
date_str = current_date.strftime('%Y-%m-%d')
row_num = random.randint(80_000, 120_000)
print(
datetime.now(), date_str, 'Start saving process. Number of lines:', row_num)
pandas_df = make_pandas_df(
row_num=row_num, char_num=20, date_str=date_str)
vaex_df = vaex.from_pandas(df=pandas_df, copy_index=False)
dask_df = dd.from_pandas(data=pandas_df, npartitions=1)
pandas_df.to_hdf(
path_or_buf=get_pandas_hdf5_file_path(date_str=date_str),
key='data', mode='w')
dask_df.to_hdf(
path_or_buf=get_dask_hdf5_file_path(date_str=date_str),
key='data', mode='w')
vaex_df.export_hdf5(
path=get_vaex_hdf5_file_path(date_str=date_str))
vaex_df.export_parquet(
path=get_parquet_file_path(date_str=date_str))
current_date += Day()
save_data()
2021-01-17 07:58:24.950967 2016-01-01 save processing started. Number of lines: 83831
2021-01-17 07:58:26.994136 2016-01-02 save process started. Number of lines: 117457
2021-01-17 07:58:29.859080 2016-01-Start saving 03. Number of lines: 101470
2021-01-17 07:58:32.381448 2016-01-04 save process started. Number of lines: 88966
...
Due to the lazy evaluation of Dask and Vaex, it is not possible to make a comparison only by reading, so we will proceed by assuming a comparison of how long it will take to complete the calculation in the form of reading → processing such as some calculation.
This time, we will set up the following two patterns.
-[Pattern 1]: Slice only the lines containing the character string ab with column_e-> Calculate the number of lines after slicing. -[Pattern 2]: Slice the value of column_a to a value of 3 million or less-> Slice only the line where the value of the string of column_e starts with a-> GROUP BY with the value of column_d-> column_a for each group Calculate the maximum value of.
We will add it in Pandas, Dask, and Vaex respectively.
def read_pandas_df_from_hdf5(start_date, last_date):
"""
Reads Pandas dataframes in the specified date range from HDF5 files.
Parameters
----------
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
Returns
-------
df : pd.DataFrame
The loaded Pandas data frame.
"""
df_list = []
current_date = start_date
while current_date <= last_date:
date_str = current_date.strftime('%Y-%m-%d')
file_path = get_pandas_hdf5_file_path(date_str=date_str)
df = pd.read_hdf(path_or_buf=file_path, key='data')
df_list.append(df)
current_date += Day()
df = pd.concat(df_list, ignore_index=True, copy=False)
return df
def read_dask_df_from_hdf5(start_date, last_date):
"""
Read Dask dataframes in the specified date range from HDF5 files
Parameters
----------
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
Returns
-------
df : dd.DataFrame
The loaded Dask data frame.
"""
df_list = []
current_date = start_date
while current_date <= last_date:
date_str = current_date.strftime('%Y-%m-%d')
file_path = get_dask_hdf5_file_path(date_str=date_str)
df = dd.read_hdf(pattern=file_path, key='data')
df_list.append(df)
current_date += Day()
df = dd.concat(dfs=df_list)
return df
def read_vaex_df_from_hdf5(start_date, last_date):
"""
Read Vaex dataframes in a specified date range from an HDF5 file
Parameters
----------
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
Returns
-------
df : vaex.dataframe.DataFrame
The loaded Vaex data frame.
"""
file_paths = []
current_date = start_date
while current_date <= last_date:
date_str = current_date.strftime('%Y-%m-%d')
file_path = get_vaex_hdf5_file_path(date_str=date_str)
file_paths.append(file_path)
current_date += Day()
vaex_df = vaex.open_many(filenames=file_paths)
return vaex_df
def read_pandas_df_from_parquet(start_date, last_date):
"""
Read Pandas dataframes in the specified date range from Parquet.
Parameters
----------
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
Returns
-------
df : pd.DataFrame
The loaded Pandas data frame.
"""
df_list = []
current_date = start_date
while current_date <= last_date:
date_str = current_date.strftime('%Y-%m-%d')
file_path = get_parquet_file_path(date_str=date_str)
df = pd.read_parquet(path=file_path)
df_list.append(df)
current_date += Day()
df = pd.concat(df_list, ignore_index=True, copy=False)
return df
def read_dask_df_from_parquet(start_date, last_date):
"""
Read Dask dataframes in a specified date range from a Parquet file
Parameters
----------
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
Returns
-------
df : dd.DataFrame
The loaded Dask data frame.
"""
df_list = []
current_date = start_date
while current_date <= last_date:
date_str = current_date.strftime('%Y-%m-%d')
file_path = get_parquet_file_path(date_str=date_str)
df = dd.read_parquet(path=file_path)
df_list.append(df)
current_date += Day()
df = dd.concat(dfs=df_list)
return df
def read_vaex_df_from_parquet(start_date, last_date):
"""
Read Vaex dataframes in the specified date range from a Parquet file
Parameters
----------
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
Returns
-------
df : vaex.dataframe.DataFrame
The loaded Vaex data frame.
"""
file_paths = []
current_date = start_date
while current_date <= last_date:
date_str = current_date.strftime('%Y-%m-%d')
file_path = get_parquet_file_path(date_str=date_str)
file_paths.append(file_path)
current_date += Day()
vaex_df = vaex.open_many(filenames=file_paths)
return vaex_df
--Slice only lines containing the string ab with column_e --Calculate the number of rows after slicing
We will add the process in each library.
def calculate_pattern_1_with_pandas_df(pandas_df):
"""
Calculate the first pattern in a Pandas data frame.
Parameters
----------
pandas_df : pd.DataFrame
The target Pandas data frame.
Returns
-------
row_count : int
The number of lines in the calculation result.
"""
pandas_df = pandas_df[pandas_df['column_e'].str.contains('ab')]
row_count = len(pandas_df)
return row_count
def calculate_pattern_1_with_dask_df(dask_df):
"""
The calculation of the first pattern is done in the Dask data frame.
Parameters
----------
dask_df : dd.DataFrame
The data frame of the target Dask.
Returns
-------
row_count : int
The number of lines in the calculation result.
"""
dask_df = dask_df[dask_df['column_e'].str.contains('ab')]
row_count = len(dask_df)
return row_count
def calculate_pattern_1_with_vaex_df(vaex_df):
"""
The calculation of the third pattern is performed in the Vaex data frame.
Parameters
----------
vaex_df : vaex.dataframe.DataFrame
The target Vaex data frame.
Returns
-------
row_count : int
The number of lines in the calculation result.
"""
vaex_df = vaex_df[vaex_df['column_e'].str.contains('ab')]
row_count = len(vaex_df)
return row_count
--Slice to a value of 3 million or less for column_a --Slice only to the line where the value of the string of column_e starts with a --GROUP BY with the value of column_d --Calculate the maximum value of column_a for each group
We will add the process in each library.
def calculate_pattern_2_with_pandas_df(pandas_df):
"""
Calculate the second pattern in a Pandas data frame.
Parameters
----------
pandas_df : pd.DataFrame
The target Pandas data frame.
Returns
-------
max_sr : pd.Series
A series that stores each calculated total value.
"""
pandas_df = pandas_df[pandas_df['column_a'] <= 3_000_000]
pandas_df = pandas_df[pandas_df['column_e'].str.startswith('a')]
grouped = pandas_df.groupby(by='column_d')
max_df = grouped.max()
max_sr = max_df['column_a']
return max_sr
def calculate_pattern_2_with_dask_df(dask_df):
"""
The calculation of the second pattern is done in the Dask data frame.
Parameters
----------
dask_df : dd.DataFrame
The data frame of the target Dask.
Returns
-------
max_sr : pd.Series
A series that stores each calculated total value.
"""
dask_df = dask_df[dask_df['column_a'] <= 3_000_000]
dask_df = dask_df[dask_df['column_e'].str.startswith('a')]
grouped = dask_df.groupby(by='column_d')
max_df = grouped.max()
max_sr = max_df['column_a']
max_sr = max_sr.compute()
return max_sr
def calculate_pattern_2_with_vaex_df(vaex_df):
"""
The calculation of the second pattern is performed in the Vaex data frame.
Parameters
----------
vaex_df : vaex.dataframe.DataFrame
The target Vaex data frame.
Returns
-------
max_sr : pd.Series
A series that stores each calculated total value.
"""
vaex_df = vaex_df[vaex_df['column_a'] <= 3_000_000]
vaex_df = vaex_df[vaex_df['column_e'].str.startswith('a')]
max_df = vaex_df.groupby(
by='column_d',
agg={
'column_a': vaex.agg.max,
})
max_df = max_df.to_pandas_df(column_names=['column_a', 'column_d'])
max_df.index = max_df['column_d']
max_sr = max_df['column_a']
return max_sr
Since each reading process and aggregation process have been added, we will write a process that combines them.
from timeit import timeit
from copy import deepcopy
import sys
class ReadAndCalcRunner:
def __init__(self, label, pattern, read_func, calc_func):
"""
A class that handles the setting and execution processing of the combination of reading and calculation processing.
Parameters
----------
label : str
A label for identifying the target combination.
Example: csv_no_compression_pandas
pattern : int
Target pattern (1 to 3).
read_func : Callable
A function that handles reading processing. Arguments are optional and change the data frame
You need a format.
calc_func : Callable
A function that handles computational processing. A format that accepts a data frame as the first argument
You need to specify something.
"""
self.label = label
self.pattern = pattern
self._read_func = read_func
self._calc_func = calc_func
def run(self, n, start_date, last_date, debug=False):
"""
Perform reading and calculation processing. After execution, mean_in seconds attribute
The average number of seconds of execution (float) is set.
Parameters
----------
n : int
Number of executions. The larger the number, the higher the accuracy of the processing time, but it is completed.
Please note that it will take a long time to do so.
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
debug : bool, default False
Debug settings. If True is specified, the calculation result will be output.
"""
statement = 'df = self._read_func(start_date=start_date, last_date=last_date);'
if not debug:
statement += 'self._calc_func(df);'
else:
statement += 'result = self._calc_func(df); print(result)'
result_seconds = timeit(
stmt=statement,
number=n, globals=locals())
self.mean_seconds = result_seconds / n
this_module = sys.modules[__name__]
FORMATS = (
'parquet',
'hdf5',
)
LIBS = (
'pandas',
'dask',
'vaex',
)
PATTERNS = (1, 2)
runners = []
for format_str in FORMATS:
for lib_str in LIBS:
for pattern in PATTERNS:
label = f'{format_str}_{lib_str}'
read_func_name = f'read_{lib_str}_df_from_{format_str}'
read_func = getattr(this_module, read_func_name)
calc_func_name = f'calculate_pattern_{pattern}_with_{lib_str}_df'
calc_func = getattr(this_module, calc_func_name)
runners.append(
ReadAndCalcRunner(
label=label,
pattern=pattern,
read_func=read_func,
calc_func=calc_func)
)
We will add functions for comparing processing times.
import matplotlib
from matplotlib.ticker import ScalarFormatter, FormatStrFormatter
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
def plot_time_info(time_sr, patten, print_sr = True):
"""
Plot the processing time.
Parameters
----------
time_sr : pd.Series
A series that stores the processing time.
patten : int
Target aggregation processing pattern (1 to 3).
print_sr : bool, default True
Whether to output and display a series of plot contents.
"""
sr = time_sr.copy()
sr.sort_values(inplace=True, ascending=False)
if print_sr:
print(sr.sort_values(ascending=True))
title = f'Read and calculation seconds (pattern: {patten})'
ax = sr.plot(kind='barh', figsize=(10, len(sr) // 2), title=title)
ax.xaxis.set_major_formatter(FormatStrFormatter('%.3f'))
plt.show()
In order to simplify the description, add a function to process each process (reading, aggregation, visualization) at once.
def run_and_get_result_time_sr(
runners, pattern, n, start_date, last_date, skip_pandas, skip_dask_hdf5,
skip_dask_parquet):
"""
Executed each measurement process of the specified pattern and stored the value of each number of seconds of the result.
Get the series.
Parameters
----------
runners : list of ReadAndCalcRunner
A list of instances that hold definitions of execution processes.
pattern : int
Pattern to execute(1~3)。
n : int
Number of executions. The larger the number, the higher the accuracy of the processing time, but it is completed.
Please note that it will take a long time to do so.
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
skip_pandas : bool
Whether to skip Pandas processing.
skip_dask_hdf5 : bool
Whether to skip processing of HDF5 format Dask.
skip_dask_parquet : bool
Whether to skip the processing of Dask in Parquet format.
Returns
-------
sr : pd.Series
A series that stores measurement results. Each format and index
A concatenated string of labels and patterns for library identification is set,
The value is set to the number of seconds.
"""
runners = deepcopy(runners)
data_dict = {}
for runner in runners:
if runner.pattern != pattern:
continue
if skip_pandas and 'pandas' in runner.label:
continue
if skip_dask_hdf5 and 'dask' in runner.label and 'hdf5' in runner.label:
continue
if skip_dask_parquet and 'dask' in runner.label and 'parquet' in runner.label:
continue
label = f'{runner.label}_{pattern}'
print(datetime.now(), label, 'Start processing...')
runner.run(n=n, start_date=start_date, last_date=last_date)
data_dict[label] = runner.mean_seconds
sr = pd.Series(data=data_dict)
return sr
def run_overall(
n, pattern, start_date, last_date, skip_pandas, skip_dask_hdf5,
skip_dask_parquet):
"""
Each process (reading, aggregation, visualization, etc.) for the target pattern
Run overall.
Parameters
----------
n : int
The number of reads and calculations performed. The larger the number, the more accurate the processing time,
Please note that it will take a long time to complete.
pattern : int
The target calculation pattern (1 to 3).
start_date : date
The start date of the date range.
last_date : date
The last day of the date range.
skip_pandas : bool
Whether to skip Pandas processing.
skip_dask_hdf5 : bool
Whether to skip processing of HDF5 format Dask.
skip_dask_parquet : bool
Whether to skip the processing of Dask in Parquet format.
"""
time_sr = run_and_get_result_time_sr(
runners=runners, pattern=pattern, n=n, start_date=start_date,
last_date=last_date, skip_pandas=skip_pandas,
skip_dask_hdf5=skip_dask_hdf5,
skip_dask_parquet=skip_dask_parquet)
plot_time_info(time_sr=time_sr, patten=pattern)
Now that we are ready, we will add up each processing time. Try each of the following conditions. In addition, Pandas targets up to 3 months, and after that, it will be executed only in Dask and Vaex.
――One month (about 3 million lines) ――For 3 months (about 9 million lines) ――For 6 months (about 18 million lines) --One year's worth (about 36 million lines) ――For 3 years (about 108 million lines) ――For 5 years (about 180 million lines)
pattern 1:
run_overall(
n=5, pattern=1,
start_date=date(2016, 1, 1),
last_date=date(2016, 1, 31),
skip_pandas=False)
hdf5_vaex_1 1.015516
parquet_vaex_1 1.028920
parquet_pandas_1 2.685143
parquet_dask_1 2.828006
hdf5_pandas_1 3.311069
hdf5_dask_1 7.616159
--Although it is Snappy compressed, the processing time of Parquet + Vaex is quite close to that of uncompressed HDF5. --Pandas and Dask seem to be slower in HDF5 (despite uncompressed). I'm particularly concerned about Dask's processing time.
Pattern 2:
run_overall(
n=5, pattern=2,
start_date=date(2016, 1, 1),
last_date=date(2016, 1, 31),
skip_pandas=False)
hdf5_vaex_2 0.766808
parquet_vaex_2 0.848183
parquet_pandas_2 2.436566
parquet_dask_2 2.961728
hdf5_pandas_2 4.134251
hdf5_dask_2 8.657277
pattern 1:
run_overall(
n=5, pattern=1,
start_date=date(2016, 1, 1),
last_date=date(2016, 3, 31),
skip_pandas=False)
hdf5_vaex_1 2.260799
parquet_vaex_1 2.649166
parquet_dask_1 8.578201
parquet_pandas_1 8.656629
hdf5_pandas_1 9.994132
hdf5_dask_1 22.766739
Pattern 2:
run_overall(
n=5, pattern=2,
start_date=date(2016, 1, 1),
last_date=date(2016, 3, 31),
skip_pandas=False)
hdf5_vaex_2 1.894970
parquet_vaex_2 2.529330
parquet_pandas_2 7.110901
hdf5_pandas_2 9.252198
parquet_dask_2 10.688318
hdf5_dask_2 23.362928
From here, skip Pandas and proceed only with Dask and Vaex. Also, due to processing time, we will reduce the number of executions for each pattern from 5 to 3 times.
In addition, Dask + HDF5 is quite slow, so I'll adjust it so that it can be skipped due to processing time.
pattern 1:
run_overall(
n=3, pattern=1,
start_date=date(2016, 1, 1),
last_date=date(2016, 6, 30),
skip_pandas=True,
skip_dask_hdf5=True)
hdf5_vaex_1 4.621812
parquet_vaex_1 5.633019
parquet_dask_1 17.827765
Pattern 2:
run_overall(
n=3, pattern=2,
start_date=date(2016, 1, 1),
last_date=date(2016, 6, 30),
skip_pandas=True,
skip_dask_hdf5=True)
hdf5_vaex_2 4.231214
parquet_vaex_2 5.312496
parquet_dask_2 17.153308
We will reduce the number of executions to 2.
pattern 1:
run_overall(
n=2, pattern=1,
start_date=date(2016, 1, 1),
last_date=date(2016, 12, 31),
skip_pandas=True,
skip_dask_hdf5=True)
hdf5_vaex_1 9.618381
parquet_vaex_1 11.091080
parquet_dask_1 36.335810
Pattern 2:
run_overall(
n=2, pattern=2,
start_date=date(2016, 1, 1),
last_date=date(2016, 12, 31),
skip_pandas=True,
skip_dask_hdf5=True)
hdf5_vaex_2 9.132881
parquet_vaex_2 11.136143
parquet_dask_2 34.377085
The number of trials will be greatly reduced, but after that, due to processing time, it will be executed only once.
pattern 1:
run_overall(
n=1, pattern=1,
start_date=date(2016, 1, 1),
last_date=date(2018, 12, 31),
skip_pandas=True,
skip_dask_hdf5=True)
hdf5_vaex_1 40.676083
parquet_vaex_1 43.035784
parquet_dask_1 100.698389
Pattern 2:
run_overall(
n=1, pattern=2,
start_date=date(2016, 1, 1),
last_date=date(2018, 12, 31),
skip_pandas=True,
skip_dask_hdf5=True)
hdf5_vaex_2 40.061167
parquet_vaex_2 42.218093
parquet_dask_2 102.830116
Is it affected by running on Docker? For some reason, the Jupyter kernel crashes while running Dask's, so I'll try to proceed with Vaex only for 5 years.
pattern 1:
run_overall(
n=1, pattern=1,
start_date=date(2016, 1, 1),
last_date=date(2020, 12, 31),
skip_pandas=True,
skip_dask_hdf5=True,
skip_dask_parquet=True)
parquet_vaex_1 95.578259
hdf5_vaex_1 99.258315
Since the number of trials is one, there is a possibility that it may have shaken, but the result is faster with Parquet.
Pattern 2:
run_overall(
n=1, pattern=2,
start_date=date(2016, 1, 1),
last_date=date(2020, 12, 31),
skip_pandas=True,
skip_dask_hdf5=True,
skip_dask_parquet=True)
hdf5_vaex_2 78.231278
parquet_vaex_2 93.696743
Here, HDF5 is usually faster.
Recommended Posts