It used to take a long time to combine data frames, but I would like to introduce a good method. In my environment I succeeded more than 60 times faster! It can be used to combine and read CSV files of the same format.
Python: 3.7.6
pandas: 1.0.1
I used to read the CSV and just append it. It took ** 12 minutes 42 seconds ** to read and combine about 2500 CSVs (about 1000 lines each).
csvs = glob.glob('./data/csv/*.csv')
df = pd.DataFrame()
for csv in csvs:
df = df.append(pd.read_csv(csv))
I was able to complete reading and combining the same CSV in ** 11.6 seconds **. I was able to complete the same work at ** 65.6x speed **!
from itertools import chain
def fast_concat(dfs):
def fast_flatten(input_list):
return list(chain.from_iterable(input_list))
col_names = dfs[0].columns
df_dict = dict.fromkeys(col_names, [])
for col in col_names:
extracted = (d[col] for d in dfs)
df_dict[col] = fast_flatten(extracted)
df = pd.DataFrame.from_dict(df_dict)[col_names]
return df
dfs = []
for csv in csvs:
dfs += [pd.read_csv(csv)]
df = fast_concat(dfs)
In a word, to Dict https://qiita.com/siruku6/items/4846431198769b38bb41
This article is based on a post from here. The task of putting together a CSV was a common occurrence, so it's very helpful. You can use it just by creating a list of data frames, so it is highly versatile!
Recommended Posts