Combine multiple CSVs and data frames into one data frame at high speed [60x speed]

df.append () is slow. .. ..

It used to take a long time to combine data frames, but I would like to introduce a good method. In my environment I succeeded more than 60 times faster! It can be used to combine and read CSV files of the same format.

environment

Python: 3.7.6
pandas: 1.0.1

The method I was using

I used to read the CSV and just append it. It took ** 12 minutes 42 seconds ** to read and combine about 2500 CSVs (about 1000 lines each).

csvs = glob.glob('./data/csv/*.csv')
df = pd.DataFrame()
for csv in csvs:
    df = df.append(pd.read_csv(csv))

Fast way

I was able to complete reading and combining the same CSV in ** 11.6 seconds **. I was able to complete the same work at ** 65.6x speed **!

from itertools import chain

def fast_concat(dfs):
    def fast_flatten(input_list):
        return list(chain.from_iterable(input_list))
    
    col_names = dfs[0].columns
    df_dict = dict.fromkeys(col_names, [])
    for col in col_names:
        extracted = (d[col] for d in dfs)
        df_dict[col] = fast_flatten(extracted)

    df = pd.DataFrame.from_dict(df_dict)[col_names]
    return df

dfs = []
for csv in csvs:
    dfs += [pd.read_csv(csv)]
df = fast_concat(dfs)

Why fast

In a word, to Dict https://qiita.com/siruku6/items/4846431198769b38bb41

Finally

This article is based on a post from here. The task of putting together a CSV was a common occurrence, so it's very helpful. You can use it just by creating a list of data frames, so it is highly versatile!

Recommended Posts

Combine multiple CSVs and data frames into one data frame at high speed [60x speed]
[Python] Combine multiple Excel sheets into one
[BigQuery] Load a part of BQ data into pandas at high speed
Combine multiple python files into one python file
[Python & Unix] Combine multiple PDF files into one.
Combine multiple Excel files loaded using pandas into one
Perform implied volatility calculation at high speed (market data processing)