I need to convert a lot of CSV files to Parquet, Since there is no column name line in the header of the CSV file in the first place, -Add header to CSV file ・ Convert CSV to Parquet I had to create a tool to do these two steps.
Assumptions
The column name added to the CSV header will be the title for the value in the Parquet file.
If the header line does not exist and the data suddenly exists,
Each title of the output Parquet file will be the data of the first line.
The process of adding a CSV header line could have been written in Python, It was relatively easy to add in Shell, so I created it in Shell and called the file from Python.
qiita.py
import subprocess
# comment
cmd = './add_header.sh'
subprocess.call(cmd, shell=True)
By specifying Shell in subprocess, You can call an external Shell file.
add_header.sh
##!/usr/bin/env bash
for file in `\find from_dir -maxdepth 1 -type f`; do
gsed -i '1iheader1,header2' $file
done
"1i" is required when calling gsed.
gsed ・ ・ ・ Please install gnu-sed.
■ Execution result CSV file header header1,header2
I had to convert a large number of CSV files existing on S3 to Parquet. All files are downloaded locally.
qiita2.py
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import glob
from_dir = './from_dir/'
to_dir = './to_dir/'
#from_Read all CSV in dir
files = glob.glob(from_dir + "*", recursive=True)
#Convert one file at a time and to_Store in dir
for file in files:
path_name = file.split('/')
df = pd.read_csv(file)
table = pa.Table.from_pandas(df)
pq.write_table(table, to_dir + path_name[2] + '.pq')
Read csv file, output pandas Conversion to Parquet is easy with pyarrow
Recommended Posts