What you want to do overall

I need to convert a lot of CSV files to Parquet, Since there is no column name line in the header of the CSV file in the first place, -Add header to CSV file ・ Convert CSV to Parquet I had to create a tool to do these two steps.

Assumptions

The column name added to the CSV header will be the title for the value in the Parquet file. If the header line does not exist and the data suddenly exists, Each title of the output Parquet file will be the data of the first line.

Call Shell from Python

The process of adding a CSV header line could have been written in Python, It was relatively easy to add in Shell, so I created it in Shell and called the file from Python.

`qiita.py`


import subprocess

# comment
cmd = './add_header.sh'
subprocess.call(cmd, shell=True)

By specifying Shell in subprocess, You can call an external Shell file.

`add_header.sh`


##!/usr/bin/env bash
for file in `\find from_dir -maxdepth 1 -type f`; do
    gsed -i '1iheader1,header2' $file
done

"1i" is required when calling gsed.

gsed ・・・ Please install gnu-sed.

■ Execution result CSV file header header1,header2

Convert CSV to Parquet

I had to convert a large number of CSV files existing on S3 to Parquet. All files are downloaded locally.

`qiita2.py`


import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import glob

from_dir = './from_dir/'
to_dir = './to_dir/'

#from_Read all CSV in dir
files = glob.glob(from_dir + "*", recursive=True)

#Convert one file at a time and to_Store in dir
for file in files:
    path_name = file.split('/')
    df = pd.read_csv(file)
    table = pa.Table.from_pandas(df)
    pq.write_table(table, to_dir + path_name[2] + '.pq')

Read csv file, output pandas Conversion to Parquet is easy with pyarrow

Recommended Posts

After calling the Shell file on Python, convert CSV to Parquet.

[Python] How to convert db file to csv

[Python] Convert csv file delimiters to tab delimiters

Convert XLSX to CSV on the command line

I tried to touch the CSV file with Python

How to convert JSON file to CSV file with Python Pandas

[Python] Write to csv file with Python

Output to csv file with Python

[Python] Convert CSV file uploaded to S3 to JSON file with AWS Lambda

Create a shell script to run the python file multiple times