A small story that outputs table data in CSV format at high speed

Article content

Previously, a colleague told me that it would take time to output the table data to CSV, so I gave some advice.

I haven't measured the time, so I'll leave a note based on a simple verification of how fast it will be.

Premise

-The contents of the table are output as they are. ・ PostgreSQL 12.1

Prepared data

As for the prepared data, we prepared 10 million data that differ by the following ids.

-[ RECORD 1 ]--------
id     | 1
data1  | aiueo
data2  | kakikukeko
data3  | sasshisuseso
data4  | tachitsuteto
data5  | naninuneno
data6  | hahihuheho
data7  | mamimumemo
data8  | yayuyo
data9  | rarirurero
data10 | wawon

Output method 1

Output method 1 is to acquire all the data and convert it to CSV format in the program.

SQL

select * from demo

Output code

csv_out.py


import psycopg2
import time

def get_connection():
    return psycopg2.connect("host=localhost port=5432 dbname=sampledb user=postgres password=postgres")
    
path = "csv1.csv"
with get_connection() as conn:
    with conn.cursor() as cur:
        start = time.time()
        cur.execute("select * from demo")
        exec_time = time.time() - start
        print(exec_time)
        start = time.time()
        with open(path, "w") as f:
            i = 1
            for row in cur:
                csv = ''
                for data in row:
                    csv = csv + str(data) + ','
                f.write(csv[:-1] + '\n')
                
                if i % 1000 == 0:
                    f.flush()
                i += 1
        exec_time = time.time() - start
        print(exec_time)                

Output method 2

It is a method to convert to SQL format with SQL.

SQL

select 
    id | ',' | 
    data1 | ',' | 
    data2 | ',' | 
    data3 | ',' | 
    data4 | ',' | 
    data5 | ',' | 
    data6 | ',' | 
    data7 | ',' | 
    data8 | ',' | 
    data9 | ',' | 
    data10 as data
from demo 

Output code

csv_out2.py


import psycopg2
import time
def get_connection():
    return psycopg2.connect("host=localhost port=5432 dbname=sampledb user=postgres password=postgres")
    
path = "csv2.csv"
with get_connection() as conn:
    with conn.cursor() as cur:
        start = time.time()
        cur.execute("select id || ',' || data1 || ',' || data2 || ',' || data3 || ',' || data4 || ',' || data5 || ',' || data6 || ',' || data7 || ',' || data8 || ',' || data9 || ',' || data10 as data from demo")
        exec_time = time.time() - start
        print(exec_time)
        start = time.time()

        with open(path, "w") as f:
            i = 1
            for row in cur:
                for data in row:
                    f.write(data + '\n')
                if i % 1000 == 0:
                    f.flush()
                i += 1
        exec_time = time.time() - start
        print(exec_time)

Output result

The results are as follows.

processing Output method 1 Output method 2
SQL 11.68s 13.35s
File output 56.95s 15.05s

Output method 2 was able to output faster.

Impressions

Personally, I thought it would be a little faster, but it may be like this because the number of output columns is small. The more columns you output, the more effective it will be.

If you want to connect to a DB server and output CSV, I think that the Export command and COPY TO command can output faster. However, in the maintenance and development of business systems, the process of outputting table data as it is is relatively common, so this method is used in such cases.

Recommended Posts

A small story that outputs table data in CSV format at high speed
A script that downloads AWS RDS log files at high speed
[BigQuery] Load a part of BQ data into pandas at high speed
A story that I was addicted to at np.where
Example of how to aggregate a large amount of time series data using Python at a reasonable speed in a small memory environment
[Python] Creating a GUI tool that automatically processes CSV of temperature rise data in Excel
I tried to build a service that sells machine-learned data at explosive speed with Docker
Published a library that hides character data in Python images
Easily build a GCP environment for Kaggle at high speed
A story that stumbled when using pip in a proxy environment
Perform implied volatility calculation at high speed (market data processing)
Create code that outputs "A and pretending B" in python
[Python] Articles that enable sparse matrix calculations at high speed
A python script that converts Oracle Database data to csv