Previously, a colleague told me that it would take time to output the table data to CSV, so I gave some advice.
I haven't measured the time, so I'll leave a note based on a simple verification of how fast it will be.
-The contents of the table are output as they are. ・ PostgreSQL 12.1
As for the prepared data, we prepared 10 million data that differ by the following ids.
-[ RECORD 1 ]--------
id | 1
data1 | aiueo
data2 | kakikukeko
data3 | sasshisuseso
data4 | tachitsuteto
data5 | naninuneno
data6 | hahihuheho
data7 | mamimumemo
data8 | yayuyo
data9 | rarirurero
data10 | wawon
Output method 1 is to acquire all the data and convert it to CSV format in the program.
SQL
select * from demo
Output code
csv_out.py
import psycopg2
import time
def get_connection():
return psycopg2.connect("host=localhost port=5432 dbname=sampledb user=postgres password=postgres")
path = "csv1.csv"
with get_connection() as conn:
with conn.cursor() as cur:
start = time.time()
cur.execute("select * from demo")
exec_time = time.time() - start
print(exec_time)
start = time.time()
with open(path, "w") as f:
i = 1
for row in cur:
csv = ''
for data in row:
csv = csv + str(data) + ','
f.write(csv[:-1] + '\n')
if i % 1000 == 0:
f.flush()
i += 1
exec_time = time.time() - start
print(exec_time)
It is a method to convert to SQL format with SQL.
SQL
select
id | ',' |
data1 | ',' |
data2 | ',' |
data3 | ',' |
data4 | ',' |
data5 | ',' |
data6 | ',' |
data7 | ',' |
data8 | ',' |
data9 | ',' |
data10 as data
from demo
Output code
csv_out2.py
import psycopg2
import time
def get_connection():
return psycopg2.connect("host=localhost port=5432 dbname=sampledb user=postgres password=postgres")
path = "csv2.csv"
with get_connection() as conn:
with conn.cursor() as cur:
start = time.time()
cur.execute("select id || ',' || data1 || ',' || data2 || ',' || data3 || ',' || data4 || ',' || data5 || ',' || data6 || ',' || data7 || ',' || data8 || ',' || data9 || ',' || data10 as data from demo")
exec_time = time.time() - start
print(exec_time)
start = time.time()
with open(path, "w") as f:
i = 1
for row in cur:
for data in row:
f.write(data + '\n')
if i % 1000 == 0:
f.flush()
i += 1
exec_time = time.time() - start
print(exec_time)
The results are as follows.
processing | Output method 1 | Output method 2 |
---|---|---|
SQL | 11.68s | 13.35s |
File output | 56.95s | 15.05s |
Output method 2 was able to output faster.
Personally, I thought it would be a little faster, but it may be like this because the number of output columns is small. The more columns you output, the more effective it will be.
If you want to connect to a DB server and output CSV, I think that the Export command and COPY TO command can output faster. However, in the maintenance and development of business systems, the process of outputting table data as it is is relatively common, so this method is used in such cases.
Recommended Posts