A while ago, I posted an article "I want to perform summation processing of array elements at high speed with Google Apps Script". At that time, I didn't plan to use it elsewhere, so I created the library only for GAS, but recently I also touched a large array in Python, so I will consider a method for outputting array data to a csv file. did.
While researching, I came across Okadate's article. The article says that there are csv module and pandas module to output csv file. Since the amount of data handled is large, I was still concerned about its processing speed, so I decided to check it before normal operation.
Therefore, we evaluated the processing speed of the csv output of the csv module and pandas module. As a reference, I used a standard method using the "+" operator, a port of the GAS summation library to Python (souwapy).
I used the following module to evaluate the speed of csv file output. The specifications of the computer used for the measurement are CPU Core i5-3210M, Memory 8GB, OS Windows10 (x64) (v1607). The Python version is 3.5.2.
Module name | Remarks |
---|---|
csv | Includes Python standard library |
pandas | Python data analysis module, version 0.19.0 |
souwapy | Self-made, version 1.1.1 |
standard algorithm | General method of adding array elements in order |
The script used for speed evaluation is as follows.
python
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import time
import csv
import pandas as pd
import SOUWA
def measure_csv(ar):
start = time.time()
with open('csvmod.csv', 'w') as f:
writer = csv.writer(f, lineterminator='\n')
writer.writerows(ar)
Processing_time = time.time() - start
print("Processing time = {0}".format(Processing_time) + " [s]")
return
def measure_pandas(ar):
start = time.time()
df = pd.DataFrame(ar)
df.to_csv('pandastest.csv', header=False, index=False)
Processing_time = time.time() - start
print("Processing time = {0}".format(Processing_time) + " [s]")
return
def measure_souwapy(ar):
start = time.time()
s = SOUWA.sou()
result = s.getcsvdata(ar, ",", "\n")
with open('souwa.csv', 'w') as f:
f.write(result)
Processing_time = time.time() - start
print("Processing time = {0}".format(Processing_time) + " [s]")
return
def measure_standard(ar):
start = time.time()
result = ''
for dat in ar:
result += ",".join(dat) + "\n"
with open('standard.csv', 'w') as f:
f.write(result)
Processing_time = time.time() - start
print("Processing time = {0}".format(Processing_time) + " [s]")
return
def MakeArray(row):
theta = [0 for i in range(row)]
for i in range(0, row):
theta[i] = [str(i + 1).zfill(9), 'a', 'b', 'c', 'd', 'e']
return theta
ar = MakeArray(10)
measure = 1
if measure == 1:
measure_csv(ar)
elif measure == 2:
measure_pandas(ar)
elif measure == 3:
measure_souwapy(ar)
elif measure == 4:
measure_standard(ar)
The array as data used a 9-digit zero-padded numeric string and a 6-element one-dimensional array of the alphabets a --e. This is exactly the content of the data you want to make into a csv file during operation. Here, all the elements are zero-padded to match the data size, and each alphabet is also one character. The data in the csv file with the number of arrays set to 10 is as follows.
000000001,a,b,c,d,e
000000002,a,b,c,d,e
000000003,a,b,c,d,e
000000004,a,b,c,d,e
000000005,a,b,c,d,e
000000006,a,b,c,d,e
000000007,a,b,c,d,e
000000008,a,b,c,d,e
000000009,a,b,c,d,e
000000010,a,b,c,d,e
"," Is used for the delimiter and "\ n" is used for the line feed code. The total of these is 20 bytes per line. We have also confirmed that the csv module, pandas module, souwapy module, and standard algorithm all have the same data. The speed evaluation was targeted until the output of the csv file from the array.
The result is shown in the above figure. The horizontal axis is the number of array elements, and the vertical axis is the time required to complete the csv file output. Blue, red, orange and green are from the standard algorithm, pandas module, csv module and souwapy module respectively. As a result, it was found that the processing time to output the array data to the csv file is faster in the order of standard, pandas module, csv module, souwapy module. The average processing time ratio was 1.4 times faster for the csv module than for the pandas module, 2.3 times faster for the souwapy module than for the csv module, and 3.1 times faster for the souwapy module than for the pandas module.
If you take a closer look, in the standard algorithm, the processing time is proportional to the square of the number of elements. [It is known that in the standard method of adding arrays in order using the "+" operator, the total amount of data moving during processing increases in proportion to the square of the number of array elements](http: // qiita.com/tanaike/items/17c88c69a0aa0b8b18d7). On the other hand, in each module, the processing time is linearly proportional to the number of elements. From these, it can be inferred that the csv module and pandas module are undergoing some optimization processing when changing to csv data. I tried to find out what algorithm csv and pandas used to convert the array to a csv file, but unfortunately I couldn't reach it myself.
If the number of elements is small, you can judge that there is no big difference in processing time between each module. The effect appears as the number of elements increases. The souwapy module has a fast result because it uses a specialized algorithm for converting array data to csv data, but so far it has only this one function, so it has other advanced functions. I thought it would be nice to combine it with a module and use it only in the final csv file output.
The souwapy module is a port of the GAS library. It seemed to be effective when the number of elements increased, so I uploaded it to PyPI if it could be useful. The installation method and usage method are as follows.
So far, it only has the ability to sum the array. I would like to add it when other functions are needed in the future.
--How to install
$ pip install souwapy
python
from souwapy import SOUWA
s = SOUWA.sou()
result = s.getcsvdata(array, ",", "\n")
array is an array, and please change the delimiter and line feed code at any time. See below for details.
Recommended Posts