Performance measurement

Continuation of Last story I compared the performance between using numpy.array.T, which I tried unpleasantly, and using the zip that received comments.

Inputs are 1M.csv with 1000 rows and 1000 columns and 25M.csv with 5000 rows and 5000 columns.

`Method 1`


import sys
import numpy
def csvt_1(fnin, fnout):
    fin = open(fnin, "r")
    fout = open(fnout, "w")
    for line in numpy.array([s.strip('\n').split(',') for s in fin]).T:
        fout.write(",".join(line) + "\n")
    fin.close()
    fout.close()

`Method 2`


import sys
def csvt_2(fnin, fnout):
    fin = open(fnin, "r")
    fout = open(fnout, "w")
    for line in zip(*[s.strip('\n').split(',') for s in fin]):
        fout.write(','.join(line) + '\n')
    fin.close()
    fout.close()

Measurement result (% time% run in IPython) Method 1 1M.csv: about 500ms 25M.csv: about 14s Method 2 1M.csv: about 250ms 25M.csv: about 11s

I measured it several times, but it's about the same. zip wins. What is this? I wonder if zip works well for lazy evaluation. Anyway, meaningful results were obtained.

But unfortunately

The data that my colleague talked about when I tried this code is about 40GB in size. For that, it seems that this method can not be used, so I wrote an application in C # separately and solved it.

Recommended Posts

Transpose CSV file in Python Part 2: Performance measurement

Csv in python

File operations in Python

File processing in Python

File operations in Python

Read Python csv file

Speed evaluation of CSV file output in Python

Collectively register data in Firestore using csv file in Python

Use Measurement Protocol in Python

Download the file in Python

UI Automation Part 2 in Python