Continuation of Last story I compared the performance between using numpy.array.T, which I tried unpleasantly, and using the zip that received comments.
Inputs are 1M.csv with 1000 rows and 1000 columns and 25M.csv with 5000 rows and 5000 columns.
Method 1
import sys
import numpy
def csvt_1(fnin, fnout):
fin = open(fnin, "r")
fout = open(fnout, "w")
for line in numpy.array([s.strip('\n').split(',') for s in fin]).T:
fout.write(",".join(line) + "\n")
fin.close()
fout.close()
Method 2
import sys
def csvt_2(fnin, fnout):
fin = open(fnin, "r")
fout = open(fnout, "w")
for line in zip(*[s.strip('\n').split(',') for s in fin]):
fout.write(','.join(line) + '\n')
fin.close()
fout.close()
Measurement result (% time% run in IPython) Method 1 1M.csv: about 500ms 25M.csv: about 14s Method 2 1M.csv: about 250ms 25M.csv: about 11s
I measured it several times, but it's about the same. zip wins. What is this? I wonder if zip works well for lazy evaluation. Anyway, meaningful results were obtained.
The data that my colleague talked about when I tried this code is about 40GB in size. For that, it seems that this method can not be used, so I wrote an application in C # separately and solved it.
Recommended Posts