I've heard that short speeches and skirts are better.
Even in data analysis, I want to do as many experiments as possible, so Routine repetitive work such as pretreatment should be as short as possible.
I think profiling is useful in such cases.
Recently, I have decided to handle data of several tens of GB ~ in size privately.
Through that work, about parallel processing, profiling, etc. I made a small discovery, so I wish I could share it.
The first is the discovery when profiling with line_profiler
.
Many people have written about line_profiler, so please check it out. It's a great project.
I can't show you the data that was actually used. .. .. We will proceed with sample data, which has a structure similar to that data.
In [1]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 3 columns):
key 100000 non-null int64
data1 100000 non-null int64
data2 100000 non-null int64
dtypes: int64(3)
memory usage: 3.1 MB
In [2]: df.head()
Out[2]:
key data1 data2
0 1800 4153 159
1 5568 6852 45
2 432 7598 418
3 4254 9412 931
4 3634 8204 872
The actual number of data is tens of millions of lines, Since it is sample data here, it is dropped to 100000 lines.
Aggregated for the code below (redundant due to line-by-line profiling).
def proc1():
chunker = pd.read_csv('./data/testdata.csv', chunksize=10000)
li = []
for df in chunker:
#Change column name
df.rename(columns={'data': 'value1', 'data2': 'value2'}, inplace=True)
#Aggregate for each key and take the total of value1
li.append(df.groupby('key')['value1'].sum())
g = pd.concat(li, axis=1)
return g.sum(axis=1)
chunk size
and read.value2
.line_profiler
This time, it is used in ipython notebook.
%load_ext line_profiler
You can access it with the magic command % lprun
.
Let's use this to measure the above proc1
.
In [3]: %load_ext line_profiler
In [4]: %lprun -f proc1 proc1()
Timer unit: 1e-06 s
Total time: 0.060401 s
File: <ipython-input-105-0457ade3b36e>
Function: proc1 at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def proc1():
2 1 1785 1785.0 3.0 chunker = pd.read_csv('./data/coltest.csv', chunksize=100000)
3
4 1 2 2.0 0.0 li = []
5 2 49155 24577.5 81.4 for df in chunker:
6 1 1932 1932.0 3.2 df.rename(columns={'data': 'value1', 'data2': 'value2'}, inplace=True)
7 1 4303 4303.0 7.1 li.append(df.groupby('key')['value1'].sum())
8
9 1 2723 2723.0 4.5 g = pd.concat(li, axis=1)
10 1 501 501.0 0.8 return g.sum(axis=1)
Until I ran line_profiler
, it was like" I wonder if reading in file division is slow ". It's slow, but there are other parts that take a lot of time.
a. The part of df.rename ...
takes about half of the aggregation process of groupby
in terms of% time (percentage of total).
--The process of renaming columns must not be inside the loop. ..
--In this case, if you want to change it in the first place, you should change it using the option of read_csv
.
――There seems to be an opinion that you don't have to change it in the first place.
b. If you don't use the column value2
, you should also not read it using the ʻusercols
option of read_csv
.
I thought it would take longer than I expected to process rename
.
I think the line_profiler
that made such a discovery is very good.
Next, I would like to write a working note about parallel processing in ipython.
Recommended Posts