line_profiler is useful

I've heard that short speeches and skirts are better.

Even in data analysis, I want to do as many experiments as possible, so Routine repetitive work such as pretreatment should be as short as possible.

I think profiling is useful in such cases.

Recently, I have decided to handle data of several tens of GB ~ in size privately.

Through that work, about parallel processing, profiling, etc. I made a small discovery, so I wish I could share it.

The first is the discovery when profiling with line_profiler.

Many people have written about line_profiler, so please check it out. It's a great project.

Profiling data aggregation processing

About data

I can't show you the data that was actually used. .. .. We will proceed with sample data, which has a structure similar to that data.

In [1]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 3 columns):
key      100000 non-null int64
data1    100000 non-null int64
data2    100000 non-null int64
dtypes: int64(3)
memory usage: 3.1 MB

In [2]: df.head()
Out[2]: 
    key  data1  data2
0  1800   4153    159
1  5568   6852     45
2   432   7598    418
3  4254   9412    931
4  3634   8204    872

The actual number of data is tens of millions of lines, Since it is sample data here, it is dropped to 100000 lines.

Aggregation processing

Aggregated for the code below (redundant due to line-by-line profiling).

def proc1():
    chunker = pd.read_csv('./data/testdata.csv', chunksize=10000)

    li = []
    for df in chunker:
        #Change column name
        df.rename(columns={'data': 'value1', 'data2': 'value2'}, inplace=True)
        #Aggregate for each key and take the total of value1
        li.append(df.groupby('key')['value1'].sum())

    g = pd.concat(li, axis=1)
    return g.sum(axis=1)

It is a supplement about the code.

Since the data does not fit in memory, it is divided by chunk size and read.
The column name of the original data is difficult to use, so I changed it to a name that is easy to use.
In this aggregation, we will not do about value2.

Use `line_profiler`

This time, it is used in ipython notebook.

%load_ext line_profiler

You can access it with the magic command % lprun. Let's use this to measure the above proc1.

In [3]: %load_ext line_profiler

In [4]: %lprun -f proc1 proc1()
Timer unit: 1e-06 s

Total time: 0.060401 s
File: <ipython-input-105-0457ade3b36e>
Function: proc1 at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def proc1():
     2         1         1785   1785.0      3.0      chunker = pd.read_csv('./data/coltest.csv', chunksize=100000)
     3                                           
     4         1            2      2.0      0.0      li = []
     5         2        49155  24577.5     81.4      for df in chunker:
     6         1         1932   1932.0      3.2          df.rename(columns={'data': 'value1', 'data2': 'value2'}, inplace=True)
     7         1         4303   4303.0      7.1          li.append(df.groupby('key')['value1'].sum())
     8                                           
     9         1         2723   2723.0      4.5      g = pd.concat(li, axis=1)
    10         1          501    501.0      0.8      return g.sum(axis=1)

Look for bottlenecks

Until I ran line_profiler, it was like" I wonder if reading in file division is slow ". It's slow, but there are other parts that take a lot of time.

a. The part of df.rename ... takes about half of the aggregation process of groupby in terms of% time (percentage of total).

--The process of renaming columns must not be inside the loop. .. --In this case, if you want to change it in the first place, you should change it using the option of read_csv. ――There seems to be an opinion that you don't have to change it in the first place.

b. If you don't use the column value2, you should also not read it using the ʻusercols option of read_csv.

at the end

I thought it would take longer than I expected to process rename. I think the line_profiler that made such a discovery is very good.

Next, I would like to write a working note about parallel processing in ipython.

Data analysis in Python: A note about line_profiler