A story that was struggling to loop processing 3 million ID data

■ Purpose

One month's worth of data is prepared for each of the 3 million IDs The contents of the data are one explanatory variable and one objective variable. In other words, there are three columns in the table: ID, explanatory variable, and objective variable. The number of records is 3 million x 30 days ≒ 90 million

At this time, for each of the 3 million IDs, a simple regression of the explanatory variables and objective variables for 30 days was performed. I want to store the correlation coefficient, slope, and p-value for each ID as an output.

■ Policy

Regression is performed in a for loop for 3 million IDs, and the results are stored as a list. Finally, combine the lists into a data frame

■ Environment

EC2 instance (ubuntu: r5d.4xlarge) JupyterLab 0.35.3

■ Challenges

Extraction with df.loc [] with ID as index + processing is slow even if dask data frame is used (it takes 2 months)

■ Solution

A table with the following shape

ID	x	y
01	x.xx	y.yy
01	x.xx	y.yy
・・・	・・・	・・・
01	x.xx	y.yy
02	x.xx	y.yy
02	x.xx	y.yy
・・・	・・・	・・・
02	x.xx	y.yy

Convert to the following form

ID	monthly_x	monthly_y
01	[x.xx, x.xx ,....]	[y.yy, y.yy ,....]
02	[x.xx, x.xx ,....]	[y.yy, y.yy ,....]
03	[x.xx, x.xx ,....]	[y.yy, y.yy ,....]
04	[x.xx, x.xx ,....]	[y.yy, y.yy ,....]
05	[x.xx, x.xx ,....]	[y.yy, y.yy ,....]
06	[x.xx, x.xx ,....]	[y.yy, y.yy ,....]

The conversion flow is the following two stages

-** Pivot using pandas.pivot_table (to make inclusion notation usable) ** This process is relatively fast. (** Approximately 50 μs ** per ID, ** Approximately 2.5 min ** with 3 million IDs)

`code1.py`



import pandas as pd
#A state in which a column representing the date is added to the original data frame (here, "date"_The column name is
pivot_df = pd.pivot_table(df, index="ID",columns="date_")

With this, the values of x and y are held horizontally for each date.

-** List the monthly (30 days) x and y values for each ID and store them in a separate column (processed in inclusion notation) **

`code2.py`



pivot_df["x"] = [list(pivot_df.x.loc[i,:]) for i in pivot_df.index]
pivot_df["y"] = [list(pivot_df.y.loc[i,:]) for i in pivot_df.index]

This process is ** about 0.2ms ** per ID, ** about 10 ~ 15min ** with 3 million IDs (** strong comprehension **)

Regression processing is performed on the converted table For this regression, we use robust regression considering outliers. (For robust regression, here is easy to understand) (Since I have a habit of deriving the coefficient of determination R ^ 2 for robust regression, I will summarize it when I have time. It is planned.)

`code3.py`


import statsmodels.api as sm

pivot_df["model"] = [sm.RLM(i, sm.add_constant(j)) for i,j in zip(pivot_df.x ,pivot_df.y)]
pivot_df["result"] = [i.fit() for i in pivot_df.model]

With the above code, ** about 8.8ms ** per ID If you do not save the "model" and go to fit with one liner, ** about 7.9ms ** per ID So far, in total ** about 9ms ** per ID

In the previous article, it took 1.7s just to extract 1 ID, so Even including the regression, the processing time was reduced to 1/200, and the processing time was reduced to ** 1/10000 ** when compared with the time until the regression preprocessing.

■ Conclusion

** Strongest inclusion notation ** The processing time for each of the 3 million IDs was ridiculously calculated to take more than a year to be honest, but with a little ingenuity, it was possible to turn it for about 8 hours.

(I would appreciate it if you could teach me any other good methods)

■ Future plans

Regarding regression, I simply used StatsModels in this article, but the coefficient of determination of robust regression cannot be easily derived from any library such as StatsModels (as far as I know), so some ingenuity is required. I'm fluffy thinking that it would be nice to be able to summarize while investigating this (~~ just thinking ~~)