A story that was struggling to loop processing 3 million ID data

Continuation of Yesterday's article

■ Purpose

One month's worth of data is prepared for each of the 3 million IDs The contents of the data are one explanatory variable and one objective variable. In other words, there are three columns in the table: ID, explanatory variable, and objective variable. The number of records is 3 million x 30 days ≒ 90 million

At this time, for each of the 3 million IDs, a simple regression of the explanatory variables and objective variables for 30 days was performed. I want to store the correlation coefficient, slope, and p-value for each ID as an output.

■ Policy

Regression is performed in a for loop for 3 million IDs, and the results are stored as a list. Finally, combine the lists into a data frame

■ Environment

EC2 instance (ubuntu: r5d.4xlarge) JupyterLab 0.35.3

■ Challenges

Extraction with df.loc [] with ID as index + processing is slow even if dask data frame is used (it takes 2 months)

■ Solution

A table with the following shape

ID x y
01 x.xx y.yy
01 x.xx y.yy
・ ・ ・ ・ ・ ・ ・ ・ ・
01 x.xx y.yy
02 x.xx y.yy
02 x.xx y.yy
・ ・ ・ ・ ・ ・ ・ ・ ・
02 x.xx y.yy

Convert to the following form

ID monthly_x monthly_y
01 [x.xx, x.xx ,....] [y.yy, y.yy ,....]
02 [x.xx, x.xx ,....] [y.yy, y.yy ,....]
03 [x.xx, x.xx ,....] [y.yy, y.yy ,....]
04 [x.xx, x.xx ,....] [y.yy, y.yy ,....]
05 [x.xx, x.xx ,....] [y.yy, y.yy ,....]
06 [x.xx, x.xx ,....] [y.yy, y.yy ,....]

The conversion flow is the following two stages

-** Pivot using pandas.pivot_table (to make inclusion notation usable) ** This process is relatively fast. (** Approximately 50 μs ** per ID, ** Approximately 2.5 min ** with 3 million IDs)

code1.py



import pandas as pd
#A state in which a column representing the date is added to the original data frame (here, "date"_The column name is
pivot_df = pd.pivot_table(df, index="ID",columns="date_")

With this, the values of x and y are held horizontally for each date.

-** List the monthly (30 days) x and y values for each ID and store them in a separate column (processed in inclusion notation) **

code2.py



pivot_df["x"] = [list(pivot_df.x.loc[i,:]) for i in pivot_df.index]
pivot_df["y"] = [list(pivot_df.y.loc[i,:]) for i in pivot_df.index]

This process is ** about 0.2ms ** per ID, ** about 10 ~ 15min ** with 3 million IDs (** strong comprehension **)

Regression processing is performed on the converted table For this regression, we use robust regression considering outliers. (For robust regression, here is easy to understand) (Since I have a habit of deriving the coefficient of determination R ^ 2 for robust regression, I will summarize it when I have time. It is planned.)

code3.py


import statsmodels.api as sm

pivot_df["model"] = [sm.RLM(i, sm.add_constant(j)) for i,j in zip(pivot_df.x ,pivot_df.y)]
pivot_df["result"] = [i.fit() for i in pivot_df.model]

With the above code, ** about 8.8ms ** per ID If you do not save the "model" and go to fit with one liner, ** about 7.9ms ** per ID So far, in total ** about 9ms ** per ID

In the previous article, it took 1.7s just to extract 1 ID, so Even including the regression, the processing time was reduced to 1/200, and the processing time was reduced to ** 1/10000 ** when compared with the time until the regression preprocessing.

■ Conclusion

** Strongest inclusion notation ** The processing time for each of the 3 million IDs was ridiculously calculated to take more than a year to be honest, but with a little ingenuity, it was possible to turn it for about 8 hours.

(I would appreciate it if you could teach me any other good methods)

■ Future plans

Regarding regression, I simply used StatsModels in this article, but the coefficient of determination of robust regression cannot be easily derived from any library such as StatsModels (as far as I know), so some ingenuity is required. I'm fluffy thinking that it would be nice to be able to summarize while investigating this (~~ just thinking ~~)

Recommended Posts

A story that was struggling to loop processing 3 million ID data
A story about struggling to loop 3 million ID data
A story that I was addicted to at np.where
A story that I was addicted to calling Lambda from AWS Lambda.
A story that I was addicted to when I made SFTP communication with python
A story that Seaborn was easy, convenient and impressed
A story that was terrible if SELinux was properly disabled
A story that took time to understand python's argsort (memorial)
How to write a test for processing that uses BigQuery
A python script that converts Oracle Database data to csv
[Python] A story that seemed to fall into a rounding trap
The story that the version of python 3.7.7 was not adapted to Heroku
A story that Qiita really wanted to see under a proxy environment
The story that the guard was confined when the laboratory was converted to IoT
Created a service that allows you to search J League data
A story addicted to Azure Pipelines
A story I was addicted to when inserting from Python to a PostgreSQL table
A story I was addicted to trying to get a video url with tweepy
A story that suffered from OS differences when trying to implement a dissertation