Continuation of Yesterday's article
One month's worth of data is prepared for each of the 3 million IDs The contents of the data are one explanatory variable and one objective variable. In other words, there are three columns in the table: ID, explanatory variable, and objective variable. The number of records is 3 million x 30 days ≒ 90 million
At this time, for each of the 3 million IDs, a simple regression of the explanatory variables and objective variables for 30 days was performed. I want to store the correlation coefficient, slope, and p-value for each ID as an output.
Regression is performed in a for loop for 3 million IDs, and the results are stored as a list. Finally, combine the lists into a data frame
EC2 instance (ubuntu: r5d.4xlarge) JupyterLab 0.35.3
Extraction with df.loc [] with ID as index + processing is slow even if dask data frame is used (it takes 2 months)
A table with the following shape
ID | x | y |
---|---|---|
01 | x.xx | y.yy |
01 | x.xx | y.yy |
・ ・ ・ | ・ ・ ・ | ・ ・ ・ |
01 | x.xx | y.yy |
02 | x.xx | y.yy |
02 | x.xx | y.yy |
・ ・ ・ | ・ ・ ・ | ・ ・ ・ |
02 | x.xx | y.yy |
Convert to the following form
ID | monthly_x | monthly_y |
---|---|---|
01 | [x.xx, x.xx ,....] | [y.yy, y.yy ,....] |
02 | [x.xx, x.xx ,....] | [y.yy, y.yy ,....] |
03 | [x.xx, x.xx ,....] | [y.yy, y.yy ,....] |
04 | [x.xx, x.xx ,....] | [y.yy, y.yy ,....] |
05 | [x.xx, x.xx ,....] | [y.yy, y.yy ,....] |
06 | [x.xx, x.xx ,....] | [y.yy, y.yy ,....] |
The conversion flow is the following two stages
-** Pivot using pandas.pivot_table (to make inclusion notation usable) ** This process is relatively fast. (** Approximately 50 μs ** per ID, ** Approximately 2.5 min ** with 3 million IDs)
code1.py
import pandas as pd
#A state in which a column representing the date is added to the original data frame (here, "date"_The column name is
pivot_df = pd.pivot_table(df, index="ID",columns="date_")
With this, the values of x and y are held horizontally for each date.
-** List the monthly (30 days) x and y values for each ID and store them in a separate column (processed in inclusion notation) **
code2.py
pivot_df["x"] = [list(pivot_df.x.loc[i,:]) for i in pivot_df.index]
pivot_df["y"] = [list(pivot_df.y.loc[i,:]) for i in pivot_df.index]
This process is ** about 0.2ms ** per ID, ** about 10 ~ 15min ** with 3 million IDs (** strong comprehension **)
Regression processing is performed on the converted table For this regression, we use robust regression considering outliers. (For robust regression, here is easy to understand) (Since I have a habit of deriving the coefficient of determination R ^ 2 for robust regression, I will summarize it when I have time. It is planned.)
code3.py
import statsmodels.api as sm
pivot_df["model"] = [sm.RLM(i, sm.add_constant(j)) for i,j in zip(pivot_df.x ,pivot_df.y)]
pivot_df["result"] = [i.fit() for i in pivot_df.model]
With the above code, ** about 8.8ms ** per ID If you do not save the "model" and go to fit with one liner, ** about 7.9ms ** per ID So far, in total ** about 9ms ** per ID
In the previous article, it took 1.7s just to extract 1 ID, so Even including the regression, the processing time was reduced to 1/200, and the processing time was reduced to ** 1/10000 ** when compared with the time until the regression preprocessing.
** Strongest inclusion notation ** The processing time for each of the 3 million IDs was ridiculously calculated to take more than a year to be honest, but with a little ingenuity, it was possible to turn it for about 8 hours.
(I would appreciate it if you could teach me any other good methods)
Regarding regression, I simply used StatsModels in this article, but the coefficient of determination of robust regression cannot be easily derived from any library such as StatsModels (as far as I know), so some ingenuity is required. I'm fluffy thinking that it would be nice to be able to summarize while investigating this (~~ just thinking ~~)
Recommended Posts