The author is currently participating in an AI human resources development program sponsored by the Ministry of Economy, Trade and Industry called AI Quest, where the first term competition ended at 8:00 today. I'm not glad that the result was 18th place (295 participants), but I think that it was a very practical content because it was a demand forecast assuming actual data, so I wrote down this experience. I will write it down.
The outline of this competition is that daily sales data for each product store of a company for about two years is given, and based on that, sales for the next month are predicted, and the data is as follows. It was given in the form of.
Date Store ID Product ID Product price Quantity sold
0 2018-01-01 9 1000001 420 1.0
1 2018-01-01 6 1000001 420 1.0
2 2018-01-01 10 1000001 420 1.0
3 2018-01-01 0 1000017 250 1.0
Set the date to datetime type before handling it with pivot.
df = pd.read_csv("data/sales_history.csv", parse_dates=["date"])
When reading in this way, the date column could be read with datetime. If you display it as df.info ()
, you can see that it is of datetime type as shown below.
RangeIndex: 1119570 entries, 0 to 1119569
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 1119570 non-null datetime64[ns]
1 Store ID 1119570 non-null int64
2 Product ID 1119570 non-null int64
3 Product price 1119570 non-null int64
4 Number of units sold 1119570 non-null float64
dtypes: datetime64[ns](1),float64(1),int64(3)
memory usage: 42.7 MB
Next, I want to aggregate this by month, so I will create a column that matches the year and month.
df["Year month"] = df["date"].dt.strftime("%y%m")
If you try it as df ["year / month "] .head ()
, you can see that it is created as follows. By the way, if you set it to % Y
, it will be 4 digits in the Christian era, and if you set it to% y
, it will be the last 2 digits.
0 1801
1 1801
2 1801
3 1801
4 1801
Name:Year month, dtype: object
Let's summarize it by month. The total number of units sold is fine, but since we want to use the average price instead of the total (this time we set the median value that is not affected by abnormal values), we will describe them together in agg
.
df = df.groupby(["Year month", "Product ID", "Store ID"]).agg({"Number of units sold":"sum", "Product price":"median"}).reset_index()
It is summarized by month as follows. Here, if you use .reset_index ()
, the column used for grouping will become MultiIndex (the index has a multi-level hierarchy) and it will be difficult to handle, so solve it. I am doing it.
Year / Month Product ID Store ID Number of units sold Product price
0 1801 1000001 0 6.0 420
1 1801 1000001 1 2.0 325
2 1801 1000001 2 1.0 420
It's finally the main subject. What we should do this time is that the month with no sales is not 0 and there is no data itself, and if we make a prediction as it is, the information that the sales was 0 will be omitted, so the month without data is 0 I want to fill it with. Also, I would like to add the lag feature (sales of last month or the previous month). The pivot table makes it easy to do these things.
df_pivot = df.pivot_table(index=["Product ID", "Store ID"], columns="Year month", values="Number of units sold")
If you specify the index and column and the value to be aggregated in this way, it will make a nice feeling.
Year 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810\
Product ID Store ID
1000001 0 6.0 3.0 1.0 2.0 NaN 4.0 3.0 2.0 1.0 NaN
1 2.0 1.0 NaN 2.0 NaN 2.0 1.0 NaN NaN 1.0
2 1.0 NaN 1.0 NaN 1.0 2.0 3.0 4.0 2.0 1.0
As you can see, when there is no data, it can be expressed as a missing value, so if you fill this with 0, you can express sales as 0.
df_pivot = df_pivot.fillna(0)
Next, let's look at the lag features. The writing method is the same shift as shifting one column in time series data.
sold_lag_1 = df_pivot.shift(1, axis=1)
If you do this, it will shift one to the right as shown below.
Year 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810\
Product ID Store ID
1000001 0 NaN 6.0 3.0 1.0 2.0 0.0 4.0 3.0 2.0 1.0
1 NaN 2.0 1.0 0.0 2.0 0.0 2.0 1.0 0.0 0.0
2 NaN 1.0 0.0 1.0 0.0 1.0 2.0 3.0 4.0 2.0
In the same way, we were able to add a few months ago lag as a feature.
sold_lag_2 = df_pivot.shift(2, axis=1)
sold_lag_3 = df_pivot.shift(3, axis=1)
Finally, you need to add these to the original data frame. For example, the data of the number of units sold in the corresponding month filled with 0 is processed as follows.
sold_lag_0 = df_pivot.stack().reset_index().rename(columns={0:"Filled with 0 units sold"})
First, you can eliminate what was pivoted by .stack ()
and express it with a vertical multi-index.
Product ID Store ID Year / Month
1000001 0 1801 6.0
1802 3.0
1803 1.0
1804 2.0
1806 4.0
By doing so, I was able to eliminate the multi-index and arrange it so that it can be combined by tweaking the column names as needed. All you have to do is merge the column names together and you're done.
df_merge = pd.merge(df, sold_lag_0, how="right", on=['Year month', 'Store ID', 'Product ID'])
At this time, if you set how =" right "
, you can join right outer join (join based on the table on the right side) and it is completed successfully.
Year / Month Product ID Store ID Number of units sold Product price Number of units sold 0
0 1801 1000001 0 6.0 420 6.0
1 1801 1000001 1 2.0 325 2.0
2 1801 1000001 2 1.0 420 1.0
After that, the lag features can be combined in the same way, and the price can be linearly interpolated with df_pivot.interpolate ("nearest", limit_direction ='both', axis = 1)
. I can do it.
In other words, more information can be extracted by performing four arithmetic operations on the lag features.
Based on what I learned this time, I would like to devote myself to achieving better results in the next competition.
Recommended Posts