If I could get 27th place in my life, M5 saved me. 27th place is different because the format is different ... [Original song Yorushika / August, certain, moonlight](https://music.apple.com/jp/album/%E5%85%AB%E6%9C%88-%E6%9F%90-%E6% 9C% 88% E6% 98% 8E% E3% 81% 8B% E3% 82% 8A / 1455955689? I = 1455955692)

Introduction

We participated in the competition Kaggle M5 Forecasting Accuracy held from March to June 2020. I submitted a model equivalent to 27th place (top 0.4%) </ font>. I made a mistake in the format and it became lowest </ font>, so I mourn the solution. If you know me next time, please treat me as a comforting fee. Now, enjoy the story of how we fell into ** hell **.

Competition overview

Kaggle M5 Forecasting: Wal-Mart (Supermarket) Sales Forecasting Competition Forecast of sales (sales volume) for each product (3049 types) in the future 28 days from the sales data of the past 1913 days. The target stores are 10 stores in California, Texas, and Wisconsin. スクリーンショット 2020-07-07 21.53.42.png Given the data -Past sales (by ID, by store, by item, etc.) -Price transition, -Calendar (holidays, etc.) The evaluation index is basically RMSE (it was a little technical WRMSSE, but this time it is omitted)

model

Create LGBM for each store and for each forecast date (see the figure below)

Basically, time series prediction is a very difficult task, so most of my resources were spent on model creation. I'm really grateful that my teammates created features and built a base model.

policy

Prediction of time series data is not suitable for machine learning (reference), so at first it was a statistical model I tried LSTM. It didn't work for some reason. Therefore, LGBM was created for each region and each forecast date so that the dynamics that control the data are as uniform as possible.

-Reason for choosing LGBM I think that models created for time series prediction, such as statistical models and LSTMs, are basically learning large dynamics. But what I want to predict this time is each product, that is, small dynamics. For example, even if you can predict the movement of the Japanese as a whole, it is difficult to predict each individual (I think Japanese people like ramen, but I don't know if you like ramen. ， I like dandan noodles). Therefore, I decided to ask LGBM, which has high expressiveness.

Reasons for creating a model for each store

When I visualized the sales transition of each store by referring to EDA notebook, I noticed that the movement was quite different for each store. .. スクリーンショット 2020-07-07 20.44.55.png

In addition, when clustering using UMAP, FOODS had almost the same distribution, while HOBBBIES showed remarkable regional differences. From these results, we decided to separate the models because there are some that sell and some that do not sell for each store, and the dynamics that govern them are different. California, Texas, and Wisconsin, where the stores are located, are geographically separated, and I thought it was reasonable that they sold differently.

Reasons expected daily

Whether a recursive model is better or a daily prediction is better, as done in Discussion It was an important point. We have simply chosen a safe path that does not miss a lot. When actually making a recurrence model, lag1 was completely negative when looking at permutation importance.

Characteristic engineering

Basic features (moving average, max / min for a certain period, etc.) ** ・ The number of days that have passed since the price of the product increased or decreased ** ** ・ The day when sales were recorded for the first time ** Data before this was not included in the training. ** ・ Ordered TS ** It's like a non-leakage Target encoding for time series data **-Maximum and minimum sales for a specific period ** I made this because I wanted to teach the learning model the concept of time. **-Ratio of 0 to 10 in past sales ** I created it to express the probability distribution of sales at that time. Also, if the sales are 0, is it out of stock? It suddenly appeared, so I thought it was necessary to include statistics related to 0, so I introduced it.

What I was thinking when creating features

I tried to put in a statistically meaningful feature in addition ・ Teammates: If you know what is related to purchasing motivation, you should know the sales ・ Writer: Things related to time to give information that LGBM does not know I was thinking about it. Actually, I tried various things such as denoising processing, waveform complexity, and features in Fourier space, but there were not many effective ones.

CV policy

We used the data from the last 3 months. I was careful not to adopt the ones with partially improved CV, but to make them robustly. I didn't count on Darker magic or the Public leaderboard because I suspected overfitting.

Slack before submit

Iwamo 8:49 PM I laugh because my head isn't working too much Teammate 9:10 PM I asked for confirmation Iwamo 9:10 PM ** It looks okay ** </ font>

And death

As mentioned above, it was in the same rank as the person who did not submit a part of the submission format by mistake, that is, lowest ...

World line that may have been

I will post the score of the original model because it is unskilled. スクリーンショット 2020-07-01 19.57.28.png

Finally

I feel sorry for my teammates. It was confirmed at the end ... Actually, I've been learning kaggle for about half a year for the first time, but I think I've learned a lot. I'm really disappointed, but I will continue to do my best.

July, a certain, M5 ~ Kaggle beginner time series data competition failure story ~