How to prevent overfitting of LGBM these days is a hot topic in me.
I noticed how to separate train data and valid data from time series data.
Until now, I thought that random split would be better even for time series data. To put it simply, if a certain date and time is set as a threshold value, the train data for spring, summer, and autumn will be learned without having winter information, so it may be an incomplete model.
However, it turned out that there was a problem with random split. It depends on the particle size of datetime, but for example, the train data contains the data of the minute immediately before the valid data, so it is extremely easy to overfit.
My current best practice is to divide the year into four parts, spring, summer, autumn, and winter, and create a four-pattern model depending on which valid is used. Take the average of the predicted values from the four models.
====
I wrote a memo about two weeks ago, The following article has exactly the same idea as I thought, so share it! !!
http://tmitani-tky.hatenablog.com/entry/2018/12/19/001304
It seems that scikit-learn also has something to validate as I hope.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
Recommended Posts