About time series data and overfitting

How to prevent overfitting of LGBM these days is a hot topic in me.

I noticed how to separate train data and valid data from time series data.

Until now, I thought that random split would be better even for time series data. To put it simply, if a certain date and time is set as a threshold value, the train data for spring, summer, and autumn will be learned without having winter information, so it may be an incomplete model.

However, it turned out that there was a problem with random split. It depends on the particle size of datetime, but for example, the train data contains the data of the minute immediately before the valid data, so it is extremely easy to overfit.

My current best practice is to divide the year into four parts, spring, summer, autumn, and winter, and create a four-pattern model depending on which valid is used. Take the average of the predicted values from the four models.

＝＝＝＝

I wrote a memo about two weeks ago, The following article has exactly the same idea as I thought, so share it! !!

http://tmitani-tky.hatenablog.com/entry/2018/12/19/001304

It seems that scikit-learn also has something to validate as I hope.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

Recommended Posts

About time series data and overfitting

Reading OpenFOAM time series data and sets data

[Python] Plot time series data

About installing Pwntools and Python2 series

Python: Time Series Analysis: Preprocessing Time Series Data

Graph time series data in Python using pandas and matplotlib

A story about clustering time series data of foreign exchange

Differentiation of time series data (discrete)

Time series analysis 3 Preprocessing of time series data

About _ and __

Comparison of time series data predictions between SARIMA and Prophet models

When plotting time series data and getting a matplotlib Overflow Error

Forecasting time series data with Simplex Projection

Predict time series data with neural network

Time series data anomaly detection for beginners

How to handle time series data (implementation)

Time series analysis # 6 Spurious regression and cointegration

Time Series Decomposition

Underfitting and overfitting

Kaggle Kernel Method Summary [Table Time Series Data]

Acquisition of time series data (daily) of stock prices

View details of time series data with Remotte

How to read time series data in PyTorch

Format and display time series data with different scales and units with Python or Matplotlib

Python: Time Series Analysis

About machine learning overfitting

Features that can be extracted from time series data

Reading, summarizing, visualizing, and exporting time series data to an Excel file with Python

Visualize data and understand correlation at the same time

[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile

Python time series question

RNN_LSTM1 Time series analysis

Time series analysis 1 Basics

Time series data prediction by AutoML (automatic machine learning)

About cross-validation and F-number

Display TOPIX time series

Time series plot / Matplotlib

It's time to seriously think about the definition and skill set of data scientists

"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python

How to generate exponential pulse time series data in python

I wanted to worry about execution time and memory usage

This and that about pd.DataFrame

Linux (about files and directories)

Python 2 series and 3 series (Anaconda edition)

Time series analysis related memo

About python objects and classes

Data handling 3 (development) About data format

About Python variables and objects

About LINUX files and processes

About Raid group and LUN

About fork () function and execve () function

About Django's deconstruct and deconstructible

Date and time ⇔ character string

About Python, len () and randint ()

About Python datetime and timezone

About Sharpe Ratio and Sortino Ratio

Time series analysis part 4 VAR

Time series analysis Part 3 Forecast

Point and Figure Data Modeling

About Python and regular expressions

Time series analysis Part 1 Autocorrelation