I tried to implement time series prediction with GBDT

Articles sent by data scientists from the manufacturing industry
This time, I tried to predict the time series data using the regression model (GBDT).

Introduction

We have organized the methods of time series analysis and regression model in the past, so if you are interested, we would appreciate it if you could refer to them as well.

GBDT Time Series Forecast

The python code is below.

#Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from matplotlib import pylab as plt
%matplotlib inline

#Statistical model
import statsmodels.api as sm

# GBDT
from sklearn.ensemble import GradientBoostingRegressor

#Make the graph landscape
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

# https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/AirPassengers.html
df = pd.read_csv('AirPassengers.csv')

#Convert to float type
df['#Passengers'] = df['#Passengers'].astype('float64')
df = df.rename(columns={'#Passengers': 'Passengers'})

#Make it a datetime type and index it
df.Month = pd.to_datetime(df.Month)
df = df.set_index("Month")

#Check the contents of the data
df.head()

スクリーンショット 2020-12-05 12.41.18.png

Next, create a correlogram.

#Autocorrelation graph
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.tsa.plot_acf(df["Passengers"], lags=30)

#Visualize partial autocorrelation
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.tsa.plot_pacf(df["Passengers"], lags=20)

In this data, if you look at the graph of partial autocorrelation, you can see that there is a correlation every 12 months. In other words, we can see that there are seasonal periodic fluctuations.

Next, create a history for the past 12 months.

for i in range(1, 13):
    df['shift%s'%i] = df['Passengers'].shift(i)

pd.concat([df.head(13), df.tail(3)], axis=0, sort=False)

スクリーンショット 2021-01-11 13.30.13.png

Next, create a diff column that is often used for time series data.

df['deriv1'] = df['shift1'].diff(1)
df[['Passengers', 'deriv1']].head()

スクリーンショット 2021-01-11 13.31.50.png

Next, create the diff column twice.

df['deriv2'] = df['shift1'].diff(1).diff(1)
df[['Passengers', 'deriv2']].head()

スクリーンショット 2021-01-11 13.32.33.png

Finally, add the statistic to the explanatory variables as well.

df['mean'] = df['shift1'].rolling(12).mean()
df['median'] = df['shift1'].rolling(12).median()
df['max'] = df['shift1'].rolling(12).max()
df['min'] = df['shift1'].rolling(12).min()
df[['Passengers', 'mean', 'median', 'max', 'min']][12:24]

スクリーンショット 2021-01-11 13.33.45.png

From now on, we will make predictions with GBDT.

#Delete missing value data
df = df.dropna()
df.head()

x = df.drop('Passengers', axis=1)
y = df['Passengers']

#Create training data and evaluation data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

#Standardize data
sc = StandardScaler()
sc.fit(x_train) #Standardized with training data
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)

#Model learning
GBDT = GradientBoostingRegressor()
GBDT.fit(x_train_std, y_train)

#Forecast
y_pred = GBDT.predict(x_test_std)

y_ = np.concatenate([np.array([None for i in range(len(y_train))]), y_pred])
y_ = pd.DataFrame(y_, index=df.index)

plt.figure(figsize=(10,5))
plt.plot(y, label='original')
plt.plot(y_, '--', label='predict')
plt.legend()

結果.png

at the end

Thank you for reading to the end. This time, I tried to predict the time series data using a regression model. When using a regression model, feature creation and selection are important.

If you have a request for correction, we would appreciate it if you could contact us.