We have organized the methods of time series analysis and regression model in the past, so if you are interested, we would appreciate it if you could refer to them as well.
The python code is below.
#Import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from matplotlib import pylab as plt
%matplotlib inline
#Statistical model
import statsmodels.api as sm
# GBDT
from sklearn.ensemble import GradientBoostingRegressor
#Make the graph landscape
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
# https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/AirPassengers.html
df = pd.read_csv('AirPassengers.csv')
#Convert to float type
df['#Passengers'] = df['#Passengers'].astype('float64')
df = df.rename(columns={'#Passengers': 'Passengers'})
#Make it a datetime type and index it
df.Month = pd.to_datetime(df.Month)
df = df.set_index("Month")
#Check the contents of the data
df.head()
Next, create a correlogram.
#Autocorrelation graph
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.tsa.plot_acf(df["Passengers"], lags=30)
#Visualize partial autocorrelation
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.tsa.plot_pacf(df["Passengers"], lags=20)
In this data, if you look at the graph of partial autocorrelation, you can see that there is a correlation every 12 months. In other words, we can see that there are seasonal periodic fluctuations.
Next, create a history for the past 12 months.
for i in range(1, 13):
df['shift%s'%i] = df['Passengers'].shift(i)
pd.concat([df.head(13), df.tail(3)], axis=0, sort=False)
Next, create a diff column that is often used for time series data.
df['deriv1'] = df['shift1'].diff(1)
df[['Passengers', 'deriv1']].head()
Next, create the diff column twice.
df['deriv2'] = df['shift1'].diff(1).diff(1)
df[['Passengers', 'deriv2']].head()
Finally, add the statistic to the explanatory variables as well.
df['mean'] = df['shift1'].rolling(12).mean()
df['median'] = df['shift1'].rolling(12).median()
df['max'] = df['shift1'].rolling(12).max()
df['min'] = df['shift1'].rolling(12).min()
df[['Passengers', 'mean', 'median', 'max', 'min']][12:24]
From now on, we will make predictions with GBDT.
#Delete missing value data
df = df.dropna()
df.head()
x = df.drop('Passengers', axis=1)
y = df['Passengers']
#Create training data and evaluation data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)
#Standardize data
sc = StandardScaler()
sc.fit(x_train) #Standardized with training data
x_train_std = sc.transform(x_train)
x_test_std = sc.transform(x_test)
#Model learning
GBDT = GradientBoostingRegressor()
GBDT.fit(x_train_std, y_train)
#Forecast
y_pred = GBDT.predict(x_test_std)
y_ = np.concatenate([np.array([None for i in range(len(y_train))]), y_pred])
y_ = pd.DataFrame(y_, index=df.index)
plt.figure(figsize=(10,5))
plt.plot(y, label='original')
plt.plot(y_, '--', label='predict')
plt.legend()
Thank you for reading to the end. This time, I tried to predict the time series data using a regression model. When using a regression model, feature creation and selection are important.
If you have a request for correction, we would appreciate it if you could contact us.
Recommended Posts