The ARIMA model is a model that is influenced by previous values With an AR model AR (p) that correlates with the last p values A MA model MA (q) that is affected by the previous q values in a model affected by the previous error was synthesized. ARMA (p, q) was adapted to the difference series before the d point.
And What is the SARIMA model? ARIMA model for time series data with seasonal cycle It is a model that can be expanded.
The SARIMA model has (p, d, q) parameters in addition to It also has the parameters (sp, sd, sq, s).
sp :Seasonal autocorrelation
sd :Derivation of seasonality
sq :Seasonal moving average
It's called.
By the way, ARIMA(p,d,q)Of the model
・ P:This is called autocorrelation, and whether the model is predicted using the last p values.
・ D:It is called induction, and the fact that the d-th order difference was necessary to make the time series data steady.
・ Q:The moving average is that the model is affected by the last q values.
It represented.
The basic meanings of sp, sd, and sq do not change. However For sp, sd, sq, the current data is influenced by historical data that has passed one or more seasonal periods.
For example, in the case of data with seasonal fluctuations of 12-month cycle, the parameter of s represents the cycle. It becomes s = 12.
If sq = 1, data for exactly 12 months (1 cycle ago) If sq = 2, it means that the model is affected by the data 12 months ago and the data 24 months ago.
If it's hard to understand, simply replace q with sq.
SARIMA model parameters, (p, d, q) (sp, sd, sq, s) for Python There is no function that automatically makes it the most appropriate.
Therefore, the information criterion(BIC in this case(Bayesian Information Criterion) )By
You have to write a program to find out which value is most appropriate.
I won't go into the information criterion this time, For BIC, understand that the lower this value, the more appropriate the parameter value.
#Data in chronological order:DATA,Parameters s(period):Enter s to output the best parameters and their BIC.
def selectparameter(DATA,s):
p = d = q = range(0, 1)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
parameters = []
BICs = np.array([])
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(DATA,
order=param,
seasonal_order=param_seasonal)
results = mod.fit()
parameters.append([param, param_seasonal, results.bic])
BICs = np.append(y,results.bic)
except:
continue
return print(parameters[np.argmin(BICs)])
To elaborate on the contents of the function, each parameter is For 0,1 (this time, the upper limit of the parameter is set to 1 for simplicity) The BIC of the SARIMA model is calculated and the case where the BIC is the smallest is displayed.
However, regarding the parameter s, time series data and Let's examine it by visualizing the partial autocorrelation described below.
import warnings
import itertools
import pandas as pd
import numpy as np
import statsmodels.api as sm
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31","1995-07-31",freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]
#The data is for one year
sales_sparkling = sales_sparkling[:12]
#Function definition
def selectparameter(DATA,s):
p = d = q = range(0, 2)
pdq = list(itertools.product(p, d, q))
seasonal_pdq = [(x[0], x[1], x[2], s) for x in list(itertools.product(p, d, q))]
parameters = []
BICs = np.array([])
for param in pdq:
for param_seasonal in seasonal_pdq:
try:
mod = sm.tsa.statespace.SARIMAX(DATA,
order=param,
seasonal_order=param_seasonal)
results = mod.fit()
parameters.append([param, param_seasonal, results.bic])
BICs = np.append(BICs,results.bic)
except:
continue
return parameters[np.argmin(BICs)]
#Put the cycle in the second argument
selectparameter(sales_sparkling,12)
Autocorrelation was the correlation with one's own past data. Another way to analyze time series data
The value of partial autocorrelation is important.
Speaking of k-th order autocorrelation, it represents the correlation between yt and yt−k. The k-order partial autocorrelation is the data between yt and yt−k, that is, The correlation is obtained by removing the influence of yt−k + 1 from yt−1.
I will explain in detail. Suppose you find the autocorrelation coefficient for a 7-day difference. However, if the data of one day correlates with the data of the previous day, 7 days ago → 6 days ago → 5 days ago ... It may be correlated through the data from 1 day ago to today.
The correlation obtained by removing the influence during this period is called partial autocorrelation.
Let's visualize the autocorrelation coefficient and the partial autocorrelation coefficient using Python. Generally, if the monthly data is seasonal, the cycle will be 12.
#The graph of autocorrelation coefficient is
sm.graphics.tsa.plot_acf(DATA)
#The graph of partial autocorrelation is
sm.graphics.tsa.plot_pacf(DATA)
#You can output with.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31", "1995-07-31", freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]
#Visualization of autocorrelation / partial autocorrelation
fig=plt.figure(figsize=(12, 8))
#Outputs a graph of autocorrelation coefficient
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(sales_sparkling, lags=30, ax=ax1)
#Outputs a graph of partial autocorrelation coefficient
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(sales_sparkling, lags=30, ax=ax2)
plt.show()
Here, I will summarize the procedure for analyzing time series data using the SARIMA model.
Read data
Data organization
Data visualization
Understanding the data cycle (determining the parameter s)
Determining parameters other than s
Build a model
Forecasting with data and its visualization
It is done in the flow.
Finally, in this section, you will learn from model building to prediction. Once you know the optimal (p, d, q), (sp, sd, sq, s), it's time to build the model. This time, we are using data for 15 years (180 months).
#To build a model
sm.tsa.statespace.SARIMAX(DATA,order=(p, d, q),seasonal_order=(sp, sd, sq, s)).fit()
#Is used.
Let's build a SARIMA model of sparkling wine sales data.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31", "1995-07-31", freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]
#Model fit
SARIMA_sparkring_sales = sm.tsa.statespace.SARIMAX(sales_sparkling,order=(0, 0, 0),seasonal_order=(0, 1, 1, 12)).fit()
#Output the BIC of the constructed SARIMA model
print(SARIMA_sparkring_sales.bic)
Getting predictive data for the model is easy
Model name.predict("At the start of forecast","At the end of the forecast")Is used.
However, the "start of forecast" must be the time in the original time series data. For example, the sales data for the sparkling wine used this time must be before 1995-07-31.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
print(sales_sparkling.head())
print(sales_sparkling.tail())
index = pd.date_range("1980-01-31", "1995-07-31", freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]
#Model fit
SARIMA_sparkring_sales = sm.tsa.statespace.SARIMAX(sales_sparkling,order=(0, 0, 0),seasonal_order=(0, 1, 1, 12)).fit()
#Substitute prediction data for pred
pred = SARIMA_sparkring_sales.predict("1994-07-31","1997-12-31")
#Visualization of pred data
plt.plot(pred)
plt.show()
Next, instead of outputting only forecast data Let's compare the original time series data and forecast data by outputting them at the same time.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
import numpy as np
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31", "1995-07-31", freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]
#Model fit
SARIMA_sparkring_sales = sm.tsa.statespace.SARIMAX(sales_sparkling,order=(0, 0, 0),seasonal_order=(0, 1, 1, 12)).fit()
#Substitute prediction data for pred
pred = SARIMA_sparkring_sales.predict("1994-7-31", "1997-12-31")
#Visualization of pred data and original time series data
plt.plot(sales_sparkling)
plt.plot(pred, color="r")
plt.show()