Time series analysis deals with time series data. Time series data Data that changes over time.
In particular
Etc. can be said to be time series data.
Time series analysis forecasts company sales and product sales Furthermore, it is a very important analysis technology in business, such as forecasting the number of visitors.
Learn how to analyze time series data using Python's StatsModels. Finally, we will deal with the SARIMA model.
The ultimate goal is like this.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
import numpy as np
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
index = pd.date_range("1980-01-31","1995-07-31",freq="M")
sales_sparkling.index=index
del sales_sparkling["Month"]
#Model fit
SARIMA_sparkring_sales = sm.tsa.statespace.SARIMAX(sales_sparkling,order=(0,0, 0),seasonal_order=(0, 1, 1, 12)).fit()
#Substitute prediction data for pred
pred = SARIMA_sparkring_sales.predict("1994-7-31","1997-12-31")
#Visualization of pread data and original time series data
plt.plot(sales_sparkling)
plt.plot(pred,color="r")
plt.show()
The first step in time series analysis is to visualize the time series data. You can see many things by visualizing the data.
Using Python matplotlib I tried to show the carbon dioxide concentration of Mauna Loa Observatory in Hawaii with a line graph.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
#Data reading(Use StatsModels test data)
co2_tsdata = sm.datasets.co2.load_pandas().data
#Handling of missing values
co2_tsdata2 = co2_tsdata.fillna(method="ffill")
#Determine the title of the graph
plt.title("Mauna Loa Weekly Atmospheric CO2 Data")
#Graph x-axis and y-axis naming
plt.xlabel("date")
plt.ylabel("CO2 Concentration ppmv")
#The x-axis period is from 1995 to 2000, and the y-axis is a line graph with values from 355 to 375.
plt.plot(co2_tsdata2)
plt.xlim("1995", "2000")
plt.ylim(355, 375)
plt.show()
For time series data
There are three patterns: (1) trend, (2) periodic fluctuation, and (3) irregular fluctuation.
Trends represent long-term trends in data. Time-series data in which the value of data is rising or falling over time is said to be "trendy". If the value is rising, then there is a positive trend. If it is falling, then there is a negative trend.
For data with periodic fluctuations, the value of the data repeatedly rises and falls with the passage of time. In particular, periodic fluctuations over a year are called seasonal fluctuations.
Irregular fluctuation means that the value of data fluctuates regardless of the passage of time.
What is time series data modeling? It refers to formulating time series data in some form.
To be able to explain the various characteristics of time series data Build a model and make various predictions based on this time series model The purpose of this time series analysis is to analyze the mutual relationships.
By observing the actual time series data, we can see that these three patterns are combined.
The time series data itself that is not doing anything
It is called the original series.
The purpose of time series analysis is to explore the nature of this original series. We will build a model by clarifying various characteristics of the original series. Then, based on that model, we predict the data and clarify the relationship between the time series data.
However, time series analysis rarely deals with the original series itself. Actually, we process the time series data into a new series and analyze it to build a model.
from now on
I will actually make it by processing the data.
Many time-series data have large fluctuations in value. Logarithmic transformation moderates such fluctuations in data.
Let's actually logarithmically convert the data. Here for logarithmic conversion
Numpy np.log()Is used.
np.log()How to use np.log(Pandas Dataframe type)And so on
Specify one argument and use it.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
#Data reading(Use StatsModels test data)
macrodata = sm.datasets.macrodata.load_pandas().data
macrodata.index = pd.Index(sm.tsa.datetools.dates_from_range("1959q1","2009q3"))
#The value of the real GDP of the United States before logarithmic conversion is displayed from the read data.
print(macrodata.realgdp.head())
#Logarithmically convert the prototype column to make it a logarithmic series
macrodata_realgdp_log = np.log(macrodata.realgdp)
#The value after logarithmic conversion is displayed.
print(macrodata_realgdp_log.head())
When analyzing time series data, we often deal with the difference in value from the previous time.
Taking the difference from the previous value in this way is called taking the difference. The series after taking the difference in this way is called the difference series.
By performing this conversion, the trend of the original series can be removed. (What is a trend? Is it on an upward trend when viewed in the big picture? Does it tend to level off? Is it on a downward trend? )
I will explain the original series later by removing the trend Stationary process (the property that the values in the time series do not change over time as a whole) It can be an important conversion.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
#Data reading(Use StatsModels test data)
co2_tsdata = sm.datasets.co2.load_pandas().data
#Handling of missing values
co2_tsdata2 = co2_tsdata.fillna(method="ffill")
#Take the difference in the data
co2_tsdata2_diff = co2_tsdata2.diff()
plt.subplot(2,1,1)
plt.title("Mauna Loa Weekly Atmospheric CO2 Data")
plt.xlabel("date")
plt.ylabel("CO2 Concentration ppmv")
plt.plot(co2_tsdata2)
plt.subplot(2,1,2)
plt.title("Mauna Loa Weekly Atmospheric CO2 Data DIFF")
plt.xlabel("date")
plt.ylabel("CO2 Concentration ppmv DIFF")
plt.plot(co2_tsdata2_diff)
plt.subplots_adjust(wspace=0, hspace=1.0)
plt.show()
Let's take a look at the data on changes in carbon dioxide concentration at Mauna Loa Observatory.
In this way, it is possible to make periodic fluctuations in a one-year cycle.
I said seasonal fluctuations.
However, in this line graph, this seasonal variation pattern gets in the way. The trend of time series data is difficult to understand.
To search for trends in non-seasonal data from such seasonal data In many cases, seasonal fluctuations are removed from the original series, and the data from which seasonal fluctuations have been removed are referred to.
It is called a seasonally adjusted series.
Stats Models
tsa.seasonal_decompose()By using
Trends in the original series, seasonal fluctuations, irregular fluctuations(Residual error)It can be divided into.
Of the result
Ovserved(First)Is the original series
Trend (Second)Is a trend component
Seasonal(The third)Is seasonal
Residual(Fourth)Is the residual
Represents.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from numpy import nan as na
import numpy as np
#Data reading(Use StatsModels test data)
co2_tsdata = sm.datasets.co2.load_pandas().data
co2_tsdata2 = co2_tsdata.fillna(method="ffill")
#Seasonally adjusted and output the original series by dividing it into trends, seasonal fluctuations, and residuals.
fig = sm.tsa.seasonal_decompose(co2_tsdata2, freq=52).plot()
plt.show()
From here, I will use a little math. Let's focus on what each value means.
Data is also described using basic statistics in time series analysis.
Suppose such data is observed. At this time, for example, the data of the third day
It is expressed as.
The most basic statistic is what is called the expected value or the average.
It is expressed as and shows the average value of the time series data. By the way
What is E?(Expected value)Is an abbreviation for.
Let's find the average value of the data.
The average value is np.mean()You can find it at.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
#Data reading(Use StatsModels test data)
co2_tsdata = sm.datasets.co2.load_pandas().data
#Handling of missing values
co2_tsdata2 = co2_tsdata.fillna(method="ffill")
#Find the average value of the data
np.mean(co2_tsdata2)
One of the statistics that shows how much the time series data deviates from the expected value It's distributed. Variance uses expected value
It is expressed as.
And the square root of the variance is called the standard deviation
It is represented by. Also, especially in finance, this standard deviation This is called volatility. Standard deviation is an important measure of risk in finance.
Next, I will introduce the statistics and autocovariance peculiar to time series data. Autocovariance is the covariance between different time points in the same time series data. The autocovariance separated by k points is called the k-th order autocovariance.
It is expressed as. Furthermore, what we see as a function of k is this autocovariance. It is called an autocovariance function.
It is possible to compare this autocovariance between various values. It ’s called the autocorrelation coefficient.
It is expressed as. Similar to the autocovariance, this autocorrelation coefficient is seen as a function of k. It is called an autocorrelation function. And a graph of this autocorrelation function
It is called a correlogram.
The autocorrelation coefficient is simply a value that represents how similar it is to past values.
We will output and visualize the introduced autocorrelation coefficient.
Autocorrelation coefficient(ACF:Autocorrelation Function)Is
sm.tsa.stattools.acf()
The first argument is data
The second argument is nlags(Default 40, optional)Asked for
The graph is
sm.graphics.tsa.plot_acf()
The first argument is data
The second argument is lags, which allows you to calculate and create graphs.
In the case of daily data, lag can be shifted by one and the autocorrelation can be confirmed. You can see how the value one day ago affects today.
The number of steps of the shifted data is called a lag. The autocorrelation coefficient lag 0 is always 1. Lag 0 does not shift data This is because there is a correlation between the same values. This is a guideline for the strength and weakness of the (self) correlation coefficient.
As an example, the autocorrelation coefficient of Mauna Loa's carbon dioxide concentration data is
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
#Data reading(Use StatsModels test data)
co2_tsdata = sm.datasets.co2.load_pandas().data
#Handling of missing values
co2_tsdata2 = co2_tsdata.fillna(method="ffill")
#Find the autocorrelation coefficient of the data
co2_tsdata2_acf = sm.tsa.stattools.acf(co2_tsdata2, nlags=40)
print(co2_tsdata2_acf)
#Autocorrelation coefficient(ACF)Create a graph of
sm.graphics.tsa.plot_acf(co2_tsdata2,lags=40)
plt.show()