To analyze time series data, you have to do some processing on the given data.
We will learn how to handle time series data given in CSV format.
First, let's read the data with pandas and display it. To load pandas
pd.read_csv()As an argument
filepath_or_buffer="Specify the path and URL of the file to read from"Is used.
#When you want to check the beginning of the read data
df.head(The number of data)
#When you want to check the tail
df.tail(The number of data)
# head,tail was a function that retrieves 5 data from the beginning and the end, respectively.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
#Display the first 5 data
print(sales_sparkling.head(5))
#Display 5 data at the tail
print(sales_sparkling.tail(5))
When analyzing time series data, use time information (Month data in the previous example) By making it the pandas index (0, 1, 2 ...... on the far left in the previous example) Makes the data easier to handle.
This processing
1.index information pd.date_range("At the start", "When finished", freq="interval")Summarize in
2.Substitute that information into the index of the original data
3.Of the original data"Month"To delete
Follow the procedure in.
For example, 2017/1/1~2018/1/If you want to collect the period of 1 by day interval
pd.date_range("2017-01-01", "2018-01-01", freq = "D")Let's pass an argument like this.
Give freq the acronym for the interval you want to collect. (Second➡S, Minute➡min, Hour➡H, Day➡D, Month➡M)
Check and determine "at the start", "at the end", and "interval" with df.head () and df.tail ().
Also, in the case of monthly data such as this sparkling wine It is easier to handle if you set the date of the end of the month yourself.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
#Creating index data
index = pd.date_range("1980-01-31", "1995-07-31", freq = "M")
#Index data assignment
sales_sparkling.index = index
# "Month"Delete column
del sales_sparkling["Month"]
#Data display
print(sales_sparkling.head())
Let's display it as a line graph once.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
#Read and organize data
sales_sparkling = pd.read_csv("./5060_tsa_data/monthly-australian-wine-sales-th-sparkling.csv")
#Creating index data
index = pd.date_range("1980-01-31", "1995-07-31", freq = "M")
#Index data assignment
sales_sparkling.index = index
# "Month"Delete column
del sales_sparkling["Month"]
#Represent the data as a line graph
#Set the title of the graph
plt.title("monthly-australian-wine-sales-th-sparkling")
#Graph x-axis and y-axis naming
plt.xlabel("date")
plt.ylabel("sales")
#Data plot
plt.plot(sales_sparkling)
plt.show()
These two factors can be considered as the reason for not having stationarity.
Recalling the condition of stationarity regarding the trend, the condition that the expected value is constant. If the data has a positive trend, its expected value is also on the rise, so it cannot be said to be stationary.
For seasonal fluctuations, the autocorrelation coefficient (that is, the degree of dispersion of data values), which is a condition of stationarity, When I recall the condition of being constant Time-series data such as oden, whose sales increase rapidly at a certain time, does not meet this condition.
To convert non-stationary time series data to stationary time series data Eliminate trends and seasonal fluctuations.
1,Eliminate trends and seasonal fluctuations
2,Analyze time series data after making it stationary
3,Build a model of steady-state data.
4,Then, by synthesizing trends and seasonal fluctuations again, a model of the original series is constructed.
The ARIMA model models time series data in this way. (To be precise, if there are seasonal fluctuations, we will handle a model called the SARIMA model)
How to make a time series a stationary process
1,Logarithmic transformation can be used to make the variance of fluctuations uniform
2,After estimating the trend by taking a moving average, remove the trend component
3,Eliminate trends and seasonal fluctuations by converting to a difference series
4,Use seasonal adjustment
And so on.
By performing logarithmic conversion, data fluctuations can be moderated. By taking the logarithm, the larger the number, the smaller the number will be output.
In other words, the autocovariance can be made uniform for a volatile time series. Let's actually logarithmically convert the time series.
If the trend cannot be removed by logarithmic conversion, the trend must be removed further.
Np for logarithmic conversion.log()Is used.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
import numpy as np
#Data reading
sunspots = sm.datasets.sunspots.load_pandas().data
sunspots.index = pd.Index(sm.tsa.datetools.dates_from_range("1700", "2008"))
del sunspots["YEAR"]
#Logarithmic conversion
sunspots_log = np.log(sunspots)#Please write here
#Graph after logarithmic conversion
plt.title("Sunspots")
plt.xlabel("date")
plt.ylabel("sunspots_log")
plt.plot(sunspots_log)
plt.show()
To take a moving average, take the mean of k consecutive values. What is a moving average? Take an "average" over a certain interval of time series data. That is to repeat while "moving" the section.
This will allow the data to be smoothed while retaining the characteristics of the original data.
For example, if the monthly data has seasonal fluctuations, you can find the moving average of 12 consecutive values. Seasonal fluctuations can be removed and trend components can be extracted.
Then subtract the found moving average from the original series. By doing this, the trend component of the series can be removed.
Let's take a moving average every 51 weeks (just one year) for the CO2 concentration data at Mauna Loa Observatory and check the trend. Also, make sure that the data obtained by subtracting the moving average from the original series is approaching a stationary process.
The moving average is
DATA.rolling(window=How many moving averages to take).mean()By
Can be asked
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
import numpy as np
co2_tsdata = sm.datasets.co2.load_pandas().data
#Handling of missing values
co2_tsdata2 = co2_tsdata.fillna(method="ffill")
#Graph of the original series
plt.subplot(6, 1, 1)
plt.xlabel("date")
plt.ylabel("co2")
plt.plot(co2_tsdata2)
#Find the moving average
co2_moving_avg = co2_tsdata2.rolling(window=51).mean()
#Moving average graph
plt.subplot(6, 1, 3)
plt.xlabel("date")
plt.ylabel("co2")
plt.plot(co2_moving_avg)
#Original series-Moving average graph
plt.subplot(6, 1, 5)
plt.xlabel("date")
plt.ylabel("co2")
mov_diff_co2_tsdata = co2_tsdata2-co2_moving_avg
plt.plot(mov_diff_co2_tsdata)
plt.show()
Conversion to a difference series is the most common method used to give stationarity. Trends and seasonal fluctuations can be eliminated by taking the difference.
The difference series was a series of time series data that was subtracted from adjacent data. For example, the difference series of time series data [1, 5, 3, 5, 3, 2, 2, 9] is [4, -2, 2, -2, -1, 0, 7].
The primary difference is DATA.diff()Is required by.
If you take the difference series from the generated difference series, it will be the second order difference series.
Converting carbon dioxide concentration data from Mauna Loa Observatory to 1st order difference series Let's confirm that the original series is approaching a stationary process.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
import numpy as np
co2_tsdata = sm.datasets.co2.load_pandas().data
#Handling of missing values
co2_tsdata2 = co2_tsdata.fillna(method="ffill")
#Original series plot
plt.subplot(2, 1, 1)
plt.xlabel("date")
plt.ylabel("co2")
plt.plot(co2_tsdata2)
#Take the difference
plt.subplot(2, 1, 2)
plt.xlabel("date")
plt.ylabel("co2_diff")
co2_data_diff = co2_tsdata2.diff()#Please write here
#Plot of difference series
plt.plot(co2_data_diff)
plt.show()
I learned about the difference series. At that time, the original series was divided into trends, seasonal fluctuations, and residuals.
This conversion is
(Original series=trend+Seasonal variation+Residual error)
Since it is expressed as
Original series-trend-Seasonal variation=Residual error
It will be.
In other words, the residuals are stationary time series data with trends and seasonal fluctuations removed. Let's check that the residual is a stationary process.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import datetime
import numpy as np
#Data reading
co2_tsdata = sm.datasets.co2.load_pandas().data
#Handling of missing values
co2_tsdata2 = co2_tsdata.dropna()
#Seasonally adjusted and graph plots
res = sm.tsa.seasonal_decompose(co2_tsdata2,freq=51)
fig = res.plot()
plt.show()
Recommended Posts