I am an M2 major in CS. I usually focus on image processing, but since I had the opportunity to handle future date and time series data, I will leave it as a memorandum. I hope it will serve as a reference for those who want to process time-series data. ** Formulas etc. are omitted, so I think that it is for those who want to grasp the atmosphere **. Also, if you have any mistakes, please let us know.
Time-series data is ** "a collection of results measured at regular intervals" **. In addition to information on changes in temperature, precipitation, and store sales, it is an image that has information on the measured time as a set.
--The future y is explained by the past y --Use your past data as an explanatory variable --Representing the data of interest by combining several past data multiplied by a coefficient --Assuming a stationary process
--Future y is explained by past error --Future forecast value is determined by the error between the past forecast value and the actual value. -(Example) If the sales volume of this month is higher than the original sales volume, the sales volume of next month will increase. --Expressing the relationship by having a term that is common to the data of interest and the past data --Assuming a stationary process
--AR + MA process, according to the stronger property ――Therefore, both autocorrelation and partial autocorrelation decay according to the size of the lag. --The ARMA model estimates and predicts under the stationarity of the data series, but the actual data is often non-stationary. --Assuming a stationary process
--The difference from the ARMA model is that it incorporates a difference process. ――Granted to ARMA how many floor differences should be taken to become steady --A process in which a sequence with d-th difference follows a steady and invertable ARMA (p, q) process
――The difference with ARIMA is whether to consider seasonality? --In addition to ARIMA (p, d, q) in the time series direction, ARIMA (P, D, Q) in the seasonal difference direction, and the period s
--The data is created by adding the values. --Data with unit roots is called "unit root process" --ex) Random walk (cumulative sum of white noise) --White noise: Just "noise" according to a normal distribution with no autocorrelation
--Since many time series models assume a stationary process, it is often the case that the unit root is confirmed first for the time series. --Null hypothesis: ** Unit root process **, Alternative hypothesis: ** Stationary process ** --If the P value is 0.05 or less, the null hypothesis is rejected and the process becomes stationary. --In general, if you take a "difference series" or "logarithmic conversion", the series tends to have stationarity.
――How much does the past value affect the current data? --The number of steps of shifted data is called lag.
--Autocorrelation obtained by removing the influence of time from the autocorrelation coefficient --The relationship between today and two days ago indirectly includes the influence of one day ago. --By using partial autocorrelation, it is possible to examine the relationship between today and two days ago, excluding the effect of one day ago.
--Lag + autocorrelation
import numpy as np
import pandas as pd
pd.date_range('2020-1-1', freq='D', periods=3)
'''
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
'''
df = pd.Series(np.arange(3))
df.index = pd.date_range('2020-1-1', freq='D', periods=3)
df
'''
2020-01-01 0
2020-01-02 1
2020-01-03 2
Freq: D, dtype: int64
'''
idx = pd.date_range('2020-1-1',freq='D',periods=365)
df = pd.DataFrame({'Product A' : np.random.randint(100, size=365),
'Product B' : np.random.randint(100, size=365)},
index=idx)
df
'''
Product A Product B
2020-01-01 99 23
2020-01-02 73 98
2020-01-03 86 85
2020-01-04 44 37
2020-01-05 67 63
... ... ...
2020-12-26 23 25
2020-12-27 91 35
2020-12-28 3 23
2020-12-29 92 47
2020-12-30 55 84
365 rows × 2 columns
'''
#Data acquisition for a specific date
df.loc['2020-2-3']
'''
Product A 51
Product B 46
Name: 2020-02-03 00:00:00, dtype: int64
'''
#Data acquisition by slicing
df.loc[:'2020-1-4']
'''
Product A Product B
2020-01-01 99 23
2020-01-02 73 98
2020-01-03 86 85
2020-01-04 44 37
'''
df.loc['2020-1-4':'2020-1-7']
'''
Product A Product B
2020-01-04 44 37
2020-01-05 67 63
2020-01-06 6 94
2020-01-07 47 11
'''
df.loc['2020-1']
'''
### ``Display all data for January(abridgement)
'''
#Get the moon
df.index.month
'''
Int64Index([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
...
12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
dtype='int64', length=365)
'''
This time we will use the'AirPassengers' dataset, which is famous for time series data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('AirPassengers.csv', index_col=0, parse_dates=[0])
plt.plot(data)
import statsmodels.api as sm
res = sm.tsa.seasonal_decompose(data)
fig = res.plot()
fig, axes = plt.subplots(1,2, figsize=(15,5))
sm.tsa.graphics.plot_acf(data, ax=axes[0])
sm.tsa.graphics.plot_pacf(data, ax=axes[1])
plt.figure(figsize=(15,5))
plt.plot(data.diff(1))
The tuple returns a value, so the first element of it is the P value. The null hypothesis can be rejected if the P value is 0.05 or less.
#raw data
sm.tsa.adfuller(data)[1]
0.991880243437641
#Logarithmic conversion
ldata = np.log(data)
sm.tsa.adfuller(ldata)[1]
0.42236677477039125
#Logarithmic conversion+Floor difference
sm.tsa.adfuller(ldata.diff().dropna())[1]
0.0711205481508595
Set parameters with ʻorder and
seasonal_order. Model training with
fit (). Forecasts outside the learning range are
forecast () Prediction of points containing training data is
predict ()`
Parameter tuning should be calculated by brute force.
(Can't find the best model for a function without statsmodels?)
model = sm.tsa.SARIMAX(ldata, order=(1,1,1),seasonal_order=(0,1,2,12))
res_model = model.fit()
pred = res_model.forecast(36)
plt.plot(ldata, label='Original')
plt.plot(pred, label='Pred')
Information that is likely to be a feature in chronological order
#Easy table creation
df = pd.DataFrame(np.arange(6).reshape(6, 1),columns=['values'])
#Difference
df['diff_1'] = df['values'].diff(1)
#Difference for 2 times
df['diff_2'] = df['values'].diff(2)
#Just shift the value
df['shift'] = df['values'].shift(1)
#Rate of change
df['ch'] = df['values'].pct_change(1)
#Moving average with window function
df['rolling_mean'] = df['values'].rolling(2).mean()
df['rolling_max'] = df['values'].rolling(2).max()
--Features can be created with a library called tsfresh
--You can CV with sklearn's TimeSeries Split
――Since the machine learning model is a stationary process, isn't it better to use a statistical model?
--SARIMA model cannot handle nan
-Time series data processing by pandas -Thorough explanation of ARIMA model and SARIMA model appearing in time series analysis -Blog of data scientist working in front of Shibuya station
It's easy, but I've summarized the time series. What is worrisome is whether to use a machine learning model or a statistical model. Personally, I feel that the statistical model is better as a result (not this data, but ...).
Recommended Posts