How to handle time series data (implementation)


I am an M2 major in CS. I usually focus on image processing, but since I had the opportunity to handle future date and time series data, I will leave it as a memorandum. I hope it will serve as a reference for those who want to process time-series data. ** Formulas etc. are omitted, so I think that it is for those who want to grasp the atmosphere **. Also, if you have any mistakes, please let us know.

What is time series data?

Time-series data is ** "a collection of results measured at regular intervals" **. In addition to information on changes in temperature, precipitation, and store sales, it is an image that has information on the measured time as a set.

Models + terms that can be used for time series data

AR model (autoregressive model)

--The future y is explained by the past y --Use your past data as an explanatory variable --Representing the data of interest by combining several past data multiplied by a coefficient --Assuming a stationary process

MA model (moving average model)

--Future y is explained by past error --Future forecast value is determined by the error between the past forecast value and the actual value. -(Example) If the sales volume of this month is higher than the original sales volume, the sales volume of next month will increase. --Expressing the relationship by having a term that is common to the data of interest and the past data --Assuming a stationary process

ARMA model (autoregressive moving average model)

--AR + MA process, according to the stronger property ――Therefore, both autocorrelation and partial autocorrelation decay according to the size of the lag. --The ARMA model estimates and predicts under the stationarity of the data series, but the actual data is often non-stationary. --Assuming a stationary process

ARIMA model (autoregressive integrated moving average model)

--The difference from the ARMA model is that it incorporates a difference process. ――Granted to ARMA how many floor differences should be taken to become steady --A process in which a sequence with d-th difference follows a steady and invertable ARMA (p, q) process

SARIMA model (seasonal autoregressive integrated moving average model)

――The difference with ARIMA is whether to consider seasonality? --In addition to ARIMA (p, d, q) in the time series direction, ARIMA (P, D, Q) in the seasonal difference direction, and the period s

Unit root process

--The data is created by adding the values. --Data with unit roots is called "unit root process" --ex) Random walk (cumulative sum of white noise) --White noise: Just "noise" according to a normal distribution with no autocorrelation

ADF test

--Since many time series models assume a stationary process, it is often the case that the unit root is confirmed first for the time series. --Null hypothesis: ** Unit root process **, Alternative hypothesis: ** Stationary process ** --If the P value is 0.05 or less, the null hypothesis is rejected and the process becomes stationary. --In general, if you take a "difference series" or "logarithmic conversion", the series tends to have stationarity.

Autocorrelation (ACF: Autocorrelation Function)

――How much does the past value affect the current data? --The number of steps of shifted data is called lag.

Partial Autocorrelation Function (PACF)

--Autocorrelation obtained by removing the influence of time from the autocorrelation coefficient --The relationship between today and two days ago indirectly includes the influence of one day ago. --By using partial autocorrelation, it is possible to examine the relationship between today and two days ago, excluding the effect of one day ago.


--Lag + autocorrelation


import numpy as np
import pandas as pd 
Handling of dates
pd.date_range('2020-1-1', freq='D', periods=3)

DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')

df = pd.Series(np.arange(3))
df.index = pd.date_range('2020-1-1', freq='D', periods=3)

2020-01-01    0
2020-01-02    1
2020-01-03    2
Freq: D, dtype: int64
idx = pd.date_range('2020-1-1',freq='D',periods=365)
df = pd.DataFrame({'Product A' : np.random.randint(100, size=365),
                   'Product B' : np.random.randint(100, size=365)},


Product A Product B
2020-01-01	99	23
2020-01-02	73	98
2020-01-03	86	85
2020-01-04	44	37
2020-01-05	67	63
...	...	...
2020-12-26	23	25
2020-12-27	91	35
2020-12-28	3	23
2020-12-29	92	47
2020-12-30	55	84
365 rows × 2 columns
#Data acquisition for a specific date

Product A 51
Product B 46
Name: 2020-02-03 00:00:00, dtype: int64

#Data acquisition by slicing

Product A Product B
2020-01-01	99	23
2020-01-02	73	98
2020-01-03	86	85
2020-01-04	44	37

Product A Product B
2020-01-04	44	37
2020-01-05	67	63
2020-01-06	6	94
2020-01-07	47	11


### ``Display all data for January(abridgement)

#Get the moon

Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
           dtype='int64', length=365)

Simple data analysis

This time we will use the'AirPassengers' dataset, which is famous for time series data.

Loading and displaying data
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
data = pd.read_csv('AirPassengers.csv', index_col=0, parse_dates=[0])


Decompose into trend, seasonal, resid using statsmodels

import statsmodels.api as sm 
res = sm.tsa.seasonal_decompose(data)
fig = res.plot()


Display of autocorrelation and partial autocorrelation

fig, axes = plt.subplots(1,2, figsize=(15,5)), ax=axes[0]), ax=axes[1])


Trend removal



ADF test

The tuple returns a value, so the first element of it is the P value. The null hypothesis can be rejected if the P value is 0.05 or less.

#raw data

#Logarithmic conversion
ldata = np.log(data)

#Logarithmic conversion+Floor difference

SARIMA model estimation

Set parameters with ʻorder and seasonal_order. Model training with fit (). Forecasts outside the learning range are forecast () Prediction of points containing training data ispredict ()` Parameter tuning should be calculated by brute force. (Can't find the best model for a function without statsmodels?)

model = sm.tsa.SARIMAX(ldata, order=(1,1,1),seasonal_order=(0,1,2,12))
res_model =
pred = res_model.forecast(36)
plt.plot(ldata, label='Original')
plt.plot(pred, label='Pred')


Feature creation in time series data

Information that is likely to be a feature in chronological order

#Easy table creation
df = pd.DataFrame(np.arange(6).reshape(6, 1),columns=['values'])

df['diff_1'] = df['values'].diff(1)
#Difference for 2 times
df['diff_2'] = df['values'].diff(2)
#Just shift the value
df['shift'] = df['values'].shift(1)
#Rate of change
df['ch'] = df['values'].pct_change(1)
#Moving average with window function
df['rolling_mean'] = df['values'].rolling(2).mean()
df['rolling_max'] = df['values'].rolling(2).max()


Other notes

--Features can be created with a library called tsfresh --You can CV with sklearn's TimeSeries Split ――Since the machine learning model is a stationary process, isn't it better to use a statistical model? --SARIMA model cannot handle nan


-Time series data processing by pandas -Thorough explanation of ARIMA model and SARIMA model appearing in time series analysis -Blog of data scientist working in front of Shibuya station

At the end

It's easy, but I've summarized the time series. What is worrisome is whether to use a machine learning model or a statistical model. Personally, I feel that the statistical model is better as a result (not this data, but ...).

