How to handle time series data (implementation)

Introduction

I am an M2 major in CS. I usually focus on image processing, but since I had the opportunity to handle future date and time series data, I will leave it as a memorandum. I hope it will serve as a reference for those who want to process time-series data. ** Formulas etc. are omitted, so I think that it is for those who want to grasp the atmosphere **. Also, if you have any mistakes, please let us know.

What is time series data?

Time-series data is ** "a collection of results measured at regular intervals" **. In addition to information on changes in temperature, precipitation, and store sales, it is an image that has information on the measured time as a set.

Models + terms that can be used for time series data

AR model (autoregressive model)

--The future y is explained by the past y --Use your past data as an explanatory variable --Representing the data of interest by combining several past data multiplied by a coefficient --Assuming a stationary process

MA model (moving average model)

--Future y is explained by past error --Future forecast value is determined by the error between the past forecast value and the actual value. -(Example) If the sales volume of this month is higher than the original sales volume, the sales volume of next month will increase. --Expressing the relationship by having a term that is common to the data of interest and the past data --Assuming a stationary process

ARMA model (autoregressive moving average model)

--AR + MA process, according to the stronger property ――Therefore, both autocorrelation and partial autocorrelation decay according to the size of the lag. --The ARMA model estimates and predicts under the stationarity of the data series, but the actual data is often non-stationary. --Assuming a stationary process

ARIMA model (autoregressive integrated moving average model)

--The difference from the ARMA model is that it incorporates a difference process. ――Granted to ARMA how many floor differences should be taken to become steady --A process in which a sequence with d-th difference follows a steady and invertable ARMA (p, q) process

SARIMA model (seasonal autoregressive integrated moving average model)

――The difference with ARIMA is whether to consider seasonality? --In addition to ARIMA (p, d, q) in the time series direction, ARIMA (P, D, Q) in the seasonal difference direction, and the period s

Unit root process

--The data is created by adding the values. --Data with unit roots is called "unit root process" --ex) Random walk (cumulative sum of white noise) --White noise: Just "noise" according to a normal distribution with no autocorrelation

ADF test

--Since many time series models assume a stationary process, it is often the case that the unit root is confirmed first for the time series. --Null hypothesis: ** Unit root process **, Alternative hypothesis: ** Stationary process ** --If the P value is 0.05 or less, the null hypothesis is rejected and the process becomes stationary. --In general, if you take a "difference series" or "logarithmic conversion", the series tends to have stationarity.

Autocorrelation (ACF: Autocorrelation Function)

――How much does the past value affect the current data? --The number of steps of shifted data is called lag.

Partial Autocorrelation Function (PACF)

--Autocorrelation obtained by removing the influence of time from the autocorrelation coefficient --The relationship between today and two days ago indirectly includes the influence of one day ago. --By using partial autocorrelation, it is possible to examine the relationship between today and two days ago, excluding the effect of one day ago.

Correlogram

--Lag + autocorrelation

analysis

import numpy as np
import pandas as pd 
Handling of dates
pd.date_range('2020-1-1', freq='D', periods=3)

'''
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03'], dtype='datetime64[ns]', freq='D')
'''

df = pd.Series(np.arange(3))
df.index = pd.date_range('2020-1-1', freq='D', periods=3)
df

'''
2020-01-01    0
2020-01-02    1
2020-01-03    2
Freq: D, dtype: int64
'''
idx = pd.date_range('2020-1-1',freq='D',periods=365)
df = pd.DataFrame({'Product A' : np.random.randint(100, size=365),
                   'Product B' : np.random.randint(100, size=365)},
                   index=idx)
df     

'''

Product A Product B
2020-01-01	99	23
2020-01-02	73	98
2020-01-03	86	85
2020-01-04	44	37
2020-01-05	67	63
...	...	...
2020-12-26	23	25
2020-12-27	91	35
2020-12-28	3	23
2020-12-29	92	47
2020-12-30	55	84
365 rows × 2 columns
'''      
#Data acquisition for a specific date
df.loc['2020-2-3']

'''
Product A 51
Product B 46
Name: 2020-02-03 00:00:00, dtype: int64
'''

#Data acquisition by slicing
df.loc[:'2020-1-4']

'''
Product A Product B
2020-01-01	99	23
2020-01-02	73	98
2020-01-03	86	85
2020-01-04	44	37
'''

df.loc['2020-1-4':'2020-1-7']
'''
Product A Product B
2020-01-04	44	37
2020-01-05	67	63
2020-01-06	6	94
2020-01-07	47	11
'''

df.loc['2020-1']

'''
### ``Display all data for January(abridgement)
'''

#Get the moon
df.index.month

'''
Int64Index([ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
            ...
            12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
           dtype='int64', length=365)
'''

Simple data analysis

This time we will use the'AirPassengers' dataset, which is famous for time series data.

Loading and displaying data
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
data = pd.read_csv('AirPassengers.csv', index_col=0, parse_dates=[0])
plt.plot(data)

plot1.png

Decompose into trend, seasonal, resid using statsmodels

import statsmodels.api as sm 
res = sm.tsa.seasonal_decompose(data)
fig = res.plot()

trend.png

Display of autocorrelation and partial autocorrelation

fig, axes = plt.subplots(1,2, figsize=(15,5))
sm.tsa.graphics.plot_acf(data, ax=axes[0])
sm.tsa.graphics.plot_pacf(data, ax=axes[1])

autocorrelation.png

Trend removal

plt.figure(figsize=(15,5))
plt.plot(data.diff(1))

remove_trend.png

ADF test

The tuple returns a value, so the first element of it is the P value. The null hypothesis can be rejected if the P value is 0.05 or less.


#raw data
sm.tsa.adfuller(data)[1]
0.991880243437641

#Logarithmic conversion
ldata = np.log(data)
sm.tsa.adfuller(ldata)[1]
0.42236677477039125

#Logarithmic conversion+Floor difference
sm.tsa.adfuller(ldata.diff().dropna())[1]
0.0711205481508595

SARIMA model estimation

Set parameters with ʻorder and seasonal_order. Model training with fit (). Forecasts outside the learning range are forecast () Prediction of points containing training data ispredict ()` Parameter tuning should be calculated by brute force. (Can't find the best model for a function without statsmodels?)

model = sm.tsa.SARIMAX(ldata, order=(1,1,1),seasonal_order=(0,1,2,12))
res_model = model.fit()
pred = res_model.forecast(36)
plt.plot(ldata, label='Original')
plt.plot(pred, label='Pred')

pred.png

Feature creation in time series data

Information that is likely to be a feature in chronological order

#Easy table creation
df = pd.DataFrame(np.arange(6).reshape(6, 1),columns=['values'])

#Difference
df['diff_1'] = df['values'].diff(1)
#Difference for 2 times
df['diff_2'] = df['values'].diff(2)
#Just shift the value
df['shift'] = df['values'].shift(1)
#Rate of change
df['ch'] = df['values'].pct_change(1)
#Moving average with window function
df['rolling_mean'] = df['values'].rolling(2).mean()
df['rolling_max'] = df['values'].rolling(2).max()

table.png

Other notes

--Features can be created with a library called tsfresh --You can CV with sklearn's TimeSeries Split ――Since the machine learning model is a stationary process, isn't it better to use a statistical model? --SARIMA model cannot handle nan

References

-Time series data processing by pandas -Thorough explanation of ARIMA model and SARIMA model appearing in time series analysis -Blog of data scientist working in front of Shibuya station

At the end

It's easy, but I've summarized the time series. What is worrisome is whether to use a machine learning model or a statistical model. Personally, I feel that the statistical model is better as a result (not this data, but ...).

Recommended Posts

How to handle time series data (implementation)
How to read time series data in PyTorch
How to generate exponential pulse time series data in python
How to compare time series data-Derivative DTW, DTW-
How to extract features of time series data with PySpark Basics
[Introduction to Python] How to handle JSON format data
Data cleaning How to handle missing and outliers
[Python] Plot time series data
How to handle session in SQLAlchemy
How to read e-Stat subregion data
Python: Time Series Analysis: Preprocessing Time Series Data
[Python] How to use Pandas Series
How to deal with imbalanced data
How to Data Augmentation with PyTorch
About time series data and overfitting
Differentiation of time series data (discrete)
Time series analysis 3 Preprocessing of time series data
How to handle Japanese in Python
How to collect machine learning data
How to calculate the sum or average of time series csv data in an instant
Forecasting time series data with Simplex Projection
How to collect Twitter data without programming
Predict time series data with neural network
How to set the server time to Japanese time
Time series data anomaly detection for beginners
How to handle consecutive values in MySQL
matplotlib Write text to time series graph
How to use "deque" for Python data
Reading OpenFOAM time series data and sets data
How to read problem data with paiza
Get time series data from k-db.com in Python
How to create sample CSV data with hypothesis
How to use MkDocs for the first time
How to avoid writing% matplotlib inline every time
How to achieve time wait processing with wxpython
How to use Python Image Library in python3 series
Kaggle Kernel Method Summary [Table Time Series Data]
Time Series Decomposition
Acquisition of time series data (daily) of stock prices
[Django] How to get data by specifying SQL.
[Python] How to read data from CIFAR-10 and CIFAR-100
How to scrape horse racing data with BeautifulSoup
How to use data analysis tools for beginners
How to get article data using Qiita API
Smoothing of time series and waveform data 3 methods (smoothing)
How to measure execution time with Python Part 1
View details of time series data with Remotte
How to search HTML data using Beautiful Soup
How to handle datetime type in python sqlite3
[wxpython] How to use wx.lib.plot basic & time axis
How to measure execution time with Python Part 2
Reading, summarizing, visualizing, and exporting time series data to an Excel file with Python
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
How to use xgboost: Multi-class classification with iris data
How to apply markers only to specific data in matplotlib
Features that can be extracted from time series data
[For beginners] How to study Python3 data analysis exam
How to scrape image data from flickr with python
How to measure processing time in Python or Java
How to scrape horse racing data using pandas read_html
How to quickly create array sample data during coding