Challenge to future sales forecast: ② Time series analysis using PyFlux

Introduction

Last time, Challenge to future sales forecast: ① What is time series analysis? introduced time series analysis as a means to realize future sales forecasts. Rather than an introduction, I felt that I tried to summarize it with my own interpretation (I do not understand mathematics at all), and for those who have a solid understanding of mathematical formulas, there may have been a point that "I am misunderstanding". Hmm. If you have any questions, I would appreciate it if you could point them out.

By the way, this time I would like to actually make a model for time series analysis.

The model is a verification library of ARIMA model and ARIMAX model using PyFlux, which was introduced in Prediction library of time series data--PyFlux--. To go. (Actually, I wanted to do SARIMA considering seasonality, but I didn't know how to do it with PyFlux, so I gave up this time)

Analytical environment

Google Colaboratory

Target data

The data is super simple, using daily sales and temperature (average, highest, lowest) as explanatory variables.

date Sales amount Average temperature Highest temperature Lowest Temperature
2018-01-01 7,400,000 4.9 7.3 2.2
2018-01-02 6,800,000 4.0 8.0 0.0
2018-01-03 5,000,000 3.6 4.5 2.7
2018-01-04 7,800,000 5.6 10.0 2.6

1. Original data creation

As usual, download the target data from BigQuery into Colaboratory's Python environment. As in the example, item names cannot be given in Japanese, but for the sake of clarity, they are used here.

import pandas as pd

query = """
SELECT * 
FROM `myproject.mydataset.mytable`
WHERE CAST(Date AS TIMESTAMP) between CAST("2018-01-01" AS TIMESTAMP) AND CAST("2018-12-31" AS TIMESTAMP) ORDER BY p_date'
"""

df = pd.io.gbq.read_gbq(query, project_id="myproject", dialect="standard")

#Missing values to zero
df.fillna(0, inplace=True)

#Set the date to Datetime type and set it to df Index
df = df[1:].set_index('date')
df.index=pd.to_datetime(df.index, utc=True).tz_convert('Asia/Tokyo')
df.index = df.index.tz_localize(None)
df= df.sort_index()

It is necessary to convert the date to Datetime type and index it for later processing.

For some reason, I used Pandas' read_gbq, which was the slowest when I verified it with I tried using BigQuery Storage API for BQ → Python cooperation. I will. When the original data is not so heavy, it is easy to write simply. .. .. (excuse)

Let's take a look at the sales data for now.

%matplotlib inline
import matplotlib.pyplot as plt

plt.figure()
df["Sales amount"].plot(figsize=(12, 8))

ダウンロード.png

Since it is daily, it goes up and down considerably. It has risen sharply at the beginning of March and the end of the year.

2. Build ARIMA model

From here, we will perform time series analysis using PyFlux. However, PyFlux is not installed as standard in Colaboratory, so let's install it first.

pip install pyflux

Model programming itself is super simple. There are five arguments passed to ARIMA:

--ar: Autoregressive order --ma: Moving average order --integ: Rank of difference --target: Objective variable --family: Probability distribution

Then, MLE (maximum likelihood estimation) predicts the model.

import pyflux as pf

model = pf.ARIMA(data=df, ar=5, ma=5, integ=1, target='Sales amount', family=pf.Normal())
x = model.fit('MLE')

However, I have no idea what the values of ar, ma, and integ should be, so I have to move it first.

x.summary()

Here is the summary that comes out. If you use AIC (Akaike's Information Criterion) or BIC (Bayesian Information Criterion) for the evaluation of the model itself, it seems that a model with a low value is good. Variable evaluation is P-value (P>|z|) Should I use it? Constant (fixed value) and AR(3)・ Ma(5)Does not seem to be significant because the P value is high.

If one of AR / MA is not significant, does it mean that each order should be reduced by one? However, if it is said that the fixed value is not significant, is it difficult to use this data for the ARIMA model in the first place?

Normal ARIMA(5,1,5)                                                                                       
======================================================= ==================================================
Dependent Variable: Differenced kingaku                 Method: MLE                                       
Start Date: 2018-01-09 03:00:00                         Log Likelihood: -5401.3927                        
End Date: 2019-01-01 03:00:00                           AIC: 10826.7854                                   
Number of observations: 357                             BIC: 10873.3182                                   
==========================================================================================================
Latent Variable                          Estimate   Std Error  z        P>|z|    95% C.I.                 
======================================== ========== ========== ======== ======== =========================
Constant                                 18212.1745 51000.329  0.3571   0.721    (-81748.4703 | 118172.819
AR(1)                                    0.2046     0.0583     3.507    0.0005   (0.0902 | 0.3189)        
AR(2)                                    -0.9284    0.0476     -19.4981 0.0      (-1.0217 | -0.8351)      
AR(3)                                    -0.0762    0.0807     -0.9438  0.3453   (-0.2343 | 0.082)        
AR(4)                                    -0.4864    0.0465     -10.4663 0.0      (-0.5774 | -0.3953)      
AR(5)                                    -0.5857    0.0555     -10.5508 0.0      (-0.6945 | -0.4769)      
MA(1)                                    -0.8716    0.0787     -11.0703 0.0      (-1.0259 | -0.7173)      
MA(2)                                    0.9898     0.0905     10.9326  0.0      (0.8123 | 1.1672)        
MA(3)                                    -0.5321    0.1217     -4.3708  0.0      (-0.7707 | -0.2935)      
MA(4)                                    0.4706     0.0945     4.9784   0.0      (0.2853 | 0.6558)        
MA(5)                                    0.007      0.0725     0.0973   0.9225   (-0.135 | 0.1491)        
Normal Scale                             900768.835                                                       
==========================================================================================================

Statistically, it didn't look very good, but what about the graph?

model.plot_fit(figsize=(15, 10))

ダウンロード (1).png

Blue is a real number and black is a model value. The timing of rising and falling is similar, but there are many timings when the magnitude of the fluctuation width of the real number cannot be predicted.

3. Construction of ARIMAX model

Next, let's build an ARIMAX model that can use ARIMA + variables. The program is also super simple here. It seems to write "objective variable ~ 1 + explanatory variable" in the argument of formula. (It's a bit unintuitive to write)

import pyflux as pf

model = pf.ARIMA(data=df, formula='Sales amount~1+Average temperature+Highest temperature+Lowest Temperature', ar=5, ma=5, integ=1, target='Sales amount', family=pf.Normal())
x = model.fit('MLE')

Then evaluate the model in the same way as ARIMA.

x.summary()

It's almost the same for AIC and BIC. The temperature variable added as an explanatory variable has a P value of 1.0, which is completely useless. .. ..

Normal ARIMAX(5,1,5)                                                                                      
======================================================= ==================================================
Dependent Variable: Differenced kingaku                 Method: MLE                                       
Start Date: 2018-01-09 03:00:00                         Log Likelihood: -5401.6313                        
End Date: 2019-01-01 03:00:00                           AIC: 10829.2627                                   
Number of observations: 357                             BIC: 10879.6732                                   
==========================================================================================================
Latent Variable                          Estimate   Std Error  z        P>|z|    95% C.I.                 
======================================== ========== ========== ======== ======== =========================
AR(1)                                    0.2036     0.0581     3.5023   0.0005   (0.0897 | 0.3175)        
AR(2)                                    -0.9277    0.0475     -19.5352 0.0      (-1.0208 | -0.8346)      
AR(3)                                    -0.0777    0.0804     -0.9658  0.3342   (-0.2353 | 0.08)         
AR(4)                                    -0.4857    0.0463     -10.4841 0.0      (-0.5765 | -0.3949)      
AR(5)                                    -0.5869    0.0552     -10.6292 0.0      (-0.6952 | -0.4787)      
MA(1)                                    -0.8687    0.0775     -11.2101 0.0      (-1.0205 | -0.7168)      
MA(2)                                    0.989      0.0902     10.9702  0.0      (0.8123 | 1.1657)        
MA(3)                                    -0.5284    0.1211     -4.3651  0.0      (-0.7657 | -0.2912)      
MA(4)                                    0.47       0.0942     4.9874   0.0      (0.2853 | 0.6547)        
MA(5)                                    0.0097     0.0715     0.1353   0.8924   (-0.1305 | 0.1499)       
Beta 1                                   0.0        59845.8347 0.0      1.0      (-117297.836 | 117297.836
Beta kion_min                            -0.0       755.0035   -0.0     1.0      (-1479.8069 | 1479.8068) 
Normal Scale                             901399.389                                                       
==========================================================================================================

Finally, graph it. ダウンロード (3).png

model.plot_fit(figsize=(15, 10))

Hmm. Same as the ARIMA model.

in conclusion

With PyFlux, the time series analysis program itself was very easy to create. However, both ARIMA and ARIMAX have a good direction of going up and down, but the width is small and the accuracy of the model does not improve. It is difficult to specify the optimum number of each parameter.

The rest may be seasonal. (I couldn't use it with PyFlux, SARIMA?) Also, this time the explanatory variable using temperature did not help at all, so there seems to be room for improvement here as well.

Recommended Posts

Challenge to future sales forecast: ② Time series analysis using PyFlux
Challenge to future sales forecast: ⑤ Time series analysis by Prophet
Challenge to future sales forecast: ④ Time series analysis considering seasonality by Stats Models
Challenge to future sales forecast: ③ PyFlux parameter tuning
Time series analysis Part 3 Forecast
Challenges for future sales forecasts: (1) What is time series analysis?
Python: Time Series Analysis
RNN_LSTM1 Time series analysis
Time series analysis 1 Basics
Time series analysis related memo
A study method for beginners to learn time series analysis
Time series analysis part 4 VAR
Time series analysis Part 1 Autocorrelation
Python: Time Series Analysis: Preprocessing Time Series Data
Time series analysis 3 Preprocessing of time series data
Instantly illustrate the predominant period in time series data using spectrum analysis
Time series analysis 2 Stationary, ARMA / ARIMA model
How to compare time series data-Derivative DTW, DTW-
I tried time series analysis! (AR model)
Time series analysis Part 2 AR / MA / ARMA
Time series analysis 4 Construction of SARIMA model
matplotlib Write text to time series graph
How to handle time series data (implementation)
Time series analysis # 6 Spurious regression and cointegration
Introduction to Time Series Analysis ~ Seasonal Adjustment Model ~ Implemented in R and Python