Hello. Future Search I'm Sugato from Brazil.
I don't know what number to brew today, but I would like to write about forecasting time series data.
There is an image that prediction of time series data is basically not so usable, but I would like to see how much it is and whether it can be used in practice.
The specifics I tried are as follows
** Get daily followers on Twitter ** ~~ That astringent ~~ I will do my best with the API
** Try to predict the number of followers on your Twitter ** (1) Predicted by SARIMA model ・ [Combining neural network model with seasonal time series ARIMA model] https://www.sciencedirect.com/science/article/pii/S004016250000113X ・ [Analysis of time series data with SARIMA (prediction of PV number)] https://www.kumilog.net/entry/sarima-pv @xkumiyu
(2) Prediction with Prophet model ・ [Prophet Official] https://facebook.github.io/prophet/docs/quick_start.html ・ [Time Series Analysis Library Prophet Official Document Translation 1 (Overview & Features)] https://qiita.com/japanesebonobo/items/96868e58d4da42d36807 @japanesebonobo
Predicting the number of followers, which is decreasing day by day without tweeting, makes my heart even more deep. To conclude first, the number of followers will decrease, and there is no prospect of an increase.
The daily follower number data looks like this. I can't stand to see it. (Https://twitter.com/Ndtn_/) http://web.sfc.wide.ad.jp/~nadechin/follower.csv
date follower
2018/9/6 39.569
2018/9/7 39.57
2018/9/8 39.573
. .
. .
. .
2019/12/10 37.861
Separate training data and test data. It doesn't matter if it's pandas or numpy, but for the time being, ・ 2018/09/06 ~ 2019/12/10 Original data ・ 2018/09/06 ~ 2019/11/30 learning data ・ 2019/12/01 ~ 2019/12/10 test data
Confirm the stationarity of the data by ADF test. ・ [Statsmodels.tsa.stattools.adfuller] http://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html ・ [Null hypothesis, significance level] http://www.gen-info.osaka-u.ac.jp/MEPHAS/express/express11.html
res = sm.tsa.stattools.adfuller(df.follower)
The output result is as follows
p-value = 0.9774
⇨p-value > 0.05
Therefore, it cannot be said to have stationarity. In order to have stationarity, the difference is taken and the seasonality is removed.
predict.py
data = [Scatter(x=df.index, y=df.follower.diff())]
Then seasonal removal.
predict.py
data = [Scatter(x=df.index, y=df.follower-res.seasonal)]
This will perform the ADF test again.
p-value = 1.109e-25
⇨p-value < 0.05
As a result, we were able to process time-series data with stationarity.
In the case of SARIMA model, creating a model for each data
predict.py
# coding:utf-8
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(
train,
order=(p, d, q),
seasonal_order=(sa, sd, sq, s),
enforce_stationarity=False,
enforce_invertibility=False)
result = model.fit()
Do it with. order = (p, d, q) is a parameter of the ARIMA model seasonal_order = (sp, sd, sq, s) is a seasonal parameter
See ↓ ・ [Statsmodels.tsa.statespace.sarimax.SARIMAX] https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html ・ [Analysis of time series data with SARIMA (prediction of PV number)] https://www.kumilog.net/entry/sarima-pv @xkumiyu
Next, create a Prophet model.
Prophet will build a model just by typing in the learning data. It realizes "I don't know what I'm doing, but I've done something that seems to be predictable." Starting today, I can become a data scientist with a 2-second copy and paste.
predict.py
# coding:utf-8
import pandad as pd
import numpy as np
from fbprophet import Prophet
data = pd.read_csv('follower.csv')
data.follower= data.follower.apply(lambda x: int(x.replace(',', '')))
#The column name is'ds','y'Must be set to
data = data.rename(columns={'date': 'ds', 'follower': 'y'})
model = Prophet()
model.fit(data)
・ SARIMA model
Prediction of test data applied to SARIMA model
2019-12-01 38002.878685
2019-12-02 38001.204647
2019-12-03 37998.080676
2019-12-04 37988.324131
2019-12-05 37981.134367
2019-12-06 37974.569498
2019-12-07 37966.333432
2019-12-08 37958.270232
2019-12-09 37956.258566
2019-12-10 37952.875398
・ Prophet model
Prediction of test data applied to Prophet model
2019-12-01 37958.337506
2019-12-02 37959.963661
2019-12-03 37957.304699
2019-12-04 37943.272430
2019-12-05 37934.533210
2019-12-06 37920.537811
2019-12-07 37908.529618
2019-12-08 37905.819057
2019-12-09 37907.445213
2019-12-10 37904.786251
I'm lonely so I'll plot
[Overall view]
[Prediction part]
[Enlarged view of the predicted part]
Let's look at the forecast data for the day after the last day of the training data.
date, follower
#Real data
2019-12-01, 38003.000000
# SARIMA
2019-12-01, 38002.878685
# Prophet
2019-12-01, 37958.337506
As you can see from the [Expanded view of the predicted part], the predictions for the next day of the training data are almost the same in the SARIMA data. The prediction of the next time point of the training data seems to be suitable.
Prophet was honestly subtle.
I thought that it would work unexpectedly if I learned until 2019/12/09 and put out the predicted value of 2019/12/10, so I will try it.
Results below
date, follower
#Real data
2019-12-10, 37861.000000
# SARIMA
2019-12-10 37868.158032
It feels good. After all, if it is a forecast for only one day, it seems that a relatively good accuracy of a practical level will come out.
As I say many times, Prophet was honestly subtle.
Prophet is convenient, but it lacks practicality. With the SARIMA model, I felt that the prediction of time-series data could be used in one day. I wanted to compare a little more models at once. See you next time.
Also, the number of followers will decrease.
Recommended Posts