We will analyze the number of corona patients in Japan and create a predictive model.
here ・ How to use the library called Prophet provided by Facebook ・ Random forest regression ・ XGBoost I will try three methods.
Use Jupyter Notebook which is installed when you install Anaconda on Windows 10
version information: Jupyter Notebook 6.1.4 Python 3.8.5
fbprophet, plotly, and xgboost with Anacona Prompt respectively conda install -c conda-forge fbprophet pip install plotly pip install xgboost After installing with, I imported the library on Jupyter Notebook by the following.
import pandas as pd
import seaborn as sns
import numpy as np
from scipy.stats import norm
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
pd.pandas.set_option('display.max_columns', None)
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel
pd.pandas.set_option('display.max_columns', None)
import matplotlib.pyplot as plt
from fbprophet import Prophet
import plotly.express as px
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.model_selection import ParameterGrid
from tqdm import tqdm
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
--nhk_news_covid19_domestic_daily_data.csv: Corona patient number data released by NHK
Data provider: https://www3.nhk.or.jp/news/special/coronavirus/data-all/
Display patient count data as a data frame:
path = "../input/covid19data_from_NHK/nhk_news_covid19_domestic_daily_data.csv"
df_Jap = pd.read_csv(path,encoding = 'utf-8')
df_Jap
Graph display:
fig = px.area(df_Jap, x="date", y="Number of infected people in Japan_Number of announcements per day", height=600, width=700,
color_discrete_sequence = ["blue"])
fig.update_layout(xaxis_rangeslider_visible=True)
fig.show()
You can see that the first wave around April 2020, the second wave around August, and the third wave around December are coming.
I divide the data into training data and test data, but here Training data: 2020/7/14-2020/11/30 Test data: 2020/12/1-2020/12/31 will do. The reason for setting it from July 14, 2020 is that before that, the number of patients was close to 0 and the characteristics of the data would change.
Here, the "date" column is converted to datetime64 type by the to_datetime method of pandas to make it easier to handle.
df_Jap['Date (datetime64)'] = pd.to_datetime(df_Jap['date'], errors='coerce')
import datetime as dt
Train_Jap=df_Jap[(dt.datetime(2020,7,14) <= df_Jap['Date (datetime64)']) & (df_Jap['Date (datetime64)'] <= dt.datetime(2020,11,30) )]
Test_Jap=df_Jap[(dt.datetime(2020,12,1) <= df_Jap['Date (datetime64)']) & (df_Jap['Date (datetime64)'] <= dt.datetime(2020,12,31) )]
Prophet is a time series forecasting library developed by Facebook, details are described below. https://facebook.github.io/prophet/docs/quick_start.html
Feature: --Prophet input must be a dataframe with columns ds and y --ds is the date stamp (date), y is the numerical measurement data you want to predict
Renamed for use with Prophet:
Train_Jap=Train_Jap[["date","Number of infected people in Japan_Number of announcements per day"]].rename(columns={"date":"ds","Number of infected people in Japan_Number of announcements per day":"y"})
Test_Jap=Test_Jap[["date","Number of infected people in Japan_Number of announcements per day"]].rename(columns={"date":"ds","Number of infected people in Japan_Number of announcements per day":"y"})
The Prophet model is trained against the training data (Train_Jap) and predicted against the test data (Test_Jap), but the following settings can be made.
You can select growth ='linear' and growth ='logistic' (logistic function). With growth ='logistic', you can set the upper limit ('cap' (carrying capacity)) and lower limit of the data, which is used here. 'cap' is set to 10000 here because of the visibility of the graph. (This is fine as long as the number of infected people per day does not exceed 10,000)
Train_Jap['cap']=10000
Test_Jap['cap']=10000
As a trend change point '2020/8/7', '2020/9/7', '2020/10/24', '2020/11/20' To set. (Read visually from the graph)
seasonality_mode ='additive' seasonality_mode ='multiplicative' You can choose between the two, but if you look at the graph above, the weekly periodicity increases in proportion to the number of infected people, so the multiplicative seasonality is adopted.
A parameter that adjusts the degree to which the periodicity of the model fits the data. First, let's set it to 1 temporarily, but later we will try to enter multiple values by grid search.
It represents the variance of the Laplace distribution, which is the prior distribution of the trend term, and the larger it is, the easier it is to detect the change point. If you make this term too large, quite a few points will be detected as change points. First, let's set it to 1 temporarily, but later we will try to enter multiple values by grid search.
Information source: https://www.atmarkit.co.jp/ait/articles/1906/07/news004_2.html
Represents the width of the uncertainty interval, which is 0.8 by default. I will try to enter multiple values later in the grid search.
model=Prophet(growth='logistic', seasonality_mode='multiplicative', seasonality_prior_scale=1, changepoint_prior_scale=1, changepoints=['2020/8/7', '2020/9/7', '2020/10/24', '2020/11/20'])
model.fit(Train_Jap)
forecast = model.predict(Test_Jap)
fig = model.plot_components(forecast)
Trends (movements excluding changes in annual and weekly cycles) and weekly learning results are available ↓ The annual cycle is not displayed because there is no data for one year. I knew that the number of infected people would decrease on Monday due to the small number of tests on Sunday, but on the contrary, it seems that there is a tendency for the number of people to be infected on Thursday, Friday and Saturday to increase:
View training data and forecast results. The change point of the set trend is displayed as a red dashed line, the forecast trend line is displayed as a red line, and the forecast result is displayed as a blue line.
from fbprophet.plot import add_changepoints_to_plot
plot = model.plot(forecast)
a = add_changepoints_to_plot(plot.gca(), model, forecast)
plt.ylim([-0, 5000])
Add the predicted value yhat obtained by training to Test_Jap as a column:
Test_Jap['yhat']=forecast['yhat'].values
Compare the actual number of infected people in December with the predicted results:
plt.figure(figsize=(20, 8))
plt.plot(Test_Jap['ds'], Test_Jap['y'], 'b-', label = 'Actual')
plt.plot(Test_Jap['ds'], Test_Jap['yhat'], 'r--', label = 'Prediction')
plt.xlabel('Date',rotation=90); plt.ylabel('Sales'); plt.title('Actual vs Prediction')
plt.xticks(rotation=90)
plt.legend();
Let's calculate the accuracy:
Test_Jap['diff']=(Test_Jap.y-Test_Jap.yhat).abs()
acc_ts2=(1-(Test_Jap['diff'].sum()/Test_Jap['y'].sum()))*100
acc_ts2
#85.55183564855452
The accuracy this time was about 86%. As an index MAE (Mean Absolute Error) MSE (Mean Squared Error) RMSE (Root Mean Square Error) Let's calculate.
MAE_ts=metrics.mean_absolute_error(Test_Jap['y'], Test_Jap['yhat'])
MSE_ts=metrics.mean_squared_error(Test_Jap['y'], Test_Jap['yhat'])
RMSE_ts=np.sqrt(metrics.mean_squared_error(Test_Jap['y'], Test_Jap['yhat']))
print('MAE:', MAE_ts)
print('MSE:', MSE_ts)
print('RMSE:', RMSE_ts)
MAE: 404.44140578238205
MSE: 305342.084316079
RMSE: 552.5776726543328
I don't think it's bad already, but we will tune the parameters by grid search below.
I have set the following values for'changepoint_prior_scale','seasonality_prior_scale', and'interval_width'.
params_grid = {'changepoint_prior_scale':[0.1, 0.5, 1, 2, 10],
'seasonality_prior_scale':[0.1, 0.5, 1, 2, 10],
'interval_width':[0.8, 0.85, 0.9, 0.95]
}
grid = ParameterGrid(params_grid)
model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
Train=Train_Jap.copy()
Valid=Test_Jap[['ds','y','cap']].reset_index()
m=Prophet(growth='logistic',
seasonality_mode='multiplicative',
seasonality_prior_scale=p['seasonality_prior_scale'],
changepoint_prior_scale=p['changepoint_prior_scale'],
changepoints=['2020/8/7', '2020/9/7', '2020/10/24', '2020/11/20'],
interval_width = p['interval_width']
)
m.fit(Train_Jap)
forecast = m.predict(Valid[['ds','cap']])
forecast = forecast.astype({"ds": object,"cap": object})
Valid=pd.concat([Valid.reset_index()[['ds','y']],forecast['yhat']], axis=1)
#performance metric
Valid['diff']=(Valid.y-Valid.yhat).abs()
acc=(1-((Valid['diff'].sum()/Valid['y'].sum())))*100
model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
best_parameters=parameters['Parameters'][0]
The best parameter is 'changepoint_prior_scale': 0.5, 'interval_width': 0.8, 'seasonality_prior_scale': 2 have become. Let's try graph drawing and accuracy calculation again using these parameters.
m = Prophet(growth='logistic',
seasonality_mode='multiplicative',
seasonality_prior_scale=best_parameters['seasonality_prior_scale'],
changepoint_prior_scale=best_parameters['changepoint_prior_scale'],
changepoints=['2020/8/7', '2020/9/7', '2020/10/24', '2020/11/20'],
interval_width =best_parameters['interval_width']
)
m.fit(Train_Jap)
forecast=m.predict(Test_Jap)
from fbprophet.plot import add_changepoints_to_plot
plot = model.plot(forecast)
a = add_changepoints_to_plot(plot.gca(), model, forecast)
plt.ylim([-0, 5000])
Test_Jap['yhat']=forecast['yhat'].values
plt.figure(figsize=(20, 8))
plt.plot(Test_Jap['ds'], Test_Jap['y'], 'b-', label = 'Actual')
plt.plot(Test_Jap['ds'], Test_Jap['yhat'], 'r--', label = 'Prediction')
plt.xlabel('Date',rotation=90); plt.ylabel('Sales'); plt.title('Actual vs Prediction')
plt.xticks(rotation=90)
plt.legend();
Test_Jap_best['diff']=(Test_Jap_best.y-Test_Jap_best.yhat).abs()
acc_ts2=(1-(Test_Jap_best['diff'].sum()/Test_Jap_best['y'].sum()))*100
acc_ts2
#90.92890097106074
The accuracy was about 91% and the MAE, MSE, and RMSE were as follows:
MAE_ts=metrics.mean_absolute_error(Test_Jap['y'], Test_Jap['yhat'])
MSE_ts=metrics.mean_squared_error(Test_Jap['y'], Test_Jap['yhat'])
RMSE_ts=np.sqrt(metrics.mean_squared_error(Test_Jap['y'], Test_Jap['yhat']))
print('MAE:', MAE_ts)
print('MSE:', MSE_ts)
print('RMSE:', RMSE_ts)
#MAE: 253.92347110782634
#MSE: 111763.48089497152
#RMSE: 334.31045585648604
In using random forest regression, we will use the month, day, and day of the week as explanatory variables (here, variables that are thought to affect the number of patients with corona).
It may be more accurate to use the number of PCR tests and the number of travelers from overseas as explanatory variables here, but when predicting the number of corona patients beyond the present, such data cannot be used, so we will use it here. No.
def create_date_features(df):
df['Moon']=df['Date (datetime64)'].dt.month
df['Day']=df['Day付(datetime64)'].dt.day
df['Day of the week (number)'] = df['Date (datetime64)'].dt.weekday
return df
reg_Jap=create_date_features(df_Jap)
def getdayofweek(day):
if (day == 0):
return "Monday"
elif(day == 1):
return "Tuesday"
elif(day ==2):
return "Wednesday"
elif(day == 3):
return "Thursday"
elif(day ==4):
return "Friday"
elif(day == 5):
return "Saturday"
elif(day ==6):
return "Sunday"
reg_Jap['Day of the week'] = reg_Jap['Day of the week(番号)'] .apply(getdayofweek)
reg_Jap=pd.get_dummies(reg_Jap,columns=['Day of the week'])
reg_Jap
Training data:
Train_reg_Jap=reg_Jap[(dt.datetime(2020,7,14) <= reg_Jap['Date (datetime64)']) & (reg_Jap['Date (datetime64)'] <= dt.datetime(2020,11,30) )]
x_train_reg_Jap=Train_reg_Jap[['Moon','Day','曜Day_Moon曜','曜Day_Tuesday','曜Day_Wednesday','曜Day_Thursday','曜Day_Friday','曜Day_Saturday','曜Day_Day曜']]
y_train_reg_Jap=Train_reg_Jap[['Number of infected people in Japan_Number of announcements per day']]
test data:
Test_reg_Jap=reg_Jap[(dt.datetime(2020,12,1) <= reg_Jap['Date (datetime64)']) & (reg_Jap['Date (datetime64)'] <= dt.datetime(2020,12,31) )]
x_test_reg_Jap=Test_reg_Jap[['Moon','Day','曜Day_Moon曜','曜Day_Tuesday','曜Day_Wednesday','曜Day_Thursday','曜Day_Friday','曜Day_Saturday','曜Day_Day曜']]
y_test_reg_Jap=Test_reg_Jap[['Number of infected people in Japan_Number of announcements per day']]
Perform learning. First, set only n_estimators = 100 as a parameter, and use the default values for the others.
--n_estimators: Number of decision trees
The parameters for random forest regression can be found here. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor :
rf_Jap = RandomForestRegressor(n_estimators=100)
rf_Jap.fit(x_train_reg_Jap,y_train_reg_Jap)
pred_rf = rf_Jap.predict(x_test_reg_Jap)
Accuracy calculation:
y_test_reg_Jap['diff']=abs((y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].values-pred_rf))
acc_rf=(1-(y_test_reg_Jap['diff'].sum()/y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].sum()))*100
acc_rf
#57.238258785021955
It became a low value of about 57%.
--max_features: Number of features of decision tree
For “auto”, max_features = n_features, For “sqrt”, max_features = sqrt (n_features), For “log2”, max_features = log2 (n_features) Will be.
--min_samples_split: Minimum number of samples when splitting a tree
--bootstrap: Whether to use bootstrap sampling (a method to allow duplication and retrieve data randomly)
param_grid = {
"n_estimators" : [10,20,300,100,200,500],
"max_features" : ["auto", "sqrt", "log2"],
"min_samples_split" : [2,4,6,8],
"bootstrap": [True, False],
}
grid = ParameterGrid(param_grid)
model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
X_Train=x_train_reg_Jap.copy()
Y_Train=y_train_reg_Jap.copy()
X_Valid=x_test_reg_Jap.copy()
Y_Valid=y_test_reg_Jap.copy()
m = RandomForestRegressor(n_estimators = p['n_estimators'],
max_features = p['max_features'],
min_samples_split = p['min_samples_split'],
bootstrap=p['bootstrap']
)
m.fit(X_Train,Y_Train)
pred_rf2 = m.predict(X_Valid)
Y_Valid['yhat']=pred_rf2
#performance metric
Y_Valid_diff=(Y_Valid['Number of infected people in Japan_Number of announcements per day']-Y_Valid.yhat).abs()
acc=(1-((Y_Valid_diff.sum()/Y_Valid['Number of infected people in Japan_Number of announcements per day'].sum())))*100
model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
best_parameters=parameters['Parameters'][0]
m2 = RandomForestRegressor(
bootstrap=best_parameters['bootstrap'],
max_features=best_parameters['max_features'],
min_samples_split=best_parameters['min_samples_split'],
n_estimators=best_parameters['n_estimators']
)
m2.fit(x_train_reg_Jap,y_train_reg_Jap)
pred_rf2 = m2.predict(x_test_reg_Jap)
y_test_reg_Jap_diff2=abs((y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].values-pred_rf2))
acc_rf2=(1-(y_test_reg_Jap_diff2.sum()/y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].sum()))*100
acc_rf2
#57.96572037955858
Even if I used the best parameters, it was almost the same as about 57%. In addition, due to the randomness of random forest regression, there were cases where the best parameter was lower in accuracy in the above code.
XGBoost (extreme gradient boosting) is a type of gradient boosting that uses the error between the measured value and the predicted value when creating a new decision tree using the results of the previous decision tree. Is a feature.
The parameters of XGBoost are described here. https://xgboost.readthedocs.io/en/latest/parameter.html
Also, the explanation on this site was easy to understand. http://kamonohashiperry.com/archives/209
--nthread: Number of parallel threads used for calculation --learning_rate: Learning rate --max_depth: Maximum depth of decision tree --min_child_weight: The minimum total value of data weighting in child nodes --subsample: Percentage of samples randomly extracted in each tree --colsample_bytree: Percentage of columns randomly extracted in each tree
The data uses x_train_reg_Jap, y_train_reg_Jap, x_test_reg_Jap, and y_test_reg_Jap used in random forest regression. (Explanatory variables are month, day and day of the week)
Execution of learning:
xgb_Jap = XGBRegressor(n_estimators=100)
xgb_Jap.fit(x_train_reg_Jap,y_train_reg_Jap)
xgb_pred = xgb_Jap.predict(x_test_reg_Jap)
Accuracy calculation:
y_test_reg_us['diff']=(y_test_reg_us['Number of infected people in Japan_Number of announcements per day']-xgb_pred).abs()
acc_xg=(1-(y_test_reg_us['diff'].sum()/y_test_reg_us['Number of infected people in Japan_Number of announcements per day'].sum()))*100
acc_xg
#57.22788239483825
param_grid = {
'nthread':[4], #when use hyperthread, xgboost may become slower,
'learning_rate': [.03, 0.05, .07], #so called `eta` value
'max_depth': [5, 6, 7],
'min_child_weight': [1,4],
'subsample': [0.7],
'colsample_bytree': [0.7],
'n_estimators': [100,200,500]
}
grid = ParameterGrid(param_grid)
model_parameters = pd.DataFrame(columns = ['Acc','Parameters'])
for p in tqdm(grid):
X_Train=x_train_reg_Jap.copy()
Y_Train=y_train_reg_Jap.copy()
X_Valid=x_test_reg_Jap.copy()
Y_Valid=y_test_reg_Jap.copy()
m = XGBRegressor(nthread = p['nthread'],
learning_rate = p['learning_rate'],
max_depth=p['max_depth'],
min_child_weight = p['min_child_weight'],
subsample = p['subsample'],
colsample_bytree=p['colsample_bytree'],
n_estimators=p['n_estimators'] )
m.fit(X_Train,Y_Train)
pred_xg2 = m.predict(X_Valid)
Y_Valid['yhat']=pred_xg2
#performance metric
Y_Valid['diff']=(Y_Valid['Number of infected people in Japan_Number of announcements per day']-Y_Valid.yhat).abs()
acc=(1-((Y_Valid['diff'].sum()/Y_Valid['Number of infected people in Japan_Number of announcements per day'].sum())))*100
model_parameters = model_parameters.append({'Acc':acc,'Parameters':p},ignore_index=True)
parameters = model_parameters.sort_values(by=['Acc'],ascending=False)
parameters = parameters.reset_index(drop=True)
best_parameters=parameters['Parameters'][0]
m = XGBRegressor(
nthread=best_parameters['nthread'],
learning_rate=best_parameters['learning_rate'],
max_depth=best_parameters['max_depth'],
min_child_weight=best_parameters['min_child_weight'],
subsample=best_parameters['subsample'],
colsample_bytree=best_parameters['colsample_bytree'],
n_estimators=best_parameters['n_estimators']
)
m.fit(x_train_reg_Jap,y_train_reg_Jap)
pred_xg2 = m.predict(x_test_reg_Jap)
y_test_reg_Jap['diff']=(y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day']-pred_xg2).abs()
acc_xg2=(1-(y_test_reg_Jap['diff'].sum()/y_test_reg_Jap['Number of infected people in Japan_Number of announcements per day'].sum()))*100
acc_xg2
#57.36199327011835
Again, the best parameters are almost the same as the previous values.
・ Prophet ・ Random forest regression ・ XGBoost In these three methods, a model was trained using data on the number of Japanese coronaviruses infected from July 14th to November 30th in 2020, and the number of infected people from December 1st to December 31st was predicted. The accuracy was about 91%, but it was as low as 57% for random forest regression and XGBoost.
I think it's not unreasonable because I could only use the month, day, and day of the week as explanatory variables.
My impression is that Prophet reads the pattern of the 2D graph and predicts it, and I think that it is more suitable than random forest regression and XGBoost in predicting the number of corona patients like this time.
Recommended Posts