Try to predict cherry blossoms with xgboost Use data from March of last year to February of this year Python Beginner Machine learning

1. Purpose

There was an AI Sakura prediction, and there was an article saying that xgboost was used, so I tried to predict cherry blossoms with xgboost. https://www.businessinsider.jp/post-186528

2. Conclusion

It was a subtle result. It turned out that the above AI Sakura prediction is excellent. 無題.png Factors that have a large effect on the flowering time are the annual average temperature, the sunshine hours in July, the rainfall in August, and the lowest temperature in October. I know the average annual temperature, but I was surprised at the hours of sunshine in July, the rainfall in August, and the lowest temperature in October.

3. Data source

https://www.data.jma.go.jp/gmd/risk/obsdl/index.php The above data from the Japan Meteorological Agency was processed and used.

4. Code explanation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor, plot_importance
from sklearn.model_selection import GridSearchCV, KFold
from tqdm import tqdm_notebook
path="./"
train = pd.read_csv(path+"kikou5train.csv")
x_test = pd.read_csv(path+"kikou5test.csv")

y_train = train.kaika.copy()
x_train = train.drop("kaika", axis=1)
train_id = x_train.Id
x_train.head()
date	avtmp3	maxtmp3	mintmp3	ame3	nisho3	joki3	kumo3	avtmp4	maxtmp4	...	kumo13	avtmp14	maxtmp14	mintmp14	ame14	nisho14	joki14	kumo14	kaika	TotalInc
Id																					
1	1961	8.2	21.9	-0.4	106.6	181.1	6.7	6.3	14.9	26.0	...	3.8	5.9	24.5	-2.6	13.5	195.0	4.5	4.1	NaN	193.6
2	1962	8.2	18.8	-0.8	65.5	189.8	6.3	4.7	14.1	24.5	...	2.0	4.8	15.3	-4.1	21.3	199.9	4.1	4.9	NaN	182.3

Join

df = pd.concat([x_train, x_test])
df.head()

Feature Engineering Added annual average temperature

df["TotalInc"] = df.avtmp3 + df.avtmp4 + df.avtmp5 + df.avtmp6 + df.avtmp7 + df.avtmp8 + df.avtmp9 + df.avtmp10 + df.avtmp11 + df.avtmp12 + df.avtmp13 + df.avtmp14 #Of average temperature

df.head()
date	avtmp3	maxtmp3	mintmp3	ame3	nisho3	joki3	kumo3	avtmp4	maxtmp4	...	kumo13	avtmp14	maxtmp14	mintmp14	ame14	nisho14	joki14	kumo14	kaika	TotalInc
0	1980	8.2	21.2	1.3	173.5	157.5	6	6.2	13.6	24	...	2.9	5.3	17.2	-3.5	38	157.3	4.6	5.5	NaN	183.4
1 rows × 87 columns

x_train = df[df.Id.isin(train_id)].set_index("Id")
x_test = df[~df.Id.isin(train_id)].set_index("Id")

Optimal hyperparameter search

random_state = 0
params = {
          "learning_rate": [0.01, 0.05, 0.1],
          "min_child_weight": [0.1],
          "gamma": [0],
          "reg_alpha": [0],
          "reg_lambda": [1],
          "max_depth": [3, 5, 7],
          "max_delta_step": [0],
          "random_state": [random_state],
          "n_estimators": [50, 100, 200],
          }
reg = XGBRegressor()
cv = KFold(n_splits=3, shuffle=True, random_state=random_state)
reg_gs = GridSearchCV(reg, params, cv=cv)
reg_gs.fit(x_train, y_train)
GridSearchCV(cv=KFold(n_splits=3, random_state=0, shuffle=True),
             estimator=XGBRegressor(base_score=None, booster=None,
                                    colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, gamma=None,
                                    gpu_id=None, importance_type='gain',
                                    interaction_constraints=None,
                                    learning_rate=None, max_delta_step=None,
                                    max_depth=None, min_child_weight=None,
                                    missing=nan, monoto...
                                    num_parallel_tree=None, random_state=None,
                                    reg_alpha=None, reg_lambda=None,
                                    scale_pos_weight=None, subsample=None,
                                    tree_method=None, validate_parameters=None,
                                    verbosity=None),
             param_grid={'gamma': [0], 'learning_rate': [0.01, 0.05, 0.1],
                         'max_delta_step': [0], 'max_depth': [3, 5, 7],
                         'min_child_weight': [0.1],
                         'n_estimators': [50, 100, 200], 'random_state': [0],
                         'reg_alpha': [0], 'reg_lambda': [1]})
display(reg_gs.best_params_)
display(reg_gs.best_score_)
ax = plot_importance(reg_gs.best_estimator_, importance_type="gain")
fig = ax.figure
fig.set_size_inches(250, 250)
ax.figure.set_size_inches(18,18)
{'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 5,
 'min_child_weight': 0.1,
 'n_estimators': 50,
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1}
0.36250088820449333

Forecast

y_pred3 = reg_gs.predict(x_test)

Evaluate the error from the correct label

y_true = pd.read_csv(path+"kikou5test.csv")
preds = pd.DataFrame({"pred3": y_pred3})
df_out = pd.concat([y_true, preds], axis=1)
df_out.head()
Id	date	avtmp3	maxtmp3	mintmp3	ame3	nisho3	joki3	kumo3	avtmp4	...	avtmp14	maxtmp14	mintmp14	ame14	nisho14	joki14	kumo14	kaika	pred3	loss3
0	100	1966	9.6	21.6	1.2	99.9	150.4	7.0	6.6	13.6	...	4.9	19.1	-4.0	43.8	162.6	5.1	5.0	30	29.816103	0.033818

RMSE

df_out["loss3"] = (df_out.kaika - df_out.pred3)**2
df_out.iloc[:, -3:].mean()
kaika    24.909091
pred3    26.849123
loss3    23.966188
dtype: float64
from sklearn.metrics import mean_squared_error, mean_absolute_error
#RMSE
rmse_kaika = np.sqrt(mean_squared_error(df_out.kaika, df_out.pred3))
rmse_kaika
4.895527368155607

The prediction accuracy of cherry blossoms is less than 5 days. It was surprisingly predictable, but subtle.