Try to predict cherry blossoms with xgboost Use data from March of last year to February of this year Python Beginner Machine learning
There was an AI Sakura prediction, and there was an article saying that xgboost was used, so I tried to predict cherry blossoms with xgboost. https://www.businessinsider.jp/post-186528
It was a subtle result. It turned out that the above AI Sakura prediction is excellent. Factors that have a large effect on the flowering time are the annual average temperature, the sunshine hours in July, the rainfall in August, and the lowest temperature in October. I know the average annual temperature, but I was surprised at the hours of sunshine in July, the rainfall in August, and the lowest temperature in October.
https://www.data.jma.go.jp/gmd/risk/obsdl/index.php The above data from the Japan Meteorological Agency was processed and used.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor, plot_importance
from sklearn.model_selection import GridSearchCV, KFold
from tqdm import tqdm_notebook
path="./"
train = pd.read_csv(path+"kikou5train.csv")
x_test = pd.read_csv(path+"kikou5test.csv")
y_train = train.kaika.copy()
x_train = train.drop("kaika", axis=1)
train_id = x_train.Id
x_train.head()
date avtmp3 maxtmp3 mintmp3 ame3 nisho3 joki3 kumo3 avtmp4 maxtmp4 ... kumo13 avtmp14 maxtmp14 mintmp14 ame14 nisho14 joki14 kumo14 kaika TotalInc
Id
1 1961 8.2 21.9 -0.4 106.6 181.1 6.7 6.3 14.9 26.0 ... 3.8 5.9 24.5 -2.6 13.5 195.0 4.5 4.1 NaN 193.6
2 1962 8.2 18.8 -0.8 65.5 189.8 6.3 4.7 14.1 24.5 ... 2.0 4.8 15.3 -4.1 21.3 199.9 4.1 4.9 NaN 182.3
Join
df = pd.concat([x_train, x_test])
df.head()
Feature Engineering Added annual average temperature
df["TotalInc"] = df.avtmp3 + df.avtmp4 + df.avtmp5 + df.avtmp6 + df.avtmp7 + df.avtmp8 + df.avtmp9 + df.avtmp10 + df.avtmp11 + df.avtmp12 + df.avtmp13 + df.avtmp14 #Of average temperature
df.head()
date avtmp3 maxtmp3 mintmp3 ame3 nisho3 joki3 kumo3 avtmp4 maxtmp4 ... kumo13 avtmp14 maxtmp14 mintmp14 ame14 nisho14 joki14 kumo14 kaika TotalInc
0 1980 8.2 21.2 1.3 173.5 157.5 6 6.2 13.6 24 ... 2.9 5.3 17.2 -3.5 38 157.3 4.6 5.5 NaN 183.4
1 rows × 87 columns
x_train = df[df.Id.isin(train_id)].set_index("Id")
x_test = df[~df.Id.isin(train_id)].set_index("Id")
Optimal hyperparameter search
random_state = 0
params = {
"learning_rate": [0.01, 0.05, 0.1],
"min_child_weight": [0.1],
"gamma": [0],
"reg_alpha": [0],
"reg_lambda": [1],
"max_depth": [3, 5, 7],
"max_delta_step": [0],
"random_state": [random_state],
"n_estimators": [50, 100, 200],
}
reg = XGBRegressor()
cv = KFold(n_splits=3, shuffle=True, random_state=random_state)
reg_gs = GridSearchCV(reg, params, cv=cv)
reg_gs.fit(x_train, y_train)
GridSearchCV(cv=KFold(n_splits=3, random_state=0, shuffle=True),
estimator=XGBRegressor(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None, max_delta_step=None,
max_depth=None, min_child_weight=None,
missing=nan, monoto...
num_parallel_tree=None, random_state=None,
reg_alpha=None, reg_lambda=None,
scale_pos_weight=None, subsample=None,
tree_method=None, validate_parameters=None,
verbosity=None),
param_grid={'gamma': [0], 'learning_rate': [0.01, 0.05, 0.1],
'max_delta_step': [0], 'max_depth': [3, 5, 7],
'min_child_weight': [0.1],
'n_estimators': [50, 100, 200], 'random_state': [0],
'reg_alpha': [0], 'reg_lambda': [1]})
display(reg_gs.best_params_)
display(reg_gs.best_score_)
ax = plot_importance(reg_gs.best_estimator_, importance_type="gain")
fig = ax.figure
fig.set_size_inches(250, 250)
ax.figure.set_size_inches(18,18)
{'gamma': 0,
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 5,
'min_child_weight': 0.1,
'n_estimators': 50,
'random_state': 0,
'reg_alpha': 0,
'reg_lambda': 1}
0.36250088820449333
Forecast
y_pred3 = reg_gs.predict(x_test)
Evaluate the error from the correct label
y_true = pd.read_csv(path+"kikou5test.csv")
preds = pd.DataFrame({"pred3": y_pred3})
df_out = pd.concat([y_true, preds], axis=1)
df_out.head()
Id date avtmp3 maxtmp3 mintmp3 ame3 nisho3 joki3 kumo3 avtmp4 ... avtmp14 maxtmp14 mintmp14 ame14 nisho14 joki14 kumo14 kaika pred3 loss3
0 100 1966 9.6 21.6 1.2 99.9 150.4 7.0 6.6 13.6 ... 4.9 19.1 -4.0 43.8 162.6 5.1 5.0 30 29.816103 0.033818
RMSE
df_out["loss3"] = (df_out.kaika - df_out.pred3)**2
df_out.iloc[:, -3:].mean()
kaika 24.909091
pred3 26.849123
loss3 23.966188
dtype: float64
from sklearn.metrics import mean_squared_error, mean_absolute_error
#RMSE
rmse_kaika = np.sqrt(mean_squared_error(df_out.kaika, df_out.pred3))
rmse_kaika
4.895527368155607
The prediction accuracy of cherry blossoms is less than 5 days. It was surprisingly predictable, but subtle.
Recommended Posts