One of the algorithms often used in data analysis competitions. Abbreviated as GBDT.
G ・ ・ ・ Gradient = Gradient descent method B ・ ・ ・ Boosting = One of the ensemble methods D ・ ・ ・ Decision T ・ ・ ・ Tree
In other words, it is a method that combines "Gradient", "Boosting (ensemble)", and "Decision Tree".
An algorithm that updates the weights little by little to find the point where the error gradient is minimized. Think that "the error becomes smaller = the prediction becomes more accurate".
One of the ensemble methods to create a model by combining multiple models. Models of the same type are combined in series, and the models are trained while correcting the predicted values. A strong learner (high accuracy) can be created by combining multiple weak learners (those whose prediction accuracy is not very high).
A method of analyzing data using a tree diagram. For example, when predicting "whether to buy ice cream"
"Temperature above 30 ° C" => will buy "Temperature below 30 ° C" => Would not buy
Prepare the condition and make a prediction.
--Features are numerical values
The feature amount needs to be a numerical value in order to judge whether the branch of the decision tree is larger or smaller than the feature amount.
--Can handle missing values
Since it is judged by the branch of the decision tree, it can be used without complementing the missing values.
--Reflects the interaction between variables
Since the branch is repeated, the interaction between variables is reflected.
--No need to scale features
Scaling such as standardization is not required because the judgment is based only on the magnitude relationship of the features.
The accuracy is improved by correcting the difference between the predicted value and the objective variable in the next decision tree.
This time, binary classification will be performed.
import xgboost as xgb
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
%matplotlib inline
df = pd.read_csv('hoge.csv')
df.head() #Confirmation of reading
#This time from feature X"foo", From the objective variable Y"bar"Get rid of
X = df.drop(['foo', 'bar'], axis=1)
y = df['bar']
X.head() #Confirm that it was removed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, shuffle=True)
test_size: Divide the specified ratio as evaluation data (30% for 0.3) random_state: Specify the seed when generating random numbers shuffle: Whether to sort randomly when splitting data
xgboost requires the dataset to be in DMatrix format.
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
xgb_params = {
#Binary classification
'objective': 'binary:logistic',
#Specification of evaluation index logloss
'eval_metric': 'logloss',
}
bst = xgb.train(xgb_params,
dtrain,
#Number of learning rounds
num_boost_round=100,
#If improvement cannot be expected even after turning a certain round, stop learning
early_stopping_rounds=10,
)
y_pred = bst.predict(dtest)
acc = accuracy_score(y_test, y_pred)
print('Accuracy:', acc)
This will output the accuracy, so adjust the parameters as needed.
Python: Try using XGBoost https://blog.amedama.jp/entry/2019/01/29/235642
Intuitively understand the mechanism and procedure of GBDT with figures and concrete examples https://www.acceluniverse.com/blog/developers/2019/12/gbdt.html
Kaggle Master Explains Gradient Boosting https://qiita.com/woody_egg/items/232e982094cd3c80b3ee
Books: Daisuke Kadowaki, Takashi Sakata, Keisuke Hosaka, Yuji Hiramatsu (2019) "Technology for Data Analysis to Win with Kaggle" Gijutsu-Hyoronsha