Introduction

XGBoost is a method of GBDT and can be implemented in python. However, when I examined the implementation example, I was confused because there were multiple ways to describe it even though I used the same library. Therefore, the purpose of this article is to try to do the same thing in each notation with the meaning of the author's memorandum. (Please note that detailed explanation of XGBoost will be omitted.)

Implementation environment

Windows10
Anaconda3
Python3
XGBoost installs and implements the following libraries

pip install xgboost

Implementation details

Regress house prices using the attached dataset ** Boston house-prices **
Perform regression by XGBoost with two types of description methods
Display regression / evaluation / Feature Importance under the same conditions as much as possible.

Implementation ~ Common part ~

Data set loading

`import_boston_datasets.py`


#Import the library to be used this time first
import pandas as pd
import numpy as np
import xgboost
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

boston = load_boston()

df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)

#Attach the objective variable to the end of the dataframe as price to display them all together.
df_boston['PRICE'] = boston.target
print(df_boston.head())

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

Creation of train data and test data

`make_train_test.py`


#Divide the data into features x and objective variable y
x = df_boston.loc[:,'CRIM':'LSTAT']
y = df_boston['PRICE']

#7:Divided into train data and test data at 3
trainX, testX, trainY, testY = train_test_split(x, y, test_size=0.3)

#Try to output each shape
print('x.shape = {}'.format(x.shape))
print('y.shape = {}'.format(y.shape))
print('trainX.shape = {}'.format(trainX.shape))
print('trainY.shape = {}'.format(trainY.shape))
print('testX.shape = {}'.format(testX.shape))
print('testY.shape = {}'.format(testY.shape))
# x.shape = (506, 13)
# y.shape = (506,)
# trainX.shape = (354, 13)
# trainY.shape = (354,)
# testX.shape = (152, 13)
# testY.shape = (152,)

Implementation ~ Training ~

Method 1

Method 1 is a method using an interface with ** scikit-learn compatible API **. I will describe it from here, which you may be familiar with. First of all, from the simplest implementation without specifying any parameters.

`regression1-1.py`


#Since it is a regression, use XGB Regressor
reg = xgb.XGBRegressor()

#eval_Set verification data in set
reg.fit(trainX, trainY,
        eval_set=[(trainX, trainY),(testX, testY)])
#[0]	validation_0-rmse:21.5867	validation_1-rmse:21.7497
#[1]	validation_0-rmse:19.5683	validation_1-rmse:19.7109
#[2]	validation_0-rmse:17.7456	validation_1-rmse:17.8998
#abridgement
#[97]	validation_0-rmse:1.45198	validation_1-rmse:2.7243
#[98]	validation_0-rmse:1.44249	validation_1-rmse:2.72238
#[99]	validation_0-rmse:1.43333	validation_1-rmse:2.7233

#Predictive execution
predY = reg.predict(testX)

#Display of MSE
print(mean_squared_error(testY, predY))
#7.4163707577050655

If you specify the parameters a little more properly, it will be as follows.

`regression1-2.py`


reg = xgb.XGBRegressor(#Specifying the objective function The initial value is also the square error
                       objective='reg:squarederror',
                       #Number of learning rounds early_Since stopping is used, specify more
                       n_estimators=50000,
                       #What to use for booster The initial value is also gbtree
                       booster='gbtree',
                       #Learning rate
                       learning_rate=0.01,
                       #Maximum depth of the tree
                       max_depth=6,
                       #Seed value
                       random_state=2525)

#Prepare variables to display the learning process
evals_result = {}
reg.fit(trainX, trainY,
        eval_set=[(trainX, trainY),(testX, testY)],
        #Evaluation index used for learning
        eval_metric='rmse',
        #Specify the number of rounds to stop learning if the objective function does not improve
        early_stopping_rounds=15,
        #Specify the above variable using the callback API to record the learning process
        callbacks=[xgb.callback.record_evaluation(evals_result)])

#[1]	validation_0-rmse:19.5646	validation_1-rmse:19.7128
#[2]	validation_0-rmse:17.7365	validation_1-rmse:17.9048
#[3]	validation_0-rmse:16.0894	validation_1-rmse:16.2733
#abridgement
#[93]	validation_0-rmse:0.368592	validation_1-rmse:2.47429
#[94]	validation_0-rmse:0.3632	validation_1-rmse:2.47945
#[95]	validation_0-rmse:0.356932	validation_1-rmse:2.48028
#Stopping. Best iteration:
#[80]	validation_0-rmse:0.474086	validation_1-rmse:2.46597

predY = reg.predict(testX)
print(mean_squared_error(testY, predY))
#6.080995445035289

The loss has improved a little compared to when nothing was specified. There are other parameters that can be specified, but for details XGBoost Python API Reference See ** Scikit-Learn API ** in.

However, there are parameters that are not written in this API, which is often confusing. For example, if you output the model when nothing is specified

`ex.py`


reg = xgb.XGBRegressor()
print(reg)
#XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
#             colsample_bynode=1, colsample_bytree=1, gamma=0,
#             importance_type='gain', learning_rate=0.1, max_delta_step=0,
#             max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
#             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
#             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
#             silent=None, subsample=1, verbosity=1)

You can check the initial value of the model like this. But at the bottom

silent=None

Even if you check with the API, such a parameter does not exist in the first place. Some sites describe it as a parameter for the presence or absence of output during training, but specifying it did not change anything in particular.

Method 2

Next is the second method. This is the ** original API ** of the xgboost library. Therefore, there is a slight difference in the handling of datasets.

`regression2-1.py`


#xgb.Make it usable in the original API by DMatrix
#feature_You do not have to specify the names, but it is safe to add them because it will be convenient later.
dtrain = xgb.DMatrix(trainX, label=trainY, feature_names = x.columns)
dtest = xgb.DMatrix(testX, label=testY, feature_names = x.columns)

reg = xgb.train(#Specify an empty list because it trains with the initial value without specifying parameters on purpose
                params=[],
                dtrain=dtrain,
                #Set verification data in eval
                evals=[(dtrain, 'train'), (dtest, 'eval')])
#[0]	train-rmse:17.1273	eval-rmse:17.3433
#[1]	train-rmse:12.3964	eval-rmse:12.7432
#[2]	train-rmse:9.07831	eval-rmse:9.44546
#[3]	train-rmse:6.6861	eval-rmse:7.16429
#[4]	train-rmse:5.03358	eval-rmse:5.70227
#[5]	train-rmse:3.88521	eval-rmse:4.7088
#[6]	train-rmse:3.03311	eval-rmse:4.09655
#[7]	train-rmse:2.44077	eval-rmse:3.6657
#[8]	train-rmse:2.0368	eval-rmse:3.40768
#[9]	train-rmse:1.72258	eval-rmse:3.29363

predY = reg.predict(dtest)
print(mean_squared_error(testY, predY))
#10.847961069710934

You can see that the number of learning rounds is set to be quite small by default. That's not a better result than Method 1. Let's set the parameters in the same way and execute it.

`regression2-2.py`


#xgb.Make it usable in the original API by DMatrix
dtrain = xgb.DMatrix(trainX, label=trainY, feature_names = x.columns)
dtest = xgb.DMatrix(testX, label=testY, feature_names = x.columns)

#Xgb first_Set the parameters as params
xgb_params = {#Objective function
              'objective': 'reg:squarederror',
              #Evaluation index used for learning
              'eval_metric': 'rmse',
              #What to use for booster
              'booster': 'gbtree',
              #learning_Synonymous with rate
              'eta': 0.1,
              #Maximum depth of the tree
              'max_depth': 6,
              #random_Synonymous with state
              'seed': 2525}

#Prepare variables to get the learning process
evals_result = {}
reg = xgb.train(#Use the learning parameters set above
                params=xgb_params,
                dtrain=dtrain,
                #Number of learning rounds
                num_boost_round=50000,
                #Number of rounds of early stopping u
                early_stopping_rounds=15,
                #Verification data
                evals=[(dtrain, 'train'), (dtest, 'eval')],
                #Set the variables prepared above
                evals_result=evals_result)
#[1]	train-rmse:19.5646	eval-rmse:19.7128
#[2]	train-rmse:17.7365	eval-rmse:17.9048
#[3]	train-rmse:16.0894	eval-rmse:16.2733
#abridgement
#[93]	train-rmse:0.368592	eval-rmse:2.47429
#[94]	train-rmse:0.3632	eval-rmse:2.47945
#[95]	train-rmse:0.356932	eval-rmse:2.48028
#Stopping. Best iteration:
#[80]	train-rmse:0.474086	eval-rmse:2.46597

predY = reg.predict(dtest)
print(mean_squared_error(testY, predY))
#6.151798278561384

The loss displayed during the learning process is exactly the same as in Method 1. However, if you do predict and output MSE after that, the value will be different ... I don't know the cause, so I will pass it through. As you can see by comparing the codes, the names of the parameters and the places where they are written are slightly different, so be careful. For example, if the learning rate eta is set as learning_rate as in method 1, execution is possible but the value is not reflected. About this parameter XGBoost Parameters When XGBoost Python API Reference Check out the ** Learning API **.

Implementation ~ Graph display of learning process ~

Since the value is stored in ** evals_result ** prepared at the time of learning, the transition is graphed using it.

Method 1

`plot_validation1.py`


#plot the loss transition for train data
plt.plot(evals_result['validation_0']['rmse'], label='train rmse')
#plot loss transition for test data
plt.plot(evals_result['validation_1']['rmse'], label='eval rmse')
plt.grid()
plt.legend()
plt.xlabel('rounds')
plt.ylabel('rmse')
plt.show()

Method 2

For some reason, the name when stored in evals_result is different from that of method 1.

`plot_validation2.py`


#plot the loss transition for train data
plt.plot(evals_result['train']['rmse'], label='train rmse')
#plot loss transition for test data
plt.plot(evals_result['eval']['rmse'], label='eval rmse')
plt.grid()
plt.legend()
plt.xlabel('rounds')
plt.ylabel('rmse')
plt.savefig("img.png ", bbox_inches='tight')
plt.show()

Implementation ~ Display of Feature Importance ~

xgboost has a method called ** xgb.plot_importance () ** that plots FeatureImportance, which can be used in both methods 1 and 2.

`plot_importance.py`


xgb.plot_importance(reg)

You can also specify importance_type as an argument. API description and importance_type If you refer to this article

weight
The number of floors where the feature is used in all tree branches
gain
Mean value of how much the evaluation criteria were improved when the feature was used in the branch
cover
The total value of the quadratic gradients of the train data classified into leaves (when there is a square error, it seems to be the number of times it simply appeared in the classification)

You can use three of them. (It seems that you can also use the total value for gain and cover instead of the average value to read the API) The initial value seems to be weight, for example, if you specify gain and output it, it will be as follows.

`plot_importance.py`


xgb.plot_importance(reg,importance_type='gain')

There is also a method called ** xgb.get_score () **, which allows you to get the values used in the graph as a dictionary. However, this method is only available in Method 2 **, and I'm not sure if there is a way to do something similar in Method 1 ... I'm worried about the inconsistencies around here.

`print_importance.py`


print(reg.get_score(importance_type='weight'))
#{'LSTAT': 251,
# 'RM': 363,
# 'CRIM': 555,
# 'DIS': 295,
# 'B': 204,
# 'INDUS': 81,
# 'NOX': 153,
# 'AGE': 290,
# 'PTRATIO': 91,
# 'RAD': 41,
# 'ZN': 36,
# 'TAX': 91,
# 'CHAS': 13}

print(reg.get_score(importance_type='gain'))
#{'LSTAT': 345.9503342748026,
# 'RM': 67.2338906183525,
# 'CRIM': 9.066383988597524,
# 'DIS': 20.52948739887609,
# 'B': 5.704856272869067,
# 'INDUS': 6.271976581219753,
# 'NOX': 17.48982672038596,
# 'AGE': 3.396609941187381,
# 'PTRATIO': 15.018738197646142,
# 'RAD': 5.182013825021951,
# 'ZN': 2.7426182845938896,
# 'TAX': 12.025571026275834,
# 'CHAS': 1.172155851074923}

I think that feature_names was specified when creating the DMatrix, but if you do not specify it, the name of the feature will not be displayed on the graph etc., so it will be very difficult to understand. It is good to set it.

Other features of each method

Method 1

I think the most convenient point of Method 1 is that parameter search is possible using GridSerchCV of sklearn. Method 1 was used in all the articles that used XGBoost for parameter search. The code of Randomized Search CV is posted for reference.

`randomized_search.py`


from sklearn.model_selection import RandomizedSearchCV

params = {
          'n_estimators':[50000],
          'objective':['reg:squarederror'],
          'eval_metric': ['rmse'],
          'booster': ['gbtree'],
          'learning_rate':[0.1,0.05,0.01],
          'max_depth':[5,7,10,15],
          'random_state':[2525]
         }

mod = xgb.XGBRegressor()
#n_I do a random search for iter
rds = RandomizedSearchCV(mod,params,random_state=2525,scoring='r2',n_jobs=1,n_iter=50)
rds.fit(trainX,
        trainY,
        eval_metric='rmse',
        early_stopping_rounds=15,
        eval_set=[(testX, testY)])
print(rds.best_params_)
#{'seed': 2525,
# 'objective': 'reg:squarederror',
# 'n_estimators': 50000,
# 'max_depth': 5,
# 'learning_rate': 0.1,
# 'eval_metric': 'rmse',
# 'booster': 'gbtree'}

Method 2

To be honest, I feel that the original API is unfamiliar and you don't have to bother to use it. However, there may be other such examples, just as some methods can only be used with this method above. Is the basic implemented by method 1? I think it would be better to implement this one here if you can't do this.

in conclusion

This time, while investigating XGBoost, I have summarized the unclear points due to the mixture of methods as easily as possible. I hope that someone in a similar situation will get here and solve the problem. There are many other useful methods in the xgboost library, such as a method that can display the generated tree as a diagram, so it is recommended that you take a quick look at the API and implement it in various ways. This is the first Qiita article in my life, and I think it's not enough, but thank you for reading this far.

Reference article

Python: Try using XGBoost How to use xgboost: Multi-class classification by iris data Using XGBoost with Python Xgboost: How to calculate importance_type of feature_importance xgboost: Effective machine learning model for table data

A confusing story with two ways to implement XGBoost in Python + overall notes

Introduction

Implementation environment

Implementation details

Implementation ~ Common part ~

Data set loading

import_boston_datasets.py

Creation of train data and test data

make_train_test.py

Implementation ~ Training ~

Method 1

regression1-1.py

regression1-2.py

ex.py

Method 2

regression2-1.py

regression2-2.py

Implementation ~ Graph display of learning process ~

Method 1

plot_validation1.py

Method 2

plot_validation2.py

Implementation ~ Display of Feature Importance ~

plot_importance.py

plot_importance.py

print_importance.py

Other features of each method

Method 1

randomized_search.py

Method 2

in conclusion

Reference article

`import_boston_datasets.py`

`make_train_test.py`

`regression1-1.py`

`regression1-2.py`

`ex.py`

`regression2-1.py`

`regression2-2.py`

`plot_validation1.py`

`plot_validation2.py`

`plot_importance.py`

`plot_importance.py`

`print_importance.py`

`randomized_search.py`