XGBoost is a method of GBDT and can be implemented in python. However, when I examined the implementation example, I was confused because there were multiple ways to describe it even though I used the same library. Therefore, the purpose of this article is to try to do the same thing in each notation with the meaning of the author's memorandum. (Please note that detailed explanation of XGBoost will be omitted.)
pip install xgboost
import_boston_datasets.py
#Import the library to be used this time first
import pandas as pd
import numpy as np
import xgboost
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
boston = load_boston()
df_boston = pd.DataFrame(boston.data, columns=boston.feature_names)
#Attach the objective variable to the end of the dataframe as price to display them all together.
df_boston['PRICE'] = boston.target
print(df_boston.head())
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
make_train_test.py
#Divide the data into features x and objective variable y
x = df_boston.loc[:,'CRIM':'LSTAT']
y = df_boston['PRICE']
#7:Divided into train data and test data at 3
trainX, testX, trainY, testY = train_test_split(x, y, test_size=0.3)
#Try to output each shape
print('x.shape = {}'.format(x.shape))
print('y.shape = {}'.format(y.shape))
print('trainX.shape = {}'.format(trainX.shape))
print('trainY.shape = {}'.format(trainY.shape))
print('testX.shape = {}'.format(testX.shape))
print('testY.shape = {}'.format(testY.shape))
# x.shape = (506, 13)
# y.shape = (506,)
# trainX.shape = (354, 13)
# trainY.shape = (354,)
# testX.shape = (152, 13)
# testY.shape = (152,)
Method 1 is a method using an interface with ** scikit-learn compatible API **. I will describe it from here, which you may be familiar with. First of all, from the simplest implementation without specifying any parameters.
regression1-1.py
#Since it is a regression, use XGB Regressor
reg = xgb.XGBRegressor()
#eval_Set verification data in set
reg.fit(trainX, trainY,
eval_set=[(trainX, trainY),(testX, testY)])
#[0] validation_0-rmse:21.5867 validation_1-rmse:21.7497
#[1] validation_0-rmse:19.5683 validation_1-rmse:19.7109
#[2] validation_0-rmse:17.7456 validation_1-rmse:17.8998
#abridgement
#[97] validation_0-rmse:1.45198 validation_1-rmse:2.7243
#[98] validation_0-rmse:1.44249 validation_1-rmse:2.72238
#[99] validation_0-rmse:1.43333 validation_1-rmse:2.7233
#Predictive execution
predY = reg.predict(testX)
#Display of MSE
print(mean_squared_error(testY, predY))
#7.4163707577050655
If you specify the parameters a little more properly, it will be as follows.
regression1-2.py
reg = xgb.XGBRegressor(#Specifying the objective function The initial value is also the square error
objective='reg:squarederror',
#Number of learning rounds early_Since stopping is used, specify more
n_estimators=50000,
#What to use for booster The initial value is also gbtree
booster='gbtree',
#Learning rate
learning_rate=0.01,
#Maximum depth of the tree
max_depth=6,
#Seed value
random_state=2525)
#Prepare variables to display the learning process
evals_result = {}
reg.fit(trainX, trainY,
eval_set=[(trainX, trainY),(testX, testY)],
#Evaluation index used for learning
eval_metric='rmse',
#Specify the number of rounds to stop learning if the objective function does not improve
early_stopping_rounds=15,
#Specify the above variable using the callback API to record the learning process
callbacks=[xgb.callback.record_evaluation(evals_result)])
#[1] validation_0-rmse:19.5646 validation_1-rmse:19.7128
#[2] validation_0-rmse:17.7365 validation_1-rmse:17.9048
#[3] validation_0-rmse:16.0894 validation_1-rmse:16.2733
#abridgement
#[93] validation_0-rmse:0.368592 validation_1-rmse:2.47429
#[94] validation_0-rmse:0.3632 validation_1-rmse:2.47945
#[95] validation_0-rmse:0.356932 validation_1-rmse:2.48028
#Stopping. Best iteration:
#[80] validation_0-rmse:0.474086 validation_1-rmse:2.46597
predY = reg.predict(testX)
print(mean_squared_error(testY, predY))
#6.080995445035289
The loss has improved a little compared to when nothing was specified. There are other parameters that can be specified, but for details XGBoost Python API Reference See ** Scikit-Learn API ** in.
However, there are parameters that are not written in this API, which is often confusing. For example, if you output the model when nothing is specified
ex.py
reg = xgb.XGBRegressor()
print(reg)
#XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
# colsample_bynode=1, colsample_bytree=1, gamma=0,
# importance_type='gain', learning_rate=0.1, max_delta_step=0,
# max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
# n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
# reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
# silent=None, subsample=1, verbosity=1)
You can check the initial value of the model like this. But at the bottom
silent=None
Even if you check with the API, such a parameter does not exist in the first place. Some sites describe it as a parameter for the presence or absence of output during training, but specifying it did not change anything in particular.
Next is the second method. This is the ** original API ** of the xgboost library. Therefore, there is a slight difference in the handling of datasets.
regression2-1.py
#xgb.Make it usable in the original API by DMatrix
#feature_You do not have to specify the names, but it is safe to add them because it will be convenient later.
dtrain = xgb.DMatrix(trainX, label=trainY, feature_names = x.columns)
dtest = xgb.DMatrix(testX, label=testY, feature_names = x.columns)
reg = xgb.train(#Specify an empty list because it trains with the initial value without specifying parameters on purpose
params=[],
dtrain=dtrain,
#Set verification data in eval
evals=[(dtrain, 'train'), (dtest, 'eval')])
#[0] train-rmse:17.1273 eval-rmse:17.3433
#[1] train-rmse:12.3964 eval-rmse:12.7432
#[2] train-rmse:9.07831 eval-rmse:9.44546
#[3] train-rmse:6.6861 eval-rmse:7.16429
#[4] train-rmse:5.03358 eval-rmse:5.70227
#[5] train-rmse:3.88521 eval-rmse:4.7088
#[6] train-rmse:3.03311 eval-rmse:4.09655
#[7] train-rmse:2.44077 eval-rmse:3.6657
#[8] train-rmse:2.0368 eval-rmse:3.40768
#[9] train-rmse:1.72258 eval-rmse:3.29363
predY = reg.predict(dtest)
print(mean_squared_error(testY, predY))
#10.847961069710934
You can see that the number of learning rounds is set to be quite small by default. That's not a better result than Method 1. Let's set the parameters in the same way and execute it.
regression2-2.py
#xgb.Make it usable in the original API by DMatrix
dtrain = xgb.DMatrix(trainX, label=trainY, feature_names = x.columns)
dtest = xgb.DMatrix(testX, label=testY, feature_names = x.columns)
#Xgb first_Set the parameters as params
xgb_params = {#Objective function
'objective': 'reg:squarederror',
#Evaluation index used for learning
'eval_metric': 'rmse',
#What to use for booster
'booster': 'gbtree',
#learning_Synonymous with rate
'eta': 0.1,
#Maximum depth of the tree
'max_depth': 6,
#random_Synonymous with state
'seed': 2525}
#Prepare variables to get the learning process
evals_result = {}
reg = xgb.train(#Use the learning parameters set above
params=xgb_params,
dtrain=dtrain,
#Number of learning rounds
num_boost_round=50000,
#Number of rounds of early stopping u
early_stopping_rounds=15,
#Verification data
evals=[(dtrain, 'train'), (dtest, 'eval')],
#Set the variables prepared above
evals_result=evals_result)
#[1] train-rmse:19.5646 eval-rmse:19.7128
#[2] train-rmse:17.7365 eval-rmse:17.9048
#[3] train-rmse:16.0894 eval-rmse:16.2733
#abridgement
#[93] train-rmse:0.368592 eval-rmse:2.47429
#[94] train-rmse:0.3632 eval-rmse:2.47945
#[95] train-rmse:0.356932 eval-rmse:2.48028
#Stopping. Best iteration:
#[80] train-rmse:0.474086 eval-rmse:2.46597
predY = reg.predict(dtest)
print(mean_squared_error(testY, predY))
#6.151798278561384
The loss displayed during the learning process is exactly the same as in Method 1. However, if you do predict and output MSE after that, the value will be different ... I don't know the cause, so I will pass it through. As you can see by comparing the codes, the names of the parameters and the places where they are written are slightly different, so be careful. For example, if the learning rate eta is set as learning_rate as in method 1, execution is possible but the value is not reflected. About this parameter XGBoost Parameters When XGBoost Python API Reference Check out the ** Learning API **.
Since the value is stored in ** evals_result ** prepared at the time of learning, the transition is graphed using it.
plot_validation1.py
#plot the loss transition for train data
plt.plot(evals_result['validation_0']['rmse'], label='train rmse')
#plot loss transition for test data
plt.plot(evals_result['validation_1']['rmse'], label='eval rmse')
plt.grid()
plt.legend()
plt.xlabel('rounds')
plt.ylabel('rmse')
plt.show()
For some reason, the name when stored in evals_result is different from that of method 1.
plot_validation2.py
#plot the loss transition for train data
plt.plot(evals_result['train']['rmse'], label='train rmse')
#plot loss transition for test data
plt.plot(evals_result['eval']['rmse'], label='eval rmse')
plt.grid()
plt.legend()
plt.xlabel('rounds')
plt.ylabel('rmse')
plt.savefig("img.png ", bbox_inches='tight')
plt.show()
xgboost has a method called ** xgb.plot_importance () ** that plots FeatureImportance, which can be used in both methods 1 and 2.
plot_importance.py
xgb.plot_importance(reg)
You can also specify importance_type as an argument. API description and importance_type If you refer to this article
You can use three of them. (It seems that you can also use the total value for gain and cover instead of the average value to read the API) The initial value seems to be weight, for example, if you specify gain and output it, it will be as follows.
plot_importance.py
xgb.plot_importance(reg,importance_type='gain')
There is also a method called ** xgb.get_score () **, which allows you to get the values used in the graph as a dictionary. However, this method is only available in Method 2 **, and I'm not sure if there is a way to do something similar in Method 1 ... I'm worried about the inconsistencies around here.
print_importance.py
print(reg.get_score(importance_type='weight'))
#{'LSTAT': 251,
# 'RM': 363,
# 'CRIM': 555,
# 'DIS': 295,
# 'B': 204,
# 'INDUS': 81,
# 'NOX': 153,
# 'AGE': 290,
# 'PTRATIO': 91,
# 'RAD': 41,
# 'ZN': 36,
# 'TAX': 91,
# 'CHAS': 13}
print(reg.get_score(importance_type='gain'))
#{'LSTAT': 345.9503342748026,
# 'RM': 67.2338906183525,
# 'CRIM': 9.066383988597524,
# 'DIS': 20.52948739887609,
# 'B': 5.704856272869067,
# 'INDUS': 6.271976581219753,
# 'NOX': 17.48982672038596,
# 'AGE': 3.396609941187381,
# 'PTRATIO': 15.018738197646142,
# 'RAD': 5.182013825021951,
# 'ZN': 2.7426182845938896,
# 'TAX': 12.025571026275834,
# 'CHAS': 1.172155851074923}
I think that feature_names was specified when creating the DMatrix, but if you do not specify it, the name of the feature will not be displayed on the graph etc., so it will be very difficult to understand. It is good to set it.
I think the most convenient point of Method 1 is that parameter search is possible using GridSerchCV of sklearn. Method 1 was used in all the articles that used XGBoost for parameter search. The code of Randomized Search CV is posted for reference.
randomized_search.py
from sklearn.model_selection import RandomizedSearchCV
params = {
'n_estimators':[50000],
'objective':['reg:squarederror'],
'eval_metric': ['rmse'],
'booster': ['gbtree'],
'learning_rate':[0.1,0.05,0.01],
'max_depth':[5,7,10,15],
'random_state':[2525]
}
mod = xgb.XGBRegressor()
#n_I do a random search for iter
rds = RandomizedSearchCV(mod,params,random_state=2525,scoring='r2',n_jobs=1,n_iter=50)
rds.fit(trainX,
trainY,
eval_metric='rmse',
early_stopping_rounds=15,
eval_set=[(testX, testY)])
print(rds.best_params_)
#{'seed': 2525,
# 'objective': 'reg:squarederror',
# 'n_estimators': 50000,
# 'max_depth': 5,
# 'learning_rate': 0.1,
# 'eval_metric': 'rmse',
# 'booster': 'gbtree'}
To be honest, I feel that the original API is unfamiliar and you don't have to bother to use it. However, there may be other such examples, just as some methods can only be used with this method above. Is the basic implemented by method 1? I think it would be better to implement this one here if you can't do this.
This time, while investigating XGBoost, I have summarized the unclear points due to the mixture of methods as easily as possible. I hope that someone in a similar situation will get here and solve the problem. There are many other useful methods in the xgboost library, such as a method that can display the generated tree as a diagram, so it is recommended that you take a quick look at the API and implement it in various ways. This is the first Qiita article in my life, and I think it's not enough, but thank you for reading this far.
Python: Try using XGBoost How to use xgboost: Multi-class classification by iris data Using XGBoost with Python Xgboost: How to calculate importance_type of feature_importance xgboost: Effective machine learning model for table data
Recommended Posts