** xgboost ** is a library that handles ** GBDT **, which is a type of decision tree model. We have summarized the steps to install and use. It can be used in various languages, but it describes how to use it in Python.
--A type of decision tree model --Gradient boosting tree
Random forest is famous for the same decision tree model, but the following article briefly summarizes the differences. [Machine learning] I tried to summarize the differences between the decision tree models --Qiita
--Easy to get good accuracy --Can handle missing values --Numerical data can be handled
It's easy to use and accurate, so it's popular with Kaggle, a machine learning competition.
I used iris data (iris variety data), which is one of the scikit-learn datasets. The OS is Amazon Linux 2.
The Amazon Linux 2 I'm using is: The installation procedure for each environment is officially listed. Installation Guide — xgboost 1.1.0-SNAPSHOT documentation
pip3 install xgboost
import xgboost as xgb
There are no special steps. Get the iris data and create a DataFrame and Series for pandas.
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
iris_data = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_target = pd.Series(iris.target)
Again, there are no special steps, and scikit-learn's train_test_split
splits the data for training and testing.
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(iris_data, iris_target, test_size=0.2, shuffle=True)
xgboost uses DMatrix
.
dtrain = xgb.DMatrix(train_x, label=train_y)
DMatrix
can be created from numpy's ndarray
or pandas'DataFrame
, so you won't have any trouble handling the data.
The types of data that can be handled are officially detailed. Python Package Introduction — xgboost 1.1.0-SNAPSHOT documentation
Set various parameters.
param = {'max_depth': 2, 'eta': 1, 'objective': 'multi:softmax', 'num_class': 3}
The meaning of each parameter is as follows.
Parameter name | meaning |
---|---|
max_depth | Maximum depth of the tree |
eta | Learning rate |
objective | Learning purpose |
num_class | Number of classes |
Specify the learning purpose (regression, classification, etc.) in'objejective'. Since this time it is a multi-class classification,'multi: softmax' is specified.
Details are officially detailed. XGBoost Parameters — xgboost 1.1.0-SNAPSHOT documentation
num_round
is the number of learnings.
num_round = 10
bst = xgb.train(param, dtrain, num_round)
dtest = xgb.DMatrix(test_x)
pred = bst.predict(dtest)
Check the accuracy rate with ʻaccuracy_score` in scikit-learn.
from sklearn.metrics import accuracy_score
score = accuracy_score(test_y, pred)
print('score:{0:.4f}'.format(score))
# 0.9667
Visualize which features contributed to the prediction results.
xgb.plot_importance(bst)
You can easily perform validation during learning using verification data and early stopping (discontinuation of learning).
A part of the training data is used as verification data.
train_x, valid_x, train_y, valid_y = train_test_split(train_x, train_y, test_size=0.2, shuffle=True)
dtrain = xgb.DMatrix(train_x, label=train_y)
dvalid = xgb.DMatrix(valid_x, label=valid_y)
Add'eval_metric' to the parameter for validation. For'eval_metric', specify the metric.
param = {'max_depth': 2, 'eta': 0.5, 'objective': 'multi:softmax', 'num_class': 3, 'eval_metric': 'mlogloss'}
Specify the data to be monitored by validation in evallist. Specify'eval'as the name of the verification data and'train' as the name of the training data.
I'm adding ʻearly_stopping_rounds as an argument to xgb.train. ʻEarly_stopping_rounds = 5
means that learning will be stopped if the evaluation index does not improve 5 times in a row.
evallist = [(dvalid, 'eval'), (dtrain, 'train')]
num_round = 10000
bst = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=5)
# [0] eval-mlogloss:0.61103 train-mlogloss:0.60698
# Multiple eval metrics have been passed: 'train-mlogloss' will be used for early stopping.
#
# Will train until train-mlogloss hasn't improved in 5 rounds.
# [1] eval-mlogloss:0.36291 train-mlogloss:0.35779
# [2] eval-mlogloss:0.22432 train-mlogloss:0.23488
#
#~ ~ ~ Omitted on the way ~ ~ ~
#
# Stopping. Best iteration:
# [1153] eval-mlogloss:0.00827 train-mlogloss:0.01863
print('Best Score:{0:.4f}, Iteratin:{1:d}, Ntree_Limit:{2:d}'.format(
bst.best_score, bst.best_iteration, bst.best_ntree_limit))
# Best Score:0.0186, Iteratin:1153, Ntree_Limit:1154
Make predictions using the model with the best verification results.
dtest = xgb.DMatrix(test_x)
pred = ypred = bst.predict(dtest, ntree_limit=bst.best_ntree_limit)
Since I can use pandas' DataFrame and Series, I felt that the threshold was low for those who have been doing machine learning so far.
I tried multi-class classification this time, but it can also be used for binary classification and regression, so it can be used in various situations.
Recommended Posts