Introduction

XGBoost that often appears in Kaggle. There were many parts that I couldn't understand even after reading the code, so I investigated and summarized it as a beginner. Please note that it is not accurate because it is written in as easy-to-understand and difficult words as possible. Please do not hesitate to let me know if you have any additions or corrections. This time, I will explain about XGBoost while implementing it.

Contents of this article

What is XGBoost?
Import XGBoost.
Import the dataset
Data set division, format conversion
Model definition and training
Model evaluation
Confirm the importance of features

Operating environment

Windows: 10 Anaconda Python: 3.7.4 pandas: 0.25.1 numpy: 1.16.5 scikit-learn: 0.21.2 XGBoost: 0.90

Dataset used in this article

This time, we will use scikit-learn's breast cancer dataset (Breast cancer wisconsin [diagnostic] dataset). The dataset contains characteristic data about the cell nucleus of breast cancer, and this time we will determine whether the breast cancer is a "malignant tumor" or a "benign tumor".

important point

This article does not explain the detailed parameters of XGBoost.

About the source

The source of this article is listed below. https://github.com/Bacchan0718/qiita/blob/master/xgb_breast_cancer_wisconsin.ipynb

1. What is XGBoost?

XGBoost (eXtreme Gradient Boosting) is an implementation of the decision tree gradient boosting algorithm. Decision trees are a method of classifying datasets using a tree-like model as shown in the figure below, analyzing the factors that influenced the results, and using the classification results to make future predictions. 決定木.png

The gradient boosting algorithm describes "gradient" and "boostering" separately. Gradient is to minimize the difference between the two values and reduce the prediction error. Boosting combines weak classifiers (processes that make inaccurate judgments) in series. An algorithm that improves the accuracy of predictions. (The weak discriminator here is the decision tree.) ブースティング.png Reference link XGBoost: https://logmi.jp/tech/articles/322734 XGBoost: http://kamonohashiperry.com/archives/209 Gradient method: https://to-kei.net/basic-study/neural-network/optimizer/ Loss function: https://qiita.com/mine820/items/f8a8c03ef1a7b390e372 Decision tree: https://enterprisezine.jp/iti/detail/6323

2. XGBoost installation

** (1) Open Anaconda Prompt ** Start> Anaconda 3 (64-bit)> Anaconda Prompt Open from. XGBoostインストール.png

** (2) Run conda install -c anaconda py-xg boost **

** (3) Open Terminal from Anaconda ** Anaconda Navigator> Click on the virtual environment to install> Open Terminal Open from. XGBoostインストール2.png

** (4) Run conda install py-xg boost ** During execution, Proceed ([y] / n)? Is displayed. Enter "y" and press Enter.

Now you can use it by doing import xgboost as xgb in jupyter notebook.

important point The installation method of XGBoost differs depending on the operating environment. If the environment is different from this article, you may not be able to install it using this method.

3. Read dataset

You can import the scikit-learn dataset in the following ways.

`xgb_breast_cancer_wisconsin.ipynb`


from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

The variable cancer is of type Bunch (a subclass of dictionary) and The explanatory variable is stored in data and the objective variable is stored in target.

`xgb_breast_cancer_wisconsin.ipynb`


X = cancer.data
y = cancer.target

The objective variable is the result of determining whether the tumor is a cancer cell or a benign cell. It is 0 for cancer cells and 1 for benign cells.

The summary is stored in DSCR.

`xgb_breast_cancer_wisconsin.ipynb`


print(cancer.DESCR)

A detailed description of the dataset can be found below. https://ensekitt.hatenablog.com/entry/2018/08/22/200000

4. Data set division, format conversion

Divide into training data and test data.

`xgb_breast_cancer_wisconsin.ipynb`


import numpy as np
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

After splitting, it will be in the format of the dataset handled by XGBoost. For feature_names, pass column names for visualization of features.

`xgb_breast_cancer_wisconsin.ipynb`


import xgboost as xgb
xgb_train = xgb.DMatrix(X_train, label=Y_train, feature_names=cancer.feature_names)
xgb_test = xgb.DMatrix(X_test, label=Y_test, feature_names=cancer.feature_names)

Model definition and training Define the parameters. This time, without setting detailed parameters, the Learning Parameters (learning task parameters) Set objective to "binary: logistic" argument. This argument returns the probability of being a binary classification.

`xgb_breast_cancer_wisconsin.ipynb`


param = {
    #Binary classification problem
    'objective': 'binary:logistic',  
}

Learn the model.

`xgb_breast_cancer_wisconsin.ipynb`


model = xgb.train(param, xgb_train)

Reference link: https://blog.amedama.jp/entry/2019/01/29/235642

6. Model evaluation

Using the trained model, calculate the probability that the validation data will be classified in each class.

`xgb_breast_cancer_wisconsin.ipynb`


y_pred_proba = model.predict(xgb_test)

Check the contents of y_pred_proba.

`xgb_breast_cancer_wisconsin.ipynb`


print(y_pred_proba)

The contents are as follows.

[0.974865   0.974865   0.974865   0.974865   0.02652072 0.02652072
 0.02652072 0.93469375 0.15752992 0.9459383  0.05494327 0.974865
 0.974865   0.793823   0.95098037 0.974865   0.93770874 0.02652072
 0.92342764 0.96573967 0.92566985 0.95829874 0.9485401  0.974865
 0.96885294 0.974865   0.9670915  0.9495995  0.9719596  0.9671308
 0.974865   0.974865   0.9671308  0.974865   0.974865   0.974865
 0.96525717 0.9248287  0.4881295  0.974865   0.9670915  0.02652072
 0.974865   0.04612969 0.9459383  0.7825349  0.974865   0.02652072
 0.04585124 0.974865   0.1232813  0.974865   0.974865   0.3750245
 0.9522517  0.974865   0.05884887 0.02652072 0.02652072 0.02652072
 0.974865   0.94800293 0.9533147  0.974865   0.9177746  0.9665209
 0.9459383  0.02652072 0.974865   0.974865   0.974865   0.974865
 0.6874632  0.72485    0.31191444 0.02912194 0.96525717 0.09619693
 0.02652072 0.9719596  0.9346858  0.02652072 0.974865   0.02652072
 0.0688739  0.974865   0.64381874 0.97141886 0.974865   0.974865
 0.974865   0.1619863  0.974865   0.02652072 0.02652072 0.974865
 0.9670915  0.45661741 0.02652072 0.02652072 0.974865   0.03072577
 0.9670915  0.974865   0.9142289  0.7509865  0.9670915  0.02652072
 0.02652072 0.9670915  0.02652072 0.78484446 0.974865   0.974865  ]

The objective variable of the dataset is a binary classification, so the value must be 0 or 1. Set the threshold (reference value) to 1 when it is 0.5 or more, and 0 when it is less than 0.5. Convert to 0 and 1.

`xgb_breast_cancer_wisconsin.ipynb`


y_pred = np.where(y_pred_proba > 0.5, 1, 0)

Verify accuracy. This time, we will use Accuracy to verify the accuracy rate.

`xgb_breast_cancer_wisconsin.ipynb`


from sklearn.metrics import accuracy_score
acc = accuracy_score(Y_test, y_pred)

The accuracy is 0.9912280701754386.

7. Confirmation of features

A feature is a measurable characteristic used for learning input. Check the graph to see what features are strongly related to the explanatory variables. Reference link: https://qiita.com/daichildren98/items/ebabef57bc19d5624682 The graph can be saved as png with fig.savefig.

`xgb_breast_cancer_wisconsin.ipynb`


import matplotlib.pyplot as plt
fig, ax1 = plt.subplots(figsize=(8,15))
xgb.plot_importance(model, ax=ax1)
plt.show()
fig.savefig("FeatureImportance.png ")

Looking at the graph, I found that it was strongly related to the worst texture.

in conclusion

XGBoost often appears in Kaggle, but when I first saw it, I didn't understand it at all, so I looked it up and summarized it. I will investigate and summarize the parameters again.

Implementation and explanation using XGBoost for beginners

Introduction

Contents of this article

table of contents

Operating environment

Dataset used in this article

important point

About the source

1. What is XGBoost?

2. XGBoost installation

3. Read dataset

`xgb_breast_cancer_wisconsin.ipynb`

`xgb_breast_cancer_wisconsin.ipynb`

`xgb_breast_cancer_wisconsin.ipynb`

4. Data set division, format conversion

`xgb_breast_cancer_wisconsin.ipynb`

`xgb_breast_cancer_wisconsin.ipynb`

`xgb_breast_cancer_wisconsin.ipynb`

`xgb_breast_cancer_wisconsin.ipynb`

6. Model evaluation

`xgb_breast_cancer_wisconsin.ipynb`

`xgb_breast_cancer_wisconsin.ipynb`

`xgb_breast_cancer_wisconsin.ipynb`

`xgb_breast_cancer_wisconsin.ipynb`

7. Confirmation of features

`xgb_breast_cancer_wisconsin.ipynb`

in conclusion