XGBoost that often appears in Kaggle. There were many parts that I couldn't understand even after reading the code, so I investigated and summarized it as a beginner. Please note that it is not accurate because it is written in as easy-to-understand and difficult words as possible. Please do not hesitate to let me know if you have any additions or corrections. This time, I will explain about XGBoost while implementing it.
Windows: 10 Anaconda Python: 3.7.4 pandas: 0.25.1 numpy: 1.16.5 scikit-learn: 0.21.2 XGBoost: 0.90
This time, we will use scikit-learn's breast cancer dataset (Breast cancer wisconsin [diagnostic] dataset). The dataset contains characteristic data about the cell nucleus of breast cancer, and this time we will determine whether the breast cancer is a "malignant tumor" or a "benign tumor".
This article does not explain the detailed parameters of XGBoost.
The source of this article is listed below. https://github.com/Bacchan0718/qiita/blob/master/xgb_breast_cancer_wisconsin.ipynb
XGBoost (eXtreme Gradient Boosting) is an implementation of the decision tree gradient boosting algorithm. Decision trees are a method of classifying datasets using a tree-like model as shown in the figure below, analyzing the factors that influenced the results, and using the classification results to make future predictions.
The gradient boosting algorithm describes "gradient" and "boostering" separately. Gradient is to minimize the difference between the two values and reduce the prediction error. Boosting combines weak classifiers (processes that make inaccurate judgments) in series. An algorithm that improves the accuracy of predictions. (The weak discriminator here is the decision tree.) Reference link XGBoost: https://logmi.jp/tech/articles/322734 XGBoost: http://kamonohashiperry.com/archives/209 Gradient method: https://to-kei.net/basic-study/neural-network/optimizer/ Loss function: https://qiita.com/mine820/items/f8a8c03ef1a7b390e372 Decision tree: https://enterprisezine.jp/iti/detail/6323
** (1) Open Anaconda Prompt ** Start> Anaconda 3 (64-bit)> Anaconda Prompt Open from.
** (2) Run conda install -c anaconda py-xg boost **
** (3) Open Terminal from Anaconda ** Anaconda Navigator> Click on the virtual environment to install> Open Terminal Open from.
** (4) Run conda install py-xg boost ** During execution, Proceed ([y] / n)? Is displayed. Enter "y" and press Enter.
Now you can use it by doing import xgboost as xgb in jupyter notebook.
important point The installation method of XGBoost differs depending on the operating environment. If the environment is different from this article, you may not be able to install it using this method.
You can import the scikit-learn dataset in the following ways.
xgb_breast_cancer_wisconsin.ipynb
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
The variable cancer is of type Bunch (a subclass of dictionary) and The explanatory variable is stored in data and the objective variable is stored in target.
xgb_breast_cancer_wisconsin.ipynb
X = cancer.data
y = cancer.target
The objective variable is the result of determining whether the tumor is a cancer cell or a benign cell. It is 0 for cancer cells and 1 for benign cells.
The summary is stored in DSCR.
xgb_breast_cancer_wisconsin.ipynb
print(cancer.DESCR)
A detailed description of the dataset can be found below. https://ensekitt.hatenablog.com/entry/2018/08/22/200000
Divide into training data and test data.
xgb_breast_cancer_wisconsin.ipynb
import numpy as np
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)
After splitting, it will be in the format of the dataset handled by XGBoost. For feature_names, pass column names for visualization of features.
xgb_breast_cancer_wisconsin.ipynb
import xgboost as xgb
xgb_train = xgb.DMatrix(X_train, label=Y_train, feature_names=cancer.feature_names)
xgb_test = xgb.DMatrix(X_test, label=Y_test, feature_names=cancer.feature_names)
xgb_breast_cancer_wisconsin.ipynb
param = {
#Binary classification problem
'objective': 'binary:logistic',
}
Learn the model.
xgb_breast_cancer_wisconsin.ipynb
model = xgb.train(param, xgb_train)
Reference link: https://blog.amedama.jp/entry/2019/01/29/235642
Using the trained model, calculate the probability that the validation data will be classified in each class.
xgb_breast_cancer_wisconsin.ipynb
y_pred_proba = model.predict(xgb_test)
Check the contents of y_pred_proba.
xgb_breast_cancer_wisconsin.ipynb
print(y_pred_proba)
The contents are as follows.
[0.974865 0.974865 0.974865 0.974865 0.02652072 0.02652072
0.02652072 0.93469375 0.15752992 0.9459383 0.05494327 0.974865
0.974865 0.793823 0.95098037 0.974865 0.93770874 0.02652072
0.92342764 0.96573967 0.92566985 0.95829874 0.9485401 0.974865
0.96885294 0.974865 0.9670915 0.9495995 0.9719596 0.9671308
0.974865 0.974865 0.9671308 0.974865 0.974865 0.974865
0.96525717 0.9248287 0.4881295 0.974865 0.9670915 0.02652072
0.974865 0.04612969 0.9459383 0.7825349 0.974865 0.02652072
0.04585124 0.974865 0.1232813 0.974865 0.974865 0.3750245
0.9522517 0.974865 0.05884887 0.02652072 0.02652072 0.02652072
0.974865 0.94800293 0.9533147 0.974865 0.9177746 0.9665209
0.9459383 0.02652072 0.974865 0.974865 0.974865 0.974865
0.6874632 0.72485 0.31191444 0.02912194 0.96525717 0.09619693
0.02652072 0.9719596 0.9346858 0.02652072 0.974865 0.02652072
0.0688739 0.974865 0.64381874 0.97141886 0.974865 0.974865
0.974865 0.1619863 0.974865 0.02652072 0.02652072 0.974865
0.9670915 0.45661741 0.02652072 0.02652072 0.974865 0.03072577
0.9670915 0.974865 0.9142289 0.7509865 0.9670915 0.02652072
0.02652072 0.9670915 0.02652072 0.78484446 0.974865 0.974865 ]
The objective variable of the dataset is a binary classification, so the value must be 0 or 1. Set the threshold (reference value) to 1 when it is 0.5 or more, and 0 when it is less than 0.5. Convert to 0 and 1.
xgb_breast_cancer_wisconsin.ipynb
y_pred = np.where(y_pred_proba > 0.5, 1, 0)
Verify accuracy. This time, we will use Accuracy to verify the accuracy rate.
xgb_breast_cancer_wisconsin.ipynb
from sklearn.metrics import accuracy_score
acc = accuracy_score(Y_test, y_pred)
The accuracy is 0.9912280701754386.
A feature is a measurable characteristic used for learning input. Check the graph to see what features are strongly related to the explanatory variables. Reference link: https://qiita.com/daichildren98/items/ebabef57bc19d5624682 The graph can be saved as png with fig.savefig.
xgb_breast_cancer_wisconsin.ipynb
import matplotlib.pyplot as plt
fig, ax1 = plt.subplots(figsize=(8,15))
xgb.plot_importance(model, ax=ax1)
plt.show()
fig.savefig("FeatureImportance.png ")
Looking at the graph, I found that it was strongly related to the worst texture.
XGBoost often appears in Kaggle, but when I first saw it, I didn't understand it at all, so I looked it up and summarized it. I will investigate and summarize the parameters again.
Recommended Posts