This time, I investigated ** LightGBM **, which is one of the learning model methods. LightGBM is also commonly used in Kaggle as part of ensemble learning. Along with that, the processes around it (such as plotting the ROC curve) are also required, so I have summarized them.
The article I referred to is here.
-Explanation of LightGBM -[Python: Try using LightGBM](https://blog.amedama.jp/entry/2018/05/01/081842#%E4%BA%8C%E5%80%A4%E5%88%86% E9% A1% 9E% E5% 95% 8F% E9% A1% 8C-Breast-Cancer-% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% BB% E3% 83 % 83% E3% 83% 88) -Memo of a series of binary classification with lightgbm
terminal
pip install lightgbm
To do. But,
jupyter
import lightgbm as lgb
I got an error, so I investigated how to deal with it.
I arrived at. According to this
terminal
brew install libomp
Then, it was solved safely. This will move you forward.
First, import the required libraries.
In[1]
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
print(pd.__version__) #You can check the version on the script.
Out[1]
0.21.0
Then load the dataset. This time we will use the iris data from the scikit-learn dataset.
In[2]
from sklearn.datasets import load_iris
iris = load_iris()
#print(iris.DESCR) #Show description about the dataset
print(type(iris))
Out[2]
<class 'sklearn.utils.Bunch'>
In[3]
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target
print(df.shape)
df.head()
Let's plot this data with only two variables.
In[4]
fig = plt.figure(figsize = (5,5))
ax = fig.add_subplot(111)
plt.scatter(df[df['target']==0].iloc[:,0], df[df['target']==0].iloc[:,1], label='Setosa')
plt.scatter(df[df['target']==1].iloc[:,0], df[df['target']==1].iloc[:,1], label='Versicolour')
plt.scatter(df[df['target']==2].iloc[:,0], df[df['target']==2].iloc[:,1], label='Virginica')
plt.xlabel("sepal length[cm]", fontsize=13)
plt.ylabel("sepal width[cm]", fontsize=13)
plt.legend()
plt.show()
This time, I will try to solve it as a two-class classification problem of Versicolour and Virginica, which seems to be relatively difficult to classify among these three types. Therefore, those with a label of 0 are excluded. Label the remaining two types 0 and 1.
In[5]
data = df[df['target']!=0]
data.head()
In[6]
data['target'] = data['target'] - 1
X = data.drop('target', axis=1)
y = data['target']
Out[6]
/Users/{user}/.pyenv/versions/3.6.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
"""Entry point for launching an IPython kernel.
I couldn't handle this Warning well, so if you know what to do ...
Before that, split the training data and the test data. This time, we will perform cross-validation, so prepare validation data as well.
In[7]
from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=7, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=7, stratify=y_trainval)
It's finally the main subject. We will build a lightgbm model.
The argument params and other arguments of the lightgbm function train are hyperparameters. The documentation for that is here.
Also, I think the following articles will be helpful.
-Thorough introduction to LightGBM-How to use LightGBM, how it works, and how it differs from XGBoost -[Machine learning] How to tune hyperparameters
Here, I will explain only the ones that seem to be important.
--objective: Since this time it is a two-class classification, binary is specified. --metric: 2 Specifies auc, which is one of the indicators that can be used for class classification. --early_stopping_rounds: If the metric does not improve in the specified number of rounds, stop learning before that.
In[8]
import lightgbm as lgb
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val)
params = {
'metric' :'auc', #binary_Also possible with logloss
'objective' :'binary',
'max_depth' :1,
'num_leaves' :2,
'min_data_in_leaf' : 5,
}
evals_result = {} #Dictionary for storing results
gbm = lgb.train(params,
lgb_train,
valid_sets = [lgb_train, lgb_val],
valid_names = [ 'train', 'eval'],
num_boost_round = 500,
early_stopping_rounds = 20,
verbose_eval = 10,
evals_result = evals_result
)
Out[8]
Training until validation scores don't improve for 20 rounds
[10] train's auc: 0.997559 eval's auc: 0.914062
[20] train's auc: 0.997559 eval's auc: 0.914062
Early stopping, best iteration is:
[3] train's auc: 0.998047 eval's auc: 0.914062
The evals_result that I put in the argument of the train function earlier contains the record about the auc specified as the index. We will use this to draw a learning curve.
In[9]
print(evals_result.keys())
print(evals_result['eval'].keys())
print(evals_result['train'].keys())
train_metric = evals_result['train']['auc']
eval_metric = evals_result['eval']['auc']
train_metric[:5], eval_metric[:5]
Out[9]
dict_keys(['train', 'eval'])
odict_keys(['auc'])
odict_keys(['auc'])
([0.9375, 0.9375, 0.998046875, 0.998046875, 0.998046875],
[0.8125, 0.8125, 0.9140625, 0.9140625, 0.9140625])
Since it is included as a list in the dictionary like this, specify the key to get this list.
Then draw a learning curve.
In[10]
plt.plot(train_metric, label='train auc')
plt.plot(eval_metric, label='eval auc')
plt.grid()
plt.legend()
plt.ylim(0, 1.1)
plt.xlabel('rounds')
plt.ylabel('auc')
plt.show()
lightgbm can visualize its importance in classification using lightgbm.plot_importance.
The documentation for lightgbm.plot_importance is here (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_importance.html).
In[11]
lgb.plot_importance(gbm, figsize=(12, 6), max_num_features=4)
plt.show();
In[12]
from sklearn import metrics
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, drop_intermediate=False, )
auc = metrics.auc(fpr, tpr)
print('auc:', auc)
Out[12]
auc: 1.0
The documentation for roc_curve is here (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html).
Also, when I was investigating the ROC curve, the following articles were organized in an easy-to-understand manner, so I will link them.
-Calculate ROC curve and its AUC with scikit-learn
In[13]
plt.plot(fpr, tpr, marker='o')
plt.xlabel('FPR: False positive rate')
plt.ylabel('TPR: True positive rate')
plt.grid()
plt.plot();
This time it was relatively easy, so it became an extreme ROC curve.
Finally, set a certain threshold and sort the answer labels based on that threshold.
In[13]
y_pred = (y_pred >= 0.5).astype(int)
y_pred
Out[14]
array([1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0])
Finally,
-Beautiful visualization with seaborn heatmap
I also drew a heat map of the confusion matrix with reference to.
In[15]
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test.values, y_pred)
sns.heatmap(cm, annot=True, annot_kws={'size': 20}, cmap= 'Reds');
Recommended Posts