0. Introduction

This time, I investigated ** LightGBM **, which is one of the learning model methods. LightGBM is also commonly used in Kaggle as part of ensemble learning. Along with that, the processes around it (such as plotting the ROC curve) are also required, so I have summarized them.

The article I referred to is here.

-Explanation of LightGBM -[Python: Try using LightGBM](https://blog.amedama.jp/entry/2018/05/01/081842#%E4%BA%8C%E5%80%A4%E5%88%86% E9% A1% 9E% E5% 95% 8F% E9% A1% 8C-Breast-Cancer-% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% BB% E3% 83 % 83% E3% 83% 88) -Memo of a series of binary classification with lightgbm

1. Install LightGBM

`terminal`


pip install lightgbm

To do. But,

`jupyter`


import lightgbm as lgb

I got an error, so I investigated how to deal with it.

Error in importing LightGBM

I arrived at. According to this

`terminal`


brew install libomp

Then, it was solved safely. This will move you forward.

2. Prepare a pseudo data set

First, import the required libraries.

`In[1]`


%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print(pd.__version__) #You can check the version on the script.

`Out[1]`


0.21.0

Then load the dataset. This time we will use the iris data from the scikit-learn dataset.

`In[2]`


from sklearn.datasets import load_iris

iris = load_iris()
#print(iris.DESCR)  #Show description about the dataset
print(type(iris))

`Out[2]`


<class 'sklearn.utils.Bunch'>

`In[3]`


df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

print(df.shape)
df.head()

Let's plot this data with only two variables.

`In[4]`


fig = plt.figure(figsize = (5,5))
ax = fig.add_subplot(111)

plt.scatter(df[df['target']==0].iloc[:,0], df[df['target']==0].iloc[:,1], label='Setosa')
plt.scatter(df[df['target']==1].iloc[:,0], df[df['target']==1].iloc[:,1], label='Versicolour')
plt.scatter(df[df['target']==2].iloc[:,0], df[df['target']==2].iloc[:,1], label='Virginica')
plt.xlabel("sepal length[cm]", fontsize=13)
plt.ylabel("sepal width[cm]", fontsize=13)

plt.legend()
plt.show()

This time, I will try to solve it as a two-class classification problem of Versicolour and Virginica, which seems to be relatively difficult to classify among these three types. Therefore, those with a label of 0 are excluded. Label the remaining two types 0 and 1.

`In[5]`


data = df[df['target']!=0]
data.head()

`In[6]`


data['target'] = data['target'] - 1

X = data.drop('target', axis=1)
y = data['target']

`Out[6]`


/Users/{user}/.pyenv/versions/3.6.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

I couldn't handle this Warning well, so if you know what to do ...

3. Learn with LightGBM

Before that, split the training data and the test data. This time, we will perform cross-validation, so prepare validation data as well.

`In[7]`


from sklearn.model_selection import train_test_split

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=7, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=7, stratify=y_trainval)

It's finally the main subject. We will build a lightgbm model.

About hyperparameters

The argument params and other arguments of the lightgbm function train are hyperparameters. The documentation for that is here.

Also, I think the following articles will be helpful.

-Thorough introduction to LightGBM-How to use LightGBM, how it works, and how it differs from XGBoost -[Machine learning] How to tune hyperparameters

Here, I will explain only the ones that seem to be important.

--objective: Since this time it is a two-class classification, binary is specified. --metric: 2 Specifies auc, which is one of the indicators that can be used for class classification. --early_stopping_rounds: If the metric does not improve in the specified number of rounds, stop learning before that.

Your understanding may be incorrect, but please correct it in that case.

`In[8]`


import lightgbm as lgb

lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val)

params = {
    'metric' :'auc', #binary_Also possible with logloss
    'objective' :'binary',
    'max_depth' :1,
    'num_leaves' :2,
    'min_data_in_leaf' : 5,    
}

evals_result = {} #Dictionary for storing results

gbm = lgb.train(params,
                lgb_train,
                valid_sets = [lgb_train, lgb_val],
                valid_names = [ 'train', 'eval'],
                num_boost_round = 500,
                early_stopping_rounds = 20,
                verbose_eval = 10,
                evals_result = evals_result
               )

`Out[8]`


Training until validation scores don't improve for 20 rounds
[10]	train's auc: 0.997559	eval's auc: 0.914062
[20]	train's auc: 0.997559	eval's auc: 0.914062
Early stopping, best iteration is:
[3]	train's auc: 0.998047	eval's auc: 0.914062

4. Plot the learning curve

The evals_result that I put in the argument of the train function earlier contains the record about the auc specified as the index. We will use this to draw a learning curve.

`In[9]`


print(evals_result.keys())
print(evals_result['eval'].keys())
print(evals_result['train'].keys())

train_metric = evals_result['train']['auc']
eval_metric = evals_result['eval']['auc']
train_metric[:5], eval_metric[:5]

`Out[9]`


dict_keys(['train', 'eval'])
odict_keys(['auc'])
odict_keys(['auc'])

([0.9375, 0.9375, 0.998046875, 0.998046875, 0.998046875],
 [0.8125, 0.8125, 0.9140625, 0.9140625, 0.9140625])

Since it is included as a list in the dictionary like this, specify the key to get this list.

Then draw a learning curve.

`In[10]`


plt.plot(train_metric, label='train auc')
plt.plot(eval_metric, label='eval auc')
plt.grid()
plt.legend()
plt.ylim(0, 1.1)

plt.xlabel('rounds')
plt.ylabel('auc')
plt.show()

5. Examine important features

lightgbm can visualize its importance in classification using lightgbm.plot_importance.

The documentation for lightgbm.plot_importance is here (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_importance.html).

`In[11]`


lgb.plot_importance(gbm, figsize=(12, 6), max_num_features=4)
plt.show();

6. Estimate test data ・ Plot ROC curve

`In[12]`


from sklearn import metrics

y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, drop_intermediate=False, )
auc = metrics.auc(fpr, tpr)
print('auc:', auc)

`Out[12]`


auc: 1.0

The documentation for roc_curve is here (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html).

Also, when I was investigating the ROC curve, the following articles were organized in an easy-to-understand manner, so I will link them.

-Calculate ROC curve and its AUC with scikit-learn

`In[13]`


plt.plot(fpr, tpr, marker='o')
plt.xlabel('FPR: False positive rate')
plt.ylabel('TPR: True positive rate')
plt.grid()
plt.plot();

This time it was relatively easy, so it became an extreme ROC curve.

Finally, set a certain threshold and sort the answer labels based on that threshold.

`In[13]`


y_pred = (y_pred >= 0.5).astype(int)
y_pred

`Out[14]`


array([1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0])

7. Draw a heatmap of the confusion matrix

Finally,

-Beautiful visualization with seaborn heatmap

I also drew a heat map of the confusion matrix with reference to.

`In[15]`


from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test.values, y_pred)

sns.heatmap(cm, annot=True, annot_kws={'size': 20}, cmap= 'Reds');

[Kaggle] I tried ensemble learning using LightGBM

0. Introduction

1. Install LightGBM

terminal

jupyter

terminal

2. Prepare a pseudo data set

In[1]

Out[1]

In[2]

Out[2]

In[3]

In[4]

In[5]

In[6]

Out[6]

3. Learn with LightGBM

In[7]

About hyperparameters

In[8]

Out[8]

4. Plot the learning curve

In[9]

Out[9]

In[10]

5. Examine important features

In[11]

6. Estimate test data ・ Plot ROC curve

In[12]

Out[12]

In[13]

In[13]

Out[14]

7. Draw a heatmap of the confusion matrix

In[15]

`terminal`

`jupyter`

`terminal`

`In[1]`

`Out[1]`

`In[2]`

`Out[2]`

`In[3]`

`In[4]`

`In[5]`

`In[6]`

`Out[6]`

`In[7]`

`In[8]`

`Out[8]`

`In[9]`

`Out[9]`

`In[10]`

`In[11]`

`In[12]`

`Out[12]`

`In[13]`

`In[13]`

`Out[14]`

`In[15]`