[Kaggle] I tried ensemble learning using LightGBM

0. Introduction

This time, I investigated ** LightGBM **, which is one of the learning model methods. LightGBM is also commonly used in Kaggle as part of ensemble learning. Along with that, the processes around it (such as plotting the ROC curve) are also required, so I have summarized them.

The article I referred to is here.

-Explanation of LightGBM -[Python: Try using LightGBM](https://blog.amedama.jp/entry/2018/05/01/081842#%E4%BA%8C%E5%80%A4%E5%88%86% E9% A1% 9E% E5% 95% 8F% E9% A1% 8C-Breast-Cancer-% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% BB% E3% 83 % 83% E3% 83% 88) -Memo of a series of binary classification with lightgbm

1. Install LightGBM

terminal


pip install lightgbm

To do. But,

jupyter


import lightgbm as lgb 

I got an error, so I investigated how to deal with it.

Error in importing LightGBM

I arrived at. According to this

terminal


brew install libomp

Then, it was solved safely. This will move you forward.

2. Prepare a pseudo data set

First, import the required libraries.

In[1]


%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print(pd.__version__) #You can check the version on the script.

Out[1]


0.21.0

Then load the dataset. This time we will use the iris data from the scikit-learn dataset.

In[2]


from sklearn.datasets import load_iris

iris = load_iris()
#print(iris.DESCR)  #Show description about the dataset
print(type(iris))

Out[2]


<class 'sklearn.utils.Bunch'>

In[3]


df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target

print(df.shape)
df.head()

image.png

Let's plot this data with only two variables.

In[4]


fig = plt.figure(figsize = (5,5))
ax = fig.add_subplot(111)

plt.scatter(df[df['target']==0].iloc[:,0], df[df['target']==0].iloc[:,1], label='Setosa')
plt.scatter(df[df['target']==1].iloc[:,0], df[df['target']==1].iloc[:,1], label='Versicolour')
plt.scatter(df[df['target']==2].iloc[:,0], df[df['target']==2].iloc[:,1], label='Virginica')
plt.xlabel("sepal length[cm]", fontsize=13)
plt.ylabel("sepal width[cm]", fontsize=13)

plt.legend()
plt.show()

image.png

This time, I will try to solve it as a two-class classification problem of Versicolour and Virginica, which seems to be relatively difficult to classify among these three types. Therefore, those with a label of 0 are excluded. Label the remaining two types 0 and 1.

In[5]


data = df[df['target']!=0]
data.head()

image.png

In[6]


data['target'] = data['target'] - 1

X = data.drop('target', axis=1)
y = data['target']

Out[6]


/Users/{user}/.pyenv/versions/3.6.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

I couldn't handle this Warning well, so if you know what to do ...

3. Learn with LightGBM

Before that, split the training data and the test data. This time, we will perform cross-validation, so prepare validation data as well.

In[7]


from sklearn.model_selection import train_test_split

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=7, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.2, random_state=7, stratify=y_trainval)

It's finally the main subject. We will build a lightgbm model.

About hyperparameters

The argument params and other arguments of the lightgbm function train are hyperparameters. The documentation for that is here.

Also, I think the following articles will be helpful.

-Thorough introduction to LightGBM-How to use LightGBM, how it works, and how it differs from XGBoost -[Machine learning] How to tune hyperparameters

Here, I will explain only the ones that seem to be important.

--objective: Since this time it is a two-class classification, binary is specified. --metric: 2 Specifies auc, which is one of the indicators that can be used for class classification. --early_stopping_rounds: If the metric does not improve in the specified number of rounds, stop learning before that.

In[8]


import lightgbm as lgb

lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val)

params = {
    'metric' :'auc', #binary_Also possible with logloss
    'objective' :'binary',
    'max_depth' :1,
    'num_leaves' :2,
    'min_data_in_leaf' : 5,    
}

evals_result = {} #Dictionary for storing results

gbm = lgb.train(params,
                lgb_train,
                valid_sets = [lgb_train, lgb_val],
                valid_names = [ 'train', 'eval'],
                num_boost_round = 500,
                early_stopping_rounds = 20,
                verbose_eval = 10,
                evals_result = evals_result
               )

Out[8]


Training until validation scores don't improve for 20 rounds
[10]	train's auc: 0.997559	eval's auc: 0.914062
[20]	train's auc: 0.997559	eval's auc: 0.914062
Early stopping, best iteration is:
[3]	train's auc: 0.998047	eval's auc: 0.914062

4. Plot the learning curve

The evals_result that I put in the argument of the train function earlier contains the record about the auc specified as the index. We will use this to draw a learning curve.

In[9]


print(evals_result.keys())
print(evals_result['eval'].keys())
print(evals_result['train'].keys())

train_metric = evals_result['train']['auc']
eval_metric = evals_result['eval']['auc']
train_metric[:5], eval_metric[:5]

Out[9]


dict_keys(['train', 'eval'])
odict_keys(['auc'])
odict_keys(['auc'])

([0.9375, 0.9375, 0.998046875, 0.998046875, 0.998046875],
 [0.8125, 0.8125, 0.9140625, 0.9140625, 0.9140625])

Since it is included as a list in the dictionary like this, specify the key to get this list.

Then draw a learning curve.

In[10]


plt.plot(train_metric, label='train auc')
plt.plot(eval_metric, label='eval auc')
plt.grid()
plt.legend()
plt.ylim(0, 1.1)

plt.xlabel('rounds')
plt.ylabel('auc')
plt.show()

image.png

5. Examine important features

lightgbm can visualize its importance in classification using lightgbm.plot_importance.

The documentation for lightgbm.plot_importance is here (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_importance.html).

In[11]


lgb.plot_importance(gbm, figsize=(12, 6), max_num_features=4)
plt.show();

image.png

6. Estimate test data ・ Plot ROC curve

In[12]


from sklearn import metrics

y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred, drop_intermediate=False, )
auc = metrics.auc(fpr, tpr)
print('auc:', auc)

Out[12]


auc: 1.0

The documentation for roc_curve is here (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html).

Also, when I was investigating the ROC curve, the following articles were organized in an easy-to-understand manner, so I will link them.

-Calculate ROC curve and its AUC with scikit-learn

In[13]


plt.plot(fpr, tpr, marker='o')
plt.xlabel('FPR: False positive rate')
plt.ylabel('TPR: True positive rate')
plt.grid()
plt.plot();

image.png

This time it was relatively easy, so it became an extreme ROC curve.

Finally, set a certain threshold and sort the answer labels based on that threshold.

In[13]


y_pred = (y_pred >= 0.5).astype(int)
y_pred

Out[14]


array([1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0])

7. Draw a heatmap of the confusion matrix

Finally,

-Beautiful visualization with seaborn heatmap

I also drew a heat map of the confusion matrix with reference to.

In[15]


from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test.values, y_pred)

sns.heatmap(cm, annot=True, annot_kws={'size': 20}, cmap= 'Reds');

image.png

Recommended Posts

[Kaggle] I tried ensemble learning using LightGBM
I tried reinforcement learning using PyBrain
I tried deep learning using Theano
I tried learning LightGBM with Yellowbrick
[Kaggle] I tried undersampling using imbalanced-learn
I tried learning with Kaggle's Titanic (kaggle②)
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried deep learning
I tried using Ipython
I tried using PyCaret
I tried using cron
I tried using ngrok
I tried using face_recognition
I tried using Jupyter
I tried using PyCaret
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
I tried using time-window
[I tried using Pythonista 3] Introduction
I tried using easydict (memo).
I tried face recognition using Face ++
I tried using Random Forest
I tried using BigQuery ML
I tried using Amazon Glacier
I tried learning my own dataset using Chainer Trainer
I tried using git inspector
[Python] I tried using OpenPose
I tried using magenta / TensorFlow
PyTorch Learning Note 2 (I tried using a pre-trained model)
I tried using AWS Chalice
I tried to compress the image using machine learning
I tried using Slack emojinator
I tried hosting a TensorFlow deep learning model using TensorFlow Serving
I tried using Tensorboard, a visualization tool for machine learning
[TF] I tried to visualize the learning result using Tensorboard
I tried using Rotrics Dex Arm # 2
I tried machine learning with liblinear
I tried using Rotrics Dex Arm
I tried using GrabCut of OpenCV
I tried using Thonny (Python / IDE)
I tried server-client communication using tmux
Somehow I tried using jupyter notebook
I tried shooting Kamehameha using OpenPose
I tried using the checkio API
[Python] I tried using YOLO v3
I tried asynchronous processing using asyncio
I tried to compare the accuracy of machine learning models using kaggle as a theme.