[For Kaggle beginners] Titanic (LightGBM)

■ Introduction

This time, I worked on the following competition with LigthGBM. I tried to summarize it briefly.

【Overview】・ Titanic: Machine Learning from Disaster ・ Based on the passenger information of the sinking ship "Titanic", distinguish between those who are saved and those who are not.

[Target readers] ・ Kaggle beginners ・ Those who want to learn about the basic code of LightGBM

1. Preparation of module


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

## 2. Data preparation


train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print(train.shape)
print(test.shape)

# (891, 12)
# (418, 11)


train.head()

【data item】・ PassengerId: Passenger ID ・ Survived: Whether or not you survived (0: not saved, 1: saved) ・ Pclass – Ticket class (1: Upper class, 2: Intermediate class, 3: Lower class) ・ Name: Passenger's name ・ Sex: Gender ・ Age: Age ・ SibSp: Number of siblings / spouses on board the ship ・ Parch: Number of parents / children on board the ship ・ Ticket: Ticket number ・ Fare: Fee ・ Cabin: Room number ・ Embarked: Port on board (C: Cherbourg, Q: Queenstown, S: Southampton)


test.head()

Save the passenger number (PassengerId) of the test data.


PassengerId = test['PassengerId']

Actually, the model is created only with train data, I want to preprocess the train / test data together, so I will consider combining.

The train data has one more item (objective variable: Survived), so it is separated.


y = train['Survived']
train = train[[col for col in train.columns if col != 'Survived']]

print(train.shape)
print(test.shape)

# (891, 11)
# (418, 11)

Now that the number of items (features) in the train data and test data is the same, combine them.


X_total = pd.concat([train, test], axis=0)

print(X_total.shape)
X_total.head()

# (1309, 11)

3. Pretreatment

First, check how many missing values there are.


print(X_total.isnull().sum())

'''
PassengerId       0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64
'''

With LightGBM, it is possible to create a model with character string data as it is. Preprocessing is performed without performing numerical conversion.


X_total.fillna(value=-999, inplace=True)

print(X_total.isnull().sum())

'''
PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64
'''

Now, check the column of character string data (hereinafter referred to as categorical).


categorical_col = [col for col in X_total.columns if X_total[col].dtype == 'object']
print('categorical_col:', categorical_col)

# categorical_col: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

Examine the data type of each categorical.


for i in X_total[categorical_col]:
    print('{}: {}'.format(i, X_total[i].dtype))

'''
Name: object
Sex: object
Ticket: object
Cabin: object
Embarked: object
'''

LightGBM can be modeled as string data, Since we need to make it a category type instead of an object type, we will convert the data type.


for i in categorical_col:
    X_total[i] = X_total[i].astype("category")

Let's look at the data type of the total data (X_total).


for i in X_total.columns:
    print('{}: {}'.format(i, X_total[i].dtype))

'''
PassengerId: int64
Pclass: int64
Name: category
Sex: category
Age: float64
SibSp: int64
Parch: int64
Ticket: category
Fare: float64
Cabin: category
Embarked: category
'''

# 4. Creating a model I want to create a model using only train data Since X_total also contains test data, only the necessary part is extracted.

train_rows = train.shape[0]
X = X_total[:train_rows]

print(X.shape)
print(y.shape)

# (891, 11)
# (891,)

Since the features and objective variables corresponding to the train data are available Furthermore, we will create a model by dividing it into training data and test data.


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

# (623, 11)
# (623,)
# (268, 11)
# (268,)

Set the parameters and pass them to LGBMClassifier () as dictionary type arguments.


params = {
"random_state": 42
}

cls = lgb.LGBMClassifier(**params)
cls.fit(X_train, y_train, categorical_feature = categorical_col)

'''
LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)
'''

Next, find the predicted value.

By specifying [:, 1] for y_proba, the probability of becoming Class1 (Survived = 1) is predicted. y_pred is converted to 1 if it is greater than 0.5 and 0 if it is less than 0.5.


y_proba = cls.predict_proba(X_test)[: , 1]
print(y_proba[:5])

y_pred = cls.predict(X_test)
print(y_pred[:5])

# [0.38007409 0.00666063 0.04531554 0.95244042 0.35233708]
# [0 0 0 1 0]

## 5. Performance evaluation Evaluate using ROC curve and AUC.


fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = roc_auc_score(y_test, y_proba)
plt.plot(fpr, tpr, label='AUC = %.3f' % (auc_score))
plt.legend()
plt.title('ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.grid(True)

print('accuracy:',accuracy_score(y_test, y_pred))
print('f1_score:',f1_score(y_test, y_pred))

# accuracy: 0.8208955223880597
# f1_score: 0.7446808510638298

We will also evaluate using the confusion matrix.


classes = [1, 0]
cm = confusion_matrix(y_test, y_pred, labels=classes)

cmdf = pd.DataFrame(cm, index=classes, columns=classes)

sns.heatmap(cmdf, annot=True)
print(classification_report(y_test, y_pred))

'''
              precision    recall  f1-score   support

           0       0.83      0.89      0.86       168
           1       0.80      0.70      0.74       100

    accuracy                           0.82       268
   macro avg       0.81      0.80      0.80       268
weighted avg       0.82      0.82      0.82       268

'''

6. Submit Since I was able to create and evaluate a model using train data Give the information of the test data and give the predicted value.

First, extract the part corresponding to the test data from the total data (X_total).

X_submit = X_total[train_rows:]

print(X_train.shape)
print(X_submit.shape)

# (623, 11)
# (418, 11)

Compared to the X_train that created the model, it has the same number of features (2431). Submit X_submit into the model to get the predicted value.


y_proba_submit = cls.predict_proba(X_submit)[: , 1]
print(y_proba_submit[:5])

y_pred_submit = cls.predict(X_submit)
print(y_pred_submit[:5])

# [0.00948223 0.02473048 0.01005387 0.50935871 0.45433965]
# [0 0 0 1 0]

Prepare the CSV data to submit to Kaggle.

First, create a data frame with the necessary information.


df_submit = pd.DataFrame(y_pred_submit, index=PassengerId, columns=['Survived'])
df_submit.head()

Then convert it to CSV data.


df_submit.to_csv('titanic_lgb_submit.csv')

This is the end of submitting.

■ Finally

This time, we have compiled an article for Kaggle beginners. I hope it helped you even a little.

Thank you for reading.