Survivor prediction using kaggle's titanic neural network [80.8%]

Last time, I used a decision tree-based xgboost to predict the survival rate. Last time: Survival prediction using kaggle's titanic xg boost [80.1%]

This time, I will try to predict the survival of Titanic using ** Neural Network **, which is often used in kaggle.

1. Acquisition of data and confirmation of missing values

import pandas as pd
import numpy as np

train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
#Combine train data and test data into one
data = pd.concat([train,test]).reset_index(drop=True)
#Check the number of rows that contain missing values

The number of each missing value is as follows.

train data test data
PassengerId 0 0
Survived 0
Pclass 0 0
Name 0 0
Sex 0 0
Age 177 86
SibSp 0 0
Parch 0 0
Ticket 0 0
Fare 0 1
Cabin 687 327
Embarked 2 0

2. Complementing missing values and creating features

2.1 Complement of Fare

The missing row had a ** Pclass of 3 ** and ** Embarked was S **. スクリーンショット 2020-10-01 12.01.00.png Complement with the median ** among those who meet these two conditions.

data['Fare'] = data['Fare'].fillna(data.query('Pclass==3 & Embarked=="S"')['Fare'].median())

2.2 Life-and-death difference between groups Creating'Family_survival'

Titanic [0.82]-[0.83] Created the feature'Family_survival' introduced in this code. I will.

** Family and friends are more likely to be acting together on board **, so it can be said that whether or not they survived ** tends to have the same result within the group **.

Therefore, grouping is performed by the surname of the first name and the ticket number, and the value is determined by whether or not the members of the group are alive.

** Creating this feature has improved the prediction accuracy rate by about 2% **, so this grouping is quite effective.

#Get the surname of the name'Last_name'Put in
data['Last_name'] = data['Name'].apply(lambda x: x.split(",")[0])

data['Family_survival'] = 0.5 #Default value
#Last_Grouping by name and Fare
for grp, grp_df in data.groupby(['Last_name', 'Fare']):
    if (len(grp_df) != 1):
        #(Same surname)And(Same Fare)When there are two or more people
        for index, row in grp_df.iterrows():
            smax = grp_df.drop(index)['Survived'].max()
            smin = grp_df.drop(index)['Survived'].min()
            passID = row['PassengerId']
            if (smax == 1.0):
                data.loc[data['PassengerId'] == passID, 'Family_survival'] = 1
            elif (smin == 0.0):
                data.loc[data['PassengerId'] == passID, 'Family_survival'] = 0
            #About members other than yourself in the group
            #Even one person is alive → 1
            #No survivors(Including NaN) → 0
            #All NaN → 0.5

#Grouping by ticket number
for grp, grp_df in data.groupby('Ticket'):
    if (len(grp_df) != 1):
        #When there are two or more people with the same ticket number
        #If there is even one survivor in the group'Family_survival'To 1
        for ind, row in grp_df.iterrows():
            if (row['Family_survival'] == 0) | (row['Family_survival']== 0.5):
                smax = grp_df.drop(ind)['Survived'].max()
                smin = grp_df.drop(ind)['Survived'].min()
                passID = row['PassengerId']
                if (smax == 1.0):
                    data.loc[data['PassengerId'] == passID, 'Family_survival'] = 1
                elif (smin == 0.0):
                    data.loc[data['PassengerId'] == passID, 'Family_survival'] = 0

2.3 Creation and classification of feature quantity'Family_size' representing the number of family members

Using the values of SibSp and Parch, we will create a feature quantity'Family_size'that indicates how many families boarded the Titanic, and classify them according to the number of people.

#Family_Creating size
data['Family_size'] = data['SibSp']+data['Parch']+1
#1, 2~4, 5~Divide into three
data['Family_size_bin'] = 0
data.loc[(data['Family_size']>=2) & (data['Family_size']<=4),'Family_size_bin'] = 1
data.loc[(data['Family_size']>=5) & (data['Family_size']<=7),'Family_size_bin'] = 2
data.loc[(data['Family_size']>=8),'Family_size_bin'] = 3

2.4 Creating the title title'Title'

Get titles such as'Mr','Miss' from the Name column. Incorporate a few titles ('Mme','Mlle', etc.) into titles that have the same meaning.

#Get the title of the name'Title'Put in
data['Title'] = data['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
#Integrate a few titles
data['Title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
data['Title'].replace(['Don', 'Sir',  'the Countess', 'Lady', 'Dona'], 'Royalty', inplace=True)
data['Title'].replace(['Mme', 'Ms'], 'Mrs', inplace=True)
data['Title'].replace(['Mlle'], 'Miss', inplace=True)
data['Title'].replace(['Jonkheer'], 'Master', inplace=True)

2.5 Age complementation and classification

Use ** the average age obtained for each title of the name ** to complete the missing value of Age. Then, it is divided into three categories: ** children (0-18), adults (18-60), and elderly people (60-) **.

#Complement the missing value of Age with the average value for each title
title_list = data['Title'].unique().tolist()
for t in title_list:
    index = data[data['Title']==t].index.values.tolist()
    age = data.iloc[index]['Age'].mean()
    age = np.round(age,1)
    data.iloc[index,5] = data.iloc[index,5].fillna(age)

#Classification by age
data['Age_bin'] = 0
data.loc[(data['Age']>18) & (data['Age']<=60),'Age_bin'] = 1
data.loc[(data['Age']>60),'Age_bin'] = 2

2.6 Standardization of Fare & dummy variable of features

Since the Fare value has a large variable scale difference, ** standardize ** (mean value is 0, standard deviation is 1) so that the neural network can be easily learned.

Then, the string that is a character string is made into a dummy variable with get_dummies. ** Pclass is a number **, but ** the size of the value itself has no meaning **, so let's convert it to a dummy variable as well.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
#A standardized version of Fare'Fare_std'Put in
data['Fare_std'] = sc.fit_transform(data[['Fare']])
#Convert to dummy variable
data['Sex'] = data['Sex'].map({'male':0, 'female':1})
data = pd.get_dummies(data=data, columns=['Title','Pclass','Family_survival'])

Finally, remove unnecessary features.

data = data.drop(['PassengerId','Name','Age','SibSp','Parch','Ticket',
                     'Fare','Cabin','Embarked','Family_size','Last_name'], axis=1)

The data frame looks like this.

Survived Sex Family_size_bin Age_bin Fare_std Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty Pclass_1 Pclass_2 Pclass_3 Family_survival_0.0 Family_survival_0.5 Family_survival_1.0
0 0.0 0 1 1 -0.503176 0 0 1 0 0 0 0 0 1 0 1 0
1 1.0 1 1 1 0.734809 0 0 0 1 0 0 1 0 0 0 1 0
2 1.0 1 0 1 -0.490126 0 1 0 0 0 0 0 0 1 0 1 0
3 1.0 1 1 1 0.383263 0 0 0 1 0 0 1 0 0 1 0 0
4 0.0 0 0 1 -0.487709 0 0 1 0 0 0 0 0 1 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1304 NaN 0 0 1 -0.487709 0 0 1 0 0 0 0 0 1 0 1 0
1305 NaN 1 0 1 1.462069 0 0 0 0 0 1 1 0 0 0 0 1
1306 NaN 0 0 1 -0.503176 0 0 1 0 0 0 0 0 1 0 1 0
1307 NaN 0 0 1 -0.487709 0 0 1 0 0 0 0 0 1 0 1 0
1308 NaN 0 1 0 -0.211081 1 0 0 0 0 0 0 0 1 0 0 1

1309 rows × 17 columns

The integrated data is divided into train data and test data, and the feature processing is completed.

model_train = data[:891]
model_test = data[891:]

x_train = model_train.drop('Survived', axis=1)
y_train = pd.DataFrame(model_train['Survived'])
x_test = model_test.drop('Survived', axis=1)

3. Model building and forecasting

Now that the data frame is complete, let's build a neural network model and make predictions.

from keras.layers import Dense,Dropout
from keras.models import Sequential
from keras.callbacks import EarlyStopping
#Model initialization
model = Sequential()
#Layer construction
model.add(Dense(12, activation='relu', input_dim=16))
model.add(Dense(8, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#Model building
model.compile(optimizer = 'adam', loss='binary_crossentropy', metrics='acc')
#View model structure

スクリーンショット 2020-10-15 11.12.07.png Train by passing train data. If you set validation_split, the data for validation will be divided arbitrarily from the train data, so it's easy.

log =, y_train, epochs=5000, batch_size=32,verbose=1,

スクリーンショット 2020-10-15 11.20.07.png

It looks like this when the state of learning progress is displayed in a graph.

import matplotlib.pyplot as plt

スクリーンショット 2020-10-15 11.11.03.png

Finally, predict_classes is used to output the predicted value.

#Predict whether it will be classified as 0 or 1
y_pred_cls = model.predict_classes(x_test)
#Create a data frame for kaggle
y_pred_cls = y_pred_cls.reshape(-1)
submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':y_pred_cls})
submission.to_csv('titanic_nn.csv', index=False)

The correct answer rate of this prediction model was ** 80.8% **. I don't know if this model created is optimal because the neural network can freely decide the parameters and the number of layers of the model, but if it exceeds 80%, it is reasonable. スクリーンショット 2020-10-15 12.58.37.png

If you have any opinions or suggestions, we would appreciate it if you could make a comment or edit request.

Sites and books that I referred to

Titanic - Neural Networks [KERAS] - 81.8% Titanic [0.82] - [0.83] [Data analysis technology that wins with Kaggle]( 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3% 81% AE% E6% 8A% 80% E8% A1% 93-% E9% 96% 80% E8% 84 % 87-% E5% A4% A7% E8% BC% 94-ebook / dp / B07YTDBC3Z) [Deep Learning from scratch-Theory and implementation of deep learning learned with Python](] % 82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E3% 83% 87% E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% A3% 85-% E6% 96% 8E% E8% 97% A4-% E5% BA% B7% E6% AF% 85 / dp / 4873117585)

