Last time, I used a decision tree-based xgboost to predict the survival rate. Last time: Survival prediction using kaggle's titanic xg boost [80.1%]
This time, I will try to predict the survival of Titanic using ** Neural Network **, which is often used in kaggle.
import pandas as pd
import numpy as np
train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
#Combine train data and test data into one
data = pd.concat([train,test]).reset_index(drop=True)
#Check the number of rows that contain missing values
train.isnull().sum()
test.isnull().sum()
The number of each missing value is as follows.
train data | test data | |
---|---|---|
PassengerId | 0 | 0 |
Survived | 0 | |
Pclass | 0 | 0 |
Name | 0 | 0 |
Sex | 0 | 0 |
Age | 177 | 86 |
SibSp | 0 | 0 |
Parch | 0 | 0 |
Ticket | 0 | 0 |
Fare | 0 | 1 |
Cabin | 687 | 327 |
Embarked | 2 | 0 |
The missing row had a ** Pclass of 3 ** and ** Embarked was S **. Complement with the median ** among those who meet these two conditions.
data['Fare'] = data['Fare'].fillna(data.query('Pclass==3 & Embarked=="S"')['Fare'].median())
Titanic [0.82]-[0.83] Created the feature'Family_survival' introduced in this code. I will.
** Family and friends are more likely to be acting together on board **, so it can be said that whether or not they survived ** tends to have the same result within the group **.
Therefore, grouping is performed by the surname of the first name and the ticket number, and the value is determined by whether or not the members of the group are alive.
** Creating this feature has improved the prediction accuracy rate by about 2% **, so this grouping is quite effective.
#Get the surname of the name'Last_name'Put in
data['Last_name'] = data['Name'].apply(lambda x: x.split(",")[0])
data['Family_survival'] = 0.5 #Default value
#Last_Grouping by name and Fare
for grp, grp_df in data.groupby(['Last_name', 'Fare']):
if (len(grp_df) != 1):
#(Same surname)And(Same Fare)When there are two or more people
for index, row in grp_df.iterrows():
smax = grp_df.drop(index)['Survived'].max()
smin = grp_df.drop(index)['Survived'].min()
passID = row['PassengerId']
if (smax == 1.0):
data.loc[data['PassengerId'] == passID, 'Family_survival'] = 1
elif (smin == 0.0):
data.loc[data['PassengerId'] == passID, 'Family_survival'] = 0
#About members other than yourself in the group
#Even one person is alive → 1
#No survivors(Including NaN) → 0
#All NaN → 0.5
#Grouping by ticket number
for grp, grp_df in data.groupby('Ticket'):
if (len(grp_df) != 1):
#When there are two or more people with the same ticket number
#If there is even one survivor in the group'Family_survival'To 1
for ind, row in grp_df.iterrows():
if (row['Family_survival'] == 0) | (row['Family_survival']== 0.5):
smax = grp_df.drop(ind)['Survived'].max()
smin = grp_df.drop(ind)['Survived'].min()
passID = row['PassengerId']
if (smax == 1.0):
data.loc[data['PassengerId'] == passID, 'Family_survival'] = 1
elif (smin == 0.0):
data.loc[data['PassengerId'] == passID, 'Family_survival'] = 0
Using the values of SibSp and Parch, we will create a feature quantity'Family_size'that indicates how many families boarded the Titanic, and classify them according to the number of people.
#Family_Creating size
data['Family_size'] = data['SibSp']+data['Parch']+1
#1, 2~4, 5~Divide into three
data['Family_size_bin'] = 0
data.loc[(data['Family_size']>=2) & (data['Family_size']<=4),'Family_size_bin'] = 1
data.loc[(data['Family_size']>=5) & (data['Family_size']<=7),'Family_size_bin'] = 2
data.loc[(data['Family_size']>=8),'Family_size_bin'] = 3
Get titles such as'Mr','Miss' from the Name column. Incorporate a few titles ('Mme','Mlle', etc.) into titles that have the same meaning.
#Get the title of the name'Title'Put in
data['Title'] = data['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
#Integrate a few titles
data['Title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
data['Title'].replace(['Don', 'Sir', 'the Countess', 'Lady', 'Dona'], 'Royalty', inplace=True)
data['Title'].replace(['Mme', 'Ms'], 'Mrs', inplace=True)
data['Title'].replace(['Mlle'], 'Miss', inplace=True)
data['Title'].replace(['Jonkheer'], 'Master', inplace=True)
Use ** the average age obtained for each title of the name ** to complete the missing value of Age. Then, it is divided into three categories: ** children (0-18), adults (18-60), and elderly people (60-) **.
#Complement the missing value of Age with the average value for each title
title_list = data['Title'].unique().tolist()
for t in title_list:
index = data[data['Title']==t].index.values.tolist()
age = data.iloc[index]['Age'].mean()
age = np.round(age,1)
data.iloc[index,5] = data.iloc[index,5].fillna(age)
#Classification by age
data['Age_bin'] = 0
data.loc[(data['Age']>18) & (data['Age']<=60),'Age_bin'] = 1
data.loc[(data['Age']>60),'Age_bin'] = 2
Since the Fare value has a large variable scale difference, ** standardize ** (mean value is 0, standard deviation is 1) so that the neural network can be easily learned.
Then, the string that is a character string is made into a dummy variable with get_dummies. ** Pclass is a number **, but ** the size of the value itself has no meaning **, so let's convert it to a dummy variable as well.
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
#A standardized version of Fare'Fare_std'Put in
data['Fare_std'] = sc.fit_transform(data[['Fare']])
#Convert to dummy variable
data['Sex'] = data['Sex'].map({'male':0, 'female':1})
data = pd.get_dummies(data=data, columns=['Title','Pclass','Family_survival'])
Finally, remove unnecessary features.
data = data.drop(['PassengerId','Name','Age','SibSp','Parch','Ticket',
'Fare','Cabin','Embarked','Family_size','Last_name'], axis=1)
The data frame looks like this.
Survived | Sex | Family_size_bin | Age_bin | Fare_std | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Title_Officer | Title_Royalty | Pclass_1 | Pclass_2 | Pclass_3 | Family_survival_0.0 | Family_survival_0.5 | Family_survival_1.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0 | 1 | 1 | -0.503176 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 1.0 | 1 | 1 | 1 | 0.734809 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 1.0 | 1 | 0 | 1 | -0.490126 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 1.0 | 1 | 1 | 1 | 0.383263 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
4 | 0.0 | 0 | 0 | 1 | -0.487709 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1304 | NaN | 0 | 0 | 1 | -0.487709 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1305 | NaN | 1 | 0 | 1 | 1.462069 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
1306 | NaN | 0 | 0 | 1 | -0.503176 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1307 | NaN | 0 | 0 | 1 | -0.487709 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1308 | NaN | 0 | 1 | 0 | -0.211081 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
1309 rows × 17 columns
The integrated data is divided into train data and test data, and the feature processing is completed.
model_train = data[:891]
model_test = data[891:]
x_train = model_train.drop('Survived', axis=1)
y_train = pd.DataFrame(model_train['Survived'])
x_test = model_test.drop('Survived', axis=1)
Now that the data frame is complete, let's build a neural network model and make predictions.
from keras.layers import Dense,Dropout
from keras.models import Sequential
from keras.callbacks import EarlyStopping
#Model initialization
model = Sequential()
#Layer construction
model.add(Dense(12, activation='relu', input_dim=16))
model.add(Dropout(0.2))
model.add(Dense(8, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#Model building
model.compile(optimizer = 'adam', loss='binary_crossentropy', metrics='acc')
#View model structure
model.summary()
Train by passing train data. If you set validation_split, the data for validation will be divided arbitrarily from the train data, so it's easy.
log = model.fit(x_train, y_train, epochs=5000, batch_size=32,verbose=1,
callbacks=[EarlyStopping(monitor='val_loss',min_delta=0,patience=100,verbose=1)],
validation_split=0.3)
It looks like this when the state of learning progress is displayed in a graph.
import matplotlib.pyplot as plt
plt.plot(log.history['loss'],label='loss')
plt.plot(log.history['val_loss'],label='val_loss')
plt.legend(frameon=False)
plt.xlabel('epochs')
plt.ylabel('crossentropy')
plt.show()
Finally, predict_classes is used to output the predicted value.
#Predict whether it will be classified as 0 or 1
y_pred_cls = model.predict_classes(x_test)
#Create a data frame for kaggle
y_pred_cls = y_pred_cls.reshape(-1)
submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':y_pred_cls})
submission.to_csv('titanic_nn.csv', index=False)
The correct answer rate of this prediction model was ** 80.8% **. I don't know if this model created is optimal because the neural network can freely decide the parameters and the number of layers of the model, but if it exceeds 80%, it is reasonable.
If you have any opinions or suggestions, we would appreciate it if you could make a comment or edit request.
Titanic - Neural Networks [KERAS] - 81.8% Titanic [0.82] - [0.83] [Data analysis technology that wins with Kaggle](https://www.amazon.co.jp/Kaggle%E3%81%A7%E5%8B%9D%E3%81%A4%E3%83%87%E3% 83% BC% E3% 82% BF% E5% 88% 86% E6% 9E% 90% E3% 81% AE% E6% 8A% 80% E8% A1% 93-% E9% 96% 80% E8% 84 % 87-% E5% A4% A7% E8% BC% 94-ebook / dp / B07YTDBC3Z) [Deep Learning from scratch-Theory and implementation of deep learning learned with Python](https://www.amazon.co.jp/%E3%82%BC%E3%83%AD%E3%81%8B%E3] % 82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95Python% E3% 81% A7% E5% AD% A6% E3% 81% B6% E3% 83% 87% E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0% E3% 81% AE% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% A3% 85-% E6% 96% 8E% E8% 97% A4-% E5% BA% B7% E6% AF% 85 / dp / 4873117585)
Recommended Posts