1.First of all

Since I started studying machine learning, I tried ** Kaggle's beginner tutorial, Titanic ** anyway.

At first, I tried to play with the features in various ways by referring to the Japanese Web information, but even if the learning data gave good accuracy, when I submitted the test data, the accuracy did not improve as I expected. I was in agony because I couldn't break the 80% wall.

Under such circumstances, it is difficult to attach because it is in English, but I borrowed the wisdom of my predecessor in Kaggle / Titanic's ** Notebook ** and finally put it in the top 2%, so I will make a memorandum focusing on the points that were particularly helpful. Leave as.

Now, let's follow the code.

1. Read data

First, load the dataset. If train and test are treated separately, the same processing needs to be performed twice, so combine them into df.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Data set loading
train_data = pd.read_csv('./train.csv')
test_data = pd.read_csv('./test.csv')

# train_data and test_data concatenation
test_data['Survived'] = np.nan
df = pd.concat([train_data, test_data], ignore_index=True, sort=False)

#df information
df.info()

#Relationship between Sex and Survival Rate
sns.barplot(x='Sex', y='Survived', data=df, palette='Set3')
plt.show()

スクリーンショット 2019-11-20 19.43.33.png df is 13 items x 1309 lines. Missing values are ** Age **: 1309-1046 = 263, ** Fare **: 1309-1308 = 1, ** Cabin **: 1309-295 = 1014, ** Embarked **: 1309 -1307 = 2 pieces.

スクリーンショット 2019-11-21 09.47.01.png Looking at the survival rates of men and women, ** the survival rate of women is overwhelmingly high **.

2. Age missing value completion

At first, there is a relationship between title and age, so I supplemented it with ** average age by title **. However, for some reason, if this feature is used, the accuracy will drop, so I did not use it for the feature. Perhaps the titles that vary in age are bad.

In Notebook, there was a person who ** estimated the missing value of Age in a random forest using complete data without missing values (Pclass, Sex, SibSp, Parch) **. 　 I had this idea from the beginning, but I thought I wouldn't have to go that far. But it was a mistake. This works.

# ------------ Age ------------
#Age to Pclass, Sex, Parch,Estimated from SibSp in Random Forest
from sklearn.ensemble import RandomForestRegressor

#Specify the item to be used for estimation
age_df = df[['Age', 'Pclass','Sex','Parch','SibSp']]

#One-hot encoding of label features
age_df=pd.get_dummies(age_df)

#Separated into training data and test data and converted to numpy
known_age = age_df[age_df.Age.notnull()].values  
unknown_age = age_df[age_df.Age.isnull()].values

#X training data,Separated into y
X = known_age[:, 1:]  
y = known_age[:, 0]

#Build an estimation model in a random forest
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)

#Predict and complement age of test data using an estimation model
predictedAges = rfr.predict(unknown_age[:, 1::])
df.loc[(df.Age.isnull()), 'Age'] = predictedAges 

#Age-specific survival and death curves
facet = sns.FacetGrid(df[0:890], hue="Survived",aspect=2)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, df.loc[0:890,'Age'].max()))
facet.add_legend()
plt.show()

スクリーンショット 2019-11-20 20.00.40.png As a result of complementing the missing values of Age, if you draw a survival curve and a death curve by age, you would say that ** the peak of survival is under 10 years old ** and the peak of death is in the late 20s **. Is it?

3. Create a new feature from Name

From the beginning, I also extracted the title from ** Name and used it as a feature. ** And, surely, this contributed to the improvement of accuracy (the accuracy decreases when it is pulled out).

# ------------ Name --------------
#Honorific title from Name(Title)Extract and group
df['Title'] = df['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
df['Title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
df['Title'].replace(['Don', 'Sir',  'the Countess', 'Lady', 'Dona'], 'Royalty', inplace=True)
df['Title'].replace(['Mme', 'Ms'], 'Mrs', inplace=True)
df['Title'].replace(['Mlle'], 'Miss', inplace=True)
df['Title'].replace(['Jonkheer'], 'Master', inplace=True)
sns.barplot(x='Title', y='Survived', data=df, palette='Set3')

スクリーンショット 2019-11-20 15.23.08.png Looking at the survival rates by title, we can see that ** Mrs have the lowest survival rate and Mrs has the highest survival rate **.

Now, in Notebook, in addition to this, there was a person who took out the last name from ** Name and focused on a group with multiple people with the same last name **. 　 In other words, ** the perspective that if you have a family, you tend to share your destiny **.

# ------------ Surname ------------
#Name to Surname(Last name)Extract
df['Surname'] = df['Name'].map(lambda name:name.split(',')[0].strip())

#Same Surname(Last name)Count the frequency of appearance(Family if the number of appearances is 2 or more)
df['FamilyGroup'] = df['Surname'].map(df['Surname'].value_counts())

Then, if you divide the family into a group of ** 16 years old or younger or women ** (so-called girls and children) and a group of ** over 16 years old and men **, it is interesting to see the survival rate. I can see it.

#Survival rate for families under 16 or women
Female_Child_Group=df.loc[(df['FamilyGroup']>=2) & ((df['Age']<=16) | (df['Sex']=='female'))]
Female_Child_Group=Female_Child_Group.groupby('Surname')['Survived'].mean()
print(Female_Child_Group.value_counts())

スクリーンショット 2019-11-20 20.18.34.png The ** under 16 or female group ** has a survival rate of 113%, which is often 100%, while the survival rate of 32 groups is 0%. In other words, ** many groups are all alive, but only some are annihilated **.

#Family survival rate for men over 16 years old
Male_Adult_Group=df.loc[(df['FamilyGroup']>=2) & (df['Age']>16) & (df['Sex']=='male')]
Male_Adult_List=Male_Adult_Group.groupby('Surname')['Survived'].mean()
print(Male_Adult_List.value_counts())

スクリーンショット 2019-11-20 20.22.17.png In the ** male group over 16 years old **, 115 groups have a survival rate of 0%, while 21 groups have a survival rate of 100%. In other words, ** many groups are annihilated, but only some groups are all alive **.

The valuable information that can be gained from these facts is that ** there are minorities who have followed the opposite fate of the whole flow **. Based on this minority information, we will take the following actions.

#Creating deadlists and survivors
Dead_list=set(Female_Child_Group[Female_Child_Group.apply(lambda x:x==0)].index)
Survived_list=set(Male_Adult_List[Male_Adult_List.apply(lambda x:x==1)].index)

#View dead list and survivor list
print('Dead_list = ', Dead_list)
print('Survived_list = ', Survived_list)

#Sex deadlist and survivorist, Age,Reflect in Title
df.loc[(df['Survived'].isnull()) & (df['Surname'].apply(lambda x:x in Dead_list)),\
             ['Sex','Age','Title']] = ['male',28.0,'Mr']
df.loc[(df['Survived'].isnull()) & (df['Surname'].apply(lambda x:x in Survived_list)),\
             ['Sex','Age','Title']] = ['female',5.0,'Mrs']

スクリーンショット 2019-11-21 15.27.26.png From the training data, we collected the last name ** (Dead_list), which collected all dead surnames in the group of women under 16 years old, and the surnames, which all survived in the group of men over 16 years old *. * Survived list ** (Survived_list). This will be reflected in the test data.

Specifically, if there is a row that corresponds to the ** dead list ** in the test data, ** Sex, Age, Title will be changed to typical death data ** so that it will always be judged as dead. Rewrite, if there is a line corresponding to ** Survival List **, rewrite Sex, Age, Title to typical survival data ** so that it is always judged as alive.

It sounds a little tricky, but I think it's a smart way because the estimation model is simple. As expected, it is the wisdom of our predecessors.

4. Complementing missing values in Fare

The fare is complemented by taking the Fare median from the missing value settings (Embarked = S, Pclass = 3) by saying that it will have something to do with the embarked location and class. This shouldn't be a problem.

# ----------- Fare -------------
#Embarked missing values='S', Pclass=Complemented by an average of 3
fare=df.loc[(df['Embarked'] == 'S') & (df['Pclass'] == 3), 'Fare'].median()
df['Fare']=df['Fare'].fillna(fare)

5. Make features from SibSp and Parch

** SibSp ** is the number of siblings and spouses on the Titanic, ** Parch ** is the number of parents and children on the Titanic. It is better to add up ** Family ** as a feature than to make it an independent feature. Grouping by survival rate. This shouldn't be a problem either.

# ----------- Family -------------
# Family = SibSp + Parch +Grouping with 1 as a feature
df['Family']=df['SibSp']+df['Parch']+1
df.loc[(df['Family']>=2) & (df['Family']<=4), 'Family_label'] = 2
df.loc[(df['Family']>=5) & (df['Family']<=7) | (df['Family']==1), 'Family_label'] = 1  # ==Note
df.loc[(df['Family']>=8), 'Family_label'] = 0

6. Extract meaningful features from Ticket

At first, without any reason, I made a feature quantity ** with the first letter of the Ticket number. Sure enough, I didn't use this feature as a feature because the accuracy would drop if I used it. After all, it seems that it is not a good idea to use features even though there is no reason.

In Notebook, there was a person who made a feature quantity ** by how many people have the same Ticket number. Well, then you can understand the reason. ** People with the same Ticket number will probably stay in the same room and share their destiny, and the ease of survival will change depending on the number of people **.

Below is a graph of survival rates by number of people with the same Ticket number.

# ----------- Ticket ----------------
#Extract how many people with the same Ticket number are as features
Ticket_Count = dict(df['Ticket'].value_counts())
df['TicketGroup'] = df['Ticket'].map(Ticket_Count)
sns.barplot(x='TicketGroup', y='Survived', data=df, palette='Set3')
plt.show()

スクリーンショット 2019-11-20 20.50.34.png The survival rate of 2 to 4 people is high, the survival rate of 5 to 8 people and 1 person is medium, and the survival rate of 11 people is zero. Therefore, we will group it into three.

#Grouping into 3 by survival rate
df.loc[(df['TicketGroup']>=2) & (df['TicketGroup']<=4), 'Ticket_label'] = 2
df.loc[(df['TicketGroup']>=5) & (df['TicketGroup']<=8) | (df['TicketGroup']==1), 'Ticket_label'] = 1  
df.loc[(df['TicketGroup']>=11), 'Ticket_label'] = 0
sns.barplot(x='Ticket_label', y='Survived', data=df, palette='Set3')
plt.show()

スクリーンショット 2019-11-21 10.29.01.png

7.Cabin There are many missing values, but the survival rate of missing values U is clearly low, so we do not specifically supplement missing values. This shouldn't be a problem either.

# ------------- Cabin ----------------
#Use the first character of Cabin as a feature(Missing value is U)
df['Cabin'] = df['Cabin'].fillna('Unknown')
df['Cabin_label']=df['Cabin'].str.get(0)
sns.barplot(x='Cabin_label', y='Survived', data=df, palette='Set3')
plt.show()

スクリーンショット 2019-11-20 21.21.46.png

8.Embarked Missing values are complemented by S, which has the most passengers. This should be fine too.

# ---------- Embarked ---------------
#Complement missing values with S
df['Embarked'] = df['Embarked'].fillna('S')

9. Pretreatment

Pre-processing to create an estimated model in Random Forest. ** Label features can be decomposed by one-hot encoding. ** **

For example, the content of Embarked consists of three labels: C, Q, and S. Multiplying this by one-hot encoding automatically creates three items, Embarked_C, Embarked_Q, and Embarked_S, and changes to the expression that only one of the three items is 1 and the rest is 0.

By doing this, Embarked_C and Embarked_S are adopted as features, but Embarked_Q is overfitted and should not be adopted, so that detailed overfitting prevention measures can be taken.

This was the first time I learned about it in Notebook, and I thought that ** decomposition of features by one-hot encoding would be a powerful weapon for selecting only what you need after finding various features **. It was.

# -------------Preprocessing---------------
#Specify the item to be used for estimation
df = df[['Survived','Pclass','Sex','Age','Fare','Embarked','Title','Family_label','Cabin_label','Ticket_label']]

#One-hot encoding of label features
df = pd.get_dummies(df)

#Split the dataset into train and test
train = df[df['Survived'].notnull()]
test = df[df['Survived'].isnull()].drop('Survived',axis=1)

#Convert dataframe to numpy
X = train.values[:,1:]  
y = train.values[:,0] 
test_x = test.values

10. Building an estimation model with a random forest

At first, I chose the best combination by manually adding or subtracting the features I made and seeing what the accuracy would be.

However, in Notbook, there were ** people who used SelectKbest to automatically select features **. This is efficient!

The number of features to be narrowed down is specified in the form ** select = SelectKBest (k = 20) **.

# -----------Estimated model construction---------------
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

#Narrow down the features to be adopted from 25 to 20
select = SelectKBest(k = 20)

clf = RandomForestClassifier(random_state = 10, 
                             warm_start = True,  #Add learning to a model that already fits
                             n_estimators = 26,
                             max_depth = 6, 
                             max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)

#Display of fit results
cv_result = cross_validate(pipeline, X, y, cv= 10)
print('mean_score = ', np.mean(cv_result['test_score']))
print('mean_std = ', np.std(cv_result['test_score']))

スクリーンショット 2019-11-21 14.36.59.png As a result of narrowing down to 20 features, the average score was 0.8417441.

Now, let's check which features were adopted in the model built with the above code.

# --------Adopted features---------------
#Adoption status
mask= select.get_support()

#List of items
list_col = list(df.columns[1:])

#List of adoption availability by item
for i, j in enumerate(list_col):
    print('No'+str(i+1), j,'=',  mask[i])

#Checking the shape
X_selected = select.transform(X)
print('X.shape={}, X_selected.shape={}'.format(X.shape, X_selected.shape))

スクリーンショット 2019-11-21 11.01.14.png Of the 25 features prepared, Embarked_Q, Title_officer, Cabin_label_A, Cabin_label_G, Cabin_label_T were not adopted, and the rest were adopted, and the features were certainly narrowed down to 20.

11. Creating Submit_data

# -----Creating Submit data------- 
PassengerId=test_data['PassengerId']
predictions = pipeline.predict(test_x)
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv("my_submission.csv", index=False)

スクリーンショット 2019-11-21 11.09.39.png As of November 21, 2019, the result was ** 259th place ** with an accuracy of 0.83732 **. Since the number of participating members is 15,889, it is around ** Top 1.6% **.

12. Summary

This time, I will summarize the know-how gained from Kaggle / titanic Notebook. ** 1) Missing value completion is a method of creating an estimation model from complete data without missing values and completing it (Age). 2) Even seemingly random features should always be able to find new features from rational reasoning (Ticket). 3) Even if a new feature is found once, more features may be found (Name). 4) There is a frequency (Surname, Ticket) as a viewpoint for exploring new features. 5) It is an effective overfitting prevention measure to decompose and select label features (Embarked, Title, Cabin). 6) SelectKBset can be used to efficiently select features. ** **

13. Code

Finally, I will summarize the code.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Data set loading
train_data = pd.read_csv('./train.csv')
test_data = pd.read_csv('./test.csv')

# train_data and test_data concatenation
test_data['Survived'] = np.nan
df = pd.concat([train_data, test_data], ignore_index=True, sort=False)

#df information
df.info()

#Relationship between Sex and Survival Rate
sns.barplot(x='Sex', y='Survived', data=df, palette='Set3')
plt.show()

# ------------ Age ------------
#Age to Pclass, Sex, Parch,Estimated from SibSp in Random Forest
from sklearn.ensemble import RandomForestRegressor

#Specify the item to be used for estimation
age_df = df[['Age', 'Pclass','Sex','Parch','SibSp']]

#One-hot encoding of label features
age_df=pd.get_dummies(age_df)

#Separated into training data and test data and converted to numpy
known_age = age_df[age_df.Age.notnull()].values  
unknown_age = age_df[age_df.Age.isnull()].values

#X training data,Separated into y
X = known_age[:, 1:]  
y = known_age[:, 0]

#Build an estimation model in a random forest
rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)
rfr.fit(X, y)

#Predict and complement age of test data using an estimation model
predictedAges = rfr.predict(unknown_age[:, 1::])
df.loc[(df.Age.isnull()), 'Age'] = predictedAges 

#Age-specific survival and death curves
facet = sns.FacetGrid(df[0:890], hue="Survived",aspect=2)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, df.loc[0:890,'Age'].max()))
facet.add_legend()
plt.show()

# ------------ Name --------------
#Honorific title from Name(Title)Extract and group
df['Title'] = df['Name'].map(lambda x: x.split(', ')[1].split('. ')[0])
df['Title'].replace(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer', inplace=True)
df['Title'].replace(['Don', 'Sir',  'the Countess', 'Lady', 'Dona'], 'Royalty', inplace=True)
df['Title'].replace(['Mme', 'Ms'], 'Mrs', inplace=True)
df['Title'].replace(['Mlle'], 'Miss', inplace=True)
df['Title'].replace(['Jonkheer'], 'Master', inplace=True)
sns.barplot(x='Title', y='Survived', data=df, palette='Set3')

# ------------ Surname ------------
#Name to Surname(Last name)Extract
df['Surname'] = df['Name'].map(lambda name:name.split(',')[0].strip())

#Same Surname(Last name)Count the frequency of appearance(Family if the number of appearances is 2 or more)
df['FamilyGroup'] = df['Surname'].map(df['Surname'].value_counts()) 

#Survival rate for families under 16 or women
Female_Child_Group=df.loc[(df['FamilyGroup']>=2) & ((df['Age']<=16) | (df['Sex']=='female'))]
Female_Child_Group=Female_Child_Group.groupby('Surname')['Survived'].mean()
print(Female_Child_Group.value_counts())

#Family survival rate for men over 16 years old
Male_Adult_Group=df.loc[(df['FamilyGroup']>=2) & (df['Age']>16) & (df['Sex']=='male')]
Male_Adult_List=Male_Adult_Group.groupby('Surname')['Survived'].mean()
print(Male_Adult_List.value_counts())

#Creating deadlists and survivors
Dead_list=set(Female_Child_Group[Female_Child_Group.apply(lambda x:x==0)].index)
Survived_list=set(Male_Adult_List[Male_Adult_List.apply(lambda x:x==1)].index)

#View dead list and survivor list
print('Dead_list = ', Dead_list)
print('Survived_list = ', Survived_list)

#Sex deadlist and survivorist, Age,Reflect in Title
df.loc[(df['Survived'].isnull()) & (df['Surname'].apply(lambda x:x in Dead_list)),\
             ['Sex','Age','Title']] = ['male',28.0,'Mr']
df.loc[(df['Survived'].isnull()) & (df['Surname'].apply(lambda x:x in Survived_list)),\
             ['Sex','Age','Title']] = ['female',5.0,'Mrs']

# ----------- Fare -------------
#Embarked missing values='S', Pclass=Complemented by an average of 3
fare=df.loc[(df['Embarked'] == 'S') & (df['Pclass'] == 3), 'Fare'].median()
df['Fare']=df['Fare'].fillna(fare)

# ----------- Family -------------
# Family = SibSp + Parch +Grouping with 1 as a feature
df['Family']=df['SibSp']+df['Parch']+1
df.loc[(df['Family']>=2) & (df['Family']<=4), 'Family_label'] = 2
df.loc[(df['Family']>=5) & (df['Family']<=7) | (df['Family']==1), 'Family_label'] = 1  # ==Note
df.loc[(df['Family']>=8), 'Family_label'] = 0

# ----------- Ticket ----------------
#Extract how many people with the same Ticket number are as features
Ticket_Count = dict(df['Ticket'].value_counts())
df['TicketGroup'] = df['Ticket'].map(Ticket_Count)
sns.barplot(x='TicketGroup', y='Survived', data=df, palette='Set3')
plt.show()

#Grouping into 3 by survival rate
df.loc[(df['TicketGroup']>=2) & (df['TicketGroup']<=4), 'Ticket_label'] = 2
df.loc[(df['TicketGroup']>=5) & (df['TicketGroup']<=8) | (df['TicketGroup']==1), 'Ticket_label'] = 1  
df.loc[(df['TicketGroup']>=11), 'Ticket_label'] = 0
sns.barplot(x='Ticket_label', y='Survived', data=df, palette='Set3')
plt.show()

# ------------- Cabin ----------------
#Use the first character of Cabin as a feature(Missing value is U)
df['Cabin'] = df['Cabin'].fillna('Unknown')
df['Cabin_label']=df['Cabin'].str.get(0)
sns.barplot(x='Cabin_label', y='Survived', data=df, palette='Set3')
plt.show()

# ---------- Embarked ---------------
#Complement missing values with S
df['Embarked'] = df['Embarked'].fillna('S') 

# -------------Preprocessing---------------
#Specify the item to be used for estimation
df = df[['Survived','Pclass','Sex','Age','Fare','Embarked','Title','Family_label','Cabin_label','Ticket_label']]

#One-hot encoding of label features
df = pd.get_dummies(df)

#Split the dataset into train and test
train = df[df['Survived'].notnull()]
test = df[df['Survived'].isnull()].drop('Survived',axis=1)

#Convert dataframe to numpy
X = train.values[:,1:]  
y = train.values[:,0] 
test_x = test.values

# -----------Estimated model construction---------------
from sklearn.feature_selection import SelectKBest
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

#Narrow down the features to be adopted from 25 to 20
select = SelectKBest(k = 20)

clf = RandomForestClassifier(random_state = 10, 
                             warm_start = True,  #Add learning to a model that already fits
                             n_estimators = 26,
                             max_depth = 6, 
                             max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)

#Display of fit results
cv_result = cross_validate(pipeline, X, y, cv= 10)
print('mean_score = ', np.mean(cv_result['test_score']))
print('mean_std = ', np.std(cv_result['test_score']))

# --------Adopted features---------------
#Adoption status
mask= select.get_support()

#List of items
list_col = list(df.columns[1:])

#List of adoption availability by item
for i, j in enumerate(list_col):
    print('No'+str(i+1), j,'=',  mask[i])

#Checking the shape
X_selected = select.transform(X)
print('X.shape={}, X_selected.shape={}'.format(X.shape, X_selected.shape))

# -----Creating Submit data------- 
PassengerId=test_data['PassengerId']
predictions = pipeline.predict(test_x)
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv("my_submission.csv", index=False)

Kaggle Tutorial Titanic know-how to be in the top 2%