Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic

Following Last time, Kaggle Titanic to the top 1.5% (0.83732) I will explain the approach of. The code to use is titanic (0.83732) _3 from Github. I will explain how to improve from the submitted score given in Last time to 0.83732.

1. Import the required library and load the CSV.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

#Read CSV
train= pd.read_csv("train.csv")
test= pd.read_csv("test.csv")

#Data integration
dataset = pd.concat([train, test], ignore_index = True)

#For submission
PassengerId = test['PassengerId']

#Survival rate comparison by room level
dataset['Cabin'] = dataset['Cabin'].fillna('Unknown') #Substitute Unknown if room data is missing
dataset['Deck']= dataset['Cabin'].str.get(0) #Get the first letter (0th letter) of Cabin (room number)

#Survival rate comparison by the number of characters in the ticket
Ticket_Count = dict(dataset['Ticket'].value_counts()) #Group by the number of characters in the ticket
dataset['TicketGroup'] = dataset['Ticket'].apply(lambda x:Ticket_Count[x]) #Group distribution

#Divide into two groups, a group with a high survival rate in terms of the number of characters in the ticket and a group with a low survival rate.
#Substitute 2 if high and 1 if low
def Ticket_Label(s):
    if (s >= 2) & (s <= 4): #Group with high survival rate in character count
        return 2
    elif ((s > 4) & (s <= 8)) | (s == 1): #Group with low survival rate in character count
        return 1
    elif (s > 8):
        return 0

dataset['TicketGroup'] = dataset['TicketGroup'].apply(Ticket_Label)

2. Use honorific title

Looking at Kaggle's top code, we can see that using name titles is the key to high scores. Honorific titles are Mr, Mrs, Miss, etc. included in the middle of Name. Occupations such as Dr (doctor) and Rev (priest or minister) may be listed without using Mr. Extract and group this information.

# 'Honorifics'(Honorific title)Divide by characteristics by
dataset['Honorifics'] = dataset['Name'].apply(lambda x:x.split(',')[1].split('.')[0].strip()) #Honorific title(','When'.'Words between)Extract

#Group titles
#Example:'Capt', 'Col', 'Major', 'Dr', 'Rev'Is'Officer'To
Honorifics_Dict = {}
Honorifics_Dict.update(dict.fromkeys(['Capt', 'Col', 'Major', 'Dr', 'Rev'], 'Officer'))
Honorifics_Dict.update(dict.fromkeys(['Don', 'Sir', 'the Countess', 'Dona', 'Lady'], 'Royalty'))
Honorifics_Dict.update(dict.fromkeys(['Mme', 'Ms', 'Mrs'], 'Mrs'))
Honorifics_Dict.update(dict.fromkeys(['Mlle', 'Miss'], 'Miss'))
Honorifics_Dict.update(dict.fromkeys(['Mr'], 'Mr'))
Honorifics_Dict.update(dict.fromkeys(['Master','Jonkheer'], 'Master'))
dataset['Honorifics'] = dataset['Honorifics'].map(Honorifics_Dict)
sns.barplot(x="Honorifics", y="Survived", data=dataset, palette='Set3')

"""List of titles
Mr: man,Master: Boy,Jonkheer: Dutch aristocrat(Man),
Mlle: Mademoiselle(France unmarried woman),Miss: Unmarried women, girls,Mme: Madam(French married woman),Ms: Female(Unmarried or married),Mrs: married woman, 
Don: Man(Spain),Sir: Man(England),the Countess: Countess,Dona: Married woman(Spain),Lady: Married woman(England),
Capt: Captain,Col: Colonel,Major: Military personnel,Dr: Doctor,Rev: priests and ministers
"""

After all, adult males have a low survival rate, and females and children have a high survival rate. But this time, we can discover that royalty groups such as aristocrats are higher than children. I feel that the aristocrats of this era are prioritized and saved. It seems to be a weapon to be able to use data on survival rate as to whether or not it is an aristocrat.

3. Review the assignment of missing values

Last time and Last time put the median value in the missing value for the time being. Review these to improve the accuracy of your predictions.

3.1 Review of missing values for'Age'

Substitute what was predicted by machine learning into the missing value of age It seems that the title (occupation) data mentioned earlier can also be used for prediction. (Dr can suppress the prediction of 5 years old)

##Predict and substitute for missing age values
#Extract items used for age prediction and create dummy variables
age = dataset[['Age','Pclass','Sex','Honorifics']]
age_dummies = pd.get_dummies(age)
age_dummies.head(3)

#Divide into those with known age and those with missing age
known_age = age_dummies[age_dummies.Age.notnull()].as_matrix()
null_age = age_dummies[age_dummies.Age.isnull()].as_matrix()

#Divide into feature quantity and correct answer data
age_X = known_age[:, 1:]
age_y = known_age[:, 0]

#Create an age prediction model and substitute the predicted value
rf = RandomForestRegressor()
rf.fit(age_X, age_y)
pred_Age = rf.predict(null_age[:, 1:])
dataset.loc[(dataset.Age.isnull()),'Age'] = pred_Age

3.2 Review of missing values for'Embarked'(port of departure)

Next, check the missing data to fill in the missing values for'Embarked'.

# 'Embarked'(Departure point)Shows missing data
dataset[dataset['Embarked'].isnull()]

2020-01-10 (2).png In both cases, the'P class'(ticket class) is 1 and the'Fare' (fare) is 80. Comparing the median'Fare'for each'Embarked' with a'Pclass' of 1, C is the closest. Substitute C for the two missing values.

# 'Pclass'(Ticket class)Is 1,'Embarked'(Departure point)Every'Fare'(Fee)Show median
C = dataset[(dataset['Embarked']=='C') & (dataset['Pclass'] == 1)]['Fare'].median()
print("Median of C", C)
S = dataset[(dataset['Embarked']=='S') & (dataset['Pclass'] == 1)]['Fare'].median()
print("Median of S", S)
Q = dataset[(dataset['Embarked']=='Q') & (dataset['Pclass'] == 1)]['Fare'].median()
print("Median Q", Q)

# 'Embarked'Substitute C for the missing value of
dataset['Embarked'] = dataset['Embarked'].fillna('C')

Median C 76.7292 Median S 52.0 Median Q 90.0

3.3 Review of missing values for'Fare'

If you look at the data, you can see that the'P class'(ticket class) is 3 and the'Embarked' (port of departure) is'S'. Therefore, substitute the median value of'Pclass'(ticket class) for 3 and'Embarked' for'S' for this missing value. Now that you have filled in the missing values for Age, Embarked, and Fare, check them.

# 'Fare'(Fee)Shows missing data
dataset[dataset['Fare'].isnull()]

# 'Pclass'(Ticket class)Is 3'Embarked'(Departure point)But'S'Substitute the median of
fare_median=dataset[(dataset['Embarked'] == "S") & (dataset['Pclass'] == 3)].Fare.median()
dataset['Fare']=dataset['Fare'].fillna(fare_median)

#Check the total number of missing data
dataset_null = dataset.fillna(np.nan)
dataset_null.isnull().sum()

Age 0 Cabin 0 Embarked 0 Fare 0 Name 0 Parch 0 PassengerId 0 Pclass 0 Sex 0 SibSp 0 Survived 418 Ticket 0 Deck 0 TicketGroup 0 Honorifics 0 dtype: int64

There are no missing values.

4. Family size

Two times before Processes the number of siblings / spouses and the number of parents / children on board that could not be handled well into usable data. Group the families on board and group them according to the survival rate according to the number of family members on board.

#Brothers on board/Survival rate comparison by number of spouses
sns.barplot(x="SibSp", y="Survived", data=train, palette='Set3')

#Parents on board/Survival rate comparison by number of children
sns.barplot(x="Parch", y="Survived", data=train, palette='Set3')

#Number of families on board
dataset['FamilySize']=dataset['SibSp']+dataset['Parch']+1
sns.barplot(x="FamilySize", y="Survived", data=dataset, palette='Set3')

#Grouping by survival rate by number of families
def Family_label(s):
    if (s >= 2) & (s <= 4):
        return 2
    elif ((s > 4) & (s <= 7)) | (s == 1):
        return 1
    elif (s > 7):
        return 0
dataset['FamilyLabel']=dataset['FamilySize'].apply(Family_label)
sns.barplot(x="FamilyLabel", y="Survived", data=dataset, palette='Set3')

I was able to divide it into beautiful differences.

5. Survival rate adjustment in surname

In the above'SibSp'and'Parch', the family relationship after the third parent is unknown, so we will investigate the survival rate in the surname instead of the family. You can see a big difference in the survival rate in the surname.

#Examine the characteristics of the surname
dataset['Surname'] = dataset['Name'].apply(lambda x:x.split(',')[0].strip()) #Last name(Of the name","Extract the word before)
Surname_Count = dict(dataset['Surname'].value_counts()) #Count the number of surnames
dataset['Surname_Count'] = dataset['Surname'].apply(lambda x:Surname_Count[x]) #Substitute the number of surnames

#Divide people with double surnames into a group of women and children and a group of adults and men.
Female_Child_Group=dataset.loc[(dataset['Surname_Count']>=2) & ((dataset['Age']<=12) | (dataset['Sex']=='female'))]
Male_Adult_Group=dataset.loc[(dataset['Surname_Count']>=2) & (dataset['Age']>12) & (dataset['Sex']=='male')]

#Compare the average number of survival rates for each surname in a group of women and children
Female_Child_mean = Female_Child_Group.groupby('Surname')['Survived'].mean() #Average survival rate by surname
Female_Child_mean_count = pd.DataFrame(Female_Child_mean.value_counts()) #Average number of survival rates by surname
Female_Child_mean_count.columns=['GroupCount']
Female_Child_mean_count

#Compare the average number of survival rates by surname in the male (adult) group
Male_Adult_mean = Male_Adult_Group.groupby('Surname')['Survived'].mean() #Average survival rate by surname
Male_Adult_mean_count = pd.DataFrame(Male_Adult_mean.value_counts()) #Average number of survival rates by surname
Male_Adult_mean_count.columns=['GroupCount']
Male_Adult_mean_count

Both groups are usually 1 or 0, indicating that there is a large difference between the groups. Is it also a rule that if the family name is the same as that of a family with girls and children (adults and men), everyone will survive (death)? This clear feature is valuable. By treating the result that is the opposite of this rule as an outlier, it can be expected to contribute to the improvement of the score. What we do is rewrite the data. All the people with the last name who have the same last name as the family with girls and children (adults and men) but all of them are dead (surviving) will be the profile data according to the opposite rule.

#Handle exceptions for each group
#Extract surnames that are exceptions to each group
# Dead_List: Last name that all died in the women / children group
# Survived_List: Last name that all died in the male (adult) group
Dead_List = set(Female_Child_mean[Female_Child_mean.apply(lambda x:x==0)].index)
print("Dead_List", Dead_List, sep="\n")
Survived_List = set(Male_Adult_mean[Male_Adult_mean.apply(lambda x:x==1)].index)
print("Survived_List", Survived_List, sep="\n")

Dead_List {'Danbom', 'Turpin', 'Zabour', 'Bourke', 'Olsson', 'Goodwin', 'Cacic', 'Robins', 'Canavan', 'Lobb', 'Palsson', 'Ilmakangas', 'Oreskovic', 'Lefebre', 'Sage', 'Johnston', 'Arnold-Franchi', 'Skoog', 'Attalah', 'Lahtinen', 'Jussila', 'Ford', 'Vander Planke', 'Rosblom', 'Boulos', 'Rice', 'Caram', 'Strom', 'Panula', 'Barbara', 'Van Impe'} Survived_List {'Chambers', 'Beane', 'Jonsson', 'Cardeza', 'Dick', 'Bradley', 'Duff Gordon', 'Greenfield', 'Daly', 'Nakid', 'Taylor', 'Frolicher-Stehli', 'Beckwith', 'Kimball', 'Jussila', 'Frauenthal', 'Harder', 'Bishop', 'Goldenberg', 'McCoy'}

#Rewrite test data
#Decompose data into train and test
train = dataset.loc[dataset['Survived'].notnull()]
test = dataset.loc[dataset['Survived'].isnull()]

#A person with a surname who died in a group of women and children → A 60-year-old man, honorific title is Mr.
#A person with a surname who all survived in a male (adult) group → A 5-year-old woman, honorific title is Miss.
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Sex'] = 'male'
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Age'] = 60
test.loc[(test['Surname'].apply(lambda x:x in Dead_List)),'Title'] = 'Mr'
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Sex'] = 'female'
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Age'] = 5
test.loc[(test['Surname'].apply(lambda x:x in Survived_List)),'Title'] = 'Miss'

#Combine data again
dataset = pd.concat([train, test])

6. Make a prediction again

#Extract variables to use
dataset6 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked','Honorifics','FamilyLabel','Deck','TicketGroup']]
#Create a dummy variable
dataset_dummies = pd.get_dummies(dataset6)
dataset_dummies.head(3)

#Decompose data into train and test
#（ 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]

#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data

#Creating a predictive model
pipe = Pipeline([('classify', RandomForestClassifier(random_state = 10, max_features = 'sqrt'))])

param_test = {'classify__n_estimators':list(range(20, 30, 1)), 
              'classify__max_depth':list(range(3, 10, 1))}
gsearch = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
gsearch.fit(X, y)
print(gsearch.best_params_, gsearch.best_score_)

#Prediction of test data
predictions = gsearch.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv("submission6.csv", index=False)

'classify__max_depth': 5, 'classify__n_estimators': 28} 0.8451178451178452 The submitted score was 0.81818.

7. Prediction by reducing features

Since the number of features has increased significantly to 26 compared to the previous time, we will exclude unimportant features.

pipe = Pipeline([('select',SelectKBest(k=20)),  #Create a model using 20 features that are useful for prediction
               ('classify', RandomForestClassifier(random_state = 10, max_features = 'sqrt'))])

param_test = {'classify__n_estimators':list(range(20, 30, 1)), 
              'classify__max_depth':list(range(3, 10, 1))}
gsearch = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
gsearch.fit(X, y)
print(gsearch.best_params_, gsearch.best_score_)

{'classify__max_depth': 6, 'classify__n_estimators': 26} 0.8451178451178452

select = SelectKBest(k = 20)
clf = RandomForestClassifier(random_state = 10, warm_start = True, 
                                  n_estimators = 26,
                                  max_depth = 6, 
                                  max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)

The previous model and max_depth and n_estimators have changed. Using this max_depth and n_estimators information, the feature quantity is narrowed down to 20 again to create a prediction model and make a prediction.

#Given max_depth and n_Using estimators, narrow down the features to 20 and create a prediction model again to make predictions.
select = SelectKBest(k = 20)
clf = RandomForestClassifier(random_state = 10,
                             warm_start = True, 
                             n_estimators = 26,
                             max_depth = 6, 
                             max_features = 'sqrt')
pipeline = make_pipeline(select, clf)
pipeline.fit(X, y)

cv_score = model_selection.cross_val_score(pipeline, X, y, cv= 10)
print("CV Score : Mean - %.7g | Std - %.7g " % (np.mean(cv_score), np.std(cv_score)))

#Prediction of test data
predictions = pipeline.predict(test_set)

#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})
submission.to_csv("submission7.csv", index=False)

CV Score : Mean - 0.8451402 | Std - 0.03276752 The submitted score should now be 0.83732. The ranking as of 2019 is 217th. This corresponds to the top 1.5%.

8. Summary

By logically filling in the missing values, generating new features such as titles, and rewriting test data, we got a score of 0.83732, which is equivalent to the top 1.5% of Kaggle Titanic. Various data processing is coming out, and you can see that Titanic is treated as a tutorial on data analysis ability.

This is the end of Titanic. I hope it helps those who have read this article.

Approach commentary for beginners to be in the top 1.5% (0.83732) of Kaggle Titanic_3