Following Last time, Kaggle Titanic to the top 1.5% (0.83732) I will explain the approach of. The code to use is titanic (0.83732) _2 from Github. This time, we will increase the submitted score to 0.81339 and prepare for the next 0.83732. In addition, before forecasting, we will visualize the data used previous and analyze the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
#Read CSV
train= pd.read_csv("train.csv")
test= pd.read_csv("test.csv")
#Data integration
dataset = pd.concat([train, test], ignore_index = True)
#For submission
PassengerId = test['PassengerId']
Let's look at the relationship of each data.
#Age and survival band graph
sns.barplot(x="Sex", y="Survived", data=train, palette='Set3')
#Survival rate by gender
print("females: %.2f" %(train['Survived'][train['Sex'] == 'female'].value_counts(normalize = True)[1]))
print("males: %.2f" %(train['Survived'][train['Sex'] == 'male'].value_counts(normalize = True)[1]))
females: 0.74 males: 0.19 You can see that women are much more helpful. What about the survival rate for each ticket class?
#Ticket class and survival band graph
sns.barplot(x='Pclass', y='Survived', data=train, palette='Set3')
#Survival rate by ticket class
print("Pclass = 1 : %.2f" %(train['Survived'][train['Pclass']==1].value_counts(normalize = True)[1]))
print("Pclass = 2 : %.2f" %(train['Survived'][train['Pclass']==2].value_counts(normalize = True)[1]))
print("Pclass = 3 : %.2f" %(train['Survived'][train['Pclass']==3].value_counts(normalize = True)[1]))
Pclass = 1 : 0.63 Pclass = 2 : 0.47 Pclass = 3 : 0.24 The higher the ticket purchaser, the higher the survival rate. What about the price?
#Survival rate comparison by price
fare = sns.FacetGrid(train, hue="Survived",aspect=2)
fare.map(sns.kdeplot,'Fare',shade= True)
fare.set(xlim=(0, 200))
fare.add_legend()
After all, you can see that the survival rate is low for people with low ticket prices.
#Survival rate comparison by age
age = sns.FacetGrid(train, hue="Survived",aspect=2)
age.map(sns.kdeplot,'Age',shade= True)
age.set(xlim=(0, train['Age'].max()))
age.add_legend()
Did the child get help first? You can see that the survival rate under 10 years old is high.
From here, previous We will check the unused data. First is the room information. Cabin (room number) seems to have different room levels depending on the acronym.
#Survival rate comparison by room level
dataset['Cabin'] = dataset['Cabin'].fillna('Unknown') #Substitute Unknown if room data is missing
dataset['Deck'] = dataset['Cabin'].str.get(0) #Get the first letter (0th letter) of Cabin (room number)
sns.barplot(x="Deck", y="Survived", data=dataset, palette='Set3')
There are some variations. Last time After substituting the median value for the missing value and confirming that there is no missing value, the'Deck'(room hierarchy) information created this time is displayed. Make additional predictions.
# Age(age)And Fare(Fee)Is the median of each, Embarked(Departure point)Is S(Southampton)Substitute
dataset["Age"].fillna(dataset.Age.mean(), inplace=True)
dataset["Fare"].fillna(dataset.Fare.mean(), inplace=True)
dataset["Embarked"].fillna("S", inplace=True)
#Check the total number of missing data
dataset_null = dataset.fillna(np.nan)
dataset_null.isnull().sum()
#Extract variables to use
dataset3 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked', 'Deck']]
#Create a dummy variable
dataset_dummies = pd.get_dummies(dataset3)
dataset_dummies.head(3)
#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]
#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data
#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_)
#Prediction of test data
pred = grid.predict(test_set)
#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission3.csv", index=False)
{'classify__max_depth': 8, 'classify__n_estimators': 22} 0.8327721661054994 The score submitted was 0.78947. By including information on the room level, it has increased from the previous time.
Then try the ticket information. But how do you group them? It is possible to distinguish between the number of characters and whether or not to include the first letter or alphabet of the number, but if it is increased too much, the accuracy will be reduced. Let's check by dividing the number of characters in the ticket.
#Survival rate comparison by the number of characters in the ticket
Ticket_Count = dict(dataset['Ticket'].value_counts()) #Group by the number of characters in the ticket
dataset['TicketGroup'] = dataset['Ticket'].apply(lambda x:Ticket_Count[x]) #Group distribution
sns.barplot(x='TicketGroup', y='Survived', data=dataset, palette='Set3')
There is a difference compared to the previous Cabin (room level) division.
#Extract variables to use
dataset4 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked', 'Deck', 'TicketGroup']]
#Create a dummy variable
dataset_dummies = pd.get_dummies(dataset4)
dataset_dummies.head(4)
#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]
#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data
#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_, sep="\n")
#Prediction of test data
pred = grid.predict(test_set)
#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission4.csv", index=False)
{'classify__max_depth': 8, 'classify__n_estimators': 23} 0.8406285072951739 The training score went up, but the score submitted to Kaggle fell to 0.77990. In the first place, realistically, the correlation between the number of characters in the ticket and the survival rate seems to be weak. However, since it is a feature that has come out with great effort, I will try to study by suppressing the items in two groups, a high group and a low group.
#Divide into two groups, a group with a high survival rate in terms of the number of characters in the ticket and a group with a low survival rate.
#Substitute 2 if high and 1 if low
def Ticket_Label(s):
if (s >= 2) & (s <= 4): #Group with high survival rate in character count
return 2
elif ((s > 4) & (s <= 8)) | (s == 1): #Group with low survival rate in character count
return 1
elif (s > 8):
return 0
dataset['TicketGroup'] = dataset['TicketGroup'].apply(Ticket_Label)
sns.barplot(x='TicketGroup', y='Survived', data=dataset, palette='Set3')
It looks like it's separated neatly.
#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]
#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data
#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_, sep="\n")
#Prediction of test data
pred = grid.predict(test_set)
#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission5.csv", index=False)
{'classify__max_depth': 7, 'classify__n_estimators': 23} 0.8417508417508418 The score submitted to Kaggle has improved significantly to 0.81339.
This time, by adding information on the room hierarchy and information divided into two groups, a group with a high survival rate and a group with a low survival rate by the acronym of the ticket, previous / bc3889fa38ff32d46c13) submitted score improved from 0.78468 to 0.81339. Next time Finally, I will explain the approach to the submission score 0.83732, which corresponds to the top 1.5%.