A site where you can test your skills by competing to solve various problems using data analysis. You will be able to study data analysis because you can get a dataset and see other people's explanations (kernel).
One of Kaggle's competitions. It is used by many beginners as a tutorial. Predict which passengers on the Titanic survived. The theme is to predict the survival of the other 418 passengers from 891 passenger data.
We will consistently explain techniques for beginners up to a submission score of 0.83732 (equivalent to the top 1.5%) using Random Forest. This time, I will explain until the submitted score reaches 0.78468. Next time increased the score to 0.81339, and Next time corresponds to the top 1.5%. The submission score is 0.83732. All the code used is published on Github. The code used this time is titanic (0.83732) _1.
Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
Read CSV and check the contents
#Read CSV
train= pd.read_csv("train.csv")
test= pd.read_csv("test.csv")
#Data integration
dataset = pd.concat([train, test], ignore_index = True)
#For submission
PassengerId = test['PassengerId']
#Check up to the third content of the train
train.head(3)
A brief description of each column is as follows.
· PassengerId – Passenger Identification Unique ID Survived – Survival flag (0 = death, 1 = survival) ・ Pclass – Ticket class · Name – Passenger's name ・ Sex – Gender (male = male, female = female) ・ Age – Age · SibSp – Number of siblings / spouses on board the Titanic · Parch – Number of parents / children on the Titanic ・ Ticket – Ticket number ・ Fare – Fare ・ Cabin – Room number ・ Embarked – Departure point (port on Titanic)
I will also give a brief description of each variable. pclass = ticket class 1 = Upper class (rich) 2 = Intermediate class (general class) 3 = Lower class (working class)
Embarked = Definition of each variable is as follows C = Cherbourg Q = Queenstown S = Southampton NaN represents a data loss. (In the table above, you can see two NaNs in cabin.) Let's check the total number of missing data.
#Check the total number of missing data
dataset_null = dataset.fillna(np.nan)
dataset_null.isnull().sum()
Age 263 Cabin 1014 Embarked 2 Fare 1 Name 0 Parch 0 PassengerId 0 Pclass 0 Sex 0 SibSp 0 Survived 418 Ticket 0 dtype: int64
With Cabin, you can see that there are as many as 1014 missing data. Next, let's check the overall statistical data.
#Check statistical data
dataset.describe()
First, check the accuracy by substituting the median value etc. for the missing data.
#Cabin is temporarily excluded
del dataset["Cabin"]
# Age(age)And Fare(Fee)Is the median of each, Embarked(Departure point)Is S(Southampton)Substitute
dataset["Age"].fillna(dataset.Age.mean(), inplace=True)
dataset["Fare"].fillna(dataset.Fare.mean(), inplace=True)
dataset["Embarked"].fillna("S", inplace=True)
#Check the total number of missing data
dataset_null = dataset.fillna(np.nan)
dataset_null.isnull().sum()
Age 0 Embarked 0 Fare 0 Name 0 Parch 0 PassengerId 0 Pclass 0 Sex 0 SibSp 0 Survived 418 Ticket 0 dtype: int64
Now there are no missing data. Survived's 418 matches 418 in the test data, so it shouldn't be a problem. Organize your data for forecasting. First, use P class, Sex, Age, Fare, Embarked. It also converts it to a dummy variable so that the machine can predict it. (Currently, there are two sex items, male and female, but by doing this, they are converted to sex_male and sex_female. If it is male, sex_male is assigned to 1, and if it is different, 0 is assigned.)
#Extract only variables to use
dataset1 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked']]
#Create a dummy variable
dataset_dummies=pd.get_dummies(dataset1)
dataset_dummies.head(3)
Let the machine learn. Create the best predictive model by changing the n_estimators and max_depth of the RandomForestClassifier.
#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]
#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data
#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_, sep="\n")
{'classify__max_depth': 8, 'classify__n_estimators': 23} 0.8316498316498316 When max_depth is 8 and n_estimators is 23, it turns out to be the best model with a prediction accuracy of 83% for training data. Predict test data with this model and create a submission file (submission1.csv).
#Prediction of test data
pred = grid.predict(test_set)
#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission1.csv", index=False)
When I actually submitted it, the score was 0.78468. Suddenly a high prediction came out.
This time, Parch (number of parents / children on board) and SibSp (number of siblings / spouses on board) are added for prediction.
#Extract variables to use
dataset2 = dataset[['Survived','Pclass','Sex','Age','Fare','Embarked', 'Parch', 'SibSp']]
#Create a dummy variable
dataset_dummies = pd.get_dummies(dataset2)
dataset_dummies.head(3)
#Decompose data into train and test
#( 'Survived'Exists in train,Not test)
train_set = dataset_dummies[dataset_dummies['Survived'].notnull()]
test_set = dataset_dummies[dataset_dummies['Survived'].isnull()]
del test_set["Survived"]
#Separate train data into variables and correct answers
X = train_set.as_matrix()[:, 1:] #Variables after Pclass
y = train_set.as_matrix()[:, 0] #Correct answer data
#Creating a predictive model
clf = RandomForestClassifier(random_state = 10, max_features='sqrt')
pipe = Pipeline([('classify', clf)])
param_test = {'classify__n_estimators':list(range(20, 30, 1)), #Try 20-30 in increments
'classify__max_depth':list(range(3, 10, 1))} #Try 3-10 in increments
grid = GridSearchCV(estimator = pipe, param_grid = param_test, scoring='accuracy', cv=10)
grid.fit(X, y)
print(grid.best_params_, grid.best_score_, sep="\n")
#Prediction of test data
pred = grid.predict(test_set)
#Creating a csv file for Kaggle submission
submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": pred.astype(np.int32)})
submission.to_csv("submission2.csv", index=False)
{'classify__max_depth': 7, 'classify__n_estimators': 25} 0.8417508417508418 When max_depth is 7 and n_estimators is 25, it turns out to be the best model with a prediction accuracy of 84% for training data. Although it is more accurate than before, when I submitted the test data prediction (submission2.csv) for this model, the score dropped to 0.76076. It seems that it has caused overfitting. It seems better not to use Parch (number of parents / children on board) and SibSp (number of siblings / spouses on board).
I made a prediction for Kaggle's tutorial competition Titanic. The highest submitted score was 0.78468. Next time will visualize the data and explain the process to the submission score 0.83732.