Try Kaggle's Titanic tutorial

I tried Kaggle's Titanic tutorial. Record the code on the Jupyter NoteBook here to learn how to use Pandas and scikit-learn.

Kaggle is a site that competes for scores by machine learning based on actual data.

It is very convenient for studying machine learning because there are some tutorials even if you do not participate in the competition and you have the actual data that can be used for machine learning.

Learn how to use Kaggle and the Titanic tutorial [Introduction to Kaggle Beginners] Who will survive the Titanic? I referred to.

I have modified my code to make it easier for me to understand, referring to the above article.

When you register as a user with Kaggle Titanic: Machine Learning from Disaster | Kaggle You will be able to download the data from.

The algorithm is a decision tree.

After this, Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita I am trying it in the environment of Jupyter Notebook prepared according to.

In this Jupyter environment, you can access port 8888 with a browser and use the Jupyter Notebook. You can open a new note by following New> Python 3 on the top right button.

Data preparation

import numpy as np
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train and test are objects of DataFrame.

image.png

Data is concatenated once in order to preprocess the data in common with train and test. I'll put a flag so that I can split it again later. Also, the column Survived is included only in train, so delete it once.

train["is_train"] = 1
test["is_train"] = 0
data = pd.concat([train.drop(columns=["Survived"]), test])

train.drop (columns = ["Survived "]) returns a new DataFrame object with the columns removed. It returns a new object, so the columns remain in the original train. It is the same even if you write train.drop (["Survived "], axis = 1). ʻAxis = 1` is a flag to delete columns instead of rows. However, the former is easier to understand.

pd.concat returns a new DataFrame that is a concatenation of multiple DataFrames. By the way, if you pass ʻaxis = 1 to pd.concat`, it will be connected horizontally.

As a result of concatenation, the data will be like this. 1309 rows of data.

image.png

Corresponds to missing data

Check the status of missing data. Data is missing in the part where NaN is displayed in the above screenshot was empty in the original CSV file.

data.isnull().sum()

image.png

You can see what data you have and the median with the code below.

data["Age"].unique()
data["Age"].median()

I will put the median value in the missing part of the item of ʻAge`.

data["Age"] = data["Age"].fillna(data["Age"].median())

Check what data is in ʻEmbarked`. You can check the number of records for each data.

data["Embarked"].value_counts()

image.png

It seems that S occupies most of ʻEmbarked, so I will put S` in the missing part.

data["Embarked"] = data["Embarked"].fillna("S")
data.isnull().sum()

Convert string data to numbers

Next, convert the Sex and ʻEmbarked` items that are strings to numbers.

ʻEmbarked` had only 3 patterns of S, C and Q, so I would like to convert it like this. This is called one-hot encoding.

Embarked
S
S
C
Q

Embarked_C Embarked_Q Embarked_S
0 0 1
0 0 1
1 0 0
0 1 0

One-hot encoding can be done with a function called pd.get_dummies. The code pd.get_dummies (data ["Embarked"], prefix = "Embarked")] produces a three-column DataFrame: ʻEmbarked_S, ʻEmbarked_C, ʻEmbarked_Q. One of the three columns is 1 and the others are 0. Concatenate this horizontally with the original data and remove the original ʻEmbarked column.

data = pd.concat([data, pd.get_dummies(data["Embarked"], prefix="Embarked")], axis=1).drop(columns=["Embarked"])

Since Sex had only male and female, it is not necessary to have two columns, only one column of 0, 1 is enough.

Sex
male
female
female

Sex
1
0
0

It seems that adding the option drop_first = True to pd.get_dummies will remove the first column and leave only one column as a result.

data["Sex"] = pd.get_dummies(data["Sex"], drop_first=True)

Of the two columns generated by pd.get_dummies, the first column happened to be female, so male was 1 and female was 0. I think it doesn't matter which one.

The result is this kind of data.

image.png

Divide the preprocessed data into training data and verification data again, and further narrow down to only the columns used this time.

feature_columns =["Pclass", "Sex", "Age", "Embarked_C", "Embarked_Q", "Embarked_S"]
feature_train = data[data["is_train"] == 1].drop(columns=["is_train"])[feature_columns]
feature_test = data[data["is_train"] == 0].drop(columns=["is_train"])[feature_columns]

The objective variable was not included in data, so we will extract it from the first train.

target_train = train["Survived"]

Learning

Learn the model.

from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(feature_train, target_train)

reference sklearn.tree.DecisionTreeClassifier — scikit-learn 0.21.3 documentation

Let's see the correct answer rate with the training data.

from sklearn import metrics
pred_train = model.predict(feature_train)
metrics.accuracy_score(target_train, pred_train)

I got 0.9001122334455668. 90% seems to be the correct answer.

Evaluation

Register the inference result in Kaggle and evaluate it.

First, save the estimation result with the verification data in my_prediction.csv.

pred_test = model.predict(feature_test)
my_prediction = pd.DataFrame(pred_test, test["PassengerId"], columns=["Survived"])
my_prediction.to_csv("my_prediction.csv", index_label=["PassengerId"])

The first line of the CSV file is the header line, which is PassengerId, Survived.

Upload this file to Kaggle's site and it will give you a score.

image.png

It was 0.74641.

Recommended Posts

Try Kaggle's Titanic tutorial
Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)
Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)
Select models with Kaggle's Titanic (kaggle ④)
Predict Kaggle's Titanic with keras (kaggle ⑦)
Try Django's official tutorial from scratch
Survivor prediction using kaggle's titanic neural network [80.8%]
Check raw data with Kaggle's Titanic (kaggle ⑥)
Survivor prediction using kaggle's titanic xg boost [80.1%]
Check the correlation with Kaggle's Titanic (kaggle③)
Data analysis before kaggle's titanic feature generation