I tried Kaggle's Titanic tutorial. Record the code on the Jupyter NoteBook here to learn how to use Pandas and scikit-learn.
Kaggle is a site that competes for scores by machine learning based on actual data.
It is very convenient for studying machine learning because there are some tutorials even if you do not participate in the competition and you have the actual data that can be used for machine learning.
Learn how to use Kaggle and the Titanic tutorial [Introduction to Kaggle Beginners] Who will survive the Titanic? I referred to.
I have modified my code to make it easier for me to understand, referring to the above article.
When you register as a user with Kaggle Titanic: Machine Learning from Disaster | Kaggle You will be able to download the data from.
The algorithm is a decision tree.
After this, Easy installation and startup of Jupyter Notebook using Docker (also supports nbextensions and Scala) --Qiita I am trying it in the environment of Jupyter Notebook prepared according to.
In this Jupyter environment, you can access port 8888 with a browser and use the Jupyter Notebook. You can open a new note by following New> Python 3 on the top right button.
import numpy as np
import pandas as pd
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train
and test
are objects of DataFrame.
Data is concatenated once in order to preprocess the data in common with train
and test
. I'll put a flag so that I can split it again later. Also, the column Survived
is included only in train
, so delete it once.
train["is_train"] = 1
test["is_train"] = 0
data = pd.concat([train.drop(columns=["Survived"]), test])
train.drop (columns = ["Survived "])
returns a new DataFrame object with the columns removed. It returns a new object, so the columns remain in the original train
. It is the same even if you write train.drop (["Survived "], axis = 1)
. ʻAxis = 1` is a flag to delete columns instead of rows. However, the former is easier to understand.
pd.concat
returns a new DataFrame that is a concatenation of multiple DataFrames. By the way, if you pass ʻaxis = 1 to
pd.concat`, it will be connected horizontally.
As a result of concatenation, the data will be like this. 1309 rows of data.
Check the status of missing data. Data is missing in the part where NaN
is displayed in the above screenshot was empty in the original CSV file.
data.isnull().sum()
You can see what data you have and the median with the code below.
data["Age"].unique()
data["Age"].median()
I will put the median value in the missing part of the item of ʻAge`.
data["Age"] = data["Age"].fillna(data["Age"].median())
Check what data is in ʻEmbarked`. You can check the number of records for each data.
data["Embarked"].value_counts()
It seems that S
occupies most of ʻEmbarked, so I will put
S` in the missing part.
data["Embarked"] = data["Embarked"].fillna("S")
data.isnull().sum()
Next, convert the Sex
and ʻEmbarked` items that are strings to numbers.
ʻEmbarked` had only 3 patterns of S, C and Q, so I would like to convert it like this. This is called one-hot encoding.
Embarked |
---|
S |
S |
C |
Q |
↓
Embarked_C | Embarked_Q | Embarked_S |
---|---|---|
0 | 0 | 1 |
0 | 0 | 1 |
1 | 0 | 0 |
0 | 1 | 0 |
One-hot encoding can be done with a function called pd.get_dummies
. The code pd.get_dummies (data ["Embarked"], prefix = "Embarked")]
produces a three-column DataFrame: ʻEmbarked_S, ʻEmbarked_C
, ʻEmbarked_Q. One of the three columns is 1 and the others are 0. Concatenate this horizontally with the original data and remove the original ʻEmbarked
column.
data = pd.concat([data, pd.get_dummies(data["Embarked"], prefix="Embarked")], axis=1).drop(columns=["Embarked"])
Since Sex
had only male
and female
, it is not necessary to have two columns, only one column of 0, 1 is enough.
Sex |
---|
male |
female |
female |
↓
Sex |
---|
1 |
0 |
0 |
It seems that adding the option drop_first = True
to pd.get_dummies
will remove the first column and leave only one column as a result.
data["Sex"] = pd.get_dummies(data["Sex"], drop_first=True)
Of the two columns generated by pd.get_dummies
, the first column happened to be female, so male was 1 and female was 0. I think it doesn't matter which one.
The result is this kind of data.
Divide the preprocessed data
into training data and verification data again, and further narrow down to only the columns used this time.
feature_columns =["Pclass", "Sex", "Age", "Embarked_C", "Embarked_Q", "Embarked_S"]
feature_train = data[data["is_train"] == 1].drop(columns=["is_train"])[feature_columns]
feature_test = data[data["is_train"] == 0].drop(columns=["is_train"])[feature_columns]
The objective variable was not included in data
, so we will extract it from the first train
.
target_train = train["Survived"]
Learn the model.
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(feature_train, target_train)
reference sklearn.tree.DecisionTreeClassifier — scikit-learn 0.21.3 documentation
Let's see the correct answer rate with the training data.
from sklearn import metrics
pred_train = model.predict(feature_train)
metrics.accuracy_score(target_train, pred_train)
I got 0.9001122334455668
. 90% seems to be the correct answer.
Register the inference result in Kaggle and evaluate it.
First, save the estimation result with the verification data in my_prediction.csv
.
pred_test = model.predict(feature_test)
my_prediction = pd.DataFrame(pred_test, test["PassengerId"], columns=["Survived"])
my_prediction.to_csv("my_prediction.csv", index_label=["PassengerId"])
The first line of the CSV file is the header line, which is PassengerId, Survived
.
Upload this file to Kaggle's site and it will give you a score.
It was 0.74641
.
Recommended Posts