Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In Last "First Kaggle", ・ How to participate in Kaggle ・ How to participate in the competition ・ Until you join and write the code ・ How to submit the results I mainly wrote. This time, I would like to proceed to the point of studying at the "Titanic Competition". Can the accuracy rate of the sample code exceed "76%"?

table of contents

Prerequisite knowledge

Learning flow

Organize data 3.1. Extract the required items 3.2. Handle missing values 3.3. Digitize labels 3.4. Standardize numbers

Build the model

Learn with training data

Predict results with test data

Submitted result

Summary History

1. Prerequisite knowledge

It's from how much a person who knows machine learning describes it. About half a year ago (April 2019), I became interested in machine learning and learned mainly from the following books. ・ [Theory and practice by a Python machine learning programming expert data scientist](https://www.amazon.co.jp/Python-%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E9% 81 % 94% E4% BA% BA% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3 % E3% 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 90% 86% E8 % AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-impress-gear / dp / 4295003379 / ref = dp_ob_title_bk) ・ [Detailed explanation Deep learning ~ Time series data processing by TensorFlow ・ Keras ~](https://www.amazon.co.jp/%E8%A9%B3%E8%A7%A3-%E3%83%87%E3 % 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0-TensorFlow% E3 % 83% BBKeras% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 99% 82% E7% B3% BB% E5% 88% 97% E3% 83% 87% E3% 83 % BC% E3% 82% BF% E5% 87% A6% E7% 90% 86-% E5% B7% A3% E7% B1% A0-% E6% 82% A0% E8% BC% 94 / dp / 4839962510 / ref = sr_1_2? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords =% E8% A9% B3% E8% A7% A3 +% E3% 83% 87 % E3% 82% A3% E3% 83% BC% E3% 83% 97% E3% 83% A9% E3% 83% BC% E3% 83% 8B% E3% 83% B3% E3% 82% B0 +% 7ETensorFlow % E3% 83% BBKeras% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% 99% 82% E7% B3% BB% E5% 88% 97% E3% 83% 87% E3 % 83% BC% E3% 82% BF% E5% 87% A6% E7% 90% 86 & qid = 1575853564 & s = books & sr = 1-2)

The situation is that you don't know what "scikit-learn", "tensorflow", and "keras" are.

The image I understand is as follows.

scikit-learn has few parameters and can be learned easily (processing speed is fast)

keras is a group of machine learning libraries that run on tensorflow. It can be set more finely than scikit-learn. (Keras seems to work outside of tensorflow, but I don't know the details. Theano?)

tensorflow is a group of libraries for machine learning, but this library is close to a container. You can handle "constants", "variables", and "placeholders" that are convenient for machine learning, but if you only use tensorflow, you need to create "activation functions" and "evaluation functions" yourself.

At my own level, I wonder if I can write learning code using scikit-learn or keras.

2. Learning flow

The flow of machine learning is as follows.

Organize data

Build a model

Learn with training data

Predict results with test data

3. Organize data

Check and maintain the data.

First of all, since we will start with a new Notebook different from the previous one, click "New Notebook" and select the language "Paython" and Type "Notebook" as before.

Check train.csv. Since you can write the code, you can output the data with the pandas.head () command, but you can also download it, so let's download it. Click train.csv and you will see 100 lines of data on the screen. You can download it with the download button.

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

1 0 3 Braund, male 22 1 0 A/5 21171 7.25 S

2 1 1 Cumings, female 38 1 0 PC 17599 71.2833 C85 C

3 1 3 Heikkinen, female 26 0 0 STON/O2. 3101282 7.925 S

4 1 1 Futrelle, female 35 1 0 113803 53.1 C123 S

5 0 3 Allen, male 35 0 0 373450 8.05 S

6 0 3 Moran, male 0 0 330877 8.4583 Q

7 0 1 McCarthy male 54 0 0 17463 51.8625 E46 S

8 0 3 Palsson, male 2 3 1 349909 21.075 S

9 1 3 Johnson, female 27 0 2 347742 11.1333 S

10 1 2 Nasser, female 14 1 0 237736 30.0708 C

Check the CSV with Excel etc. There are some items that I don't understand, but there is a description in the data of the competition. As an aside, as explained in OverView, the sample "gender_submission.csv" seems to consider "only women survived". Certainly, the values of "Sex" in "test.csv" and "Survived" in "gender_submission.csv" match. That's why the correct answer rate of "76%" is quite formidable.

Data Dictionary

Variable Definition Translation Key

survival Survival Survival 0 = No, 1 = Yes

pclass Ticket class Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd

sex Sex sex

Age Age in years age

sibsp # of siblings / spouses aboard the Titanic Brothers riding the Titanic/Number of spouses

parch # of parents / children aboard the Titanic Parents riding the Titanic/Number of children

ticket Ticket number Ticket number

fare Passenger fare Passenger fare

cabin Cabin number Room number

embarked Port of Embarkation Boarding port C = Cherbourg, Q = Queenstown, S = Southampton

Consider which items to use for learning. Since "Survival" is the part that is being asked, I will use it as an answer for learning. Since women and children are more likely to get on lifeboats preferentially, "gender" and "age" are used. Also, depending on the situation, wealth may have had an effect. Let's also use "ticket class" and "passenger fare". "Name", "ticket number" and "port of embarkation" do not seem to be related, so they are excluded. The problem is "sibsp" and "parch". When "sibsp" and "parch" are aggregated by Excel etc., it is as follows. It seems to be related, but this time I excluded it for the sake of simplicity.

sibsp (number of siblings / spouses on the Titanic)

value of sibsp Survival=1 Survival=0 Survival rate

0 210 608 26%

1 112 209 35%

2 13 28 32%

3 4 16 20%

4 3 18 14%

5 0 5 0%

8 0 7 0%

parch (number of parents / children on the Titanic)

value of parch Survival=1 Survival=0 Survival rate

0 233 678 26%

1 65 118 36%

2 40 80 33%

3 3 5 38%

4 0 4 0%

5 1 5 17%

6 0 1 0%

3.1. Extract the required items

Delete the sample code and write the following code. Load train.csv and extract only the required items ('Survived','Pclass','Sex','Age','Fare').

import numpy import pandas ############################## #Data preprocessing 1 #Extract the required items ############################## # train.load csv df_train = pandas.read_csv('/kaggle/input/titanic/train.csv') df_train = df_train.loc[:, ['Survived', 'Pclass', 'Sex', 'Age', 'Fare']] df_train.head()

index Survived Pclass Sex Age Fare

0 0 3 male 22 7.25

1 1 1 female 38 71.2833

2 1 3 female 26 7.925

3 1 1 female 35 53.1

4 0 3 male 35 8.05

I was able to extract only the required items.

3.2. Handle missing values

Check for missing values.

############################## #Data preprocessing 2 #Handle missing values ############################## #Check for missing values df_train.isnull().sum()

Column count

Survived 0

Pclass 0

Sex 0

Age 177

Fare 0

There are many data without age. If possible, fill in the missing values, but this time delete them.

#Delete rows with null age # Delete rows with null age df_train = df_train.dropna(subset=['Age']).reset_index(drop=True) len(df_train)

count

714

Lines with null age have been removed.

3.3. Digitize labels

Gender "male" and "female" are difficult to handle as they are, so digitize them. Since there are only two types, male and female, you can convert them yourself, but scikit-learn has a convenient class called LabelEncoder </ b>, so let's use it. LabelEncoder: The fit method and fit_transform method replace the character string with an integer from 0 to N-1 when there are N types of character strings appearing in the input.

############################## #Data preprocessing 3 #Quantify the label (name) ############################## from sklearn.preprocessing import LabelEncoder #Quantify gender using Label Encoder encoder = LabelEncoder() df_train['Sex'] = encoder.fit_transform(df_train['Sex'].values) df_train.head()

index Survived Pclass Sex Age Fare

0 0 3 1 22 7.25

1 1 1 0 38 71.2833

2 1 3 0 26 7.925

3 1 1 0 35 53.1

4 0 3 1 35 8.05

"Sex" has been quantified. This encoder will also be used later when quantifying sex in test.csv.

3.4. Standardize numbers

It seems that there are many cases where learning can be done better by adjusting the scale (standardization) rather than inputting the numerical values as learning data as they are. For example, when analyzing test results, is it easier to understand by analyzing the deviation value than by analyzing the points (out of 100 points, out of 200 points)? Let's standardize "age" and "fare". As with label encoding, standardization has a useful class in scikit-learn. Standard Scaler </ b>.

############################## #Data preprocessing 4 #Standardize numbers # Data preprocessing 4 # Standardize numbers ############################## from sklearn.preprocessing import StandardScaler #Standardization # Standardize numbers standard = StandardScaler() df_train_std = pandas.DataFrame(standard.fit_transform(df_train.loc[:, ['Age', 'Fare']]), columns=['Age', 'Fare']) #Standardize Age # Standardize Age df_train['Age'] = df_train_std['Age'] #Standardize Fare # Standardize Fare df_train['Fare'] = df_train_std['Fare'] df_train.head()

index Survived Pclass Sex Age Fare

0 0 3 1 -0.530376641 -0.518977865

1 1 1 0 0.571830994 0.69189675

2 1 3 0 -0.254824732 -0.506213563

3 1 1 0 0.365167062 0.348049152

4 0 3 1 0.365167062 -0.503849804

Age and fares have been standardized. At this point, data maintenance is complete.

4. Build the model

Once the data is ready, it's time to build the model. For the time being, let's build it with scikit-learn. Below is a flowchart of algorithm selection on the sckit-learn site.

Let's select a model according to this flowchart. "Category" YES ⇒ "With label data" Yes, proceed to "classification" on the upper left. I think it corresponds to "classification supervised learning". It became "Linear SVC" on the chart.

When learning, the data to be learned (= x_train) and the answer (= y_train) are passed to the model separately. The image is as follows.

y_train x_train

index Survived Pclass Sex Age Fare

0 0 3 1 -0.530376641 -0.518977865

1 1 1 0 0.571830994 0.69189675

2 1 3 0 -0.254824732 -0.506213563

3 1 1 0 0.365167062 0.348049152

4 0 3 1 0.365167062 -0.503849804

The code is below.

############################## #Model building ############################## from sklearn.svm import LinearSVC #Prepare training data x_train = df_train.loc[:, ['Pclass', 'Sex', 'Age', 'Fare']].values y_train = df_train.loc[:, ['Survived']].values # y_Remove train dimension y_train = numpy.reshape(y_train,(-1)) #Generate a model model = LinearSVC(random_state=1)

5. Learn with training data

Training simply passes the training data to the model.

############################## #Learning ############################## model.fit(x_train, y_train)

6. Predict results with test data

Let's see the learning result with test data. test.csv should be similar to the training data (x_train). There is a deficiency in age and fare, but even if it is deficient, the result must be predicted. If it is test data, it will be converted to "0" without being deleted.

############################## # test.Convert csv # convert test.csv ############################## # test.load csv # Load test.csv df_test = pandas.read_csv('/kaggle/input/titanic/test.csv') # 'PassengerId'To extract(To combine with the result) df_test_index = df_test.loc[:, ['PassengerId']] # 'Survived', 'Pclass', 'Sex', 'Age', 'Fare'To extract # Extract 'Survived', 'Pclass', 'Sex', 'Age', 'Fare' df_test = df_test.loc[:, ['Pclass', 'Sex', 'Age', 'Fare']] #Quantify gender using Label Encoder # Digitize gender using LabelEncoder df_test['Sex'] = encoder.transform(df_test['Sex'].values) df_test_std = pandas.DataFrame(standard.transform(df_test.loc[:, ['Age', 'Fare']]), columns=['Age', 'Fare']) #Standardize Age # Standardize Age df_test['Age'] = df_test_std['Age'] #Standardize Fare # Standardize Fare df_test['Fare'] = df_test_std['Fare'] # Age,Convert Fare Nan to 0 # Convert Age and Fare Nan to 0 df_test = df_test.fillna({'Age':0, 'Fare':0}) df_test.head()

Index Pclass Sex Age Fare

0 3 1 0.298549339 -0.497810518

1 3 0 1.181327932 -0.512659955

2 2 1 2.240662243 -0.464531805

3 3 1 -0.231117817 -0.482887658

4 3 0 -0.584229254 -0.417970618

I was able to convert the test data in the same way. Predict the result. Just pass in the test data and predict.

############################## #Predict results # Predict results ############################## x_test = df_test.values y_test = model.predict(x_test)

The result is in y_test. Save the result in the same format as gender_submission.csv.

#Combine the result with the DataFrame of the PassengerId # Combine the data frame of PassengerId and the result df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1) # result.Write csv to current directory # Write result.csv to the current directory df_output.to_csv('result.csv', index=False)

With the above, we were able to obtain the results. I will try to execute it with "Commit" as before. After the execution is completed, click "Open Viersion". You can see that result.csv has been created.

Click "Submit to Competition" to submit. What will happen ...

7. Submitted result

The result was "0.75119". 75%. It's worse than the sample data ^^;

8. Summary

How was that. I didn't adjust the learning parameters at all, but I understood the learning flow. Next time will examine the data and look at the learning parameters so that the score will be a little better.

History

2019/12/11 First edition released 2019/12/26 Next link installation 2020/01/03 Source comment correction 2020/03/26 Partially revised the source code of "6. Predict the result with test data"

Recommended Posts
I tried learning with Kaggle's Titanic (kaggle②)

PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding

Select models with Kaggle's Titanic (kaggle ④)

Predict Kaggle's Titanic with keras (kaggle ⑦)

I tried machine learning with liblinear

I tried learning LightGBM with Yellowbrick

I tried to predict and submit Titanic survivors with Kaggle

Check raw data with Kaggle's Titanic (kaggle ⑥)

[Kaggle] I tried ensemble learning using LightGBM

Check the correlation with Kaggle's Titanic (kaggle③)

I tried deep learning

I tried principal component analysis with Titanic data!

I tried to predict Titanic survival with PyCaret

[Mac] I tried reinforcement learning with OpenAI Baselines

I tried fp-growth with python

I tried to move machine learning (ObjectDetection) with TouchDesigner

I tried Learning-to-Rank with Elasticsearch!

I tried clustering with PyCaret

Mayungo's Python Learning Episode 1: I tried printing with print

I tried gRPC with Python

I tried scraping with python

Try machine learning with Kaggle

Mayungo's Python Learning Episode 3: I tried to print numbers with print

I tried to implement ListNet of rank learning with Chainer

I tried to divide with a deep learning language model

I tried summarizing sentences with summpy

I tried web scraping with python.

I tried moving food with SinGAN

I tried implementing DeepPose with PyTorch

I tried face detection with MTCNN

I tried reinforcement learning using PyBrain

I tried deep learning using Theano

I tried running prolog with python 3.8.2.

I tried SMTP communication with Python

I tried sentence generation with GPT-2

[Kaggle] I tried undersampling using imbalanced-learn

I tried face recognition with OpenCV

I tried to make deep learning scalable with Spark × Keras × Docker

I tried deep reinforcement learning (Double DQN) for tic-tac-toe with ChainerRL

Mayungo's Python Learning Episode 7: I tried printing with if, elif, else

I tried multiple regression analysis with polynomial regression

I tried sending an SMS with Twilio

I tried using Amazon SQS with django-celery

I tried to implement Autoencoder with TensorFlow

I tried linebot with flask (anaconda) + heroku

I tried to visualize AutoEncoder with TensorFlow

I tried to get started with Hy

I tried scraping Yahoo News with Python

I tried using Selenium with Headless chrome

I tried sending an email with python.

I tried non-photorealistic rendering with Python + opencv

I tried a functional language with Python

I tried batch normalization with PyTorch (+ note)

I tried recursion with Python ② (Fibonacci sequence)

I tried implementing DeepPose with PyTorch PartⅡ

I tried to implement CVAE with PyTorch

I tried playing with the image with Pillow

Mayungo's Python Learning Episode 8: I tried input

I tried to solve TSP with QAOA

I tried simple image recognition with Jupyter

I tried CNN fine tuning with Resnet

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund,	male	22	1	0	A/5 21171	7.25		S
2	1	1	Cumings,	female	38	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen,	female	26	0	0	STON/O2. 3101282	7.925		S
4	1	1	Futrelle,	female	35	1	0	113803	53.1	C123	S
5	0	3	Allen,	male	35	0	0	373450	8.05		S
6	0	3	Moran,	male		0	0	330877	8.4583		Q
7	0	1	McCarthy	male	54	0	0	17463	51.8625	E46	S
8	0	3	Palsson,	male	2	3	1	349909	21.075		S
9	1	3	Johnson,	female	27	0	2	347742	11.1333		S
10	1	2	Nasser,	female	14	1	0	237736	30.0708		C

Variable	Definition	Translation	Key
survival	Survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex	sex
Age	Age in years	age
sibsp	# of siblings / spouses aboard the Titanic	Brothers riding the Titanic/Number of spouses
parch	# of parents / children aboard the Titanic	Parents riding the Titanic/Number of children
ticket	Ticket number	Ticket number
fare	Passenger fare	Passenger fare
cabin	Cabin number	Room number
embarked	Port of Embarkation	Boarding port	C = Cherbourg, Q = Queenstown, S = Southampton

index	Survived	Pclass	Sex	Age	Fare
0	0	3	1	-0.530376641	-0.518977865
1	1	1	0	0.571830994	0.69189675
2	1	3	0	-0.254824732	-0.506213563
3	1	1	0	0.365167062	0.348049152
4	0	3	1	0.365167062	-0.503849804

	y_train	x_train
index	Survived	Pclass	Sex	Age	Fare
0	0	3	1	-0.530376641	-0.518977865
1	1	1	0	0.571830994	0.69189675
2	1	3	0	-0.254824732	-0.506213563
3	1	1	0	0.365167062	0.348049152
4	0	3	1	0.365167062	-0.503849804

Index	Pclass	Sex	Age	Fare
0	3	1	0.298549339	-0.497810518
1	3	0	1.181327932	-0.512659955
2	2	1	2.240662243	-0.464531805
3	3	1	-0.231117817	-0.482887658
4	3	0	-0.584229254	-0.417970618