This article is the 23rd day article of PONOS Advent Calendar 2019.

Introduction

This article aims to give you a quick trial of machine learning without the hassle of preparatory work. I will not touch on detailed method explanations or problem solving methods.

Register with Kaggle

First, register for Kaggle. Kaggle is a platform where data scientists and machine learning engineers around the world are shaving day and night. The execution environment of python is prepared on the web, and all the necessary libraries and learning data are available, so you can try it immediately without building the environment locally.

Trial using data from the Titanic sinking accident

Kaggle holds daily competitions where you can access a variety of data. This time, we will use Titanic: Machine Learning from Disaster, which is always open as a tutorial, not the competition being held. The purpose of this competition is to determine whether a passenger without survival information has survived, using the Titanic passenger list (name, age, gender, cabin class, etc.) and whether or not they have survived as learning data.

Participate in the competition

You can join by pressing Join Competition.

Make a Notebook

You can create it by going to the Notebooks tab and pressing New Notebook. Go to the settings screen. You can keep the default, so just press Create.

View data

First, let's look at the data to be learned. Erase the code written from the beginning and write the following code.

`cell1`


import pandas as pd

You can execute the contents of that cell by pressing ctrl + enter or by pressing the play button on the left. Nothing changes here as we are just loading the library. Press b or + code at the bottom of the cell to add a new cell and write the following code.

`cell2`


train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
passenger_id = test.PassengerId #Save for submission
train.head(3)

If you run it and the table is displayed, it is successful. This time, we will use Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked.

`cell2`


train = train.iloc[:, [1, 2, 4, 5, 6, 7, 9, 11]]
test = test.iloc[:, [1, 3, 4, 5, 6, 8, 10]]

Format the data

Numerical data is required for learning, so we will format the data. First, repair the data loss. Since train.Age, train.Embarked, test.Age, and test.Fare have missing data, fill them with good Shioume numbers. This time, Embarked is filled with S, and the others are filled with the median.

`cell2`


train.Age = train.Age.fillna(train.Age.median())
train.Embarked = train.Embarked.fillna('S')
test.Age = test.Age.fillna(test.Age.median())
test.Fare = test.Fare.fillna(test.Fare.median())

Next, convert Sex and Embarked to numbers with one-hot encoding.

`cell2`


train = pd.get_dummies(train)
test = pd.get_dummies(test)

Finally, convert Age and Fare to discrete values. Since it uses numpy, it loads the library.

`cell1`


import numpy as np

`cell2`


train.Age = np.digitize(train.Age, bins=[10, 20, 30, 40, 50])
train.Fare = np.digitize(test.Fare, bins=[10, 20, 30])
test.Age = np.digitize(train.Age, bins=[10, 20, 30, 40, 50])
test.Fare = np.digitize(test.Fare, bins=[10, 20, 30])

learn

This time we will use Random Forest. It is a method of learning slightly different decision trees and averaging them. First, load the library (scikit-learn).

`cell1`


from sklearn.ensemble import RandomForestClassifier

Separate the Survived of the training data from earlier. Add a new cell and write the following code.

`cell3`


X = train.iloc[:, 1:]
y = train.iloc[:, 1]

Now that the training data is ready, let's train.

`cell3`


forest = RandomForestClassifier(n_estimators=5, random_state=0)
forest.fit(X, y)

Now that we have learned, we will make predictions using test data.

`cell3`


predictions = forest.predict(test)

Finally, save the prediction result to a file.

`cell3`


submission = pd.DataFrame({ 'PassengerId': passenger_id, 'Survived': predictions })
submission.to_csv('submission.csv', index=False)

Submit to Kaggle

Press the Commit button and a pop-up window will appear. Press the Open Version button when you are done. In the Output column of the newly opened screen, there are the submission.csv saved earlier and the Submit to Competition button, so press them. The score will be displayed when the submission is completed. I think it will be around 0.76 (the closer it is to 1, the better the score).

And to the actual competition ...

As I tried this time, the library will do most of the learning part. The actual difficulty was overwhelmingly the data molding part (more difficult if you wanted to achieve accuracy). Those who are good at this kind of work may want to step into the path of machine learning.

Try machine learning with Kaggle

Introduction

Register with Kaggle

Trial using data from the Titanic sinking accident

Participate in the competition

Make a Notebook

View data

cell1

cell2

cell2

Format the data

cell2

cell2

cell1

cell2

learn

cell1

cell3

cell3

cell3

cell3

Submit to Kaggle

And to the actual competition ...

`cell1`

`cell2`

`cell2`

`cell2`

`cell2`

`cell1`

`cell2`

`cell1`

`cell3`

`cell3`

`cell3`

`cell3`