This article is the 23rd day article of PONOS Advent Calendar 2019.
This article aims to give you a quick trial of machine learning without the hassle of preparatory work. I will not touch on detailed method explanations or problem solving methods.
First, register for Kaggle. Kaggle is a platform where data scientists and machine learning engineers around the world are shaving day and night. The execution environment of python is prepared on the web, and all the necessary libraries and learning data are available, so you can try it immediately without building the environment locally.
Kaggle holds daily competitions where you can access a variety of data. This time, we will use Titanic: Machine Learning from Disaster, which is always open as a tutorial, not the competition being held. The purpose of this competition is to determine whether a passenger without survival information has survived, using the Titanic passenger list (name, age, gender, cabin class, etc.) and whether or not they have survived as learning data.
You can join by pressing Join Competition.
You can create it by going to the Notebooks tab and pressing New Notebook. Go to the settings screen. You can keep the default, so just press Create.
First, let's look at the data to be learned. Erase the code written from the beginning and write the following code.
cell1
import pandas as pd
You can execute the contents of that cell by pressing ctrl + enter or by pressing the play button on the left. Nothing changes here as we are just loading the library. Press b or + code at the bottom of the cell to add a new cell and write the following code.
cell2
train = pd.read_csv('../input/titanic/train.csv')
test = pd.read_csv('../input/titanic/test.csv')
passenger_id = test.PassengerId #Save for submission
train.head(3)
If you run it and the table is displayed, it is successful. This time, we will use Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked.
cell2
train = train.iloc[:, [1, 2, 4, 5, 6, 7, 9, 11]]
test = test.iloc[:, [1, 3, 4, 5, 6, 8, 10]]
Numerical data is required for learning, so we will format the data. First, repair the data loss. Since train.Age, train.Embarked, test.Age, and test.Fare have missing data, fill them with good Shioume numbers. This time, Embarked is filled with S, and the others are filled with the median.
cell2
train.Age = train.Age.fillna(train.Age.median())
train.Embarked = train.Embarked.fillna('S')
test.Age = test.Age.fillna(test.Age.median())
test.Fare = test.Fare.fillna(test.Fare.median())
Next, convert Sex and Embarked to numbers with one-hot encoding.
cell2
train = pd.get_dummies(train)
test = pd.get_dummies(test)
Finally, convert Age and Fare to discrete values. Since it uses numpy, it loads the library.
cell1
import numpy as np
cell2
train.Age = np.digitize(train.Age, bins=[10, 20, 30, 40, 50])
train.Fare = np.digitize(test.Fare, bins=[10, 20, 30])
test.Age = np.digitize(train.Age, bins=[10, 20, 30, 40, 50])
test.Fare = np.digitize(test.Fare, bins=[10, 20, 30])
This time we will use Random Forest. It is a method of learning slightly different decision trees and averaging them. First, load the library (scikit-learn).
cell1
from sklearn.ensemble import RandomForestClassifier
Separate the Survived of the training data from earlier. Add a new cell and write the following code.
cell3
X = train.iloc[:, 1:]
y = train.iloc[:, 1]
Now that the training data is ready, let's train.
cell3
forest = RandomForestClassifier(n_estimators=5, random_state=0)
forest.fit(X, y)
Now that we have learned, we will make predictions using test data.
cell3
predictions = forest.predict(test)
Finally, save the prediction result to a file.
cell3
submission = pd.DataFrame({ 'PassengerId': passenger_id, 'Survived': predictions })
submission.to_csv('submission.csv', index=False)
Press the Commit button and a pop-up window will appear. Press the Open Version button when you are done. In the Output column of the newly opened screen, there are the submission.csv saved earlier and the Submit to Competition button, so press them. The score will be displayed when the submission is completed. I think it will be around 0.76 (the closer it is to 1, the better the score).
As I tried this time, the library will do most of the learning part. The actual difficulty was overwhelmingly the data molding part (more difficult if you wanted to achieve accuracy). Those who are good at this kind of work may want to step into the path of machine learning.
Recommended Posts