Kaggle's Titanic conjecture. Last time, I made All Survival Model
and Gender Based Model
(male death / female survival). Day 66 [Introduction to Kaggle] The easiest Titanic prediction
This time it was machine learning, so I tried using Random Forest. Click here for the original story. The most popular recipe on Kaggle's Notebook. Titanic Data Science Solutions
Since it is written in English, I will take a quick look from top to bottom for the time being. In conclusion, Random Forest seems to be the easiest to use.
I immediately tried to execute it based on the previous Gender Based Model
.
train.csv
and test.csv
.This area is the same as last time.
11.py
import pandas as pd
#Read CSV with pandas
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
#Convert gender to male 0 female 1
train_df.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)
test_df.replace({'Sex': {'male': 0, 'female': 1}}, inplace=True)
#Create a Dataframe
train_df = train_df.loc[:,['PassengerId','Survived','Sex']]
test_df = test_df.loc[:,['PassengerId','Sex']]
-The training data train.csv
is vertically divided into the explanatory variable (x) and the objective variable (y).
-Furthermore, the pseudo training data (X_train, y_train) and the pseudo test data ((X_valid, y_valid) are divided horizontally.
12.py
#Building a baseline model
#Import data split module
from sklearn.model_selection import train_test_split
#Cut out the training data from the original data.numpy in values.Convert to ndarray type
X = train_df.iloc[:, 2:].values #Factors that cause
y = train_df.iloc[:, 1].values #result
#test data
X_test = test_df.iloc[:, 1:].values #Factors that cause
#Divide the training data to create a prediction model
#For data partitioning, scikit-learn train_test_Use split function
#Randomize the split Set the seed value to 42 (according to the Galactic Hitchhiking Guide)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)
Predict pseudo test data with a prediction model. The closer the resulting score is, the better the prediction model. If the training data score is too good for overfitting or too low for underfitting, review the prediction model.
13.py
#Create a predictive model in a random forest
from sklearn.ensemble import RandomForestClassifier
#Learn pseudo-training data and create a prediction model.
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
#Pseudo training data X_train, y_train)View the score of
print('Train Score: {}'.format(round(rfc.score(X_train, y_train), 3)))
#Pseudo test data(X_valid, y_valid)View the score of
print(' Test Score: {}'.format(round(rfc.score(X_valid, y_valid), 3)))
Train Score: 0.785 Test Score: 0.791
The result of this time is ... Is it okay, how is it? Humans are lacking in learning. Anyway, let's predict that the model has been created.
14.py
#Created prediction model( rfc.predict)With test data(X_test)Predict
y_pred = rfc.predict(X_test)
#Convert the result to a Pandas data frame.
submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": y_pred
})
#Output to CSV.
submission.to_csv('titanic1-2.csv', index=False)
I'll upload it to Kaggle right away.
Public Score:0.76555
???
This is the same result as the previous male death female survival model. When I checked the CSV file, it was certainly exactly the same. Looking at the original data train.csv, the survival rate of women is 75% and that of men is 18%, so it was not because I thought that there would be some different predictions.
The prediction model is based on more than 600 data of 890 train.csv divided into 7: 3. It may not be enough to predict. Or maybe Random Forest isn't good at making ambiguous predictions, or maybe it's coded wrong somewhere.
I'm not sure about this area, so I'll put it on hold.
Recommended Posts