This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Checking the correlation with Kaggle's Titanic" (, check the correlation and check Pclass (ticket class), Sex (gender), Fare. I decided to use the three input data of (fare). This time I would like to try some models.
From the result, the score went up a little to "0.77511". The result is the top 58% (as of December 29, 2019). I would like to see the flow until resubmission.
Last time, I used "Linear SVC" according to scikit-learn algorithm sheet. I first learned machine learning [this book]( % 92% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E9% 81% 94% E4% BA% BA% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-impress-gear / dp / 4295003379 / ref = dp_ob_title_bk) takes up the following model of scikit-learn on the classification problem. I am. ・ Sklearn.svm.LinearSVC ・ Sklearn.svm.SVC ・ Sklearn.ensemble.RandomForestClassifier ・ Sklearn.linear_model.LogisticRegression ・ Sklearn.linear_model.SGDClassifier
This time, I would like to try the above model.
The procedure for evaluating the model is as follows.
Kaggle's Titanic has training data [train.csv](data with unknown results) and test data [test.csv](data with unknown results). If you use test.csv to "predict" 2 and "confirm" 3 each time, you have to commit and submit the result, which is inefficient. Training data [train.csv] with known results can be evaluated efficiently by dividing it into training data and test data. scikit-learn provides a function "train_test_split" that splits into training data and test data.
from sklearn.model_selection import train_test_split
#Separate training data and test data
# Split training data and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1, shuffle=True)
The image is as follows. If test_size = 0.3, the training data and test data will be divided by "7: 3".
〇Data before division
y | x | |||
Survived | Pclass | Sex | Fare | |
1 | 0 | 3 | male | 7.25 |
2 | 1 | 1 | female | 71.2833 |
3 | 1 | 3 | female | 7.925 |
4 | 1 | 1 | female | 53.1 |
5 | 0 | 3 | male | 8.05 |
6 | 0 | 3 | male | 8.4583 |
7 | 0 | 1 | male | 51.8625 |
8 | 0 | 3 | male | 21.075 |
9 | 1 | 3 | female | 11.1333 |
10 | 1 | 2 | female | 30.0708 |
〇 Training data after division
y_train | x_train | |||
Survived | Pclass | Sex | Fare | |
1 | 0 | 3 | male | 7.25 |
2 | 1 | 1 | female | 71.2833 |
4 | 1 | 1 | female | 53.1 |
5 | 0 | 3 | male | 8.05 |
6 | 0 | 3 | male | 8.4583 |
8 | 0 | 3 | male | 21.075 |
10 | 1 | 2 | female | 30.0708 |
〇 Test data after division
y_test | x_test | |||
Survived | Pclass | Sex | Fare | |
3 | 1 | 3 | female | 7.925 |
7 | 0 | 1 | male | 51.8625 |
9 | 1 | 3 | female | 11.1333 |
Next is learning and prediction.
The scikit-learn model provides a method "fit" for learning and a method "score" for evaluating predictions. They are "fit" and "score".
from sklearn.svm import LinearSVC
model = LinearSVC(random_state=1)
# training
######################################, y_train)
#Evaluate the predicted results
# Evaluate predicted results
score = model.score(x_test, y_test)
“Fit” is learning. "Score" predicts the result with "x_test", matches the result with "y_test", and returns the correct answer rate. In the above case, the score would be "0.753731343283582". The result is a 75% correct answer rate.
You can evaluate the performance of your model by experimenting with different models and comparing their scores. Try the model in "2. About the model to use".
The overall code is below.
import numpy
import pandas
# train.load csv
# Load train.csv
df = pandas.read_csv('/kaggle/input/titanic/train.csv')
#Data preprocessing
#Extract the required items
# Data preprocessing
# Extract necessary items
# 'Survived', 'Pclass', 'Sex', 'Fare'To extract
# Extract 'Survived', 'Pclass', 'Age', 'Fare'
df = df[['Survived', 'Pclass', 'Sex', 'Fare']]
#Data preprocessing
#Quantify the label (name)
# Data preprocessing
# Digitize labels
from sklearn.preprocessing import LabelEncoder
#Quantify gender using Label Encoder
# Digitize gender using LabelEncoder
encoder_sex = LabelEncoder()
df['Sex'] = encoder_sex.fit_transform(df['Sex'].values)
#Data preprocessing
#Standardize numbers
# Data preprocessing
# Standardize numbers
from sklearn.preprocessing import StandardScaler
# Standardize numbers
standard = StandardScaler()
df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])
#Standardize Fare
# Standardize Fare
df['Pclass'] = df_std['Pclass']
df['Fare'] = df_std['Fare']
from sklearn.model_selection import train_test_split
x = df.drop(columns='Survived')
y = df[['Survived']]
Training data, test data creation
#Separate training data and test data
# Split training data and test data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1, shuffle=True)
y_train = numpy.ravel(y_train)
y_test = numpy.ravel(y_test)
Model evaluation
#Evaluate the model
# Evaluate the model
from sklearn.svm import LinearSVC
model = LinearSVC(random_state=1), y_train)
score = model.score(x_test, y_test)
By replacing the definition part of the model in "Model evaluation", you can evaluate with various models. Try the model described in "2. About the model to use". The result is as follows.
model | score |
sklearn.svm.LinearSVC | 0.753 |
sklearn.svm.SVC | 0.783 |
sklearn.ensemble.RandomForestClassifier | 0.805 |
sklearn.linear_model.LogisticRegression | 0.753 |
sklearn.linear_model.SGDClassifier | 0.753 |
The result is that Random Forest is the best. Next, let's adjust the parameters of the random forest model.
Use the grid search (GridSearchCV) in scikit-learn to adjust the parameters. Grid search evaluates the specified parameters in all patterns and finds the optimum combination of parameters. However, since all patterns are evaluated, the more parameters you have, the longer the process will take. Check the Random Forest Documentation ( and decide to adjust the following parameters:
Parameters | pattern |
criterion | gini / entropy |
n_estimators | 25 / 100 / 500 / 1000 / 2000 |
min_samples_split | 0.5 / 2 / 4 / 10 |
min_samples_leaf | 1 / 2 / 4 / 10 |
bootstrap | Ture / False |
You can perform grid search by replacing "model evaluation" with the following "grid search".
Grid search
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
#Try LogisticRegression parameters with grid search
# Tuning LogisticRegression parameters with grid search
pipe_svc = RandomForestClassifier(random_state=1)
param_grid = {'criterion':['gini','entropy'],
'n_estimators':[25, 100, 500, 1000, 2000],
'min_samples_split':[0.5, 2,4,10],
'bootstrap':[True, False]
grid = GridSearchCV(estimator=RandomForestClassifier(random_state=1), param_grid=param_grid)
grid =, y_train)
The result is as follows. In my environment, it took about 10 minutes to execute the grid search.
Grid search results
{'bootstrap': False, 'criterion': 'entropy', 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 100}
Let's specify the parameters tuned by grid search, learn and predict. Rewrite the code of "Training data, test data creation" and "Grid search" to the following to perform learning and prediction.
Learning, anticipation
#Model building
# Model building
from sklearn.ensemble import RandomForestClassifier
#Generate a model
# Generate a model
model = RandomForestClassifier(n_estimators=100, \
criterion='entropy', \
min_samples_split=2, \
min_samples_leaf=10, \
bootstrap=False, \
# Trainig
y = numpy.ravel(y), y)
# test.Convert csv
# convert test.csv
# test.load csv
# Load test.csv
df_test = pandas.read_csv('/kaggle/input/titanic/test.csv')
#Convert Fare Nan
# Convert Fare Nan to 0
df_test = df_test.fillna({'Fare':0})
# 'PassengerId'To extract(To combine with the result)
# Extract 'PassengerId'(To combine with the result)
df_test_index = df_test[['PassengerId']]
# 'Pclass', 'Sex', 'Fare'To extract
# Extract 'Pclass', 'Sex', 'Fare'
df_test = df_test[['Pclass', 'Sex', 'Fare']]
# Standardize
df_test_std = pandas.DataFrame(standard.transform(df_test[['Pclass', 'Fare']]), columns=['Pclass', 'Fare'])
df_test['Pclass'] = df_test_std['Pclass']
df_test['Fare'] = df_test_std['Fare']
#Label encoding
# Label Encoding
df_test ['Sex'] = encoder_sex.transform(df_test ['Sex'].values)
#Predict results
# Predict results
x_test = df_test.values
y_test = model.predict(x_test)
#Combine the result with the DataFrame of the PassengerId
# Combine the data frame of PassengerId and the result
df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1)
# result.Write csv to current directory
# Write result.csv to the current directory
df_output.to_csv('result.csv', index=False)
Write the above in a Kaggle environment. Run "Run All" and verify that result.csv is created.
Submit by selecting "Commit" ⇒ "Open Version" ⇒ "Submit to Competition".
The score is now "0.77511".
This time, I was able to raise the score a little by comparing 5 types of models and tuning the parameters. Next time would like to find a more suitable model from various models of scikit-learn.
2019/12/29 First edition released 2020/01/01 Add next link 2020/01/03 Source comment correction