Introduction

This is the story of participating in the Kaggle </ b> competition for the first time. In the previous "Checking the correlation with Kaggle's Titanic" (https://qiita.com/sudominoru/items/840e87cc77de29f10ca2), check the correlation and check Pclass (ticket class), Sex (gender), Fare. I decided to use the three input data of (fare). This time I would like to try some models.

table of contents

Result

About the model to use

How to evaluate the model

Try the model

Parameter tuning

Submit to Kaggle

Summary

History

1. Result

From the result, the score went up a little to "0.77511". The result is the top 58% (as of December 29, 2019). I would like to see the flow until resubmission.

2. About the model to use

Last time, I used "Linear SVC" according to scikit-learn algorithm sheet. I first learned machine learning [this book](https://www.amazon.co.jp/Python-%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF % 92% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% 9F% E3% 83% B3% E3% 82% B0-% E9% 81% 94% E4% BA% BA% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3% 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E7% 90% 86% E8% AB% 96% E3% 81% A8% E5% AE% 9F% E8% B7% B5-impress-gear / dp / 4295003379 / ref = dp_ob_title_bk) takes up the following model of scikit-learn on the classification problem. I am. ・ Sklearn.svm.LinearSVC ・ Sklearn.svm.SVC ・ Sklearn.ensemble.RandomForestClassifier ・ Sklearn.linear_model.LogisticRegression ・ Sklearn.linear_model.SGDClassifier

This time, I would like to try the above model.

3. How to evaluate the model

The procedure for evaluating the model is as follows.

Learn using training data

Predict using test data

Check if the predicted result is correct

Kaggle's Titanic has training data [train.csv](data with unknown results) and test data [test.csv](data with unknown results). If you use test.csv to "predict" 2 and "confirm" 3 each time, you have to commit and submit the result, which is inefficient. Training data [train.csv] with known results can be evaluated efficiently by dividing it into training data and test data. scikit-learn provides a function "train_test_split" that splits into training data and test data.

from sklearn.model_selection import train_test_split ###################################### #Separate training data and test data # Split training data and test data ###################################### x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1, shuffle=True)

The image is as follows. If test_size = 0.3, the training data and test data will be divided by "7: 3".

〇Data before division

y x

Survived Pclass Sex Fare

1 0 3 male 7.25

2 1 1 female 71.2833

3 1 3 female 7.925

4 1 1 female 53.1

5 0 3 male 8.05

6 0 3 male 8.4583

7 0 1 male 51.8625

8 0 3 male 21.075

9 1 3 female 11.1333

10 1 2 female 30.0708

〇 Training data after division

y_train x_train

Survived Pclass Sex Fare

1 0 3 male 7.25

2 1 1 female 71.2833

4 1 1 female 53.1

5 0 3 male 8.05

6 0 3 male 8.4583

8 0 3 male 21.075

10 1 2 female 30.0708

〇 Test data after division

y_test x_test

Survived Pclass Sex Fare

3 1 3 female 7.925

7 0 1 male 51.8625

9 1 3 female 11.1333

Next is learning and prediction.

The scikit-learn model provides a method "fit" for learning and a method "score" for evaluating predictions. They are "fit" and "score".

from sklearn.svm import LinearSVC model = LinearSVC(random_state=1) ###################################### #learn # training ###################################### model.fit(x_train, y_train) ###################################### #Evaluate the predicted results # Evaluate predicted results ###################################### score = model.score(x_test, y_test)

“Fit” is learning. "Score" predicts the result with "x_test", matches the result with "y_test", and returns the correct answer rate. In the above case, the score would be "0.753731343283582". The result is a 75% correct answer rate.

4. Try the model

You can evaluate the performance of your model by experimenting with different models and comparing their scores. Try the model in "2. About the model to use".

The overall code is below.

Preparation

import numpy import pandas # train.load csv # Load train.csv df = pandas.read_csv('/kaggle/input/titanic/train.csv') ############################## #Data preprocessing #Extract the required items # Data preprocessing # Extract necessary items ############################## # 'Survived', 'Pclass', 'Sex', 'Fare'To extract # Extract 'Survived', 'Pclass', 'Age', 'Fare' df = df[['Survived', 'Pclass', 'Sex', 'Fare']] ############################## #Data preprocessing #Quantify the label (name) # Data preprocessing # Digitize labels ############################## from sklearn.preprocessing import LabelEncoder #Quantify gender using Label Encoder # Digitize gender using LabelEncoder encoder_sex = LabelEncoder() df['Sex'] = encoder_sex.fit_transform(df['Sex'].values) ############################## #Data preprocessing #Standardize numbers # Data preprocessing # Standardize numbers ############################## from sklearn.preprocessing import StandardScaler #Standardization # Standardize numbers standard = StandardScaler() df_std = pandas.DataFrame(standard.fit_transform(df[['Pclass', 'Fare']]), columns=['Pclass', 'Fare']) #Standardize Fare # Standardize Fare df['Pclass'] = df_std['Pclass'] df['Fare'] = df_std['Fare'] from sklearn.model_selection import train_test_split x = df.drop(columns='Survived') y = df[['Survived']]

Training data, test data creation

####################################### #Separate training data and test data # Split training data and test data ####################################### x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1, shuffle=True) y_train = numpy.ravel(y_train) y_test = numpy.ravel(y_test)

Model evaluation

####################################### #Evaluate the model # Evaluate the model ####################################### from sklearn.svm import LinearSVC model = LinearSVC(random_state=1) model.fit(x_train, y_train) score = model.score(x_test, y_test) score

By replacing the definition part of the model in "Model evaluation", you can evaluate with various models. Try the model described in "2. About the model to use". The result is as follows.

model score

sklearn.svm.LinearSVC 0.753

sklearn.svm.SVC 0.783

sklearn.ensemble.RandomForestClassifier 0.805

sklearn.linear_model.LogisticRegression 0.753

sklearn.linear_model.SGDClassifier 0.753

The result is that Random Forest is the best. Next, let's adjust the parameters of the random forest model.

5. Parameter tuning

Use the grid search (GridSearchCV) in scikit-learn to adjust the parameters. Grid search evaluates the specified parameters in all patterns and finds the optimum combination of parameters. However, since all patterns are evaluated, the more parameters you have, the longer the process will take. Check the Random Forest Documentation (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and decide to adjust the following parameters:

Parameters pattern

criterion gini / entropy

n_estimators 25 / 100 / 500 / 1000 / 2000

min_samples_split 0.5 / 2 / 4 / 10

min_samples_leaf 1 / 2 / 4 / 10

bootstrap Ture / False

You can perform grid search by replacing "model evaluation" with the following "grid search".

Grid search

from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV ############################################### #Try LogisticRegression parameters with grid search # Tuning LogisticRegression parameters with grid search ############################################### pipe_svc = RandomForestClassifier(random_state=1) param_grid = {'criterion':['gini','entropy'], 'n_estimators':[25, 100, 500, 1000, 2000], 'min_samples_split':[0.5, 2,4,10], 'min_samples_leaf':[1,2,4,10], 'bootstrap':[True, False] } grid = GridSearchCV(estimator=RandomForestClassifier(random_state=1), param_grid=param_grid) grid = grid.fit(x_train, y_train) print(grid.best_score_) print(grid.best_params_)

The result is as follows. In my environment, it took about 10 minutes to execute the grid search.

Grid search results

0.8105939004815409 {'bootstrap': False, 'criterion': 'entropy', 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 100}

6. Submit to Kaggle

Let's specify the parameters tuned by grid search, learn and predict. Rewrite the code of "Training data, test data creation" and "Grid search" to the following to perform learning and prediction.

Learning, anticipation

############################## #Model building # Model building ############################## from sklearn.ensemble import RandomForestClassifier #Generate a model # Generate a model model = RandomForestClassifier(n_estimators=100, \ criterion='entropy', \ min_samples_split=2, \ min_samples_leaf=10, \ bootstrap=False, \ random_state=1) ############################## #Learning # Trainig ############################## y = numpy.ravel(y) model.fit(x, y) # test.Convert csv # convert test.csv ############################## # test.load csv # Load test.csv df_test = pandas.read_csv('/kaggle/input/titanic/test.csv') #Convert Fare Nan # Convert Fare Nan to 0 df_test = df_test.fillna({'Fare':0}) # 'PassengerId'To extract(To combine with the result) # Extract 'PassengerId'(To combine with the result) df_test_index = df_test[['PassengerId']] # 'Pclass', 'Sex', 'Fare'To extract # Extract 'Pclass', 'Sex', 'Fare' df_test = df_test[['Pclass', 'Sex', 'Fare']] #Standardization # Standardize df_test_std = pandas.DataFrame(standard.transform(df_test[['Pclass', 'Fare']]), columns=['Pclass', 'Fare']) df_test['Pclass'] = df_test_std['Pclass'] df_test['Fare'] = df_test_std['Fare'] #Label encoding # Label Encoding df_test ['Sex'] = encoder_sex.transform(df_test ['Sex'].values) ############################## #Predict results # Predict results ############################## x_test = df_test.values y_test = model.predict(x_test) #Combine the result with the DataFrame of the PassengerId # Combine the data frame of PassengerId and the result df_output = pandas.concat([df_test_index, pandas.DataFrame(y_test, columns=['Survived'])], axis=1) # result.Write csv to current directory # Write result.csv to the current directory df_output.to_csv('result.csv', index=False)

Write the above in a Kaggle environment. Run "Run All" and verify that result.csv is created.

Submit by selecting "Commit" ⇒ "Open Version" ⇒ "Submit to Competition".

The score is now "0.77511".

7. Summary

This time, I was able to raise the score a little by comparing 5 types of models and tuning the parameters. Next time would like to find a more suitable model from various models of scikit-learn.

History

2019/12/29 First edition released 2020/01/01 Add next link 2020/01/03 Source comment correction

Recommended Posts
Select models with Kaggle's Titanic (kaggle ④)

Predict Kaggle's Titanic with keras (kaggle ⑦)

Check raw data with Kaggle's Titanic (kaggle ⑥)

I tried learning with Kaggle's Titanic (kaggle②)

Check the correlation with Kaggle's Titanic (kaggle③)

Try all scikit-learn models on Kaggle's Titanic (kaggle ⑤)

Challenge Kaggle Titanic

Try Kaggle's Titanic tutorial

Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Modeling-

I tried to predict and submit Titanic survivors with Kaggle

Predicting Kaggle's Hello World, Titanic Survivors with Logistic Regression-Prediction / Evaluation-

Kaggle Tutorial Titanic Accuracy 80.9% (Top 7% 0.80861)

[For Kaggle beginners] Titanic (LightGBM)

Try machine learning with Kaggle

PySpark learning record ② Kaggle I tried the Titanic competition with PySpark binding

Select models with Kaggle's Titanic (kaggle ④)

Introduction

table of contents

1. Result

2. About the model to use

3. How to evaluate the model

4. Try the model

`Preparation`

`Training data, test data creation`

`Model evaluation`

5. Parameter tuning

`Grid search`

`Grid search results`

6. Submit to Kaggle

`Learning, anticipation`

7. Summary

History

	y	x
	Survived	Pclass	Sex	Fare
1	0	3	male	7.25
2	1	1	female	71.2833
3	1	3	female	7.925
4	1	1	female	53.1
5	0	3	male	8.05
6	0	3	male	8.4583
7	0	1	male	51.8625
8	0	3	male	21.075
9	1	3	female	11.1333
10	1	2	female	30.0708

	y_train	x_train
	Survived	Pclass	Sex	Fare
1	0	3	male	7.25
2	1	1	female	71.2833
4	1	1	female	53.1
5	0	3	male	8.05
6	0	3	male	8.4583
8	0	3	male	21.075
10	1	2	female	30.0708

	y_test	x_test
	Survived	Pclass	Sex	Fare
3	1	3	female	7.925
7	0	1	male	51.8625
9	1	3	female	11.1333

model	score
sklearn.svm.LinearSVC	0.753
sklearn.svm.SVC	0.783
sklearn.ensemble.RandomForestClassifier	0.805
sklearn.linear_model.LogisticRegression	0.753
sklearn.linear_model.SGDClassifier	0.753

Parameters	pattern
criterion	gini / entropy
n_estimators	25 / 100 / 500 / 1000 / 2000
min_samples_split	0.5 / 2 / 4 / 10
min_samples_leaf	1 / 2 / 4 / 10
bootstrap	Ture / False