In the Kaggle / Titanic tutorial, we are learning with RandomForestClassifier ()
. Adjust this parameter to see if the learning accuracy improves.
import numpy as np
import pandas as pd
import seaborn
import matplotlib.pyplot as plt
train_data = pd.read_csv("../train.csv")
from sklearn.model_selection import train_test_split
train_data_orig = train_data
train_data, cv_data = train_test_split(train_data_orig, test_size=0.3, random_state=1)
We used train_test_split
to split the data into train: cv = 7: 3 (cv; cross validation).
train_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 623 entries, 114 to 37
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 623 non-null int64
1 Survived 623 non-null int64
2 Pclass 623 non-null int64
3 Name 623 non-null object
4 Sex 623 non-null object
5 Age 496 non-null float64
6 SibSp 623 non-null int64
7 Parch 623 non-null int64
8 Ticket 623 non-null object
9 Fare 623 non-null float64
10 Cabin 135 non-null object
11 Embarked 622 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 63.3+ KB
cv_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 862 to 92
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 268 non-null int64
1 Survived 268 non-null int64
2 Pclass 268 non-null int64
3 Name 268 non-null object
4 Sex 268 non-null object
5 Age 218 non-null float64
6 SibSp 268 non-null int64
7 Parch 268 non-null int64
8 Ticket 268 non-null object
9 Fare 268 non-null float64
10 Cabin 69 non-null object
11 Embarked 267 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 27.2+ KB
There are 623 trains, 268 cvs, and a total of 891.
from sklearn.ensemble import RandomForestClassifier
features = ["Pclass", "Sex", "SibSp", "Parch"]
y = train_data["Survived"]
y_cv = cv_data["Survived"]
X = pd.get_dummies(train_data[features])
X_cv = pd.get_dummies(cv_data[features])
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
print('Train score: {}'.format(model.score(X, y)))
print('CV score: {}'.format(model.score(X_cv, y_cv)))
Train score: 0.8394863563402889
CV score: 0.753731343283582
The train is about 84% correct, but the CV is only 75% correct. Is it overfitting?
n_estimator
Try changing the value of n_estimator
.
rfc_results = pd.DataFrame(columns=["train", "cv"])
for iter in [1, 10, 100]:
model = RandomForestClassifier(n_estimators=iter, max_depth=5, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
train | cv | |
---|---|---|
1 | 0.826645 | 0.753731 |
10 | 0.833066 | 0.753731 |
100 | 0.839486 | 0.753731 |
As the number of decision trees increases, the train score increases slightly, but the cv score does not change.
max_depth
Try changing the value of max_depth
.
max_depth = 2
for iter in [1, 10, 100]:
model = RandomForestClassifier(n_estimators=iter, max_depth=2, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
train | cv | |
---|---|---|
1 | 0.813804 | 0.731343 |
10 | 0.81862 | 0.753731 |
100 | 0.817014 | 0.761194 |
I got a cv score of 76%.
max_depth = 3
for iter in [1, 10, 100]:
model = RandomForestClassifier(n_estimators=iter, max_depth=3, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
train | cv | |
---|---|---|
1 | 0.81862 | 0.753731 |
10 | 0.82504 | 0.776119 |
100 | 0.82504 | 0.768657 |
I got a cv score of 77.6%.
max_depth = 4
for iter in [1, 10, 100]:
model = RandomForestClassifier(n_estimators=iter, max_depth=4, random_state=1, max_features="auto")
model.fit(X, y)
predictions = model.predict(X_cv)
rfc_results.loc[iter] = model.score(X, y), model.score(X_cv, y_cv)
train | cv | |
---|---|---|
1 | 0.823435 | 0.764925 |
10 | 0.82825 | 0.761194 |
100 | 0.826645 | 0.764925 |
The score of cv is about 76.5%.
From the above, with max_depth = 3
, n_estimators = 10
had the highest score.
Find the best parameters with a method called GridSearch (GridSearchCV
). This is a method of trying out the combinations of the listed parameters and finding the best one.
from sklearn.model_selection import GridSearchCV
param_grid = {"max_depth": [2, 3, 4, 5, None],
"n_estimators":[1, 3, 10, 30, 100],
"max_features":["auto", None]}
model_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=1),
param_grid = param_grid,
scoring="accuracy", # metrics
cv = 3, # cross-validation
n_jobs = 1) # number of core
model_grid.fit(X, y) #fit
model_grid_best = model_grid.best_estimator_ # best estimator
print("Best Model Parameter: ", model_grid.best_params_)
Note the from
line. When I searched on the net, there was a way to write from sklearn.grid_search import GridSearchCV
, but in my case this was NG.
Best Model Parameter: {'max_depth': 3, 'max_features': 'auto', 'n_estimators': 10}
As I tried manually, max_depth = 3
and n_estimators = 10
were the best. I also tried two types of max_features
, but"auto"
was good.
print('Train score: {}'.format(model.score(X, y)))
print('CV score: {}'.format(model.score(X_cv, y_cv)))
Train score: 0.826645264847512
CV score: 0.7649253731343284
I predicted the result using this parameter and submitted it to Kaggle. However, the accuracy was 0.77751, which was the same as the parameter of the tutorial ***. Bien.
Before trying to improve with features, we were able to improve a little by adjusting the hyperparameters of learning. Next, we would like to consider features.
-Scikit-learn splits data for training and testing train_test_split -Add columns and rows to pandas.DataFrame (assign, append, etc.) -Let's tune the model hyperparameters with scikit-learn! -Scikit-learn's GridSearchCV for hyperparameter search -What to do when sklearn grid search cannot be used in Python
Recommended Posts