There are some parameters that must be decided in advance depending on the model. (For example, the number of k-means clusters, the strength of the SVC regularization term, the depth of the decision tree, etc.)
It is called "hyperparameter", but the trouble is that even if it is the same model, the accuracy may change ** significantly ** depending on the value of the hyperparameter.
Hyperparameter tuning is to decide it well using training data! !!
Of the tuning methods, we will deal with two, grid search and random search. Roughly speaking, if there is hyperparameter α, it will be executed according to the following flow.
・ For grid search, specify the ** range ** of α (ex. 0,1,2,3,4,5, etc.) in advance, and actually try to get the accuracy of the model with that parameter. Make him a parameter.
-For random search, specify the ** distribution ** that α follows in advance (ex. Normal distribution with mean 0, standard deviation 1, etc.), randomly extract from it, and actually use that parameter to determine the accuracy of the model. Look, the best one is the parameter.
As mentioned above, both are not the procedure of deciding the hyperparameter α as it is. Before that, you can see that the procedure is to determine the ** range and distribution ** and use the actual training data. (See Resources for more details!)
The above two are standard equipment in scikit-learn, so we will use them! Code for python3.5.1, scikit_learn-0.18.1.
This time, we take data from UCI's Machine Learning Repository and use two classifiers of RandomForestClassifier to tune the parameters. The full code has been uploaded to github.
Grid_and_Random_Search.ipynb
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases'
'/breast-cancer-wisconsin/wdbc.data', header=None)
For the sake of clarity, set the column you want to predict to Target and the others to a ~.
Grid_and_Random_Search.ipynb
columns_list = []
for i in range(df.shape[1]):
columns_list.append("a%d"%i)
columns_list[1] = "Target"
df.columns = columns_list
Grid_and_Random_Search.ipynb
y = df["Target"].values
X = df.drop(["a0","Target"],axis=1)
Divided into train data and test data
Grid_and_Random_Search.ipynb
#split X,y to train,test(0.5:0.5)
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.5,random_state=2017)
Grid_and_Random_Search.ipynb
from sklearn.metrics import classification_report
def model_check(model):
model.fit(X_train,y_train)
y_train_pred = classification_report(y_train,model.predict(X_train))
y_test_pred = classification_report(y_test,model.predict(X_test))
print("""【{model_name}】\n Train Accuracy: \n{train}
\n Test Accuracy: \n{test}""".format(model_name=model.__class__.__name__, train=y_train_pred, test=y_test_pred))
print(model_check(RandomForestClassifier()))
Output result 1(Default)
[RandomForestClassifier]
Train Accuracy:
precision recall f1-score support
B 1.00 1.00 1.00 67
M 1.00 1.00 1.00 75
avg / total 1.00 1.00 1.00 142
Test Accuracy:
precision recall f1-score support
B 0.89 0.93 0.91 72
M 0.93 0.89 0.91 70
avg / total 0.91 0.91 0.91 142
It turned out that the correct answer rate of Train data is 1.0 and the correct answer rate of Test data is 0.91. From here, we will implement grid search and random search. From now on, reference 3 is referred to.
Grid_and_Random_Search.ipynb
#Grid search
from sklearn.grid_search import GridSearchCV
# use a full grid over all parameters
param_grid = {"max_depth": [2,3, None],
"n_estimators":[50,100,200,300,400,500],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
forest_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=0),
param_grid = param_grid,
scoring="accuracy", #metrics
cv = 3, #cross-validation
n_jobs = 1) #number of core
forest_grid.fit(X_train,y_train) #fit
forest_grid_best = forest_grid.best_estimator_ #best estimator
print("Best Model Parameter: ",forest_grid.best_params_)
Output result 2(Grid search)
[RandomForestClassifier]
Train Accuracy:
precision recall f1-score support
B 0.99 0.99 0.99 67
M 0.99 0.99 0.99 75
avg / total 0.99 0.99 0.99 142
Test Accuracy:
precision recall f1-score support
B 0.96 0.89 0.92 72
M 0.89 0.96 0.92 70
avg / total 0.92 0.92 0.92 142
All accuracy such as total correct answer rate and f1-score has increased! !!
Random search uses scipy to represent the distribution that the parameters follow. This time, the number of iterations is the same as the grid search.
Grid_and_Random_Search.ipynb
#Random search
from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint as sp_randint
param_dist = {"max_depth": [3, None], #distribution
"n_estimators":[50,100,200,300,400,500],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
forest_random = RandomizedSearchCV( estimator=RandomForestClassifier( random_state=0 ),
param_distributions=param_dist,
cv=3, #CV
n_iter=1944, #interation num
scoring="accuracy", #metrics
n_jobs=1, #num of core
verbose=0,
random_state=1)
forest_random.fit(X,y)
forest_random_best = forest_random.best_estimator_ #best estimator
print("Best Model Parameter: ",forest_random.best_params_)
Output result 3(Random search)
[RandomForestClassifier]
Train Accuracy:
precision recall f1-score support
B 1.00 1.00 1.00 67
M 1.00 1.00 1.00 75
avg / total 1.00 1.00 1.00 142
Test Accuracy:
precision recall f1-score support
B 0.94 0.92 0.93 72
M 0.92 0.94 0.93 70
avg / total 0.93 0.93 0.93 142
We found that all items increased by 2% compared to the default case!
The accuracy of both grid search and random search has improved! However, I think that the effect has become difficult to see because I originally selected data with high accuracy this time. It may be easier to see the effect of tuning if you try on data that is not accurate.
The full code has been uploaded to github.
References
Recommended Posts