What is hyperparameter tuning?

There are some parameters that must be decided in advance depending on the model. (For example, the number of k-means clusters, the strength of the SVC regularization term, the depth of the decision tree, etc.)

It is called "hyperparameter", but the trouble is that even if it is the same model, the accuracy may change ** significantly ** depending on the value of the hyperparameter.

Hyperparameter tuning is to decide it well using training data! !!

Grid search and random search

Of the tuning methods, we will deal with two, grid search and random search. Roughly speaking, if there is hyperparameter α, it will be executed according to the following flow.

・ For grid search, specify the ** range ** of α (ex. 0,1,2,3,4,5, etc.) in advance, and actually try to get the accuracy of the model with that parameter. Make him a parameter.

-For random search, specify the ** distribution ** that α follows in advance (ex. Normal distribution with mean 0, standard deviation 1, etc.), randomly extract from it, and actually use that parameter to determine the accuracy of the model. Look, the best one is the parameter.

Screen Shot 2017-02-25 at 22.37.35.png

As mentioned above, both are not the procedure of deciding the hyperparameter α as it is. Before that, you can see that the procedure is to determine the ** range and distribution ** and use the actual training data. (See Resources for more details!)

Python code

The above two are standard equipment in scikit-learn, so we will use them! Code for python3.5.1, scikit_learn-0.18.1.

This time, we take data from UCI's Machine Learning Repository and use two classifiers of RandomForestClassifier to tune the parameters. The full code has been uploaded to github.

STEP1 Download data from UCI repository

`Grid_and_Random_Search.ipynb`


 df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases'
                  '/breast-cancer-wisconsin/wdbc.data', header=None)

For the sake of clarity, set the column you want to predict to Target and the others to a ~.

`Grid_and_Random_Search.ipynb`


 columns_list = [] 
 for i in range(df.shape[1]):
     columns_list.append("a%d"%i) 
 columns_list[1] = "Target" 
 df.columns = columns_list

STEP2 Divide the data

`Grid_and_Random_Search.ipynb`


 y = df["Target"].values
 X = df.drop(["a0","Target"],axis=1)

Divided into train data and test data

`Grid_and_Random_Search.ipynb`


 #split X,y to train,test(0.5:0.5)
 from sklearn.cross_validation import train_test_split

 X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.5,random_state=2017)

STEP3 Check the accuracy of the model in the default state.

`Grid_and_Random_Search.ipynb`


 from sklearn.metrics import classification_report

 def model_check(model):
     model.fit(X_train,y_train)
     y_train_pred = classification_report(y_train,model.predict(X_train))
     y_test_pred  = classification_report(y_test,model.predict(X_test))
        
     print("""【{model_name}】\n Train Accuracy: \n{train}
           \n Test Accuracy:  \n{test}""".format(model_name=model.__class__.__name__, train=y_train_pred, test=y_test_pred))

print(model_check(RandomForestClassifier()))

`Output result 1(Default)`


    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       1.00      1.00      1.00        67
              M       1.00      1.00      1.00        75

    avg / total       1.00      1.00      1.00       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.89      0.93      0.91        72
              M       0.93      0.89      0.91        70

    avg / total       0.91      0.91      0.91       142

It turned out that the correct answer rate of Train data is 1.0 and the correct answer rate of Test data is 0.91. From here, we will implement grid search and random search. From now on, reference 3 is referred to.

STEP4 grid search

`Grid_and_Random_Search.ipynb`


 #Grid search

 from sklearn.grid_search import GridSearchCV

 # use a full grid over all parameters
 param_grid = {"max_depth": [2,3, None],
              "n_estimators":[50,100,200,300,400,500],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

 forest_grid = GridSearchCV(estimator=RandomForestClassifier(random_state=0),
                 param_grid = param_grid,   
                 scoring="accuracy",  #metrics
                 cv = 3,              #cross-validation
                 n_jobs = 1)          #number of core

 forest_grid.fit(X_train,y_train) #fit

 forest_grid_best = forest_grid.best_estimator_ #best estimator
 print("Best Model Parameter: ",forest_grid.best_params_)

`Output result 2(Grid search)`


    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       0.99      0.99      0.99        67
              M       0.99      0.99      0.99        75

    avg / total       0.99      0.99      0.99       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.96      0.89      0.92        72
              M       0.89      0.96      0.92        70

    avg / total       0.92      0.92      0.92       142

All accuracy such as total correct answer rate and f1-score has increased! !!

STEP5 Random search

Random search uses scipy to represent the distribution that the parameters follow. This time, the number of iterations is the same as the grid search.

`Grid_and_Random_Search.ipynb`


#Random search
from sklearn.grid_search import RandomizedSearchCV
from scipy.stats import randint as sp_randint

param_dist = {"max_depth": [3, None],                  #distribution
              "n_estimators":[50,100,200,300,400,500],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

forest_random = RandomizedSearchCV( estimator=RandomForestClassifier( random_state=0 ),
                                    param_distributions=param_dist,
                                    cv=3,              #CV
                                    n_iter=1944,          #interation num
                                    scoring="accuracy", #metrics
                                    n_jobs=1,           #num of core
                                    verbose=0,          
                                    random_state=1)

forest_random.fit(X,y)
forest_random_best = forest_random.best_estimator_ #best estimator
print("Best Model Parameter: ",forest_random.best_params_)

`Output result 3(Random search)`


    [RandomForestClassifier]
     Train Accuracy: 
                 precision    recall  f1-score   support

              B       1.00      1.00      1.00        67
              M       1.00      1.00      1.00        75

    avg / total       1.00      1.00      1.00       142


     Test Accuracy:  
                 precision    recall  f1-score   support

              B       0.94      0.92      0.93        72
              M       0.92      0.94      0.93        70

    avg / total       0.93      0.93      0.93       142

We found that all items increased by 2% compared to the default case!

Summary

The accuracy of both grid search and random search has improved! However, I think that the effect has become difficult to see because I originally selected data with high accuracy this time. It may be easier to see the effect of tuning if you try on data that is not accurate.

The full code has been uploaded to github.

References

Bergstra, J., & Bengio, Y. (2012)
http://qiita.com/SE96UoC5AfUt7uY/items/c81f7cea72a44a7bfd3a
http://scikit-learn.org/stable/auto_examples/model_selection/randomized_search.html

Let's tune the model hyperparameters with scikit-learn!

What is hyperparameter tuning?

Grid search and random search

Python code

STEP1 Download data from UCI repository

Grid_and_Random_Search.ipynb

Grid_and_Random_Search.ipynb

STEP2 Divide the data

Grid_and_Random_Search.ipynb

Grid_and_Random_Search.ipynb

STEP3 Check the accuracy of the model in the default state.

Grid_and_Random_Search.ipynb

Output result 1(Default)

STEP4 grid search

Grid_and_Random_Search.ipynb

Output result 2(Grid search)

STEP5 Random search

Grid_and_Random_Search.ipynb

Output result 3(Random search)

Summary

`Grid_and_Random_Search.ipynb`

`Grid_and_Random_Search.ipynb`

`Grid_and_Random_Search.ipynb`

`Grid_and_Random_Search.ipynb`

`Grid_and_Random_Search.ipynb`

`Output result 1(Default)`

`Grid_and_Random_Search.ipynb`

`Output result 2(Grid search)`

`Grid_and_Random_Search.ipynb`

`Output result 3(Random search)`