Try Black Box Optimization Techniques (Bayesian Optimization: Optuna, Genetic Programming: TPOT)

Introduction

We often want to solve complex problems smartly using optimization, but the problems are too complex to apply ordinary optimization methods. In such a case, you can use a black box optimization method that finds the best input for a "good feeling" even if you only know the output for the input and you do not know the mechanism in between. This time, we consider the hyperparameter (pipeline) optimization problem of the machine learning model as an example of a complicated problem, and perform Bayesian optimization Optuna. , I tried two tools of TPOT that do genetic programming, so

--A brief explanation of how to use the tool --Implementation example --Observation result of optimization

Is summarized in the article.

Problems and methods (tools) to deal with

With the black box optimization method, even if you do not know the shape of the specific objective function you want to maximize or minimize, if you know the output for a certain input, you can use it as a good feeling to input an appropriate input as the next search point. By repeating the process of making a decision, getting the output again, and searching for the next search point ..., in the end, a reasonably good result is automatically found. In this article, we will deal with hyperparameter optimization of machine learning as a common problem for the time being. \ # Actually, I have a more complicated problem at hand, but this time I will try a solid problem setting as a verification of the method. \ # So I won't go too deep into talking about machine learning performance. \ # Since the search space is completely different in the first place, it is not a performance comparison between Optuna and TPOT. Please note.

problem

Using the familiar Titanic data from kaggle, consider binary classification of life or death.

Method

As a naive method of hyperparameter optimization,

--GridSearch: Search for all candidate combinations. --RandomSearch: Randomly search for candidate combinations

there is. Here are two things that seem smarter:

Bayesian optimization

Reference: Introduction to Bayesian Optimization

Roughly speaking, based on the values tested up to each step, we estimate the shape of the function that obtains the value of the objective function from the variable to be searched, and decide the value to be examined in the next step. At this time, considering the mean and variance of the function to be estimated, the value of the estimated objective function ・ Points with a large average (in the case of maximization) ・ Large variance The more you make it easier to be selected as the next search target. In other words, we will focus not only on "where the results are likely to improve" but also on "where we haven't searched so much so it might be good". It's a balance between exploration and exploitation. There are various libraries that perform Bayesian optimization with Python, such as GPyOpt and Hyperopt, but this time I tried Optuna published by PFN. I will try. There are various concrete methods of Bayesian optimization, but among them, the one called TPE is used. ・ Reference: Understanding the algorithm of Optuna (TPE) -Part 1-

Genetic programming

As another approach, genetic programming could also be used for black box optimization. I think genetic programming is an improvement of the genetic algorithm to handle tree structures. Although limited to machine learning applications, there is a tool called TPOT. ・ Reference: I tried automatic machine learning (regression) with TPOT The strong point here is that it not only optimizes hyperparameters, but also optimizes the pipeline including preprocessing such as feature engineering and feature selection. The pipeline is represented by a tree structure, and the structure and the hyperparameters used in each element are optimized together.

Other?

Recently, there seems to be a method of utilizing reinforcement learning for such black box optimization, but it is not well understood yet ...

environment

1. Bayesian optimization: Optuna

What is Optuna

The open source hyperparameter auto-optimization framework Optuna ™ automates trial and error regarding hyperparameter values and automatically discovers hyperparameter values that perform well. It can be used with a variety of machine learning software, including the open source deep learning framework Chainer.

https://preferred.jp/ja/projects/optuna/

In order to perform black box optimization, it is necessary to set the search space of the optimization target (hyperparameters in this case) and implement the objective function. If you give these two, the rest will be a nice search. A notable feature of Optuna is that it uses the Define-by-Run style. The parameter search space is not defined independently of the objective function, but is defined together with the objective function (inside ʻobjective` below). In other words, dynamic settings such as changing the setting method of another parameter depending on the value of one parameter are possible. The Official Tutorial provides the following examples.

--Select the type of training model from the candidates, and optimize the hyperparameters specific to each type. --Select the number of layers of the neural network as a hyperparameter and optimize the number of units of each layer individually.

If you use this well, I think that it is possible to deal with complicated problem setting such as optimization of pipeline structure.

Experiment

Now, let's actually try binary classification using Titanic data. Implementation is based on Official example, make data correspond to Titanic, and use cross validation for evaluation. We have made changes such as. The model uses a simple LightGBM.

import numpy as np
import matplotlib.pyplot as plt
import lightgbm as lgb
import sklearn.metrics
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

import optuna

def objective(trial):
    data = pd.read_csv('data/titanic/train.csv')
    #Preprocessing of Titanic data
    #Delete PassengerId and Name
    #Convert other categorical variables to ints
    data = data.drop(['PassengerId', 'Name'], axis=1)
    cat_columns = ['Sex', 'Ticket', 'Cabin', 'Embarked']
    for cat_col in cat_columns:
        target_column = data[cat_col].fillna('0')
        le = preprocessing.LabelEncoder()
        le.fit(target_column)
        label_encoded_column = le.transform(target_column)
        data[cat_col] = pd.Series(label_encoded_column).astype('category')
    data_y = data['Survived']
    data_x = data.drop('Survived', axis=1)
    dtrain = lgb.Dataset(data_x, label=data_y)
    param = {
        'objective': 'binary',
        'metric': 'binary_logloss',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'lambda_l1': trial.suggest_loguniform('lambda_l1', 1e-8, 10.0),
        'lambda_l2': trial.suggest_loguniform('lambda_l2', 1e-8, 10.0),
        'num_leaves': trial.suggest_int('num_leaves', 2, 256),
        'feature_fraction': trial.suggest_uniform('feature_fraction', 0.4, 1.0),
        'bagging_fraction': trial.suggest_uniform('bagging_fraction', 0.4, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
    }
    cv_result = lgb.cv(param, dtrain, nfold=5, metrics='auc')
    return cv_result['auc-mean'][-1]

This ʻobjectiveis the objective function to optimize. In this, the search space is defined bytrial.suggest_int` etc. By defining the search space in the objective function in this way, it is possible to flexibly set conditional parameters, etc. (Here, it is a simple form to set all at once). It trains with LightGBM based on the data and returns the AUC value as a result of cross validation. The goal is to maximize this.

Use this objective to perform the optimization. Since we want to maximize the AUC this time, we are creating a study object by specifying direction ='maximize'. Set the number of steps to 200. Calling study.optimize performs the optimization and the process and results are stored in the study object.

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=200)

print(f'Number of finished trials: {len(study.trials)}')

print('Best trial:')
trial = study.best_trial

print(f'  Value: {trial.value}')

print('  Params: ')
for key, value in trial.params.items():
    print(f'    {key}: {value}')

output


Number of finished trials: 200
Best trial:
  Value: 0.8746183253734463
  Params: 
    lambda_l1: 6.04386664092051e-07
    lambda_l2: 0.2406416092392933
    num_leaves: 103
    feature_fraction: 0.661309206008689
    bagging_fraction: 0.670194374502007
    bagging_freq: 1
    min_child_samples: 25

We were able to confirm the best trial score in the search and the parameters at that time.

Let's see how the score (the maximum value of the objective function up to each step) has transitioned by the search.

result_values = [trial.value for trial in study.trials]
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(range(1, len(study.trials)+1), result_values)

optuna_score0.png

It's pretty rough, but ... In Bayesian optimization, we proceed with optimization while being aware of both "exploration" and "exploitation", so it is good to be in a state of "rough but steadily improving". I think not. In that sense, it doesn't look bad. In reality, we will adopt the best result up to each step, so when we plot it, it looks like this.


fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(range(1, len(study.trials)+1), pd.Series(result_values).cummax())

optuna_score1.png

it is a good feeling.

Next, let's visualize how we are exploring the search space. Below, we select two hyperparameters (num_leaves, min_child_samples) as trials and plot the points searched in 200 trials. The red dot is the best point that was finally selected. As I felt in Bayesian optimization, I feel like I'm trying to find the best things around, while trying out the other parts as well.

searched_params = np.array([[trial.params['num_leaves'], trial.params['min_child_samples']] for trial in study.trials]).T
best_trial = study.best_trial
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.scatter(searched_params[0], searched_params[1])
ax.scatter(best_trial.params['num_leaves'], best_trial.params['min_child_samples'], color='r')
ax.set_xlabel('num_leaves')
ax.set_ylabel('min_child_samples')

optuna_searchspace.png

So, with the optimization of Optuna, the AUC has improved from 0.856 ... to 0.874 ...!

Try to submit

For reference, compare the scores (accuracy) submitted to kaggle by making predictions for the test data.

--Scores of models trained with LightGBM using default parameters without any special optimization: ** 0.746 ** --Optuna optimized model score: ** 0.741 **

It's getting a little worse ... LightGBM itself is not a method that is so sensitive to hyperparameter values, so I think it's rather overfitting.

2. Genetic programming: TPOT

What is TPOT

TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

https://epistasislab.github.io/tpot/

Another approach is to try black box optimization with a genetic algorithm. There is a library called DEAP that uses genetic algorithms in Python. TPOT is a tool that uses DEAP internally and especially optimizes the machine learning pipeline. One of the appealing points when compared to Bayesian optimization methods such as Optuna (TPE) is that TPOT optimizes the preprocessing pipeline itself. As components of the pipeline, feature engineering, feature selection, learning model construction, etc. are prepared in advance, and what to execute in what order, what to do with hyperparameters, etc. are all done. It is said that they will search for it in a nice way. The pipeline will be represented by a tree structure, and the structure will be included for optimization. Looks great. If you want to do something similar with Optuna, you'll likely need to use conditional parameters to define very complex objectives. However, since the only components of TPOT are those provided in the tool, it seems necessary to hit DEAP directly to take a similar approach to different tasks with more flexibility.

Experiment

Prepare the data as before. I used LightGBM earlier, but TPOT will search for the right one from multiple models (including xgboost if xgboost is installed). However, LightGBM cannot be used because it is not included in the candidates.

from tpot import TPOTClassifier

tpot = TPOTClassifier(generations=100, population_size=50, cv=5, verbosity=2)

#Data preparation under the same conditions as before
data = pd.read_csv('data/titanic/train.csv')
#Preprocessing of Titanic data
#Delete PassengerId and Name
#Convert other categorical variables to ints
data = data.drop(['PassengerId', 'Name'], axis=1)
cat_columns = ['Sex', 'Ticket', 'Cabin', 'Embarked']
for cat_col in cat_columns:
    target_column = data[cat_col].fillna('0')
    le = preprocessing.LabelEncoder()
    le.fit(target_column)
    label_encoded_column = le.transform(target_column)
    data[cat_col] = pd.Series(label_encoded_column).astype('category')

#It seems that the name of the target column needs to be "class"
data.rename(columns={'Survived': 'class'}, inplace=True)
data_y = data['class']
data_x = data.drop('class', axis=1)

tpot.fit(data_x, data_y)

It uses TPOTClassifier to set parameters in genetic programming as appropriate, and then gives data and calls fit to perform optimization. This also uses cross-validated AUC as an evaluation index. I've run the calculations for 100 generations, but since each generation evaluates 50 models individually, it will take some time.

The pipeline resulting from the optimization can be output as a .py file as follows.

tpot.export('tpot_titanic_pipeline.py')

tpot_titanic_pipeline.py


import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import OneHotEncoder, StackingEstimator, ZeroCount
from xgboost import XGBClassifier
from sklearn.impute import SimpleImputer

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=None)

imputer = SimpleImputer(strategy="median")
imputer.fit(training_features)
training_features = imputer.transform(training_features)
testing_features = imputer.transform(testing_features)

# Average CV score on the training set was: 0.858602724248321
exported_pipeline = make_pipeline(
    make_union(
        ZeroCount(),
        StackingEstimator(estimator=make_pipeline(
            OneHotEncoder(minimum_fraction=0.2, sparse=False, threshold=10),
            GradientBoostingClassifier(learning_rate=0.1, max_depth=6, max_features=0.8500000000000001, min_samples_leaf=9, min_samples_split=19, n_estimators=100, subsample=0.8500000000000001)
        ))
    ),
    XGBClassifier(learning_rate=0.001, max_depth=5, min_child_weight=17, n_estimators=100, nthread=1, subsample=0.6000000000000001)
)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Again, let's look at the changes in the score. The progress of the search is saved in tpot.evaluated_individuals_, but not enough information is saved ... (There is a history of trials, but the order is not saved, so each generation I don't know if it's a thing → I can't check the transition of the score) ・ Reference: I tried more automatic machine learning with TPOT

Since there is no help for it, the result of restoring the muddy information from the standard output is as follows. It feels like the search is proceeding properly. tpot_score.png

So, by optimizing TPOT, AUC has improved from 0.832 ... to 0.858 ...!

Try to submit

Again, make a prediction for the test data and look at the submitted score (accuracy).

--Models trained only with xgboost using default parameters Score: ** 0.799 ** --TPOT-optimized model score: ** 0.770 **

This is also a little worse. The degree of deterioration is greater than in the case of Optuna, but TPOT optimizes the pipeline and the search space becomes larger, so is it easier to overfit? In the first place, the default xgboost is quite strong.

Summary

In each case, I felt that the training data was overfitting, but I was able to confirm that the search was proceeding well in terms of maximizing the objective function.

--If you want to do Bayesian optimization, you can use Optuna well to deal with complicated problems with flexible settings. --If you want to use a genetic algorithm, --TPOT is very easy and convenient for the purpose of optimizing the machine learning pipeline. --Use DEAP directly if you want to deal with another issue

Recommended Posts

Try Black Box Optimization Techniques (Bayesian Optimization: Optuna, Genetic Programming: TPOT)