An introduction to Bayesian optimization

I recently heard about Bayesian optimization, so I summarized it. Since I am not a specialist, I may have written something wrong, but if I find it, I would appreciate it if you could point it out.

Bayesian optimization motivation

There are quite a lot of things in the world that are difficult to actually experiment with. So I want to design the experiment systematically. Bayesian optimization is to design the next experiment "Bayesian" based on the results of previous experiments each time. For example, the fields actually used

Advertising (how to A / B test)
Biology (when you want to determine DNA that looks good and experiment based on it?)
Physics or Chemistry (Material Design)
Hyperparameter search in machine learning (Deep Learning, etc.)
Reinforcement learning

There seems to be something like. In the most basic example, hyperparameter search of machine learning, I think that it is often done manually, or if it is done automatically, it is done by grid search or random. But you can do better with Bayesian optimization. In Bayesian optimization, an alternative function (acquisition function) that predicts the result of the next experiment is created based on the data so far, the point to maximize the function is found, and the next experiment is performed there.

Intuitive description of Bayesian optimization

I think that it is an explanation that often comes up in reinforcement learning, but we are swaying between exploration and utilization. Utilization is to use something that is already known to be good to some extent, and exploration is to challenge something unknown and gain a completely different experience. Bayesian optimization can be easily described as a method of selecting the next action to be selected by considering both the concepts of utilization and search in the framework of optimization. In other words, use the acquisition function that considers utilization and search. It can also be said that humans unconsciously perform Bayesian optimization in situations such as buying something or trying something for the first time.

Specific description of Bayesian optimization

The gif above is a rough explanation of Bayesian optimization. First, the yellow line on the left is the true function. The task is to gradually take points from the constraints ((-20,20) this time) to find the point that maximizes this function. This time, the evaluation (experiment) of the points will be completed in an instant, but please note that the evaluation will take time. The red line represents the average of the predicted values predicted from the data points so far. The green area represents the confidence interval ($ mean \ pm variance $). This mean and variance are calculated using the Gaussian process based on the data points obtained.

Use Bayesian optimization when picking points because of how to score them. Let's say you choose the first point randomly. From the following points, select the point with the highest acquisition function. This acquisition function is represented by the blue line on the right. In other words, the high points in the blue function are surely promising, so let's do it next. As mentioned earlier, creating an acquisition function that takes into account search and utilization can be said to be the key to Bayesian optimization.

A little more detailed description of Bayesian optimization

First, I will explain about GP, and then I will introduce some acquisition functions. After that, I summarized other important things in the literature I read.

Introduction of Gaussian process (GP)

First and foremost, GP is a stochastic process. A stochastic process is a set of random variables {$ X_ {t} } ( t \ in T $). (In signal processing, $ T $ is time. In the context of GP in machine learning, it is the area of input.)

For example, if $ T $ is discrete, there is a Discrete Markov process, and if it is continuous, there is a Wiener process. It is quite important that a stochastic process can be regarded as a function of time or just a random variable cut out at a certain time by changing what is fixed. The former is often called the sample path. Also, the existence of continuous stochastic processes is guaranteed when the "consistency" is satisfied. [^ 1] In the Gaussian process, the joint distribution of y for any set of x follows a Gaussian distribution. At this time, satisfying "consistency" guarantees the existence of such a continuous stochastic process. The Gaussian distribution is uniquely determined once the variance and mean are determined. The stochastic process in which the mean is 0 and the variance is expressed as the kernel is the Gaussian process in the context of machine learning.

[^ 1]: It is an easy-to-understand claim that consistency should be satisfied when marginalized, but the proof is quite complicated. Proven by Kolmogorov. Exactly discussing stochastic processes over continuous time can be a daunting task.

Description of kernel regression

The simplest interpretation of kernel regression is to replace ridge regression with the kernel, but in the context of Bayesian optimization, the interpretation as a gaussian process is important, so I will write that. First, suppose you have $ x $, $ y $, $ t $. x is the input and y is the output. t is y with noise (variance is $ \ beta $) and represents the actual observed value. First, let the prior of y be $ N (0, K) $ (K is the gram matrix by the kernel). Then, the posterior distribution (predicted distribution) of $ t_ {N + 1} $ after the N point is observed is determined for $ x_ {N + 1} $. This posterior distribution is Gaussian and the mean and variance are as follows.

$ \ mu_ {N + 1} = k ^ {T} (K + \ beta I) ^ {-1} t_ {1: N} $ ($ k (x_ {n}, x_ {N + 1}) $ The side by side is $ k $)

$ \sigma_{N+1}= k(x_{N+1},x_{N+1})-k^{T}(K+\beta I)^{-1}k$

Acquisition function design

Define the acquisition function ($ a (x) $) to select the $ x_ {N + 1} $ above. As mentioned above, create a function that successfully captures the trade-off between utilization and search. From the viewpoint of "utilization", the point with a high average should be selected, and from the viewpoint of "search", the point with a high variance should be selected. From now on, $ f_ {+} $ will represent the maximum output at the points obtained up to the Nth point. For the time being, the code is given here. https://github.com/Ma-sa-ue/practice/blob/master/machine%20learning(python)/bayeisan_optimization.ipynb

Probability of Improvement(PI)

Z = \frac {\mu(x)-f^{+}- \xi}{\sigma(x)},a_{PI}(x)= \Phi (Z)

$ \ Phi $ is a normally distributed cdf. You can see from the gif that it's not so good. $ \ Xi $ is provided to prevent it from becoming too greedy, but it still tends to be used more heavily. Expected Improvement(EI)

a_{EI}(x)=(\mu (x)-f_{+}-\xi)\Phi (Z)+\sigma (x)\phi (Z)

$ \ phi $ is a normally distributed pdf. This is a much better method than PI. Taking the expected value of improvement ($ f_ {N + 1} (x) -f ^ {+} $) gives the above. It seems that about 0.01 is recommended for $ \ xi $.

GP Upper (lower) Confidence Band

a_{UCB}(x)=\mu (x)+k_{N} \sigma (x)

This is also a simple method (Upprt Confidence Bound) in which the coefficient originally applied to the variance is a constant. However, that doesn't work very well, so GP-UCB adjusts the coefficient for each update. You can see that it shows the same tendency as EI.

Thomson sampling

Take a sample path and use it as an acquisition function.

Information-theoretic approaches

Use functions based on information theory.

Others

I think that the basic concept is probably the end of what has been written so far, but various studies have been done based on these things. For the time being, while I was grasping the outline, I tried to summarize what seems to be important though it is not written above.

What about high-dimensional exploration?

A method has been proposed in which the manifold hypothesis holds for the input of the search, that is, the optimization is performed while lowering the dimension based on the hypothesis that it is a low-dimensional submanifold. It seems to be a really good technique. http://arxiv.org/pdf/1301.1942.pdf

Effectiveness of Input warpning

If you put in a normal kernel, it will represent a steady stochastic process. However, real-world data is often non-stationary. The idea is to perform Bayesian optimization by distorting the input with a bijective function to represent such a non-stationary stochastic process. http://jmlr.org/proceedings/papers/v32/snoek14.pdf

What kernel do you choose in the first place?

How to choose a kernel is very important. In the experiment, I used a simple one with a quadratic form exponent, but it seems that there are various kernels for Bayesian optimization.

What to do with kernel hyperparameters?

What to do with the kernel hyperparameter settings is an annoying problem. For example, the method used in the famous library spearmint seems to approximately integrate and eliminate hyperparameters. https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf

Isn't it really simple?

Bayesian optimization itself has been around for quite some time. It seems that Bayesian optimization has revival again for hyperparameter search with the development of machine learning. The things I wrote in the beginning may have been around for a long time, but at the recent research level, various developments have been made in terms of both theory and application.

How to maximize the acquisition function?

It should be noted that Bayesian optimization is performed on the premise that observation of the objective function is troublesome. You can get $ y $ for the video right away, but of course you can't get it right away. On the other hand, the evaluation of the acquisition function is easy. However, there is no guarantee that it is convex or differentiable, so we use a primitive method such as dividing the interval to find the maximum value. (maybe)

Isn't the cost of observing the objective function different for each input?

It is assumed that the cost of observation will be reasonable, but it is still true that the cost differs depending on the input. There seems to be a model that takes that into consideration.

References (Honestly, it's better to read this article than to read this article)

A tutorial created by researchers who are energetically researching BOP at Harvard (formerly toronto). The person who made Spearmint. A brief overview and application are provided. https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf
The same slide as above. What I wrote is included in this slide. http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/Ryan_adams_140814_bayesopt_ncap.pdf
This is a slide of mlss of Oxford researcher (now Deepmind?). http://www.mlss2014.com/files/defreitas_slides1.pdf
This is the lecture from the previous person. The slides used are almost the same as above. It's very easy to understand. Https://www.youtube.com/watch?v=vz3D36VXefI
This is the tuorial made by the previous person. I think it was the most useful thing I read. Http://arxiv.org/pdf/1012.2599.pdf
There is a slightly unusual Acquisition Function. Http://gpss.cc/gpss15/talks/gpgo-gpss2015.pdf
This is the Bayesopt page. It is a software that seems to be quite famous. You can see that there are many variations regarding the selection of kernels and hyperparameters that are actually used. http://rmcantin.bitbucket.org/html/bopttheory.html