An introduction to Bayesian optimization

An introduction to Bayesian optimization

I recently heard about Bayesian optimization, so I summarized it. Since I am not a specialist, I may have written something wrong, but if I find it, I would appreciate it if you could point it out.

Bayesian optimization motivation

There are quite a lot of things in the world that are difficult to actually experiment with. So I want to design the experiment systematically. Bayesian optimization is to design the next experiment "Bayesian" based on the results of previous experiments each time. For example, the fields actually used

There seems to be something like. In the most basic example, hyperparameter search of machine learning, I think that it is often done manually, or if it is done automatically, it is done by grid search or random. But you can do better with Bayesian optimization. In Bayesian optimization, an alternative function (acquisition function) that predicts the result of the next experiment is created based on the data so far, the point to maximize the function is found, and the next experiment is performed there.

Intuitive description of Bayesian optimization

I think that it is an explanation that often comes up in reinforcement learning, but we are swaying between exploration and utilization. Utilization is to use something that is already known to be good to some extent, and exploration is to challenge something unknown and gain a completely different experience. Bayesian optimization can be easily described as a method of selecting the next action to be selected by considering both the concepts of utilization and search in the framework of optimization. In other words, use the acquisition function that considers utilization and search. It can also be said that humans unconsciously perform Bayesian optimization in situations such as buying something or trying something for the first time.

Specific description of Bayesian optimization

1459589792Q4eF7i9z6oIH0u61459589786.gif

The gif above is a rough explanation of Bayesian optimization. First, the yellow line on the left is the true function. The task is to gradually take points from the constraints ((-20,20) this time) to find the point that maximizes this function. This time, the evaluation (experiment) of the points will be completed in an instant, but please note that the evaluation will take time. The red line represents the average of the predicted values predicted from the data points so far. The green area represents the confidence interval ($ mean \ pm variance $). This mean and variance are calculated using the Gaussian process based on the data points obtained.

Use Bayesian optimization when picking points because of how to score them. Let's say you choose the first point randomly. From the following points, select the point with the highest acquisition function. This acquisition function is represented by the blue line on the right. In other words, the high points in the blue function are surely promising, so let's do it next. As mentioned earlier, creating an acquisition function that takes into account search and utilization can be said to be the key to Bayesian optimization.

A little more detailed description of Bayesian optimization

First, I will explain about GP, and then I will introduce some acquisition functions. After that, I summarized other important things in the literature I read.

Introduction of Gaussian process (GP)

First and foremost, GP is a stochastic process. A stochastic process is a set of random variables {$ X_ {t} } ( t \ in T $). (In signal processing, $ T $ is time. In the context of GP in machine learning, it is the area of input.)

For example, if $ T $ is discrete, there is a Discrete Markov process, and if it is continuous, there is a Wiener process. It is quite important that a stochastic process can be regarded as a function of time or just a random variable cut out at a certain time by changing what is fixed. The former is often called the sample path. Also, the existence of continuous stochastic processes is guaranteed when the "consistency" is satisfied. [^ 1] In the Gaussian process, the joint distribution of y for any set of x follows a Gaussian distribution. At this time, satisfying "consistency" guarantees the existence of such a continuous stochastic process. The Gaussian distribution is uniquely determined once the variance and mean are determined. The stochastic process in which the mean is 0 and the variance is expressed as the kernel is the Gaussian process in the context of machine learning.

[^ 1]: It is an easy-to-understand claim that consistency should be satisfied when marginalized, but the proof is quite complicated. Proven by Kolmogorov. Exactly discussing stochastic processes over continuous time can be a daunting task.

Description of kernel regression

The simplest interpretation of kernel regression is to replace ridge regression with the kernel, but in the context of Bayesian optimization, the interpretation as a gaussian process is important, so I will write that. First, suppose you have $ x $, $ y $, $ t $. x is the input and y is the output. t is y with noise (variance is $ \ beta $) and represents the actual observed value. First, let the prior of y be $ N (0, K) $ (K is the gram matrix by the kernel). Then, the posterior distribution (predicted distribution) of $ t_ {N + 1} $ after the N point is observed is determined for $ x_ {N + 1} $. This posterior distribution is Gaussian and the mean and variance are as follows.

$ \ mu_ {N + 1} = k ^ {T} (K + \ beta I) ^ {-1} t_ {1: N} $ ($ k (x_ {n}, x_ {N + 1}) $ The side by side is $ k $)

$ \sigma_{N+1}= k(x_{N+1},x_{N+1})-k^{T}(K+\beta I)^{-1}k$

Acquisition function design

Define the acquisition function ($ a (x) $) to select the $ x_ {N + 1} $ above. As mentioned above, create a function that successfully captures the trade-off between utilization and search. From the viewpoint of "utilization", the point with a high average should be selected, and from the viewpoint of "search", the point with a high variance should be selected. From now on, $ f_ {+} $ will represent the maximum output at the points obtained up to the Nth point. For the time being, the code is given here. https://github.com/Ma-sa-ue/practice/blob/master/machine%20learning(python)/bayeisan_optimization.ipynb

Probability of Improvement(PI) 1459578637h1DiBeEHdcLl6UW1459578633.gif

Z = \frac {\mu(x)-f^{+}- \xi}{\sigma(x)},a_{PI}(x)= \Phi (Z)

$ \ Phi $ is a normally distributed cdf. You can see from the gif that it's not so good. $ \ Xi $ is provided to prevent it from becoming too greedy, but it still tends to be used more heavily. Expected Improvement(EI)

1459589792Q4eF7i9z6oIH0u61459589786.gif

a_{EI}(x)=(\mu (x)-f_{+}-\xi)\Phi (Z)+\sigma (x)\phi (Z)

$ \ phi $ is a normally distributed pdf. This is a much better method than PI. Taking the expected value of improvement ($ f_ {N + 1} (x) -f ^ {+} $) gives the above. It seems that about 0.01 is recommended for $ \ xi $.

GP Upper (lower) Confidence Band

14595791439NFETVIUPDKiNWu1459579137.gif

a_{UCB}(x)=\mu (x)+k_{N} \sigma (x)

This is also a simple method (Upprt Confidence Bound) in which the coefficient originally applied to the variance is a constant. However, that doesn't work very well, so GP-UCB adjusts the coefficient for each update. You can see that it shows the same tendency as EI.

Thomson sampling

Take a sample path and use it as an acquisition function.

Information-theoretic approaches

Use functions based on information theory.

Others

I think that the basic concept is probably the end of what has been written so far, but various studies have been done based on these things. For the time being, while I was grasping the outline, I tried to summarize what seems to be important though it is not written above.

What about high-dimensional exploration?

A method has been proposed in which the manifold hypothesis holds for the input of the search, that is, the optimization is performed while lowering the dimension based on the hypothesis that it is a low-dimensional submanifold. It seems to be a really good technique. http://arxiv.org/pdf/1301.1942.pdf

Effectiveness of Input warpning

If you put in a normal kernel, it will represent a steady stochastic process. However, real-world data is often non-stationary. The idea is to perform Bayesian optimization by distorting the input with a bijective function to represent such a non-stationary stochastic process. http://jmlr.org/proceedings/papers/v32/snoek14.pdf

What kernel do you choose in the first place?

How to choose a kernel is very important. In the experiment, I used a simple one with a quadratic form exponent, but it seems that there are various kernels for Bayesian optimization.

What to do with kernel hyperparameters?

What to do with the kernel hyperparameter settings is an annoying problem. For example, the method used in the famous library spearmint seems to approximately integrate and eliminate hyperparameters. https://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf

Isn't it really simple?

Bayesian optimization itself has been around for quite some time. It seems that Bayesian optimization has revival again for hyperparameter search with the development of machine learning. The things I wrote in the beginning may have been around for a long time, but at the recent research level, various developments have been made in terms of both theory and application.

How to maximize the acquisition function?

It should be noted that Bayesian optimization is performed on the premise that observation of the objective function is troublesome. You can get $ y $ for the video right away, but of course you can't get it right away. On the other hand, the evaluation of the acquisition function is easy. However, there is no guarantee that it is convex or differentiable, so we use a primitive method such as dividing the interval to find the maximum value. (maybe)

Isn't the cost of observing the objective function different for each input?

It is assumed that the cost of observation will be reasonable, but it is still true that the cost differs depending on the input. There seems to be a model that takes that into consideration.

References (Honestly, it's better to read this article than to read this article)

Recommended Posts

An introduction to Bayesian optimization
An introduction to private TensorFlow
An introduction to machine learning
Introduction to Nonlinear Optimization (I)
An introduction to Mercurial for non-engineers
An introduction to Python for non-engineers
[Python Tutorial] An Easy Introduction to Python
Introduction to MQTT (Introduction)
Introduction to Scrapy (1)
Introduction to Scrapy (3)
Introduction to Supervisor
Introduction to Tkinter 1: Introduction
An introduction to OpenCV for machine learning
Introduction to PyQt
Introduction to Scrapy (2)
Applying Bayesian optimization to Keras DNN model
Recurrent Neural Networks: An Introduction to RNN
[Linux] Introduction to Linux
Introduction to Scrapy (4)
Introduction to discord.py (2)
An introduction to Python for machine learning
Introduction to discord.py
An Introduction to Object-Oriented-Give an object a child.
An introduction to Python for C programmers
[What is an algorithm? Introduction to Search Algorithm] ~ Python ~
An introduction to machine learning for bot developers
An introduction to Cython that doesn't go deep
An introduction to statistical modeling for data analysis
Introduction to Python "Re" 1 Building an execution environment
An introduction to voice analysis for music apps
An introduction to Cython that doesn't go deep -2-
Introduction to Lightning pytorch
Introduction to Web Scraping
Introduction to Nonparametric Bayes
Introduction to EV3 / MicroPython
Introduction to Python language
Introduction to TensorFlow-Image Recognition
Introduction to OpenCV (python)-(2)
Introduction to PyQt4 Part 1
Introduction to Dependency Injection
Introduction to Private Chainer
I tried Bayesian optimization!
Introduction to machine learning
An introduction to Python distributed parallel processing with Ray
Reading Note: An Introduction to Data Analysis with Python
An introduction to Word2Vec that even cats can understand
I tried to step through Bayesian optimization. (With examples)
An introduction to machine learning from a simple perceptron
AOJ Introduction to Programming Topic # 1, Topic # 2, Topic # 3, Topic # 4
Introduction to electronic paper modules
A quick introduction to pytest-mock
Introduction to dictionary lookup algorithm
Introduction to Monte Carlo Method
[Learning memorandum] Introduction to vim
Introduction to PyTorch (1) Automatic differentiation
opencv-python Introduction to image processing
Introduction to Cython Writing [Notes]
[Introduction to cx_Oracle] Overview of cx_Oracle
A super introduction to Linux
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
AOJ Introduction to Programming Topic # 7, Topic # 8