Introduction

Nice to meet you, my name is Mimura and I will be in charge of the 21st day of the Advent Calendar of the NTT DoCoMo Service Innovation Department. This year as well, DoCoMo volunteers will write an Advent Calendar, so I would like to write an article.

Suddenly, do you know the regression problem? Regression problems are problems used in various places such as stock price forecasts. DoCoMo also deals with this regression problem in various places.

In this article, Gaussian process regression is one of the methods to solve this regression problem, and I will implement it. The purpose of this time is to implement it with numpy, not a theoretical explanation. Thank you. For those who want to know the specific contents of this area, ["Gauss process and machine learning (machine learning professional series)"](https://www.amazon.co.jp/%E3%82%AC%E3%82% A6%E3%82%B9%E9%81%8E%E7%A8%8B%E3%81%A8%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92- % E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3% 83% 95% E3% 82% A7% E3 % 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83% AA% E3% 83% BC% E3% 82 % BA-% E6% 8C% 81% E6% A9% 8B-% E5% A4% A7% E5% 9C% B0 / dp / 4061529269 / ref = sr_1_1? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords =% E3% 82% AC% E3% 82% A6% E3% 82% B9% E9% 81% 8E% E7% A8% 8B & qid = 1575191954 & sr = 8-1 ](Https://www.amazon.co.jp/ Gauss process and machine learning-Machine learning professional series-Mochihashi-Earth / dp / 4061529269 / ref = sr_1_1? -1 "Call" Gaussian Process and Machine Learning (Machine Learning Professional Series) "")

Also, if you actually use Gaussian processes, use GPyTorch or GPy! Finally, this time we have not optimized the kernel functions or adjusted the parameters. sorry…

A brief explanation of how to solve a regression problem

The simple regression problem is simply "I want to predict some data $ Y $ from some observable data $ X $". One way to solve this problem is to prepare a function like $ y = Ax + B $. If we can observe the variable $ x_i $, we can predict $ y_i $.

However, in reality, it is difficult to express it with a simple function such as $ y = Ax + B $. The way to solve this is to make the formula difficult and improve the expressiveness like $ y = Ax ^ 2 + Bx + C $.

This time, we will express this complicated function as $ y = W ^ T \ phi (x) $. For example, if $ W = [A, B, C], \ phi (x) = [x ^ 2, x ^ 1, x ^ 0] $, then $ y = Ax ^ 2 + Bx + C $ is expressed well. can do. In other words, if you can make $ \ phi (x) $ difficult and find $ W $ for it, you can predict $ Y $ well! !! !! !!

Yay! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !! !!

…

If it ends here, it will be against the subject, so I will explain the Gaussian process from here.

A brief explanation of the Gaussian process

First, consider making $ y = W ^ T \ phi (x) $ $ \ phi (x) $ a Gaussian distribution. In other words, try $ \ phi (x) = \ exp (-\ frac {(x- \ mu_h) ^ 2} {\ sigma}) $. You can still solve the regression problem.

However, with this method, the amount of calculation becomes enormous due to the curse of dimensionality. To solve this problem, consider a method of taking the expected value in $ W $ and integrating and eliminating it from the model.

If $ w $ is generated from a Gaussian distribution with mean $ 0 $ and variance $ \ lambda ^ 2 I $ as $ y = \ Phi W $, it will be $ w \ sim N (0, \ lambda ^ 2 I) $. .. This means that $ y $ is "a linear transformation of $ W $ following a Gaussian distribution with the constant matrix $ \ Phi $".

At this time, the expected value and covariance are

E[y]=E[\Phi W]=\Phi E[ww^T]=0
\Sigma = E[yy^T]-E[y]E[y]^T=E[(\Phi w)(\Phi w)^T]=\Phi E[ww^T]\Phi^T=\lambda^2\Phi\Phi^T

Will be

From this result, $ y $ follows a multivariate Gaussian distribution of $ y \ sim N (0, \ lambda ^ 2 \ Phi \ Phi ^ 2) $.

$ K = \ lambda ^ 2 \ Phi \ Phi ^ T $ and $ K_ {n, n'} = \ lambda ^ 2 \ phi (X_n) \ phi (X_ {n'}) $ and the average of $ y $ If you normalize to $ 0 $, you get $ y \ sim N (0, K) $. Even with this, the calculation of $ K_ {n, n'} = \ phi (X_n) \ phi (X_ {n'}) $ is still heavy ... But here is the famous "use kernel tricks!" The idea is that $ K_ {n, n'} $ can be calculated without having to work hard to calculate $ \ phi (X_n) $.

When you come to this point, you can get more data

D=\{(x_1,y_1),(x_2,y_2),\cdots,(x_N,y_N)\} \\

After learning based on this, what happens to $ y ^ {new} $ when observing new data $ x ^ {new} $ = y ^ {new} = w ^ T \ phi (x ^ {new) }) All you have to do is solve $.

To achieve this, $ y'= (y_1, y_2, \ cdots, y_N, y ^ {new}), X = (x_1, x_2, \ cdots, x_N, x ^ {new}) $ It becomes y \ sim N (0, K') $.

This way

\begin{pmatrix}
y \\ 
y^{new}
\end{pmatrix}
\sim 
N\left(0,\begin{pmatrix}
K,k_{new} \\ 
k_{new}^T,k_{new,new}
\end{pmatrix}\right)

Can be written as. Here, $ k_ {new} and k_ {new, new} $ are as follows.

\left\{ 
\begin{array}{}
k_{new} &=(k(x^{new},x_1),k(x^{new},x_2),\cdots,k(x^{new},x_N))^T \\
k_{new,new}&= (k(x^{new},x^{new}))
\end{array}
\right.

here

\begin{pmatrix}
y_1 \\ 
y_2
\end{pmatrix}
\sim N\left(
\begin{pmatrix}
\mu_1 \\ 
\mu_2
\end{pmatrix}
,
\begin{pmatrix}
\Sigma_{1,1},\Sigma_{1,2} \\
\Sigma_{2,1},\Sigma_{2,2}
\end{pmatrix}
\right)

When there was the formulap(y_2|y_1)Top(y_2|y_1)=N(\mu_2+\Sigma_{2,1}\Sigma_{1,1}^{-1}(y_1-\mu_1),\Sigma_{2,2}-\Sigma_{2,1}\Sigma_{2,1}^{-1}\Sigma_{1,2})It is known that it can be expressed by. This time\mu_1=0,\mu_2=0Sop(y_2|y_1)=N(\Sigma_{2,1}\Sigma_{1,1}^{-1}y_1,\Sigma_{2,2}-\Sigma_{2,1}\Sigma_{2,1}^{-1}\Sigma_{1,2})It can be expressed by.

It was so long, but in other words

p(y^{new}|X^{new},D)=N(k_{new}^TK^{-1}y,k_{new,new}-k_{new}^TK^{-1}k_{new})

You just have to implement! !! !!

Let's actually implement it!

p(y^{new}|X^{new},D)=N(k_{new}^TK^{-1}y,k_{new,new}-k_{new}^TK^{-1}k_{new})

I just found out that I should implement this.

Also, $ K, k_ {new}, k_ {new, new} $

\left\{ 
\begin{array}{}
K   &=\left[\begin{array}{}
k(x_1,x_1),k(x_1,x_2),\cdots,k(x_1,x_N) \\
k(x_2,x_1),k(x_2,x_2),\cdots,k(x_2,x_N) \\
\cdots \\
k(x_N,x_1),k(x_N,x_2),\cdots,k(x_N,x_N)
\end{array}
\right] \\
k_{new} &=(k(x^{new},x_1),k(x^{new},x_2),\cdots,k(x^{new},x_N))^T \\
k_{new,new}&= (k(x^{new},x^{new}))
\end{array}
\right.

was.

k(x,x')=\theta_1\exp \left(-\frac{(x-x')^2}{\theta_2}\right)+\theta_3\delta(x,x')

Then the kernel function

def RBF(X,X_):
    theta_1  = 1
    theta_2  = 0.2
    theta_3  = 0.01
    distance =  ((X - X_)**2).sum(-1)
    if distance.shape[0] == distance.shape[1]:
        return theta_1 * np.exp(-1 * distance/theta_2) + theta_3 * np.eye(len(X))[:,:X.shape[1]]
    else:
        return theta_1 * np.exp(-1 * distance/theta_2)

You can write with.

Let's use this to calculate $ K ^ {-1}, k_ {new}, k_ {new, new} $. First, calculate $ K ^ {-1} $.

X_    = np.array([X for _ in range(len(X))])
K     = RBF(X_,X_.transpose(1,0,2)) 
inv_K = np.linalg.inv(K)

Then calculate $ k_ {new} $ and $ k_ {new, new} $.

X__        = np.array([X      for _ in range(len(Test_X))])
Test_X__   = np.array([Test_X for _ in range(len(X))]).transpose(1,0,2)
Test_X__a  = np.array([Test_X for _ in range(len(Test_X))]).transpose(1,0,2)
k          = RBF(X__,Test_X__)
k__        = RBF(Test_X__a,Test_X__a)

Finally

p(y^{new}|X^{new},D)=N(k_{new}^TK^{-1}y,k_{new,new}-k_{new}^TK^{-1}k_{new})

To generate from

Y_result   = np.random.multivariate_normal(k.dot(inv_K).dot(Y),k__ - k.dot(inv_K).dot(k.T))

Just call

Experiment with bike share data in New York

Use New York's Bike Share Data (https://www.citibikenyc.com/system-data) to predict your picture. This time, I tried to visualize the number of returned bicycles from 8:00 am to 12:00 noon on June 30th.

It is logarithmically normalized to the returned number for easy viewing.

Let's visualize using the Gaussian process Then it looks like this.

The star is the position of the port. $ 0 $ and below are complemented with $ 0 $ for easy viewing.

The results show that there is a high demand in the center. It was inferred that this time there is a high demand because various values are sampled at positions without ports due to the large variance. In this way, the good thing about the Gaussian process is that it can not only predict where there is no data, but also calculate the variance of that part.

Finally

You can solve the regression problem like this, so why don't you try it once?

I wrote GP with numpy

Introduction

A brief explanation of how to solve a regression problem

A brief explanation of the Gaussian process

Let's actually implement it!

Experiment with bike share data in New York

Finally