Introduction

I am studying Bayesian theory. This time, I will summarize linear regression based on Bayesian theory. 　 The books and articles that I referred to are as follows.

[Pattern recognition and machine learning](https://www.amazon.co.jp/%E3%83%91%E3%82%BF%E3%83%BC%E3%83%B3%E8%AA % 8D% E8% AD% 98% E3% 81% A8% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92-% E4% B8% 8A-CM-% E3 % 83% 93% E3% 82% B7% E3% 83% A7% E3% 83% 83% E3% 83% 97 / dp / 4612061224)
[Bayes Deep Learning](https://www.amazon.co.jp/%E3%83%99%E3%82%A4%E3%82%BA%E6%B7%B1%E5%B1%A4% E5% AD% A6% E7% BF% 92-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3 % 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83 % AA% E3% 83% BC% E3% 82% BA-% E9% A0% 88% E5% B1% B1% E6% 95% A6% E5% BF% 97-ebook / dp / B07YSHL8MS / ref = reads_cwrtbar_2 / 356-7679536-6826133? _encoding = UTF8 & pd_rd_i = B07YSHL8MS & pd_rd_r = 36fddf3f-0e8b-4921-8e3c-13dd6fe2007f & pd_rd_w = nZY1w & pd_rd_wg = h8TZc & pf_rd_p = 64c49d12-7012-452e-9a49-e43c9513f9fc & pf_rd_r = 7CCX4HQPWKQC8344612V & psc = 1 & refRID = 7CCX4HQPWKQC8344612V)
Bayesian linear regression https://openbook4.me/sections/1563
PRML Question 3.7, 3.8 Answers https://qiita.com/halhorn/items/037db507e9884265d757

What is Bayesian linear regression?

The idea of linear regression that you come across while learning machine learning is different from that of Bayesian linear regression that we are dealing with this time. The outline is shown below. 　

(Frequency theory) Linear regression

To distinguish linear regression based on the least squares method from Bayesian linear regression, this time it is called frequency theory linear regression. * If the name is not strictly frequency-based, we would appreciate it if you could comment.

This method can be obtained very easily even in Excel. I also often use it when compiling data in my work. Well, this method is very simple to think about.


E(w) = \frac {1}{2}\sum_{n=1}^{N}(y(x_n,w)-t_n)^2

However,

$ x ≡ (x_1, ..., x_N) ^ T $: $ N $ observations $ x $
Corresponding $ t ≡ (t_1, ...., t_N) ^ T $
$ \ bf w $: Polynomial coefficient (= weight parameter) $ (w_1, ...., w_N) $
Function $ y (x, w) $ will do.

All you have to do is find the polynomial coefficient $ w $ that minimizes $ E (w) $, which is called this error function.

Bayesian linear regression

On the other hand, consider Bayesian linear regression. Consider the following probabilities for the observed data $ X, Y $.


p(w|X,Y) = \frac {p(w)p(Y|X,w)}{p(X,Y)}

here,

$ p (w | X, Y) $: Posterior probability distribution of weight parameter $ w $ given data $ X, Y $
$ p (w) $: Prior probability of weight parameter $ w $
$ p (Y | X, w) $: Predicted distribution of $ Y $ with observations $ X $ and weight parameter $ w $
$ p (X, Y) $: $ X, Y $ joint distribution

It will be.

The point of the Bayesian way of thinking is that all are organized by probability (= distribution). ** I understand that I try to capture it as a distribution including the optimal solution **.

Another point is the idea called Bayesian inference. It refers to the act of trying to find a probability (= posterior probability) with a value (= prior probability) that is assumed in advance. When I first heard it, it was refreshing.

In this case,p(w|X,Y)IsX,YWeight parameter when is givenwI want to find the probability distribution of. This is the observed valueX,Weight parameterwbyYIs the predicted distribution ofp(Y|X,w)It is calculated using. By the way, I'm wondering how to calculate using $ w $ even though I should have wanted to know $ w $.

In this Bayesian way of thinking, it is necessary to be aware of the time axis. In other words, it is necessary to consider the flow of time to temporarily determine the weight parameter from a certain data and update it to a good value at any time. This is a very convenient idea used in the process of increasing data from a small number of data.

Actually find the distribution (formation of formula in Bayesian linear regression)

Now, I would like to solve the regression problem by Bayesian linear regression. Consider the following for $ p (w | X, Y) $, with the parameter $ t $.

p(w|t) \propto p(t|w)p(w)

Originally, $ p (t) $ should come to the denominator when performing the calculation, but since the calculation is complicated, we will consider it with the proportional expression $ \ propto $. Here, $ p (t | w) $ and $ p (w) $, respectively, are as shown below.

\begin{align}
p(t|w)&=\prod_{n=1}^N \left\{ \mathcal{N} (t_n|w^T\phi(x_n), \beta^{-1}) \right\}\\

\end{align}\\
p(w)=\mathcal{N}(w|m_0, S_0)

It is formulated as following the Gaussian distribution.

In addition, constants and functions are shown below.

$ φ (x) $: Basis set
$ \ Beta $: Reciprocal of variance (constant) with precision parameter
$ m_0 $: Expected value
$ S_0 $: Covariance

=\prod_{n=1}^N \left\{ \mathcal{N} (t_n|w^T\phi(x_n), \beta^{-1}) \right\} \mathcal{N}(w|m_0, S_0)\\
\propto \left( \prod_{n=1}^N exp\left[ -\frac{1}{2} \left\{ t_n - w^T \phi(x_n) \right\}^2 \beta \right] \right) exp\left[ -\frac{1}{2} (w - m_0)^TS_0^{-1}(w - m_0) \right]\\
= exp\left[  -\frac{1}{2} \left\{ \beta \sum_{n=1}^N \left( t_n - w^T\phi(x_n) \right)^2 + (w - m_0)^TS_0^{-1}(w - m_0) \right\} \right]

Can be transformed with. After that, we will carefully develop this formula.

\beta \sum_{n=1}^N \left( t_n - w^T\phi(x_n) \right)^2 + (w - m_0)^TS_0^{-1}(w - m_0)\\
= \beta \left( \begin{array}{c}
t_1 - w^T\phi(x_1) \\
\vdots\\
t_N - w^T\phi(x_N)
\end{array} \right)^T
\left( \begin{array}{c}
t_1 - w^T\phi(x_1) \\
\vdots\\
t_N - w^T\phi(x_N)
\end{array} \right)
+ (w - m_0)^TS_0^{-1}(w - m_0)

It can be expressed as $ w ^ T \ phi (x_1) = \ phi ^ T (x_1) w $. Therefore,

= \beta \left( \begin{array}{c}
t_1 - \phi^T(x_1)w \\
\vdots\\
t_N - \phi^T(x_N)w
\end{array} \right)^T
\left( \begin{array}{c}
t_1 - \phi^T(x_1)w \\
\vdots\\
t_N - \phi^T(x_N)w
\end{array} \right)
+ (w - m_0)^TS_0^{-1}(w - m_0)\\

It will be. Now, here, the basis function is expressed in a form called a design matrix as shown below.

\Phi = \left( \begin{array}{c}
\phi^T(x_1) \\
\vdots\\
\phi^T(x_N)
\end{array} \right)

I will use this to further summarize. The terms of only $ \ beta $ and $ m $ are constant terms, but they are put together as $ C $.

= \beta (t - \Phi w)^T(t - \Phi w) + (w - m_0)^TS_0^{-1}(w - m_0)\\
= \beta ( w^T\Phi^T\Phi w - w^T\Phi^Tt - t^T\Phi w ) + w^TS_0^{-1}w - w^TS_0^{-1}m_0 - m_0^TS_0^{-1}w + C\\

Since $ S_0 $ is covariant, it is a symmetric matrix. Therefore, since its inverse matrix is also a symmetric matrix, it can be expressed as $ (S_0 ^ {-1}) ^ T = S_0 ^ {-1} $. Apply this to the coefficient of $ w $ and

= w^T(S_0^{-1} + \beta \Phi^T\Phi)w - w^T(S_0^{-1}m_0 + \beta \Phi^T t) - (S_0^{-1}m_0 + \beta \Phi^T t)^Tw + C

Here, if $ R = S_0 ^ {-1} m_0 + \ beta \ Phi ^ Tt $

= w^TS_N^{-1}w - w^TR - R^Tw + C

I was able to summarize. Furthermore, it can be summarized by square completion and factorization from here, but it is a rather difficult calculation. Confirm the match by expanding the formula that is completed in Amanojaku.

Expand expression

\mathcal{N}(w|m_N, S_N)\\
\propto exp\left\{ -\frac{1}{2} (w - m_N)^T S_N^{-1} (w - m_N) \right\}

Again, we will only discuss the contents of $-\ frac {1} {2} $ in $ exp $ in the Gaussian formula.

(w - m_N)^T S_N^{-1} (w - m_N)\\

Here, $ m_N and S_N $ are given as follows.

m_N = S_N(S_0^{-1} m_0 + \beta \Phi^Tt) = S_NR\\
S_N^{-1} = S_0^{-1} + \beta \Phi^T \Phi

Here, $ S_N $ also appears in the target formula derived from the posterior distribution, so do not expand it, but expand $ m_N $.

(w - m_N)^T S_N^{-1} (w - m_N)\\
= (w - S_NR)^T S_N^{-1} (w - S_NR)\\
= w^T S_N^{-1} w - w^T R - R^Tw + C

Therefore, posterior distributionp(w|t)When\mathcal{N}(w|m_N, S_N)Has the same normal distribution.

I will implement it and check it.

This time, let's regress the randomly generated data points based on the $ sin $ function.

Make a design matrix

`beyes.ipynb`



#Make a design matrix
Phi = np.array([phi(x) for x in X])
 
#Hyperparameters
alpha = 0.1
beta = 9.0
M = 12

Create a randomly distributed sin function-like

`beyes.ipynb`



n = 10
X = np.random.uniform(0, 1, n)
T = np.sin(2 * np.pi * X) + np.random.normal(0, 0.1, n)
plt.scatter(X, T)
plt.plot(np.linspace(0,1), np.sin(2 * np.pi * np.linspace(0,1)), c ="g")
plt.show()

Find the mean and variance of posterior probabilities

For easy calculation this time, set $ m_0 = 0 $, $ S_0 = α ^ {-1} I $ $ S_N ^ {-1} = αI + \ beta \ Phi ^ T \ Phi $, $ m_N = βS_N \ Phi ^ Tt $.

`beyes.ipynb`


#Variance of posterior probabilities
S = np.linalg.inv(alpha * np.eye(M) + beta * Phi.T.dot(Phi))
#Average posterior probability
m = beta * S.dot(Phi.T).dot(T)

Illustrated

`beyes.ipynb`


x_, y_ = np.meshgrid(np.linspace(0,1), np.linspace(-1.5, 1.5))
Z = np.vectorize(norm)(x_,y_)
x = np.linspace(0,1)
y = [m.dot(phi(x__)) for x__ in x]
 
plt.figure(figsize=(10,6))
plt.pcolor(x_, y_, Z, alpha = 0.2)
plt.colorbar()
plt.scatter(X, T)
#Mean of predicted distribution
plt.plot(x, y)
#Genuine distribution
plt.plot(np.linspace(0,1), np.sin(2 * np.pi * np.linspace(0,1)), c ="g")
plt.show()
 
#Sample parameters obtained from posterior distribution
m_list = [np.random.multivariate_normal(m, S) for i in range(5)]
 
for m_ in m_list:
    x = np.linspace(0,1)
    y = [m_.dot(phi(x__)) for x__ in x]
    plt.plot(x, y, c = "r")
    plt.plot(np.linspace(0,1), np.sin(2 * np.pi * np.linspace(0,1)), c ="g")

The shade indicates that the probability density is high.

At the end

This time, we have summarized Bayesian linear regression from conception to implementation. I was almost at a loss because the expression development was very complicated. However, the idea is very simple, and it is an important idea in machine learning how to obtain a plausible probability from a small amount of data.

We will continue to learn to gain a deeper understanding of Bayesian feelings.

The full program is here. https://github.com/Fumio-eisan/Beyes_20200512

(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.

Introduction

What is Bayesian linear regression?

(Frequency theory) Linear regression

Bayesian linear regression

Actually find the distribution (formation of formula in Bayesian linear regression)

Expand expression

I will implement it and check it.

Make a design matrix

beyes.ipynb

Create a randomly distributed sin function-like

beyes.ipynb

Find the mean and variance of posterior probabilities

beyes.ipynb

Illustrated

beyes.ipynb

At the end

`beyes.ipynb`

`beyes.ipynb`

`beyes.ipynb`

`beyes.ipynb`