I am studying Bayesian theory. This time, I will summarize linear regression based on Bayesian theory. The books and articles that I referred to are as follows.
[Pattern recognition and machine learning](https://www.amazon.co.jp/%E3%83%91%E3%82%BF%E3%83%BC%E3%83%B3%E8%AA % 8D% E8% AD% 98% E3% 81% A8% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92-% E4% B8% 8A-CM-% E3 % 83% 93% E3% 82% B7% E3% 83% A7% E3% 83% 83% E3% 83% 97 / dp / 4612061224)
[Bayes Deep Learning](https://www.amazon.co.jp/%E3%83%99%E3%82%A4%E3%82%BA%E6%B7%B1%E5%B1%A4% E5% AD% A6% E7% BF% 92-% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E3% 83% 97% E3% 83% AD% E3 % 83% 95% E3% 82% A7% E3% 83% 83% E3% 82% B7% E3% 83% A7% E3% 83% 8A% E3% 83% AB% E3% 82% B7% E3% 83 % AA% E3% 83% BC% E3% 82% BA-% E9% A0% 88% E5% B1% B1% E6% 95% A6% E5% BF% 97-ebook / dp / B07YSHL8MS / ref = reads_cwrtbar_2 / 356-7679536-6826133? _encoding = UTF8 & pd_rd_i = B07YSHL8MS & pd_rd_r = 36fddf3f-0e8b-4921-8e3c-13dd6fe2007f & pd_rd_w = nZY1w & pd_rd_wg = h8TZc & pf_rd_p = 64c49d12-7012-452e-9a49-e43c9513f9fc & pf_rd_r = 7CCX4HQPWKQC8344612V & psc = 1 & refRID = 7CCX4HQPWKQC8344612V)
Bayesian linear regression https://openbook4.me/sections/1563
PRML Question 3.7, 3.8 Answers https://qiita.com/halhorn/items/037db507e9884265d757
The idea of linear regression that you come across while learning machine learning is different from that of Bayesian linear regression that we are dealing with this time. The outline is shown below.
To distinguish linear regression based on the least squares method from Bayesian linear regression, this time it is called frequency theory linear regression. * If the name is not strictly frequency-based, we would appreciate it if you could comment.
This method can be obtained very easily even in Excel. I also often use it when compiling data in my work. Well, this method is very simple to think about.
E(w) = \frac {1}{2}\sum_{n=1}^{N}(y(x_n,w)-t_n)^2
However,
All you have to do is find the polynomial coefficient $ w $ that minimizes $ E (w) $, which is called this error function.
On the other hand, consider Bayesian linear regression. Consider the following probabilities for the observed data $ X, Y $.
p(w|X,Y) = \frac {p(w)p(Y|X,w)}{p(X,Y)}
here,
It will be.
The point of the Bayesian way of thinking is that all are organized by probability (= distribution). ** I understand that I try to capture it as a distribution including the optimal solution **.
Another point is the idea called Bayesian inference. It refers to the act of trying to find a probability (= posterior probability) with a value (= prior probability) that is assumed in advance. When I first heard it, it was refreshing.
In this case,
In this Bayesian way of thinking, it is necessary to be aware of the time axis. In other words, it is necessary to consider the flow of time to temporarily determine the weight parameter from a certain data and update it to a good value at any time. This is a very convenient idea used in the process of increasing data from a small number of data.
Now, I would like to solve the regression problem by Bayesian linear regression. Consider the following for $ p (w | X, Y) $, with the parameter $ t $.
p(w|t) \propto p(t|w)p(w)
Originally, $ p (t) $ should come to the denominator when performing the calculation, but since the calculation is complicated, we will consider it with the proportional expression $ \ propto $. Here, $ p (t | w) $ and $ p (w) $, respectively, are as shown below.
\begin{align}
p(t|w)&=\prod_{n=1}^N \left\{ \mathcal{N} (t_n|w^T\phi(x_n), \beta^{-1}) \right\}\\
\end{align}\\
p(w)=\mathcal{N}(w|m_0, S_0)
It is formulated as following the Gaussian distribution.
In addition, constants and functions are shown below.
=\prod_{n=1}^N \left\{ \mathcal{N} (t_n|w^T\phi(x_n), \beta^{-1}) \right\} \mathcal{N}(w|m_0, S_0)\\
\propto \left( \prod_{n=1}^N exp\left[ -\frac{1}{2} \left\{ t_n - w^T \phi(x_n) \right\}^2 \beta \right] \right) exp\left[ -\frac{1}{2} (w - m_0)^TS_0^{-1}(w - m_0) \right]\\
= exp\left[ -\frac{1}{2} \left\{ \beta \sum_{n=1}^N \left( t_n - w^T\phi(x_n) \right)^2 + (w - m_0)^TS_0^{-1}(w - m_0) \right\} \right]
Can be transformed with. After that, we will carefully develop this formula.
\beta \sum_{n=1}^N \left( t_n - w^T\phi(x_n) \right)^2 + (w - m_0)^TS_0^{-1}(w - m_0)\\
= \beta \left( \begin{array}{c}
t_1 - w^T\phi(x_1) \\
\vdots\\
t_N - w^T\phi(x_N)
\end{array} \right)^T
\left( \begin{array}{c}
t_1 - w^T\phi(x_1) \\
\vdots\\
t_N - w^T\phi(x_N)
\end{array} \right)
+ (w - m_0)^TS_0^{-1}(w - m_0)
It can be expressed as $ w ^ T \ phi (x_1) = \ phi ^ T (x_1) w $. Therefore,
= \beta \left( \begin{array}{c}
t_1 - \phi^T(x_1)w \\
\vdots\\
t_N - \phi^T(x_N)w
\end{array} \right)^T
\left( \begin{array}{c}
t_1 - \phi^T(x_1)w \\
\vdots\\
t_N - \phi^T(x_N)w
\end{array} \right)
+ (w - m_0)^TS_0^{-1}(w - m_0)\\
It will be. Now, here, the basis function is expressed in a form called a design matrix as shown below.
\Phi = \left( \begin{array}{c}
\phi^T(x_1) \\
\vdots\\
\phi^T(x_N)
\end{array} \right)
I will use this to further summarize. The terms of only $ \ beta $ and $ m $ are constant terms, but they are put together as $ C $.
= \beta (t - \Phi w)^T(t - \Phi w) + (w - m_0)^TS_0^{-1}(w - m_0)\\
= \beta ( w^T\Phi^T\Phi w - w^T\Phi^Tt - t^T\Phi w ) + w^TS_0^{-1}w - w^TS_0^{-1}m_0 - m_0^TS_0^{-1}w + C\\
Since $ S_0 $ is covariant, it is a symmetric matrix. Therefore, since its inverse matrix is also a symmetric matrix, it can be expressed as $ (S_0 ^ {-1}) ^ T = S_0 ^ {-1} $. Apply this to the coefficient of $ w $ and
= w^T(S_0^{-1} + \beta \Phi^T\Phi)w - w^T(S_0^{-1}m_0 + \beta \Phi^T t) - (S_0^{-1}m_0 + \beta \Phi^T t)^Tw + C
Here, if $ R = S_0 ^ {-1} m_0 + \ beta \ Phi ^ Tt $
= w^TS_N^{-1}w - w^TR - R^Tw + C
I was able to summarize. Furthermore, it can be summarized by square completion and factorization from here, but it is a rather difficult calculation. Confirm the match by expanding the formula that is completed in Amanojaku.
\mathcal{N}(w|m_N, S_N)\\
\propto exp\left\{ -\frac{1}{2} (w - m_N)^T S_N^{-1} (w - m_N) \right\}
Again, we will only discuss the contents of $-\ frac {1} {2} $ in $ exp $ in the Gaussian formula.
(w - m_N)^T S_N^{-1} (w - m_N)\\
Here, $ m_N and S_N $ are given as follows.
m_N = S_N(S_0^{-1} m_0 + \beta \Phi^Tt) = S_NR\\
S_N^{-1} = S_0^{-1} + \beta \Phi^T \Phi
Here, $ S_N $ also appears in the target formula derived from the posterior distribution, so do not expand it, but expand $ m_N $.
(w - m_N)^T S_N^{-1} (w - m_N)\\
= (w - S_NR)^T S_N^{-1} (w - S_NR)\\
= w^T S_N^{-1} w - w^T R - R^Tw + C
Therefore, posterior distribution
This time, let's regress the randomly generated data points based on the $ sin $ function.
beyes.ipynb
#Make a design matrix
Phi = np.array([phi(x) for x in X])
#Hyperparameters
alpha = 0.1
beta = 9.0
M = 12
beyes.ipynb
n = 10
X = np.random.uniform(0, 1, n)
T = np.sin(2 * np.pi * X) + np.random.normal(0, 0.1, n)
plt.scatter(X, T)
plt.plot(np.linspace(0,1), np.sin(2 * np.pi * np.linspace(0,1)), c ="g")
plt.show()
For easy calculation this time, set $ m_0 = 0 $, $ S_0 = α ^ {-1} I $ $ S_N ^ {-1} = αI + \ beta \ Phi ^ T \ Phi $, $ m_N = βS_N \ Phi ^ Tt $.
beyes.ipynb
#Variance of posterior probabilities
S = np.linalg.inv(alpha * np.eye(M) + beta * Phi.T.dot(Phi))
#Average posterior probability
m = beta * S.dot(Phi.T).dot(T)
beyes.ipynb
x_, y_ = np.meshgrid(np.linspace(0,1), np.linspace(-1.5, 1.5))
Z = np.vectorize(norm)(x_,y_)
x = np.linspace(0,1)
y = [m.dot(phi(x__)) for x__ in x]
plt.figure(figsize=(10,6))
plt.pcolor(x_, y_, Z, alpha = 0.2)
plt.colorbar()
plt.scatter(X, T)
#Mean of predicted distribution
plt.plot(x, y)
#Genuine distribution
plt.plot(np.linspace(0,1), np.sin(2 * np.pi * np.linspace(0,1)), c ="g")
plt.show()
#Sample parameters obtained from posterior distribution
m_list = [np.random.multivariate_normal(m, S) for i in range(5)]
for m_ in m_list:
x = np.linspace(0,1)
y = [m_.dot(phi(x__)) for x__ in x]
plt.plot(x, y, c = "r")
plt.plot(np.linspace(0,1), np.sin(2 * np.pi * np.linspace(0,1)), c ="g")
The shade indicates that the probability density is high.
This time, we have summarized Bayesian linear regression from conception to implementation. I was almost at a loss because the expression development was very complicated. However, the idea is very simple, and it is an important idea in machine learning how to obtain a plausible probability from a small amount of data.
We will continue to learn to gain a deeper understanding of Bayesian feelings.
The full program is here. https://github.com/Fumio-eisan/Beyes_20200512
Recommended Posts