Linear regression based on Bayesian estimation (least squares estimation, maximum likelihood estimation, MAP estimation, Bayesian estimation)

background

The purpose of the regression problem is to predict the target value for a new observation given a training data set consisting of $ N $ observations and the corresponding target values. The linear regression model dealt with this time is the simplest model that utilizes the feature that polynomials are linear combinations of adjustable parameters. A useful class of functions can be obtained by taking a fixed set combination of nonlinear functions with respect to the input variables of the fixed basis functions.

For the observed data $ D = \ {(x_i, y_i); i = 1,2, ..., n } $, define the regression function model based on the linear combination of the basis functions as follows. .. Here, $ \ Phi $ is the basis function of $ x $ and $ \ epsilon $ is the error term.

y_i= \Phi w+ \epsilon

About Bayesian linear regression

Least squares estimation

Least squares estimation is a method for finding $ \ hat {w} $ that minimizes the sum of squares $ S (w) $ of the prediction error by the regression model. Partially differentiate $ S (w) $ with respect to $ w $ to find $ \ hat {w} $.

S(w)=\epsilon^{T}\epsilon=(y-\Phi w)^T(y-\Phi w)
\frac{dS(w)}{dw}=-\Phi^{T}y+\Phi^T\Phi w

Considering the case of $ \ frac {dS (w)} {dw} = 0 $,

\hat{w}=(\Phi^T\Phi)^{-1}\Phi^{T}y

Therefore, the prediction model by least squares estimation is as follows.

\hat{y}=\Phi\hat{w}=\Phi(\Phi^T\Phi)^{-1}y

Maximum likelihood estimation

Maximum likelihood estimation is a method for finding $ \ hat {w} $ that maximizes the likelihood $ P (y, w) $. Consider a model that assumes a normal distribution for the error term. At this time, the observed value $ y $ follows the n-dimensional normal distribution of the mean $ \ Phi w $ and the variance matrix $ \ sigma ^ 2I_n $. Therefore, the likelihood is given as follows.

  y= \Phi w+ \epsilon,\epsilon \sim \mathcal{N}(0,\sigma^2I_n) 
P(y\mid w,\sigma^2)=\mathcal{N}(\Phi w,\sigma^2I_n)
=\frac{1}{(2\pi\sigma^2)^{\frac{n}{2}}}exp\{-\frac{1}{2\sigma^2}(y-\Phi w)^T(y-\Phi w)\} 

Partially differentiate the logarithm of $ P (y \ mid w) $ with respect to $ w $ to find $ \ hat {w} $.

\log P(y\mid w) = -\frac{n}{2}\log(2\pi\sigma^2)-\frac{(y-\Phi w)^T(y-\Phi w)}{2\sigma^2}
\frac{1}{P(y\mid w)}\frac{P(y\mid w)}{dw}=-(\Phi^{T}y+\Phi^{T}\Phi w)

Consider the case of $ \ frac {dP (y \ mid w)} {dw} = 0 $.

\hat{w}=(\Phi^T\Phi)^{-1}\Phi^{T}y

Therefore, the prediction model based on maximum likelihood estimation is as follows.

\hat{y}=\Phi\hat{w}=\Phi(\Phi^T\Phi)^{-1}y

This is the same as the prediction model obtained by the least squares method.

MAP estimation

The method based on the least squares method and maximum likelihood estimation has a problem that overfitting is likely to occur when the number of model parameters is large and the number of observation data is small. When overfitting occurs, generalization performance cannot be expected, so it is important to prevent overfitting. MAP estimation treats $ w $ as a random variable. The prior distribution of $ w $ and the likelihood function of the observed data are introduced as follows. $ \ Alpha $, $ \ beta $ are hyperparameters.

P(w;\alpha)=\mathcal{N}(w\mid0,\alpha^{-1}I_n)
P(y \mid w;\beta)=\mathcal{N}(\Phi w,\beta^{-1}I_n)

In MAP estimation, it is a method to find $ \ hat {w} $ that maximizes the posterior distribution $ P (w \ mid y) $ of $ w $. From Bayes' theorem

P(w \mid y)=\frac{P(y \mid w)P(w)}{P(y)}

here. $ P (y | w) $ represents the likelihood and $ P (w) $ represents the prior probability.

P(w \mid y)=\frac{\frac{1}{(2\pi \beta^{-1})^{\frac{n}{2}}}exp\{-\frac{1}{2\beta^2}(y-\Phi w)^T(y-\Phi w)\frac{1}{(2\pi \alpha^{-1})^{\frac{n}{2}}}exp\{-\frac{1}{2\alpha^2}w^{T}w\} }{P(y)}
P(w \mid y)=\frac{\frac{1}{ (2\pi)^{n}(\alpha\beta)^{-\frac{n}{2}} }
exp\{-\frac{1}{2\beta^2} (y-\Phi w)^T(y-\Phi w)    -     \frac{1}{2\alpha^2} w^{T}w 
\}
}{P(y)}

$ \ Hat {w} $ that maximizes $ P (w \ mid y) $ is $ Z =-\ frac {1} {2 \ beta ^ 2} (y- \ Phi w) ^ T (y- \ Phi w)-\ frac {1} {2 \ alpha ^ 2} w ^ {T} w Equal to $ \ which maximizes $ \ hat {w} $.

\frac{dZ}{dw}=-\frac{1}{2\beta^{-1}}(-\Phi^T+\Phi^T\Phi w)- \frac{1}{2 \alpha^{-1}}w^{T}w

Considering the case of $ \ frac {dZ} {dw} = 0 $,

\hat{w}=(-\frac{\alpha}{\beta}I_n+\Phi^T\Phi)^{-1}\Phi^{T}y

Therefore, the prediction model by MAP estimation is as follows.

\hat{y}=\Phi\hat{w}=\Phi(-\frac{\alpha}{\beta}I_n+\Phi^T\Phi)^{-1}\Phi^{T}y

Bayesian inference

In the least squares method, maximum likelihood estimation, and MAP estimation, the estimated values ​​of the parameters were obtained as one solution. However, this does not allow parameter uncertainty to be taken into account in data prediction. By treating the posterior distribution as it is as a probability distribution, the predicted distribution that takes into account the uncertainty of parameter estimation is obtained. In MAP estimation, we considered maximization of $ P (w \ mid y) $, so we could ignore $ P (D) $, but we need to consider it in Bayesian estimation. Remove one random variable from the simultaneous probability $ P (y, w) $ and find the peripheral probability $ P (y) $.

P(y)=\int P(y\mid w)P(w)dw
P(w \mid y) = \frac{P(y \mid w)P(w)}{\int P(y\mid w)P(w)dw}

Here, from Bayes' theorem for Gaussian distribution,

P(w \mid y) = \mathcal{N}(\mu_N,\Sigma_N)

Calculated using the marginal distribution and conditional distribution of the P90 Gaussian distribution in Reference 1.

\mu_N=(\frac{\alpha}{\beta}I_n+\Phi^{T}\Phi)^{-1}\Phi^{T}y
\Sigma_{N}=(\alpha I_n + \beta \Phi^{T}\Phi)^{-1}

Therefore, from the above, the prediction model by Bayesian estimation is as follows.

\hat{y}=\Phi\hat{w}=\Phi(-\frac{\alpha}{\beta}I_n+\Phi^T\Phi)^{-1}\Phi^{T}y
\Sigma_{N}=(\alpha I_n + \beta \Phi^{T}\Phi)^{-1}

Experiment

Least squares estimation, maximum likelihood estimation, MAP estimation, Bayesian estimation for training data of $ D = \ {(x.train_i, y.train_i); i = 1,2, ..., 15 } $ A prediction model was created using this. For each model, we confirmed the predicted distribution for the test data of $ D = \ {(x.test_i, y.test_i); i = 1,2, ..., 100 } $. The code is here

The following basis functions are used.

f_j(x)=x^{j},j=0,1,...9

Let the hyperparameters $ \ alpha = 10, \ beta = 10 $.

--Least squares estimation 1.png --MAP estimation 3.png --Bayesian inference 4.png

The point estimation is evaluated by the coefficient of determination $ R ^ 2 $. The coefficient of determination is a value that expresses the goodness of fitting of the model derived by regression, and is an evaluation index that expresses how well the value predicted by the model matches the actual value. The coefficient of determination $ R ^ 2 $ is $ R ^ 2 = 1- \ frac {, where the actual data is $ (x_i, y_i) $ and the data estimated from the regression equation is $ (x_i, \ bar {y_i}) $. It is calculated by \ sum_ {i = 1} ^ n {(y_i- \ hat {y_i}) ^ 2}} {\ sum_ {i = 1} ^ n (y_i- \ bar {y}) ^ 2} $. The value is better as it approaches 1 in the range of 0 to 1. fig8.png

Change the value of the variance $ \ beta $ of the likelihood function.

-Bayesian inference when $ \ beta = 50 $ 50.png -Bayesian inference when $ \ beta = 100 $ 100.png -Bayesian inference when $ \ beta = 1000 $ 1000.png

Summary

--The least squares method, maximum likelihood estimation, MAP estimation, and Bayesian estimation were applied to the regression problem. It was found that the prediction accuracy of the part where the training data does not exist decreases. --For point estimation, least squares method, maximum likelihood estimation, and MAP estimation, it can be seen that in this case, MAP estimation has fewer outliers and an excellent model can be created. --In Bayesian estimation and MAP estimation, it was found that the smaller the variance of the likelihood function, the smaller the error, but if the variance is too small, overfitting occurs and the generalization performance decreases.

References

--C.M. Bishop, "Statistical prediction by Bayesian theory on pattern recognition and machine learning", Springer Japan --Atsushi Suyama, "Introduction to Machine Learning by Bayesian Inference", Kodansha Scientific Co., Ltd.

Recommended Posts

Linear regression based on Bayesian estimation (least squares estimation, maximum likelihood estimation, MAP estimation, Bayesian estimation)
Least squares method and maximum likelihood estimation method (comparison by model fitting)
[TensorFlow] Least squares linear regression by gradient descent (stochastic descent)