This time, I implemented the evidence approximation mentioned a little at the end of First PRML implementation. This method is also called empirical Bayes, type 2 maximum likelihood estimation, and so on. When using machine learning algorithms, we often have to set hyperparameters. By using this method, hyperparameters that have been decided somehow until now are automatically decided from the data.
Before we get into the explanation of evidence approximation, let's explain Bayesian linear regression.
Given the training data $ \ {x_i, t_i \} _ {i = 1} ^ N $, prepare the feature vector $ {\ bf \ phi} $ (for example, $ {\ bf \ phi). } (x) = (1, x, x ^ 2, x ^ 3) ^ {\ rm T} $)
{\bf t} =
\begin{bmatrix}
t_1\\
t_2\\
\vdots\\
t_N
\end{bmatrix}\\
{\bf\Phi} =
\begin{bmatrix}
\phi_1(x_1) & \phi_2(x_1) & \cdots & \phi_M(x_1)\\\
\phi_1(x_2) & \phi_2(x_2) & \cdots & \phi_M(x_2)\\\
\vdots & \vdots & \ddots & \vdots\\\
\phi_1(x_N) & \phi_2(x_N) & \cdots & \phi_M(x_N)
\end{bmatrix}\\
{\bf w} =
\begin{bmatrix}
w_1\\
w_2\\
\vdots\\
w_M
\end{bmatrix}\\
As a result, the following likelihood function is minimized for the parameter $ {\ bf w} $. However, $ {\ bf I} $ is the identity matrix of $ N \ times N $, and $ \ beta $ is the precision parameter.
E({\bf w}) = \mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I})
Maximum likelihood estimation sets a prior distribution on the parameter $ {\ bf w} $ because it may overfit the training data. So, assuming that the Gaussian distribution with mean 0 and variance $ \ alpha ^ {-1} $ is prior distribution, we use Bayes' theorem.
\begin{align}
p({\bf w}|{\bf\Phi},{\bf t}, \alpha, \beta) &= {\mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I}) \mathcal{N}({\bf w}|{\bf 0}, \alpha^{-1}{\bf I})\over p({\bf t}|{\bf\Phi}, \alpha, \beta)}\\
&= \mathcal{N}({\bf w}|{\bf m}_N, {\bf S}_N)
\end{align}
As a posterior distribution for the parameter $ {\ bf w} $. However,
{\bf m}_N = \beta {\bf S}_N{\bf\Phi}^{\rm T}{\bf t}\\
{\bf S}_N^{-1} = \alpha{\bf I} + \beta{\bf\Phi}^{\rm T}{\bf\Phi}
Using the posterior distribution obtained above, the posterior predictive distribution of $ t $ for the new input $ x $ is
p(t|x,{\bf\Phi},{\bf t},\alpha,\beta) = \int \mathcal{N}(t|{\bf w}^{\rm T}{\bf\phi}(x), \beta^{-1}) \mathcal{N}({\bf w}|{\bf m}_N, {\bf S}_N) {\rm d}{\bf w}\\
= \mathcal{N}(t|{\bf m}_N^{\rm T}{\bf\phi}(x), \sigma^2_N(x))
It can be obtained by integrating and eliminating the parameter $ {\ bf w} $ as in. However, the variance is
\sigma^2_N(x) = {1\over\beta} + \phi(x)^{\rm T}{\bf S}_N\phi(x)
By using the posterior predictive distribution obtained in this way, Bayesian linear regression was possible. The behavior of this regression depends on the hyperparameters $ \ alpha, \ beta $. If the hyperparameter values we set are too large or too small, the regression results may not be what you want.
In the above, the hyperparameters $ \ alpha, \ beta $ were set as some human-determined values, but these are estimated from the data. When finding the posterior distribution of the parameter $ {\ bf w} $
p({\bf w}|{\bf\Phi},{\bf t}, \alpha, \beta) = {\mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I}) \mathcal{N}({\bf w}|{\bf 0}, \alpha^{-1}{\bf I})\over p({\bf t}|{\bf\Phi}, \alpha, \beta)}
I used Bayes' theorem as. When finding the posterior distribution, we often focus only on the numerator on the right side, but in reality, the denominator $ p ({\ bf t} | {\ bf \ Phi}, \ alpha, \ beta) $ is the evidence function. It is called, and has the following relationship.
(Likelihood function)\times(Prior distribution)=(Evidence function)\times(Ex-post distribution)
Considering the meaning of the evidence function from $ p ({\ bf t} | {\ bf \ Phi}, \ alpha, \ beta) $, the evidence function has the given hyperparameters $ \ alpha, \ beta $. Shows how easy it is to generate training data $ {\ bf t} $. If the value of the evidence function is small, the $ \ alpha, \ beta $ is difficult to generate the data, and conversely, if the value of the evidence function is large, it is easy to generate the data. In other words, ** the higher the value of the evidence function, the better **. Approximately by substituting the hyperparameter values that maximize the value of the evidence function into the posterior predictive distribution formula as $ {\ hat \ alpha}, {\ hat \ beta} $.
p(t|x,{\bf\Phi},{\bf t}) \approx p(t|x,{\bf\Phi},{\bf t},{\hat\alpha},{\hat\beta})
Is an approximation of the evidence.
Find $ \ alpha, \ beta $ that maximizes the evidence function. Update $ \ alpha, \ beta, \ gamma $ with the eigenvalue of $ {\ bf \ Phi} ^ {\ rm T} {\ bf \ Phi} $ as $ \ lambda_i $ as follows.
By repeating these two steps, the evidence function converges to $ \ alpha = {\ hat \ alpha}, \ beta = {\ hat \ beta} $. See PRML for the derivation of this formula.
I am reusing the code I made for Bayes curve fitting.
The PolynomialFeatures class transforms the input vector $ (x_1, \ cdots, x_N) ^ {\ rm T} $ into the design matrix $ {\ bf \ Phi} $ by setting the degree.
import matplotlib.pyplot as plt
import numpy as np
class PolynomialFeatures(object):
def __init__(self, degree):
assert type(degree) is int, "%s is not int" % type(degree)
self.degree = degree
def transform(self, x):
features = [x ** i for i in xrange(self.degree + 1)]
return np.array(features).transpose()
PolynomialFeatures | Method description |
---|---|
__init__ | Set the degree of polynomial features |
transform | Convert input to design matrix |
The BayesianRegression class estimates posterior probabilities and calculates posterior predictive distribution as described in Bayesian Linear Regression.
class BayesianRegression(object):
def __init__(self, alpha=1., beta=1.):
self.alpha = alpha
self.beta = beta
def fit(self, X, t):
self.w_var = np.linalg.inv(
self.alpha * np.identity(np.size(X, 1))
+ self.beta * X.T.dot(X))
self.w_mean = self.beta * self.w_var.dot(X.T.dot(t))
def predict(self, X):
return X.dot(self.w_mean)
def predict_dist(self, X):
y = X.dot(self.w_mean)
y_var = 1 / self.beta + np.sum(X.dot(self.w_var) * X, axis=1)
y_std = np.sqrt(y_var)
return y, y_std
BayesianRegression | Method description |
---|---|
__init__ | |
fit | Parameters |
predict | Calculate the mode of the posterior predictive distribution |
predict_dist | Calculation of mode and standard deviation of posterior prediction distribution |
The evidence function is maximized by alternately updating the parameter $ {\ bf w} $ and the hyperparameters $ \ alpha, \ beta $. Updating the parameter $ {\ bf w} $ only does the same thing as in the Bayesian curve fitting above, so we're reusing the method with class inheritance. The evidence method also calculates the value of the logarithmic evidence function. By comparing this value, you can automatically select the best model among multiple models.
class EvidenceApproximation(BayesianRegression):
def __init__(self, iter_max=100, alpha=1., beta=1.):
self.iter_max = iter_max
self.alpha = alpha
self.beta = beta
def fit(self, X, t):
M = X.T.dot(X)
eigenvalues = np.linalg.eigvalsh(M)
for i in xrange(self.iter_max):
params = [self.alpha, self.beta]
super(EvidenceApproximation, self).fit(X, t)
self.gamma = np.sum(self.beta * eigenvalues / (self.alpha + self.beta * eigenvalues))
self.alpha = self.gamma / self.w_mean.dot(self.w_mean)
self.beta = (len(t) - self.gamma) / np.sum((t - X.dot(self.w_mean)) ** 2)
if np.allclose(params, [self.alpha, self.beta]):
break
super(EvidenceApproximation, self).fit(X, t)
def evidence(self, X, t):
M = X.T.dot(X)
return (len(M) * np.log(self.alpha)
+ len(t) * np.log(self.beta)
- self.beta * np.sum((t - X.dot(self.w_mean)) ** 2)
- np.linalg.slogdet(self.alpha + self.beta * M)[1])
EvidenceApproximation | Method description |
---|---|
__init__ | |
fit | |
evidence | Calculate the value of the logarithmic evidence function |
Generate data points that obey the cubic polynomial $ x (x-5) (x + 5) $, and then perform curve fitting using evidence approximation on the model of the 0th to 7th order polynomials to obtain the model evidence. The posterior prediction distribution is calculated using the model with the largest value.
evidence_approximation.py
import matplotlib.pyplot as plt
import numpy as np
class PolynomialFeatures(object):
def __init__(self, degree):
assert type(degree) is int, "%s is not int" % type(degree)
self.degree = degree
def transform(self, x):
features = [x ** i for i in xrange(self.degree + 1)]
return np.array(features).transpose()
class BayesianRegression(object):
def __init__(self, alpha=1., beta=1.):
self.alpha = alpha
self.beta = beta
def fit(self, X, t):
self.w_var = np.linalg.inv(
self.alpha * np.identity(np.size(X, 1))
+ self.beta * X.T.dot(X))
self.w_mean = self.beta * self.w_var.dot(X.T.dot(t))
def predict(self, X):
return X.dot(self.w_mean)
def predict_dist(self, X):
y = X.dot(self.w_mean)
y_var = 1 / self.beta + np.sum(X.dot(self.w_var) * X, axis=1)
y_std = np.sqrt(y_var)
return y, y_std
class EvidenceApproximation(BayesianRegression):
def __init__(self, iter_max=100, alpha=1., beta=1.):
self.iter_max = iter_max
self.alpha = alpha
self.beta = beta
def fit(self, X, t):
M = X.T.dot(X)
eigenvalues = np.linalg.eigvalsh(M)
for i in xrange(self.iter_max):
params = [self.alpha, self.beta]
super(EvidenceApproximation, self).fit(X, t)
self.gamma = np.sum(self.beta * eigenvalues / (self.alpha + self.beta * eigenvalues))
self.alpha = self.gamma / self.w_mean.dot(self.w_mean)
self.beta = (len(t) - self.gamma) / np.sum((t - X.dot(self.w_mean)) ** 2)
if np.allclose(params, [self.alpha, self.beta]):
break
super(EvidenceApproximation, self).fit(X, t)
def evidence(self, X, t):
M = X.T.dot(X)
return (len(M) * np.log(self.alpha)
+ len(t) * np.log(self.beta)
- self.beta * np.sum((t - X.dot(self.w_mean)) ** 2)
- np.linalg.slogdet(self.alpha + self.beta * M)[1])
def create_toy_data(func, low=0, high=1, size=10, sigma=1.):
x = np.random.uniform(low, high, size)
t = func(x) + np.random.normal(scale=sigma, size=size)
return x, t
def main():
def func(x):
return x * (x - 5) * (x + 5)
x, t = create_toy_data(func, low=-5, high=5, size=20, sigma=5.)
plt.scatter(x, t, s=50, alpha=0.5, label="observation")
evidences = []
regressions = []
for i in xrange(8):
features = PolynomialFeatures(degree=i)
X = features.transform(x)
regression = EvidenceApproximation(alpha=100., beta=100.)
regression.fit(X, t)
evidences.append(regression.evidence(X, t))
regressions.append(regression)
degree = np.argmax(evidences)
regression = regressions[degree]
x_test = np.linspace(-5, 5, 100)
X_test = PolynomialFeatures(degree=int(degree)).transform(x_test)
y, y_std = regression.predict_dist(X_test)
plt.plot(x_test, func(x_test), color="blue", label="x(x-5)(x+5)")
plt.plot(x_test, y, color="red", label="predict_mean")
plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
plt.legend()
plt.show()
plt.plot(evidences)
plt.title("Model evidence")
plt.xlabel("M")
plt.ylabel("evidence")
plt.show()
if __name__ == '__main__':
main()
def func ()
)x, t = create_toy_data (...)
)X = features.transform (x)
)regression.fit (X,,). t)
)degree = np.argmax (evidences)
)regression = regressions [degree]
)As shown in Figure 3.14 of PRML, the value of the logarithmic evidence function when using a cubic polynomial as a model is the largest. Also, the predicted distribution by the model of the cubic polynomial looks good. Since the initial values of the hyperparameters $ \ alpha and \ beta $ were both set to 100, in the case of normal Bayesian curve fitting, $ \ alpha $ is too large and each component of $ {\ bf w} $ The value is close to 0, the variance of the prediction distribution is small because $ \ beta $ is too large, and there should have been a posterior prediction distribution that could not be fitted to the data at all. Since we estimated $ \ alpha, \ beta $ to maximize the evidence function, the posterior predictive distribution also fits the data.
By using this evidence approximation, he estimated that it is okay to set a bad hyperparameter as an initial value. I was looking for hyperparameters in order to verify some points of training data with grid search, but with evidence approximation, it is possible to estimate hyperparameters using all training data with a small number of trials. I can do it. In addition, this evidence approximation is used not only for linear regression models but also for logistic regression and estimation of neural network hyperparameters. If I have a chance, I will implement them as well.
Recommended Posts