This time, I implemented the evidence approximation mentioned a little at the end of First PRML implementation. This method is also called empirical Bayes, type 2 maximum likelihood estimation, and so on. When using machine learning algorithms, we often have to set hyperparameters. By using this method, hyperparameters that have been decided somehow until now are automatically decided from the data.

Bayesian linear regression

Before we get into the explanation of evidence approximation, let's explain Bayesian linear regression.

Maximum likelihood estimation

Given the training data $ \ {x_i, t_i \} _ {i = 1} ^ N $, prepare the feature vector $ {\ bf \ phi} $ (for example, $ {\ bf \ phi). } (x) = (1, x, x ^ 2, x ^ 3) ^ {\ rm T} $)

{\bf t} = 
\begin{bmatrix}
t_1\\
t_2\\
\vdots\\
t_N
\end{bmatrix}\\
{\bf\Phi} =
\begin{bmatrix}
\phi_1(x_1) & \phi_2(x_1) & \cdots & \phi_M(x_1)\\\
\phi_1(x_2) & \phi_2(x_2) & \cdots & \phi_M(x_2)\\\
\vdots & \vdots & \ddots & \vdots\\\
\phi_1(x_N) & \phi_2(x_N) & \cdots & \phi_M(x_N)
\end{bmatrix}\\
{\bf w} =
\begin{bmatrix}
w_1\\
w_2\\
\vdots\\
w_M
\end{bmatrix}\\

As a result, the following likelihood function is minimized for the parameter $ {\ bf w} $. However, $ {\ bf I} $ is the identity matrix of $ N \ times N $, and $ \ beta $ is the precision parameter.

E({\bf w}) = \mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I})

Ex-post distribution

Maximum likelihood estimation sets a prior distribution on the parameter $ {\ bf w} $ because it may overfit the training data. So, assuming that the Gaussian distribution with mean 0 and variance $ \ alpha ^ {-1} $ is prior distribution, we use Bayes' theorem.

\begin{align}
p({\bf w}|{\bf\Phi},{\bf t}, \alpha, \beta) &= {\mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I}) \mathcal{N}({\bf w}|{\bf 0}, \alpha^{-1}{\bf I})\over p({\bf t}|{\bf\Phi}, \alpha, \beta)}\\
&= \mathcal{N}({\bf w}|{\bf m}_N, {\bf S}_N)
\end{align}

As a posterior distribution for the parameter $ {\ bf w} $. However,

{\bf m}_N = \beta {\bf S}_N{\bf\Phi}^{\rm T}{\bf t}\\
{\bf S}_N^{-1} = \alpha{\bf I} + \beta{\bf\Phi}^{\rm T}{\bf\Phi}

Posterior predictive distribution

Using the posterior distribution obtained above, the posterior predictive distribution of $ t $ for the new input $ x $ is

p(t|x,{\bf\Phi},{\bf t},\alpha,\beta) = \int \mathcal{N}(t|{\bf w}^{\rm T}{\bf\phi}(x), \beta^{-1}) \mathcal{N}({\bf w}|{\bf m}_N, {\bf S}_N) {\rm d}{\bf w}\\
= \mathcal{N}(t|{\bf m}_N^{\rm T}{\bf\phi}(x), \sigma^2_N(x))

It can be obtained by integrating and eliminating the parameter $ {\ bf w} $ as in. However, the variance is

\sigma^2_N(x) = {1\over\beta} + \phi(x)^{\rm T}{\bf S}_N\phi(x)

By using the posterior predictive distribution obtained in this way, Bayesian linear regression was possible. The behavior of this regression depends on the hyperparameters $ \ alpha, \ beta $. If the hyperparameter values we set are too large or too small, the regression results may not be what you want.

Evidence approximation

In the above, the hyperparameters $ \ alpha, \ beta $ were set as some human-determined values, but these are estimated from the data. When finding the posterior distribution of the parameter $ {\ bf w} $

p({\bf w}|{\bf\Phi},{\bf t}, \alpha, \beta) = {\mathcal{N}({\bf t}|{\bf\Phi}{\bf w}, \beta^{-1}{\bf I}) \mathcal{N}({\bf w}|{\bf 0}, \alpha^{-1}{\bf I})\over p({\bf t}|{\bf\Phi}, \alpha, \beta)}

I used Bayes' theorem as. When finding the posterior distribution, we often focus only on the numerator on the right side, but in reality, the denominator $ p ({\ bf t} | {\ bf \ Phi}, \ alpha, \ beta) $ is the evidence function. It is called, and has the following relationship.

(Likelihood function)\times(Prior distribution)=(Evidence function)\times(Ex-post distribution)

Considering the meaning of the evidence function from $ p ({\ bf t} | {\ bf \ Phi}, \ alpha, \ beta) $, the evidence function has the given hyperparameters $ \ alpha, \ beta $. Shows how easy it is to generate training data $ {\ bf t} $. If the value of the evidence function is small, the $ \ alpha, \ beta $ is difficult to generate the data, and conversely, if the value of the evidence function is large, it is easy to generate the data. In other words, ** the higher the value of the evidence function, the better **. Approximately by substituting the hyperparameter values that maximize the value of the evidence function into the posterior predictive distribution formula as $ {\ hat \ alpha}, {\ hat \ beta} $.

p(t|x,{\bf\Phi},{\bf t}) \approx p(t|x,{\bf\Phi},{\bf t},{\hat\alpha},{\hat\beta})

Is an approximation of the evidence.

Maximize evidence

Find $ \ alpha, \ beta $ that maximizes the evidence function. Update $ \ alpha, \ beta, \ gamma $ with the eigenvalue of $ {\ bf \ Phi} ^ {\ rm T} {\ bf \ Phi} $ as $ \ lambda_i $ as follows.

\gamma=\sum_i{\beta\lambda_i\over\alpha+\beta\lambda_i}
\alpha={\gamma\over{\bf m}\_N^{\rm T}{\bf m}\_N}、\beta={N-\gamma\over\sum\_n\left\\{t\_n - {\rm m}\_N^{\rm T}{\bf\phi}(x\_n)\right\\}^2}

By repeating these two steps, the evidence function converges to $ \ alpha = {\ hat \ alpha}, \ beta = {\ hat \ beta} $. See PRML for the derivation of this formula.

Implementation

I am reusing the code I made for Bayes curve fitting.

Design matrix

The PolynomialFeatures class transforms the input vector $ (x_1, \ cdots, x_N) ^ {\ rm T} $ into the design matrix $ {\ bf \ Phi} $ by setting the degree.

import matplotlib.pyplot as plt
import numpy as np


class PolynomialFeatures(object):

    def __init__(self, degree):
        assert type(degree) is int, "%s is not int" % type(degree)
        self.degree = degree

    def transform(self, x):
        features = [x ** i for i in xrange(self.degree + 1)]
        return np.array(features).transpose()

PolynomialFeatures	Method description
__init__	Set the degree of polynomial features
transform	Convert input to design matrix

Bayesian linear regression

The BayesianRegression class estimates posterior probabilities and calculates posterior predictive distribution as described in Bayesian Linear Regression.

class BayesianRegression(object):

    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        self.w_var = np.linalg.inv(
            self.alpha * np.identity(np.size(X, 1))
            + self.beta * X.T.dot(X))
        self.w_mean = self.beta * self.w_var.dot(X.T.dot(t))

    def predict(self, X):
        return X.dot(self.w_mean)

    def predict_dist(self, X):
        y = X.dot(self.w_mean)
        y_var = 1 / self.beta + np.sum(X.dot(self.w_var) * X, axis=1)
        y_std = np.sqrt(y_var)
        return y, y_std

BayesianRegression	Method description
__init__	\alpha,\betaSet the value of
fit	Parameters{\bf w}Calculate the mean and variance of the posterior distribution of
predict	Calculate the mode of the posterior predictive distribution
predict_dist	Calculation of mode and standard deviation of posterior prediction distribution

Evidence approximation

The evidence function is maximized by alternately updating the parameter $ {\ bf w} $ and the hyperparameters $ \ alpha, \ beta $. Updating the parameter $ {\ bf w} $ only does the same thing as in the Bayesian curve fitting above, so we're reusing the method with class inheritance. The evidence method also calculates the value of the logarithmic evidence function. By comparing this value, you can automatically select the best model among multiple models.

class EvidenceApproximation(BayesianRegression):

    def __init__(self, iter_max=100, alpha=1., beta=1.):
        self.iter_max = iter_max
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        M = X.T.dot(X)
        eigenvalues = np.linalg.eigvalsh(M)
        for i in xrange(self.iter_max):
            params = [self.alpha, self.beta]
            super(EvidenceApproximation, self).fit(X, t)
            self.gamma = np.sum(self.beta * eigenvalues / (self.alpha + self.beta * eigenvalues))
            self.alpha = self.gamma / self.w_mean.dot(self.w_mean)
            self.beta = (len(t) - self.gamma) / np.sum((t - X.dot(self.w_mean)) ** 2)
            if np.allclose(params, [self.alpha, self.beta]):
                break
        super(EvidenceApproximation, self).fit(X, t)

    def evidence(self, X, t):
        M = X.T.dot(X)
        return (len(M) * np.log(self.alpha)
                + len(t) * np.log(self.beta)
                - self.beta * np.sum((t - X.dot(self.w_mean)) ** 2)
                - np.linalg.slogdet(self.alpha + self.beta * M)[1])

EvidenceApproximation	Method description
__init__	\alpha,\betaSet the values of and the maximum number of times to update them
fit	\alpha,\betaEstimate and parameters{\bf w}Calculate the mean and variance of the posterior distribution of
evidence	Calculate the value of the logarithmic evidence function

Whole code

Generate data points that obey the cubic polynomial $ x (x-5) (x + 5) $, and then perform curve fitting using evidence approximation on the model of the 0th to 7th order polynomials to obtain the model evidence. The posterior prediction distribution is calculated using the model with the largest value.

`evidence_approximation.py`


import matplotlib.pyplot as plt
import numpy as np


class PolynomialFeatures(object):

    def __init__(self, degree):
        assert type(degree) is int, "%s is not int" % type(degree)
        self.degree = degree

    def transform(self, x):
        features = [x ** i for i in xrange(self.degree + 1)]
        return np.array(features).transpose()


class BayesianRegression(object):

    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        self.w_var = np.linalg.inv(
            self.alpha * np.identity(np.size(X, 1))
            + self.beta * X.T.dot(X))
        self.w_mean = self.beta * self.w_var.dot(X.T.dot(t))

    def predict(self, X):
        return X.dot(self.w_mean)

    def predict_dist(self, X):
        y = X.dot(self.w_mean)
        y_var = 1 / self.beta + np.sum(X.dot(self.w_var) * X, axis=1)
        y_std = np.sqrt(y_var)
        return y, y_std


class EvidenceApproximation(BayesianRegression):

    def __init__(self, iter_max=100, alpha=1., beta=1.):
        self.iter_max = iter_max
        self.alpha = alpha
        self.beta = beta

    def fit(self, X, t):
        M = X.T.dot(X)
        eigenvalues = np.linalg.eigvalsh(M)
        for i in xrange(self.iter_max):
            params = [self.alpha, self.beta]
            super(EvidenceApproximation, self).fit(X, t)
            self.gamma = np.sum(self.beta * eigenvalues / (self.alpha + self.beta * eigenvalues))
            self.alpha = self.gamma / self.w_mean.dot(self.w_mean)
            self.beta = (len(t) - self.gamma) / np.sum((t - X.dot(self.w_mean)) ** 2)
            if np.allclose(params, [self.alpha, self.beta]):
                break
        super(EvidenceApproximation, self).fit(X, t)

    def evidence(self, X, t):
        M = X.T.dot(X)
        return (len(M) * np.log(self.alpha)
                + len(t) * np.log(self.beta)
                - self.beta * np.sum((t - X.dot(self.w_mean)) ** 2)
                - np.linalg.slogdet(self.alpha + self.beta * M)[1])


def create_toy_data(func, low=0, high=1, size=10, sigma=1.):
    x = np.random.uniform(low, high, size)
    t = func(x) + np.random.normal(scale=sigma, size=size)
    return x, t


def main():

    def func(x):
        return x * (x - 5) * (x + 5)

    x, t = create_toy_data(func, low=-5, high=5, size=20, sigma=5.)
    plt.scatter(x, t, s=50, alpha=0.5, label="observation")

    evidences = []
    regressions = []
    for i in xrange(8):
        features = PolynomialFeatures(degree=i)
        X = features.transform(x)
        regression = EvidenceApproximation(alpha=100., beta=100.)
        regression.fit(X, t)
        evidences.append(regression.evidence(X, t))
        regressions.append(regression)
    degree = np.argmax(evidences)
    regression = regressions[degree]

    x_test = np.linspace(-5, 5, 100)
    X_test = PolynomialFeatures(degree=int(degree)).transform(x_test)
    y, y_std = regression.predict_dist(X_test)

    plt.plot(x_test, func(x_test), color="blue", label="x(x-5)(x+5)")
    plt.plot(x_test, y, color="red", label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
    plt.legend()
    plt.show()

    plt.plot(evidences)
    plt.title("Model evidence")
    plt.xlabel("M")
    plt.ylabel("evidence")
    plt.show()


if __name__ == '__main__':
    main()

Define the cubic polynomial $ x (x-5) (x + 5) $ (def func ())
Create a noisy data point according to the function (x, t = create_toy_data (...))
Repeat steps 4 to 6 with $ i = 0, \ cdots, 7 $
Create a design matrix of $ i $ order polynomial features (X = features.transform (x))
Set the initial values of $ \ alpha and \ beta $ to 100 (values that would be useless if the evidence function is not maximized), and perform curve fitting by evidence approximation (regression.fit (X,,). t) )
Calculate and save the value of the logarithmic evidence function in step 5 (ʻevidences.append (regression.evidence (X, t)) `)
Get the order of the model with the highest value of the logarithmic evidence function (degree = np.argmax (evidences))
Get the result of regression at that order (regression = regressions [degree])
The above results are illustrated

result

As shown in Figure 3.14 of PRML, the value of the logarithmic evidence function when using a cubic polynomial as a model is the largest. Also, the predicted distribution by the model of the cubic polynomial looks good. Since the initial values of the hyperparameters $ \ alpha and \ beta $ were both set to 100, in the case of normal Bayesian curve fitting, $ \ alpha $ is too large and each component of $ {\ bf w} $ The value is close to 0, the variance of the prediction distribution is small because $ \ beta $ is too large, and there should have been a posterior prediction distribution that could not be fitted to the data at all. Since we estimated $ \ alpha, \ beta $ to maximize the evidence function, the posterior predictive distribution also fits the data.

At the end

By using this evidence approximation, he estimated that it is okay to set a bad hyperparameter as an initial value. I was looking for hyperparameters in order to verify some points of training data with grid search, but with evidence approximation, it is possible to estimate hyperparameters using all training data with a small number of trials. I can do it. In addition, this evidence approximation is used not only for linear regression models but also for logistic regression and estimation of neural network hyperparameters. If I have a chance, I will implement them as well.

PRML Chapter 3 Evidence Approximation Python Implementation

Bayesian linear regression

Maximum likelihood estimation

Ex-post distribution

Posterior predictive distribution

Evidence approximation

Maximize evidence

Implementation

Design matrix

Bayesian linear regression

Evidence approximation

Whole code

evidence_approximation.py

result

At the end

`evidence_approximation.py`