PRML: Chapter 7 Kernel machine with sparse solution

Overview

PRML Chapter 7 Regression by related vector machine (RVM; relevance vector machine) is implemented in python.

Jupyter notebook summarizing the code and experimental results.

RVM model

Conditional probability distribution of the real-valued target variable $ t $ for the given input vector $ \ mathbf {x} $

p(t|\mathbf{x}, \mathbf{w}, \beta) = \mathcal{N}(t|y(\mathbf{x}), \beta^{-1})

Here, $ \ beta = \ sigma ^ {-2} $ is defined by the noise accuracy parameter (the reciprocal of the noise variance), and the mean value is defined by the following linear model.

y(\mathbf{x})=\sum_{i=1}^{M}w_{i}\phi_{i}(\mathbf{x})=\mathbf{w}^{T}\mathbf{\phi}(\mathbf{x})

A kernel function with individual training data as one argument is used as the basis function.

y(\mathbf{x})=\sum_{n=1}^{N}w_{n}k(\mathbf{x}, \mathbf{x}_{n})+b

The total number of parameters is $ M = N + 1 $. The likelihood function is given by the following equation.

p(\mathbf{t}| \mathbf{X}, \mathbf{w}, \beta) = \prod_{n=1}^{N} p(t_{n}|\mathbf{x}_{n}, \mathbf{w}, \beta)

A Gauss prior distribution with an average of 0 is used as the prior distribution of the parameter vector $ \ mathbf {w} $. RVM uses different hyperparameters $ \ alpha {i} $ for each weight parameter $ w {i} $. __ That is, the prior distribution for weights is as follows.

p(\mathbf{w}|\mathbf{\alpha}) = \prod_{i=1}^{M} \mathcal{N} (w_{i}|0, \alpha_{i}^{-1})

Where $ \ alpha_ {i} $ represents the precision of the corresponding weight parameter $ w_ {i} $, $ \ mathbf {\ alpha} = (\ alpha_ {1}, \ dots, \ alpha_ {M}) ^ {T} $. Maximizing the evidence for these hyperparameters makes most of the hyperparameters infinite, concentrating the posterior distribution of the corresponding weight parameters at zero point. Then, the basis functions corresponding to these parameters (kernel functions representing the distances to the corresponding data points) do not play any role in the prediction, so they can be removed and a sparse model is obtained.

The posterior probabilities for the weight vector are again Gaussian and are expressed in the following form.

p(\mathbf{w}|\mathbf{t}, \mathbf{X}, \mathbf{\alpha}, \beta) = \mathcal{N}(\mathbf{w}|\mathbf{m}, \mathbf{\Sigma})

Here, the mean and covariance are given by the following equations.

\mathbf{m} = \beta \mathbf{\Sigma} \mathbf{\Phi}^{T} \mathbf{t}
\mathbf{\Sigma} = \left( \mathbf{A} + \beta \mathbf{\Phi}^{T} \mathbf{\Phi} \right)^{-1}

However, $ \ mathbf {\ Phi} $ is the element $ \ Phi_ {ni} = \ phi_ {i} (\ mathbf {x_ {n}}) $ for $ i = 1, \ dots, N $, and $ N \ times M $ design matrix with element $ \ Phi_ {nM} = 1 $ for $ n = 1, \ dots, N $, $ \ mathbf {A} = \ rm {diag} (\ alpha_ {i}) $.

The values of $ \ mathbf {\ alpha} $ and $ \ beta $ are obtained by a second type of maximum likelihood estimation, also known as __evidence approximation __. To perform the second type maximum likelihood estimation, first integrate the weight parameters.

p(\mathbf{t}|\mathbf{X}, \mathbf{\alpha}, \beta) = \int p(\mathbf{t}|\mathbf{X}, \mathbf{w}, \beta)p(\mathbf{w}|\mathbf{\alpha}) d\mathbf{w}

Since this equation is a convolutional integral of two Gaussian distributions, the integral can be executed analytically, and the log-likelihood can be obtained as follows.

\ln p(\mathbf{t}|\mathbf{X}, \mathbf{\alpha}, \beta) = \ln \mathcal{N} (\mathbf{t}|\mathbf{0}, \mathbf{C}) = -\frac{1}{2}\{ N \ln (2 \pi) + \ln |\mathbf{C}| + \mathbf{t}^{T}\mathbf{C}^{-1}\mathbf{t}\}

Here, $ \ mathbf {t} = (t_ {1}, \ dots, t_ {N}) ^ {T} $. We also defined the $ N \ times N $ matrix $ \ mathbf {C} $ as follows.

\mathbf{C} = \beta^{-1}\mathbf{I} + \mathbf{\Phi} \mathbf{A}^{-1} \mathbf{\Phi}^{T}

By setting the derivative of the obtained log-likelihood to 0, we obtain the hyperparameter update equation as follows.

\begin{split} \alpha_{i}^{new} &= \frac{\gamma_{i}}{m_{i}^{2}} \\ (\beta^{new})^{-1} &= \frac{\|\mathbf{t} - \mathbf{\Phi m}\|^{2}}{N - \Sigma_{i}\gamma_{i}} \end{split}

Here, $ \ gamma_ {i} $ is a quantity that indicates how well the corresponding weight parameter $ w_ {i} $ is specified in the data, and is defined by the following equation.

\gamma_{i} = 1 - \alpha_{i} \Sigma_{ii}

Learning hyperparameters proceeds as follows using the above results. First, the mean $ \ mathbf {m} $ and covariance $ \ mathbf {\ Sigma} $ of posterior probabilities are estimated from the initial values of $ \ mathbf {\ alpha} $ and $ \ beta $ selected appropriately. Next, we estimate the hyperparameters from the obtained values. This process is repeated alternately until the appropriate convergence conditions are met.

Experiment

Training data

synthetic_data_sin.png

result

Design matrix and related vectors

design_matrix_relevance_vector.png

For vectors unrelated to the output $ \ mathbf {t} $, the value of $ \ alpha $ is $ \ infinity $. In other words, the column with the large value of $ \ alpha ^ {-1} $ is the related vector.

RVM regression results

RVM_regression.png

Consideration

Recommended Posts

PRML: Chapter 7 Kernel machine with sparse solution
Sparse modeling: Chapter 5 From exact solution to approximate solution
Kernel Method with Python
Day 65 (Solution) Jupyter notebook does not work with Kernel Not Conected.
PRML Chapter 7 Related Vector Machine Python Implementation for Regression Problems