PRML Chapter 7 Related Vector Machine Python Implementation for Regression Problems

This time I implemented a related vector machine. It is not so famous compared to the support vector machine, but it seems to have advantages such as the output value being a probability. I think that many people will use it if it is in scikit-learn (a library that implements machine learning methods), but I did not implement it because Microsoft has a patent.

Related vector machine

Chapter 7 is about "kernel machines with sparse solutions", but the discussion of related vector machines is generally valid without using the kernel method. Parameters such as linear regression{\bf w}Prior distribution ofp({\bf w}|\alpha)=\mathcal{N}({\bf w}|{\bf 0},\alpha^{-1}I)As one hyperparameter\alphaI set this

p({\bf w}|{\bf \alpha}) = \prod_i\mathcal{N}(w_i|0,\alpha_i^{-1})

As ** introduce hyperparameters for each parameter **. After that, the hyperparameters $ {\ bf \ alpha} $ are estimated in the same way as Chapter 3 Evidence Approximation. The estimated $ \ alpha $ has many components that become infinite. This means that the variance of the parameter $ w $, $ \ alpha ^ {-1} $, is close to 0, and the parameter $ w $ is a steep distribution with the mean of the initially set prior distribution of 0. The value of the parameter $ w $ becomes sparse, and features with high relevance are automatically selected (automatic relevance determination).

Related vector machine for regression problems

The likelihood function for the weight parameter $ {\ bf w} $ when N training data $ \ {x_n, t_n \} _ {n = 1} ^ N $ is observed is

\begin{align}
p({\bf t}|{\bf\Phi},{\bf w},\beta) &= \prod_{n=1}^N p(t_n|\phi(x_n),{\bf w}, \beta)\\
&= \prod_{n=1}^N \mathcal{N}(t_n|{\bf w}^{\rm T}\phi(x_n), \beta^{-1})
\end{align}

However, $ {\ bf t} = (t_1, \ dots, t_N) ^ {\ rm T} $ and $ \ phi (\ cdot) $ used the following Gaussian functions centered on training data points. Feature vector,

\phi(x) =
\begin{bmatrix}
\phi_1(x)\\
\vdots\\
\phi_N(x)
\end{bmatrix}
=
\begin{bmatrix}
a\exp(-b(x - x_1)^2)\\
\vdots\\
a\exp(-b(x-x_N)^2)
\end{bmatrix}、

$ {\ bf \ Phi} $ is a design matrix of $ N \ times N $ whose elements are $ \ Phi_ {ni} = \ phi_i (x_n) $. The posterior distribution of the weight parameter $ {\ bf w} $ is as follows from Bayes' theorem, using $ p ({\ bf w} | {\ bf \ alpha}) $ as the prior distribution.

p({\bf w}|{\bf t},{\bf\Phi},{\bf\alpha},\beta) = \mathcal{N}({\bf w}|{\bf m},{\bf \Sigma})

However, the mean and covariance are

\begin{align}
{\bf m} &= \beta{\bf\Sigma}{\bf\Phi}^{\rm T}{\bf t}\\
{\bf\Sigma} &= \left({\bf A} + \beta{\bf\Phi}^{\rm T}{\bf\Phi}\right)^{-1}
\end{align}

The matrix $ {\ bf A} $ is a diagonal matrix with elements $ A_ {ii} = \ alpha_i $.

Based on the discussion so far, maximum likelihood estimation of hyperparameters $ {\ bf \ alpha}, \ beta $ is performed. The logarithmic evidence function is $ {\ bf C} = \ beta ^ {-1} {\ bf I} + {\ bf \ Phi} {\ bf A} ^ {-1} {\ bf \ Phi} ^ {\ As rm T} $

\begin{align}
\ln p({\bf t}|{\bf\Phi},{\bf\alpha},\beta) &= \ln\mathcal{N}({\bf t}|{\bf 0},{\bf C})\\
&= -{1\over2}\left\{N\ln(2\pi) + \ln|{\bf C}| + {\bf t}^{\rm T}{\bf C}^{-1}{\bf t}\right\}
\end{align}

Will be. This is differentiated for $ {\ bf \ alpha}, \ beta $ to obtain the hyperparameter update equation. As $ \ gamma_i = 1-\ alpha_i \ Sigma_ {ii} $

\begin{align}
\alpha_i^{new} &= {\gamma_i\over m_i^2}\\
\beta^{new} &= {N - \sum_i\gamma_i\over||{\bf t} - {\bf\Phi}{\bf m}||^2}
\end{align}

Using the new hyperparameters $ {\ bf \ alpha} ^ {new}, \ beta ^ {new} $ obtained in this way, calculate the posterior distribution of the weight parameter $ {\ bf w} $ again and then super. Repeat updating the parameters.

Related vector

This time, since the Gaussian function centered on the training data points is used for the feature vector, the value of the parameter $ w $ can be said to be a value indicating how much the training data contributes to the prediction. ** When the relation degree automatic determination is used by the relation vector machine, many parameters $ w $ become 0, and the training data corresponding to the other parameters is called the relation vector **. When calculating the prediction distribution, using only the features corresponding to the related vectors should not change the prediction result much.

Implementation

Library

Again, the algorithm part is coded only with numpy.

import matplotlib.pyplot as plt
import numpy as np

Related vector regression

#A class that performs related vector regression
class RelevanceVectorRegression(object):
    #Initialization of hyperparameters
    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta

    #Calculation of feature vector using Gaussian kernel
    def _kernel(self, x, y):
        return np.exp(-10 * (x - y) ** 2)

    #Hyperparameter alpha,beta estimation
    def fit(self, x, t, iter_max=1000):
        self.x = x
        self.t = t
        N = len(x)

        #Design matrix
        Phi = self._kernel(*np.meshgrid(x, x))
        self.alphas = np.zeros(N) + self.alpha
        for _ in xrange(iter_max):
            params = np.hstack([self.alphas, self.beta])

            #Covariance PRML equation for posterior distribution of weight parameter w(7.83)
            self.precision = np.diag(self.alphas) + self.beta * Phi.T.dot(Phi)
            self.covariance = np.linalg.inv(self.precision)

            #Mean PRML equation for posterior distribution of weight parameter w(7.82)
            self.mean = self.beta * self.covariance.dot(Phi.T).dot(t)

            #Parameter validity PRML expression(7.89)
            gamma = 1 - self.alphas * np.diag(self.covariance)

            #Hyperparameter update PRML expression(7.87)
            self.alphas = gamma / np.square(self.mean)

            #10 so that division by zero does not occur^10 alpha over 10^Set to 10
            self.alphas = np.clip(self.alphas, 0, 1e10)

            #Hyperparameter update PRML expression(7.88)
            self.beta = (N - np.sum(gamma)) / np.sum((t - Phi.dot(self.mean)) ** 2)

            #If the parameter update amount is small, it ends
            if np.allclose(params, np.hstack([self.alphas, self.beta])):
                break
        else:
            #Outputs the following statement if it does not end even after updating the specified number of times
            print "paramters may not have converged"

    #Calculate posterior predictive distribution for input x
    def predict_dist(self, x):
        K = self._kernel(*np.meshgrid(x, self.x, indexing='ij'))

        #Mean PRML formula for posterior prediction distribution(7.90)
        mean = K.dot(self.mean)

        #Variance PRML equation for posterior prediction distribution(7.91)
        var = 1 / self.beta + np.sum(K.dot(self.covariance) * K, axis=1)

        #Returns the mean and standard deviation of the posterior predictive distribution
        return mean, np.sqrt(var)

Whole code

relevance_vector_regression.py


import matplotlib.pyplot as plt
import numpy as np


class RelevanceVectorRegression(object):

    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta

    def _kernel(self, x, y):
        return np.exp(-10 * (x - y) ** 2)

    def fit(self, x, t, iter_max=1000):
        self.x = x
        self.t = t
        N = len(x)
        Phi = self._kernel(*np.meshgrid(x, x))
        self.alphas = np.zeros(N) + self.alpha
        for _ in xrange(iter_max):
            params = np.hstack([self.alphas, self.beta])
            self.precision = np.diag(self.alphas) + self.beta * Phi.T.dot(Phi)
            self.covariance = np.linalg.inv(self.precision)
            self.mean = self.beta * self.covariance.dot(Phi.T).dot(t)
            gamma = 1 - self.alphas * np.diag(self.covariance)
            self.alphas = gamma / np.square(self.mean)
            self.alphas = np.clip(self.alphas, 0, 1e10)
            self.beta = (N - np.sum(gamma)) / np.sum((t - Phi.dot(self.mean)) ** 2)
            if np.allclose(params, np.hstack([self.alphas, self.beta])):
                break
        else:
            print "paramters may not have converged"

    def predict_dist(self, x):
        K = self._kernel(*np.meshgrid(x, self.x, indexing='ij'))
        mean = K.dot(self.mean)
        var = 1 / self.beta + np.sum(K.dot(self.covariance) * K, axis=1)
        return mean, np.sqrt(var)


def create_toy_data(func, low=0., high=1., n=10, std=0.1):
    x = np.random.uniform(low, high, n)
    t = func(x) + np.random.normal(scale=std, size=n)
    return x, t


def main():

    def func(x):
        return np.sin(2 * np.pi * x)

    x, t = create_toy_data(func, n=10)
    plt.scatter(x, t, color="blue", alpha=0.5, label="observation")

    regression = RelevanceVectorRegression()
    regression.fit(x, t)
    relevance_vector = np.abs(regression.mean) > 0.1

    x_test = np.linspace(0, 1, 100)
    plt.scatter(x[relevance_vector], t[relevance_vector], color="green", s=100, marker="D", label="relevance vector")
    plt.plot(x_test, func(x_test), color="blue", label="sin($2\pi x$)")
    y, y_std = regression.predict_dist(x_test)
    plt.plot(x_test, y, color="red", label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
    plt.legend()
    plt.show()


if __name__ == '__main__':
    main()

result

As a result of regressing the blue and green points as training data with the related vector machine, the result is as shown in the figure below (reproduction of PRML Figure 7.9). The green dots are represented as related vectors. result.png

At the end

It seems that it is possible to make faster predictions while having the same generalization performance as the support vector machine, so I would like to take this opportunity to use not only the support vector machine but also the related vector machine. It is also a nice advantage to be able to treat the output in a Bayesian way and evaluate the variance. However, it is a pity that it is against the human intuition that the predictive variance becomes small where there are no learning data points. This time, we solved the regression problem using the related vector machine, but of course we can extend it to the classification problem by using the logistic sigmoid function. If I have a chance, I would like to implement that as well.

Recommended Posts

PRML Chapter 7 Related Vector Machine Python Implementation for Regression Problems
PRML Chapter 4 Bayesian Logistic Regression Python Implementation
PRML Chapter 6 Gaussian Process Regression Python Implementation
PRML Chapter 5 Neural Network Python Implementation
PRML Chapter 3 Evidence Approximation Python Implementation
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
PRML Chapter 8 Product Sum Algorithm Python Implementation
PRML Chapter 5 Mixed Density Network Python Implementation
PRML Chapter 9 Mixed Gaussian Distribution Python Implementation
PRML Chapter 14 Conditional Mixed Model Python Implementation
PRML Chapter 10 Variational Gaussian Distribution Python Implementation
PRML Chapter 2 Student's t Distribution Python Implementation
PRML Chapter 1 Bayesian Curve Fitting Python Implementation
Implemented in Python PRML Chapter 3 Bayesian Linear Regression
PRML Chapter 11 Markov Chain Monte Carlo Python Implementation
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Python learning memo for machine learning by Chainer from Chapter 2
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Python for Data Analysis Chapter 3
Python learning memo for machine learning by Chainer Chapter 8 Introduction to Numpy
Python learning memo for machine learning by Chainer Chapter 10 Introduction to Cupy
Python learning memo for machine learning by Chainer Chapter 9 Introduction to scikit-learn
PRML Chapter 13 Maximum Likelihood Estimating Python Implementation of Hidden Markov Models
Explanation and implementation of PRML Chapter 4
<Course> Machine Learning Chapter 7: Support Vector Machine
<For beginners> python library <For machine learning>
Python learning memo for machine learning by Chainer Chapter 13 Neural network training ~ Chainer completed
Python learning memo for machine learning by Chainer Chapter 13 Basics of neural networks
Python learning memo for machine learning by Chainer until the end of Chapter 2
<Course> Machine Learning Chapter 3: Logistic Regression Model
PRML: Chapter 7 Kernel machine with sparse solution
Amplify images for machine learning with python
PRML implementation Chapter 3 Linear basis function model
Implemented in Python PRML Chapter 7 Nonlinear SVM
Why Python is chosen for machine learning
[Shakyo] Encounter with Python for machine learning
<Course> Machine Learning Chapter 1: Linear Regression Model
[Python] Web application design for machine learning
Support Vector Machine (for beginners) -Code Edition-
<Course> Machine Learning Chapter 2: Nonlinear Regression Model
An introduction to Python for machine learning
[Python] I thoroughly explained the theory and implementation of support vector machine (SVM)