This time I implemented a related vector machine. It is not so famous compared to the support vector machine, but it seems to have advantages such as the output value being a probability. I think that many people will use it if it is in scikit-learn (a library that implements machine learning methods), but I did not implement it because Microsoft has a patent.

Related vector machine

Chapter 7 is about "kernel machines with sparse solutions", but the discussion of related vector machines is generally valid without using the kernel method. Parameters such as linear regression{\bf w}Prior distribution ofp({\bf w}|\alpha)=\mathcal{N}({\bf w}|{\bf 0},\alpha^{-1}I)As one hyperparameter\alphaI set this

p({\bf w}|{\bf \alpha}) = \prod_i\mathcal{N}(w_i|0,\alpha_i^{-1})

As ** introduce hyperparameters for each parameter **. After that, the hyperparameters $ {\ bf \ alpha} $ are estimated in the same way as Chapter 3 Evidence Approximation. The estimated $ \ alpha $ has many components that become infinite. This means that the variance of the parameter $ w $, $ \ alpha ^ {-1} $, is close to 0, and the parameter $ w $ is a steep distribution with the mean of the initially set prior distribution of 0. The value of the parameter $ w $ becomes sparse, and features with high relevance are automatically selected (automatic relevance determination).

Related vector

This time, since the Gaussian function centered on the training data points is used for the feature vector, the value of the parameter $ w $ can be said to be a value indicating how much the training data contributes to the prediction. ** When the relation degree automatic determination is used by the relation vector machine, many parameters $ w $ become 0, and the training data corresponding to the other parameters is called the relation vector **. When calculating the prediction distribution, using only the features corresponding to the related vectors should not change the prediction result much.

Implementation

Library

Again, the algorithm part is coded only with numpy.

import matplotlib.pyplot as plt
import numpy as np

Related vector regression

#A class that performs related vector regression
class RelevanceVectorRegression(object):
    #Initialization of hyperparameters
    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta

    #Calculation of feature vector using Gaussian kernel
    def _kernel(self, x, y):
        return np.exp(-10 * (x - y) ** 2)

    #Hyperparameter alpha,beta estimation
    def fit(self, x, t, iter_max=1000):
        self.x = x
        self.t = t
        N = len(x)

        #Design matrix
        Phi = self._kernel(*np.meshgrid(x, x))
        self.alphas = np.zeros(N) + self.alpha
        for _ in xrange(iter_max):
            params = np.hstack([self.alphas, self.beta])

            #Covariance PRML equation for posterior distribution of weight parameter w(7.83)
            self.precision = np.diag(self.alphas) + self.beta * Phi.T.dot(Phi)
            self.covariance = np.linalg.inv(self.precision)

            #Mean PRML equation for posterior distribution of weight parameter w(7.82)
            self.mean = self.beta * self.covariance.dot(Phi.T).dot(t)

            #Parameter validity PRML expression(7.89)
            gamma = 1 - self.alphas * np.diag(self.covariance)

            #Hyperparameter update PRML expression(7.87)
            self.alphas = gamma / np.square(self.mean)

            #10 so that division by zero does not occur^10 alpha over 10^Set to 10
            self.alphas = np.clip(self.alphas, 0, 1e10)

            #Hyperparameter update PRML expression(7.88)
            self.beta = (N - np.sum(gamma)) / np.sum((t - Phi.dot(self.mean)) ** 2)

            #If the parameter update amount is small, it ends
            if np.allclose(params, np.hstack([self.alphas, self.beta])):
                break
        else:
            #Outputs the following statement if it does not end even after updating the specified number of times
            print "paramters may not have converged"

    #Calculate posterior predictive distribution for input x
    def predict_dist(self, x):
        K = self._kernel(*np.meshgrid(x, self.x, indexing='ij'))

        #Mean PRML formula for posterior prediction distribution(7.90)
        mean = K.dot(self.mean)

        #Variance PRML equation for posterior prediction distribution(7.91)
        var = 1 / self.beta + np.sum(K.dot(self.covariance) * K, axis=1)

        #Returns the mean and standard deviation of the posterior predictive distribution
        return mean, np.sqrt(var)

Whole code

`relevance_vector_regression.py`


import matplotlib.pyplot as plt
import numpy as np


class RelevanceVectorRegression(object):

    def __init__(self, alpha=1., beta=1.):
        self.alpha = alpha
        self.beta = beta

    def _kernel(self, x, y):
        return np.exp(-10 * (x - y) ** 2)

    def fit(self, x, t, iter_max=1000):
        self.x = x
        self.t = t
        N = len(x)
        Phi = self._kernel(*np.meshgrid(x, x))
        self.alphas = np.zeros(N) + self.alpha
        for _ in xrange(iter_max):
            params = np.hstack([self.alphas, self.beta])
            self.precision = np.diag(self.alphas) + self.beta * Phi.T.dot(Phi)
            self.covariance = np.linalg.inv(self.precision)
            self.mean = self.beta * self.covariance.dot(Phi.T).dot(t)
            gamma = 1 - self.alphas * np.diag(self.covariance)
            self.alphas = gamma / np.square(self.mean)
            self.alphas = np.clip(self.alphas, 0, 1e10)
            self.beta = (N - np.sum(gamma)) / np.sum((t - Phi.dot(self.mean)) ** 2)
            if np.allclose(params, np.hstack([self.alphas, self.beta])):
                break
        else:
            print "paramters may not have converged"

    def predict_dist(self, x):
        K = self._kernel(*np.meshgrid(x, self.x, indexing='ij'))
        mean = K.dot(self.mean)
        var = 1 / self.beta + np.sum(K.dot(self.covariance) * K, axis=1)
        return mean, np.sqrt(var)


def create_toy_data(func, low=0., high=1., n=10, std=0.1):
    x = np.random.uniform(low, high, n)
    t = func(x) + np.random.normal(scale=std, size=n)
    return x, t


def main():

    def func(x):
        return np.sin(2 * np.pi * x)

    x, t = create_toy_data(func, n=10)
    plt.scatter(x, t, color="blue", alpha=0.5, label="observation")

    regression = RelevanceVectorRegression()
    regression.fit(x, t)
    relevance_vector = np.abs(regression.mean) > 0.1

    x_test = np.linspace(0, 1, 100)
    plt.scatter(x[relevance_vector], t[relevance_vector], color="green", s=100, marker="D", label="relevance vector")
    plt.plot(x_test, func(x_test), color="blue", label="sin($2\pi x$)")
    y, y_std = regression.predict_dist(x_test)
    plt.plot(x_test, y, color="red", label="predict_mean")
    plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
    plt.legend()
    plt.show()


if __name__ == '__main__':
    main()

result

As a result of regressing the blue and green points as training data with the related vector machine, the result is as shown in the figure below (reproduction of PRML Figure 7.9). The green dots are represented as related vectors.

At the end

It seems that it is possible to make faster predictions while having the same generalization performance as the support vector machine, so I would like to take this opportunity to use not only the support vector machine but also the related vector machine. It is also a nice advantage to be able to treat the output in a Bayesian way and evaluate the variance. However, it is a pity that it is against the human intuition that the predictive variance becomes small where there are no learning data points. This time, we solved the regression problem using the related vector machine, but of course we can extend it to the classification problem by using the logistic sigmoid function. If I have a chance, I would like to implement that as well.

PRML Chapter 7 Related Vector Machine Python Implementation for Regression Problems

Related vector machine