This time I implemented a related vector machine. It is not so famous compared to the support vector machine, but it seems to have advantages such as the output value being a probability. I think that many people will use it if it is in scikit-learn (a library that implements machine learning methods), but I did not implement it because Microsoft has a patent.
Chapter 7 is about "kernel machines with sparse solutions", but the discussion of related vector machines is generally valid without using the kernel method. Parameters such as linear regression
p({\bf w}|{\bf \alpha}) = \prod_i\mathcal{N}(w_i|0,\alpha_i^{-1})
As ** introduce hyperparameters for each parameter **. After that, the hyperparameters $ {\ bf \ alpha} $ are estimated in the same way as Chapter 3 Evidence Approximation. The estimated $ \ alpha $ has many components that become infinite. This means that the variance of the parameter $ w $, $ \ alpha ^ {-1} $, is close to 0, and the parameter $ w $ is a steep distribution with the mean of the initially set prior distribution of 0. The value of the parameter $ w $ becomes sparse, and features with high relevance are automatically selected (automatic relevance determination).
The likelihood function for the weight parameter $ {\ bf w} $ when N training data $ \ {x_n, t_n \} _ {n = 1} ^ N $ is observed is
\begin{align}
p({\bf t}|{\bf\Phi},{\bf w},\beta) &= \prod_{n=1}^N p(t_n|\phi(x_n),{\bf w}, \beta)\\
&= \prod_{n=1}^N \mathcal{N}(t_n|{\bf w}^{\rm T}\phi(x_n), \beta^{-1})
\end{align}
However, $ {\ bf t} = (t_1, \ dots, t_N) ^ {\ rm T} $ and $ \ phi (\ cdot) $ used the following Gaussian functions centered on training data points. Feature vector,
\phi(x) =
\begin{bmatrix}
\phi_1(x)\\
\vdots\\
\phi_N(x)
\end{bmatrix}
=
\begin{bmatrix}
a\exp(-b(x - x_1)^2)\\
\vdots\\
a\exp(-b(x-x_N)^2)
\end{bmatrix}、
$ {\ bf \ Phi} $ is a design matrix of $ N \ times N $ whose elements are $ \ Phi_ {ni} = \ phi_i (x_n) $. The posterior distribution of the weight parameter $ {\ bf w} $ is as follows from Bayes' theorem, using $ p ({\ bf w} | {\ bf \ alpha}) $ as the prior distribution.
p({\bf w}|{\bf t},{\bf\Phi},{\bf\alpha},\beta) = \mathcal{N}({\bf w}|{\bf m},{\bf \Sigma})
However, the mean and covariance are
\begin{align}
{\bf m} &= \beta{\bf\Sigma}{\bf\Phi}^{\rm T}{\bf t}\\
{\bf\Sigma} &= \left({\bf A} + \beta{\bf\Phi}^{\rm T}{\bf\Phi}\right)^{-1}
\end{align}
The matrix $ {\ bf A} $ is a diagonal matrix with elements $ A_ {ii} = \ alpha_i $.
Based on the discussion so far, maximum likelihood estimation of hyperparameters $ {\ bf \ alpha}, \ beta $ is performed. The logarithmic evidence function is $ {\ bf C} = \ beta ^ {-1} {\ bf I} + {\ bf \ Phi} {\ bf A} ^ {-1} {\ bf \ Phi} ^ {\ As rm T} $
\begin{align}
\ln p({\bf t}|{\bf\Phi},{\bf\alpha},\beta) &= \ln\mathcal{N}({\bf t}|{\bf 0},{\bf C})\\
&= -{1\over2}\left\{N\ln(2\pi) + \ln|{\bf C}| + {\bf t}^{\rm T}{\bf C}^{-1}{\bf t}\right\}
\end{align}
Will be. This is differentiated for $ {\ bf \ alpha}, \ beta $ to obtain the hyperparameter update equation. As $ \ gamma_i = 1-\ alpha_i \ Sigma_ {ii} $
\begin{align}
\alpha_i^{new} &= {\gamma_i\over m_i^2}\\
\beta^{new} &= {N - \sum_i\gamma_i\over||{\bf t} - {\bf\Phi}{\bf m}||^2}
\end{align}
Using the new hyperparameters $ {\ bf \ alpha} ^ {new}, \ beta ^ {new} $ obtained in this way, calculate the posterior distribution of the weight parameter $ {\ bf w} $ again and then super. Repeat updating the parameters.
This time, since the Gaussian function centered on the training data points is used for the feature vector, the value of the parameter $ w $ can be said to be a value indicating how much the training data contributes to the prediction. ** When the relation degree automatic determination is used by the relation vector machine, many parameters $ w $ become 0, and the training data corresponding to the other parameters is called the relation vector **. When calculating the prediction distribution, using only the features corresponding to the related vectors should not change the prediction result much.
Again, the algorithm part is coded only with numpy.
import matplotlib.pyplot as plt
import numpy as np
#A class that performs related vector regression
class RelevanceVectorRegression(object):
#Initialization of hyperparameters
def __init__(self, alpha=1., beta=1.):
self.alpha = alpha
self.beta = beta
#Calculation of feature vector using Gaussian kernel
def _kernel(self, x, y):
return np.exp(-10 * (x - y) ** 2)
#Hyperparameter alpha,beta estimation
def fit(self, x, t, iter_max=1000):
self.x = x
self.t = t
N = len(x)
#Design matrix
Phi = self._kernel(*np.meshgrid(x, x))
self.alphas = np.zeros(N) + self.alpha
for _ in xrange(iter_max):
params = np.hstack([self.alphas, self.beta])
#Covariance PRML equation for posterior distribution of weight parameter w(7.83)
self.precision = np.diag(self.alphas) + self.beta * Phi.T.dot(Phi)
self.covariance = np.linalg.inv(self.precision)
#Mean PRML equation for posterior distribution of weight parameter w(7.82)
self.mean = self.beta * self.covariance.dot(Phi.T).dot(t)
#Parameter validity PRML expression(7.89)
gamma = 1 - self.alphas * np.diag(self.covariance)
#Hyperparameter update PRML expression(7.87)
self.alphas = gamma / np.square(self.mean)
#10 so that division by zero does not occur^10 alpha over 10^Set to 10
self.alphas = np.clip(self.alphas, 0, 1e10)
#Hyperparameter update PRML expression(7.88)
self.beta = (N - np.sum(gamma)) / np.sum((t - Phi.dot(self.mean)) ** 2)
#If the parameter update amount is small, it ends
if np.allclose(params, np.hstack([self.alphas, self.beta])):
break
else:
#Outputs the following statement if it does not end even after updating the specified number of times
print "paramters may not have converged"
#Calculate posterior predictive distribution for input x
def predict_dist(self, x):
K = self._kernel(*np.meshgrid(x, self.x, indexing='ij'))
#Mean PRML formula for posterior prediction distribution(7.90)
mean = K.dot(self.mean)
#Variance PRML equation for posterior prediction distribution(7.91)
var = 1 / self.beta + np.sum(K.dot(self.covariance) * K, axis=1)
#Returns the mean and standard deviation of the posterior predictive distribution
return mean, np.sqrt(var)
relevance_vector_regression.py
import matplotlib.pyplot as plt
import numpy as np
class RelevanceVectorRegression(object):
def __init__(self, alpha=1., beta=1.):
self.alpha = alpha
self.beta = beta
def _kernel(self, x, y):
return np.exp(-10 * (x - y) ** 2)
def fit(self, x, t, iter_max=1000):
self.x = x
self.t = t
N = len(x)
Phi = self._kernel(*np.meshgrid(x, x))
self.alphas = np.zeros(N) + self.alpha
for _ in xrange(iter_max):
params = np.hstack([self.alphas, self.beta])
self.precision = np.diag(self.alphas) + self.beta * Phi.T.dot(Phi)
self.covariance = np.linalg.inv(self.precision)
self.mean = self.beta * self.covariance.dot(Phi.T).dot(t)
gamma = 1 - self.alphas * np.diag(self.covariance)
self.alphas = gamma / np.square(self.mean)
self.alphas = np.clip(self.alphas, 0, 1e10)
self.beta = (N - np.sum(gamma)) / np.sum((t - Phi.dot(self.mean)) ** 2)
if np.allclose(params, np.hstack([self.alphas, self.beta])):
break
else:
print "paramters may not have converged"
def predict_dist(self, x):
K = self._kernel(*np.meshgrid(x, self.x, indexing='ij'))
mean = K.dot(self.mean)
var = 1 / self.beta + np.sum(K.dot(self.covariance) * K, axis=1)
return mean, np.sqrt(var)
def create_toy_data(func, low=0., high=1., n=10, std=0.1):
x = np.random.uniform(low, high, n)
t = func(x) + np.random.normal(scale=std, size=n)
return x, t
def main():
def func(x):
return np.sin(2 * np.pi * x)
x, t = create_toy_data(func, n=10)
plt.scatter(x, t, color="blue", alpha=0.5, label="observation")
regression = RelevanceVectorRegression()
regression.fit(x, t)
relevance_vector = np.abs(regression.mean) > 0.1
x_test = np.linspace(0, 1, 100)
plt.scatter(x[relevance_vector], t[relevance_vector], color="green", s=100, marker="D", label="relevance vector")
plt.plot(x_test, func(x_test), color="blue", label="sin($2\pi x$)")
y, y_std = regression.predict_dist(x_test)
plt.plot(x_test, y, color="red", label="predict_mean")
plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
plt.legend()
plt.show()
if __name__ == '__main__':
main()
As a result of regressing the blue and green points as training data with the related vector machine, the result is as shown in the figure below (reproduction of PRML Figure 7.9). The green dots are represented as related vectors.
It seems that it is possible to make faster predictions while having the same generalization performance as the support vector machine, so I would like to take this opportunity to use not only the support vector machine but also the related vector machine. It is also a nice advantage to be able to treat the output in a Bayesian way and evaluate the variance. However, it is a pity that it is against the human intuition that the predictive variance becomes small where there are no learning data points. This time, we solved the regression problem using the related vector machine, but of course we can extend it to the classification problem by using the logistic sigmoid function. If I have a chance, I would like to implement that as well.
Recommended Posts