Plan to implement PRML algorithm almost exclusively with Numpy, but the Japanese version of PRML has already entered the second volume. In the kernel method of Chapter 6, the inner product of data points in the feature space is calculated using the kernel function (kernel trick), and linear regression or linear separation is performed in that space. I don't know much about the Gaussian process, so I think there are many mistakes. As for the Ichiou code, we have an implementation that works like that.
In the linear regression we worked on in Chapter 3 etc., the parameter $ {\ bf w} $ often followed a Gaussian distribution. So $ y = {\ bf w} ^ {\ rm T} \ phi (x) $ also follows a one-dimensional Gaussian distribution, and $ {\ bf y} = {\ bf \ Phi} {\ bf w} $ It follows a multidimensional Gaussian distribution. However, $ {\ bf \ Phi} $ is a design matrix of $ \ {x_1, \ dots, x_N \} $. In such cases, $ p ({\ bf y}) $ is said to follow the Gaussian process. ** The Gaussian process follows a Gaussian distribution not only in one dimension ($ y $ is a scalar), in multiple dimensions ($ y $ is a finite number of vectors), but also in infinite dimensions ($ y $ is an infinite number of vectors). It can be interpreted as an infinite dimensional Gaussian distribution **.
The joint distribution of $ {\ bf y} $ is a Gaussian distribution with the mean $ {\ bf 0} $ and the covariance Gram matrix $ K $.
p({\bf y}) = \mathcal{N}({\bf y}|{\bf 0},K)
Will be. However, the elements of the Gram matrix are $ K_ {nm} = k (x_n, x_m) $ with $ k (x, x') $ as the kernel function. Gaussian functions $ k (x, x') = a \ exp \ left (-b (x --x') ^ 2 \ right) $ are often used as kernel functions with $ a and b $ as constants. I will. For $ x_n, x_m $ that are similar to each other, the correlation of $ y (x_n), y (x_m) $ is set to be high. This "similarity" depends on what kind of kernel function you use.
Let the observed target be $ t_n = y_n + \ epsilon_n $. Here, $ y_n = {\ bf w} ^ {\ rm T} \ phi (x_n) $ and noise $ \ epsilon_n $ are noises that follow the Gaussian distribution added to the nth observation. With the precision parameter $ \ beta $
p(t_n|y_n) = \mathcal{N}(t_n|y_n,\beta^{(-1)})
It will be. Therefore, as $ {\ bf t} = (t_1, \ dots, t_N) ^ {\ rm T} $,
p({\bf t}|{\bf y}) = \mathcal{N}({\bf t}|{\bf y},\beta^{(-1)}{\bf I}_N)
It will be. Now that we have already defined the joint probability of $ {\ bf y} $ above, we can find the joint distribution of $ {\ bf t} $.
\begin{align}
p({\bf t}) &= \int p({\bf t}|{\bf y})p({\bf y})d{\bf y}\\
&= \int \mathcal{N}({\bf t}|{\bf y},\beta^{(-1)}{\bf I}_N)\mathcal{N}({\bf y}|{\bf 0},K)d{\bf y}\\
&= \mathcal{N}({\bf t}|{\bf 0},{\bf C}_N)
\end{align}
However, $ {\ bf C} \ _N = K + \ beta ^ {(-1)} {\ bf I} \ _N $.
As $ t_ {N + 1} $ to be newly predicted in addition to N observations $ {\ bf t} $, the simultaneous probability of $ {\ bf t}, t_ {N + 1} $ is above. From the discussion
p({\bf t},t_{N+1}) = \mathcal{N}({\bf t},t_{N+1}|{\bf 0}, {\bf C}_{N+1})
However, the covariance matrix is $ {\ bf k} = \ left (k (x_1, x_ {N + 1}), \ dots, k (x_N, x_ {N + 1}) \ right), c = k (x_ {N + 1}, x_ {N + 1}) as $
{\bf C}_{N+1} =
\begin{bmatrix}
{\bf C}_N & {\bf k}\\
{\bf k}^{\rm T} & c
\end{bmatrix}
It will be. Now that we know the joint distribution of $ {\ bf t}, t_ {N + 1} $, we can find the conditional distribution $ p (t_ {N + 1} | {\ bf t}) $.
p(t_{N+1}|{\bf t}) = \mathcal{N}(t_{N+1}|{\bf k}^{\rm T}{\bf C}_N^{-1}{\bf t},c-{\bf k}^{\rm T}{\bf C}_N^{-1}{\bf k})
The discussion above fixed the kernel function $ k (x, x') $. For example, in the kernel function $ k (x, x') = a \ exp (-b (x-x') ^ 2) $, $ a, b $ was set as a constant. Then, what should we do with the constants used for kernel functions? This parameter estimation corresponds to the hyperparameter estimation implemented in PRML Chapter 3. At that time, the evidence function $ p ({\ bf t} | \ theta) $ was maximized. $ \ Theta $ is a kernel function parameter, for example $ \ theta = (a, b) ^ {\ rm T} $. The logarithmic evidence function
\ln p({\bf t}|\theta) = -{1\over2}\ln |{\bf C}_N| - {1\over2}{\bf t}^{\rm T}{\bf C}_N^{-1}{\bf t} - {N\over2}\ln(2\pi)
And when this is differentiated with respect to the parameters,
{\partial\over\partial\theta_i}\ln p({\bf t}|\theta) = -{1\over2}{\rm Tr}({\bf C}_N^{-1}{\partial{\bf C}_N\over\partial\theta_i}) + {1\over2}{\bf t}^{\rm T}{\bf C}_N^{-1}{\partial{\bf C}_N\over\partial\theta_i}{\bf C}_N^{-1}{\bf t}
It will be. I updated the parameters using the gradient method.
import matplotlib.pyplot as plt
import numpy as np
There are two libraries to import: matplotlib, which is a library for drawing graphs, and numpy, which performs matrix calculations.
This time I used $ a \ exp (-{b \ over2} (x-x') ^ 2) $ as the kernel function.
#Kernel function class
class GaussianKernel(object):
#Kernel function parameters a,Initialize b
def __init__(self, params):
assert np.shape(params) == (2,)
self.__params = params
#Kernel function parameters a,Returns b
def get_params(self):
return np.copy(self.__params)
# x,Calculate kernel function value with y as input PRML expression(6.63)
def __call__(self, x, y):
return self.__params[0] * np.exp(-0.5 * self.__params[1] * (x - y) ** 2)
# x,Calculate the derivative of the kernel function parameter when y is input
def derivatives(self, x, y):
sq_diff = (x - y) ** 2
#Differentiation with parameter a
delta_0 = np.exp(-0.5 * self.__params[1] * sq_diff)
#Derivative on parameter b
delta_1 = -0.5 * sq_diff * delta_0 * self.__params[0]
return (delta_0, delta_1)
#Updated kernel function parameters
def update_parameters(self, updates):
assert np.shape(updates) == (2,)
self.__params += updates
#A class that performs regression by Gaussian process
class GaussianProcessRegression(object):
#Initialization of kernel functions and noise accuracy parameters
def __init__(self, kernel, beta=1.):
self.kernel = kernel
self.beta = beta
#Regression without parameter estimation of kernel functions
def fit(self, x, t):
self.x = x
self.t = t
#Gramian matrix calculation PRML formula(6.54)
Gram = self.kernel(*np.meshgrid(x, x))
#Covariance matrix calculation PRML formula(6.62)
self.covariance = Gram + np.identity(len(x)) / self.beta
#Precision matrix calculation
self.precision = np.linalg.inv(self.covariance)
#Regression for estimating kernel function parameters
def fit_kernel(self, x, t, learning_rate=0.1, iter_max=10000):
for i in xrange(iter_max):
params = self.kernel.get_params()
#Regression with current parameters of kernel function
self.fit(x, t)
#Differentiate the logarithmic evidence function with parameters
gradients = self.kernel.derivatives(*np.meshgrid(x, x))
#Calculate the amount of parameter updates PRML expression(6.70)
updates = np.array(
[-np.trace(self.precision.dot(grad)) + t.dot(self.precision.dot(grad).dot(self.precision).dot(t)) for grad in gradients])
#Updated parameters
self.kernel.update_parameters(learning_rate * updates)
#Stop updating if the parameter update amount is small
if np.allclose(params, self.kernel.get_params()):
break
else:
#If the parameter update amount is not small even after updating the default number of updates, the following statement is output.
print "parameters may not have converged"
#Output predicted distribution
def predict_dist(self, x):
K = self.kernel(*np.meshgrid(x, self.x, indexing='ij'))
#Calculate the mean of the predicted distribution PRML formula(6.66)
mean = K.dot(self.precision).dot(self.t)
#Calculate the variance of the predicted distribution PRML formula(6.67)
var = self.kernel(x, x) + 1 / self.beta - np.sum(K.dot(self.precision) * K, axis=1)
return mean.ravel(), np.sqrt(var.ravel())
Replacing regression.fit_kernel (...)
with regression.fit (x, t)
will regress without estimating the parameters.
def main():
#Set the function that the training data follows
def func(x):
return np.sin(2 * np.pi * x)
#Generate training data
x, t = create_toy_data(func, high=0.7, std=0.1)
#Kernel function settings and parameters are finally set
kernel = GaussianKernel(params=np.array([1., 1.]))
#Kernel functions and precision parameters used for Gaussian process regression(True value)settings of
regression = GaussianProcessRegression(kernel=kernel, beta=100)
#Regression, including estimation of kernel function parameters.fit(x,t)Regression without estimating parameters when replaced with
regression.fit_kernel(x, t, learning_rate=0.1, iter_max=10000)
#Output predicted distribution for test data
x_test = np.linspace(0, 1, 100)
y, y_std = regression.predict_dist(x_test)
#Plot the results of the regression
plt.scatter(x, t, alpha=0.5, color="blue", label="observation")
plt.plot(x_test, func(x_test), color="blue", label="sin$(2\pi x)$")
plt.plot(x_test, y, color="red", label="predict_mean")
plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
plt.legend(loc="lower left")
plt.show()
gaussian_process_regression.py
import matplotlib.pyplot as plt
import numpy as np
class GaussianKernel(object):
def __init__(self, params):
assert np.shape(params) == (2,)
self.__params = params
def get_params(self):
return np.copy(self.__params)
def __call__(self, x, y):
return self.__params[0] * np.exp(-0.5 * self.__params[1] * (x - y) ** 2)
def derivatives(self, x, y):
sq_diff = (x - y) ** 2
delta_0 = np.exp(-0.5 * self.__params[1] * sq_diff)
delta_1 = -0.5 * sq_diff * delta_0 * self.__params[0]
return (delta_0, delta_1)
def update_parameters(self, updates):
assert np.shape(updates) == (2,)
self.__params += updates
class GaussianProcessRegression(object):
def __init__(self, kernel, beta=1.):
self.kernel = kernel
self.beta = beta
def fit(self, x, t):
self.x = x
self.t = t
Gram = self.kernel(*np.meshgrid(x, x))
self.covariance = Gram + np.identity(len(x)) / self.beta
self.precision = np.linalg.inv(self.covariance)
def fit_kernel(self, x, t, learning_rate=0.1, iter_max=10000):
for i in xrange(iter_max):
params = self.kernel.get_params()
self.fit(x, t)
gradients = self.kernel.derivatives(*np.meshgrid(x, x))
updates = np.array(
[-np.trace(self.precision.dot(grad)) + t.dot(self.precision.dot(grad).dot(self.precision).dot(t)) for grad in gradients])
self.kernel.update_parameters(learning_rate * updates)
if np.allclose(params, self.kernel.get_params()):
break
else:
print "parameters may not have converged"
def predict_dist(self, x):
K = self.kernel(*np.meshgrid(x, self.x, indexing='ij'))
mean = K.dot(self.precision).dot(self.t)
var = self.kernel(x, x) + 1 / self.beta - np.sum(K.dot(self.precision) * K, axis=1)
return mean.ravel(), np.sqrt(var.ravel())
def create_toy_data(func, low=0, high=1., n=10, std=1.):
x = np.random.uniform(low, high, n)
t = func(x) + np.random.normal(scale=std, size=n)
return x, t
def main():
def func(x):
return np.sin(2 * np.pi * x)
x, t = create_toy_data(func, high=0.7, std=0.1)
kernel = GaussianKernel(params=np.array([1., 1.]))
regression = GaussianProcessRegression(kernel=kernel, beta=100)
regression.fit_kernel(x, t, learning_rate=0.1, iter_max=10000)
x_test = np.linspace(0, 1, 100)
y, y_std = regression.predict_dist(x_test)
plt.scatter(x, t, alpha=0.5, color="blue", label="observation")
plt.plot(x_test, func(x_test), color="blue", label="sin$(2\pi x)$")
plt.plot(x_test, y, color="red", label="predict_mean")
plt.fill_between(x_test, y - y_std, y + y_std, color="pink", alpha=0.5, label="predict_std")
plt.legend(loc="lower left")
plt.show()
if __name__ == '__main__':
main()
You now have a diagram like Figure 6.8 in PRML. The variance of the predicted distribution is large in the region without training data, and small in the region with training data.
By the way, the result when the maximum likelihood estimation of hyperparameters is not performed with the same training data as above is as shown in the figure below. The above predicted distribution is more suitable for training data (and test data).
After all, what you are doing is similar to Bayesian linear regression. However, without explicitly using the feature vector, I used the kernel trick to regress using only the inner product of the feature vector. Although it is not described in PRML, it seems that the Gaussian process is used for a method called Bayesian optimization, which seems to be attracting attention recently. It seems that Bayesian optimization is used to determine parameters such as neural network learning coefficients. I would like to implement Bayesian optimization if I have a chance.
Recommended Posts