The Gaussian process is originally one of the continuous time stochastic processes, but it has been applied to machine learning as a stochastic model. The point is that ** Gaussian process is an infinite dimensional multivariate Gaussian distribution **. If it is a function, each output for different inputs follows a Gaussian distribution. The kernel defines the similarity of the inputs. If the inputs are similar, the outputs will be similar, so you can predict smooth curves. From another point of view, you can think of it as a linear regression model in which the basis functions are arranged infinitely on the grid. Given the observed data, the kernel matrix is calculated to determine the mean and variance in the unobserved locations. (The Gaussian distribution does not change.) In the calculation, kernel tricks can be used to obtain the output without calculating the feature vector.

For a theoretical explanation of the Gaussian process, refer to "Gaussian Process and Machine Learning (Machine Learning Professional Series)". Interestingly, it turns out that the Gaussian process is equivalent to a neural network with an infinite number of units (https://oumpy.github.io/blog/2020/04/neural_tangents.html).

This time, I will implement the Gaussian process with Pyro introduced in Previous article.

Gaussian process regression with Pyro

Example of Pyro's official tutorial will be done. The Official Document is also helpful.

import matplotlib.pyplot as plt
import torch
import pyro
import pyro.contrib.gp as gp
import pyro.distributions as dist
pyro.set_rng_seed(100)

Take 20 points from y = 0.5 * sin (3x) and use them as observation data.

N = 20
X = dist.Uniform(0.0, 5.0).sample(sample_shape=(N,))
y = 0.5 * torch.sin(3*X) + dist.Normal(0.0, 0.2).sample(sample_shape=(N,))
plt.plot(X.numpy(), y.numpy(), 'kx')

From these 20 observation data, we predict the original function (in this case, y = 0.5 * sin (3x)). Decide on a kernel and perform Gaussian process regression.

#Set hyperparameters
variance = torch.tensor(0.1)
lengthscale = torch.tensor(0.1)
noise = torch.tensor(0.01)
#Regression
kernel = gp.kernels.RBF(input_dim=1, variance=variance, lengthscale=lengthscale)
gpr = gp.models.GPRegression(X, y, kernel, noise=noise)

Now you have a regression. The prediction result is displayed.

Xtest = torch.linspace(-0.5, 5.5, 500)
with torch.no_grad():
    mean, cov = gpr(Xtest, full_cov=True, noiseless=False)
sd = cov.diag().sqrt()
plt.plot(Xtest.numpy(), mean.numpy(), 'r', lw=2)
plt.fill_between(Xtest.numpy(), (mean - 2.0 * sd).numpy(), (mean + 2.0 * sd).numpy(), color='C0', alpha=0.3)
plt.plot(X.numpy(), y.numpy(), 'kx')

In Gaussian process regression, the prediction function is expressed as a cloud of functions (set of prediction curves) as shown in the above figure. This point is similar to the result of Bayesian linear regression. However, this result is the result when hyperparameters (variance, lengthscale, noise) such as kernel are properly determined. Optimize with the variational inference gradient descent method to adjust to the appropriate kernel hyperparameters. (MCMC is also acceptable)

optimizer = torch.optim.Adam(gpr.parameters(), lr=0.005)
loss_fn = pyro.infer.Trace_ELBO().differentiable_loss
losses = []
num_steps = 2500
for i in range(num_steps):
    optimizer.zero_grad()
    loss = loss_fn(gpr.model, gpr.guide)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())
#Plot loss curve
plt.plot(losses)
plt.xlabel("step")
plt.ylabel("loss")
#Display optimization results
print('variance = {}'.format(gpr.kernel.variance))
print('lengthscale = {}'.format(gpr.kernel.lengthscale))
print('noise = {}'.format(gpr.noise))

variance = 0.15705525875091553 lengthscale = 0.4686208963394165 noise = 0.017524730414152145

The prediction result is displayed.

Xtest = torch.linspace(-0.5, 5.5, 500)
with torch.no_grad():
    mean, cov = gpr(Xtest, full_cov=True, noiseless=False)
sd = cov.diag().sqrt()
plt.plot(Xtest.numpy(), mean.numpy(), 'r', lw=2)
plt.fill_between(Xtest.numpy(), (mean - 2.0 * sd).numpy(), (mean + 2.0 * sd).numpy(), color='C0', alpha=0.3)
plt.plot(X.numpy(), y.numpy(), 'kx')

The shape of the cloud is smoother than in the previous figure, and it is closer to the original function (y = 0.5 * sin (3x)). I think that the prediction result has improved by adjusting the hyperparameters.

Other applications

Bayesian optimization

https://pyro.ai/examples/bo.html You can find the minimum value while searching by issuing a prediction function by regression (approximate solution method).

Gaussian process latent variable model (GPLVM)

https://pyro.ai/examples/gplvm.html GPLVM is unsupervised learning, and dimensionality reduction can be performed by obtaining input from output.

At the end

In normal data analysis, you have to decide the probabilistic model that suits your situation. For example, Bayesian statistical modeling defines a model to determine prior distributions. In that respect, in the Gaussian process, ** it is only necessary to specify the kernel (similarity between samples) **, so it seems to be highly versatile. And even that kernel seems to be automatically determined by ARD (Automatic Relevance).

In addition, the Gaussian process can express uncertainty, and I think that it is possible to learn flexibly even with a small number of samples. However, it is complicated as a model and difficult to interpret.

Implement Gaussian process in Pyro

Gaussian process regression with Pyro

Other applications

Bayesian optimization

Gaussian process latent variable model (GPLVM)

At the end