This article is from Furukawa Lab Advent_calendar Day 23.
GPLVM is an unsupervised learning method using Gaussian process (GP), which can estimate low-dimensional manifolds in which data are distributed in high-dimensional space. The most attractive point of GPLVM is that the model is very simple, so it is highly extensible and theoretically easy to handle. In fact, hyperparameters have been estimated in the framework of Bayesian estimation, and have been extended to more complicated situations such as time series data analysis [^ 1], multi-view analysis [^ 2], and tensor decomposition [^ 3]. .. This time, the most basic GPLVM algorithm is derived by starting from probabilistic principal component analysis (probabilistic PCA: pPCA). This derivation generally follows the flow of Professor Lawrence's dissertation, so please refer to that person for details [^ 4].
probabilistic PCA
As the name implies, pPCA is a model that redistributes principal component analysis within the framework of probability theory. In pPCA, $ \ mathbf {X} = (\ mathbf {x} _1, \ mathbf {x} _2, \ cdots, \ mathbf {x} _N) \ in \ mathbb {R} ^ {D \ times N} $ Suppose that $ \ mathbf {x} $ is observed by mapping the latent variable $ \ mathbf {z} \ in \ mathbb {R} ^ L $ as follows [^ 5] ].
\mathbf{x} = \mathbf{W}\mathbf{z} + \boldsymbol{\epsilon},
Here, $ \ epsilon $ is the observed noise, and the isotropic Gaussian noise with the accuracy parameter $ \ beta ^ {-1} $ is assumed. It is also $ \ mathbf {W} \ in \ mathbb {R} ^ {D \ times L} $. We also assume that $ \ mathbf {z} $ follows a standard normal distribution of the $ L $ dimension. That is, $ \ mathbf {x} $ and $ \ mathbf {z} $ follow the following probability distributions.
\begin{align}
p(\mathbf{x}\mid\mathbf{W},\mathbf{z},\beta) &= \mathcal{N}(\mathbf{x}\mid\mathbf{W}\mathbf{z},\beta^{-1}\mathbf{I}_D) \\
p(\mathbf{z}) &= \mathcal{N}(\mathbf{z}\mid\mathbf{0},\mathbf{I}_L)
\end{align}
At this time, both $ p (\ mathbf {x} \ mid \ mathbf {W}, \ mathbf {z}, \ beta) $ and $ p (\ mathbf {z}) $ are Gaussian, so they are jointly distributed $. p (\ mathbf {x}, \ mathbf {z} \ mid \ mathbf {W}, \ beta) $ also has a Gaussian distribution, and $ p (\ mathbf {x}, \ mathbf {z} \ mid \ mathbf {W $ P (\ mathbf {x} \ mid \ mathbf {W}, \ beta) $, which is the marginalization of}, \ beta) $ with respect to the latent variable $ \ mathbf {z} $, also has a Gaussian distribution. In fact this is $ p (\ mathbf {x} \ mid \ mathbf {W}, \ beta) = \ mathcal {N} (\ mathbf {x} \ mid \ mathbb {E} [\ mathbf {x}], \ mathbb {V} [\ mathbf {x}]) $ can be calculated as follows.
\begin{align}
\mathbb{E}[\mathbf{x}]&=\mathbb{E}[\mathbf{W}\mathbf{z} + \boldsymbol{\epsilon}] \\
&=\mathbf{W}\mathbb{E}[\mathbf{z}] + \mathbb{E}[\boldsymbol{\epsilon}] \\
&=\mathbf{0} \\
\mathbb{V}[\mathbf{x}]&=\mathbb{E}[(\mathbf{W}\mathbf{z} + \boldsymbol{\epsilon})(\mathbf{W}\mathbf{z} + \boldsymbol{\epsilon})^{\rm T}] \\
&=\mathbf{W}\mathbb{E}[\mathbf{z}\mathbf{z}^{\rm T}]\mathbf{W}^{\rm T} + \mathbb{E}[\boldsymbol{\epsilon}\mathbf{z}^{\rm T}]\mathbf{W}^{\rm T} + \mathbb{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^{\rm T}]\\
&=\mathbf{W}\mathbf{W}^{\rm T} + \beta^{-1}\mathbf{I}_D \\
&=\mathbf{C}
\end{align}
In the above formula transformation, it is assumed that $ \ mathbf {W} $ is not a random variable and can be put out of the expected value, and that $ \ mathbf {z} $ and $ \ boldsymbol {\ epsilon} $ are independent. I am using that.
Here, both $ \ boldsymbol {\ epsilon} $ and $ \ mathbf {z} $ have independent and identical distributions, so $ p (\ mathbf {x} \ mid \ mathbf {W}, \ beta) $ is also $ \ mathbf { It has an independent and identical distribution for x} $. Therefore
\begin{align}
p(\mathbf{X}\mid\mathbf{W},\beta)=\prod^N_{n=1}p(\mathbf{x}_n\mid\mathbf{W},\beta)
\tag{2}
\end{align}
It becomes. The logarithm of the likelihood function in equation (2) is maximized for $ \ mathbf {W} $ and $ \ beta $ as follows.
\begin{align}
L_{\rm pPCA}&=\log{p(\mathbf{X}\mid\mathbf{W},\beta)} \\
&=\sum^N_{n=1}\log{p(\mathbf{x}_n\mid\mathbf{W},\beta)} \\
&=\sum^N_{n=1}\left(-\frac{D}{2}\log{2\pi}-\frac{1}{2}\log{|\mathbf{C}|}-\frac{1}{2}{\rm Tr}[\mathbf{C}^{-1}\mathbf{x}_n\mathbf{x}^{\rm T}_n]\right) \\
&=-\frac{ND}{2}\log{2\pi}-\frac{N}{2}\log{|\mathbf{C}|}-\frac{N}{2}{\rm Tr}[\mathbf{C}^{-1}\mathbf{V}] \tag{3}
\end{align}
This is the objective function of stochastic principal component analysis [^ 6]. There are two ways to estimate $ \ mathbf {W} $ and $ \ beta $ that maximize this objective function: by differentiating $ L $ and finding it in closed form, or by using the EM algorithm. Since it is out of the main subject, it is omitted here. Please refer to this paper for details [^ 7]
Dual probabilistic PCA
In pPCA, we considered a random model of $ \ mathbf x_n $ with $ \ mathbf z_n $ as a random variable and $ \ mathbf {W} $ as a parameter. On the other hand, in Dual Probabilistic PCA (DPPCA), $ \ mathbf {Z} = (\ mathbf z_1, \ mathbf z_2, \ cdots, \ mathbf z_N) \ in \ mathbb {R} ^ {L \ times N} $ is a parameter. , $ \ Mathbf w_d $ is regarded as a random variable Consider a probabilistic model for $ \ mathbf x_ {: d} $. Here, $ \ mathbf x_ {: d} = (x_ {1d}, x_ {2d}, \ cdots, x_ {Nd}) ^ {\ rm T} $. In other words
\mathbf{x}_{:d} = \mathbf{Z}^{\rm T}\mathbf{w}_d + \boldsymbol{\epsilon}_{:d},
Let us assume that $ \ mathbf w_d $ and $ \ boldsymbol \ epsilon_ {: d} $ each follow the following probability distributions.
\begin{align}
p(\mathbf w_d)&=\mathcal{N}(\mathbf w_d\mid\mathbf{0},\mathbf{I}_L) \\
p(\boldsymbol{\epsilon}_{:d})&=\mathcal{N}(\boldsymbol{\epsilon}_{:d}\mid\mathbf{0},\beta^{-1}\mathbf{I}_N)
\end{align}
Both $ p (\ mathbf w_d) $ and $ p (\ mathbf x_ {: d} \ mid \ mathbf {Z}, \ mathbf {w} _ {: d}, \ beta) $ are Gaussian, so the rest is pPCA. Similarly, the likelihood function can be obtained by marginalizing $ \ mathbf {w} _d $ as follows.
\begin{align}
p(\mathbf{X} \mid \mathbf{Z}, \beta)&=\prod^D_{d=1}\mathcal{N}(\mathbf{x}_{:d}\mid\mathbf{0},\mathbf{Z}\mathbf{Z}^{\rm T}+\beta^{-1}\mathbf{I}_N) \\
\end{align}
The logarithm of this likelihood function gives the objective function of DPPCA.
\begin{align}
\log{p(\mathbf{X} \mid \mathbf{Z}, \beta)} &=\sum^D_{d=1}\log{p(\mathbf{x}_{:d} \mid \mathbf{Z}, \beta)} \\
&=-\frac{ND}{2}\log{2\pi}-\frac{D}{2}\log{|\mathbf{K}|}-\frac{D}{2}{\rm Tr}[\mathbf{K}^{-1}\mathbf{S}] \tag{4}
\end{align}
Here, $ \ mathbf S $ and $ \ mathbf K $ are defined as follows.
\begin{align}
\mathbf S &= \frac 1 D \mathbf{X}\mathbf{X}^{\rm T} \\
\mathbf K &= \mathbf Z \mathbf Z^{\rm T}+\beta^{-1}\mathbf I_N
\end{align}
Looking at equation (4), it may seem that what you are doing with pPCA is not so different, as the roles of $ \ mathbf W $ and $ \ mathbf Z $ are just swapped after all. However, consider the following constant minus equation (4).
\int \mathcal{N}(\mathbf{x}\mid\mathbf{0},\mathbf{S})\log{\mathcal{N}(\mathbf{x}\mid\mathbf{0},\mathbf{S})}d\mathbf{x}=-\frac{ND}{2}\log{2\pi}-\frac{D}{2}\log{|\mathbf S|}+\frac{ND}{2}
At this time, estimating $ \ mathbf {Z} $ that maximizes equation (4) from the following equation minimizes the KL divergence between the Gramian matrix of the observed data and the Gramian matrix of the latent variable. You can see that it is equivalent to estimating {Z} $. In other words, the purpose of DPPCA is to estimate the latent variable $ \ mathbf {Z} $ so that the similarity between the observed data and the similarity between the corresponding latent variables match as much as possible.
\begin{align}
\int \mathcal{N}(\mathbf{x}\mid\mathbf{0},\mathbf{S})\log{\mathcal{N}(\mathbf{x}\mid\mathbf{0},\mathbf{S})}d\mathbf{x}-L_{\rm DPPCA} &=\frac{D}{2}\{-\log{|\mathbf S|}+\log{|\mathbf{K}|}+{\rm Tr}[\mathbf{K}^{-1}\mathbf{S}]+N\} \\
&= D_{\rm KL}[\mathcal{N}(\mathbf{x}\mid\mathbf{0},\mathbf{S}) || \mathcal{N}(\mathbf{x}\mid\mathbf{0},\mathbf{K})]
\end{align}
At this time, DPPCA can realize non-linear dimension reduction by defining the similarity of observed data and the similarity of latent variables using the kernel function $ k (\ cdot, \ cdot) $ other than the standard inner product. .. In particular, the method of performing non-linear dimension reduction using the kernel function for the similarity of observed data is called kernel principal component analysis [^ 8], and the method of performing non-linear dimension reduction using the kernel function for the similarity of latent variables. It is called GPLVM.
Both methods allow non-linear dimensionality reduction, but each has its advantages and disadvantages. In kernel principal component analysis, the kernel function is used for the similarity between observed data, so once $ \ mathbf {S} $ is calculated using the kernel function, an analytical solution can be obtained in the same way as normal principal component analysis. .. However, since we do not know the mapping from the latent space to the observation space, it is necessary to solve the pre-image problem [^ 9] in order to estimate the points in the observation space corresponding to any points in the latent space. On the other hand, in GPLVM, the mapping from the latent space to the observation space can be explicitly written, so the pre-image problem does not occur. Instead, you need to update $ \ mathbf {K} $ every time the latent variable changes.
Gaussian Process Latent Variable Model
So far, we have found that GPLVM is a learning model that estimates the latent variable $ \ mathbf Z $ so as to maximize the following objective function.
L_{\rm DPPCA} = -\frac{ND}{2}\log{2\pi}-\frac{D}{2}\log{|\mathbf{K}|}-\frac{D}{2}{\rm Tr}[\mathbf{K}^{-1}\mathbf{S}] \tag{5}
Where $ \ mathbf K $ uses the kernel function $ k (\ cdot, \ cdot) $ to $ \ mathbf K = k (\ mathbf Z, \ mathbf Z) + \ beta ^ {-1} \ mathbf {I It is defined as} _N $.
In fact, this objective function can also be derived from the perspective of GP. I will omit the details about GP this time, so if you want to know about GP, please refer to this book [^ 10]. Let the mapping from the latent space $ \ mathcal Z $ to the observation space $ \ mathcal X $ be $ f: \ mathcal Z \ rightarrow \ mathcal X = \ mathbb {R} ^ D $, and $ \ mathbf x \ in \ mathcal X We assume that $ is independent of dimensions. That is,
\begin{align}
x_{nd}=f_d(\mathbf z_n)+\epsilon_{nd}
\end{align}
It is assumed that $ x_ {nd} $ is generated independently for each dimension from $ \ mathbf z_n $ according to. Here, $ \ epsilon_ {nd} $ is Gaussian noise with the precision parameter $ \ beta $. Let the prior distribution of $ f_d $ be $ f_d \ sim \ mathcal {GP} (0, k (\ mathbf {z}, \ mathbf {z}')) $ [^ 11]. When the observed data is $ \ mathbf X = (\ mathbf x_1, \ mathbf x_2, \ cdots, \ mathbf x_N) $ and the corresponding latent variable is $ \ mathbf Z $, the prior distribution is $ p (\ mathbf f_d ). mid \ mathbf Z) = \ mathcal {N} (\ mathbf f_d \ mid \ mathbf 0, k (\ mathbf Z, \ mathbf Z)) $ [^ 12]. Here, $ \ mathbf f_d = (f_d (\ mathbf z_1), f_d (\ mathbf z_2), \ cdots, f_d (\ mathbf z_N)) $. At this time, $ \ mathbf f_d $ is independent for the dimension $ d $, so the marginal likelihood is
\begin{align}
p(\mathbf{X}\mid \mathbf{Z},\beta) &= \prod^D_{d=1}p(\mathbf{x}_{:d}\mid \mathbf{Z},\beta) \\
&= \prod^D_{d=1}\int{p(\mathbf{x}_{:d}\mid \mathbf{f}_d,\beta)p(\mathbf{f}_d\mid \mathbf{Z})}d\mathbf{f}_d
\end{align}
It becomes. Also, $ p (\ mathbf {x} _ {: d} \ mid \ mathbf {f} _d, \ beta) $ and $ p (\ mathbf {f} _d \ mid \ mathbf {Z}) $ are Gaussian distributions. Therefore, the marginal likelihood also has a Gaussian distribution.
Therefore $ p (\ mathbf x_ {: d} \ mid \ mathbf Z) = \ mathcal N (\ mathbf x_ {: d} \ mid \ mathbb E [\ mathbf x_ {: d}], \ mathbb V [\ mathbf x_ {: d}]) If it is $, it will be as follows.
\begin{align}
\mathbb{E}[\mathbf x_{:d}]&=\mathbb{E}[\mathbf f_d+\boldsymbol{\epsilon}_{:d}]\\
&=\mathbf 0\\
\mathbb{V}[\mathbf x_{:d}]&=\mathbb{E}[(\mathbf f_d+\boldsymbol{\epsilon}_{:d})(\mathbf f_d+\boldsymbol{\epsilon}_{:d})^{\rm T}] \\
&=\mathbb{E}[\mathbf f_d\mathbf f^{\rm T}_d]+2\mathbb{E}[\boldsymbol{\epsilon}_{:d}\mathbf f^{\rm T}_d]+\mathbb{E}[\boldsymbol{\epsilon}_{:d}\boldsymbol{\epsilon}^{\rm T}_{:d}] \\
&=k(\mathbf Z,\mathbf Z)+\beta^{-1}\mathbf I_N \\
&=\mathbf K
\end{align}
Than this
\begin{align}
\log{p(\mathbf{X}\mid \mathbf{Z},\beta)} &= \sum^D_{d=1}\log{p(\mathbf{x}_{:d}\mid \mathbf{Z},\beta)} \\
&= -\frac{ND}{2}\log{2\pi}-\frac{D}{2}\log{|\mathbf{K}|}-\frac{D}{2}{\rm Tr}[\mathbf{K}^{-1}\mathbf{S}]\\
\end{align}
And agrees with Eq. (5). In other words, from a different point of view, the latent variable $ \ mathbf Z $ estimated by DPPCA estimates the latent variable that maximizes the marginal likelihood when considering a Gaussian process in which the output is multidimensional instead of being the latent variable. You can see that it is equivalent to that. From this, the posterior distribution of the mapping $ f $ from the latent space to the observation space can be written as the following Gaussian process.
\begin{align}
f &\sim \mathcal{GP}(\mu(\mathbf{z}),\sigma(\mathbf{z},\mathbf{z}')) \\
\mu(\mathbf{z}) &= k(\mathbf{z},\mathbf{Z}) (k(\mathbf{Z},\mathbf{Z})+\beta^{-1}\mathbf{I})^{-1}\mathbf{X} \\
\sigma(\mathbf{z},\mathbf{z}') &= k(\mathbf{z},\mathbf{z}') - k(\mathbf{z},\mathbf{Z})(k(\mathbf{Z},\mathbf{Z})+\beta^{-1}\mathbf{I})^{-1}k(\mathbf{Z},\mathbf{z})
\end{align}
Furthermore, regarding hyperparameters, it can be seen that estimating hyperparameters that maximize equation (5) is equivalent to estimating hyperparameters that maximize peripheral likelihood, so hyperparameter estimation in unsupervised learning is appropriate. The sex can also be guaranteed from the viewpoint of maximizing the peripheral likelihood.
Since equation (5) cannot be solved analytically, we estimate $ \ mathbf Z $ and hyperparameters using the gradient method. When differentiating $ L_ {\ rm DPPCA} $ for $ z_ {nl} $, $ z_ {nl} $ is an argument of the kernel function, so it can be calculated using the following chain rule of differentiation. I can do it.
\begin{align}
\frac{\partial L_{\rm DPPCA}}{\partial z_{nl}}={\rm Tr}\left(\frac{\partial L_{\rm DPPCA}}{\partial \mathbf{K}}\frac{\partial \mathbf{K}}{\partial z_{nl}}\right)
\end{align}
here
\begin{align}
\frac{\partial L_{\rm DPPCA}}{\partial \mathbf{K}} = \frac{D}{2}(\mathbf{K}^{-1}\mathbf{S}\mathbf{K}^{-1}-\mathbf{K}^{-1})
\end{align}
It becomes. As for $ \ frac {\ partial \ mathbf {K}} {\ partial z_ {nl}} $, it depends on the kernel function, so differentiate it accordingly. Also, in the calculation of the actual algorithm, the equation with the logarithm $ \ log {p (\ mathbf {Z})} $ of the prior distribution is maximized in order to prevent the latent variable from becoming an extreme value. The standard normal distribution is basically used as the prior distribution.
Finally, we will actually implement GPLVM on python and verify whether learning works with simple data. This time, the data obtained by randomly sampling 200 points of the latent variable $ \ mathbf z $ in the range of $ [-1,1] ^ 2 $ and mapping it to the three-dimensional space with the following function was used as the observation data. Where $ \ mathbf {\ epsilon} $ is Gaussian noise. In addition, the kernel function uses a Gaussian kernel, and hyperparameters and observed noise are estimated by maximizing the peripheral likelihood. The initial value of the latent variable is also randomly determined.
\begin{align}
x_{n1} &= z_{n1}+\epsilon_{n1} \\
x_{n2} &= z_{n2}+\epsilon_{n2} \\
x_{n3} &= z_{n1}^2 - z_{n2}^2+\epsilon_{n3} \\
\end{align}
The actual program is as follows.
GPLVM.py
import numpy as np
class GPLVM(object):
def __init__(self,Y,LatentDim,HyperParam,X=None):
self.Y = Y
self.hyperparam = HyperParam
self.dataNum = self.Y.shape[0]
self.dataDim = self.Y.shape[1]
self.latentDim = LatentDim
if X is not None:
self.X = X
else:
self.X = 0.1*np.random.randn(self.dataNum,self.latentDim)
self.S = Y @ Y.T
self.history = {}
def fit(self,epoch=100,epsilonX=0.5,epsilonSigma=0.0025,epsilonAlpha=0.00005):
resolution = 10
M = resolution**self.latentDim
self.history['X'] = np.zeros((epoch, self.dataNum, self.latentDim))
self.history['F'] = np.zeros((epoch, M, self.dataDim))
sigma = np.log(self.hyperparam[0])
alpha = np.log(self.hyperparam[1])
for i in range(epoch):
#Latent variable update
K = self.kernel(self.X,self.X,self.hyperparam[0]) + self.hyperparam[1]*np.eye(self.dataNum)
Kinv = np.linalg.inv(K)
G = 0.5*(Kinv @ self.S @ Kinv-self.dataDim*Kinv)
dKdX = -(((self.X[:,None,:]-self.X[None,:,:])*K[:,:,None]))/self.hyperparam[0]
# dFdX = (G[:,:,None] * dKdX).sum(axis=1)-self.X
dFdX = (G[:,:,None] * dKdX).sum(axis=1)
self.X = self.X + epsilonX * dFdX
self.history['X'][i] = self.X
#Hyperparameter updates
Dist = ((self.X[:, None, :] - self.X[None, :, :]) ** 2).sum(axis=2)
dKdSigma = 0.5*Dist/self.hyperparam[0]*K
dFdSigma = np.trace(G @ dKdSigma)
sigma = sigma + epsilonSigma * dFdSigma
self.hyperparam[0] = np.exp(sigma)
dKdAlpha = self.hyperparam[1]*np.eye(self.dataNum)
dFdAlpha = np.trace(G @ dKdAlpha)
alpha = alpha + epsilonAlpha * dFdAlpha
self.hyperparam[1] = np.exp(alpha)
zeta = np.meshgrid(np.linspace(self.X[:, 0].min(), self.X[:, 0].max(), resolution),
np.linspace(self.X[:, 1].min(), self.X[:, 1].max(), resolution))
zeta = np.dstack(zeta).reshape(M, self.latentDim)
K = self.kernel(self.X,self.X,self.hyperparam[0]) + self.hyperparam[1]*np.eye(self.dataNum)
Kinv = np.linalg.inv(K)
KStar = self.kernel(zeta, self.X, self.hyperparam[0])
self.F = KStar @ Kinv @ self.Y
self.history['F'][i] = self.F
def kernel(self,X1, X2, length):
Dist = (((X1[:, None, :] - X2[None, :, :]) ** 2) / length).sum(axis=2)
K = np.exp(-0.5 * Dist)
return K
main.py
from GPLVM import GPLVM
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def createKuraData(N, D,sigma=0.1):
X = (np.random.rand(N, 2) - 0.5) * 2
Y = np.zeros((N, D))
Y[:, :2] = X
Y[:,2]=X[:,0]**2-X[:,1]**2
Y += np.random.normal(0,sigma,(N,D))
return [X,Y]
def plot_prediction(Y,f,Z,epoch, isSave=False):
fig = plt.figure(1,[10,8])
nb_nodes = f.shape[1]
nb_dim = f.shape[2]
resolution = np.sqrt(nb_nodes).astype('int')
for i in range(epoch):
if i%10 is 0:
ax_input = fig.add_subplot(1, 2, 1, projection='3d', aspect='equal')
ax_input.cla()
#Display of observation space
r_f = f[i].reshape(resolution, resolution, nb_dim)
ax_input.plot_wireframe(r_f[:, :, 0], r_f[:, :, 1], r_f[:, :, 2],color='k')
ax_input.scatter(Y[:, 0], Y[:, 1], Y[:, 2], c=Y[:, 0], edgecolors="k",marker='x')
ax_input.set_xlim(Y[:, 0].min(), Y[:, 0].max())
ax_input.set_ylim(Y[:, 1].min(), Y[:, 1].max())
ax_input.set_zlim(Y[:, 2].min(), Y[:, 2].max())
# plt.savefig("fig1.pdf")
#Display of latent space
ax_latent = fig.add_subplot(1, 2, 2, aspect='equal')
ax_latent.cla()
ax_latent.set_xlim(Z[:,:, 0].min(), Z[:,:, 0].max())
ax_latent.set_ylim(Z[:,:, 1].min(), Z[:,:, 1].max())
ax_latent.scatter(Z[i,:, 0], Z[i,:, 1], c=Y[:, 0], edgecolors="k")
plt.savefig("fig/fig{0}.png ".format(i))
plt.pause(0.001)
if isSave:
plt.savefig("result.png ", dpi=100)
plt.show()
if __name__ == '__main__':
L=2
N=200
D=3
sigma=3
alpha=0.05
beta=0.08
seedData=1
resolution = 10
M = resolution**L
#Input data generation
# np.random.seed(seedData)
[X,Y] = createKuraData(N,D,sigma=0.01)
# Y = np.loadtxt('kura.txt', delimiter=' ')
#Kernel settings
[U,D,Vt] = np.linalg.svd(Y)
model = GPLVM(Y,L, np.array([sigma**2,alpha/beta]))
#GPLVM optimization
epoch = 200
model.fit(epoch=epoch,epsilonX=0.05,epsilonSigma=0.0005,epsilonAlpha=0.00001)
#Get the estimated latent variable
X = model.history['X']
f = model.history['F']
#Display of learning results
plot_prediction(Y,f,X,epoch,True)
The figure on the left shows the learning process of the estimated manifold in the observation space. The figure on the right is the latent variable at that time. If the initial value of the parameter is not bad to some extent, it operates stably even if the initial value is randomly determined. In particular, if the width of the kernel function is set large in advance, it tends to be stable.
I have summarized my understanding of GPLVM. We would appreciate it if you could contact us if you have any questions or suggestions.
Recommended Posts