Last time, I implemented an ordinary neural network without using libraries such as tensorflow and chainer. However, there are already many implementations of that kind on the net, and that is not very interesting, so I implemented a mixed density network ** this time as a preparation for the last time. Since the code of the neural network implemented last time is used, please refer to Previous article as appropriate.
When solving a regression problem, it may not be possible to deal with it well by using the sum of squares error function. The figure below (reproduction of PRML Figure 5.19) shows the result when the neural network is trained using the sum of squares error function using the blue dots as training data. When $ x = 0.5 $, the value of $ y $ looks good at 0.2, 0.5 or 0.8. However, since I have to choose one of them, it becomes 0.5 and I do not fit to the other two. ** I trained the neural network but it didn't fit very well because the cost function is modeled on a monomodal Gaussian distribution **. The relationship between the input $ x $ and the target $ t $ is expressed in a Gaussian distribution as follows.
p(t|x) = \mathcal{N}(t|\mu(x),\sigma^2),
Based on this, fitting and estimation of the function $ \ mu (x) $ were performed. And the estimated $ \ mu (x) $ is red in the above figure, which is a bad result.
The mixed density network solves the above problem by using a cost function modeled on a multimodal mixed Gaussian distribution. Prepare a mixing coefficient $ \ pi \ _k $ for each K Gaussian distribution, and find the relationship between the input $ x $ and the target $ t $.
p(t|x) = \sum_{k=1}^K\pi_k(x)\mathcal{N}(t|\mu_k(x),\sigma^2_k(x))
Model with. However, the sum of the mixing coefficients is 1 ($ \ sum_k \ pi_k = 1 $). In the training data example above, at $ x = 0.5 $ there is a set of three data points around $ y = 0.2,0.5,0.8 $. Therefore, it seems good to set $ K = 3 $. By fitting three Gaussian distributions to different sets of data points, even multimodal data can be dealt with. Therefore, given the training data set $ \ {x \ _n, t \ _n \} _ {n = 1} ^ N $, the cost function to optimize is
\begin{align}
E(w) &= \sum_{n=1}^N E_n\\
&= -\sum_{n=1}^N \ln p(t_n|x_n)\\
&= -\sum_{n=1}^N \ln \left\{\sum_{k=1}^K\pi_k(x_n,w)\mathcal{N}\left(t_n|\mu_k(x_n,w),\sigma^2_k(x_n,w)\right)\right\}
\end{align}
It will be. Then we estimate the parameter $ w $ that minimizes the above equation. A neural network is used for the functions $ \ pi_k, \ mu_k, \ sigma_k $. If you put the input $ x $ into the network with the parameter $ w $, the mixture coefficient, mean, and variance are output, and the probability distribution of $ t $ is calculated.
This time, regression is performed on the data as shown above, so the input $ {\ bf x} $ is one-dimensional. The number of nodes in the middle $ {\ bf z} $ is 5 according to PRML, and the number of nodes in the output $ {\ bf y} $ is 9. The parameters of the first layer are $ W ^ {(1)}, {\ bf b} ^ {(1)} $, the activation function is $ f (\ cdot) $, and the parameters of the second layer are $ W ^ { As (2)}, {\ bf b} ^ {(2)} $, the forward propagation is as follows.
\begin{align}
{\bf z} &= f(W^{(1)}{\bf x} + {\bf b}^{(1)})\\
{\bf y} &= W^{(2)}{\bf z} + {\bf b}^{(2)}
\end{align}
These nine outputs are the activities of the Gaussian mixed Gaussian parameters $ \ pi_k, \ mu_k, \ sigma_k $. First, the output $ {\ bf y} $ is the activity corresponding to $ \ pi, \ mu, \ sigma $ three from the top.
{\bf y} =
\begin{bmatrix}
a_{\pi_1}\\
a_{\pi_2}\\
a_{\pi_3}\\
a_{\mu_1}\\
a_{\mu_2}\\
a_{\mu_3}\\
a_{\sigma_1}\\
a_{\sigma_2}\\
a_{\sigma_3}
\end{bmatrix}
Since the mixing coefficient $ \ pi_k $ has a constraint of $ \ sum_k \ pi_k = 1 $, the mixing coefficient is output by the softmax function.
\pi_k = {\exp(a_{\pi_k})\over\sum_{l=1}^K\exp(a_{\pi_l})}
The average is $ \ mu_k = a_ {\ mu_k} $ without using non-linear transformation. Since the standard deviation $ \ sigma $ is greater than or equal to 0, use the exponential function $ \ sigma_k = \ exp (a_ {\ sigma_k}) $. You now have the functions $ \ pi_k, \ mu_k, \ sigma_k $ needed to calculate the Gaussian mixture.
As I wrote in the previous Implementation of Neural Network, in order to train a neural network, a gradient obtained by differentiating the cost function with the output of the neural network is required. With that, we can also use error backpropagation to calculate the derivative for the cost function parameter $ w $. The gradient at the output of the neural network for that cost function is
\gamma_{nk}(t_n|x_n) = {\pi_k\mathcal{N}(t_n|\mu_k(x_n,w),\sigma_k^2(x_n,w))\over\sum_l\pi_l\mathcal{N}(t_n|\mu_l(x_n,w),\sigma_l^2(x_n,w))}
Using,
\begin{align}
{\partial E_n\over\partial a_{\pi_k}} &= \pi_k - \gamma_{nk}\\
{\partial E_n\over\partial a_{\mu_k}} &= \gamma_{nk}{\mu_k - t_n\over\sigma^2_k}\\
{\partial E_n\over\partial a_{\sigma_k}} &= \gamma_{nk}\left(1 - {||t_n - \mu_k||^2\over \sigma^2_k}\right)
\end{align}
It will be. Derivation of the formula is an exercise in PRML, so please refer to the answer or the paper on the mixed density network.
The code of the formula written above is as follows.
#Cost function class
class GaussianMixture(object):
"""Negative Log Likelihood of Gaussian Mixture model"""
def __init__(self, n_components):
#Number of Gaussian distributions, 3 in previous examples
self.n_components = n_components
#Method to calculate the value of the cost function
def __call__(self, X, targets):
#Convert the output X of the network with the activation function to calculate the standard deviation, mixing factor, and mean.
sigma, weight, mu = self.activate(X)
#Gaussian function N(t|mu,sigma^2)Calculate the value of
gauss = self.gauss(mu, sigma, targets)
#Negative log-likelihood E of mixed Gaussian distribution(w):PRML(Equation 5.153)
return -np.sum(np.log(np.sum(weight * gauss, axis=1)))
#Convert with activation function
def activate(self, X):
assert np.size(X, 1) == 3 * self.n_components
#Divide X where it corresponds to standard deviation, mixing factor, and mean
X_sigma, X_weight, X_mu = np.split(X, [self.n_components, 2 * self.n_components], axis=1)
#Convert standard deviation with activation function
sigma = np.exp(X_sigma)
#Convert the mixing coefficient with the activation function and subtract the maximum value so that the digits do not overflow
weight = np.exp(X_weight - np.max(X_weight, 1, keepdims=True))
weight /= np.sum(weight, axis=1, keepdims=True)
return sigma, weight, X_mu
#Gaussian function N(target|mu,sigma^2)Calculate
def gauss(self, mu, sigma, targets):
return np.exp(-0.5 * (mu - targets) ** 2 / np.square(sigma)) / np.sqrt(2 * np.pi * np.square(sigma))
#Differentiate the cost function by activity
def delta(self, X, targets):
sigma, weight, mu = self.activate(X)
var = np.square(sigma)
gamma = weight * self.gauss(mu, sigma, targets)
gamma /= np.sum(gamma, axis=1, keepdims=True)
#Calculate each derivative
delta_mu = gamma * (mu - targets) / var
delta_sigma = gamma * (1 - (mu - targets) ** 2 / var)
delta_weight = weight - gamma
#Connect and then return
delta = np.hstack([delta_sigma, delta_weight, delta_mu])
return delta
#Create training data that follows the inverse function of the func function
def create_toy_dataset(func, n=300):
t = np.random.uniform(size=(n, 1))
x = func(t) + np.random.uniform(-0.05, 0.05, size=(n, 1))
return x, t
One of the methods I used in a hurry because there were times when learning did not go well. Instead of using all the training data to calculate the gradient, we take some from the training data and perform the gradient calculation. By doing this, you may be able to get out of the local solution.
#Training data set x,Randomly select n pairs from t
def sample(x, t, n=None):
assert len(x) == len(t)
N = len(x)
if n is None:
n = N
indices = np.random.choice(N, n, replace=False)
return x[indices], t[indices]
This is the main function that learns the mixed density network and illustrates the result.
def main():
def func(x):
return x + 0.3 * np.sin(2 * np.pi * x)
#Generate data points that follow the inverse function of the func function
x, t = create_toy_dataset(func)
#Determine the structure of the neural network.
layers = [TanhLayer(1, 5, std=0.1), LinearLayer(5, 9, std=0.1)]
#Determine the cost function to optimize
cost_function = GaussianMixture(3)
#Make a neural network
nn = NeuralNetwork(layers, cost_function)
#Uncomment the line below to check if the implementation is working.
# nn._gradient_check(np.array([[0.5]]), np.array([[0.5]]))
#Set the first learning coefficient
learning_rate = 1e-4
for i in xrange(500000):
#Calculate the value of the cost function once every 10,000 times and set the learning coefficient to 0.9 times
if i % 10000 == 0:
print "step %6d, cost %f" % (i, nn.cost(x, t))
learning_rate *= 0.9
#Mini batch processing
batch = sample(x, t, n=100)
nn.fit(*batch, learning_rate=learning_rate)
#Creating test data
x_test = np.linspace(x.min(), x.max(), 100)
y_test = np.linspace(t.min(), t.max(), 100)
X_test, Y_test = np.meshgrid(x_test, y_test)
test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)
#Calculate the likelihood of test data
sigma, weight, mu = nn(test[:, 0].reshape(-1, 1))
probs = cost_function.gauss(mu, sigma, test[:, 1].reshape(-1, 1))
probs = np.sum(weight * probs, axis=1)
Probs = probs.reshape(100, 100)
#PRML Figure 5.21(a)Reproduction of
plt.plot(x_test, weight[:100, 0], color="blue")
plt.plot(x_test, weight[:100, 1], color="red")
plt.plot(x_test, weight[:100, 2], color="green")
plt.title("weights")
plt.show()
#PRML Figure 5.21(b)Reproduction of
plt.plot(x_test, mu[:100, 0], color="blue")
plt.plot(x_test, mu[:100, 1], color="red")
plt.plot(x_test, mu[:100, 2], color="green")
plt.title("means")
plt.show()
#PRML Figure 5.21(c)Reproduction of
plt.scatter(x, t, alpha=0.5, label="observation")
levels_log = np.linspace(0, np.log(probs.max()), 21)
levels = np.exp(levels_log)
levels[0] = 0
plt.contourf(X_test, Y_test, Probs, levels, alpha=0.5)
plt.colorbar()
plt.xlim(x.min(), x.max())
plt.ylim(t.min(), t.max())
plt.show()
The neural_network
on the third line is the neural_network.py
implemented last time. Put it in the same directory as this code.
mixture_density_network.py
import matplotlib.pyplot as plt
import numpy as np
from neural_network import TanhLayer, LinearLayer, NeuralNetwork
class GaussianMixture(object):
"""Negative Log Likelihood of Gaussian Mixture model"""
def __init__(self, n_components):
self.n_components = n_components
def __call__(self, X, targets):
sigma, weight, mu = self.activate(X)
gauss = self.gauss(mu, sigma, targets)
return -np.sum(np.log(np.sum(weight * gauss, axis=1)))
def activate(self, X):
assert np.size(X, 1) == 3 * self.n_components
X_sigma, X_weight, X_mu = np.split(X, [self.n_components, 2 * self.n_components], axis=1)
sigma = np.exp(X_sigma)
weight = np.exp(X_weight - np.max(X_weight, 1, keepdims=True))
weight /= np.sum(weight, axis=1, keepdims=True)
return sigma, weight, X_mu
def gauss(self, mu, sigma, targets):
return np.exp(-0.5 * (mu - targets) ** 2 / np.square(sigma)) / np.sqrt(2 * np.pi * np.square(sigma))
def delta(self, X, targets):
sigma, weight, mu = self.activate(X)
var = np.square(sigma)
gamma = weight * self.gauss(mu, sigma, targets)
gamma /= np.sum(gamma, axis=1, keepdims=True)
delta_mu = gamma * (mu - targets) / var
delta_sigma = gamma * (1 - (mu - targets) ** 2 / var)
delta_weight = weight - gamma
delta = np.hstack([delta_sigma, delta_weight, delta_mu])
return delta
def create_toy_dataset(func, n=300):
t = np.random.uniform(size=(n, 1))
x = func(t) + np.random.uniform(-0.05, 0.05, size=(n, 1))
return x, t
def sample(x, t, n=None):
assert len(x) == len(t)
N = len(x)
if n is None:
n = N
indices = np.random.choice(N, n, replace=False)
return x[indices], t[indices]
def main():
def func(x):
return x + 0.3 * np.sin(2 * np.pi * x)
x, t = create_toy_dataset(func)
layers = [TanhLayer(1, 5, std=0.1), LinearLayer(5, 9, std=0.1)]
cost_function = GaussianMixture(3)
nn = NeuralNetwork(layers, cost_function)
# nn._gradient_check(np.array([[0.5]]), np.array([[0.5]]))
learning_rate = 1e-4
for i in xrange(500000):
if i % 10000 == 0:
print "step %6d, cost %f" % (i, nn.cost(x, t))
learning_rate *= 0.9
batch = sample(x, t, n=100)
nn.fit(*batch, learning_rate=learning_rate)
x_test = np.linspace(x.min(), x.max(), 100)
y_test = np.linspace(t.min(), t.max(), 100)
X_test, Y_test = np.meshgrid(x_test, y_test)
test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)
sigma, weight, mu = nn(test[:, 0].reshape(-1, 1))
probs = cost_function.gauss(mu, sigma, test[:, 1].reshape(-1, 1))
probs = np.sum(weight * probs, axis=1)
Probs = probs.reshape(100, 100)
plt.plot(x_test, weight[:100, 0], color="blue")
plt.plot(x_test, weight[:100, 1], color="red")
plt.plot(x_test, weight[:100, 2], color="green")
plt.title("weights")
plt.show()
plt.plot(x_test, mu[:100, 0], color="blue")
plt.plot(x_test, mu[:100, 1], color="red")
plt.plot(x_test, mu[:100, 2], color="green")
plt.title("means")
plt.show()
plt.scatter(x, t, alpha=0.5, label="observation")
levels_log = np.linspace(0, np.log(probs.max()), 21)
levels = np.exp(levels_log)
levels[0] = 0
plt.contourf(X_test, Y_test, Probs, levels, alpha=0.5)
plt.colorbar()
plt.xlim(x.min(), x.max())
plt.ylim(t.min(), t.max())
plt.show()
if __name__ == '__main__':
main()
As a result of training the mixed density network by running the above code with the blue point as the training data point, the graph shown in Fig. 5.21 of PRML was reproduced. However, the color of the line may be different. It is fitted to the training data points by three Gaussian distributions, blue, red and green.
Like last time, I made a video of the learning process.
I hadn't heard about the mixed density network until I read it on PRML, so when I searched on the net, the paper written by PRML author CM Bishop in 1994 was a hit, so this person himself May have been advocated by. Mixed density networks seem to be more effective than traditional models if they are multimodal, so I would like to apply this to some real data.
Recommended Posts