Last time, I implemented an ordinary neural network without using libraries such as tensorflow and chainer. However, there are already many implementations of that kind on the net, and that is not very interesting, so I implemented a mixed density network ** this time as a preparation for the last time. Since the code of the neural network implemented last time is used, please refer to Previous article as appropriate.

Mixed density network

Problems with conventional models

When solving a regression problem, it may not be possible to deal with it well by using the sum of squares error function. The figure below (reproduction of PRML Figure 5.19) shows the result when the neural network is trained using the sum of squares error function using the blue dots as training data. When $ x = 0.5 $, the value of $ y $ looks good at 0.2, 0.5 or 0.8. However, since I have to choose one of them, it becomes 0.5 and I do not fit to the other two. ** I trained the neural network but it didn't fit very well because the cost function is modeled on a monomodal Gaussian distribution **. The relationship between the input $ x $ and the target $ t $ is expressed in a Gaussian distribution as follows.

p(t|x) = \mathcal{N}(t|\mu(x),\sigma^2),

Based on this, fitting and estimation of the function $ \ mu (x) $ were performed. And the estimated $ \ mu (x) $ is red in the above figure, which is a bad result.

Mixed Gaussian distribution

The mixed density network solves the above problem by using a cost function modeled on a multimodal mixed Gaussian distribution. Prepare a mixing coefficient $ \ pi \ _k $ for each K Gaussian distribution, and find the relationship between the input $ x $ and the target $ t $.

p(t|x) = \sum_{k=1}^K\pi_k(x)\mathcal{N}(t|\mu_k(x),\sigma^2_k(x))

Model with. However, the sum of the mixing coefficients is 1 ($ \ sum_k \ pi_k = 1 $). In the training data example above, at $ x = 0.5 $ there is a set of three data points around $ y = 0.2,0.5,0.8 $. Therefore, it seems good to set $ K = 3 $. By fitting three Gaussian distributions to different sets of data points, even multimodal data can be dealt with. Therefore, given the training data set $ \ {x \ _n, t \ _n \} _ {n = 1} ^ N $, the cost function to optimize is

\begin{align}
E(w) &= \sum_{n=1}^N E_n\\
&= -\sum_{n=1}^N \ln p(t_n|x_n)\\
&= -\sum_{n=1}^N \ln \left\{\sum_{k=1}^K\pi_k(x_n,w)\mathcal{N}\left(t_n|\mu_k(x_n,w),\sigma^2_k(x_n,w)\right)\right\}
\end{align}

It will be. Then we estimate the parameter $ w $ that minimizes the above equation. A neural network is used for the functions $ \ pi_k, \ mu_k, \ sigma_k $. If you put the input $ x $ into the network with the parameter $ w $, the mixture coefficient, mean, and variance are output, and the probability distribution of $ t $ is calculated.

Network structure

This time, regression is performed on the data as shown above, so the input $ {\ bf x} $ is one-dimensional. The number of nodes in the middle $ {\ bf z} $ is 5 according to PRML, and the number of nodes in the output $ {\ bf y} $ is 9. The parameters of the first layer are $ W ^ {(1)}, {\ bf b} ^ {(1)} $, the activation function is $ f (\ cdot) $, and the parameters of the second layer are $ W ^ { As (2)}, {\ bf b} ^ {(2)} $, the forward propagation is as follows.

\begin{align}
{\bf z} &= f(W^{(1)}{\bf x} + {\bf b}^{(1)})\\
{\bf y} &= W^{(2)}{\bf z} + {\bf b}^{(2)}
\end{align}

These nine outputs are the activities of the Gaussian mixed Gaussian parameters $ \ pi_k, \ mu_k, \ sigma_k $. First, the output $ {\ bf y} $ is the activity corresponding to $ \ pi, \ mu, \ sigma $ three from the top.

{\bf y} =
\begin{bmatrix}
a_{\pi_1}\\
a_{\pi_2}\\
a_{\pi_3}\\
a_{\mu_1}\\
a_{\mu_2}\\
a_{\mu_3}\\
a_{\sigma_1}\\
a_{\sigma_2}\\
a_{\sigma_3}
\end{bmatrix}

Since the mixing coefficient $ \ pi_k $ has a constraint of $ \ sum_k \ pi_k = 1 $, the mixing coefficient is output by the softmax function.

\pi_k = {\exp(a_{\pi_k})\over\sum_{l=1}^K\exp(a_{\pi_l})}

The average is $ \ mu_k = a_ {\ mu_k} $ without using non-linear transformation. Since the standard deviation $ \ sigma $ is greater than or equal to 0, use the exponential function $ \ sigma_k = \ exp (a_ {\ sigma_k}) $. You now have the functions $ \ pi_k, \ mu_k, \ sigma_k $ needed to calculate the Gaussian mixture.

Slope

As I wrote in the previous Implementation of Neural Network, in order to train a neural network, a gradient obtained by differentiating the cost function with the output of the neural network is required. With that, we can also use error backpropagation to calculate the derivative for the cost function parameter $ w $. The gradient at the output of the neural network for that cost function is

\gamma_{nk}(t_n|x_n) = {\pi_k\mathcal{N}(t_n|\mu_k(x_n,w),\sigma_k^2(x_n,w))\over\sum_l\pi_l\mathcal{N}(t_n|\mu_l(x_n,w),\sigma_l^2(x_n,w))}

Using,

\begin{align}
{\partial E_n\over\partial a_{\pi_k}} &= \pi_k - \gamma_{nk}\\
{\partial E_n\over\partial a_{\mu_k}} &= \gamma_{nk}{\mu_k - t_n\over\sigma^2_k}\\
{\partial E_n\over\partial a_{\sigma_k}} &= \gamma_{nk}\left(1 - {||t_n - \mu_k||^2\over \sigma^2_k}\right)
\end{align}

It will be. Derivation of the formula is an exercise in PRML, so please refer to the answer or the paper on the mixed density network.

Implementation

Cost function

The code of the formula written above is as follows.

#Cost function class
class GaussianMixture(object):
    """Negative Log Likelihood of Gaussian Mixture model"""
    def __init__(self, n_components):
        #Number of Gaussian distributions, 3 in previous examples
        self.n_components = n_components

    #Method to calculate the value of the cost function
    def __call__(self, X, targets):

        #Convert the output X of the network with the activation function to calculate the standard deviation, mixing factor, and mean.
        sigma, weight, mu = self.activate(X)

        #Gaussian function N(t|mu,sigma^2)Calculate the value of
        gauss = self.gauss(mu, sigma, targets)

        #Negative log-likelihood E of mixed Gaussian distribution(w):PRML(Equation 5.153)
        return -np.sum(np.log(np.sum(weight * gauss, axis=1)))

    #Convert with activation function
    def activate(self, X):
        assert np.size(X, 1) == 3 * self.n_components

        #Divide X where it corresponds to standard deviation, mixing factor, and mean
        X_sigma, X_weight, X_mu = np.split(X, [self.n_components, 2 * self.n_components], axis=1)

        #Convert standard deviation with activation function
        sigma = np.exp(X_sigma)

        #Convert the mixing coefficient with the activation function and subtract the maximum value so that the digits do not overflow
        weight = np.exp(X_weight - np.max(X_weight, 1, keepdims=True))
        weight /= np.sum(weight, axis=1, keepdims=True)

        return sigma, weight, X_mu

    #Gaussian function N(target|mu,sigma^2)Calculate
    def gauss(self, mu, sigma, targets):
        return np.exp(-0.5 * (mu - targets) ** 2 / np.square(sigma)) / np.sqrt(2 * np.pi * np.square(sigma))

    #Differentiate the cost function by activity
    def delta(self, X, targets):
        sigma, weight, mu = self.activate(X)
        var = np.square(sigma)
        gamma = weight * self.gauss(mu, sigma, targets)
        gamma /= np.sum(gamma, axis=1, keepdims=True)

        #Calculate each derivative
        delta_mu = gamma * (mu - targets) / var
        delta_sigma = gamma * (1 - (mu - targets) ** 2 / var)
        delta_weight = weight - gamma

        #Connect and then return
        delta = np.hstack([delta_sigma, delta_weight, delta_mu])
        return delta

Creation of training data

#Create training data that follows the inverse function of the func function
def create_toy_dataset(func, n=300):
    t = np.random.uniform(size=(n, 1))
    x = func(t) + np.random.uniform(-0.05, 0.05, size=(n, 1))
    return x, t

Mini batch processing

One of the methods I used in a hurry because there were times when learning did not go well. Instead of using all the training data to calculate the gradient, we take some from the training data and perform the gradient calculation. By doing this, you may be able to get out of the local solution.

#Training data set x,Randomly select n pairs from t
def sample(x, t, n=None):
    assert len(x) == len(t)
    N = len(x)
    if n is None:
        n = N
    indices = np.random.choice(N, n, replace=False)
    return x[indices], t[indices]

Learning mixed density networks

This is the main function that learns the mixed density network and illustrates the result.

def main():

    def func(x):
        return x + 0.3 * np.sin(2 * np.pi * x)

    #Generate data points that follow the inverse function of the func function
    x, t = create_toy_dataset(func)

    #Determine the structure of the neural network.
    layers = [TanhLayer(1, 5, std=0.1), LinearLayer(5, 9, std=0.1)]
    #Determine the cost function to optimize
    cost_function = GaussianMixture(3)
    #Make a neural network
    nn = NeuralNetwork(layers, cost_function)
    #Uncomment the line below to check if the implementation is working.
    # nn._gradient_check(np.array([[0.5]]), np.array([[0.5]]))

    #Set the first learning coefficient
    learning_rate = 1e-4
    for i in xrange(500000):
        #Calculate the value of the cost function once every 10,000 times and set the learning coefficient to 0.9 times
        if i % 10000 == 0:
            print "step %6d, cost %f" % (i, nn.cost(x, t))
            learning_rate *= 0.9
        #Mini batch processing
        batch = sample(x, t, n=100)
        nn.fit(*batch, learning_rate=learning_rate)

    #Creating test data
    x_test = np.linspace(x.min(), x.max(), 100)
    y_test = np.linspace(t.min(), t.max(), 100)
    X_test, Y_test = np.meshgrid(x_test, y_test)
    test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)

    #Calculate the likelihood of test data
    sigma, weight, mu = nn(test[:, 0].reshape(-1, 1))
    probs = cost_function.gauss(mu, sigma, test[:, 1].reshape(-1, 1))
    probs = np.sum(weight * probs, axis=1)
    Probs = probs.reshape(100, 100)

    #PRML Figure 5.21(a)Reproduction of
    plt.plot(x_test, weight[:100, 0], color="blue")
    plt.plot(x_test, weight[:100, 1], color="red")
    plt.plot(x_test, weight[:100, 2], color="green")
    plt.title("weights")
    plt.show()

    #PRML Figure 5.21(b)Reproduction of
    plt.plot(x_test, mu[:100, 0], color="blue")
    plt.plot(x_test, mu[:100, 1], color="red")
    plt.plot(x_test, mu[:100, 2], color="green")
    plt.title("means")
    plt.show()

    #PRML Figure 5.21(c)Reproduction of
    plt.scatter(x, t, alpha=0.5, label="observation")
    levels_log = np.linspace(0, np.log(probs.max()), 21)
    levels = np.exp(levels_log)
    levels[0] = 0
    plt.contourf(X_test, Y_test, Probs, levels, alpha=0.5)
    plt.colorbar()
    plt.xlim(x.min(), x.max())
    plt.ylim(t.min(), t.max())
    plt.show()

Whole code

The neural_network on the third line is the neural_network.py implemented last time. Put it in the same directory as this code.

`mixture_density_network.py`


import matplotlib.pyplot as plt
import numpy as np
from neural_network import TanhLayer, LinearLayer, NeuralNetwork


class GaussianMixture(object):
    """Negative Log Likelihood of Gaussian Mixture model"""
    def __init__(self, n_components):
        self.n_components = n_components

    def __call__(self, X, targets):
        sigma, weight, mu = self.activate(X)
        gauss = self.gauss(mu, sigma, targets)
        return -np.sum(np.log(np.sum(weight * gauss, axis=1)))

    def activate(self, X):
        assert np.size(X, 1) == 3 * self.n_components
        X_sigma, X_weight, X_mu = np.split(X, [self.n_components, 2 * self.n_components], axis=1)
        sigma = np.exp(X_sigma)
        weight = np.exp(X_weight - np.max(X_weight, 1, keepdims=True))
        weight /= np.sum(weight, axis=1, keepdims=True)
        return sigma, weight, X_mu

    def gauss(self, mu, sigma, targets):
        return np.exp(-0.5 * (mu - targets) ** 2 / np.square(sigma)) / np.sqrt(2 * np.pi * np.square(sigma))

    def delta(self, X, targets):
        sigma, weight, mu = self.activate(X)
        var = np.square(sigma)
        gamma = weight * self.gauss(mu, sigma, targets)
        gamma /= np.sum(gamma, axis=1, keepdims=True)

        delta_mu = gamma * (mu - targets) / var
        delta_sigma = gamma * (1 - (mu - targets) ** 2 / var)
        delta_weight = weight - gamma
        delta = np.hstack([delta_sigma, delta_weight, delta_mu])
        return delta


def create_toy_dataset(func, n=300):
    t = np.random.uniform(size=(n, 1))
    x = func(t) + np.random.uniform(-0.05, 0.05, size=(n, 1))
    return x, t


def sample(x, t, n=None):
    assert len(x) == len(t)
    N = len(x)
    if n is None:
        n = N
    indices = np.random.choice(N, n, replace=False)
    return x[indices], t[indices]


def main():

    def func(x):
        return x + 0.3 * np.sin(2 * np.pi * x)

    x, t = create_toy_dataset(func)

    layers = [TanhLayer(1, 5, std=0.1), LinearLayer(5, 9, std=0.1)]
    cost_function = GaussianMixture(3)
    nn = NeuralNetwork(layers, cost_function)
    # nn._gradient_check(np.array([[0.5]]), np.array([[0.5]]))
    learning_rate = 1e-4
    for i in xrange(500000):
        if i % 10000 == 0:
            print "step %6d, cost %f" % (i, nn.cost(x, t))
            learning_rate *= 0.9
        batch = sample(x, t, n=100)
        nn.fit(*batch, learning_rate=learning_rate)

    x_test = np.linspace(x.min(), x.max(), 100)
    y_test = np.linspace(t.min(), t.max(), 100)
    X_test, Y_test = np.meshgrid(x_test, y_test)
    test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)

    sigma, weight, mu = nn(test[:, 0].reshape(-1, 1))
    probs = cost_function.gauss(mu, sigma, test[:, 1].reshape(-1, 1))
    probs = np.sum(weight * probs, axis=1)
    Probs = probs.reshape(100, 100)

    plt.plot(x_test, weight[:100, 0], color="blue")
    plt.plot(x_test, weight[:100, 1], color="red")
    plt.plot(x_test, weight[:100, 2], color="green")
    plt.title("weights")
    plt.show()

    plt.plot(x_test, mu[:100, 0], color="blue")
    plt.plot(x_test, mu[:100, 1], color="red")
    plt.plot(x_test, mu[:100, 2], color="green")
    plt.title("means")
    plt.show()

    plt.scatter(x, t, alpha=0.5, label="observation")
    levels_log = np.linspace(0, np.log(probs.max()), 21)
    levels = np.exp(levels_log)
    levels[0] = 0
    plt.contourf(X_test, Y_test, Probs, levels, alpha=0.5)
    plt.colorbar()
    plt.xlim(x.min(), x.max())
    plt.ylim(t.min(), t.max())
    plt.show()


if __name__ == '__main__':
    main()

result

As a result of training the mixed density network by running the above code with the blue point as the training data point, the graph shown in Fig. 5.21 of PRML was reproduced. However, the color of the line may be different. It is fitted to the training data points by three Gaussian distributions, blue, red and green.

State of learning

Like last time, I made a video of the learning process.

At the end

I hadn't heard about the mixed density network until I read it on PRML, so when I searched on the net, the paper written by PRML author CM Bishop in 1994 was a hit, so this person himself May have been advocated by. Mixed density networks seem to be more effective than traditional models if they are multimodal, so I would like to apply this to some real data.

PRML Chapter 5 Mixed Density Network Python Implementation