Chapter 5 of PRML introduces recently popular neural networks. There are many kinds of neural networks on the net, so I wanted to use something that I wasn't familiar with as much as possible, so I decided to implement a mixed density network almost exclusively with Numpy. However, since the amount of code has become quite large, I will divide it into two parts, and in this article I will implement a normal neural network after all, and the mixed density network will be the next time.

neural network

Network structure

2000px-Artificial_neural_network.svg.png This is 3D $ {\ bf x} = (x_1, x_2, x_3) $ for the input unit, 4D $ {\ bf z} = (z_1, z_2, z_3, z_4) $ for the hidden unit, and 2 output units. This is a schematic representation of a two-layer neural network with dimensions $ {\ bf z} = (z_1, z_2) . From the input, the hidden unit is the first layer ( {\ bf x} \ to {\ bf z} ), and from the hidden unit, the output is the second layer ( {\ bf z} \ to {\ bf y} $). It has become. The dimensions of the unit, the number of layers, etc. need to be changed according to the problem.

Forward propagation

Forward propagation is the step of calculating the output of the network from the input. Calculate the hidden unit $ {\ bf z} $ from the input $ {\ bf x} $, and then the output $ {\ bf y} $ from the hidden unit $ {\ bf z} $.

One of the first layer hidden units, $ z_1 $,

\begin{align}
a_{z_1} &= w_{11}^{(1)}x_1+w_{12}^{(1)}x_2+w_{13}^{(1)}x_3 + b_1^{(1)}\\
z_1 &= f^{(1)}(a_{z_1})
\end{align}

Is calculated. Where $ a_ {z_1} $ is the activity of the first hidden unit, $ w_ {1j} ^ {(1)} $ is the weight from the jth input unit to the first hidden unit, $ b_1 $ is The bias of the first hidden unit, $ f ^ {(1)} $, is the activation function of the first layer.

It is possible to formulate the same formulas for $ z_2, z_3, z_4 $, but it becomes complicated to have three formulas, so I often use a matrix.

\begin{align}
\begin{bmatrix}
a_{z_1}\\
a_{z_2}\\
a_{z_3}\\
a_{z_4}
\end{bmatrix}
&=
\begin{bmatrix}
w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)}\\
w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)}\\
w_{31}^{(1)} & w_{32}^{(1)} & w_{33}^{(1)}\\
w_{41}^{(1)} & w_{42}^{(1)} & w_{43}^{(1)}
\end{bmatrix}
\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1^{(1)}\\
b_2^{(1)}\\
b_3^{(1)}\\
b_4^{(1)}
\end{bmatrix}
\\
\begin{bmatrix}
z_1\\
z_2\\
z_3\\
z_4
\end{bmatrix}
&=
\begin{bmatrix}
f^{(1)}(a_{z_1})\\
f^{(1)}(a_{z_2})\\
f^{(1)}(a_{z_3})\\
f^{(1)}(a_{z_4})
\end{bmatrix}
\end{align}

Make this more concise

\begin{align}
{\bf a}_z &= W^{(1)}{\bf x} + {\bf b}^{(1)}\\
{\bf z} &= f^{(1)}({\bf a}_z)
\end{align}

Also expressed as. This completes the forward propagation of the first layer.

In the same way, the second layer

\begin{align}
{\bf a}_y &= W^{(2)}{\bf y} + {\bf b}^{(2)}\\
{\bf y} &= f^{(2)}({\bf a}_y)
\end{align}

Can be expressed as.

By the way, to summarize these,

{\bf y} = f^{(2)}(W^{(2)}f^{(1)}(W^{(1)}{\bf x} + {\bf b}^{(1)}) + {\bf b}^{(2)})

It will be. In this way, you can calculate from input to output.

Backpropagation

The neural network has many parameters (26 in this example). Backpropagation is an efficient way to calculate these gradients. Input and its target pair\\{{\bf x}, {\bf t}\\}, And the cost function you want to minimizeEThink about. First, input by forward propagation{\bf x}Network output from{\bf y}To calculate. Then the cost function can be calculated, for exampleE=||{\bf t} - {\bf y}||^2, And the target{\bf t}There will be an error with. This error is propagated to the input. The error in the output layer, actually the partial derivative in the output of the cost function, is

{\partial E\over\partial y_i}

is. Using this, the error of $ {\ bf a} \ _y $ and the partial differential of the cost function at $ {\ bf a} \ _y $ are

\begin{align}
{\partial E\over\partial a_{y_i}} &= {\partial E\over\partial y_i}{\partial y_i\over\partial a_{y_i}}\\
&= {\partial E\over\partial y_i}f'^{(2)}(a_{y_i})\\
(&= y_i - t_i)
\end{align}

It is obtained as. If the output layer activation function $ f ^ {(2)} $ is a canonical concatenation function such as an identity map, sigmoid function, or softmax function, the error of ** $ {\ bf a} _y $ is simple. The difference between the output and the target **. If the error of $ {\ bf a} _y $ is found, the gradient of the weight in the second layer can be found. Because

\begin{align}
{\partial a_{y_i}\over\partial w_{ij}^{(2)}} &= z_j\\
{\partial a_{y_i}\over\partial b_i^{(2)}} &= 1
\end{align}

Than,

\begin{align}
{\partial E\over\partial w_{ij}^{(2)}} &= {\partial E\over\partial a_{y_i}}{\partial a_{y_i}\over\partial w_{ij}^{(2)}}\\
&= {\partial E\over\partial a_{y_i}}z_j\\
{\partial E\over\partial b_i^{(2)}} &= {\partial E\over\partial a_{y_i}}{\partial a_{y_i}\over\partial b_i^{(2)}}\\
&= {\partial E\over\partial a_{y_i}}
\end{align}

And can be calculated. We propagated the output error to find the error in activity $ {\ bf a} _ {y} $ and used that error to calculate the gradient of the parameter. Furthermore, if the error of $ {\ bf a} \ _ y $ is obtained, not only the gradient of the weight but also the error of the input $ {\ bf z} $ of the second layer can be calculated.

\begin{align}
{\partial E\over\partial z_j} &= \sum_{i=1}^2 {\partial E\over\partial a_{y_i}}{\partial a_{y_i}\over\partial z_j}\\
&= \sum_{i=1}^2 {\partial E\over\partial a_{y_i}}w_{ij}^{(2)}
\end{align}

In this way, the error can be propagated from the output $ {\ bf y} $ of the second layer to the input $ {\ bf z} $. The error of the input $ {\ bf z} $ of the second layer is also the error of the output of the first layer, so by repeating the above, the error in the activity of the first layer and the gradient of the weight can also be obtained. You can get it.

\begin{align}
{\partial E\over\partial a_{z_i}} &= {\partial E\over\partial z_i}{\partial z_i\over\partial a_{z_i}}\\
&= {\partial E\over\partial z_i}f'^{(2)}(a_{z_i})\\
{\partial E\over\partial x_j} &= \sum_{i=1}^4 {\partial E\over\partial a_{z_i}}{\partial a_{z_i}\over\partial x_j}\\
&= \sum_{i=1}^4 {\partial E\over\partial a_{z_i}}w_{ij}^{(1)}\\
{\partial E\over\partial w_{ij}^{(1)}} &= {\partial E\over\partial a_{z_i}}{\partial a_{z_i}\over\partial w_{ij}^{(1)}}\\
&= {\partial E\over\partial a_{z_i}}x_j\\
{\partial E\over\partial b_i^{(1)}} &= {\partial E\over\partial a_{z_i}}{\partial a_{y_i}\over\partial b_i^{(1)}}\\
&= {\partial E\over\partial a_{z_i}}\\
\end{align}

In this way, the error in the output layer can be propagated to the input and the gradient required to update the weight parameters can be calculated in the process.

Backpropagation summary

Propagate in the opposite direction from the error in the output unit

Output unit	Second layer activity
{\partial E\over\partial y_i}	{\partial E\over\partial a_{y_i}}={\partial E\over\partial y_i}f'^{(2)}(a_{y_i})~~~(=y_i - t_i)

Calculate the gradient in the parameter from the error in activity.

\begin{align}
{\partial E\over\partial w_{ij}^{(2)}} &= {\partial E\over\partial a_{y_i}}z_j\\
{\partial E\over\partial b_i^{(2)}} &= {\partial E\over\partial a_{y_i}}
\end{align}

This is also propagated to the first layer.

Hidden unit	First layer activity
{\partial E\over\partial z_j}=\sum_{i=1}^2 {\partial E\over\partial a_{y_i}}w_{ij}^{(2)}	{\partial E\over\partial a_{z_j}}={\partial E\over\partial z_j}f'^{(1)}(a_{j_1})

Then the parameter gradient is obtained from the error in activity.

\begin{align}
{\partial E\over\partial w_{ij}^{(1)}} &= {\partial E\over\partial a_{z_i}}x_j\\
{\partial E\over\partial b_i^{(1)}} &= {\partial E\over\partial a_{z_i}}
\end{align}

code

Library

This time I used something other than numpy to use the truncated normal distribution for weight initialization. I usually use Tensorflow when building a neural network, but since Tensorflow uses a truncated normal distribution to initialize weights, I decided to follow it here as well.

import numpy as np
from scipy.stats import truncnorm

layer

A class that represents one layer of a neural network. The weights are initialized when the instance is created, the input is input and propagated forward, and then the error is backpropagated. Based on this class, we will build a layer to be used to actually model the neural network.

class Layer(object):

    def __init__(self, dim_input, dim_output, std=1., bias=0.):
        self.w = truncnorm(a=-2 * std, b=2 * std, scale=std).rvs((dim_input, dim_output))
        self.b = np.ones(dim_output) * bias

    def __call__(self, X):
        self.input = X
        return self.forward_propagation(X)

    def back_propagation(self, delta, learning_rate):
        # derivative with respect to activation
        delta = delta * self.activation_derivative()

        w = np.copy(self.w)
        self.w -= learning_rate * self.input.T.dot(delta)
        self.b -= learning_rate * np.sum(delta, axis=0)

        # derivative with respect to input
        return delta.dot(w.T)

Layer	Description
__init__	Initialize the parameters by entering the input and output orders of this layer
__call__	Calculate the output of this layer by propagating forward from the input of this layer
back_propagation	Calculate the input error of this layer by updating the parameters, including the output error and learning factor of this layer.

Layer: Conformal map

A layer whose activation function is an identity map $ f (a) = a $. We will build layers by defining new methods for calculating forward propagation and the derivative of the activation function.

class LinearLayer(Layer):

    def forward_propagation(self, X):
        return X.dot(self.w) + self.b

    def activation_derivative(self):
        return 1

Layer: Logistic sigmoid function

The layer where the activation function is the logistic sigmoid function $ f (x) = {1 \ over1 + \ exp (-x)} $. The derivative of the logistic sigmoid function is $ f'(x) = f (x) (1-f (x)) $.

class SigmoidLayer(Layer):

    def forward_propagation(self, X):
        activation = X.dot(self.w) + self.b
        self.output = 1 / (1 + np.exp(-activation))
        return self.output

    def activation_derivative(self):
        return self.output * (1 - self.output)

In this way, you can also build a layer using the hyperbolic tangent function $ \ tanh (x) $ and the normalized linear function $ \ max (x, 0) $ as the activation function.

Cost function: sum of squares error

This is an error function often used when solving ** regression problems **.

class SumSquaresError(object):

    def activate(self, X):
        return X

    def __call__(self, X, targets):
        return 0.5 * np.sum((X - targets) ** 2)

    def delta(self, X, targets):
        return X - targets

Cost function: sigmoid cross entropy

This is an error function used when you want to classify ** 2 classes **. The cross-entropy is calculated after performing a non-linear transformation with the logistic sigmoid function.

class SigmoidCrossEntropy(object):

    def activate(self, logits):
        return 1 / (1 + np.exp(-logits))

    def __call__(self, logits, targets):
        probs = self.activate(logits)
        p = np.clip(probs, 1e-10, 1 - 1e-10)
        return np.sum(-targets * np.log(p) - (1 - targets) * np.log(1 - p))

    def delta(self, logits, targets):
        probs = self.activate(logits)
        return probs - targets

In this way, Softmax cross entropy used for multi-class classification is also implemented.

neural network

Since the cost function has an activation function, the final layer should use LinearLayer. By using the approximation by the difference of finite difference, it is possible to confirm whether the error back propagation is implemented correctly.

class NeuralNetwork(object):

    def __init__(self, layers, cost_function):
        self.layers = layers
        self.cost_function = cost_function

    def __call__(self, X):
        for layer in self.layers:
            X = layer(X)
        return self.cost_function.activate(X)

    def fit(self, X, t, learning_rate):
        for layer in self.layers:
            X = layer(X)

        delta = self.cost_function.delta(X, t)
        for layer in reversed(self.layers):
            delta = layer.back_propagation(delta, learning_rate)

    def cost(self, X, t):
        for layer in self.layers:
            X = layer(X)
        return self.cost_function(X, t)

    def _gradient_check(self, X=None, t=None, eps=1e-6):
        if X is None:
            X = np.array([[0.5 for _ in xrange(np.size(self.layers[0].w, 0))]])
        if t is None:
            t = np.zeros((1, np.size(self.layers[-1].w, 1)))
            t[0, 0] = 1.

        e = np.zeros_like(X)
        e[:, 0] += eps
        x_plus_e = X + e
        x_minus_e = X - e
        grad = (self.cost(x_plus_e, t) - self.cost(x_minus_e, t)) / (2 * eps)

        for layer in self.layers:
            X = layer(X)
        delta = self.cost_function.delta(X, t)
        for layer in reversed(self.layers):
            delta = layer.back_propagation(delta, 0)

        print "==================================="
        print "checking gradient"
        print "finite difference", grad
        print " back propagation", delta[0, 0]
        print "==================================="

NueralNetwork	Description
__init__	Definition of network structure and cost function
__call__	Forward propagation calculation
fit	Network learning
cost	Calculate the value of the cost function
_gradient_check	Confirmation of backpropagation gradient

Whole code

The whole code is here. Import only what you need from this module and build code to solve regression and classification problems.

`neural_network.py`


import numpy as np
from scipy.stats import truncnorm


class Layer(object):

    def __init__(self, dim_input, dim_output, std=1., bias=0.):
        self.w = truncnorm(a=-2 * std, b=2 * std, scale=std).rvs((dim_input, dim_output))
        self.b = np.ones(dim_output) * bias

    def __call__(self, X):
        self.input = X
        return self.forward_propagation(X)

    def back_propagation(self, delta, learning_rate):
        # derivative with respect to activation
        delta = delta * self.activation_derivative()

        w = np.copy(self.w)
        self.w -= learning_rate * self.input.T.dot(delta)
        self.b -= learning_rate * np.sum(delta, axis=0)

        # derivative with respect to input
        return delta.dot(w.T)


class LinearLayer(Layer):

    def forward_propagation(self, X):
        return X.dot(self.w) + self.b

    def activation_derivative(self):
        return 1


class SigmoidLayer(Layer):

    def forward_propagation(self, X):
        activation = X.dot(self.w) + self.b
        self.output = 1 / (1 + np.exp(-activation))
        return self.output

    def activation_derivative(self):
        return self.output * (1 - self.output)


class TanhLayer(Layer):

    def forward_propagation(self, X):
        activation = X.dot(self.w) + self.b
        self.output = np.tanh(activation)
        return self.output

    def activation_derivative(self):
        return 1 - self.output ** 2


class ReLULayer(Layer):

    def forward_propagation(self, X):
        activation = X.dot(self.w) + self.b
        self.output = activation.clip(min=0)
        return self.output

    def activation_derivative(self):
        return (self.output > 0).astype(np.float)


class SigmoidCrossEntropy(object):

    def activate(self, logits):
        return 1 / (1 + np.exp(-logits))

    def __call__(self, logits, targets):
        probs = self.activate(logits)
        p = np.clip(probs, 1e-10, 1 - 1e-10)
        return np.sum(-targets * np.log(p) - (1 - targets) * np.log(1 - p))

    def delta(self, logits, targets):
        probs = self.activate(logits)
        return probs - targets


class SoftmaxCrossEntropy(object):

    def activate(self, logits):
        a = np.exp(logits - np.max(logits, 1, keepdims=True))
        a /= np.sum(a, 1, keepdims=True)
        return a

    def __call__(self, logits, targets):
        probs = self.activate(logits)
        p = probs.clip(min=1e-10)
        return - np.sum(targets * np.log(p))

    def delta(self, logits, targets):
        probs = self.activate(logits)
        return probs - targets


class SumSquaresError(object):

    def activate(self, X):
        return X

    def __call__(self, X, targets):
        return 0.5 * np.sum((X - targets) ** 2)

    def delta(self, X, targets):
        return X - targets


class NeuralNetwork(object):

    def __init__(self, layers, cost_function):
        self.layers = layers
        self.cost_function = cost_function

    def __call__(self, X):
        for layer in self.layers:
            X = layer(X)
        return self.cost_function.activate(X)

    def fit(self, X, t, learning_rate):
        for layer in self.layers:
            X = layer(X)

        delta = self.cost_function.delta(X, t)
        for layer in reversed(self.layers):
            delta = layer.back_propagation(delta, learning_rate)

    def cost(self, X, t):
        for layer in self.layers:
            X = layer(X)
        return self.cost_function(X, t)

    def _gradient_check(self, X=None, t=None, eps=1e-6):
        if X is None:
            X = np.array([[0.5 for _ in xrange(np.size(self.layers[0].w, 0))]])
        if t is None:
            t = np.zeros((1, np.size(self.layers[-1].w, 1)))
            t[0, 0] = 1.

        e = np.zeros_like(X)
        e[:, 0] += eps
        x_plus_e = X + e
        x_minus_e = X - e
        grad = (self.cost(x_plus_e, t) - self.cost(x_minus_e, t)) / (2 * eps)

        for layer in self.layers:
            X = layer(X)
        delta = self.cost_function.delta(X, t)
        for layer in reversed(self.layers):
            delta = layer.back_propagation(delta, 0)

        print "==================================="
        print "checking gradient"
        print "finite difference", grad
        print " back propagation", delta[0, 0]
        print "==================================="

2 class classification

Put the above neural_network.py and this file in the same directory.

`binary_classification.py`


import pylab as plt
import numpy as np
from neural_network import TanhLayer, LinearLayer, SigmoidCrossEntropy, NeuralNetwork


def create_toy_dataset():
    x = np.random.uniform(-1., 1., size=(1000, 2))
    labels = (np.prod(x, axis=1) > 0).astype(np.float)
    return x, labels.reshape(-1, 1)


def main():
    x, labels = create_toy_dataset()
    colors = ["blue", "red"]
    plt.scatter(x[:, 0], x[:, 1], c=[colors[int(label)] for label in labels])

    layers = [TanhLayer(2, 4), LinearLayer(4, 1)]
    cost_function = SigmoidCrossEntropy()
    nn = NeuralNetwork(layers, cost_function)
    nn._gradient_check()
    for i in xrange(100000):
        if i % 10000 == 0:
            print "step %6d, cost %f" % (i, nn.cost(x, labels))
        nn.fit(x, labels, learning_rate=0.001)

    X_test, Y_test = np.meshgrid(np.linspace(-1, 1, 100), np.linspace(-1, 1, 100))
    x_test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)
    probs = nn(x_test)
    Probs = probs.reshape(100, 100)
    levels = np.linspace(0, 1, 11)
    plt.contourf(X_test, Y_test, Probs, levels, alpha=0.5)
    plt.colorbar()
    plt.xlim(-1, 1)
    plt.ylim(-1, 1)
    plt.show()


if __name__ == '__main__':
    main()

Regression

Also put this in the same directory as neural_network.py above.

`regression.py`


import pylab as plt
import numpy as np
from neural_network import TanhLayer, LinearLayer, SumSquaresError, NeuralNetwork


def create_toy_dataset(func, n=100):
    x = np.random.uniform(size=(n, 1))
    t = func(x) + np.random.uniform(-0.1, 0.1, size=(n, 1))
    return x, t


def main():

    def func(x):
        return x + 0.3 * np.sin(2 * np.pi * x)

    x, t = create_toy_dataset(func)

    layers = [TanhLayer(1, 6, std=1., bias=-0.5), LinearLayer(6, 1, std=1., bias=0.5)]
    cost_function = SumSquaresError()
    nn = NeuralNetwork(layers, cost_function)
    nn._gradient_check()
    for i in xrange(100000):
        if i % 10000 == 0:
            print "step %6d, cost %f" % (i, nn.cost(x, t))
        nn.fit(x, t, learning_rate=0.001)

    plt.scatter(x, t, alpha=0.5, label="observation")
    x_test = np.linspace(0, 1, 1000)[:, np.newaxis]
    y = nn(x_test)
    plt.plot(x_test, func(x_test), color="blue", label="$x+0.3\sin(2\pi x)$")
    plt.plot(x_test, y, color="red", label="regression")
    plt.legend(loc="upper left")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.show()


if __name__ == '__main__':
    main()

result

If you run the code that trains the neural network that performs two-class classification, the output will be as follows.

`Terminal output`


===================================
checking gradient
finite difference 0.349788735199
 back propagation 0.349788735237
===================================

The implementation looks okay because the gradient calculated from the finite width difference and the error value calculated by the error backpropagation are close.

A neural network that classifies two classes using blue and red dots as training data is trained, and the two-dimensional plane is color-coded according to the output. The video shows the process of learning a neural network. (However, the above two-class classification code displays only the still image as a result of learning.)

This is the result when a neural network is used for regression. The neural network is trained using the blue dots as training data, and the changes in the output of the neural network are illustrated. (However, the code of the regression problem above also displays only the still image as a result of learning.)

At the end

This time, I implemented a neural network and trained it. Next time, we will use this code to implement a mixed density network. When using a normal neural network for a regression problem, the cost function is modeled on a single-peak Gaussian function, so it cannot handle multi-peak situations. A mixed density network solves this by using a mixed Gauss as a cost function.

PRML Chapter 5 Neural Network Python Implementation

neural network

Network structure

Forward propagation

Backpropagation

Backpropagation summary

code

Library

layer

Layer: Conformal map

Layer: Logistic sigmoid function

Cost function: sum of squares error

Cost function: sigmoid cross entropy

neural network

Whole code

neural_network.py

2 class classification

binary_classification.py

Regression

regression.py

result

Terminal output

At the end

`neural_network.py`

`binary_classification.py`

`regression.py`

`Terminal output`