Chapter 5 of PRML introduces recently popular neural networks. There are many kinds of neural networks on the net, so I wanted to use something that I wasn't familiar with as much as possible, so I decided to implement a mixed density network almost exclusively with Numpy. However, since the amount of code has become quite large, I will divide it into two parts, and in this article I will implement a normal neural network after all, and the mixed density network will be the next time.
This is 3D $ {\ bf x} = (x_1, x_2, x_3) $ for the input unit, 4D $ {\ bf z} = (z_1, z_2, z_3, z_4) $ for the hidden unit, and 2 output units. This is a schematic representation of a two-layer neural network with dimensions $ {\ bf z} = (z_1, z_2)
Forward propagation is the step of calculating the output of the network from the input. Calculate the hidden unit $ {\ bf z} $ from the input $ {\ bf x} $, and then the output $ {\ bf y} $ from the hidden unit $ {\ bf z} $.
One of the first layer hidden units, $ z_1 $,
\begin{align}
a_{z_1} &= w_{11}^{(1)}x_1+w_{12}^{(1)}x_2+w_{13}^{(1)}x_3 + b_1^{(1)}\\
z_1 &= f^{(1)}(a_{z_1})
\end{align}
Is calculated. Where $ a_ {z_1} $ is the activity of the first hidden unit, $ w_ {1j} ^ {(1)} $ is the weight from the jth input unit to the first hidden unit, $ b_1 $ is The bias of the first hidden unit, $ f ^ {(1)} $, is the activation function of the first layer.
It is possible to formulate the same formulas for $ z_2, z_3, z_4 $, but it becomes complicated to have three formulas, so I often use a matrix.
\begin{align}
\begin{bmatrix}
a_{z_1}\\
a_{z_2}\\
a_{z_3}\\
a_{z_4}
\end{bmatrix}
&=
\begin{bmatrix}
w_{11}^{(1)} & w_{12}^{(1)} & w_{13}^{(1)}\\
w_{21}^{(1)} & w_{22}^{(1)} & w_{23}^{(1)}\\
w_{31}^{(1)} & w_{32}^{(1)} & w_{33}^{(1)}\\
w_{41}^{(1)} & w_{42}^{(1)} & w_{43}^{(1)}
\end{bmatrix}
\begin{bmatrix}
x_1\\
x_2\\
x_3
\end{bmatrix}
+
\begin{bmatrix}
b_1^{(1)}\\
b_2^{(1)}\\
b_3^{(1)}\\
b_4^{(1)}
\end{bmatrix}
\\
\begin{bmatrix}
z_1\\
z_2\\
z_3\\
z_4
\end{bmatrix}
&=
\begin{bmatrix}
f^{(1)}(a_{z_1})\\
f^{(1)}(a_{z_2})\\
f^{(1)}(a_{z_3})\\
f^{(1)}(a_{z_4})
\end{bmatrix}
\end{align}
Make this more concise
\begin{align}
{\bf a}_z &= W^{(1)}{\bf x} + {\bf b}^{(1)}\\
{\bf z} &= f^{(1)}({\bf a}_z)
\end{align}
Also expressed as. This completes the forward propagation of the first layer.
In the same way, the second layer
\begin{align}
{\bf a}_y &= W^{(2)}{\bf y} + {\bf b}^{(2)}\\
{\bf y} &= f^{(2)}({\bf a}_y)
\end{align}
Can be expressed as.
By the way, to summarize these,
{\bf y} = f^{(2)}(W^{(2)}f^{(1)}(W^{(1)}{\bf x} + {\bf b}^{(1)}) + {\bf b}^{(2)})
It will be. In this way, you can calculate from input to output.
The neural network has many parameters (26 in this example). Backpropagation is an efficient way to calculate these gradients.
Input and its target pair
{\partial E\over\partial y_i}
is. Using this, the error of $ {\ bf a} \ _y $ and the partial differential of the cost function at $ {\ bf a} \ _y $ are
\begin{align}
{\partial E\over\partial a_{y_i}} &= {\partial E\over\partial y_i}{\partial y_i\over\partial a_{y_i}}\\
&= {\partial E\over\partial y_i}f'^{(2)}(a_{y_i})\\
(&= y_i - t_i)
\end{align}
It is obtained as. If the output layer activation function $ f ^ {(2)} $ is a canonical concatenation function such as an identity map, sigmoid function, or softmax function, the error of ** $ {\ bf a} _y $ is simple. The difference between the output and the target **. If the error of $ {\ bf a} _y $ is found, the gradient of the weight in the second layer can be found. Because
\begin{align}
{\partial a_{y_i}\over\partial w_{ij}^{(2)}} &= z_j\\
{\partial a_{y_i}\over\partial b_i^{(2)}} &= 1
\end{align}
Than,
\begin{align}
{\partial E\over\partial w_{ij}^{(2)}} &= {\partial E\over\partial a_{y_i}}{\partial a_{y_i}\over\partial w_{ij}^{(2)}}\\
&= {\partial E\over\partial a_{y_i}}z_j\\
{\partial E\over\partial b_i^{(2)}} &= {\partial E\over\partial a_{y_i}}{\partial a_{y_i}\over\partial b_i^{(2)}}\\
&= {\partial E\over\partial a_{y_i}}
\end{align}
And can be calculated. We propagated the output error to find the error in activity $ {\ bf a} _ {y} $ and used that error to calculate the gradient of the parameter. Furthermore, if the error of $ {\ bf a} \ _ y $ is obtained, not only the gradient of the weight but also the error of the input $ {\ bf z} $ of the second layer can be calculated.
\begin{align}
{\partial E\over\partial z_j} &= \sum_{i=1}^2 {\partial E\over\partial a_{y_i}}{\partial a_{y_i}\over\partial z_j}\\
&= \sum_{i=1}^2 {\partial E\over\partial a_{y_i}}w_{ij}^{(2)}
\end{align}
In this way, the error can be propagated from the output $ {\ bf y} $ of the second layer to the input $ {\ bf z} $. The error of the input $ {\ bf z} $ of the second layer is also the error of the output of the first layer, so by repeating the above, the error in the activity of the first layer and the gradient of the weight can also be obtained. You can get it.
\begin{align}
{\partial E\over\partial a_{z_i}} &= {\partial E\over\partial z_i}{\partial z_i\over\partial a_{z_i}}\\
&= {\partial E\over\partial z_i}f'^{(2)}(a_{z_i})\\
{\partial E\over\partial x_j} &= \sum_{i=1}^4 {\partial E\over\partial a_{z_i}}{\partial a_{z_i}\over\partial x_j}\\
&= \sum_{i=1}^4 {\partial E\over\partial a_{z_i}}w_{ij}^{(1)}\\
{\partial E\over\partial w_{ij}^{(1)}} &= {\partial E\over\partial a_{z_i}}{\partial a_{z_i}\over\partial w_{ij}^{(1)}}\\
&= {\partial E\over\partial a_{z_i}}x_j\\
{\partial E\over\partial b_i^{(1)}} &= {\partial E\over\partial a_{z_i}}{\partial a_{y_i}\over\partial b_i^{(1)}}\\
&= {\partial E\over\partial a_{z_i}}\\
\end{align}
In this way, the error in the output layer can be propagated to the input and the gradient required to update the weight parameters can be calculated in the process.
Propagate in the opposite direction from the error in the output unit
Output unit | Second layer activity |
---|---|
Calculate the gradient in the parameter from the error in activity.
\begin{align}
{\partial E\over\partial w_{ij}^{(2)}} &= {\partial E\over\partial a_{y_i}}z_j\\
{\partial E\over\partial b_i^{(2)}} &= {\partial E\over\partial a_{y_i}}
\end{align}
This is also propagated to the first layer.
Hidden unit | First layer activity |
---|---|
Then the parameter gradient is obtained from the error in activity.
\begin{align}
{\partial E\over\partial w_{ij}^{(1)}} &= {\partial E\over\partial a_{z_i}}x_j\\
{\partial E\over\partial b_i^{(1)}} &= {\partial E\over\partial a_{z_i}}
\end{align}
This time I used something other than numpy to use the truncated normal distribution for weight initialization. I usually use Tensorflow when building a neural network, but since Tensorflow uses a truncated normal distribution to initialize weights, I decided to follow it here as well.
import numpy as np
from scipy.stats import truncnorm
A class that represents one layer of a neural network. The weights are initialized when the instance is created, the input is input and propagated forward, and then the error is backpropagated. Based on this class, we will build a layer to be used to actually model the neural network.
class Layer(object):
def __init__(self, dim_input, dim_output, std=1., bias=0.):
self.w = truncnorm(a=-2 * std, b=2 * std, scale=std).rvs((dim_input, dim_output))
self.b = np.ones(dim_output) * bias
def __call__(self, X):
self.input = X
return self.forward_propagation(X)
def back_propagation(self, delta, learning_rate):
# derivative with respect to activation
delta = delta * self.activation_derivative()
w = np.copy(self.w)
self.w -= learning_rate * self.input.T.dot(delta)
self.b -= learning_rate * np.sum(delta, axis=0)
# derivative with respect to input
return delta.dot(w.T)
Layer | Description |
---|---|
__init__ | Initialize the parameters by entering the input and output orders of this layer |
__call__ | Calculate the output of this layer by propagating forward from the input of this layer |
back_propagation | Calculate the input error of this layer by updating the parameters, including the output error and learning factor of this layer. |
A layer whose activation function is an identity map $ f (a) = a $. We will build layers by defining new methods for calculating forward propagation and the derivative of the activation function.
class LinearLayer(Layer):
def forward_propagation(self, X):
return X.dot(self.w) + self.b
def activation_derivative(self):
return 1
The layer where the activation function is the logistic sigmoid function $ f (x) = {1 \ over1 + \ exp (-x)} $. The derivative of the logistic sigmoid function is $ f'(x) = f (x) (1-f (x)) $.
class SigmoidLayer(Layer):
def forward_propagation(self, X):
activation = X.dot(self.w) + self.b
self.output = 1 / (1 + np.exp(-activation))
return self.output
def activation_derivative(self):
return self.output * (1 - self.output)
In this way, you can also build a layer using the hyperbolic tangent function $ \ tanh (x) $ and the normalized linear function $ \ max (x, 0) $ as the activation function.
This is an error function often used when solving ** regression problems **.
class SumSquaresError(object):
def activate(self, X):
return X
def __call__(self, X, targets):
return 0.5 * np.sum((X - targets) ** 2)
def delta(self, X, targets):
return X - targets
This is an error function used when you want to classify ** 2 classes **. The cross-entropy is calculated after performing a non-linear transformation with the logistic sigmoid function.
class SigmoidCrossEntropy(object):
def activate(self, logits):
return 1 / (1 + np.exp(-logits))
def __call__(self, logits, targets):
probs = self.activate(logits)
p = np.clip(probs, 1e-10, 1 - 1e-10)
return np.sum(-targets * np.log(p) - (1 - targets) * np.log(1 - p))
def delta(self, logits, targets):
probs = self.activate(logits)
return probs - targets
In this way, Softmax cross entropy used for multi-class classification is also implemented.
Since the cost function has an activation function, the final layer should use LinearLayer. By using the approximation by the difference of finite difference, it is possible to confirm whether the error back propagation is implemented correctly.
class NeuralNetwork(object):
def __init__(self, layers, cost_function):
self.layers = layers
self.cost_function = cost_function
def __call__(self, X):
for layer in self.layers:
X = layer(X)
return self.cost_function.activate(X)
def fit(self, X, t, learning_rate):
for layer in self.layers:
X = layer(X)
delta = self.cost_function.delta(X, t)
for layer in reversed(self.layers):
delta = layer.back_propagation(delta, learning_rate)
def cost(self, X, t):
for layer in self.layers:
X = layer(X)
return self.cost_function(X, t)
def _gradient_check(self, X=None, t=None, eps=1e-6):
if X is None:
X = np.array([[0.5 for _ in xrange(np.size(self.layers[0].w, 0))]])
if t is None:
t = np.zeros((1, np.size(self.layers[-1].w, 1)))
t[0, 0] = 1.
e = np.zeros_like(X)
e[:, 0] += eps
x_plus_e = X + e
x_minus_e = X - e
grad = (self.cost(x_plus_e, t) - self.cost(x_minus_e, t)) / (2 * eps)
for layer in self.layers:
X = layer(X)
delta = self.cost_function.delta(X, t)
for layer in reversed(self.layers):
delta = layer.back_propagation(delta, 0)
print "==================================="
print "checking gradient"
print "finite difference", grad
print " back propagation", delta[0, 0]
print "==================================="
NueralNetwork | Description |
---|---|
__init__ | Definition of network structure and cost function |
__call__ | Forward propagation calculation |
fit | Network learning |
cost | Calculate the value of the cost function |
_gradient_check | Confirmation of backpropagation gradient |
The whole code is here. Import only what you need from this module and build code to solve regression and classification problems.
neural_network.py
import numpy as np
from scipy.stats import truncnorm
class Layer(object):
def __init__(self, dim_input, dim_output, std=1., bias=0.):
self.w = truncnorm(a=-2 * std, b=2 * std, scale=std).rvs((dim_input, dim_output))
self.b = np.ones(dim_output) * bias
def __call__(self, X):
self.input = X
return self.forward_propagation(X)
def back_propagation(self, delta, learning_rate):
# derivative with respect to activation
delta = delta * self.activation_derivative()
w = np.copy(self.w)
self.w -= learning_rate * self.input.T.dot(delta)
self.b -= learning_rate * np.sum(delta, axis=0)
# derivative with respect to input
return delta.dot(w.T)
class LinearLayer(Layer):
def forward_propagation(self, X):
return X.dot(self.w) + self.b
def activation_derivative(self):
return 1
class SigmoidLayer(Layer):
def forward_propagation(self, X):
activation = X.dot(self.w) + self.b
self.output = 1 / (1 + np.exp(-activation))
return self.output
def activation_derivative(self):
return self.output * (1 - self.output)
class TanhLayer(Layer):
def forward_propagation(self, X):
activation = X.dot(self.w) + self.b
self.output = np.tanh(activation)
return self.output
def activation_derivative(self):
return 1 - self.output ** 2
class ReLULayer(Layer):
def forward_propagation(self, X):
activation = X.dot(self.w) + self.b
self.output = activation.clip(min=0)
return self.output
def activation_derivative(self):
return (self.output > 0).astype(np.float)
class SigmoidCrossEntropy(object):
def activate(self, logits):
return 1 / (1 + np.exp(-logits))
def __call__(self, logits, targets):
probs = self.activate(logits)
p = np.clip(probs, 1e-10, 1 - 1e-10)
return np.sum(-targets * np.log(p) - (1 - targets) * np.log(1 - p))
def delta(self, logits, targets):
probs = self.activate(logits)
return probs - targets
class SoftmaxCrossEntropy(object):
def activate(self, logits):
a = np.exp(logits - np.max(logits, 1, keepdims=True))
a /= np.sum(a, 1, keepdims=True)
return a
def __call__(self, logits, targets):
probs = self.activate(logits)
p = probs.clip(min=1e-10)
return - np.sum(targets * np.log(p))
def delta(self, logits, targets):
probs = self.activate(logits)
return probs - targets
class SumSquaresError(object):
def activate(self, X):
return X
def __call__(self, X, targets):
return 0.5 * np.sum((X - targets) ** 2)
def delta(self, X, targets):
return X - targets
class NeuralNetwork(object):
def __init__(self, layers, cost_function):
self.layers = layers
self.cost_function = cost_function
def __call__(self, X):
for layer in self.layers:
X = layer(X)
return self.cost_function.activate(X)
def fit(self, X, t, learning_rate):
for layer in self.layers:
X = layer(X)
delta = self.cost_function.delta(X, t)
for layer in reversed(self.layers):
delta = layer.back_propagation(delta, learning_rate)
def cost(self, X, t):
for layer in self.layers:
X = layer(X)
return self.cost_function(X, t)
def _gradient_check(self, X=None, t=None, eps=1e-6):
if X is None:
X = np.array([[0.5 for _ in xrange(np.size(self.layers[0].w, 0))]])
if t is None:
t = np.zeros((1, np.size(self.layers[-1].w, 1)))
t[0, 0] = 1.
e = np.zeros_like(X)
e[:, 0] += eps
x_plus_e = X + e
x_minus_e = X - e
grad = (self.cost(x_plus_e, t) - self.cost(x_minus_e, t)) / (2 * eps)
for layer in self.layers:
X = layer(X)
delta = self.cost_function.delta(X, t)
for layer in reversed(self.layers):
delta = layer.back_propagation(delta, 0)
print "==================================="
print "checking gradient"
print "finite difference", grad
print " back propagation", delta[0, 0]
print "==================================="
Put the above neural_network.py and this file in the same directory.
binary_classification.py
import pylab as plt
import numpy as np
from neural_network import TanhLayer, LinearLayer, SigmoidCrossEntropy, NeuralNetwork
def create_toy_dataset():
x = np.random.uniform(-1., 1., size=(1000, 2))
labels = (np.prod(x, axis=1) > 0).astype(np.float)
return x, labels.reshape(-1, 1)
def main():
x, labels = create_toy_dataset()
colors = ["blue", "red"]
plt.scatter(x[:, 0], x[:, 1], c=[colors[int(label)] for label in labels])
layers = [TanhLayer(2, 4), LinearLayer(4, 1)]
cost_function = SigmoidCrossEntropy()
nn = NeuralNetwork(layers, cost_function)
nn._gradient_check()
for i in xrange(100000):
if i % 10000 == 0:
print "step %6d, cost %f" % (i, nn.cost(x, labels))
nn.fit(x, labels, learning_rate=0.001)
X_test, Y_test = np.meshgrid(np.linspace(-1, 1, 100), np.linspace(-1, 1, 100))
x_test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)
probs = nn(x_test)
Probs = probs.reshape(100, 100)
levels = np.linspace(0, 1, 11)
plt.contourf(X_test, Y_test, Probs, levels, alpha=0.5)
plt.colorbar()
plt.xlim(-1, 1)
plt.ylim(-1, 1)
plt.show()
if __name__ == '__main__':
main()
Also put this in the same directory as neural_network.py above.
regression.py
import pylab as plt
import numpy as np
from neural_network import TanhLayer, LinearLayer, SumSquaresError, NeuralNetwork
def create_toy_dataset(func, n=100):
x = np.random.uniform(size=(n, 1))
t = func(x) + np.random.uniform(-0.1, 0.1, size=(n, 1))
return x, t
def main():
def func(x):
return x + 0.3 * np.sin(2 * np.pi * x)
x, t = create_toy_dataset(func)
layers = [TanhLayer(1, 6, std=1., bias=-0.5), LinearLayer(6, 1, std=1., bias=0.5)]
cost_function = SumSquaresError()
nn = NeuralNetwork(layers, cost_function)
nn._gradient_check()
for i in xrange(100000):
if i % 10000 == 0:
print "step %6d, cost %f" % (i, nn.cost(x, t))
nn.fit(x, t, learning_rate=0.001)
plt.scatter(x, t, alpha=0.5, label="observation")
x_test = np.linspace(0, 1, 1000)[:, np.newaxis]
y = nn(x_test)
plt.plot(x_test, func(x_test), color="blue", label="$x+0.3\sin(2\pi x)$")
plt.plot(x_test, y, color="red", label="regression")
plt.legend(loc="upper left")
plt.xlabel("x")
plt.ylabel("y")
plt.show()
if __name__ == '__main__':
main()
If you run the code that trains the neural network that performs two-class classification, the output will be as follows.
Terminal output
===================================
checking gradient
finite difference 0.349788735199
back propagation 0.349788735237
===================================
The implementation looks okay because the gradient calculated from the finite width difference and the error value calculated by the error backpropagation are close.
A neural network that classifies two classes using blue and red dots as training data is trained, and the two-dimensional plane is color-coded according to the output. The video shows the process of learning a neural network. (However, the above two-class classification code displays only the still image as a result of learning.)
This is the result when a neural network is used for regression. The neural network is trained using the blue dots as training data, and the changes in the output of the neural network are illustrated. (However, the code of the regression problem above also displays only the still image as a result of learning.)
This time, I implemented a neural network and trained it. Next time, we will use this code to implement a mixed density network. When using a normal neural network for a regression problem, the cost function is modeled on a single-peak Gaussian function, so it cannot handle multi-peak situations. A mixed density network solves this by using a mixed Gauss as a cost function.
Recommended Posts