I suddenly started studying in Chapter 5 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". This is a memo of the trip.
The execution environment is macOS Mojave + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.
(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)
This chapter describes the error backpropagation method for speeding up the calculation of weight parameters in neural network training.
By the way, the word "propagation" (denpa) is a translation of Propagetion, but the Japanese is chaotic because it is also translated as "propagation" (denpan). I found an article on a blog that investigated this area, so if you are interested, rest after walking> [machine learning] The translation of "Propagation" is "propagation" or "propagation"? Please.
The explanation by the calculation graph is very easy to understand when learning the error back propagation method. The secret to the popularity of this book may be Chapter 5.
Learning is essential for neural networks, which requires finding the gradient of the weight parameters and calculating the derivative of each parameter. Using the calculation graph, we can see that the calculation of the derivative can be done very efficiently by back-propagating the local derivative from the back to the front.
This is a review of the differential formula of the composition function and the explanation that backpropagation holds. I don't think it's a problem to read it through, but if you want to understand it correctly, you need to review the differentiation of high school.
I didn't remember the differential formula at all, so I decided to review it after a long time. However, I often study after my main business is over, and when it comes to books and websites, sleepiness attacks me and it's hard to get into my head. It is quite difficult for office workers to study in the gap time.
What I found there was Tri-It, which is being done by a home teacher, Try. Although it is a video lesson, it is very easy to understand even though it is free, and it is divided into about 15 minutes, so it is ideal for studying with a smartphone while traveling or taking a break. It's a service for students, but I think it's also very suitable for reviewing working people.
For reference, here is a link to a video about the introductory part of differentiation. Please note that it is limited to private viewing.
-Video lesson Try IT> Mathematics II Limits and differential functions -Video lesson Try IT> Mathematics II differential method
A description of backpropagation of add and multiply nodes. Since differentiation is simple for both addition and multiplication, backpropagation calculations are easy.
This is the implementation of the addition layer and multiplication layer in the previous section. Since the differentiation is simple, it is easy to implement.
An implementation of the activation function layer.
Since ReLU has a simple derivative calculation, it is easy to implement and is likely to lead to faster learning. Below is the implemented code.
relu.py
# coding: utf-8
class ReLU:
def __init__(self):
"""ReLU layer
"""
self.mask = None
def forward(self, x):
"""Forward propagation
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
self.mask = (x <= 0)
out = x.copy()
out[self.mask] = 0
return out
def backward(self, dout):
"""Backpropagation
Args:
dout (numpy.ndarray):Derivative value transmitted from the right layer
Returns:
numpy.ndarray:Derivative value
"""
dout[self.mask] = 0
dx = dout
return dx
The derivative of the sigmoid function is a little complicated, but it can be calculated little by little using a calculation graph. However, the derivative of $ y = \ frac {1} {x} $ required for the first "/" node and the derivative of $ y = exp (x) $ of the "$ exp $" node do not review the formula. It's spicy. As for the formulas in this area, Mathematics learned from concrete examples> Calculus> Organize all 59 differential formulas with importance is recommended because it is simply organized.
After the calculation by the calculation graph, it transforms into the formula (5.12) of the book, but at first I did not understand the flow of transforming the second line from the end to the last line. It will be easier to understand if you insert one line in between as follows.
\begin{align}
\frac{\partial L}{\partial y}y^2exp(-x) &= \frac{\partial L}{\partial y} \frac{1}{(1+exp(-x))^2} exp(-x) \\
&=\frac{\partial L}{\partial y} \frac{1}{1+exp(-x)} \frac{exp(-x)}{1+exp(-x)} \\
&=\frac{\partial L}{\partial y} \frac{1}{1+exp(-x)} \biggl(\frac{1+exp(-x)}{1+exp(-x)} - \frac{1}{1+exp(-x)}\biggr)← Added line\\
&=\frac{\partial L}{\partial y} y(1-y)
\end{align}
What is surprising about this result is that it is very easy to calculate with $ y $, which is the output of forward propagation. In the error back propagation method, the forward propagation is calculated first, so if you have the result, it will lead to faster learning. The characteristic that even if $ e ^ x $ is differentiated, it becomes $ e ^ x $ is also important, and the person who thinks about the sigmoid function is really amazing.
The sigmoid layer is not used in this chapter, so the code is omitted.
The calculation process is omitted in the Affine layer backpropagation formula (5.13), but @yuyasat [calculates the components of the Affine layer backpropagation steadily](https://qiita.com/yuyasat/items/ The calculation process is summarized in d9cdd4401221df5375b6) for your reference.
Below is the implemented code. In addition, in the code of the book, there is consideration when the input data is a tensor (4D data), but it is not included because the usage is not yet known.
affine.py
# coding: utf-8
import numpy as np
class Affine:
def __init__(self, W, b):
"""Affine layer
Args:
W (numpy.ndarray):weight
b (numpy.ndarray):bias
"""
self.W = W #weight
self.b = b #bias
self.x = None #input
self.dW = None #Derivative value of weight
self.db = None #Derivative value of bias
def forward(self, x):
"""Forward propagation
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
self.x = x
out = np.dot(x, self.W) + self.b
return out
def backward(self, dout):
"""Backpropagation
Args:
dout (numpy.ndarray):Derivative value transmitted from the right layer
Returns:
numpy.ndarray:Derivative value
"""
dx = np.dot(dout, self.W.T)
self.dW = np.dot(self.x.T, dout)
self.db = np.sum(dout, axis=0)
return dx
Backpropagation of a layer with a set of softmax function and cross entropy error. The calculation process is described in detail in Appendix A of the book, but even if $ e ^ x $ is differentiated, it becomes $ e ^ x $, and when all the one-hot vectors of the teacher label are added, it becomes 1. Etc. contribute to simplifying the formula.
The final result of $ (y_1-t_1, y_2-t_2, y_3-t_3) $ is also surprising. It does show the difference between the output of the neural network and the teacher label, and the calculation can be done at high speed. It seems that the cross entropy error is designed to be "clean" when used as a loss function of the softmax function, and it is really amazing for those who think about the cross entropy error.
Note that the backpropagation value must be divided by the number of batches to accommodate batches. In the book, there was only an explanation that "by dividing the value to be propagated by the number of batches (batch_size), the error per data is propagated to the previous layer", and I could not understand why it was necessary to divide by the number of batches. However, I understood it from the explanation of the code of @ Yoko303's Deep Learning from scratch ~ Softmax-with-Loss layer ~. In the batch version of forward propagation, the cross entropy errors are finally summed and divided by the number of batches (batch_size) to make one value. Backpropagation also requires calculation of this part, and its derivative value is $ \ frac {1} {batch_size} $. Below, I wrote the calculation graph for that part.
I think this understanding is correct, but please point out any mistakes. Below is the implemented code.
softmax_with_loss.py
# coding: utf-8
from functions import softmax, cross_entropy_error
class SoftmaxWithLoss:
def __init__(self):
"""Softmax-with-Loss layer
"""
self.loss = None #loss
self.y = None #softmax output
self.t = None #Teacher data (one-hot vector)
def forward(self, x, t):
"""Forward propagation
Args:
x (numpy.ndarray):input
t (numpy.ndarray):Teacher data
Returns:
float:Cross entropy error
"""
self.t = t
self.y = softmax(x)
self.loss = cross_entropy_error(self.y, self.t)
return self.loss
def backward(self, dout=1):
"""Backpropagation
Args:
dout (float, optional):Derivative value transmitted from the right layer. The default is 1.
Returns:
numpy.ndarray:Derivative value
"""
batch_size = self.t.shape[0] #Number of batches
dx = (self.y - self.t) * (dout / batch_size)
return dx
Note that the code in the book does not use dout
in backward
. Since dout
is used only with 1 specified, there is no problem in operation, but I think it is probably a mistake.
This is a review of the implementation flow. There is no particular stumbling block.
First is the implementation of a general-purpose function. I brought only what I needed from what I wrote in the previous chapter.
functions.py
# coding: utf-8
import numpy as np
def softmax(x):
"""Softmax function
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
#For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
#In this case, it is necessary to calculate well for each image using broadcast.
#Here, np so that it can be shared in both 1D and 2D..max()And np.sum()Is axis=-Calculated by 1
#Keepdims so that you can broadcast as it is=True to maintain the dimension.
c = np.max(x, axis=-1, keepdims=True)
exp_a = np.exp(x - c) #Overflow measures
sum_exp_a = np.sum(exp_a, axis=-1, keepdims=True)
y = exp_a / sum_exp_a
return y
def numerical_gradient(f, x):
"""Gradient calculation
Args:
f (function):Loss function
x (numpy.ndarray):An array of weight parameters for which you want to check the gradient
Returns:
numpy.ndarray:Slope
"""
h = 1e-4
grad = np.zeros_like(x)
# np.Enumerate the elements of a multidimensional array with nditer
it = np.nditer(x, flags=['multi_index'])
while not it.finished:
idx = it.multi_index # it.multi_index is the element number in the list
tmp_val = x[idx] #Save original value
# f(x + h)Calculation of
x[idx] = tmp_val + h
fxh1 = f()
# f(x - h)Calculation of
x[idx] = tmp_val - h
fxh2 = f()
#Calculate the gradient
grad[idx] = (fxh1 - fxh2) / (2 * h)
x[idx] = tmp_val #Return value
it.iternext()
return grad
def cross_entropy_error(y, t):
"""Calculation of cross entropy error
Args:
y (numpy.ndarray):Neural network output
t (numpy.ndarray):Correct label
Returns:
float:Cross entropy error
"""
#If there is one data, shape it (make one data line)
if y.ndim == 1:
t = t.reshape(1, t.size)
y = y.reshape(1, y.size)
#Calculate the error and normalize by the number of batches
batch_size = y.shape[0]
return -np.sum(t * np.log(y + 1e-7)) / batch_size
And then there is the neural network class. Since it is based on the code in the previous chapter, there are many of the same parts.
two_layer_net.py
# coding: utf-8
import numpy as np
from affine import Affine
from functions import numerical_gradient
from relu import ReLU
from softmax_with_loss import SoftmaxWithLoss
class TwoLayerNet:
def __init__(self, input_size, hidden_size, output_size,
weight_init_std=0.01):
"""Two-layer neural network
Args:
input_size (int):Number of neurons in the input layer
hidden_size (int):Number of neurons in the hidden layer
output_size (int):Number of neurons in the output layer
weight_init_std (float, optional):Adjustment parameter of the initial value of the weight. The default is 0.01。
"""
#Weight initialization
self.params = {}
self.params['W1'] = weight_init_std * \
np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = weight_init_std * \
np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
#Layer generation
self.layers = {} # Python 3.OrderedDict is unnecessary because the storage order of dictionaries is retained from 7
self.layers['Affine1'] = \
Affine(self.params['W1'], self.params['b1'])
self.layers['Relu1'] = ReLU()
self.layers['Affine2'] = \
Affine(self.params['W2'], self.params['b2'])
self.lastLayer = SoftmaxWithLoss()
def predict(self, x):
"""Inference by neural network
Args:
x (numpy.ndarray):Input to neural network
Returns:
numpy.ndarray:Neural network output
"""
#Propagate layers forward
for layer in self.layers.values():
x = layer.forward(x)
return x
def loss(self, x, t):
"""Loss function value calculation
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
float:Loss function value
"""
#inference
y = self.predict(x)
# Softmax-with-Calculated by forward propagation of Loss layer
loss = self.lastLayer.forward(y, t)
return loss
def accuracy(self, x, t):
"""Recognition accuracy calculation
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
float:Recognition accuracy
"""
y = self.predict(x)
y = np.argmax(y, axis=1)
t = np.argmax(t, axis=1)
accuracy = np.sum(y == t) / x.shape[0]
return accuracy
def numerical_gradient(self, x, t):
"""Calculate the gradient for the weight parameter by numerical differentiation
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
dictionary:A dictionary containing gradients
"""
grads = {}
grads['W1'] = \
numerical_gradient(lambda: self.loss(x, t), self.params['W1'])
grads['b1'] = \
numerical_gradient(lambda: self.loss(x, t), self.params['b1'])
grads['W2'] = \
numerical_gradient(lambda: self.loss(x, t), self.params['W2'])
grads['b2'] = \
numerical_gradient(lambda: self.loss(x, t), self.params['b2'])
return grads
def gradient(self, x, t):
"""Gradient for weight parameters calculated by error backpropagation
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
dictionary:A dictionary containing gradients
"""
#Forward propagation
self.loss(x, t) #Propagate forward to calculate loss value
#Backpropagation
dout = self.lastLayer.backward()
for layer in reversed(list(self.layers.values())):
dout = layer.backward(dout)
#Extract the differential value of each layer
grads = {}
grads['W1'] = self.layers['Affine1'].dW
grads['b1'] = self.layers['Affine1'].db
grads['W2'] = self.layers['Affine2'].dW
grads['b2'] = self.layers['Affine2'].db
return grads
The code in the book uses ʻOrderedDict, but here we use the usual
dict. This is because starting with Python 3.7, the insertion order of
dict` objects is saved [^ 1].
This code compares the gradient obtained by the error back propagation method with the gradient obtained by numerical differentiation.
gradient_check.py
# coding: utf-8
import os
import sys
import numpy as np
from two_layer_net import TwoLayerNet
sys.path.append(os.pardir) #Add parent directory to path
from dataset.mnist import load_mnist
#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
load_mnist(normalize=True, one_hot_label=True)
#Two-layer neural work generation
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
#Data preparation for verification
x_batch = x_train[:3]
t_batch = t_train[:3]
#Gradient calculation by numerical differentiation and error back propagation method
grad_numerical = network.numerical_gradient(x_batch, t_batch)
grad_backprop = network.gradient(x_batch, t_batch)
#Check the difference between each weight
for key in grad_numerical.keys():
#Calculation of absolute value of difference
diff = np.abs(grad_backprop[key] - grad_numerical[key])
#Show average and maximum
print(f"{key}: [Average difference]{np.average(diff):.10f} [Maximum difference]{np.max(diff):.10f}")
In the book, I checked only the average of the differences in absolute values, but I also checked the maximum value of the differences in absolute values.
W1: [Average difference]0.0000000003 [Maximum difference]0.0000000080
b1: [Average difference]0.0000000021 [Maximum difference]0.0000000081
W2: [Average difference]0.0000000063 [Maximum difference]0.0000000836
b2: [Average difference]0.0000001394 [Maximum difference]0.0000002334
Since b2
is a value with about 7 digits after the decimal point, it seems that the error is a little larger than the book. There may be a bad point in the implementation. If you have any questions, please let me know: sweat:
Below is the learning code.
mnist.py
# coding: utf-8
import os
import sys
import matplotlib.pylab as plt
import numpy as np
from two_layer_net import TwoLayerNet
sys.path.append(os.pardir) #Add parent directory to path
from dataset.mnist import load_mnist
#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
load_mnist(normalize=True, one_hot_label=True)
#Hyperparameter settings
iters_num = 10000 #Number of updates
batch_size = 100 #Batch size
learning_rate = 0.1 #Learning rate
#Record list of results
train_loss_list = [] #Changes in the value of the loss function
train_acc_list = [] #Recognition accuracy for training data
test_acc_list = [] #Recognition accuracy for test data
train_size = x_train.shape[0] #Training data size
iter_per_epoch = max(int(train_size / batch_size), 1) #Number of iterations per epoch
#Two-layer neural work generation
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
#Start learning
for i in range(iters_num):
#Mini batch generation
batch_mask = np.random.choice(train_size, batch_size, replace=False)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
#Gradient calculation
grad = network.gradient(x_batch, t_batch)
#Weight parameter update
for key in ('W1', 'b1', 'W2', 'b2'):
network.params[key] -= learning_rate * grad[key]
#Loss function value calculation
loss = network.loss(x_batch, t_batch)
train_loss_list.append(loss)
#Recognition accuracy calculation for each epoch
if i % iter_per_epoch == 0:
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_acc_list.append(train_acc)
test_acc_list.append(test_acc)
#Progress display
print(f'[Number of updates]{i:>4} [Loss function value]{loss:.4f} '
f'[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}')
#Draw the transition of the value of the loss function
x = np.arange(len(train_loss_list))
plt.plot(x, train_loss_list, label='loss')
plt.xlabel('iteration')
plt.ylabel('loss')
plt.xlim(left=0)
plt.ylim(bottom=0)
plt.show()
#Draw the transition of recognition accuracy of training data and test data
x2 = np.arange(len(train_acc_list))
plt.plot(x2, train_acc_list, label='train acc')
plt.plot(x2, test_acc_list, label='test acc', linestyle='--')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.xlim(left=0)
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()
And it is the execution result of the same form as the previous chapter.
[Number of updates] 0 [Loss function value]2.3008 [Training data recognition accuracy]0.0926 [Test data recognition accuracy]0.0822
[Number of updates] 600 [Loss function value]0.2575 [Training data recognition accuracy]0.9011 [Test data recognition accuracy]0.9068
[Number of updates]1200 [Loss function value]0.2926 [Training data recognition accuracy]0.9219 [Test data recognition accuracy]0.9242
[Number of updates]1800 [Loss function value]0.2627 [Training data recognition accuracy]0.9324 [Test data recognition accuracy]0.9341
[Number of updates]2400 [Loss function value]0.0899 [Training data recognition accuracy]0.9393 [Test data recognition accuracy]0.9402
[Number of updates]3000 [Loss function value]0.1096 [Training data recognition accuracy]0.9500 [Test data recognition accuracy]0.9483
[Number of updates]3600 [Loss function value]0.1359 [Training data recognition accuracy]0.9559 [Test data recognition accuracy]0.9552
[Number of updates]4200 [Loss function value]0.1037 [Training data recognition accuracy]0.9592 [Test data recognition accuracy]0.9579
[Number of updates]4800 [Loss function value]0.1065 [Training data recognition accuracy]0.9639 [Test data recognition accuracy]0.9600
[Number of updates]5400 [Loss function value]0.0419 [Training data recognition accuracy]0.9665 [Test data recognition accuracy]0.9633
[Number of updates]6000 [Loss function value]0.0393 [Training data recognition accuracy]0.9698 [Test data recognition accuracy]0.9649
[Number of updates]6600 [Loss function value]0.0575 [Training data recognition accuracy]0.9718 [Test data recognition accuracy]0.9663
[Number of updates]7200 [Loss function value]0.0850 [Training data recognition accuracy]0.9728 [Test data recognition accuracy]0.9677
[Number of updates]7800 [Loss function value]0.0403 [Training data recognition accuracy]0.9749 [Test data recognition accuracy]0.9686
[Number of updates]8400 [Loss function value]0.0430 [Training data recognition accuracy]0.9761 [Test data recognition accuracy]0.9685
[Number of updates]9000 [Loss function value]0.0513 [Training data recognition accuracy]0.9782 [Test data recognition accuracy]0.9715
[Number of updates]9600 [Loss function value]0.0584 [Training data recognition accuracy]0.9777 [Test data recognition accuracy]0.9707
Compared to the results in the previous chapter, the recognition accuracy increases faster. In the end, it was about 97%. The only difference between the numerical differentiation and the error back propagation method should be the gradient calculation method, so it seems that the change from the sigmoid function to the ReLU function has led to improvement.
The calculation graph should be easy to understand. It was also well understood that the output layer and loss function were designed so that the differential value could be easily obtained.
That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out. (To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)
[^ 1]: See "Improvement of Python's Data Model" in What's New In Python 3.7.
Recommended Posts