Introduction

I suddenly started studying in Chapter 5 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". This is a memo of the trip.

The execution environment is macOS Mojave + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)

Chapter 5 Error back propagation method

This chapter describes the error backpropagation method for speeding up the calculation of weight parameters in neural network training.

By the way, the word "propagation" (denpa) is a translation of Propagetion, but the Japanese is chaotic because it is also translated as "propagation" (denpan). I found an article on a blog that investigated this area, so if you are interested, rest after walking> [machine learning] The translation of "Propagation" is "propagation" or "propagation"? Please.

5.1 Calculation graph

The explanation by the calculation graph is very easy to understand when learning the error back propagation method. The secret to the popularity of this book may be Chapter 5.

Learning is essential for neural networks, which requires finding the gradient of the weight parameters and calculating the derivative of each parameter. Using the calculation graph, we can see that the calculation of the derivative can be done very efficiently by back-propagating the local derivative from the back to the front.

5.2 Chain rule

This is a review of the differential formula of the composition function and the explanation that backpropagation holds. I don't think it's a problem to read it through, but if you want to understand it correctly, you need to review the differentiation of high school.

I didn't remember the differential formula at all, so I decided to review it after a long time. However, I often study after my main business is over, and when it comes to books and websites, sleepiness attacks me and it's hard to get into my head. It is quite difficult for office workers to study in the gap time.

What I found there was Tri-It, which is being done by a home teacher, Try. Although it is a video lesson, it is very easy to understand even though it is free, and it is divided into about 15 minutes, so it is ideal for studying with a smartphone while traveling or taking a break. It's a service for students, but I think it's also very suitable for reviewing working people.

For reference, here is a link to a video about the introductory part of differentiation. Please note that it is limited to private viewing.

-Video lesson Try IT> Mathematics II Limits and differential functions -Video lesson Try IT> Mathematics II differential method

5.3 Backpropagation

A description of backpropagation of add and multiply nodes. Since differentiation is simple for both addition and multiplication, backpropagation calculations are easy.

5.4 Simple layer implementation

This is the implementation of the addition layer and multiplication layer in the previous section. Since the differentiation is simple, it is easy to implement.

5.5 Implementation of activation function layer

An implementation of the activation function layer.

5.5.1 ReLU layer

Since ReLU has a simple derivative calculation, it is easy to implement and is likely to lead to faster learning. Below is the implemented code.

`relu.py`


# coding: utf-8


class ReLU:
    def __init__(self):
        """ReLU layer
        """
        self.mask = None

    def forward(self, x):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):input
            
        Returns:
            numpy.ndarray:output
        """
        self.mask = (x <= 0)
        out = x.copy()
        out[self.mask] = 0

        return out

    def backward(self, dout):
        """Backpropagation
        
        Args:
            dout (numpy.ndarray):Derivative value transmitted from the right layer
        
        Returns:
            numpy.ndarray:Derivative value
        """
        dout[self.mask] = 0
        dx = dout

        return dx

5.5.2 Sigmoid layer

The derivative of the sigmoid function is a little complicated, but it can be calculated little by little using a calculation graph. However, the derivative of $ y = \ frac {1} {x} $ required for the first "/" node and the derivative of $ y = exp (x) $ of the "$ exp $" node do not review the formula. It's spicy. As for the formulas in this area, Mathematics learned from concrete examples> Calculus> Organize all 59 differential formulas with importance is recommended because it is simply organized.

After the calculation by the calculation graph, it transforms into the formula (5.12) of the book, but at first I did not understand the flow of transforming the second line from the end to the last line. It will be easier to understand if you insert one line in between as follows.

\begin{align}
\frac{\partial L}{\partial y}y^2exp(-x) &= \frac{\partial L}{\partial y} \frac{1}{(1+exp(-x))^2} exp(-x) \\
&=\frac{\partial L}{\partial y} \frac{1}{1+exp(-x)} \frac{exp(-x)}{1+exp(-x)} \\
&=\frac{\partial L}{\partial y} \frac{1}{1+exp(-x)} \biggl(\frac{1+exp(-x)}{1+exp(-x)} - \frac{1}{1+exp(-x)}\biggr)← Added line\\
&=\frac{\partial L}{\partial y} y(1-y)

\end{align}

What is surprising about this result is that it is very easy to calculate with $ y $, which is the output of forward propagation. In the error back propagation method, the forward propagation is calculated first, so if you have the result, it will lead to faster learning. The characteristic that even if $ e ^ x $ is differentiated, it becomes $ e ^ x $ is also important, and the person who thinks about the sigmoid function is really amazing.

The sigmoid layer is not used in this chapter, so the code is omitted.

5.6 Implementation of Affine / Softmax layer

5.6.1 Affine layer

The calculation process is omitted in the Affine layer backpropagation formula (5.13), but @yuyasat [calculates the components of the Affine layer backpropagation steadily](https://qiita.com/yuyasat/items/ The calculation process is summarized in d9cdd4401221df5375b6) for your reference.

5.6.2 Batch version of Affine layer

Below is the implemented code. In addition, in the code of the book, there is consideration when the input data is a tensor (4D data), but it is not included because the usage is not yet known.

`affine.py`


# coding: utf-8
import numpy as np


class Affine:

    def __init__(self, W, b):
        """Affine layer
        
        Args:
            W (numpy.ndarray):weight
            b (numpy.ndarray):bias
        """
        self.W = W      #weight
        self.b = b      #bias
        self.x = None   #input
        self.dW = None  #Derivative value of weight
        self.db = None  #Derivative value of bias

    def forward(self, x):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):input
            
        Returns:
            numpy.ndarray:output
        """
        self.x = x
        out = np.dot(x, self.W) + self.b

        return out

    def backward(self, dout):
        """Backpropagation
        
        Args:
            dout (numpy.ndarray):Derivative value transmitted from the right layer

        Returns:
            numpy.ndarray:Derivative value
        """
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)

        return dx

5.6.3 Softmax-with-Loss layer

Backpropagation of a layer with a set of softmax function and cross entropy error. The calculation process is described in detail in Appendix A of the book, but even if $ e ^ x $ is differentiated, it becomes $ e ^ x $, and when all the one-hot vectors of the teacher label are added, it becomes 1. Etc. contribute to simplifying the formula.

The final result of $ (y_1-t_1, y_2-t_2, y_3-t_3) $ is also surprising. It does show the difference between the output of the neural network and the teacher label, and the calculation can be done at high speed. It seems that the cross entropy error is designed to be "clean" when used as a loss function of the softmax function, and it is really amazing for those who think about the cross entropy error.

Note that the backpropagation value must be divided by the number of batches to accommodate batches. In the book, there was only an explanation that "by dividing the value to be propagated by the number of batches (batch_size), the error per data is propagated to the previous layer", and I could not understand why it was necessary to divide by the number of batches. However, I understood it from the explanation of the code of @ Yoko303's Deep Learning from scratch ~ Softmax-with-Loss layer ~. In the batch version of forward propagation, the cross entropy errors are finally summed and divided by the number of batches (batch_size) to make one value. Backpropagation also requires calculation of this part, and its derivative value is $ \ frac {1} {batch_size} $. Below, I wrote the calculation graph for that part.

計算グラフ.png

I think this understanding is correct, but please point out any mistakes. Below is the implemented code.

`softmax_with_loss.py`


# coding: utf-8
from functions import softmax, cross_entropy_error


class SoftmaxWithLoss:
    def __init__(self):
        """Softmax-with-Loss layer
        """
        self.loss = None    #loss
        self.y = None       #softmax output
        self.t = None       #Teacher data (one-hot vector）

    def forward(self, x, t):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):input
            t (numpy.ndarray):Teacher data

        Returns:
            float:Cross entropy error
        """
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)

        return self.loss

    def backward(self, dout=1):
        """Backpropagation
        
        Args:
            dout (float, optional):Derivative value transmitted from the right layer. The default is 1.

        Returns:
            numpy.ndarray:Derivative value
        """
        batch_size = self.t.shape[0]    #Number of batches
        dx = (self.y - self.t) * (dout / batch_size)

        return dx

Note that the code in the book does not use dout in backward. Since dout is used only with 1 specified, there is no problem in operation, but I think it is probably a mistake.

5.7 Implementation of error back propagation method

5.7.1 Overall view of neural network learning

This is a review of the implementation flow. There is no particular stumbling block.

5.7.2 Implementation of neural network corresponding to error back propagation method

First is the implementation of a general-purpose function. I brought only what I needed from what I wrote in the previous chapter.

`functions.py`


# coding: utf-8
import numpy as np


def softmax(x):
    """Softmax function
    
    Args:
        x (numpy.ndarray):input
    
    Returns:
        numpy.ndarray:output
    """
    #For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
    #In this case, it is necessary to calculate well for each image using broadcast.
    #Here, np so that it can be shared in both 1D and 2D..max()And np.sum()Is axis=-Calculated by 1
    #Keepdims so that you can broadcast as it is=True to maintain the dimension.
    c = np.max(x, axis=-1, keepdims=True)
    exp_a = np.exp(x - c)  #Overflow measures
    sum_exp_a = np.sum(exp_a, axis=-1, keepdims=True)
    y = exp_a / sum_exp_a
    return y


def numerical_gradient(f, x):
    """Gradient calculation
    
    Args:
        f (function):Loss function
        x (numpy.ndarray):An array of weight parameters for which you want to check the gradient
    
    Returns:
        numpy.ndarray:Slope
    """
    h = 1e-4
    grad = np.zeros_like(x)

    # np.Enumerate the elements of a multidimensional array with nditer
    it = np.nditer(x, flags=['multi_index'])
    while not it.finished:

        idx = it.multi_index  # it.multi_index is the element number in the list
        tmp_val = x[idx]  #Save original value

        # f(x + h)Calculation of
        x[idx] = tmp_val + h
        fxh1 = f()

        # f(x - h)Calculation of
        x[idx] = tmp_val - h
        fxh2 = f()

        #Calculate the gradient
        grad[idx] = (fxh1 - fxh2) / (2 * h)
    
        x[idx] = tmp_val  #Return value
        it.iternext()

    return grad


def cross_entropy_error(y, t):
    """Calculation of cross entropy error
    
    Args:
        y (numpy.ndarray):Neural network output
        t (numpy.ndarray):Correct label
    
    Returns:
        float:Cross entropy error
    """

    #If there is one data, shape it (make one data line)
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)

    #Calculate the error and normalize by the number of batches
    batch_size = y.shape[0]
    return -np.sum(t * np.log(y + 1e-7)) / batch_size

And then there is the neural network class. Since it is based on the code in the previous chapter, there are many of the same parts.

`two_layer_net.py`


# coding: utf-8
import numpy as np
from affine import Affine
from functions import numerical_gradient
from relu import ReLU
from softmax_with_loss import SoftmaxWithLoss


class TwoLayerNet:

    def __init__(self, input_size, hidden_size, output_size,
                 weight_init_std=0.01):
        """Two-layer neural network
        
        Args:
            input_size (int):Number of neurons in the input layer
            hidden_size (int):Number of neurons in the hidden layer
            output_size (int):Number of neurons in the output layer
            weight_init_std (float, optional):Adjustment parameter of the initial value of the weight. The default is 0.01。
        """

        #Weight initialization
        self.params = {}
        self.params['W1'] = weight_init_std * \
            np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * \
            np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

        #Layer generation
        self.layers = {}    # Python 3.OrderedDict is unnecessary because the storage order of dictionaries is retained from 7
        self.layers['Affine1'] = \
            Affine(self.params['W1'], self.params['b1'])
        self.layers['Relu1'] = ReLU()
        self.layers['Affine2'] = \
            Affine(self.params['W2'], self.params['b2'])
    
        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        """Inference by neural network
        
        Args:
            x (numpy.ndarray):Input to neural network
        
        Returns:
            numpy.ndarray:Neural network output
        """
        #Propagate layers forward
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    def loss(self, x, t):
        """Loss function value calculation
        
        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label

        Returns:
            float:Loss function value
        """
        #inference
        y = self.predict(x)

        # Softmax-with-Calculated by forward propagation of Loss layer
        loss = self.lastLayer.forward(y, t)

        return loss

    def accuracy(self, x, t):
        """Recognition accuracy calculation
        
        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label
        
        Returns:
            float:Recognition accuracy
        """
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        t = np.argmax(t, axis=1)
        
        accuracy = np.sum(y == t) / x.shape[0]
        return accuracy

    def numerical_gradient(self, x, t):
        """Calculate the gradient for the weight parameter by numerical differentiation
        
        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label
        
        Returns:
            dictionary:A dictionary containing gradients
        """
        grads = {}
        grads['W1'] = \
            numerical_gradient(lambda: self.loss(x, t), self.params['W1'])
        grads['b1'] = \
            numerical_gradient(lambda: self.loss(x, t), self.params['b1'])
        grads['W2'] = \
            numerical_gradient(lambda: self.loss(x, t), self.params['W2'])
        grads['b2'] = \
            numerical_gradient(lambda: self.loss(x, t), self.params['b2'])

        return grads

    def gradient(self, x, t):
        """Gradient for weight parameters calculated by error backpropagation
        
         Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label
        
        Returns:
            dictionary:A dictionary containing gradients
        """
        #Forward propagation
        self.loss(x, t)     #Propagate forward to calculate loss value

        #Backpropagation
        dout = self.lastLayer.backward()
        for layer in reversed(list(self.layers.values())):
            dout = layer.backward(dout)

        #Extract the differential value of each layer
        grads = {}
        grads['W1'] = self.layers['Affine1'].dW
        grads['b1'] = self.layers['Affine1'].db
        grads['W2'] = self.layers['Affine2'].dW
        grads['b2'] = self.layers['Affine2'].db

        return grads

The code in the book uses ʻOrderedDict, but here we use the usual dict. This is because starting with Python 3.7, the insertion order of dict` objects is saved [^ 1].

5.7.3 Gradient confirmation of error back propagation method

This code compares the gradient obtained by the error back propagation method with the gradient obtained by numerical differentiation.

`gradient_check.py`


# coding: utf-8
import os
import sys
import numpy as np
from two_layer_net import TwoLayerNet
sys.path.append(os.pardir)  #Add parent directory to path
from dataset.mnist import load_mnist


#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
    load_mnist(normalize=True, one_hot_label=True)

#Two-layer neural work generation
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

#Data preparation for verification
x_batch = x_train[:3]
t_batch = t_train[:3]

#Gradient calculation by numerical differentiation and error back propagation method
grad_numerical = network.numerical_gradient(x_batch, t_batch)
grad_backprop = network.gradient(x_batch, t_batch)

#Check the difference between each weight
for key in grad_numerical.keys():
    #Calculation of absolute value of difference
    diff = np.abs(grad_backprop[key] - grad_numerical[key])
    #Show average and maximum
    print(f"{key}: [Average difference]{np.average(diff):.10f} [Maximum difference]{np.max(diff):.10f}")

In the book, I checked only the average of the differences in absolute values, but I also checked the maximum value of the differences in absolute values.

W1: [Average difference]0.0000000003 [Maximum difference]0.0000000080
b1: [Average difference]0.0000000021 [Maximum difference]0.0000000081
W2: [Average difference]0.0000000063 [Maximum difference]0.0000000836
b2: [Average difference]0.0000001394 [Maximum difference]0.0000002334

Since b2 is a value with about 7 digits after the decimal point, it seems that the error is a little larger than the book. There may be a bad point in the implementation. If you have any questions, please let me know: sweat:

5.7.4 Learning using the error back propagation method

Below is the learning code.

`mnist.py`


# coding: utf-8
import os
import sys
import matplotlib.pylab as plt
import numpy as np
from two_layer_net import TwoLayerNet
sys.path.append(os.pardir)  #Add parent directory to path
from dataset.mnist import load_mnist


#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
    load_mnist(normalize=True, one_hot_label=True)

#Hyperparameter settings
iters_num = 10000       #Number of updates
batch_size = 100        #Batch size
learning_rate = 0.1     #Learning rate

#Record list of results
train_loss_list = []    #Changes in the value of the loss function
train_acc_list = []     #Recognition accuracy for training data
test_acc_list = []      #Recognition accuracy for test data

train_size = x_train.shape[0]  #Training data size
iter_per_epoch = max(int(train_size / batch_size), 1)    #Number of iterations per epoch

#Two-layer neural work generation
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

#Start learning
for i in range(iters_num):

    #Mini batch generation
    batch_mask = np.random.choice(train_size, batch_size, replace=False)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    #Gradient calculation
    grad = network.gradient(x_batch, t_batch)

    #Weight parameter update
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]
    
    #Loss function value calculation
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)

    #Recognition accuracy calculation for each epoch
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

        #Progress display
        print(f'[Number of updates]{i:>4} [Loss function value]{loss:.4f} '
              f'[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}')

#Draw the transition of the value of the loss function
x = np.arange(len(train_loss_list))
plt.plot(x, train_loss_list, label='loss')
plt.xlabel('iteration')
plt.ylabel('loss')
plt.xlim(left=0)
plt.ylim(bottom=0)
plt.show()

#Draw the transition of recognition accuracy of training data and test data
x2 = np.arange(len(train_acc_list))
plt.plot(x2, train_acc_list, label='train acc')
plt.plot(x2, test_acc_list, label='test acc', linestyle='--')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.xlim(left=0)
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

And it is the execution result of the same form as the previous chapter.

[Number of updates]   0 [Loss function value]2.3008 [Training data recognition accuracy]0.0926 [Test data recognition accuracy]0.0822
[Number of updates] 600 [Loss function value]0.2575 [Training data recognition accuracy]0.9011 [Test data recognition accuracy]0.9068
[Number of updates]1200 [Loss function value]0.2926 [Training data recognition accuracy]0.9219 [Test data recognition accuracy]0.9242
[Number of updates]1800 [Loss function value]0.2627 [Training data recognition accuracy]0.9324 [Test data recognition accuracy]0.9341
[Number of updates]2400 [Loss function value]0.0899 [Training data recognition accuracy]0.9393 [Test data recognition accuracy]0.9402
[Number of updates]3000 [Loss function value]0.1096 [Training data recognition accuracy]0.9500 [Test data recognition accuracy]0.9483
[Number of updates]3600 [Loss function value]0.1359 [Training data recognition accuracy]0.9559 [Test data recognition accuracy]0.9552
[Number of updates]4200 [Loss function value]0.1037 [Training data recognition accuracy]0.9592 [Test data recognition accuracy]0.9579
[Number of updates]4800 [Loss function value]0.1065 [Training data recognition accuracy]0.9639 [Test data recognition accuracy]0.9600
[Number of updates]5400 [Loss function value]0.0419 [Training data recognition accuracy]0.9665 [Test data recognition accuracy]0.9633
[Number of updates]6000 [Loss function value]0.0393 [Training data recognition accuracy]0.9698 [Test data recognition accuracy]0.9649
[Number of updates]6600 [Loss function value]0.0575 [Training data recognition accuracy]0.9718 [Test data recognition accuracy]0.9663
[Number of updates]7200 [Loss function value]0.0850 [Training data recognition accuracy]0.9728 [Test data recognition accuracy]0.9677
[Number of updates]7800 [Loss function value]0.0403 [Training data recognition accuracy]0.9749 [Test data recognition accuracy]0.9686
[Number of updates]8400 [Loss function value]0.0430 [Training data recognition accuracy]0.9761 [Test data recognition accuracy]0.9685
[Number of updates]9000 [Loss function value]0.0513 [Training data recognition accuracy]0.9782 [Test data recognition accuracy]0.9715
[Number of updates]9600 [Loss function value]0.0584 [Training data recognition accuracy]0.9777 [Test data recognition accuracy]0.9707

スクリーンショット 2020-01-19 1.20.06.png スクリーンショット 2020-01-19 1.20.59.png

Compared to the results in the previous chapter, the recognition accuracy increases faster. In the end, it was about 97%. The only difference between the numerical differentiation and the error back propagation method should be the gradient calculation method, so it seems that the change from the sigmoid function to the ReLU function has led to improvement.

5.8 Summary

The calculation graph should be easy to understand. It was also well understood that the output layer and loss function were designed so that the differential value could be easily obtained.

That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out. (To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)

[^ 1]: See "Improvement of Python's Data Model" in What's New In Python 3.7.

An amateur stumbled in Deep Learning from scratch Note: Chapter 5