Introduction

I suddenly started studying in Chapter 8 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". It is a memo of the trip.

The execution environment is macOS Mojave + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / [Chapter 5](https://qiita. com / segavvy / items / 8707e4e65aa7fa357d8a) / Chapter 6 / Chapter 7 Chapter / Summary)

Chapter 8 Deep Learning

This chapter describes deep neural networks with deep layers.

8.1 Deeper network

We will use what we have learned to challenge the implementation of MNIST handwriting recognition in a deep network. Unfortunately, this chapter is hard because there is no source description at all.

Dropout and Adam that I learned in the previous chapter skipped the implementation, but since I will use it this time, I will clean up from here.

(1) Implementation of Dropout layer

The Dropout layer is explained in "6.4.3 Droput" of the book, so I implemented it while looking at it.

`dropout.py`


# coding: utf-8
import numpy as np


class Dropout:
    def __init__(self, dropout_ratio=0.5):
        """Dropout layer
        
        Args:
            dropout_ratio (float):Neuron erasure rate during learning, default 0.5。
        """
        self.dropout_ratio = dropout_ratio              #Elimination rate of neurons during learning
        self.valid_ratio = 1.0 - self.dropout_ratio     #Percentage that was used at the time of learning
        self.mask = None                                #An array of flags indicating whether each neuron is erased

    def forward(self, x, train_flg=True):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):input
            train_flg (bool, optional):True if learning, default is True.
        
        Returns:
            numpy.ndarray:output
        """
        if train_flg:
            #Generate a mask that determines the neurons to be erased during learning
            self.mask = np.random.rand(*x.shape) > self.dropout_ratio

            #Calculate output
            return x * self.mask
        
        else:
            #Neurons are not erased during recognition, but the output is adjusted to include the erase rate during learning.
            return x * self.valid_ratio

    def backward(self, dout):
        """Backpropagation
        
        Args:
            dout (numpy.ndarray):Derivative value transmitted from the right layer
        
        Returns:
            numpy.ndarray:Derivative value (gradient)
        """
        #Backpropagation of the differential value of the right layer only for neurons that have not been erased
        assert self.mask is not None, 'Backpropagation was called without forward propagation'
        return dout * self.mask

(2) Implementation of Adam

Adam used for optimization has a brief explanation in "6.1.6 Adam" of the book, but it is too easy to implement by itself. Also, I couldn't understand the algorithm well by looking at the source of the book. So, first, understand the rough mechanism with @ omiita's [2020 definitive edition] super easy-to-understand optimization algorithm-Adam and Newton's method from loss function-. Did. And the PDF of the original paper introduced in the book (reference [8] site Adam: A Method for Stochastic Optimization I implemented it while looking at the explanation of "Algorithm 1" on page 2 of (You can download it from the upper right of //arxiv.org/abs/1412.6980). Although it is in English, the explanation is about 20 lines using pseudo code, so even I, who is not good at English, was quite good at it. I also tried to follow the recommended values in this paper for the initial values of the parameters.

`adam.py`


# coding: utf-8
import numpy as np


class Adam:

    def __init__(self, alpha=0.001, beta1=0.9, beta2=0.999):
        """Parameter optimization by Adam
        
        Args:
            alpha (float, optional):Learning factor, default 0.001。
            beta1 (float, optional):Coefficients of past and present velocity in Momentum, default 0.9。
            beta2 (float, optional):Past and present proportional division coefficient of learning coefficient in AdaGrad, default is 0.999。
        """
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2

        self.m = None   #Speed in Momentum
        self.v = None   #Learning factor in AdaGrad
        self.t = 0      #Time step

    def update(self, params, grads):
        """Parameter update
        
        Args:
            params (dict):The dictionary of parameters to be updated, key is'W1'、'b1'Such.
            grads (dict):Gradient dictionary corresponding to params
        """
        #Initialization of m and v
        if self.m is None:
            self.m = {}
            self.v = {}
            for key, val in params.items():
                self.m[key] = np.zeros_like(val)
                self.v[key] = np.zeros_like(val)

        #update
        self.t += 1     #Time step addition
        for key in params.keys():

            #Equivalent to m update, speed update in Momentum
            #Gradient of past and present beta1: 1 -Prorate by beta1
            self.m[key] = \
                self.beta1 * self.m[key] + (1 - self.beta1) * grads[key]

            #Equivalent to updating v and updating learning coefficients in AdaGrad
            #Gradient of past and present beta2: 1 -Prorate by beta2
            self.v[key] = \
                self.beta2 * self.v[key] + (1 - self.beta2) * (grads[key] ** 2)

            #Calculation of correction values for m and v for parameter update
            hat_m = self.m[key] / (1.0 - self.beta1 ** self.t)
            hat_v = self.v[key] / (1.0 - self.beta2 ** self.t)

            #Parameter update, last 1e-7 avoids division by zero
            params[key] -= self.alpha * hat_m / (np.sqrt(hat_v) + 1e-7)

(3) Calculation of output size of convolution layer and pooling layer

This time there are many layers, and the calculation of the output size of the convolution layer and the pooling layer comes out many times. Therefore, I added them to functions.py as functions conv_output_size and pool_output_size, respectively. Other functions remain up to the previous chapter.

`functions.py`


# coding: utf-8
import numpy as np


def softmax(x):
    """Softmax function
    
    Args:
        x (numpy.ndarray):input
    
    Returns:
        numpy.ndarray:output
    """
    #For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
    #In this case, it is necessary to calculate well for each image using broadcast.
    #Here, np so that it can be shared in both 1D and 2D..max()And np.sum()Is axis=-Calculated by 1
    #Keepdims so that you can broadcast as it is=True to maintain the dimension.
    c = np.max(x, axis=-1, keepdims=True)
    exp_a = np.exp(x - c)  #Overflow measures
    sum_exp_a = np.sum(exp_a, axis=-1, keepdims=True)
    y = exp_a / sum_exp_a
    return y


def cross_entropy_error(y, t):
    """Calculation of cross entropy error
    
    Args:
        y (numpy.ndarray):Neural network output
        t (numpy.ndarray):Correct label
    
    Returns:
        float:Cross entropy error
    """

    #If there is one data, shape it (make one data line)
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)

    #Calculate the error and normalize by the number of batches
    batch_size = y.shape[0]
    return -np.sum(t * np.log(y + 1e-7)) / batch_size


def conv_output_size(input_size, filter_size, pad, stride):
    """Calculation of output size of convolution layer
    
    Args:
        input_size (int):The size of one side of the input (assuming the same value in the vertical and horizontal directions)
        filter_size (int):The size of one side of the filter (assuming the same value in the vertical and horizontal directions)
        pad (int):Padding size (assuming the same value in the vertical and horizontal directions)
        stride (int):Stride width (assuming the same value in the vertical and horizontal directions)
    
    Returns:
        int:The size of one side of the output
    """
    assert (input_size + 2 * pad - filter_size) \
        % stride == 0, 'The output size of the convolution layer is not divisible!'
    return int((input_size + 2 * pad - filter_size) / stride + 1)


def pool_output_size(input_size, pool_size, stride):
    """Calculation of output size of pooling layer
    
    Args:
        input_size (int):The size of one side of the input (assuming the same value in the vertical and horizontal directions)
        pool_size (int):Pooling window size (assuming the same value in height and width)
        stride (int):Stride width (assuming the same value in the vertical and horizontal directions)
    
    Returns:
        int:The size of one side of the output
    """
    assert (input_size - pool_size) % stride == 0, 'The output size of the pooling layer is not divisible!'
    return int((input_size - pool_size) / stride + 1)

(4) Deep CNN implementation

Now that we have implemented the necessary parts, it is time to implement the network.

First, I will organize the input and output in this network.

layer	Input / output shape	Shape at the time of mounting
	$ (Batch size N,Number of channels CH,Image height H,Width W) $	$ (100, 1, 28, 28) $
[1] Convolution #1	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 16, 28, 28) $
[2] ReLU #1	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 16, 28, 28) $
[3] Convolution #2	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 16, 28, 28) $
[4] ReLU #2	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 16, 28, 28) $
[5] Pooling #1	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 16, 14, 14) $
[6] Convolution #3	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 32, 14, 14) $
[7] ReLU #3	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 32, 14, 14) $
[8] Convolution #4	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 32, 16, 16) $
[9] ReLU #4	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 32, 16, 16) $
[10] Pooling #2	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 32, 8, 8) $
[11] Convolution #5	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 64, 8, 8) $
[12] ReLU #5	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 64, 8, 8) $
[13] Convolution #6	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 64, 8, 8) $
[14] ReLU #6	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 64, 8, 8) $
[15] Pooling #3	↓
	$ (Batch size N,Number of filters FN,Output height OH,Width OW) $	$ (100, 64, 4, 4) $
[16] Affine #1	↓
	$ (Batch size N,Hidden layer size) $	$ (100, 50) $
[17] ReLU #7	↓
	$ (Batch size N,Hidden layer size) $	$ (100, 50) $
[18] Dropout #1	↓
	$ (Batch size N,Hidden layer size) $	$ (100, 50) $
[19] Affine #2	↓
	$ (Batch size N,Hidden layer size) $	$ (100, 10) $
[20] Dropout #2	↓
	$ (Batch size N,Hidden layer size) $	$ (100, 10) $
[21] Softmax	↓
	$ (Batch size N,Final output size) $	$ (100, 10) $

It's a magnificent table, but we'll implement it layer by layer.

The code in the book is simply organized using loops, but the calculation of the I / O size for each layer can be confusing, so I implemented parameter initialization and layer generation one layer at a time. It's a pretty muddy code. The parameters are initialized with "Initial value of He".

`deep_conv_net.py`


# coding: utf-8
import numpy as np
from affine import Affine
from convolution import Convolution
from dropout import Dropout
from functions import conv_output_size, pool_output_size
from pooling import Pooling
from relu import ReLU
from softmax_with_loss import SoftmaxWithLoss


class DeepConvNet:

    def __init__(
        self, input_dim=(1, 28, 28),
        conv_param_1={
            'filter_num': 16, 'filter_size': 3, 'pad': 1, 'stride': 1
        },
        conv_param_2={
            'filter_num': 16, 'filter_size': 3, 'pad': 1, 'stride': 1
        },
        conv_param_3={
            'filter_num': 32, 'filter_size': 3, 'pad': 1, 'stride': 1
        },
        conv_param_4={
            'filter_num': 32, 'filter_size': 3, 'pad': 2, 'stride': 1
        },
        conv_param_5={
            'filter_num': 64, 'filter_size': 3, 'pad': 1, 'stride': 1
        },
        conv_param_6={
            'filter_num': 64, 'filter_size': 3, 'pad': 1, 'stride': 1
        },
        hidden_size=50, output_size=10
    ):
        """Deep convolutional neural network
        
        Args:
            input_dim (tuple, optional):Input data shape, default is(1, 28, 28)。
            conv_param_1 (dict, optional):Hyperparameters of convolution layer 1,
The default is{'filter_num':16, 'filter_size':3, 'pad':1, 'stride':1}。
            conv_param_2 (dict, optional):Hyperparameters of convolution layer 2,
The default is{'filter_num':16, 'filter_size':3, 'pad':1, 'stride':1}。
            conv_param_3 (dict, optional):Hyperparameters of convolution layer 3,
The default is{'filter_num':32, 'filter_size':3, 'pad':1, 'stride':1}。
            conv_param_4 (dict, optional):Hyperparameters of convolution layer 4,
The default is{'filter_num':32, 'filter_size':3, 'pad':2, 'stride':1}。
            conv_param_5 (dict, optional):Hyperparameters of convolution layer 5,
The default is{'filter_num':64, 'filter_size':3, 'pad':1, 'stride':1}。
            conv_param_6 (dict, optional):Hyperparameters of convolution layer 6,
The default is{'filter_num':64, 'filter_size':3, 'pad':1, 'stride':1}。
            hidden_size (int, optional):The number of neurons in the hidden layer, the default is 50.
            output_size (int, optional):The number of neurons in the output layer, the default is 10.
        """
        assert input_dim[1] == input_dim[2], 'Input data is assumed to have the same height and width!'

        #Parameter initialization and layer generation
        self.params = {}    #parameter
        self.layers = {}    #Layer (Python 3).OrderedDict is not required because the dictionary storage order is retained from 7)
    
        #Input size
        channel_num = input_dim[0]                          #Number of input channels
        input_size = input_dim[1]                           #Input size

        # [1]Convolution layer#1 :Parameter initialization and layer generation
        filter_num, filter_size, pad, stride = list(conv_param_1.values())
        pre_node_num = channel_num * (filter_size ** 2)     #Number of connected nodes in the previous layer for one node
        key_w, key_b = 'W1', 'b1'                           #Key when storing the dictionary
        self.params[key_w] = np.random.normal(
            scale=np.sqrt(2.0 / pre_node_num),              #Standard deviation of the initial value of He
            size=(filter_num, channel_num, filter_size, filter_size)
        )
        self.params[key_b] = np.zeros(filter_num)

        self.layers['Conv1'] = Convolution(
            self.params[key_w], self.params[key_b], stride, pad
        )

        #Input size calculation for the next layer
        channel_num = filter_num
        input_size = conv_output_size(input_size, filter_size, pad, stride)

        # [2]ReLU layer#1 :Layer generation
        self.layers['ReLU1'] = ReLU()
   
        # [3]Convolution layer#2 :Parameter initialization and layer generation
        filter_num, filter_size, pad, stride = list(conv_param_2.values())
        pre_node_num = channel_num * (filter_size ** 2)     #Number of connected nodes in the previous layer for one node
        key_w, key_b = 'W2', 'b2'                           #Key when storing the dictionary
        self.params[key_w] = np.random.normal(
            scale=np.sqrt(2.0 / pre_node_num),              #Standard deviation of the initial value of He
            size=(filter_num, channel_num, filter_size, filter_size)
        )
        self.params[key_b] = np.zeros(filter_num)

        self.layers['Conv2'] = Convolution(
            self.params[key_w], self.params[key_b], stride, pad
        )

        #Input size calculation for the next layer
        channel_num = filter_num
        input_size = conv_output_size(input_size, filter_size, pad, stride)

        # [4]ReLU layer#2 :Layer generation
        self.layers['ReLU2'] = ReLU()
        
        # [5]Pooling layer#1 :Layer generation
        self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
        
        #Input size calculation for the next layer
        input_size = pool_output_size(input_size, pool_size=2, stride=2)

        # [6]Convolution layer#3 :Parameter initialization and layer generation
        filter_num, filter_size, pad, stride = list(conv_param_3.values())
        pre_node_num = channel_num * (filter_size ** 2)     #Number of connected nodes in the previous layer for one node
        key_w, key_b = 'W3', 'b3'                           #Key when storing the dictionary
        self.params[key_w] = np.random.normal(
            scale=np.sqrt(2.0 / pre_node_num),              #Standard deviation of the initial value of He
            size=(filter_num, channel_num, filter_size, filter_size)
        )
        self.params[key_b] = np.zeros(filter_num)

        self.layers['Conv3'] = Convolution(
            self.params[key_w], self.params[key_b], stride, pad
        )

        #Input size calculation for the next layer
        channel_num = filter_num
        input_size = conv_output_size(input_size, filter_size, pad, stride)

        # [7]ReLU layer#3 :Layer generation
        self.layers['ReLU3'] = ReLU()
   
        # [8]Convolution layer#4 :Parameter initialization and layer generation
        filter_num, filter_size, pad, stride = list(conv_param_4.values())
        pre_node_num = channel_num * (filter_size ** 2)     #Number of connected nodes in the previous layer for one node
        key_w, key_b = 'W4', 'b4'                           #Key when storing the dictionary
        self.params[key_w] = np.random.normal(
            scale=np.sqrt(2.0 / pre_node_num),              #Standard deviation of the initial value of He
            size=(filter_num, channel_num, filter_size, filter_size)
        )
        self.params[key_b] = np.zeros(filter_num)
        
        self.layers['Conv4'] = Convolution(
            self.params[key_w], self.params[key_b], stride, pad
        )

        #Input size calculation for the next layer
        channel_num = filter_num
        input_size = conv_output_size(input_size, filter_size, pad, stride)

        # [9]ReLU layer#4 :Layer generation
        self.layers['ReLU4'] = ReLU()
        
        # [10]Pooling layer#2 :Layer generation
        self.layers['Pool2'] = Pooling(pool_h=2, pool_w=2, stride=2)
        
        #Input size calculation for the next layer
        input_size = pool_output_size(input_size, pool_size=2, stride=2)

        # [11]Convolution layer#5 :Parameter initialization and layer generation
        filter_num, filter_size, pad, stride = list(conv_param_5.values())
        pre_node_num = channel_num * (filter_size ** 2)     #Number of connected nodes in the previous layer for one node
        key_w, key_b = 'W5', 'b5'                           #Key when storing the dictionary
        self.params[key_w] = np.random.normal(
            scale=np.sqrt(2.0 / pre_node_num),              #Standard deviation of the initial value of He
            size=(filter_num, channel_num, filter_size, filter_size)
        )
        self.params[key_b] = np.zeros(filter_num)

        self.layers['Conv5'] = Convolution(
            self.params[key_w], self.params[key_b], stride, pad
        )

        #Input size calculation for the next layer
        channel_num = filter_num
        input_size = conv_output_size(input_size, filter_size, pad, stride)

        # [12]ReLU layer#5 :Layer generation
        self.layers['ReLU5'] = ReLU()
   
        # [13]Convolution layer#6 :Parameter initialization and layer generation
        filter_num, filter_size, pad, stride = list(conv_param_6.values())
        pre_node_num = channel_num * (filter_size ** 2)     #Number of connected nodes in the previous layer for one node
        key_w, key_b = 'W6', 'b6'                           #Key when storing the dictionary
        self.params[key_w] = np.random.normal(
            scale=np.sqrt(2.0 / pre_node_num),              #Standard deviation of the initial value of He
            size=(filter_num, channel_num, filter_size, filter_size)
        )
        self.params[key_b] = np.zeros(filter_num)
        
        self.layers['Conv6'] = Convolution(
            self.params[key_w], self.params[key_b], stride, pad
        )

        #Input size calculation for the next layer
        channel_num = filter_num
        input_size = conv_output_size(input_size, filter_size, pad, stride)

        # [14]ReLU layer#6 :Layer generation
        self.layers['ReLU6'] = ReLU()
        
        # [15]Pooling layer#3 :Layer generation
        self.layers['Pool3'] = Pooling(pool_h=2, pool_w=2, stride=2)
        
        #Input size calculation for the next layer
        input_size = pool_output_size(input_size, pool_size=2, stride=2)

        # [16]Affine layer#1　:Parameter initialization and layer generation
        pre_node_num = channel_num * (input_size ** 2)      #Number of connected nodes in the previous layer for one node
        key_w, key_b = 'W7', 'b7'                           #Key when storing the dictionary
        self.params[key_w] = np.random.normal(
            scale=np.sqrt(2.0 / pre_node_num),              #Standard deviation of the initial value of He
            size=(channel_num * (input_size ** 2), hidden_size)
        )
        self.params[key_b] = np.zeros(hidden_size)

        self.layers['Affine1'] = Affine(self.params[key_w], self.params[key_b])
 
        #Input size calculation for the next layer
        input_size = hidden_size

        # [17]ReLU layer#7 :Layer generation
        self.layers['ReLU7'] = ReLU()

        # [18]Dropout layer#１ :Layer generation
        self.layers['Drop1'] = Dropout(dropout_ratio=0.5)

        # [19]Affine layer#2　:Parameter initialization and layer generation
        pre_node_num = input_size                           #Number of connected nodes in the previous layer for one node
        key_w, key_b = 'W8', 'b8'                           #Key when storing the dictionary
        self.params[key_w] = np.random.normal(
            scale=np.sqrt(2.0 / pre_node_num),              #Standard deviation of the initial value of He
            size=(input_size, output_size)
        )
        self.params[key_b] = np.zeros(output_size)

        self.layers['Affine2'] = Affine(self.params[key_w], self.params[key_b])

        # [20]Dropout layer#2 :Layer generation
        self.layers['Drop2'] = Dropout(dropout_ratio=0.5)

        # [21]Softmax layer:Layer generation
        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x, train_flg=False):
        """Inference by neural network
        
        Args:
            x (numpy.ndarray):Input to neural network
            train_flg (Boolean):True if learning (Erase neurons in Dropout layer)
        
        Returns:
            numpy.ndarray:Neural network output
        """
        #Propagate layers forward
        for layer in self.layers.values():
            if isinstance(layer, Dropout):
                x = layer.forward(x, train_flg)  #In the case of the Dropout layer, tell if you are learning
            else:
                x = layer.forward(x)
        return x

    def loss(self, x, t):
        """Loss function value calculation
        
        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label

        Returns:
            float:Loss function value
        """
        #inference
        y = self.predict(x, True)   #Loss is always true as it is only calculated during learning

        # Softmax-with-Calculated by forward propagation of Loss layer
        loss = self.lastLayer.forward(y, t)

        return loss

    def accuracy(self, x, t, batch_size=100):
        """Recognition accuracy calculation
        batch_size is the batch size at the time of calculation. When trying to calculate a large amount of data at once
Because im2col eats too much memory and thrashing occurs and it does not work
To avoid that.

        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label (one-hot）
            batch_size (int), optional):Batch size at the time of calculation, default is 100.
        
        Returns:
            float:Recognition accuracy
        """
        #Calculation of the number of divisions
        batch_num = max(int(x.shape[0] / batch_size), 1)

        #Split
        x_list = np.array_split(x, batch_num, 0)
        t_list = np.array_split(t, batch_num, 0)

        #Process in divided units
        correct_num = 0  #Total number of correct answers
        for (sub_x, sub_t) in zip(x_list, t_list):
            assert sub_x.shape[0] == sub_t.shape[0], 'Did the division boundary shift?'
            y = self.predict(sub_x, False)  #Recognition accuracy is not calculated during learning, so it is always False
            y = np.argmax(y, axis=1)
            t = np.argmax(sub_t, axis=1)
            correct_num += np.sum(y == t)
        
        #Calculation of recognition accuracy
        return correct_num / x.shape[0]

    def gradient(self, x, t):
        """Gradient for weight parameters calculated by error backpropagation
        
         Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label
        
        Returns:
            dictionary:A dictionary containing gradients
        """
        #Forward propagation
        self.loss(x, t)     #Propagate forward to calculate loss value

        #Backpropagation
        dout = self.lastLayer.backward()
        for layer in reversed(list(self.layers.values())):
            dout = layer.backward(dout)

        #Extract the differential value of each layer
        grads = {}
        layer = self.layers['Conv1']
        grads['W1'], grads['b1'] = layer.dW, layer.db
        layer = self.layers['Conv2']
        grads['W2'], grads['b2'] = layer.dW, layer.db
        layer = self.layers['Conv3']
        grads['W3'], grads['b3'] = layer.dW, layer.db
        layer = self.layers['Conv4']
        grads['W4'], grads['b4'] = layer.dW, layer.db
        layer = self.layers['Conv5']
        grads['W5'], grads['b5'] = layer.dW, layer.db
        layer = self.layers['Conv6']
        grads['W6'], grads['b6'] = layer.dW, layer.db
        layer = self.layers['Affine1']
        grads['W7'], grads['b7'] = layer.dW, layer.db
        layer = self.layers['Affine2']
        grads['W8'], grads['b8'] = layer.dW, layer.db

        return grads

(5) Implementation of learning

Learning is almost the same as the code in the previous chapter. I was thinking of implementing the Trainer class according to the code in the book, but since it is the last chapter and this implementation is over, I keep it as it is.

I tried to update the number of updates to 12,000 (20 epochs).

`mnist.py`


# coding: utf-8
import os
import sys
import matplotlib.pylab as plt
import numpy as np
from adam import Adam
from deep_conv_net import DeepConvNet
sys.path.append(os.pardir)  #Add parent directory to path
from dataset.mnist import load_mnist


#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
    load_mnist(normalize=True, flatten=False, one_hot_label=True)

#Hyperparameter settings
iters_num = 12000           #Number of updates
batch_size = 100            #Batch size
adam_param_alpha = 0.001    #Adam parameters
adam_param_beta1 = 0.9      #Adam parameters
adam_param_beta2 = 0.999    #Adam parameters

train_size = x_train.shape[0]  #Training data size
iter_per_epoch = max(int(train_size / batch_size), 1)    #Number of iterations per epoch

#Deep convolutional neural network generation
network = DeepConvNet()

#Optimizer generation, using Adam
optimizer = Adam(adam_param_alpha, adam_param_beta1, adam_param_beta2)

#Confirmation of recognition accuracy before learning
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_loss_list = []            #Storage location of the transition of the value of the loss function
train_acc_list = [train_acc]    #Storage location of changes in recognition accuracy for training data
test_acc_list = [test_acc]      #Storage destination of transition of recognition accuracy for test data
print(f'Before learning[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}')

#Start learning
for i in range(iters_num):

    #Mini batch generation
    batch_mask = np.random.choice(train_size, batch_size, replace=False)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    #Gradient calculation
    grads = network.gradient(x_batch, t_batch)

    #Weight parameter update
    optimizer.update(network.params, grads)
    
    #Loss function value calculation
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)

    #Recognition accuracy calculation for each epoch
    if (i + 1) % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

        #Progress display
        print(
            f'[epoch]{(i + 1) // iter_per_epoch:>2} '
            f'[Number of updates]{i + 1:>5} [Loss function value]{loss:.4f} '
            f'[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}'
        )

#Draw the transition of the value of the loss function
x = np.arange(len(train_loss_list))
plt.plot(x, train_loss_list, label='loss')
plt.xlabel('iteration')
plt.ylabel('loss')
plt.xlim(left=0)
plt.ylim(0, 2.5)
plt.show()

#Draw the transition of recognition accuracy of training data and test data
x2 = np.arange(len(train_acc_list))
plt.plot(x2, train_acc_list, label='train acc')
plt.plot(x2, test_acc_list, label='test acc', linestyle='--')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.xlim(left=0)
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

(6) Execution result

Below are the execution results. It took about half a day in my environment.

Before learning[Training data recognition accuracy]0.0975 [Test data recognition accuracy]0.0974
[epoch] 1 [Number of updates]  600 [Loss function value]1.0798 [Training data recognition accuracy]0.9798 [Test data recognition accuracy]0.9811
[epoch] 2 [Number of updates] 1200 [Loss function value]0.8792 [Training data recognition accuracy]0.9881 [Test data recognition accuracy]0.9872
[epoch] 3 [Number of updates] 1800 [Loss function value]0.9032 [Training data recognition accuracy]0.9884 [Test data recognition accuracy]0.9890
[epoch] 4 [Number of updates] 2400 [Loss function value]0.8012 [Training data recognition accuracy]0.9914 [Test data recognition accuracy]0.9906
[epoch] 5 [Number of updates] 3000 [Loss function value]0.9475 [Training data recognition accuracy]0.9932 [Test data recognition accuracy]0.9907
[epoch] 6 [Number of updates] 3600 [Loss function value]0.8105 [Training data recognition accuracy]0.9939 [Test data recognition accuracy]0.9910
[epoch] 7 [Number of updates] 4200 [Loss function value]0.8369 [Training data recognition accuracy]0.9920 [Test data recognition accuracy]0.9915
[epoch] 8 [Number of updates] 4800 [Loss function value]0.8727 [Training data recognition accuracy]0.9954 [Test data recognition accuracy]0.9939
[epoch] 9 [Number of updates] 5400 [Loss function value]0.9640 [Training data recognition accuracy]0.9958 [Test data recognition accuracy]0.9935
[epoch]10 [Number of updates] 6000 [Loss function value]0.8375 [Training data recognition accuracy]0.9953 [Test data recognition accuracy]0.9925
[epoch]11 [Number of updates] 6600 [Loss function value]0.8500 [Training data recognition accuracy]0.9955 [Test data recognition accuracy]0.9915
[epoch]12 [Number of updates] 7200 [Loss function value]0.7959 [Training data recognition accuracy]0.9966 [Test data recognition accuracy]0.9932
[epoch]13 [Number of updates] 7800 [Loss function value]0.7778 [Training data recognition accuracy]0.9946 [Test data recognition accuracy]0.9919
[epoch]14 [Number of updates] 8400 [Loss function value]0.9212 [Training data recognition accuracy]0.9973 [Test data recognition accuracy]0.9929
[epoch]15 [Number of updates] 9000 [Loss function value]0.9046 [Training data recognition accuracy]0.9974 [Test data recognition accuracy]0.9934
[epoch]16 [Number of updates] 9600 [Loss function value]0.9806 [Training data recognition accuracy]0.9970 [Test data recognition accuracy]0.9924
[epoch]17 [Number of updates]10200 [Loss function value]0.7837 [Training data recognition accuracy]0.9975 [Test data recognition accuracy]0.9931
[epoch]18 [Number of updates]10800 [Loss function value]0.8948 [Training data recognition accuracy]0.9976 [Test data recognition accuracy]0.9928
[epoch]19 [Number of updates]11400 [Loss function value]0.7936 [Training data recognition accuracy]0.9980 [Test data recognition accuracy]0.9932
[epoch]20 [Number of updates]12000 [Loss function value]0.8072 [Training data recognition accuracy]0.9984 [Test data recognition accuracy]0.9939

スクリーンショット 2020-02-19 23.34.13.png スクリーンショット 2020-02-19 23.34.34.png

The final recognition accuracy was 99.39%. The CNN in the previous chapter was 98.60%, which is 0.79 points up. The result made me feel the possibility of deepening the layer.

The value of the loss function for recognition accuracy is larger than the result of the previous chapter, but I think this is due to Dropout. The recognition accuracy uses all neurons, but half of the neurons (because we ran the Dropout rate at 0.5) were in the deleted state when calculating the loss function.

This is the end of the implementation in this book, but methods such as ensemble learning and Data Augmentation are introduced to further improve recognition accuracy. It also summarizes the benefits of deepening the layer.

8.2 A short history of deep learning

Introducing the trends of deep learning. In each case, I understood that the CNN learned so far is the basis.

8.3 Accelerate deep learning

This is an explanation of speeding up. What was interesting was that in deep learning, half-precision floating-point is attracting attention because single-precision floating-point is too accurate. I've never heard of half-precision floating point types in the development languages I've used, but I found out that NumPy has a type called float16.

8.4 Practical examples of deep learning

It turns out that object detection, segmentation, image captioning, and other interesting things have already been achieved. However, the mechanism is still not fully understood at the level I have learned so far.

8.5 The future of deep learning

This is an introduction to the fields under study, such as image generation, autonomous driving, and reinforcement learning. I feel the possibility of deep learning.

8.6 Summary

I managed to finish the final implementation. I am relieved that the accuracy of the book can be obtained. I was also able to learn about the possibilities of deep learning.

That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.