An amateur stumbled in Deep Learning from scratch Note: Chapter 7

Introduction

I suddenly started studying in Chapter 7 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". It is a memo of the trip.

The execution environment is macOS Mojave + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / [Chapter 5](https://qiita. com / segavvy / items / 8707e4e65aa7fa357d8a) / Chapter 6 / Chapter 7 / Chapter 8 / Summary)

Chapter 7 Convolutional Neural Network

This chapter describes Convolutional Neural Networks (CNN).

7.1 Overall structure

In addition to the existing Affine layer, Softmax layer, and ReLU layer, the Convolution layer and Pooling layer will appear.

7.2 Convolution layer

The explanation of the convolution layer is easier to read if you have a little bit of image processing.

It says, "Images are usually three-dimensional shapes in the vertical, horizontal, and channel directions." However, since images are vertical and horizontal 2D data, isn't it 3D data with depth added? Some people may think that.

"Channel" here refers to information for each color such as RGB. For grayscale (black and white shades only) data such as MNIST, the density of one point can be expressed by one value, so one channel is sufficient, but in a color image, one point is red, green, and blue. Since it is expressed by the density of the three values of (RGB), three channels are required. The color information channels include not only RGB, but also CMYK, HSV, and transparency alpha. For details, go to "RGB CMYK" etc. and you will find a lot of explanations (although there are many stories that are a little closer to printing).

Also, the word "filter" is also special, and in image processing, it refers to the processing used to extract only the necessary parts (for example, contours) of an image or to remove unnecessary information. For those who are not familiar with it, it will be easier to understand if you get an overview of the convolution filter in image processing. @ t-tkd3a's 3x3 convolution filter result image is recommended because it is easy to imagine.

As an aside, this book seems to be a rule that does not add long vowel notation for katakana with three or more notes like "Layer". However, since the "filter" has a long vowel notation, it may be a unified omission. By the way, when Microsoft switched the katakana long vowel notation method in 2008 [^ 1], I was in charge of developing packaged applications for Windows, and I was in charge of correcting the wording of programs, manuals, etc. It was hard. Before that, I was involved in removing half-width kana from the GUI on Windows 98 ... In this industry, really Japanese is inconvenient: sweat:

Let's get back to the story and move on.

7.3 Pooling layer

As for the pooling layer, I didn't have any particular stumbling blocks.

7.4 Implementation of Convolution / Pooling layer

The implementation of the Convolution layer and Pooling layer is short in code, but complicated because the shape of the target data changes rapidly with ʻim2col, numpy.ndarray.reshape and numpy.ndarray.transpose`. It was confusing at first, but I could understand it by referring to @ daizutabi's "Deep Learning from scratch" Convolution / Pooling layer implementation. ..

First is the implementation of the Convolution layer. I have a lot of comments because I can't get my head on unless I write the shape.

convolution.py


# coding: utf-8
import os
import sys
import numpy as np
sys.path.append(os.pardir)  #Add parent directory to path
from common.util import im2col, col2im


class Convolution:
    def __init__(self, W, b, stride=1, pad=0):
        """Convolution layer
        
        Args:
            W (numpy.ndarray):Filter (weight), shape(FN, C, FH, FW)。
            b (numpy.ndarray):Bias, shape(FN)。
            stride (int, optional):Stride, default is 1.
            pad (int, optional):Padding, default is 0.
        """
        self.W = W
        self.b = b
        self.stride = stride
        self.pad = pad

        self.dW = None      #Derivative value of weight
        self.db = None      #Derivative value of bias

        self.x = None       #Input for forward propagation required for back propagation
        self.col_x = None   #Input col expansion result at the time of forward propagation required for back propagation
        self.col_W = None   #Col expansion result of filter at the time of forward propagation required for back propagation

    def forward(self, x):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):input. The shape is(N, C, H, W)。
            
        Returns:
            numpy.ndarray:output. The shape is(N, FN, OH, OW)。
        """
        FN, C, FH, FW = self.W.shape  # FN:Number of filters, C:Number of channels, FH:Filter height, FW:width
        N, x_C, H, W = x.shape        # N:Batch size, x_C:Number of channels, H: Height of input data, W:width
        assert C == x_C, f'Mismatch in the number of channels![C]{C}, [x_C]{x_C}'

        #Output size calculation
        assert (H + 2 * self.pad - FH) % self.stride == 0, 'OH is not divisible!'
        assert (W + 2 * self.pad - FW) % self.stride == 0, 'OW is indivisible!'
        OH = int((H + 2 * self.pad - FH) / self.stride + 1)
        OW = int((W + 2 * self.pad - FW) / self.stride + 1)

        #Expand input data
        # (N, C, H, W) → (N * OH * OW, C * FH * FW)
        col_x = im2col(x, FH, FW, self.stride, self.pad)

        #Expand filter
        # (FN, C, FH, FW) → (C * FH * FW, FN)
        col_W = self.W.reshape(FN, -1).T

        #Calculate output (col_x, col_W,The calculation for b is exactly the same as the Affine layer)
        # (N * OH * OW, C * FH * FW)・(C * FH * FW, FN) → (N * OH * OW, FN)
        out = np.dot(col_x, col_W) + self.b

        #Result shaping
        # (N * OH * OW, FN) → (N, OH, OW, FN) → (N, FN, OH, OW)
        out = out.reshape(N, OH, OW, FN).transpose(0, 3, 1, 2)

        #Save for backpropagation
        self.x = x
        self.col_x = col_x
        self.col_W = col_W

        return out

    def backward(self, dout):
        """Backpropagation
        
        Args:
            dout (numpy.ndarray):The differential value and shape transmitted from the right layer(N, FN, OH, OW)。
        
        Returns:
            numpy.ndarray:Derivative value (gradient), shape(N, C, H, W)。
        """
        FN, C, FH, FW = self.W.shape  #The shape of the differential value is the same as W(FN, C, FH, FW)

        #Expand the differential value from the right layer
        # (N, FN, OH, OW) → (N, OH, OW, FN) → (N * OH * OW, FN)
        dout = dout.transpose(0, 2, 3, 1).reshape(-1, FN)

        #Derivative value calculation (col_x, col_W,The calculation for b is exactly the same as the Affine layer)
        dcol_x = np.dot(dout, self.col_W.T)     # → (N * OH * OW, C * FH * FW)
        self.dW = np.dot(self.col_x.T, dout)    # → (C * FH * FW, FN)
        self.db = np.sum(dout, axis=0)          # → (FN)

        #Formatting the derivative of the filter (weight)
        # (C * FH * FW, FN) → (FN, C * FH * FW) → (FN, C, FH, FW)
        self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)

        #Forming the result (gradient)
        # (N * OH * OW, C * FH * FW) → (N, C, H, W)
        dx = col2im(dcol_x, self.x.shape, FH, FW, self.stride, self.pad)
    
        return dx

Next is the implementation of the Pooling layer. This is also full of comments.

pooling.py


# coding: utf-8
import os
import sys
import numpy as np
sys.path.append(os.pardir)  #Add parent directory to path
from common.util import im2col, col2im


class Pooling:
    def __init__(self, pool_h, pool_w, stride=1, pad=0):
        """Pooling layer
        
        Args:
            pool_h (int):Pooling area height
            pool_w (int):Pooling area width
            stride (int, optional):Stride, default is 1.
            pad (int, optional):Padding, default is 0.
        """
        self.pool_h = pool_h
        self.pool_w = pool_w
        self.stride = stride
        self.pad = pad

        self.x = None           #Input for forward propagation required for back propagation
        self.arg_max = None     #The col used for forward propagation, which is required for back propagation_x Position of each row

    def forward(self, x):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):Input, shape(N, C, H, W)。
            
        Returns:
            numpy.ndarray:Output, shape(N, C, OH, OW)。
        """
        N, C, H, W = x.shape  # N:Number of data, C:Number of channels, H:Height, W:width

        #Output size calculation
        assert (H - self.pool_h) % self.stride == 0, 'OH is not divisible!'
        assert (W - self.pool_w) % self.stride == 0, 'OW is indivisible!'
        OH = int((H - self.pool_h) / self.stride + 1)
        OW = int((W - self.pool_w) / self.stride + 1)

        #Expand and format input data
        # (N, C, H, W) → (N * OH * OW, C * PH * PW)
        col_x = im2col(x, self.pool_h, self.pool_w, self.stride, self.pad)
        # (N * OH * OW, C * PH * PW) → (N * OH * OW * C, PH * PW)
        col_x = col_x.reshape(-1, self.pool_h * self.pool_w)

        #Calculate output
        # (N * OH * OW * C, PH * PW) → (N * OH * OW * C)
        out = np.max(col_x, axis=1)

        #Result shaping
        # (N * OH * OW * C) → (N, OH, OW, C) → (N, C, OH, OW)
        out = out.reshape(N, OH, OW, C).transpose(0, 3, 1, 2)

        #Save for backpropagation
        self.x = x
        self.arg_max = np.argmax(col_x, axis=1)  # col_x Maximum position (index) of each row

        return out

    def backward(self, dout):
        """Backpropagation
        
        Args:
            dout (numpy.ndarray):The differential value and shape transmitted from the right layer(N, C, OH, OW)。
        
        Returns:
            numpy.ndarray:Derivative value (gradient), shape(N, C, H, W)。
        """
        #Shape the differential value from the right layer
        # (N, C, OH, OW) → (N, OH, OW, C)
        dout = dout.transpose(0, 2, 3, 1)

        #Initialize col for the resulting derivative with 0
        # (N * OH * OW * C, PH * PW)
        pool_size = self.pool_h * self.pool_w
        dcol_x = np.zeros((dout.size, pool_size))

        #Set the differential value of dout (= dout manma) only at the position adopted as the maximum value during forward propagation.
        #The position of the value that was not adopted during forward propagation remains 0 at initialization.
        #(Same as processing when x is greater than 0 and x is less than 0 in ReLU)
        assert dout.size == self.arg_max.size, 'Col during forward propagation_Does not match the number of lines in x'
        dcol_x[np.arange(self.arg_max.size), self.arg_max.flatten()] = \
            dout.flatten()

        #Formatting the derivative of the result 1
        # (N * OH * OW * C, PH * PW) → (N, OH, OW, C, PH * PW)
        dcol_x = dcol_x.reshape(dout.shape + (pool_size,))  #Last','Indicates a one-element tuple

        #Formatting the derivative of the result 2
        # (N, OH, OW, C, PH * PW) → (N * OH * OW, C * PH * PW)
        dcol_x = dcol_x.reshape(
            dcol_x.shape[0] * dcol_x.shape[1] * dcol_x.shape[2], -1
        )

        #Formatting the derivative of the result 3
        # (N * OH * OW, C * PH * PW) → (N, C, H, W)
        dx = col2im(
            dcol_x, self.x.shape, self.pool_h, self.pool_w, self.stride, self.pad
        )

        return dx

7.5 CNN implementation

Implement CNN by combining the previous implementations.

(1) Implementation of each layer

First, I will organize the input and output in this network.

layer Input / output shape Shape at the time of mounting
$ (Batch size N,Number of channels CH,Image height H,Width W) $ $ (100, 1, 28, 28) $
:one: Convolution
$ (Batch size N,Number of filters FN,Output height OH,Width OW) $ $ (100, 30, 24, 24) $
:two: ReLU
$ (Batch size N,Number of filters FN,Output height OH,Width OW) $ $ (100, 30, 24, 24) $
:three: Pooling
$ (Batch size N,Number of filters FN,Output height OH,Width OW) $ $ (100, 30, 12, 12) $
:four: Affine
$ (Batch size N,Hidden layer size) $ $ (100, 100) $
:five: ReLU
$ (Batch size N,Hidden layer size) $ $ (100, 100) $
:six: Affine
$ (Batch size N,Final output size) $ $ (100, 10) $
:seven: Softmax
$ (Batch size N,Final output size) $ $ (100, 10) $

The implementation of the Convlolution layer and the Pooling layer is as described above.

The Affine layer requires some modifications to the previous implementation. Previously [5.6.2 Batch Affine Layer](https://qiita.com/segavvy/items/8707e4e65aa7fa357d8a#562-%E3%83%90%E3%83%83%E3%83%81%E7%89% When implemented with 88affine% E3% 83% AC% E3% 82% A4% E3% 83% A4), the input was two-dimensional ($ batch size N $, image size), but this time, the fourth Since the input of the Affine layer is 4D (number of batches $ N $, number of filters $ FN $, pooling result $ OH $, $ OW $), it is necessary to deal with it. On page 152 of the book, there is a proviso that "The implementation of Affine in common / layers.py is an implementation that considers the case where the input data is a tensor (4D data)". I didn't know if it was left unattended, but it was supposed to be used this time.

The following is an implementation of the Affine layer that supports 3D or higher input.

affine.py


# coding: utf-8
import numpy as np


class Affine:

    def __init__(self, W, b):
        """Affine layer
        
        Args:
            W (numpy.ndarray):weight
            b (numpy.ndarray):bias
        """
        self.W = W                      #weight
        self.b = b                      #bias
        self.x = None                   #Input (after 2D)
        self.dW = None                  #Derivative value of weight
        self.db = None                  #Derivative value of bias
        self.original_x_shape = None    #Original input shape (for input of 3D or more)

    def forward(self, x):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):input
            
        Returns:
            numpy.ndarray:output
        """
        #Two-dimensional input of three dimensions or more (tensor)
        self.original_x_shape = x.shape  #Because it is necessary to save the shape and restore it by back propagation
        x = x.reshape(x.shape[0], -1)
        self.x = x

        #Calculate output
        out = np.dot(x, self.W) + self.b

        return out

    def backward(self, dout):
        """Backpropagation
        
        Args:
            dout (numpy.ndarray):Derivative value transmitted from the right layer

        Returns:
            numpy.ndarray:Derivative value
        """
        #Derivative value calculation
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)

        #Return to the original shape
        dx = dx.reshape(*self.original_x_shape)
        return dx

The ReLU and Softmax layers are the same as in the previous implementation, but will be reprinted.

relu.py


# coding: utf-8


class ReLU:
    def __init__(self):
        """ReLU layer
        """
        self.mask = None

    def forward(self, x):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):input
            
        Returns:
            numpy.ndarray:output
        """
        self.mask = (x <= 0)
        out = x.copy()
        out[self.mask] = 0

        return out

    def backward(self, dout):
        """Backpropagation
        
        Args:
            dout (numpy.ndarray):Derivative value transmitted from the right layer
        
        Returns:
            numpy.ndarray:Derivative value
        """
        dout[self.mask] = 0
        dx = dout

        return dx

softmax_with_loss.py


# coding: utf-8
from functions import softmax, cross_entropy_error


class SoftmaxWithLoss:
    def __init__(self):
        """Softmax-with-Loss layer
        """
        self.loss = None    #loss
        self.y = None       #softmax output
        self.t = None       #Teacher data (one-hot vector)

    def forward(self, x, t):
        """Forward propagation
        
        Args:
            x (numpy.ndarray):input
            t (numpy.ndarray):Teacher data

        Returns:
            float:Cross entropy error
        """
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)

        return self.loss

    def backward(self, dout=1):
        """Backpropagation
        
        Args:
            dout (float, optional):Derivative value transmitted from the right layer. The default is 1.

        Returns:
            numpy.ndarray:Derivative value
        """
        batch_size = self.t.shape[0]    #Number of batches
        dx = (self.y - self.t) * (dout / batch_size)

        return dx

The functions required to implement the softmax layer are also reprinted as before. The functions not used this time are deleted.

functions.py


# coding: utf-8
import numpy as np


def softmax(x):
    """Softmax function
    
    Args:
        x (numpy.ndarray):input
    
    Returns:
        numpy.ndarray:output
    """
    #For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
    #In this case, it is necessary to calculate well for each image using broadcast.
    #Here, np so that it can be shared in both 1D and 2D..max()And np.sum()Is axis=-Calculated by 1
    #Keepdims so that you can broadcast as it is=True to maintain the dimension.
    c = np.max(x, axis=-1, keepdims=True)
    exp_a = np.exp(x - c)  #Overflow measures
    sum_exp_a = np.sum(exp_a, axis=-1, keepdims=True)
    y = exp_a / sum_exp_a
    return y


def cross_entropy_error(y, t):
    """Calculation of cross entropy error
    
    Args:
        y (numpy.ndarray):Neural network output
        t (numpy.ndarray):Correct label
    
    Returns:
        float:Cross entropy error
    """

    #If there is one data, shape it (make one data line)
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)

    #Calculate the error and normalize by the number of batches
    batch_size = y.shape[0]
    return -np.sum(t * np.log(y + 1e-7)) / batch_size

(2) Implementation of optimizer

For the parameter optimizer, see 6.1 Parameter Updates (https://qiita.com/segavvy/items/ca4ac4c9ee1a126bff41#61-%E3%83%91%E3%83%A9%E3%83% I skipped the implementation just by reading A1% E3% 83% BC% E3% 82% BF% E3% 81% AE% E6% 9B% B4% E6% 96% B0), so I decided to use AdaGrad this time. I tried to implement. It's almost the same as the code in the book.

ada_grad.py


# coding: utf-8
import numpy as np


class AdaGrad:

    def __init__(self, lr=0.01):
        """Parameter optimization with AdaGrad
        
        Args:
            lr (float, optional):Learning factor, default 0.01。
        """
        self.lr = lr
        self.h = None   #Sum of squares of the gradient so far

    def update(self, params, grads):
        """Parameter update
        
        Args:
            params (dict):The dictionary of parameters to be updated, key is'W1'、'b1'Such.
            grads (dict):Gradient dictionary corresponding to params
        """

        #initialization of h
        if self.h is None:
            self.h = {}
            for key, val in params.items():
                self.h[key] = np.zeros_like(val)

        #update
        for key in params.keys():

            #h update
            self.h[key] += grads[key] ** 2

            #Parameter update, last 1e-7 avoids division by zero
            params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)

(3) Implementation of CNN

CNN previously [5.7.2 Implementation of Neural Network for Error Backpropagation](https://qiita.com/segavvy/items/8707e4e65aa7fa357d8a#572-%E8%AA%A4%E5%B7%AE% E9% 80% 86% E4% BC% 9D% E6% 92% AD% E6% B3% 95% E3% 81% AB% E5% AF% BE% E5% BF% 9C% E3% 81% 97% E3% 81% 9F% E3% 83% 8B% E3% 83% A5% E3% 83% BC% E3% 83% A9% E3% 83% AB% E3% 83% 8D% E3% 83% 83% E3% 83% Based on the TwoLayerNet made with 88% E3% 83% AF% E3% 83% BC% E3% 82% AF% E3% 81% AE% E5% AE% 9F% E8% A3% 85) I implemented it according to the instructions.

The code in the book uses ʻOrderedDict, but like last time, we use the normal dicthere. This is because starting with Python 3.7, the insertion order ofdict objects is saved [^ 2]. Also, I stumbled upon the implementation of ʻaccuracy, so I will explain it later.

Below is the implementation of CNN.

simple_conv_net.py


# coding: utf-8
import numpy as np
from affine import Affine
from convolution import Convolution
from pooling import Pooling
from relu import ReLU
from softmax_with_loss import SoftmaxWithLoss


class SimpleConvNet:

    def __init__(
        self, input_dim=(1, 28, 28),
        conv_param={'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
        hidden_size=100, output_size=10, weight_init_std=0.01
    ):
        """Simple convolutional neural network
        
        Args:
            input_dim (tuple, optional):Input data shape, default is(1, 28, 28)。
            conv_param (dict, optional):Hyperparameters of the convolution layer,
The default is{'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1}。
            hidden_size (int, optional):The number of neurons in the hidden layer, the default is 100.
            output_size (int, optional):The number of neurons in the output layer, the default is 10.
            weight_init_std (float, optional):Adjustment parameter of the initial value of the weight. The default is 0.01。
        """
        #Extract hyperparameters of convolution layer
        filter_num = conv_param['filter_num']       #Number of filters
        filter_size = conv_param['filter_size']     #Filter size (same height and width)
        filter_stride = conv_param['stride']        #stride
        filter_pad = conv_param['pad']              #Padding
        
        #The hyperparameters of the pooling layer are fixed
        pool_size = 2                               #Size (same height and width)
        pool_stride = 2                             #stride
        pool_pad = 0                                #Padding

        #Input data size calculation
        input_ch = input_dim[0]                     #Number of input data channels
        assert input_dim[1] == input_dim[2], 'Input data is assumed to have the same height and width!'
        input_size = input_dim[1]                   #Input data size
        
        #Calculation of output size of convolution layer
        assert (input_size + 2 * filter_pad - filter_size) \
            % filter_stride == 0, 'The output size of the convolution layer is not divisible!'
        conv_output_size = int(
            (input_size + 2 * filter_pad - filter_size) / filter_stride + 1
        )

        #Calculation of output size of pooling layer
        assert (conv_output_size - pool_size) % pool_stride == 0, \
            'The output size of the pooling layer is not divisible!'
        pool_output_size_one = int(
            (conv_output_size - pool_size) / pool_stride + 1  #Height / width size
        )
        pool_output_size = filter_num * \
            pool_output_size_one * pool_output_size_one     #Total size of all filters

        #Weight initialization
        self.params = {}
        #Convolution layer
        self.params['W1'] = weight_init_std * \
            np.random.randn(filter_num, input_ch, filter_size, filter_size)
        self.params['b1'] = np.zeros(filter_num)
        #Affine layer 1
        self.params['W2'] = weight_init_std * \
            np.random.randn(pool_output_size, hidden_size)
        self.params['b2'] = np.zeros(hidden_size)
        #Affine layer 2
        self.params['W3'] = weight_init_std * \
            np.random.randn(hidden_size, output_size)
        self.params['b3'] = np.zeros(output_size)
            
        #Layer generation
        self.layers = {}    # Python 3.OrderedDict is unnecessary because the storage order of dictionaries is retained from 7
        #Convolution layer
        self.layers['Conv1'] = Convolution(
            self.params['W1'], self.params['b1'], filter_stride, filter_pad
        )
        self.layers['Relu1'] = ReLU()
        self.layers['Pool1'] = Pooling(
            pool_size, pool_size, pool_stride, pool_pad
        )
        #Affine layer 1
        self.layers['Affine1'] = \
            Affine(self.params['W2'], self.params['b2'])
        self.layers['Relu2'] = ReLU()
        #Affine layer 2
        self.layers['Affine2'] = \
            Affine(self.params['W3'], self.params['b3'])
    
        self.lastLayer = SoftmaxWithLoss()

    def predict(self, x):
        """Inference by neural network
        
        Args:
            x (numpy.ndarray):Input to neural network
        
        Returns:
            numpy.ndarray:Neural network output
        """
        #Propagate layers forward
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    def loss(self, x, t):
        """Loss function value calculation
        
        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label

        Returns:
            float:Loss function value
        """
        #inference
        y = self.predict(x)

        # Softmax-with-Calculated by forward propagation of Loss layer
        loss = self.lastLayer.forward(y, t)

        return loss

    def accuracy(self, x, t, batch_size=100):
        """Recognition accuracy calculation
        batch_size is the batch size at the time of calculation. When trying to calculate a large amount of data at once
Because im2col eats too much memory and thrashing occurs and it does not work
To avoid that.

        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label (one-hot)
            batch_size (int), optional):Batch size at the time of calculation, default is 100.
        
        Returns:
            float:Recognition accuracy
        """
        #Calculation of the number of divisions
        batch_num = max(int(x.shape[0] / batch_size), 1)

        #Split
        x_list = np.array_split(x, batch_num, 0)
        t_list = np.array_split(t, batch_num, 0)

        #Process in divided units
        correct_num = 0  #Total number of correct answers
        for (sub_x, sub_t) in zip(x_list, t_list):
            assert sub_x.shape[0] == sub_t.shape[0], 'Did the division boundary shift?'
            y = self.predict(sub_x)
            y = np.argmax(y, axis=1)
            t = np.argmax(sub_t, axis=1)
            correct_num += np.sum(y == t)
        
        #Calculation of recognition accuracy
        return correct_num / x.shape[0]

    def gradient(self, x, t):
        """Gradient for weight parameters calculated by error backpropagation
        
         Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label
        
        Returns:
            dictionary:A dictionary containing gradients
        """
        #Forward propagation
        self.loss(x, t)     #Propagate forward to calculate loss value

        #Backpropagation
        dout = self.lastLayer.backward()
        for layer in reversed(list(self.layers.values())):
            dout = layer.backward(dout)

        #Extract the differential value of each layer
        grads = {}
        grads['W1'] = self.layers['Conv1'].dW
        grads['b1'] = self.layers['Conv1'].db
        grads['W2'] = self.layers['Affine1'].dW
        grads['b2'] = self.layers['Affine1'].db
        grads['W3'] = self.layers['Affine2'].dW
        grads['b3'] = self.layers['Affine2'].db

        return grads

The stumbling block in this implementation is ʻaccuracy`, which is omitted in the book.

During learning, the recognition accuracy is calculated in units of 1 epoch, but in the code written in Chapter 4, 60,000 pieces of training data were thrown in at once to obtain the recognition accuracy. However, if I do the same thing this time, it seems that the expansion of ʻim2col` consumes a lot of memory, and my VM with 4GB of memory stops at thrashing [^ 3]: sweat:

However, the book source still consumes less memory and works normally in my environment. It's strange, so when I followed the source, it was divided and processed internally. That's why I also imitate and divide it internally. Try using numpy.array_split to implement splitting. I did.

(4) Implementation of learning

The learning is the previous [5.7.4 Learning using the error back propagation method](https://qiita.com/segavvy/items/8707e4e65aa7fa357d8a#574-%E8%AA%A4%E5%B7%AE%E9% 80% 86% E4% BC% 9D% E6% 92% AD% E6% B3% 95% E3% 82% 92% E4% BD% BF% E3% 81% A3% E3% 81% 9F% E5% AD% Implemented based on A6% E7% BF% 92). Below are some points.

--Unlike the last time, the input image this time is (1, 28, 28), so you need to specify flatten = False when reading MNIST data with load_mnist. --The hyperparameter learning_rate has been reduced for AdaGrad and has been tried several times to make it 0.06. --The number of updates was set to 6000 (10 epochs) because the recognition accuracy of test data is relatively fast and stable. --In the previous source, the display of the number of updates was shifted by 1, and the display of the recognition accuracy for the first time was not before the update but after the update once, so it was corrected.

Below is the implementation of learning.

mnist.py


# coding: utf-8
import os
import sys
import matplotlib.pylab as plt
import numpy as np
from ada_grad import AdaGrad
from simple_conv_net import SimpleConvNet
sys.path.append(os.pardir)  #Add parent directory to path
from dataset.mnist import load_mnist


#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
    load_mnist(normalize=True, flatten=False, one_hot_label=True)

#Hyperparameter settings
iters_num = 6000        #Number of updates
batch_size = 100        #Batch size
learning_rate = 0.06    #Assuming learning rate, AdaGrad

train_size = x_train.shape[0]  #Training data size
iter_per_epoch = max(int(train_size / batch_size), 1)    #Number of iterations per epoch

#Simple convolutional neural network generation
network = SimpleConvNet(
    input_dim=(1, 28, 28),
    conv_param={'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
    hidden_size=100, output_size=10, weight_init_std=0.01
)

#Optimizer generation
optimizer = AdaGrad(learning_rate)   # AdaGrad

#Confirmation of recognition accuracy before learning
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_loss_list = []            #Storage location of the transition of the value of the loss function
train_acc_list = [train_acc]    #Storage location of changes in recognition accuracy for training data
test_acc_list = [test_acc]      #Storage destination of transition of recognition accuracy for test data
print(f'Before learning[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}')

#Start learning
for i in range(iters_num):

    #Mini batch generation
    batch_mask = np.random.choice(train_size, batch_size, replace=False)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    #Gradient calculation
    grads = network.gradient(x_batch, t_batch)

    #Weight parameter update
    optimizer.update(network.params, grads)
    
    #Loss function value calculation
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)

    #Recognition accuracy calculation for each epoch
    if (i + 1) % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

        #Progress display
        print(
            f'[epoch]{(i + 1) // iter_per_epoch:>2} '
            f'[Number of updates]{i + 1:>5} [Loss function value]{loss:.4f} '
            f'[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}'
        )

#Draw the transition of the value of the loss function
x = np.arange(len(train_loss_list))
plt.plot(x, train_loss_list, label='loss')
plt.xlabel('iteration')
plt.ylabel('loss')
plt.xlim(left=0)
plt.ylim(0, 2.5)
plt.show()

#Draw the transition of recognition accuracy of training data and test data
x2 = np.arange(len(train_acc_list))
plt.plot(x2, train_acc_list, label='train acc')
plt.plot(x2, test_acc_list, label='test acc', linestyle='--')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.xlim(left=0)
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

(5) Execution result

Below are the execution results. It took about an hour in my environment.

Before learning[Training data recognition accuracy]0.0909 [Test data recognition accuracy]0.0909
[epoch] 1 [Number of updates]  600 [Loss function value]0.0699 [Training data recognition accuracy]0.9784 [Test data recognition accuracy]0.9780
[epoch] 2 [Number of updates] 1200 [Loss function value]0.0400 [Training data recognition accuracy]0.9844 [Test data recognition accuracy]0.9810
[epoch] 3 [Number of updates] 1800 [Loss function value]0.0362 [Training data recognition accuracy]0.9885 [Test data recognition accuracy]0.9853
[epoch] 4 [Number of updates] 2400 [Loss function value]0.0088 [Training data recognition accuracy]0.9907 [Test data recognition accuracy]0.9844
[epoch] 5 [Number of updates] 3000 [Loss function value]0.0052 [Training data recognition accuracy]0.9926 [Test data recognition accuracy]0.9851
[epoch] 6 [Number of updates] 3600 [Loss function value]0.0089 [Training data recognition accuracy]0.9932 [Test data recognition accuracy]0.9850
[epoch] 7 [Number of updates] 4200 [Loss function value]0.0029 [Training data recognition accuracy]0.9944 [Test data recognition accuracy]0.9865
[epoch] 8 [Number of updates] 4800 [Loss function value]0.0023 [Training data recognition accuracy]0.9954 [Test data recognition accuracy]0.9873
[epoch] 9 [Number of updates] 5400 [Loss function value]0.0051 [Training data recognition accuracy]0.9959 [Test data recognition accuracy]0.9860
[epoch]10 [Number of updates] 6000 [Loss function value]0.0037 [Training data recognition accuracy]0.9972 [Test data recognition accuracy]0.9860

スクリーンショット 2020-02-11 17.03.04.png スクリーンショット 2020-02-11 17.03.20.png As a result, the recognition accuracy of the training data was 99.72%, and the recognition accuracy of the test data was 98.60%. With one epoch, it has already exceeded the previous recognition accuracy. Since the recognition accuracy of test data has not changed since around 7 epochs, it may have been just overfitting after that. Even so, the accuracy of 98.60% with a simple CNN is amazing.

I also tried running the source of the book, but for some reason the calculation of the recognition accuracy for each epoch is very fast. Mysteriously, I found that it was possible to sample with the ʻevaluate_sample_num_per_epoch parameter of the Trainer` class, and the training image and test image were calculated with only the first 1,000 images. unfair! : unamused:

7.6 CNN visualization

It's amazing that the necessary filters such as edge and blob extraction are automatically created. It is very interesting that the level of abstraction increases as the layers are layered.

7.7 Typical CNN

It is said that big data and GPUs are making a big contribution to the development of deep learning, but I think that the spread of the cloud, which makes it possible to use huge machine resources at low cost, is also a big point.

Also, as a complete digression, it was said that LeNet's proposal was 1998, 20 years ago, and I was deeply moved, or rather, 1998 was a more recent impression. I don't want to get old: sweat:

7.8 Summary

It was a bit difficult to implement, but it helped me understand CNN. That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / [Chapter 5](https://qiita. com / segavvy / items / 8707e4e65aa7fa357d8a) / Chapter 6 / Chapter 7 / Chapter 8 / Summary)

[^ 1]: [Changes in the long vowel notation at the end of foreign words and katakana terms in Microsoft products and services](https://web.archive.org/web/20130228002415/http://www.microsoft.com/japan/ presspass / detail.aspx? newsid = 3491) (* Since there are no pages left at that time, Wikipedia> Choonpu B3% E7% AC% A6) is also a link to the Wayback Machine of the Internet Archive)

[^ 2]: See "Improvement of Python's Data Model" in What's New In Python 3.7.

[^ 3]: Thrashing is a phenomenon that occurs when memory is insufficient, and it is troublesome because it may become inoperable for each OS. If you are interested in OS memory management, please check out the previously posted Introduction to Memory Management for Everyone: 01! : grin:

Recommended Posts

An amateur stumbled in Deep Learning from scratch Note: Chapter 1
An amateur stumbled in Deep Learning from scratch Note: Chapter 3
An amateur stumbled in Deep Learning from scratch Note: Chapter 7
An amateur stumbled in Deep Learning from scratch Note: Chapter 5
An amateur stumbled in Deep Learning from scratch Note: Chapter 4
An amateur stumbled in Deep Learning from scratch Note: Chapter 2
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 5
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 2
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 7
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 1
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 4
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 6
[Learning memo] Deep Learning made from scratch [Chapter 7]
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 6]
"Deep Learning from scratch" in Haskell (unfinished)
Deep learning / Deep learning made from scratch Chapter 7 Memo
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
Deep Learning from scratch
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
Deep Learning from scratch Chapter 2 Perceptron (reading memo)
Deep Learning from scratch 1-3 chapters
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Create an environment for "Deep Learning from scratch" with Docker
Deep learning from scratch (cost calculation)
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo
Deep Learning memos made from scratch
Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo
[Deep Learning from scratch] I tried to explain the gradient confirmation in an easy-to-understand manner.
"Deep Learning from scratch" Self-study memo (No. 14) Run the program in Chapter 4 on Google Colaboratory
"Deep Learning from scratch" Self-study memo (Part 8) I drew the graph in Chapter 6 with matplotlib
Why ModuleNotFoundError: No module named'dataset.mnist' appears in "Deep Learning from scratch".
Write an impression of Deep Learning 3 framework edition made from scratch
Deep learning from scratch (forward propagation edition)
Deep learning / Deep learning from scratch 2-Try moving GRU
[Windows 10] "Deep Learning from scratch" environment construction
Learning record of reading "Deep Learning from scratch"
[Deep Learning from scratch] About hyperparameter optimization
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
Python vs Ruby "Deep Learning from scratch" Chapter 2 Logic circuit by Perceptron
Python vs Ruby "Deep Learning from scratch" Chapter 4 Implementation of loss function
"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
An amateur tried Deep Learning using Caffe (Introduction)
An amateur tried Deep Learning using Caffe (Practice)
An amateur tried Deep Learning using Caffe (Overview)
Python vs Ruby "Deep Learning from scratch" Summary
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
Python vs Ruby "Deep Learning from scratch" Chapter 3 Implementation of 3-layer neural network
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Lua version Deep Learning from scratch Part 5.5 [Making pkl files available in Lua Torch]
[For beginners] After all, what is written in Deep Learning made from scratch?
[Deep Learning from scratch] I implemented the Affine layer
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation