Deep Learning from scratch Chapter 7

The theme of Chapter 7 is Convolutional neural network: ** CNN **

Overall structure

CNN is the same as the neural network we have seen so far, and it is possible to create a combination of layers like a Lego block. Next is new ・ "Convolution layer" ・ "Pooling layer"

Kobito.JxbIF2.png

Kobito.SXd9XQ.png

General CNN features ・ Flow of "Convolution --ReLU-(Pooling)" ・ Pooling layer may be omitted ・ The combination "Affine --ReLU" is used in the layer close to the output. -The last output layer is a combination of "Affine --Softmax"

Convolution layer

The following terms appeared ・ Putting ·stride

In addition, three-dimensional data has appeared

Problems with fully connected layers

The problem with full joins is that ** the structure of the data is "ignored" **

For example, in the case of an image, it is usually a three-dimensional shape in the vertical, horizontal, and channel directions. This shape contains important spatial information For example ・ Pixels that are spatially close have similar values ・ There is a close relationship between each RBG channel ・ Pixels that are far apart are not so related to each other There is an essential pattern to be picked up in the three-dimensional shape.

The fully connected layer ignores the above shape and treats them as all equivalent neurons (neurons of the same dimension). On the other hand, the convolution layer maintains its shape

In CNN, the input / output data of the convolution layer is ** feature map ** (feature map). Input data ** Input feature map ** (feature map) Output data ** Output feature map ** (feature map) Sometimes I say.

Convolution operation

Kobito.uMcQj4.png

"Convolution operation" Equivalent to "filter processing" in image processing In some literature, the term "filter" is sometimes referred to as "kernel".

The parameters used for this filter correspond to the "weights" in the fully coupled neural network.

Calculation example

Kobito.Wo1MG2.png

Biased operation

Kobito.gNgZ98.png

Padding

Putting: Filling fixed data (eg 0) around the input data

The figure below is filled with 0 pixels wide Kobito.joYXx3.png

stride

Stride: The distance between the positions where the filter is applied.

Kobito.WoeNb7.png

Output size calculation

Input size (H, W) Filter size (FH, FW) Output size (OH, OW) Padding P Stride S The output size is as follows

OH = \frac{H + 2P - FH}{S} + 1\\
OW = \frac{W + 2P - FW}{S} + 1

Think in blocks

Considering the operation of 3D convolution in a rectangular parallelepiped block in an easy-to-understand manner, it is as follows.

Kobito.ITLGJH.png

The above is a feature map with one output. In other words, it is a feature map with one channel.

The following is a diagram when having multiple channel methods.

Kobito.l5nAG0.png

Adding the bias term is as follows.

Batch processing

When batch processing N pieces of data, the shape of the data is the same

Pooling layer

Pooling: Calculation to reduce the vertical and horizontal view of the sky

In the figure below, the space size is reduced by performing processing that aggregates the 2x2 area into one element.

Kobito.HL2U8Q.png

In this example, it is the process when 2x2 Max pooling is performed on slide 2.

Max pooling: Calculation to take the maximum value in the area Also, in general, the pooling window size and slides are set to the same value.

In addition to Max pooling, there is Average pooling that takes the average value in the area.

Features of the pooling layer

・ There are no parameters to learn

Since pooling is a process that only takes the maximum value (or average value) from the target, there are no parameters to learn.

・ The number of channels does not change

The number of channels of input data and output data does not change due to the pooling operation. (OH and OW change, but FN does not)

・ Robust against minute changes in position

Pooling returns similar results for small deviations in the input data. Therefore, it is robust against slight deviations in input data.

Kobito.rOBXno.png

Implementation of Convolution / Pooling layer

4D array

#Randomly generate data
x = np.random.rand(10,1,28,28)
x.shape
# (10, 1, 28, 28)

x[0].shape
# (1, 28, 28)
x[1].shape
# (1, 28, 28)

x[0, 0].shape # x[0][0]But OK
# (28, 28)

Deployment by im2col

If you implement the convolution as shown in the previous figure, you need to combine multiple for statements. Also, NumPy is slow when using a for statement.

Therefore, implement using a function called im2col instead of a for statement. im2col is a function that expands the input data to suit the filter.

In this figure, we emphasized the ease of understanding and gave an example in which the filter areas do not overlap.

・ Im2col Advantages Disadvantages Advantage: Since it can be reduced to matrix calculation, the library of linear algebra can be effectively used. Disadvantages: Consume more memory than usual

#----------------------------------------------------
# Parameters
#   input_data : (The number of data,Channel,height,Width)Input data consisting of a 4-dimensional array of
#   filter_h :Filter height
#   filter_w :Filter width
#   stride :stride
#   pad :Padding
# Returns
#   col :A two-dimensional array
#----------------------------------------------------
def im2col(input_data, filter_h, filter_w, stride=1, pad=0):

    N, C, H, W = input_data.shape
    out_h = (H + 2*pad - filter_h)//stride + 1
    out_w = (W + 2*pad - filter_w)//stride + 1

    img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
    col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))

    for y in range(filter_h):
        y_max = y + stride*out_h
        for x in range(filter_w):
            x_max = x + stride*out_w
            col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]

    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
    return col

View using im2col

import sys, os
sys.path.append(os.pardir)
from common.util import im2col

x1 = np.random.rand(1, 3, 7, 7)
col1 = im2col(x1, 5, 5, stride=1, pad=0)
print(col1.shape)

x2 = np.random.rand(10, 3, 7, 7)
col2 = im2col(x2, 5, 5, stride=1, pad=0)
print(col2.shape)

result (9, 75) (90, 75)

x1 is 7x7 data with batch size 1 and 3 channels x2 is 7x7 data with batch size 10 and 3 channels

In both cases, the number of elements in the second dimension is 75, which is the sum of the number of aspects of the filter. (Channel 3, size 5 x 5)

Convolution layer implementation

After expanding the data with im2col, all you have to do is expand the filter (weight) of the convolution layer into one column and calculate the inner product of the two matrices. This is almost the same as what we did in the Affine layer of the fully connected layer.

class Convolution:
    def __init__(self, W, b, stride=1, pad=0):
        self.W = W
        self.b = b
        self.stride = stride
        self.pad = pad
        
        #Intermediate data (used during backward)
        self.x = None   
        self.col = None
        self.col_W = None
        
        #Gradient of weight / bias parameters
        self.dW = None
        self.db = None

    def forward(self, x):
        FN, C, FH, FW = self.W.shape
        N, C, H, W = x.shape
        out_h = 1 + int((H + 2*self.pad - FH) / self.stride)
        out_w = 1 + int((W + 2*self.pad - FW) / self.stride)

        col = im2col(x, FH, FW, self.stride, self.pad)
        #to the reshape function-If 1 is specified, the number of elements will be summarized so that the tsuji of the multidimensional array matches.
        col_W = self.W.reshape(FN, -1).T

        out = np.dot(col, col_W) + self.b
        #Finally, shape the output size to the appropriate shape
        #reshape reconfigures the specified output size
        #transpose swaps the order of the axes
        out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)

        self.x = x
        self.col = col
        self.col_W = col_W

        return out

    def backward(self, dout):
        FN, C, FH, FW = self.W.shape
        dout = dout.transpose(0,2,3,1).reshape(-1, FN)

        #The calculation of the inverse matrix itself is done in the following two rows, which is the same as in Affine, the only difference is the alignment of the dimensions of the matrix.
        self.db = np.sum(dout, axis=0)
        self.dW = np.dot(self.col.T, dout)

        self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)

        dcol = np.dot(dout, self.col_W.T)
        #Reverse processing of im2col
        dx = col2im(dcol, self.x.shape, FH, FW, self.stride, self.pad)

        return dx

Implementation of Pooling layer

As with the Convolution layer, use im2col to expand and implement the input data However, in the case of pooling, the difference is that they are independent in the channel direction.

class Pooling:
    def __init__(self, pool_h, pool_w, stride=1, pad=0):
        self.pool_h = pool_h
        self.pool_w = pool_w
        self.stride = stride
        self.pad = pad
        
        self.x = None
        self.arg_max = None

    def forward(self, x):
        N, C, H, W = x.shape
        out_h = int(1 + (H - self.pool_h) / self.stride)
        out_w = int(1 + (W - self.pool_w) / self.stride)

        col = im2col(x, self.pool_h, self.pool_w, self.stride, self.pad)
        col = col.reshape(-1, self.pool_h*self.pool_w)

        arg_max = np.argmax(col, axis=1)
        out = np.max(col, axis=1)
        out = out.reshape(N, out_h, out_w, C).transpose(0, 3, 1, 2)

        self.x = x
        self.arg_max = arg_max

        return out

    def backward(self, dout):
        dout = dout.transpose(0, 2, 3, 1)
        
        pool_size = self.pool_h * self.pool_w
        dmax = np.zeros((dout.size, pool_size))
        #flatten reinserts the structure into a one-dimensional array
        dmax[np.arange(self.arg_max.size), self.arg_max.flatten()] = dout.flatten()
        dmax = dmax.reshape(dout.shape + (pool_size,)) 
        
        dcol = dmax.reshape(dmax.shape[0] * dmax.shape[1] * dmax.shape[2], -1)
        dx = col2im(dcol, self.x.shape, self.pool_h, self.pool_w, self.stride, self.pad)
        
        return dx

CNN implementation

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import pickle
import numpy as np
from collections import OrderedDict
from common.layers import *
from common.gradient import numerical_gradient

#Simple ConvNet
# conv - relu - pool - affine - relu - affine - softmax
class SimpleConvNet:

    
    #----------------------------------------------------
    # Parameters
    #   input_size :Input size (784 for MNIST)
    #   hidden_size_list :List of numbers of neurons in the hidden layer (e.g. [100, 100, 100]）
    #   output_size :Output size (10 for MNIST)
    #   activation : 'relu' or 'sigmoid'
    #   weight_init_std :Specify the standard deviation of the weights (e.g. 0.01）
    #                    'relu'Or'he'If is specified, "Initial value of He" is set.
    #                    'sigmoid'Or'xavier'If is specified, "Initial value of Xavier" is set.
    #----------------------------------------------------
    def __init__(self, input_dim=(1, 28, 28), 
                 conv_param={'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1},
                 hidden_size=100, output_size=10, weight_init_std=0.01):

        #Initialization of weights, calculation of output size of convolution layer
        filter_num = conv_param['filter_num']
        filter_size = conv_param['filter_size']
        filter_pad = conv_param['pad']
        filter_stride = conv_param['stride']
        input_size = input_dim[1]
        conv_output_size = (input_size - filter_size + 2*filter_pad) / filter_stride + 1
        pool_output_size = int(filter_num * (conv_output_size/2) * (conv_output_size/2))

        #Weight initialization
        self.params = {}
        self.params['W1'] = weight_init_std * \
                            np.random.randn(filter_num, input_dim[0], filter_size, filter_size)
        self.params['b1'] = np.zeros(filter_num)
        self.params['W2'] = weight_init_std * \
                            np.random.randn(pool_output_size, hidden_size)
        self.params['b2'] = np.zeros(hidden_size)
        self.params['W3'] = weight_init_std * \
                            np.random.randn(hidden_size, output_size)
        self.params['b3'] = np.zeros(output_size)

        #Layer generation
        self.layers = OrderedDict()
        self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
                                           conv_param['stride'], conv_param['pad'])
        self.layers['Relu1'] = Relu()
        self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
        self.layers['Affine1'] = Affine(self.params['W2'], self.params['b2'])
        self.layers['Relu2'] = Relu()
        self.layers['Affine2'] = Affine(self.params['W3'], self.params['b3'])

        self.last_layer = SoftmaxWithLoss()

    #Make inferences
    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    #Find the loss function
    def loss(self, x, t):
        """Find the loss function
The argument x is the input data and t is the teacher label.
        """
        y = self.predict(x)
        return self.last_layer.forward(y, t)

    def accuracy(self, x, t, batch_size=100):
        if t.ndim != 1 : t = np.argmax(t, axis=1)
        
        acc = 0.0
        
        for i in range(int(x.shape[0] / batch_size)):
            tx = x[i*batch_size:(i+1)*batch_size]
            tt = t[i*batch_size:(i+1)*batch_size]
            y = self.predict(tx)
            y = np.argmax(y, axis=1)
            acc += np.sum(y == tt) 
        
        return acc / x.shape[0]

    def numerical_gradient(self, x, t):
        """Find the gradient (numerical differentiation)

        Parameters
        ----------
        x :Input data
        t :Teacher label

        Returns
        -------
Dictionary variable with gradient for each layer
            grads['W1']、grads['W2']、...Is the weight of each layer
            grads['b1']、grads['b2']、...Is the bias of each layer
        """
        loss_w = lambda w: self.loss(x, t)

        grads = {}
        for idx in (1, 2, 3):
            grads['W' + str(idx)] = numerical_gradient(loss_w, self.params['W' + str(idx)])
            grads['b' + str(idx)] = numerical_gradient(loss_w, self.params['b' + str(idx)])

        return grads

    def gradient(self, x, t):
        """Find the gradient (error backpropagation method)

        Parameters
        ----------
        x :Input data
        t :Teacher label

        Returns
        -------
Dictionary variable with gradient for each layer
            grads['W1']、grads['W2']、...Is the weight of each layer
            grads['b1']、grads['b2']、...Is the bias of each layer
        """
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        #Setting
        grads = {}
        grads['W1'], grads['b1'] = self.layers['Conv1'].dW, self.layers['Conv1'].db
        grads['W2'], grads['b2'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
        grads['W3'], grads['b3'] = self.layers['Affine2'].dW, self.layers['Affine2'].db

        return grads
        
    def save_params(self, file_name="params.pkl"):
        params = {}
        for key, val in self.params.items():
            params[key] = val
        with open(file_name, 'wb') as f:
            pickle.dump(params, f)

    def load_params(self, file_name="params.pkl"):
        with open(file_name, 'rb') as f:
            params = pickle.load(f)
        for key, val in params.items():
            self.params[key] = val

        for i, key in enumerate(['Conv1', 'Affine1', 'Affine2']):
            self.layers[key].W = self.params['W' + str(i+1)]
            self.layers[key].b = self.params['b' + str(i+1)]

The point is that it can be implemented simply by increasing the number of layers and increasing the values of hyperparameters used in the hidden layer.

Perform learning Also, my Macbook Air had a lot of CPU usage, so I uncommented the data reduction and ran it.

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import numpy as np
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from simple_convnet import SimpleConvNet
from common.trainer import Trainer

#Data reading
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=False)

#Reduce data if processing takes time
#x_train, t_train = x_train[:5000], t_train[:5000]
#x_test, t_test = x_test[:1000], t_test[:1000]

max_epochs = 20

network = SimpleConvNet(input_dim=(1,28,28), 
                        conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
                        hidden_size=100, output_size=10, weight_init_std=0.01)
                        
trainer = Trainer(network, x_train, t_train, x_test, t_test,
                  epochs=max_epochs, mini_batch_size=100,
                  optimizer='Adam', optimizer_param={'lr': 0.001},
                  evaluate_sample_num_per_epoch=1000)
trainer.train()

#Save parameters
network.save_params("params.pkl")
print("Saved Network Parameters!")

#Drawing a graph
markers = {'train': 'o', 'test': 's'}
x = np.arange(max_epochs)
plt.plot(x, trainer.train_acc_list, marker='o', label='train', markevery=2)
plt.plot(x, trainer.test_acc_list, marker='s', label='test', markevery=2)
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

…… train loss:0.0145554384445 train loss:0.0275851756417 train loss:0.00785021651885 train loss:0.00986611950473 =============== Final Test Accuracy =============== test acc:0.956 Saved Network Parameters!

Visualization of CNN

Visualization of the weight of the first layer

Before learning: There is no regularity in black and white shades because the filter is initialized randomly.

After learning: regular

What are these regular filters "looking at"? ・ ** Edge **: Border where color changes ・ ** Blob **: Locally lumpy area

Night information extraction in a hierarchical structure

1st layer convolution: Extracts low-level information such as edges and blobs Overlay convolution layers: extract more complex and abstract information

The following DEMO 1 was quoted http://vision03.csail.mit.edu/cnn_art/index.html#v_single

In the demo it was as follows. Cov1: Edge, blob (Edge + Blob) Cov3: Texture Cov5: Object Parts Fc8: Object Classes such as dogs and cats

Therefore, as the layer becomes deeper, neurons change from simple shapes to "advanced" information **. In other words, it is the day when the object that reacts changes so that you can understand the "meaning" of things.

Typical CNN

This book explains the following ・ CNN, which was first proposed in 1998, is also the original LeNet ・ AlexNet in 2012, when deep learning attracted attention

LeNet

Compared to "current CNN", the following points are different -Use the sigmoid function as the activation function (Currently ReLU function) ・ The size of intermediate data is reduced by subsampling. (Currently Max Pooling) http://dx.doi.org/10.1109/5.726791

AlexNet

AlexNet stacks the convolution layer and the pooling layer, and finally outputs the result via the fully connected layer. The following points are different from LeNet -Use the ReLU function as the activation function -Use a layer that performs local normalization called LRN (Local Response Normalization) ・ Use Dropout

Now and in the past

There is no big difference between LeNet and AlexNe in network configuration, but there have been major advances in computer technology. In particular ・ A large amount of data can now be obtained by anyone. -GPUs that specialize in large amounts of parallel computing have become widespread, making it possible to perform large amounts of calculations at high speed.

[Learning memo] Deep Learning made from scratch [Chapter 7]