"Deep Learning from scratch" Self-study memo (No. 11) CNN

While reading "Deep Learning from scratch" (written by Yasuki Saito, published by O'Reilly Japan), I will make a note of the sites I referred to. Part 10 ←

When it comes to Chapter 7 convolutional neural networks, it looks quite different from what we've done up to Chapter 6. It looks like you're doing a lot different things, but in the end you'll find and store the weight and bias gradients. In other words, the basic principle has not changed at all, what has changed is the input data.

P207 If the input data is an image, the image is usually in 3D shape in the vertical, horizontal, and channel directions. However, when inputting to the fully connected layer, the 3D data must be flat—the 1D data. In fact, in previous examples using the MNIST dataset, the input image was in the shape of (1, 28, 28) — 1 channel, 28 pixels high, 28 pixels wide – but arranged in a row. You have entered 784 data into the first Affine layer. ・・・ The convolution layer, on the other hand, retains its shape. In the case of an image, the input data is received as 3D data, and the data is output to the next layer as 3D data as well. As a result, CNN can (potentially) correctly understand data that has shapes such as images.

In fact, I myself used this Self-study memo # 6-2 to process 3D data when processing Kaggle's cat and dog datasets. I convert it to a dimension and use it. If this can be processed in three dimensions, the recognition rate may improve.

Convolution layer, padding, stride

These explanations are not difficult at all, and I can understand them as such, but since this formula suddenly appears on P212, what is this? Is that really the case? So I thought about it. OH = \frac{H + 2P - FH}{S} + 1 OW = \frac{W + 2P - FW}{S} + 1

For the time being, let's think about the fact that there is no S (stride). Let's check some input size and filter size

When the input size (n, n) and the filter size (m, m) The output size seems to be (n-m + 1, n-m + 1). If you apply the filter to the upper left corner, you can rotate it to the right (nm). It can rotate down (nm). So, adding 1 minute in the upper left corner, is it nm + 1?

So what happens with strides? When the stride is 2, the number of turns to the right (nm) is halved. (Nm) / 2 When it is 3, it becomes 1/3.

In other words, the number of times you can move is (nm) / s, so The output size is (nm) / s + 1.

Assuming that the input data size is (H, W), the padding is P, and the filter size is (FH, FW) n = H + 2 × P Similarly, n = W + 2 × P ｍ＝ＦＨ　　　　　　　　ｎ＝ＦＷ So The output size is 　ＯＨ＝（Ｈ＋２×Ｐ－ＦＨ）/ｓ　＋　１　ＯＷ＝（Ｗ＋２×Ｐ－ＦＷ）/ｓ　＋　１

Learning and testing MNIST data

From P230, there is a description of the class SimpleConvNet as an example for training MNIST data. Let me learn using this class

import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import numpy as np

from dataset.mnist import load_mnist
from common.simple_convnet import SimpleConvNet
from common.trainer import Trainer

#Data reading
(x_train, t_train), (x_test, t_test) = load_mnist(flatten=False)

max_epochs = 20
network = SimpleConvNet(input_dim=(1,28,28), 
                        conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
                        hidden_size=100, output_size=10, weight_init_std=0.01)
trainer = Trainer(network, x_train, t_train, x_test, t_test,
                  epochs=max_epochs, mini_batch_size=100,
                  optimizer='Adam', optimizer_param={'lr': 0.001},
                  evaluate_sample_num_per_epoch=1000, verbose=False)
trainer.train()

I tried to verify the judgment contents of the test data.

import numpy as np
from common.simple_convnet import SimpleConvNet
from dataset.mnist import load_mnist
import pickle

import matplotlib.pyplot as plt

def showImg(x):
    example = x.reshape((28, 28))
    plt.figure()
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(example, cmap=plt.cm.binary)
    plt.show()
    return

#Evaluate with test data
x = x_test
t = t_test

network = SimpleConvNet(input_dim=(1,28,28), 
                        conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},
                        hidden_size=100, output_size=10, weight_init_std=0.01)
network.load_params("params.pkl")
    
y = network.predict(x)

accuracy_cnt = 0
for i in range(len(x)):
    p= np.argmax(y[i])
    #print(str(x[i]) + " : " + str(p))
    if p == t[i]:
        accuracy_cnt += 1
    else:
        print("Correct answer:"+str(t[i])+"Inference result:"+str(p))
        showImg(x[i])

print("Accuracy:" + str(float(accuracy_cnt) / len(x)))

As a result, the correct answer rate is

Accuracy:0.988

What was wrong is like this

However, it took hours to process 60,000 data. Furthermore, after learning, I tried to process the test data, but I was struck by the problem of insufficient memory and could not proceed easily. Is Deep Learning too much for memory 4G?

For the time being, I was able to confirm that I was able to learn with high accuracy on CNN.

As usual, I would like to follow the contents of the program.

SimpleConvNet class

# coding: utf-8
import sys, os
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import pickle
import numpy as np
from collections import OrderedDict
from common.layers import *
from common.gradient import numerical_gradient


class SimpleConvNet:
    def __init__(self, input_dim=(1, 28, 28), 
                 conv_param={'filter_num':30, 'filter_size':5, 'pad':0, 'stride':1},
                 hidden_size=100, output_size=10, weight_init_std=0.01):
        filter_num = conv_param['filter_num']
        filter_size = conv_param['filter_size']
        filter_pad = conv_param['pad']
        filter_stride = conv_param['stride']
        input_size = input_dim[1]
        conv_output_size = (input_size - filter_size + 2*filter_pad) / filter_stride + 1
        pool_output_size = int(filter_num * (conv_output_size/2) * (conv_output_size/2))

        #Weight initialization
        self.params = {}
        self.params['W1'] = weight_init_std * \
                            np.random.randn(filter_num, input_dim[0], filter_size, filter_size)
        self.params['b1'] = np.zeros(filter_num)
        self.params['W2'] = weight_init_std * \
                            np.random.randn(pool_output_size, hidden_size)
        self.params['b2'] = np.zeros(hidden_size)
        self.params['W3'] = weight_init_std * \
                            np.random.randn(hidden_size, output_size)
        self.params['b3'] = np.zeros(output_size)

        #Layer generation
        self.layers = OrderedDict()
        self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
                                           conv_param['stride'], conv_param['pad'])
        self.layers['Relu1'] = Relu()
        self.layers['Pool1'] = Pooling(pool_h=2, pool_w=2, stride=2)
        self.layers['Affine1'] = Affine(self.params['W2'], self.params['b2'])
        self.layers['Relu2'] = Relu()
        self.layers['Affine2'] = Affine(self.params['W3'], self.params['b3'])

        self.last_layer = SoftmaxWithLoss()

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)

        return x

    def loss(self, x, t):
        y = self.predict(x)
        return self.last_layer.forward(y, t)

    def accuracy(self, x, t, batch_size=100):
        if t.ndim != 1 : t = np.argmax(t, axis=1)
        
        acc = 0.0
        
        for i in range(int(x.shape[0] / batch_size)):
            tx = x[i*batch_size:(i+1)*batch_size]
            tt = t[i*batch_size:(i+1)*batch_size]
            y = self.predict(tx)
            y = np.argmax(y, axis=1)
            acc += np.sum(y == tt) 
        
        return acc / x.shape[0]

    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        #Setting
        grads = {}
        grads['W1'], grads['b1'] = self.layers['Conv1'].dW, self.layers['Conv1'].db
        grads['W2'], grads['b2'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
        grads['W3'], grads['b3'] = self.layers['Affine2'].dW, self.layers['Affine2'].db

        return grads
        
    def save_params(self, file_name="params.pkl"):
        params = {}
        for key, val in self.params.items():
            params[key] = val
        with open(file_name, 'wb') as f:
            pickle.dump(params, f)

    def load_params(self, file_name="params.pkl"):
        with open(file_name, 'rb') as f:
            params = pickle.load(f)
        for key, val in params.items():
            self.params[key] = val

        for i, key in enumerate(['Conv1', 'Affine1', 'Affine2']):
            self.layers[key].W = self.params['W' + str(i+1)]
            self.layers[key].b = self.params['b' + str(i+1)]

The only difference is that the layers are stacked, and the others are not much different from the MultiLayerNet class.

        self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
                                           conv_param['stride'], conv_param['pad'])

The Convolution class is also defined in layers.py

class Convolution:
    def __init__(self, W, b, stride=1, pad=0):
        self.W = W
        self.b = b
        self.stride = stride
        self.pad = pad
        
        #Intermediate data (used during backward)
        self.x = None   
        self.col = None
        self.col_W = None
        
        #Gradient of weight / bias parameters
        self.dW = None
        self.db = None

    def forward(self, x):
        FN, C, FH, FW = self.W.shape
        N, C, H, W = x.shape
        out_h = 1 + int((H + 2*self.pad - FH) / self.stride)
        out_w = 1 + int((W + 2*self.pad - FW) / self.stride)

        col = im2col(x, FH, FW, self.stride, self.pad)
        col_W = self.W.reshape(FN, -1).T

        out = np.dot(col, col_W) + self.b
        out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)

        self.x = x
        self.col = col
        self.col_W = col_W

        return out

    def backward(self, dout):
        FN, C, FH, FW = self.W.shape
        dout = dout.transpose(0,2,3,1).reshape(-1, FN)

        self.db = np.sum(dout, axis=0)
        self.dW = np.dot(self.col.T, dout)
        self.dW = self.dW.transpose(1, 0).reshape(FN, C, FH, FW)

        dcol = np.dot(dout, self.col_W.T)
        dx = col2im(dcol, self.x.shape, FH, FW, self.stride, self.pad)

        return dx

im2col The heart of this is the im2col function. Defined in util.py

def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
    N, C, H, W = input_data.shape
    out_h = (H + 2*pad - filter_h)//stride + 1
    out_w = (W + 2*pad - filter_w)//stride + 1

    img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')
    col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))

    for y in range(filter_h):
        y_max = y + stride*out_h
        for x in range(filter_w):
            x_max = x + stride*out_w
            col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]

    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
    return col

And this seems to be the cause of running out of memory. If you process more rows of data, you will get a Memorry Error here.

In the first three lines, we check the size of the input data and calculate the output size from the input size and filter size. The reason why // is used for division by stride seems to be to truncate after the decimal point if it is not divisible.

Confirmation of the size of the test data of this input data MNIST

len(x_test)  #The number of data

10000

len(x_test[0]) #channel

1

len(x_test[0][0]) #height

28

len(x_test[0][0][0]) #Width

28

Check the size of filter W1

len(network.params['W1']) #Number of filters

30

len(network.params['W1'][0]) #Number of channels

1

len(network.params['W1'][0][0]) #Filter height

5

len(network.params['W1'][0][0][0]) #Filter width

5

Confirmation of padding and stride

                        conv_param = {'filter_num': 30, 'filter_size': 5, 'pad': 0, 'stride': 1},

Padding 0 and stride 1 are specified when the network object is created.

Output size of the convolution layer

OH = \frac{H + 2P - FH}{S} + 1　= (28 + 0 - 5)/1 +1 = 24 OW = \frac{W + 2P - FW}{S} + 1　= (28 + 0 - 5)/1 +1 = 24 Should be.

len(network.layers['Conv1'].forward(x_test))  #The number of data

10000

len(network.layers['Conv1'].forward(x_test)[0]) #Number of filters

30

len(network.layers['Conv1'].forward(x_test)[0][0]) #Output height

24

len(network.layers['Conv1'].forward(x_test)[0][0][0]) #Output width

24

Convolution layer tracking

        self.layers['Conv1'] = Convolution(self.params['W1'], self.params['b1'],
                                           conv_param['stride'], conv_param['pad'])

class Convolution:
(Omitted)
    def forward(self, x):
        FN, C, FH, FW = self.W.shape   # 30, 1, 5, 5
        N, C, H, W = x.shape           # 10000, 1, 28, 28
        out_h = 1 + int((H + 2*self.pad - FH) / self.stride) # 24
        out_w = 1 + int((W + 2*self.pad - FW) / self.stride) # 24

        col = im2col(x, FH, FW, self.stride, self.pad)
(Omitted)

def im2col(input_data, filter_h, filter_w, stride=1, pad=0):
    N, C, H, W = input_data.shape          # 10000, 1, 28, 28
    out_h = (H + 2*pad - filter_h)//stride + 1 # 24
    out_w = (W + 2*pad - filter_w)//stride + 1 # 24

    img = np.pad(input_data, [(0,0), (0,0), (pad, pad), (pad, pad)], 'constant')

input_data is 4 dimensions (10000 data rows, 1 channel, 28 height, 28 width) When pad = 0, [(0,0), (0,0), (0, 0), (0, 0)] do not pad. When pad = 1, [(0,0), (0,0), (1, 1), (1, 1)] pad one by one on the top, bottom, left and right of the height and width. When pad = 2, [(0,0), (0,0), (2, 2), (2, 2)] Pads two each on the top, bottom, left and right of the height and width. In this program example, pad = 0. The same as input_data is set in img.

    col = np.zeros((N, C, filter_h, filter_w, out_h, out_w)) #10000, 1, 5, 5, 24, 24

The input data (image image) is expanded into an array col, but as a container for expanding the data, create an array of the size (number of data, channel, filter height, filter width, output height, output width). ..

    for y in range(filter_h):        
        y_max = y + stride*out_h     
        for x in range(filter_w):    
            x_max = x + stride*out_w 
            col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]

    col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)
    return col

I can't get the image here at all, so I tested it with the following simplified array.

import numpy as np
N=1
C=1
H=8
W=8
filter_h=4
filter_w=4
stride=2
out_h=3
out_w=3
img= np.arange(64).reshape(N, C, 8, 8)
col = np.zeros((N, C, filter_h, filter_w, out_h, out_w))

for y in range(filter_h):        
    y_max = y + stride*out_h     
    for x in range(filter_w):    
        x_max = x + stride*out_w 
        col[:, :, y, x, :, :] = img[:, :, y:y_max:stride, x:x_max:stride]
col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N*out_h*out_w, -1)

col

array([[ 0., 1., 2., 3., 8., 9., 10., 11., 16., 17., 18., 19., 24., 25., 26., 27.], [ 2., 3., 4., 5., 10., 11., 12., 13., 18., 19., 20., 21., 26., 27., 28., 29.], [ 4., 5., 6., 7., 12., 13., 14., 15., 20., 21., 22., 23., 28., 29., 30., 31.], [16., 17., 18., 19., 24., 25., 26., 27., 32., 33., 34., 35., 40., 41., 42., 43.], [18., 19., 20., 21., 26., 27., 28., 29., 34., 35., 36., 37., 42., 43., 44., 45.], [20., 21., 22., 23., 28., 29., 30., 31., 36., 37., 38., 39., 44., 45., 46., 47.], [32., 33., 34., 35., 40., 41., 42., 43., 48., 49., 50., 51., 56., 57., 58., 59.], [34., 35., 36., 37., 42., 43., 44., 45., 50., 51., 52., 53., 58., 59., 60., 61.], [36., 37., 38., 39., 44., 45., 46., 47., 52., 53., 54., 55., 60., 61., 62., 63.]])

For col [0], the part of the input data to which the filter is applied is extracted first. col [1] is the part where the filter is applied by shifting the stride 2 to the right by two. The following is an array in which the parts to which the filter is applied 9 times are extracted and arranged.

I'm not sure what I'm doing, but I can understand the result.

If you reshape the 4x4 filter into one column and perform col and dot operations, you can obtain the result of applying the filter 9 times in one operation.

#Convolution.forward

        col = im2col(x, FH, FW, self.stride, self.pad)
        col_W = self.W.reshape(FN, -1).T

        out = np.dot(col, col_W) + self.b
        out = out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)

reference

Complete understanding of numpy.pad function Manipulate the two-dimensional array freely. [Initialization / Reference / Extraction / Calculation / Transposition]

Part 10 ←