Target person

The previous article is here DNN (Deep Neural Network) is completed by the last time. (I plan to play with DNN in another article, including how to use the layer manager) Here, we will create a CNN (Convolutional Neural Network) for image recognition. The ʻim2colandcol2im` functions used here are here and here. ) Is introduced. The next article is here

-[Convolution layer](# Convolution layer) -[Convolution layer forward propagation](# Convolution layer forward propagation) -[Convolution layer backpropagation](# Convolution layer back propagation) -[Convolution layer learning](# Convolution layer learning) -[Convolution layer mounting](# Convolution layer mounting) -[Pooling layer](# pooling layer) -[Polling layer forward propagation](# Pooling layer forward propagation) -[Pooling layer back propagation](# Pooling layer back propagation) -[Pooling layer learning](# Pooling layer learning) -[Pooling layer mounting](# Pooling layer mounting) -Conclusion

Convolution layer

A process called ** convolution ** gives a great benefit to image recognition. As an introduction, for data such as images where the positional relationship seems to be important, simply smoothing it to a neural network in one dimension and flowing it is like throwing away the important information of the positional relationship, which is a waste. It's like that. The role of the convolution layer is to flow data through the neural network while maintaining the dimensions of the input, that is, while maintaining important information such as positional relationships. For convolution layers, this filter corresponds to the weights of ordinary layers. After that, you can write the code that works according to this gif, but in fact, if you implement it as it is, it will be a heavy code that is not practical. Because if you simplify and implement this gif part

Image = I_h×I_array of w
Filter = F_h×F_array of w
Output = O_h×O_array of w
for h in range(O_h):
    h_lim = h + F_h
    for w in range(O_w):
        w_lim = w + F_w
        Output[h, w] = Image[h:h_lim, w:w_lim] * Filter

It becomes like, access the numpy array in a double loop, apply the element product to the corresponding part of the input, and save the result in the output, and so on. Moreover, this loop, here a double loop, is a quadruple loop because the actual input is four-dimensional. It's easy to imagine that the number of loops will increase rapidly. Since numpy has a specification that it will be slow if you access it with the for statement, you want to avoid accessing it in a loop as much as possible. That's where the ʻim2col` function comes into play. The previous gif is

a = 1W + 2X + 5Y + 6Z \\
b = 2W + 3X + 6Y + 7Z \\
c = 3W + 4X + 7Y + 8Z \\
d = 5W + 6X + 9Y + 10Z \\
e = 6W + 7X + 10Y + 11Z \\
f = 7W + 8X + 11Y + 12Z \\
g = 9W + 10X + 13Y + 14Z \\
h = 10W + 11X + 14Y + 15Z \\
i = 11W + 12X + 15Y + 16Z

It is like that, but if you express this as a matrix product

\left(
  \begin{array}{c}
    a \\
    b \\
    c \\
    d \\
    e \\
    f \\
    g \\
    h \\
    i
  \end{array}
\right)^{\top}
=
\left(
  \begin{array}{cccc}
    W & X & Y & Z
  \end{array}
\right)
\left(
  \begin{array}{ccccccccc}
    1 & 2 & 3 & 5 & 6 & 7 & 9 & 10 & 11 \\
    2 & 3 & 4 & 6 & 7 & 8 & 10 & 11 & 12 \\
    5 & 6 & 7 & 9 & 10 & 11 & 13 & 14 & 15 \\
    6 & 7 & 8 & 10 & 11 & 12 & 14 & 15 & 16
  \end{array}
\right)

It will be. The ʻim2col function is a function for converting an input image or filter into a matrix like this. For details, see [here](https://qiita.com/kuroitu/items/35d7b5a4bde470f69570). By the way, by using this ʻim2col function, the above problem can be solved considerably. However, of course, if the ʻim2colfunction is used, the shape of the original input will be different, so learning by the error back propagation method cannot proceed as it is. So you need to bite thecol2im` function, which does the opposite, during backpropagation. For details, see here.

So far, I have briefly explained the outline of the convolution layer, so I will show you the blueprint.

Convolution layer forward propagation

Let's start with forward propagation. The relevant part is the color part in the figure below. Operationally

Throw the input image into the ʻim2col` function
Calculate the matrix product of the return value and the filter (transformed).
Add the output of 2. and the bias
Pass the activation function

The basic operation is the same as the forward propagation of a normal neural network. The only difference is that the ʻim2colfunction is inserted before that. Let's take a closer look. First, the convolution operation is as shown in the figure below. ![conv_filtering.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/6d06ad5e-550c-2e59-7906-04989b5269a5.png) Bias is omitted. Inputs are a tensor with batch size $ B $, number of channels $ C $, and image size $ (I_h, I_w) $. There are $ M $ filters for each channel, which is a tensor with the same number of channels as the input and a filter size of $ (F_h, F_w) $. The channel filter corresponding to each input channel is filtered for all batch data, resulting in a tensor with the shape $ (B, M, O_h, O_w) $. Let's see how to do this process concretely. Process the inputs and filters as shown in the figure below. ![input_im2col.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/f22aa967-6879-1cc1-41e2-d83ceeb956f3.png) ![filter_reshape.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/d550d4be-8531-1986-aaa9-b3352e1008bb.png) This allowed us to drop the 4D tensor into 2D, which allows us to perform matrix multiplication. ![convolution.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/640911/f763ca02-5d32-440a-44ec-5ca51817db9e.png) Bias (the shape is a two-dimensional matrix of $ (M, 1) $) is applied to this output. At this time, use the broadcast function ofnumpy` to add the same value to all columns. After that, this output is transformed and the dimensions are exchanged to obtain the output tensor. Throw this output tensor into the activation function to complete the forward propagation of the convolution layer.

Convolution layer backpropagation

Next is back propagation. The related part is the color part in the figure below. As an operation

Take the product of the output gradient and the derivative of the activation function
One is used as a bias gradient
The matrix product with the input image passed through the ʻim2col` function is used as the gradient of the filter.
The matrix product with the filter is used as the input gradient through the col2im function.

It's like that. Let's take a closer look. The propagated gradient is a $ (B, M, O_h, O_w) $ tensor. First, this gradient is transformed in the reverse order of forward propagation. The gradient to the filter is calculated by the gradient and the input matrix product. Since the obtained result is a two-dimensional matrix, it is transformed into a four-dimensional tensor with the same shape as the filter. The key to the gradient to bias is adding the same value to all columns during forward propagation. Adding the same value to several elements shows that it is equivalent to a network shaped as shown in the figure below. (The number is appropriate) Therefore, $ axis = 1 $, that is, backpropagation is performed from each column direction toward one bias, and the sum of them is the gradient to the bias. The gradient to the input is calculated by the matrix product of the filter and the gradient. As you can see by looking at the tensor of the calculation result, it is the same as the result of throwing the input tensor to the ʻim2colfunction when the shape is forward propagation. So throwing this into thecol2im` function, which does the opposite, forms a gradient tensor to the input. This completes the backpropagation of the convolution layer.

Convolution layer learning

Well, you don't actually have to transform the filter every time. You only have to do it once at the beginning. The reason is that "the filter deforms the same every time, so there is no need to repeat it." The filter remains as it is after the first deformation, which means that the gradient to the filter calculated by backpropagation does not need to be deformed either. As such, the learning of the convolution layer is the same as the normal layer.

Convolution layer mounting

So I will implement it. However, a little ingenuity is required to inherit BaseLayer.

conv.py

`conv.py`


import numpy as np


class ConvLayer(BaseLayer):
    def __init__(self, *, I_shape=None, F_shape=None,
                 stride=1, pad="same",
                 name="", wb_width=5e-2,
                 act="ReLU", opt="Adam",
                 act_dic={}, opt_dic={}, **kwds):
        self.name = name
        
        if I_shape is None:
            raise KeyError("Input shape is None.")
        if F_shape is None:
            raise KeyError("Filter shape is None.")
        
        if len(I_shape) == 2:
            C, I_h, I_w = 1, *I_shape
        else:
            C, I_h, I_w = I_shape
        self.I_shape = (C, I_h, I_w)
        
        if len(F_shape) == 2:
            M, F_h, F_w = 1, *F_shape
        else:
            M, F_h, F_w = F_shape
        self.F_shape = (M, C, F_h, F_w)
        
        if isinstance(stride, tuple):
            stride_ud, stride_lr = stride
        else:
            stride_ud = stride
            stride_lr = stride
        self.stride = (stride_ud, stride_lr)
        
        if isinstance(pad, tuple):
            pad_ud, pad_lr = pad
        elif isinstance(pad, int):
            pad_ud = pad
            pad_lr = pad
        elif pad == "same":
            pad_ud = 0.5*((I_h - 1)*stride_ud - I_h + F_h)
            pad_lr = 0.5*((I_w - 1)*stride_lr - I_w + F_w)
        self.pad = (pad_ud, pad_lr)
        
        O_h = get_O_shape(I_h, F_h, stride_ud, pad_ud)
        O_w = get_O_shape(I_w, F_w, stride_lr, pad_lr)
        self.O_shape = (M, O_h, O_w)
        
        self.n = np.prod(self.O_shape)
        
        #Set filters and bias
        self.w = wb_width*np.random.randn(*self.F_shape).reshape(M, -1).T
        self.b = wb_width*np.random.randn(M)
        
        #Activation function(class)Get
        self.act = get_act(act, **act_dic)

        #Optimizer(class)Get
        self.opt = get_opt(opt, **opt_dic)
    
    
    def forward(self, x):
        B = x.shape[0]
        M, O_h, O_w = self.O_shape
        
        x, _, self.pad_state = im2col(x, self.F_shape,
                                      stride=self.stride,
                                      pad=self.pad)
        super().forward(x.T)
        return self.y.reshape(B, O_h, O_w, M).transpose(0, 3, 1, 2)
    
    
    def backward(self, grad):
        B = grad.shape[0]
        I_shape = B, *self.I_shape
        M, O_h, O_w = self.O_shape
        
        grad = grad.transpose(0, 2, 3, 1).reshape(-1, M)
        super().backward(grad)
        self.grad_x = col2im(self.grad_x.T, I_shape, self.O_shape,
                             stride=self.stride, pad=self.pad_state)
        return self.grad_x

I will explain which area you are devising. If you implement it as explained above without any ingenuity, it will be as follows.

No ingenuity ver.

`conv.py`


import numpy as np


class ConvLayer(BaseLayer):
    def __init__(self, *, I_shape=None, F_shape=None,
                 stride=1, pad="same",
                 name="", wb_width=5e-2,
                 act="ReLU", opt="Adam",
                 act_dic={}, opt_dic={}, **kwds):
        self.name = name
        
        if I_shape is None:
            raise KeyError("Input shape is None.")
        if F_shape is None:
            raise KeyError("Filter shape is None.")
        
        if len(I_shape) == 2:
            C, I_h, I_w = 1, *I_shape
        else:
            C, I_h, I_w = I_shape
        self.I_shape = (C, I_h, I_w)
        
        if len(F_shape) == 2:
            M, F_h, F_w = 1, *F_shape
        else:
            M, F_h, F_w = F_shape
        self.F_shape = (M, C, F_h, F_w)
        
        _, O_shape, self.pad_state = im2col(np.zeros((1, *self.I_shape)), self.F_shape,
                                            stride=stride, pad=pad)
        self.O_shape = (M, *O_shape)
        self.stride = stride
        
        self.n = np.prod(self.O_shape)
        
        #Set filters and bias
        self.w = wb_width*np.random.randn(*self.F_shape).reshape(M, -1)
        self.b = wb_width*np.random.randn(M, 1)
        
        #Activation function(class)Get
        self.act = get_act(act, **act_dic)

        #Optimizer(class)Get
        self.opt = get_opt(opt, **opt_dic)


    def forward(self, x):
        B = x.shape[0]
        M, O_h, O_w = self.O_shape

        self.x, _, self.pad_state = im2col(x, self.F_shape,
                                           stride=self.stride,
                                           pad=self.pad)
        
        self.u = [email protected] + self.b
        self.u = self.u.reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3)
        self.y = self.act.forward(self.u)
        
        return self.y
    
    
    def backward(self, grad):
        B = grad.shape[0]
        I_shape = B, *self.I_shape
        _, O_h, O_w = self.O_shape
        
        dact = grad*self.act.backward(self.u, self.y)
        dact = dact.transpose(1, 0, 2, 3).reshape(M, -1)
        self.grad_w = [email protected]
        self.grad_b = np.sum(dact, axis=1).reshape(M, 1)
        self.grad_x = self.w.T@dact
        self.grad_x = col2im(self.grad_x, I_shape, self.O_shape,
                             stride=self.stride, pad=self.pad_state)
        
        return self.grad_x

Let's take a closer look at the differences from BaseLayer, omitting the code.

Attention part	`BaseLayer`	shape	`ConvLayer`	shape
w	randn(prev, n)	(prev, n)	randn(*F_shape).reshape(M, -1)	(M, CF_hF_w)
b	randn(n)	(n, )	randn(M, 1)	(M, 1)

x	-	(B, prev)	im2col(x)	(CF_hF_w, BO_hO_w)
u	x@w + b	(B, prev)@(prev, n)+(n)=(B, n)	w@x + b	(M, CF_hF_w)@(CF_hF_w, BO_hO_w)+(M, 1)=(M, BO_hO_w)
u	-	-	u.reshape(M, B, O_h, O_w).transpose(1, 0, 2, 3)	(B, M, O_h, O_w)
y	act.forward(u)	(B, n)	act.forward(u)	(B, M, O_h, O_w)

grad	-	(B, n)	-	(B, M, O_h, O_w)
dact	grad*act.backward(u, y)	(B, n)	grad*act.backward(u, y)	(B, M, O_h, O_w)
dact	-	-	dact.transpose(1, 0, 2, 3).reshape(M, -1)	(M, BO_hO_w)
grad_w	x.T@dact	(prev, B)@(B, n)=(prev, n)	[email protected]	(M, BO_hO_w)@(BO_hO_w, CF_hF_w)=(M, CF_hF_w)
grad_b	sum(dact, axis=0)	(n)	sum(dact, axis=1).reshape(M, 1)	(M, 1)
grad_x	[email protected]	(B, n)@(n, prev)=(B, prev)	w.T@dact	(CF_hF_w, M)@(M, BO_hO_w)=(CF_hF_w, BO_hO_w)
grad_x	-	-	col2im(grad_x)	(B, C, I_h, I_w)

First, let's align the forward propagation. The most different point of forward propagation is the calculation of ʻu`.

\boldsymbol{x}@\boldsymbol{w} + \boldsymbol{b} \quad \Leftrightarrow \quad \boldsymbol{w}@\boldsymbol{x} + \boldsymbol{b}

The matrix product can be reversed in order by setting $ \ boldsymbol {w} @ \ boldsymbol {x} = \ boldsymbol {x} ^ {\ top} @ \ boldsymbol {w} ^ {\ top} $. For,

\begin{align}
  \boldsymbol{x} &\leftarrow \textrm{im2col}(\boldsymbol{x})^{\top} = (BO_hO_w, CF_hF_w) \\
  \boldsymbol{w} &\leftarrow \boldsymbol{w}^{\top} = (CF_hF_w, M) \\
  \boldsymbol{b} & \leftarrow (M, )
\end{align}

It is possible to align with the forward propagation formula by setting as. Also, regarding bias, in order to enable the broadcast function of numpy, it is possible to make a one-dimensional array instead of a two-dimensional matrix with $ (M, 1) $. If you change the forward propagation like this

\boldsymbol{x}@\boldsymbol{w} + \boldsymbol{b} = (BO_hO_w, CF_hF_w)@(CF_hF_w, M) + (M) = (BO_hO_w, M)

After calculating with forward of BaseLayer, the propagation to the next layer is self.y.reshape (B, O_h, O_w, M) .transpose (0, 3, 1, 2 It can be transformed into $ (B, M, O_h, O_w) $ by setting) . Also, if you look at the code of the person who devised it, the return statement transforms it as described above and flows it, but this leaves the shapes of ʻu and y` as $ (BO_hO_w, M) $. This is fine as it is.

Next is back propagation. The gradient grad is $ (B, M, O_h, O_w) $, and the element product ofgrad * act.backward (u, y)cannot be calculated as it is.

\boldsymbol{grad} \otimes \textrm{act.backward}(\boldsymbol{u}, \boldsymbol{y}) = (B, M, O_h, O_w) \otimes (BO_hO_w, M)

So let's transform grad and align it. You can do it with grad.transpose (0, 2, 3, 1) .reshape (-1, M). After this, if you throw it to the backward of the BaseLayer

\begin{array}[cccc]
   d\boldsymbol{dact} &= \boldsymbol{grad} \otimes \textrm{act.backward}(\boldsymbol{u}, \boldsymbol{y}) &= (BO_hO_w, M) & \\
  \boldsymbol{grad_w} &= \boldsymbol{x}^{\top}@\boldsymbol{dact} &= (CF_hF_w, BO_hO_w)@(BO_hO_w, M) &= (CF_hF_w, M)\\
  \boldsymbol{grad_b} &= \textrm{sum}(\boldsymbol{dact}, \textrm{axis}=0) &= (M, ) & \\
  \boldsymbol{grad_x} &= \boldsymbol{dact}@\boldsymbol{w}^{\top} &= (BO_hO_w, M)@(M, CF_hF_w) &= (BO_hO_w, CF_hF_w)
\end{array}

Because it becomes

\boldsymbol{grad_x} \leftarrow \textrm{col2im}(\boldsymbol{grad_x}^{\top}) = (B, C, I_h, I_w)

If so, it's OK. You do not need to change the BaseLayer ʻupdate` function as described above. Therefore, this completes the convolution layer.

Pooling layer

Next is the pooling layer. First, the pooling layer is a layer that reduces the data size by extracting only the information that seems to be important from the input image. The important information in this case is usually the maximum or average. Also, when implementing this, it will be faster and more efficient by using the ʻim2colfunction and thecol2im` function as well as the convolution layer. The blueprint for the pooling layer looks like this:

Pooling layer forward propagation

Let's look at forward propagation. It is the color part that is relevant. As an operation

Throw the input image into the ʻim2col` function
Get the shape of the return value
Get the maximum value and its index from the return value
Reconstruct the shape of the output image

It's like that. There are some things to keep for backpropagation. Let's take a closer look. The target operation is as shown in the figure below. First, throw the input tensor into the ʻim2col` function and convert it to a two-dimensional matrix. Furthermore, this two-dimensional matrix is transformed. After transforming into such a vertically long matrix, add the sum in the column direction, and finally transform and swap the dimensions to complete the output. Also, you need to get the index of the maximum value before taking the column sum.

Pooling layer backpropagation

Next is back propagation. It's the color part of the number that is related. As an operation

Deform the gradient of the output image
Generate an empty matrix with the same shape as the return value when throwing the input image to the ʻim2col` function
Place the gradient information at the index of the generated empty matrix that had the maximum original return value.
Throw in the col2im function

It's like that. It's hard to understand the operation with just a few words ... It looks like the following.

Pooling layer learning

As you can see from the blueprint, there are no parameters to learn in the pooling layer. So I don't even learn.

Pooling layer mounting

The explanation of the pooling layer was much easier than that of the convolution layer. The implementation is not that complicated either.

pool.py

`pool.py`


import numpy as np


class PoolingLayer(BaseLayer):
    def __init__(self, *, I_shape=None,
                 pool=1, pad=0,
                 name="", **kwds):
        self.name = name
        
        if I_shape is None:
            raise KeyError("Input shape is None.")
        
        if len(I_shape) == 2:
            C, I_h, I_w = 1, *I_shape
        else:
            C, I_h, I_w = I_shape
        self.I_shape = (C, I_h, I_w)
        
        _, O_shape, self.pad_state = im2col(np.zeros((1, *self.I_shape)), (pool, pool),
                                            stride=pool, pad=pad)
        self.O_shape = (C, *O_shape)
        
        self.n = np.prod(self.O_shape)
        
        self.pool = pool
        self.F_shape = (pool, pool)
    
    
    def forward(self, x):
        B = x.shape[0]
        C, O_h, O_w = self.O_shape
        
        self.x, _, self.pad_state = im2col(x, self.F_shape,
                                           stride=self.pool,
                                           pad=self.pad_state)
        
        self.x = self.x.T.reshape(B*O_h*O_w*C, -1)
        self.max_index = np.argmax(self.x, axis=1)
        self.y = np.max(self.x, axis=1).reshape(B, O_h, O_w, C).transpose(0, 3, 1, 2)
        
        return self.y
    
    
    def backward(self, grad):
        B = grad.shape[0]
        I_shape = B, *self.I_shape
        C, O_h, O_w = self.O_shape
        
        grad = grad.transpose(0, 2, 3, 1).reshape(-1, 1)
        self.grad_x = np.zeros((grad.size, self.pool*self.pool))
        self.grad_x[:, self.max_index] = grad
        self.grad_x = self.grad_x.reshape(B*O_h*O_w, C*self.pool*self.pool).T
        self.grad_x = col2im(self.grad_x, I_shape, self.O_shape,
                             stride=self.pool, pad=self.pad_state)
        
        return self.grad_x
    
    
    def update(self, **kwds):
        pass

in conclusion

When I tried to build the experimental code of CNN, it didn't work well and I was investigating it all the time ... In conclusion, there was no problem with the convolution layer and the pooling layer, and the activation function was the problem. The implementation of Activation function list has also been changed. I will post the experimental code in the next article. I also changed the LayerManager class and so on.

Deep learning series

Introduction to Deep Learning ~ Convolution and Pooling ~

Target person

table of contents

Convolution layer

Convolution layer forward propagation

Convolution layer backpropagation

Convolution layer learning

Convolution layer mounting

conv.py

conv.py

Pooling layer

Pooling layer forward propagation

Pooling layer backpropagation

Pooling layer learning

Pooling layer mounting

pool.py

in conclusion

Deep learning series

`conv.py`

`conv.py`

`pool.py`