In this article, we will introduce the paper "Deep Residual Learning for Image Recognition" (CVPR 2016) \ [1] proposed by the famous ResNet. In addition, I implemented the code of CIFAR-10 classification that is being worked on in the paper (GitHub) and actually performed a reproduction experiment. ResNet proposed in this paper not only won the ILSVRC competition for the classification accuracy of ImageNet, but is also used in a wide variety of tasks due to the effectiveness and versatility of the proposed method for dealing with deep networks. (The number of citations exceeds 52,000 as of August 2020).

Thesis commentary

Overview

Deep neural networks have traditionally been said to be difficult to learn. In this paper, we facilitate the learning of deep networks by using a structure in which a large number of neural network blocks that learn the "residual" (difference from the true value) are connected instead of learning the true value directly. Greatly improved performance for various image recognition tasks.

background

In image recognition, it is known that the deeper the network hierarchy, the deeper the "deeper" semantic features can be extracted. However, simply stacking the layers of the network caused problems such as gradient disappearance / gradient explosion, and the network parameters did not converge and learning could not be performed well. This problem of difficulty in convergence has been solved by the initial value of the network and the normalization method, but even if it converges, there is a problem that the accuracy decreases as the layer is deepened (this). Is not overfitting, and the training error is worse as shown in the graph below). (The figure is from the paper. The same applies below)

Proposed method

The method proposed in this paper is to realize a deep network by connecting a large number of small "residual blocks". Each residual block consists of multiple weight layers and identity mapping as shown in the figure below. Assuming that the function you want to express in the entire block is $ H (x) $, you will learn $ F (x) = H (x) --x $ in the part where the weight layers are combined. This is why it is called "residual". This method solves the problem of "the deeper the layer, the lower the accuracy" mentioned at the end of the background. In fact, it is difficult to learn identity mapping ($ H (x) = x $) for multiple non-linear layers, and this mapping cannot be learned well in blocks where identity mapping is the optimal solution when the number of layers is increased. It is said that the accuracy will drop. However, if you use the proposed residual block, you can easily learn identity mapping by simply setting the weight of the weight layer to 0. In practice, it is rarely true identity, but it makes learning easier in cases where $ x $ and $ H (x) $ are very small.

Details of Residual block

When the output of Residual block is $ y , block can be expressed by such an operation. $y = F(x) + x$$ This $ F (x) $ is realized by combining two or more layers. For example, in the figure above, there are two layers, so $ \ sigma $ is used as the ReLU function. $F(x) = W_2\sigma(W_1x)$ (The bias term is omitted here). The non-linear operation by ReLU is actually performed after the last addition, so it is actually $y = \sigma(W_2\sigma(W_1x) + x)$ It looks like. If the number of channels of $ x $ and $ F (x) $ does not match, the number of channels is adjusted by padding with 0 or performing 1x1 convolution.

Overall network structure

This network diagram is a network for solving the ImageNet classification problem. The figure on the far right is the proposed method. The network structure is designed based on the principle that if the size of the feature map is the same, the number of channels is the same, and if the size of the feature map is halved, the number of channels is doubled. The operation to reduce the size of the feature map is basically to set the stride of the convolution of the first layer to 2, and pooling is not performed (other than the first and last). (By the way, the pooling before entering the last FC layer is Global Average Pooling.) Batch Normalization runs immediately after each convolution.

Implementation differences for each dataset

Detailed hyperparameters are omitted here. ImageNet ImageNet's image classification uses the networks in the table below. In networks with 50 or more layers, the structure of residual blocks has changed slightly, which is called "bottleneck architecture". The feature is that $ F (x) $, which was previously composed of two layers, is now composed of three layers, and the first and second layers are composed of a shallow number of channels. In addition, it is stated that this is a modification that is conscious of calculation cost, and the accuracy does not change much. In fact, even in a 152-layer network, the amount of calculation is less than that of VGG-16 / 19. The identity mapping addition when the number of channels is different adopts the method of performing __ 1x1 convolution only when the number of __ channels is different.

CIFAR-10 The image classification of CIFAR-10 has a slightly different network structure because the input image size is much smaller than ImageNet. The first layer is a single 3x3 convolution, followed by residual blocks. The convolution layer uses $ 6n + 1 $ layers, and the breakdown is as shown in the table below. The result is $ 3n $ residual blocks. Global Average Pooling is performed after this $ 6n + 1 $ layer, and classification is performed using 10 FC layers. Experiments will be conducted at $ n = \ {3, 5, 7, 9 \} $, resulting in a $ 20, 32, 44, 56 $ layer network, respectively. When the number of channels is different, the identity mapping is added by filling the missing part with 0.

Reproduction implementation

Of the datasets featured in this paper, we have implemented all the code that solves the CIFAR-10 classification problem using PyTorch. The entire source code has been posted on GitHub.

code

Only the most important model definition parts are posted here as well. Residual block is implemented as one class ResNetCifarBlock, and the general function make_resblock_group that creates groups with the same number of channels is implemented to make the code concise and extensible. Where the size of the feature map and the number of channels change, the pixels are thinned out first and then padded with zeros.

`nets.py`


import torch
import torch.nn as nn
import torch.nn.functional as F


class ResNetCifarBlock(nn.Module):
    def __init__(self, input_nc, output_nc):
        super().__init__()
        stride = 1
        self.expand = False
        if input_nc != output_nc:
            assert input_nc * 2 == output_nc, 'output_nc must be input_nc * 2'
            stride = 2
            self.expand = True

        self.conv1 = nn.Conv2d(input_nc, output_nc, kernel_size=3, stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(output_nc)
        self.conv2 = nn.Conv2d(output_nc, output_nc, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(output_nc)

    def forward(self, x):
        xx = F.relu(self.bn1(self.conv1(x)), inplace=True)
        y = self.bn2(self.conv2(xx))
        if self.expand:
            x = F.interpolate(x, scale_factor=0.5, mode='nearest')  # subsampling
            zero = torch.zeros_like(x)
            x = torch.cat([x, zero], dim=1)  # option A in the original paper
        h = F.relu(y + x, inplace=True)
        return h


def make_resblock_group(cls, input_nc, output_nc, n):
    blocks = []
    blocks.append(cls(input_nc, output_nc))
    for _ in range(1, n):
        blocks.append(cls(output_nc, output_nc))
    return nn.Sequential(*blocks)


class ResNetCifar(nn.Module):
    def __init__(self, n):
        super().__init__()

        self.conv = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.bn = nn.BatchNorm2d(16)
        self.block1 = make_resblock_group(ResNetCifarBlock, 16, 16, n)
        self.block2 = make_resblock_group(ResNetCifarBlock, 16, 32, n)
        self.block3 = make_resblock_group(ResNetCifarBlock, 32, 64, n)
        self.pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))  # global average pooling
        self.fc = nn.Linear(64, 10)

    def forward(self, x):
        x = F.relu(self.bn(self.conv(x)), inplace=True)
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.pool(x)
        x = x.view(x.shape[0], -1)
        x = self.fc(x)
        return x

Dataset used and various parameters

Everything in the dissertation follows it.

data set

CIFAR-10 is a data set containing 60,000 images in 10 classes, and 50,000 of them were used for training and 10,000 for evaluation. The image size is 32x32.

Parameters

Batch size: 128
Iterations: 64k
Optimizer: SGD + momentum (0.9) + weight decay (0.0001) --Learning rate: Start at 0.1, 1/10 times each with 32k / 48k iteration --Initialization: Initialization of He --Data augmentation during training: --Perform 4-pixel 0-padding on each side and randomly crop at 32x32. --Horizontal inversion

result

The results of learning and evaluation with the above settings are as follows. Top-1 error rate is used as an evaluation index. It represents the mean ± standard deviation in 5 runs.

Method	n	Top-1 error rate (%)	Reported error rate (%)
ResNet-20	3	8.586 ± 0.120	8.75
ResNet-32	5	7.728 ± 0.318	7.51
ResNet-44	7	7.540 ± 0.475	7.17
ResNet-56	9	7.884 ± 0.523	6.97

The deeper the layer, the greater the variation in the error rate, and although the average value is different from the value reported in the paper, it can be said that the value is generally close to the paper value. (Although it was not written, I think that it is a reasonable value if the paper value is executed multiple times and uses best.)

References

[1] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Read & implement Deep Residual Learning for Image Recognition