Read & implement Deep Residual Learning for Image Recognition

In this article, we will introduce the paper "Deep Residual Learning for Image Recognition" (CVPR 2016) \ [1] proposed by the famous ResNet. In addition, I implemented the code of CIFAR-10 classification that is being worked on in the paper (GitHub) and actually performed a reproduction experiment. ResNet proposed in this paper not only won the ILSVRC competition for the classification accuracy of ImageNet, but is also used in a wide variety of tasks due to the effectiveness and versatility of the proposed method for dealing with deep networks. (The number of citations exceeds 52,000 as of August 2020).

Thesis commentary

Overview

Deep neural networks have traditionally been said to be difficult to learn. In this paper, we facilitate the learning of deep networks by using a structure in which a large number of neural network blocks that learn the "residual" (difference from the true value) are connected instead of learning the true value directly. Greatly improved performance for various image recognition tasks.

background

In image recognition, it is known that the deeper the network hierarchy, the deeper the "deeper" semantic features can be extracted. However, simply stacking the layers of the network caused problems such as gradient disappearance / gradient explosion, and the network parameters did not converge and learning could not be performed well. This problem of difficulty in convergence has been solved by the initial value of the network and the normalization method, but even if it converges, there is a problem that the accuracy decreases as the layer is deepened (this). Is not overfitting, and the training error is worse as shown in the graph below). image.png (The figure is from the paper. The same applies below)

Proposed method

The method proposed in this paper is to realize a deep network by connecting a large number of small "residual blocks". Each residual block consists of multiple weight layers and identity mapping as shown in the figure below. Assuming that the function you want to express in the entire block is $ H (x) $, you will learn $ F (x) = H (x) --x $ in the part where the weight layers are combined. This is why it is called "residual". image.png This method solves the problem of "the deeper the layer, the lower the accuracy" mentioned at the end of the background. In fact, it is difficult to learn identity mapping ($ H (x) = x $) for multiple non-linear layers, and this mapping cannot be learned well in blocks where identity mapping is the optimal solution when the number of layers is increased. It is said that the accuracy will drop. However, if you use the proposed residual block, you can easily learn identity mapping by simply setting the weight of the weight layer to 0. In practice, it is rarely true identity, but it makes learning easier in cases where $ x $ and $ H (x) $ are very small.

Details of Residual block

When the output of Residual block is $ y , block can be expressed by such an operation. $y = F(x) + x$$ This $ F (x) $ is realized by combining two or more layers. For example, in the figure above, there are two layers, so $ \ sigma $ is used as the ReLU function. $F(x) = W_2\sigma(W_1x)$ (The bias term is omitted here). The non-linear operation by ReLU is actually performed after the last addition, so it is actually $y = \sigma(W_2\sigma(W_1x) + x)$ It looks like. If the number of channels of $ x $ and $ F (x) $ does not match, the number of channels is adjusted by padding with 0 or performing 1x1 convolution.

Overall network structure

image.png This network diagram is a network for solving the ImageNet classification problem. The figure on the far right is the proposed method. The network structure is designed based on the principle that if the size of the feature map is the same, the number of channels is the same, and if the size of the feature map is halved, the number of channels is doubled. The operation to reduce the size of the feature map is basically to set the stride of the convolution of the first layer to 2, and pooling is not performed (other than the first and last). (By the way, the pooling before entering the last FC layer is Global Average Pooling.) Batch Normalization runs immediately after each convolution.

Implementation differences for each dataset

CIFAR-10 The image classification of CIFAR-10 has a slightly different network structure because the input image size is much smaller than ImageNet. The first layer is a single 3x3 convolution, followed by residual blocks. The convolution layer uses $ 6n + 1 $ layers, and the breakdown is as shown in the table below. The result is $ 3n $ residual blocks. image.png Global Average Pooling is performed after this $ 6n + 1 $ layer, and classification is performed using 10 FC layers. Experiments will be conducted at $ n = \ {3, 5, 7, 9 \} $, resulting in a $ 20, 32, 44, 56 $ layer network, respectively. When the number of channels is different, the identity mapping is added by filling the missing part with 0.

Reproduction implementation

Of the datasets featured in this paper, we have implemented all the code that solves the CIFAR-10 classification problem using PyTorch. The entire source code has been posted on GitHub.

code

Only the most important model definition parts are posted here as well. Residual block is implemented as one class ResNetCifarBlock, and the general function make_resblock_group that creates groups with the same number of channels is implemented to make the code concise and extensible. Where the size of the feature map and the number of channels change, the pixels are thinned out first and then padded with zeros.

nets.py


import torch
import torch.nn as nn
import torch.nn.functional as F


class ResNetCifarBlock(nn.Module):
    def __init__(self, input_nc, output_nc):
        super().__init__()
        stride = 1
        self.expand = False
        if input_nc != output_nc:
            assert input_nc * 2 == output_nc, 'output_nc must be input_nc * 2'
            stride = 2
            self.expand = True

        self.conv1 = nn.Conv2d(input_nc, output_nc, kernel_size=3, stride=stride, padding=1)
        self.bn1 = nn.BatchNorm2d(output_nc)
        self.conv2 = nn.Conv2d(output_nc, output_nc, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm2d(output_nc)

    def forward(self, x):
        xx = F.relu(self.bn1(self.conv1(x)), inplace=True)
        y = self.bn2(self.conv2(xx))
        if self.expand:
            x = F.interpolate(x, scale_factor=0.5, mode='nearest')  # subsampling
            zero = torch.zeros_like(x)
            x = torch.cat([x, zero], dim=1)  # option A in the original paper
        h = F.relu(y + x, inplace=True)
        return h


def make_resblock_group(cls, input_nc, output_nc, n):
    blocks = []
    blocks.append(cls(input_nc, output_nc))
    for _ in range(1, n):
        blocks.append(cls(output_nc, output_nc))
    return nn.Sequential(*blocks)


class ResNetCifar(nn.Module):
    def __init__(self, n):
        super().__init__()

        self.conv = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
        self.bn = nn.BatchNorm2d(16)
        self.block1 = make_resblock_group(ResNetCifarBlock, 16, 16, n)
        self.block2 = make_resblock_group(ResNetCifarBlock, 16, 32, n)
        self.block3 = make_resblock_group(ResNetCifarBlock, 32, 64, n)
        self.pool = nn.AdaptiveAvgPool2d(output_size=(1, 1))  # global average pooling
        self.fc = nn.Linear(64, 10)

    def forward(self, x):
        x = F.relu(self.bn(self.conv(x)), inplace=True)
        x = self.block1(x)
        x = self.block2(x)
        x = self.block3(x)
        x = self.pool(x)
        x = x.view(x.shape[0], -1)
        x = self.fc(x)
        return x

Dataset used and various parameters

Everything in the dissertation follows it.

data set

CIFAR-10 is a data set containing 60,000 images in 10 classes, and 50,000 of them were used for training and 10,000 for evaluation. The image size is 32x32.

Parameters

result

The results of learning and evaluation with the above settings are as follows. Top-1 error rate is used as an evaluation index. It represents the mean ± standard deviation in 5 runs.

Method n Top-1 error rate (%) Reported error rate (%)
ResNet-20 3 8.586 ± 0.120 8.75
ResNet-32 5 7.728 ± 0.318 7.51
ResNet-44 7 7.540 ± 0.475 7.17
ResNet-56 9 7.884 ± 0.523 6.97

The deeper the layer, the greater the variation in the error rate, and although the average value is different from the value reported in the paper, it can be said that the value is generally close to the paper value. (Although it was not written, I think that it is a reasonable value if the paper value is executed multiple times and uses best.)

References

Recommended Posts

Read & implement Deep Residual Learning for Image Recognition
Implementation of Deep Learning model for image recognition
Deep learning image recognition 2 model implementation
[AI] Deep Learning for Image Denoising
Image recognition model using deep learning in 2016
Deep learning image recognition 3 after model creation
Deep learning for compound formation?
Implement Deep Learning / VAE (Variational Autoencoder)
Deep learning learned by implementation 2 (image classification)
Make your own PC for deep learning
Image alignment: from SIFT to deep learning
[Deep learning] Nogizaka face detection ~ For beginners ~
Image recognition
Deep Learning
About data expansion processing for deep learning
Recommended study order for machine learning / deep learning beginners
Creating learning data for face image dataset sorting (# 1)
(Test automation) Nesting images used for image recognition
Basic principles of image recognition technology (for beginners)
[Implementation for learning] Implement Stratified Sampling in Python (1)
I installed Chainer, a framework for deep learning
Deep Learning Memorandum
Start Deep learning
Inflated learning image
Python Deep Learning
Deep learning × Python
Model construction for face image dataset sorting-VGG19 transfer learning (# 2)
Image collection Python script for creating datasets for machine learning
Deep learning image analysis starting with Kaggle and Keras
[Anomaly detection] Detect image distortion by deep distance learning
Techniques for understanding the basis of deep learning decisions
Deep Learning Experienced in Python Chapter 2 (Materials for Journals)
Artificial intelligence, machine learning, deep learning to implement and understand
A scene where GPU is useful for deep learning?
Introduction to Deep Learning for the first time (Chainer) Japanese character recognition Chapter 1 [Environment construction]