In this article, we will introduce the paper "Deep Residual Learning for Image Recognition" (CVPR 2016) \ [1] proposed by the famous ResNet. In addition, I implemented the code of CIFAR-10 classification that is being worked on in the paper (GitHub) and actually performed a reproduction experiment. ResNet proposed in this paper not only won the ILSVRC competition for the classification accuracy of ImageNet, but is also used in a wide variety of tasks due to the effectiveness and versatility of the proposed method for dealing with deep networks. (The number of citations exceeds 52,000 as of August 2020).
Deep neural networks have traditionally been said to be difficult to learn. In this paper, we facilitate the learning of deep networks by using a structure in which a large number of neural network blocks that learn the "residual" (difference from the true value) are connected instead of learning the true value directly. Greatly improved performance for various image recognition tasks.
In image recognition, it is known that the deeper the network hierarchy, the deeper the "deeper" semantic features can be extracted. However, simply stacking the layers of the network caused problems such as gradient disappearance / gradient explosion, and the network parameters did not converge and learning could not be performed well. This problem of difficulty in convergence has been solved by the initial value of the network and the normalization method, but even if it converges, there is a problem that the accuracy decreases as the layer is deepened (this). Is not overfitting, and the training error is worse as shown in the graph below). (The figure is from the paper. The same applies below)
The method proposed in this paper is to realize a deep network by connecting a large number of small "residual blocks". Each residual block consists of multiple weight layers and identity mapping as shown in the figure below. Assuming that the function you want to express in the entire block is $ H (x) $, you will learn $ F (x) = H (x) --x $ in the part where the weight layers are combined. This is why it is called "residual". This method solves the problem of "the deeper the layer, the lower the accuracy" mentioned at the end of the background. In fact, it is difficult to learn identity mapping ($ H (x) = x $) for multiple non-linear layers, and this mapping cannot be learned well in blocks where identity mapping is the optimal solution when the number of layers is increased. It is said that the accuracy will drop. However, if you use the proposed residual block, you can easily learn identity mapping by simply setting the weight of the weight layer to 0. In practice, it is rarely true identity, but it makes learning easier in cases where $ x $ and $ H (x) $ are very small.
When the output of Residual block is $ y
CIFAR-10 The image classification of CIFAR-10 has a slightly different network structure because the input image size is much smaller than ImageNet. The first layer is a single 3x3 convolution, followed by residual blocks. The convolution layer uses $ 6n + 1 $ layers, and the breakdown is as shown in the table below. The result is $ 3n $ residual blocks. Global Average Pooling is performed after this $ 6n + 1 $ layer, and classification is performed using 10 FC layers. Experiments will be conducted at $ n = \ {3, 5, 7, 9 \} $, resulting in a $ 20, 32, 44, 56 $ layer network, respectively. When the number of channels is different, the identity mapping is added by filling the missing part with 0.
Of the datasets featured in this paper, we have implemented all the code that solves the CIFAR-10 classification problem using PyTorch. The entire source code has been posted on GitHub.
Only the most important model definition parts are posted here as well. Residual block is implemented as one class ResNetCifarBlock
, and the general function make_resblock_group
that creates groups with the same number of channels is implemented to make the code concise and extensible. Where the size of the feature map and the number of channels change, the pixels are thinned out first and then padded with zeros.
nets.py
import torch
import torch.nn as nn
import torch.nn.functional as F
class ResNetCifarBlock(nn.Module):
def __init__(self, input_nc, output_nc):
super().__init__()
stride = 1
self.expand = False
if input_nc != output_nc:
assert input_nc * 2 == output_nc, 'output_nc must be input_nc * 2'
stride = 2
self.expand = True
self.conv1 = nn.Conv2d(input_nc, output_nc, kernel_size=3, stride=stride, padding=1)
self.bn1 = nn.BatchNorm2d(output_nc)
self.conv2 = nn.Conv2d(output_nc, output_nc, kernel_size=3, stride=1, padding=1)
self.bn2 = nn.BatchNorm2d(output_nc)
def forward(self, x):
xx = F.relu(self.bn1(self.conv1(x)), inplace=True)
y = self.bn2(self.conv2(xx))
if self.expand:
x = F.interpolate(x, scale_factor=0.5, mode='nearest') # subsampling
zero = torch.zeros_like(x)
x = torch.cat([x, zero], dim=1) # option A in the original paper
h = F.relu(y + x, inplace=True)
return h
def make_resblock_group(cls, input_nc, output_nc, n):
blocks = []
blocks.append(cls(input_nc, output_nc))
for _ in range(1, n):
blocks.append(cls(output_nc, output_nc))
return nn.Sequential(*blocks)
class ResNetCifar(nn.Module):
def __init__(self, n):
super().__init__()
self.conv = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
self.bn = nn.BatchNorm2d(16)
self.block1 = make_resblock_group(ResNetCifarBlock, 16, 16, n)
self.block2 = make_resblock_group(ResNetCifarBlock, 16, 32, n)
self.block3 = make_resblock_group(ResNetCifarBlock, 32, 64, n)
self.pool = nn.AdaptiveAvgPool2d(output_size=(1, 1)) # global average pooling
self.fc = nn.Linear(64, 10)
def forward(self, x):
x = F.relu(self.bn(self.conv(x)), inplace=True)
x = self.block1(x)
x = self.block2(x)
x = self.block3(x)
x = self.pool(x)
x = x.view(x.shape[0], -1)
x = self.fc(x)
return x
Everything in the dissertation follows it.
CIFAR-10 is a data set containing 60,000 images in 10 classes, and 50,000 of them were used for training and 10,000 for evaluation. The image size is 32x32.
The results of learning and evaluation with the above settings are as follows. Top-1 error rate is used as an evaluation index. It represents the mean ± standard deviation in 5 runs.
Method | Top-1 error rate (%) | Reported error rate (%) | |
---|---|---|---|
ResNet-20 | 3 | 8.586 ± 0.120 | 8.75 |
ResNet-32 | 5 | 7.728 ± 0.318 | 7.51 |
ResNet-44 | 7 | 7.540 ± 0.475 | 7.17 |
ResNet-56 | 9 | 7.884 ± 0.523 | 6.97 |
The deeper the layer, the greater the variation in the error rate, and although the average value is different from the value reported in the paper, it can be said that the value is generally close to the paper value. (Although it was not written, I think that it is a reasonable value if the paper value is executed multiple times and uses best.)
Recommended Posts