At first

Hello, you belong to CodeNext, it is @ aiskoaskosd. I have been indebted to Chainer on a regular basis, so I thought it would be great if I could give back, so I wrote an article. Today, I will focus on the image recognition model that has become a hot topic in the last 1-2 years, publish the implementation, and explain the contents of some papers. Some papers before 2013 are also exceptionally implemented. ** 22 out of 24 models were implemented in Chainer. ** Unfortunately, as of December 22, all implementations and verification with cifar10 have not been completed. We will update it one by one. I think there are some misinterpretations and implementation mistakes. In that case, I would be very happy if you could tell me.

paper

1. Netowork In Network 2. Very Deep Convolutional Networks for Large-Scale Image Recognition 3. Going deeper with convolutions 4. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 5. Rethinking the Inception Architecture for Computer Vision 6. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification 7. Training Very Deep Networks 8. Deep Residual Learning for Image Recognition 9. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning 10. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size 11. Identity Mappings in Deep Residual Networks 12. Resnet in Resnet: Generalizing Residual Architectures 13. Deep Networks with Stochastic Depth 14. Swapout: Learning an ensemble of deep architectures 15. Wide Residual Networks 16. FractalNet: Ultra-Deep Neural Networks without Residuals 17. Weighted Residuals for Very Deep Networks 18. Residual Networks of Residual Networks: Multilevel Residual Networks 19. Densely Connected Convolutional Networks 20. Xception: Deep Learning with Depthwise Separable Convolutions 21. Deep Pyramidal Residual Networks 22. Neural Architecture Search with Reinforcement Learning 23. Aggregated Residual Transformations for Deep Neural Networks 24. Deep Pyramidal Residual Networks with Separated Stochastic Depth

The number of parameters is basically calculated from the model actually implemented, and the total weight excluding the bias term of the convolution layer and FC layer is calculated.
The date is the posting date of the first edition of arxiv.
How to read the table is as follows. $ \ phi $ (cifar 10 total accuracy implementation): Implementation is not finished yet $ \ times $ (cifar 10 total accuracy implementation): Implementation verification not yet $ \ bigtriangleup $ (cifar 10 total accuracy implementation): I verified the implementation, but did not reach the accuracy described in the paper. $ \ bigcirc $ (cifar 10 total accuracy implementation): Implementation verification results almost match the accuracy described in the paper $ \ times $ (cifar 10 total accuracy paper): No description in the paper $ \ times $ (imagenet top-5 error): Not mentioned in the paper

Introductory paper	date	model	Number of parameters(10^6)	cifar10 total accuracy paper(%)	cifar10 total accuracy implementation	imagenet top-5 error(%)
1	131116	Caffe implementation reference	0.1	91.19	\bigtriangleup(90.10)	\times
1	131116	Caffe implementation reference with BN	0.1	No paper exists	91.52%	No paper exists
2	140904	Model A	129	\times	92.1(Model A)	6.8(Model E)
3	140917	googlenet	6	\times	91.33%	6.67
4	150211	inceptionv2	10	\times	94.89%	4.9
5	151202	inceptionv3(reference)	22.5	\times	94.74%	3.58
6	150206	model A	43.9(global average pooling instead of spp)	\times	94.98%	4.94
7	150722	Highway(Fitnet19)	2.8	92.46	\bigcirc(93.35%,However, BN is attached and the configuration of the higway part is different.)	\times
8	151210	ResNet110	1.6	93.57	\bigcirc(93.34%)	3.57
9	160223	inception v4	\times	\times	\phi	3.1
10	160224	Squeezenet with BN	0.7	82%(alexnet without data augmentation)	\bigcirc(92.63%)	17.5(withoutBNandsingle)
11	160316	ResNet164	1.6	94.54	\bigcirc(94.39%)	4.8(single)
12	160325	18-layer + wide RiR	9.5	94.99	\bigtriangleup(94.43%)	\times
13	160330	ResNet110	1.7	94.75	\bigcirc(94.76%)	\times
14	160520	Swapout v2(32)W×4	7.1	95.24	\bigcirc(95.34%)	\times
15	160523	WRN28-10	36.2	96.0	\bigcirc(95.76%)	\times
16	160524	20 layers	33.7	95.41	\bigtriangleup(93.77%)	7.39%(FractalNet-34)
17	160528	WResNet-d	19.1	95.3	\times	\times
18	160809	RoR-3-WRN58-4	13.6	96.23	\times	\times
19	160825	k=24, depth=100	27.0	96.26	\bigcirc 95.12%(k=12, depth=40)	\times
20	161007	xception	\times	\times	\phi	5.5(single)
21	161010	\alpha = 270	28.4	96.23	\bigcirc(95.9%)	4.7(\alpha = 450)
22	161105	depth=49	32(From the dissertation)	96.16	$\bigtriangleup$90.35(Appendix A: 4.1M)	\times
23	161116	ResNeXt-29, 16×64d	68.3	96.42	\bigcirc(95.72%: 2x64d)	\times
24	161205	depth=182, \alpha=150	16.5	96.69	\times	\times

Implementation

https://github.com/nutszebra/chainer_image_recognition

Be careful when using a model that has not been verified

Numerical sense of cifar10

At the moment, I think that 97% or more will be SoTA. As far as I know, the highest accuracy is 96.69%. It may be time to focus on cifar100 or another dataset.

Recent trends

I think this year was the year of the Resnet family. The characteristic point is that depth = accuracy is over. Although googlenet and others have been insisting for a long time, various papers have shown that ** if it is deep to some extent, the accuracy will be higher if the width is wider than the depth **. From around March, the result that it is better to widen the width of Resnet has come out as a side effect, and I think that it became decisive in Wide Residual Netowrks released on May 23. I think it is clear this year that width is important. Looking at the paper from a bird's-eye view, it seems that the Resnet family was mostly a ** Res block modified **.

Modify the Res block: 11/12/14/15/17/20/21/23
Other than that: 9/13/16/18/19/22/24

It's hard to tell which is the best derivative of this Res block. As everyone thinks, in the paper, the number of parameters and FLOPS of forward are different in each model, so it does not make much sense to simply compare the accuracy. Therefore, it is difficult to understand which method is essential and in the right direction even after reading the paper. I think there is currently a need for a rule that everyone builds a model with a certain metric tied up and then posts the test accuracy of a single model (like FLOPS?). All the papers are in a difficult state to evaluate, but I think the overall tendency is that ** ReLU is not applied to the final output value of the Res block **. I think it is better to base the Res block of BN-ReLU-Conv-BN-ReLU-Conv proposed in 11 instead of the original 8 at present. It's a personal impression, but it seems like a calm year when the accuracy has improved steadily. I don't think there are any new structures coming out this year, such as residual. Imagenet 2016 also saw a lot of ensembles based on Residual Networks and Inception.

Impressions

I personally think that the google paper from 22 is very shocking and there is a little possibility that something will develop from here (there is no basis). 22 is searching the network with 800 gpu and RNN + policy gradient, and has recorded 96.16% with cifar10. If mnist is 99% with this, it's over, but it is very amazing that cifar 10 has a value close to SoTA. Letting the data determine the network structure is a very fascinating idea, reminiscent of the advent of DNN (making feature design data). Also, I didn't introduce it here, but I remember that HyperNetworks was very interesting and shocked when I read it. HyperNetworks is a network that generates network weights, and is a technology that is likely to be fused or developed in the future. If there is nothing, it seems that the way of remodeling the Res block and connecting the Res block will develop in the future, but what will happen?

Introduction of some papers

4. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Batch normalization is, in a nutshell, "normalize each input between batches at the channel level". This Batch Normalization is very useful. There is no reason not to be on the network. Convergence is no longer possible and accuracy is slightly higher. Batch Normalization is important, so I will explain it firmly. First, I will explain the phenomenon called internal covariate shift, which is the basis of the motivation for Batch Normalization, and the specific algorithm.

Internal covariate shift

Suppose the weight of layer $ l_ {i} $ is changed by the error backpropagation method. Then, the distribution of the output value (the nature of the output) that is output using that weight changes. Layer $ l_ {i + 1} $ must learn the proper nonlinear map, corresponding to the distribution of the changed output values. What's wrong here is that learning a non-linear map for discrimination is very slow because of the great effort it takes to learn to adapt to the changing distribution of outputs. This phenomenon is defined in the paper as internal covariate shift. This internal covariate shift causes learning to stagnate. Let's assume that the learning coefficient of SGD is set large. Then, the weight of layer $ l_ {i} $ changes greatly. As a result, layer $ l_ {i + 1} $ cannot adapt to the changed output value (or it takes a tremendous amount of time to adapt), and learning stagnates. As a result, the learning coefficient must be set small in order to learn, and this time learning will slow down the convergence of learning. If you stand there, it feels like you can't stand here. This internal covariate shift becomes a more serious problem as the model gets deeper. Even if the output change in the lower layer is slight, it will be amplified in the upper layer, and even a small change will become a big change. It's like a butterfly effect. The solution to this is very simple. If the distribution of output values changes, you can adjust it each time so that the distribution is the same. BN (Batch Normalization) normalizes the input (mean 0, variance 1) and outputs it. Therefore, when the output of BN is used as an input, the distribution of the output becomes stable and the need for learning to respond to changes in the output distribution is reduced. You can focus on learning the non-linear maps that you essentially have to do, and as a result the learning converges quickly. Furthermore, since the distribution is stable, the learning coefficient can be set large. This also contributes significantly to the rapid convergence of learning. If you include this BN, the learning time of GoogLeNet will be about 7%. It's amazing.

Batch Normalization algorithm

Input is $ x_ {i, cxy} $. This means that the input is in the $ (x, y) $ position of channel $ c $ in batch $ i $. Let the mean on channel $ c $ be $ \ mu_ {c} $ and the variance on channel $ c $ be $ \ sigma_ {c} ^ 2 $. Then, when the number of batches is $ m $, the input height is $ Y $, and the input width is $ X $, $ \ mu_ {c} $ and $ \ sigma_ {c} ^ 2 $ can be expressed as follows. I will.

\begin{equation} \mu_{c} = \frac{1}{mXY}\sum_{i=1}^{m} \sum_{y=1}^{Y} \sum_{x=1}^{X} x_{i,cxy} \tag{4.1} \end{equation}

\begin{equation} \sigma_{c}^2 = \frac{1}{mXY}\sum_{i=1}^{m} \sum_{y=1}^{Y} \sum_{x=1}^{X} (x_{i,cxy} -\mu_c)^2 \tag{4.2} \end{equation}

Looking at equations (4.1) and (4.2), we can see that the mean $ \ mu_ {c} $ and the variance $ \ sigma_ {c} ^ 2 $ are calculated for each channel between batches. This is not learning. Next, let $ \ hat {x_ {i, cxy}} $ be the normalized version of each input $ x_ {i, cxy} $. Here we define the scale $ \ gamma_ {c} $ and the shift $ \ beta_ {c} $ for each channel. $ \ Gamma_ {c} $ and shift $ \ beta_ {c} $ are parameters learned by the error backpropagation method. The reason for introducing such a thing will be described later. If you enter $ x_ {i, cxy} $ in Batch Normalization, the final output $ y_ {i, cxy} $ and the normalized value $ \ hat {x_ {i, cxy}} $ Is defined as follows.

\begin{equation} \hat{x_{i,cxy}} = \frac{x_{i,cxy} - \mu_{c}}{\sqrt{\sigma_{c}^2 + \epsilon}} \tag{4.3} \end{equation}

\begin{equation} y_{i,cxy} = BN(x_{i,cxy}) = \gamma_{c} \hat{x_{i,cxy}} + \beta_{c} \tag{4.4} \end{equation}

You can see that equation (4.3) simply normalizes $ x_ {i, cxy} $ using equations (4.1) and (4.2) (variance $ \ sigma_ {c} ^ 2 $ is 0). If $ \ hat {x_ {i, cxy}} $ becomes infinite, add a small number $ \ epsilon $. Chainer uses $ 2.0 \ times 10 ^ {-5} $ as the default value It is.). The question here is equation (4.4), where I am linearly mapping with $ \ gamma_ {c} $ and shift $ \ beta_ {c} $, but equation (4.3) already normalizes the main enclosure. I'm done, what are you doing? Let $ \ gamma_ {c} = \ sigma_ {c}, \ beta_ {c} = \ mu_ {c} $. Then, when $ \ epsilon $ is small and ignored, $ y_ {i, cxy} $ will be as follows. $\begin{equation} y_{i,cxy} = \gamma_{c} \hat{x_{i,cxy}} + \beta_{c} = \sigma_{c} \times \frac{x_{i,cxy} - \mu_{c}}{\sqrt{\sigma_{c}^2}} + \mu_{c} = x_{i,cxy} \tag{4.5} \end{equation}$ Looking at equation (4.5), the normalized $ \ hat {x_ {i, cxy}} $ is $ \ gamma_ {c} = \ sigma_ {c} ^ 2, \ beta_ {c} = \ mu_ {c } $ Returns the original input $ x_ {i, cxy} $. By introducing $ \ gamma_ {c} $ and shift $ \ beta_ {c} $, this aims to retain significant features that disappear due to normalization. The only parameters learned by Batch Normlization are scale $ \ gamma_ {c} $ and shift $ \ beta_ {c} $. $ x_ {i, cxy}, \ mu_ {c}, \ sigma_ {c} ^ 2, \ hat {x_ {i, cxy}}, \ gamma_ {c}, \ beta_ {c} $ are differentiable. .. The derivation is in the paper. When I confirmed it, it was certainly differentiable. If you are interested, please try to derive it. Batch Normalization can be described in one line with Chainer. It's easy and very good.

5. Rethinking the Inception Architecture for Computer Vision The network itself is a graceful extension of googlenet. The good thing about this paper is that it verbalizes the network design policy. Especially before ** downsampling, it is very important to increase the number of channels on the network **.

6. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification This is a ReLU extension of xavier initialization, which is the standard initialization method of the current DNN. It is also called ReLU initialization, msra initialization, He initialization, etc. This xavier initialization is a very revolutionary method of calculating the weight variance so that the output variance value does not change during the forward. Thing. As data-driven initialization, All you need is a good init · Data-dependent Initializations of Convolutional Neural Networks /1511.06856) etc. were proposed this year, but this is also an extension of xavier initialization. I basically use the variance values calculated by msra initialization to generate diagonalized random matrices and use them as the initial values of the weights. I am. This paper proposes PReLU, which is an extension of ReLU other than initialization. The content is simple, such as changing the x <0 part to ax instead of 0 like ReLU. At this time, a is learned by the error back propagation method. The interesting thing about this paper is the learned value of a. The result is: It is a large a in the initial layer, and becomes a small value in the upper layer. It seems that the initial layer retains the information and discards it as it goes up. It can be seen that the initial layer retains linearity and becomes non-linear as it goes up. This result is very close to CReLU. CReLU is a non-linear function of the idea of outputting a concatenation of ReLU (x) and ReLU (-x). Another interesting point is that the value of a increases before downsampling (pool). Information is dropped in the pool, so it looks like you're trying not to drop it. Looking at the value of a, I feel like I can understand the feelings of CNN, and I like it.

7. Training Very Deep Networks It's called Highway networks. Highway networks are mathematically as follows. $y=H(x, W_h)T(x, W_T)+xC(x, W_c)$ $ H (x, W_h) $ is a normal nonlinear function. $ T (x, W_T) $ is a non-linear function, $ C (x, W_c) $ Is a function that calculates how much to load the input x. Residual Networks is a simplification of $ T (x, W_T) = 1, C (x, W_c) = 1 $. Therefore, it is often said that Residual Networks is a simplification of Highway Networks. In the paper, we are building a network in the form of $ C (x, W_c) = 1-T (x, W_T), 0 \ le T (x, WT) \ le 1 $. The interesting thing about this paper is that it was observed that the accuracy of the network hardly changed even if the upper layer of the learned network was removed. In fact, the same phenomenon has been confirmed in Residual Networks, and it is known that the accuracy decreases when too many upper layers are tampered with, but the accuracy does not change even if several layers are removed or shuffled [[. Residual Networks Behave Like Ensembles of Relatively Shallow Networks]. I was surprised when I learned this.

8. Deep Residual Learning for Image Recognition It is a network that became the champion of ILSVRC2015 in the recognition department. It has a simplified structure of Highway Netoworks (7), and the characteristic part is called residual. Most of the networks that came out this year will be based on this and improved. The figure above clearly shows the residual structure. Simply add the input x to F (x) with the nonlinear function applied. This F is composed of several layers of conv, BN, and ReLU. Below is an overview of the network.

10. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size This squeezenet is a very cospa model. A module called fire module has been proposed, and a network is built using it. The fire module looks like this: The idea of the fire module is very simple, such as inputting the output of 1x1conv with dimensionality reduction (reducing the number of input channels in the next stage) to 3x3conv and 1x1conv, and concatenating the respective outputs. If the network in the paper is given BN, the weight number is about 0.7M and 92.6% can be obtained with cifar10. If you're solving 2-100 class image recognition tasks in practice, this is pretty much a piece. 11. Identity Mappings in Deep Residual Networks A paper that has tried several types of Res blocks and has shown that BN-ReLU-Conv-BN-ReLU-Conv was good. Generally speaking, Residual Networks now refers to this.

13. Deep Networks with Stochastic Depth It's called stochastic depth, and it's a regularization method that stochastically drops out Res blocks. This has already been adopted in multiple paper models and seems to have had some effect. However, there is a report that the normal stochastic depth was not effective at 24, so the evaluation for stochastic depth is tentative and should be watched. The method of decreasing the dropout probability for the Res block from the lower layer to the upper layer seems to be the most accurate. In the paper, set a drop probability of about 1 (do not drop) in the bottom layer and about 0.5 (drop with a 50% probability) in the upper layer, and apply the drop probability calculated linearly to those values to the middle layer. The way to do it is taken in the paper and the result is the best.

14. Swapout: Learning an ensemble of deep architectures This is a method to drop out independently the residual part of the Res block and the output value of the nonlinear function. It is expressed by a mathematical formula as follows. $y=p_{1}F(x)+ p_{2}x$ $ y $ is the output of the Res block, $ x $ is the input, $ F $ is the nonlinear function, $ p_ {1}, p_ {2} $ is the Bernoulli distribution and takes an output value of 0 or 1. What this means is that the figure below is intuitive. E in the figure is swap out, but as you can see, you can see that the output value is 0, x, F (x), F (x) + x. The above formula only expresses that. swapout is a network that applies this to all Res blocks. Assuming that p outputs 1 with a probability of T and 0 with a probability of 1-T, the pattern of increasing the value of T in the lower layers and decreasing it in the upper layers works well.

15. Wide Residual Networks The paper says that if you increase the width of the Res block instead of making it deeper, the accuracy will increase. The entire network looks like the figure below.

The network configuration is determined by the number of blocks N and the width parameter k. The pattern with 4 for N and 10 for k works best in the paper. Of course, this alone is not a paper, so I am also verifying which Res block will give accuracy. The conclusion shows that it was good to use two 3x3 convs inside the Res block. It was interesting to find that the performance deteriorates in the case of stacking 1, 3 or 4 instead of stacking 2 3x3conv. It doesn't work with cifar10, but for tasks such as cifar100, inserting dropout between convs in the Res block seems to improve the accuracy. If you set it to wide, the learning speed will be slower (8 times the learning speed of Resnet-1001), and it is possible to learn even if you use up to 5 times the number of parameters compared to normal Resnet. Learning is really fast when you try to learn with the model you actually built. With cifar10, the accuracy of about 60% comes out at the 1st epoch.

16. FractalNet: Ultra-Deep Neural Networks without Residuals 16 is a fractal net, which has no residual and is characterized by a structure that takes an average value and a fractal network. The title has without Residuals to take the average value. The fractal structure is easy to understand in the figure below. Configure your network with the Fractal Expansion Rule as shown. In this paper, the mean value is applied to the output from the two convs (join layer in the figure). I am very interested in what happens to the accuracy of the network when joining is retained (in the paper, only the average is applied to join). The title of the dissertation says without residuals, but I feel that averaging is essentially the same as residual. The author claims it is different, but I personally have doubts. ~~ What I found interesting when learning networks was that learning stagnated for the first 20 epochs (described in the paper). This is a behavior I haven't seen much. ~~ When I increased the learning coefficient, I learned from the 1st epoch (tears). 17. Weighted Residuals for Very Deep Networks When H is a nonlinear function, x is an input, $ \ alpha $ is a learning parameter, and y is the output of a Res block, the main idea of this paper is to define a Res block like the following formula. $y=\alpha H(x)+ x$ The diagram is as follows. The interesting thing about this paper is the value of $ \ alpha $ learned by the error backpropagation method. Looking at the figure, it can be seen that the value of $ \ alpha $ tends to be large in the upper layer. It seems that the non-linearity is actively utilized in the upper layer because H (x) is greatly added. However, the output value of H (x) may be very small, and there is no verification of that part in the paper. The figure above shows how $ \ alpha $ changed. What is interesting is that it satisfies the symmetry. I don't know why this happens, but it's very interesting.

18. Residual Networks of Residual Networks: Multilevel Residual Networks It's a simple idea to put skip connections from other layers as well. It is easy to see the figure below. This is the only idea.

19. Densely Connected Convolutional Networks 19 is Dense net, which is characterized by connecting the output to the input and putting it in the block. The output recursively becomes the next input, as shown in the figure below. It seems to be doing essentially the same thing as a Residual network, except that it recursively concatenates the outputs instead of the residual.

20. Xception: Deep Learning with Depthwise Separable Convolutions It is a network configured by replacing the conv of the Res block with a separable conv. separable conv is a channel wise conv and 1x1 conv applied in order, and has the feature that the number of weights can be reduced. A channel wise conv is a conv that does not look at the correlation between channels, and is a conv that outputs without summing after calculating the convolution window. As claimed in this paper, it was observed that a vgg-based network could be comparable to inception v4 without a residual structure. However, it is also verified in this paper that the convergence is faster with the residual structure. Is it a feeling that the residual structure should be silently put in?

21. Deep Pyramidal Residual Networks The simple idea is to gradually increase the number of output channels in the Res block. In general Residual Networks, when applying stride2 conv as a down sample in the Res block, the number of channels is doubled in advance. pyramidal networks will be a network like d in the figure, which will gradually increase the number of channels instead of rapidly increasing the number of channels before downsample.

22. Neural Architecture Search with Reinforcement Learning This is a paper that google uses 800 gpu and RNN to generate a network. Learn the RNN network that spits out the appropriate CNN network from the error backpropagation method using the accuracy of the validation data and the policy gradient. What is surprising is that cifar 10 is as accurate as SoTA. It seems that the reason why it worked was that the generated network was simplified (filter is only 3,5,7, etc.) and that the generated network was simply evaluated by SGD. The generated network is very interesting. The following is the generated lightweight network. The points where the arrows meet are connected. There is no residual structure. Similar in configuration to the 19 dense net. Since the data determines the network structure, it is a very complex input that humans cannot understand. The interesting thing is that removing or increasing this arrow reduced the test accuracy. I feel that the more freedom you give to network generation, the more accurate it will be. However, if the degree of freedom is too high, it seems that the network cannot be generated properly, so hardware for adjustment work and experimentation is essential. I think there are only a few environments in the world where you can verify this experiment. .. ..

23. Aggregated Residual Transformations for Deep Neural Networks In a network called ResNeXt, replace the Res block on the left in the figure above with something on the right. The Res block on the right of the above figure is equivalent to (a), (b), (c) in the figure below. Since there is no group conv in chainer, I implemented it in (b). In the paper, the number of input branches in the Res block is called cardinality, and increasing the cardinality and block width appropriately is compared with Wide Residual Networks, which has the same number of parameters and simply increased the block width. It claims to be accurate.

24. Deep Pyramidal Residual Networks with Separated Stochastic Depth 24 is 21 pyramidal net with separated stochastic depth applied. Separated stochastic depth is the idea of applying stochastic depth independently to the part where the channel increases. It looks like the figure below.

Image recognition model using deep learning in 2016