What is Convolutional Neural Network (CNN)?

A neural network specialized for images. Whereas a normal multi-layer perceptron consists of an input layer, an intermediate layer, and an output layer, CNN also has a convolution layer, a pooling layer, and a locally normalized layer (LRN layer).

Looking at AlexNet (top of ILSVRC2012) in Chainer Example, it looks like this:

# … (Omitted)
class AlexBN(chainer.Chain):

   """Single-GPU AlexNet with LRN layers replaced by BatchNormalization."""

   insize = 227

   def __init__(self):
       super(AlexBN, self).__init__(
           conv1=L.Convolution2D(3,  96, 11, stride=4),
           bn1=L.BatchNormalization(96),
           conv2=L.Convolution2D(96, 256,  5, pad=2),
           bn2=L.BatchNormalization(256),
           conv3=L.Convolution2D(256, 384,  3, pad=1),
           conv4=L.Convolution2D(384, 384,  3, pad=1),
           conv5=L.Convolution2D(384, 256,  3, pad=1),
           fc6=L.Linear(9216, 4096),
           fc7=L.Linear(4096, 4096),
           fc8=L.Linear(4096, 1000),
       )
       self.train = True

# … (Omitted)

It consists of 5 convolution layers and 3 fully connected layers. The paper also states that the activation function ReLu, multi-GPU, LRN, and pooling are important. The Chainer Example above uses batch normalization instead of LRN. The following is detailed about batch normalization.

-Batch Normalization mechanism and its intuitive understanding

Convolution layer

L.Convolution2D(3,  96, 11, stride=4)

The calculation formula for convolution is written in various reference books, so I will omit it. Here, we aim to be able to use this function. First, convolution is to filter and convert an image. Like images, filters also have variables such as the number of pixels and channels. Assuming that the number of pixels of the image is $ N \ times N $ and the number of channels is $ K $, the size of the filter is also $ H \ times H , just as the size of the image is written as $ N \ times N \ times K $. Write like times K $. The number of image and filter channels will be the same. There may be multiple types of filters applied to the image. With the $ M $ type filter, the number of channels in the output image is converted to $ M $. Also, when applying a filter, the image is partially applied while moving the filter, and the width of the movement is called the stride width. Increasing the stride width makes it easier to miss image features, so a smaller stride width is desirable. Further, providing virtual pixels outside the edge of the image is called padding. By giving padding, it is possible to suppress the reduction of the image when it is folded. If you want it to be the same size as the input, truncate the padding size to $ H / 2 $. To summarize so far, the argument of pythonL.Convolution2D () is

--First argument = number of input channels. That's $ K $. --Second argument = number of output channels. $ M $. --Third argument = filter size. About $ H $ --Argument name stride = stride width. (The smaller the better. Is it a balance with the input image size?) --Argument name pad = padding width. (Often rounded down to the nearest $ H / 2 $)

Will be. There is no specification regarding the number of pixels in the input / output image.

Batch normalization

L.BatchNormalization(96)

The argument is the number of image channels to be normalized. It will be the same as the number of output channels of the previous convolution ($ M $).

Pooling layer

 h = self.bn1(self.conv1(x), test=not self.train)
 h = F.max_pooling_2d(F.relu(h), 3, stride=2)

After inputting the result of the convolution layer into ReLu, pooling is performed. Pooling pays attention to a certain area like a filter and outputs one representative value of that area according to a certain rule. This makes it possible to obtain position invariance. There are many variations of pooling.

--Mean pooling: Take the average of the values in the area --Maximum pooling: Take the maximum in the area

AlexNet uses maximum pooling.

To summarize the arguments of F.max_pooling_2d (),

--First argument: Input image --Second argument: Pooling area size. If the area is made too large, the accuracy will decrease. --Third argument: Stride width. Usually 2 or more.

Again, the number of input / output pixels is not specified.

I tried using AlexNet

In the next article (in writing), I will write the result of using AlexNet.

[Deep learning] Investigating how to use each function of the convolutional neural network [DW day 3]

What is Convolutional Neural Network (CNN)?

Convolution layer

Batch normalization

Pooling layer

I tried using AlexNet