Assumed reader

I am a Pytorch beginner. Now that you understand Pytorch's Neural Network (CNN) Tutorial 1.3.1, let's take a look.

――I have a soft grasp of the mechanism of CNN --I've touched Python somehow ――I tried to study Pytorch for the first time, but I don't understand the official tutorial.

It is an article for those who say. Therefore, I write it fairly carefully. Please read only where you need it. In addition, this time we are focusing on understanding the official tutorial, and we will not explain the arguments that do not appear in the tutorial. This article describes Pytorch Tutorial 1.3.1is.

What to do in this tutorial

In this tutorial, "Pytorch puts a two-dimensional image into a neural network, outputs it to the objective function (forward propagation), and then updates each parameter value (error backpropagation). Do you want to do it? "

Image source: [Pytorch Tutorial 1.3.1] (https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#backprop )

To give a more detailed overview, as shown above, the image is that the image is convolved → pooling → convolution → pooling → converted to a one-dimensional array → brought to the output layer (10 nodes) in a fully connected network (after this, each I will explain it). After that, although it is not written in the image above, the value of the objective function (in this case, the mean square error) is calculated by comparing this output result with the answer that you have in advance, and the parameter value is updated.

(By the way, this model was Object Recognition with Gradient-Based Learning When CNN was first mentioned in 1998 as being ideal for simple object recognition such as handwritten characters. It is a 5-layer LeNet introduced in the paper com / exdb / publis / pdf / lecun-99.pdf).

Now, let's look at more details while writing the code.

Make a model

`qiita.python`


import torch
import torch.nn as nn
import torch.nn.functional as F

First, import torch. nn is a module that contains a layer with parameters, and F is a module that contains a layer without parameters.

`qiita.python`


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

Now, the network is defined here. Pytorch creates a class that inherits from nn.Module (here NET), and the network is defined in this class. I will explain each of the three methods while dividing the above code.

`Init ` with a layer with parameters

    def __init__(self):
        super(Net, self).__init__()

The first is how to process in the layer with parameters. Basically the layer with parameters is put in the constructor __init__. First, super (Net, self) .__ init __ () inherits the constructor of the parent class. If you generate a constructor in a child class, it will be overwritten, so it's like taking over the constructor of the parent class and adding the parts that need to be added this time. By the way, super (Net, self) .__ init __ () can be abbreviated as super () .__ init __ ().

Convolution layer

        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)

conv2D is a class used for 2D convolution. In other words, here, the image is such that the vertical and horizontal directions of the image are compressed. The arguments are (the depth of the input image (in-channel), the depth of the output image (out-channel), and the size of the filter).

What are channels (depth, depth)?

Images have depth (sometimes translated as depth) in addition to vertical and horizontal, and this depth is called a channel. The depth corresponds to the color in the case of an image, the number of channels is 3 in RGB, and 1 in monochrome. We convolve a filter into it, which is automatically set as having the same number of channels as the input layer. For example, if the number of input channels is 3, the number of filter channels is automatically 3.

For example Image source: https://axa.biopapyrus.jp/deep-learning/cnn.html

In this way, if the number of channels in the input image is 3, the number of channels in the filter will also be 3. In other words, each channel of the filter is convoluted into R, G, and B, and one feature map is created as the sum.

And the number of outputs will change depending on how many filters you prepare. Image source: https://qiita.com/icoxfog417/items/5aa1b3f87bb294f84bac

Looking at the first conv1 argument, the input data is 1 channel = monochrome, the output is 6 channels, and the filter size is 3x3. In other words, 6 feature maps were output by preparing 6 filters with the same depth 1 as the input data and convolving them. The next conv2 has 6 channels of input data and 16 channels of output, so 16 filters with a depth of 6 are convoluted. The number of image channels at the time of output is always the number of filters.

Fully connected layer

Next, nn.Linear is a class that applies a linear transformation to the input data, and the arguments are (number of input units, number of output units). A fully connected network in which all units (also called nodes) are connected.

        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

By the way, the number 16 * 6 * 6 (= 576) suddenly appears here, but this is a one-dimensional version of the three-dimensional data up to that point.

The image data was 16 channels in the convolution layer before coming to this fully connected layer. Therefore, one data is 3D data (number of channels, vertical, horizontal) = (16, vertical, horizontal). In order to bring this data to the fully connected layer, it is necessary to make the three-dimensional data one-dimensional. In this model, it is set that images with 6 lengths and 6 widths come in here, and 16 * 6 * 6 = 576 is the number of nodes in the first input layer of the fully connected layer. So, if the image data to be included in this model is not 6 * 6 vertically and horizontally before the fully connected layer, for example, a layer to change to 6 * 6 is required before this layer.

`forward` to describe forward propagation

Next is the method called forward. Here, the network that receives the data (x) as an argument and outputs the value of the output layer is described.

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

here, input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d -> view -> linear -> relu -> linear -> relu -> linear The flow of forward propagation is described. We have already explained the layers with parameters, so we will explain the other layers.

The F.relu function is a ramp function, which is one of the activation functions that performs non-linear processing on the convoluted data. ReLU (x) = max (x, 0), and if the data is larger than 0, the value is output, and if it is smaller than 0, 0 is output.

The max_pool2d function performs two-dimensional maximum value pooling. Here we are using 2 * 2 windows. The image is as follows. The maximum value in the window is output in order from the upper left. Pooling is calculated so that the border does not cover without setting the stride. This time it is 2 * 2, so the vertical and horizontal sizes are compressed by half here.

Let's take a closer look at the view function. view is a function that returns a list of numbers that are the same as the input data as a new tensor with a different shape. This is used when converting image data to one-dimensional data before the fully connected layer. For example

>>>x = torch.randn(2,2)
>>>x
tensor([[-0.2834, -0.3660],
        [-0.1678, -0.3034]])
>>>x.view(4)
tensor([-0.2834, -0.3660, -0.1678, -0.3034])

You can change the shape of the data as you said. This time there is a -1 at the beginning of the argument. This will adjust the size of the first dimension appropriately to match the other (second in this case) arguments. For example

>>>x = torch.randn(4,3)
>>>x
tensor([[-1.2163,  1.6905,  0.1850],
        [-0.2123,  0.5995,  0.7282],
        [-0.5564, -0.1090, -0.8454],
        [-0.5643,  1.2565, -0.5475]])
>>>x.view(-1,6)
tensor([[-1.2163,  1.6905,  0.1850, -0.2123,  0.5995,  0.7282],
        [-0.5564, -0.1090, -0.8454, -0.5643,  1.2565, -0.5475]])

It is like this. If you say "change 4 * 3 to x * 6", it will automatically change to the optimum 2 * 6.

Consider this time x = x.view (-1, self.num_flat_features (x)).

If the original image data is (16,6,6), x.view (576) seems to be good, but in fact, the original input tensor has four dimensions (number of samples, number of channels, length, width). It is a tensor of. Until now, I didn't mention the number of samples because I thought the input data was one image, but in machine learning, basically because the parameters are updated after processing multiple images in a mini-batch (Pytorch's'torch' .nn'is made on the assumption that a mini-batch is used), and the input data also includes information such as the number of samples. Therefore, here, the shape of the output is set to (number of samples, number of channels x length x width), and the features of each sample are arranged in a one-dimensional array, so that the features of each sample are set to the fully connected layer. I want it to be the starting node.

x = x.view (-1, self.num_flat_features (x)) is this time x = x.view (-1, 576). This self.num_flat_features (x) is created as a method to calculate the number of features per sample, so the result calculated by this method is simply substituted here. (I'll talk about self.num_flat_features (x) later.)

By creating a forward propagation model with this forward method, the backward function is also defined. The backward function only reverses the path that came from forward propagation and finds the gradient for the objective function, so if a forward propagation network is built, this formula will also be created automatically.

`Num_flat_features (x)` counting the number of features

Here, in order to make the features other than the number of samples one-dimensional, only the number of channels x length x width is used.

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

It just multiplies the dimensions other than the first number of samples of the input data. [1:] means to extract 16 or later (= index [1] or later) of (0,16,6, ...). num_features * = s =num_features = num_features * s.

In other words, here, (number of samples, 16,6,6), so → (16,6,6), then 16 * 6 * 6 per image is used to calculate the number of features.

Now the blueprint for the model is complete. Let's instantiate.

>>>net = Net()
>>>print(net)
Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)

In the machine learning process, learning is performed using an object that instantiates a class defined by oneself (here, the Net () class).

Organize image size

To be on the safe side, let's sort out what image size this model is supposed to have, and how the image size has changed so far.

input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d -> view -> linear -> relu -> linear -> relu -> linear

This time, "The input data of LeNet is assumed to be 32 * 32" is written in the official tutorial of pytorch, but I think that 30 is the optimum for this parameter. Enter 30x30 data → Convolution filter size is (3 * 3,1 stride) so 28 * 28 → Pooling (2 * 2) 14 * 14 → Convolution (3 * 3,1 stride) so 12 * 12 → pooling (2 * 2) → 6 * 6 to put into the fully connected layer, isn't it? In the original paper, the first convolution layer used a filter size of 5 * 5, so 32 * 32 is the best choice for that. (Pytorch's'nn.MaxPool2d' pooling is truncated after the decimal point, so it is possible to enter 32 * 32 with this parameter as well)

Check parameters

Now, let's check the parameters. The parameters to be learned can be found with net.parameter ().

>>>params = list(net.parameters())
>>>print(len(params))
>>>print(params[0].size())  # conv1's .weight
>>>print(params[1].size())

10
torch.Size([6, 1, 3, 3])
torch.Size([6])

This time there are 10 parameters. The first convolution layer has parameters [6, 1, 3, 3] and [6]. The parameter in the convolution layer is the value of the filter. Since the value of this filter is a parameter and is updated by learning, it is 6 (number of output channels) x 1 (number of input channels) x 3 (vertical) x 3 (horizontal) + 6 (bias). The next convolution layer is [16,6,3,3] and [16] with the same reasoning. If you understand so far, you can understand the other six fully connected layers. They are [120,576], [120], [84,120], [84], [10,84], [10].

Enter data

Now, let's try entering an appropriate number as if we entered the image data in the previous model.

>>>input = torch.randn(1, 1, 32, 32)
>>>out = net(input)
>>>print(out)
tensor([[-0.0843,  0.0283,  0.0677,  0.0639, -0.0076, -0.0293,  0.1049,  0.2183,
         -0.1275, -0.1151]], grad_fn=<AddmmBackward>)

10 pieces are output firmly. By the way, the first of the 4D data at the time of input is the number of images per batch.

Loss calculation

The objective function takes a pair of (output value, target (answer)) as input and calculates how far the output result is from the answer you wanted to get. There are several loss functions in the nn package, but this time we will use nn.MSELoss to calculate the mean square error between the output result and the target.

>>>output = net(input)
>>>target = torch.randn(10)  # a dummy target, for example
>>>target = target.view(1, -1)  # make it the same shape as output
>>>criterion = nn.MSELoss()

>>>loss = criterion(output, target)
>>>print(loss)
tensor(0.6110, grad_fn=<MseLossBackward>)

Put the output result from the model in output, and put an appropriate number in target this time to match the shape with the output result of the model (since the output result from the model also includes the number of batches (1,10) )is). The loss function is instantiated and used.

Parameter update

There are various parameter update methods in the module called optim of pytorch, and you can easily perform the error back propagation method to update the parameters.

import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

ʻOptim.SGD (net.parameters (), lr = 0.01) means to update the specified parameter value (net.parameters () `) with a learning rate of 0.01 using stochastic gradient descent. is.

optimizer.zero_grad() This means that the gradient of the objective function should be zero. This may be necessary for those who are familiar with define-and-run frameworks like Transflow, but Pytorch and chainer prescribe backpropagation calculations in advance. It is not necessary, and parameters can be updated flexibly by storing all the calculation history required for gradient processing, but conversely, when the calculation process of this gradient ends? Not specified. Therefore, if this is not set to 0 at the required location, the gradient of the new data will be accumulated with respect to the gradient calculated from the input data before that, and the correct gradient cannot be calculated. In other words, the process of initializing this gradient must be performed every time the error is backpropagated (= every time a batch is created).

So far, we have explained the flow of performing model creation-> forward propagation-> calculation of loss function-> error back propagation and parameter update with Pytorch. Next, there is a tutorial waiting for you to move the model using actual data, so please give it a try.

in conclusion

This time, I have referred to various URLs for my understanding, so I would like to introduce them. All are recommended.

Image source

--Implementing Convolutional Neural Network
https://qiita.com/icoxfog417/items/5aa1b3f87bb294f84bac --Convolutional neural network
https://axa.biopapyrus.jp/deep-learning/cnn.html

Object Recognition with Gradient-Based Learning
http://yann.lecun.com/exdb/publis/pdf/lecun-99.pdf

Next is the reference URL

--Convolutional Neural Network_CNN (Vol.16)
https://products.sint.co.jp/aisia/blog/vol1-16#toc-3 --Latest research trends of convolutional neural networks (~ 2017)
https://qiita.com/yu4u/items/7e93c454c9410c4b5427#fn3 --Medical AI Specialized Course Online Lecture Materials
https://japan-medical-ai.github.io/medical-ai-course-materials/index.html

Convolutional Neural Networks (CNNs / ConvNets)
http://cs231n.github.io/convolutional-networks/
Why do we need to call zero_grad() in PyTorch?
https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903

Pytorch Neural Network (CNN) Tutorial 1.3.1.

Assumed reader

What to do in this tutorial

Make a model

qiita.python

qiita.python

__Init __ with a layer with parameters

Convolution layer

What are channels (depth, depth)?

Fully connected layer

forward to describe forward propagation

Num_flat_features (x) counting the number of features

Organize image size

Check parameters

Enter data

Loss calculation

Parameter update

in conclusion

`qiita.python`

`qiita.python`

`Init ` with a layer with parameters

`forward` to describe forward propagation

`Num_flat_features (x)` counting the number of features