Thorough commentary on pyTorch optim SGD

2020/1/27 Posted

0. Who is the target of this article

--People who have touched python and have a good execution environment --People who have touched pyTorch to some extent --People who want to understand optimizer SGD by machine learning with pyTorch --People who want to use pyTorch's optimizer SGD for anything other than network models (such as ordinary variables)

1.First of all

Nowadays, the main research on machine learning is in the python language, because python has many libraries (called modules) for high-speed data analysis and calculation. Among them, this time we will use a module called ** pyTorch **, and explain how to update parameters using the stochastic gradient descent (SGD) method ** in the program **. However, please understand that this article is like your own memo, and that you want it to be used as a reference only, and that you may use incorrect expressions or phrases for the sake of brevity. I want you to do it. Furthermore, I will not explain SGD in detail, so if you are interested in it, please learn by yourself.

Also, in this article, we will not actually learn using Network. If you are interested in it, please refer to the link below.

Thorough explanation of CNNs with pyTorch

2. Install pyTorch

If you are using pyTorch for the first time, you have to install it with cmd because pyTorch is not already installed in python. Jump to the link below, select the one in your environment with "QUICK START LOCALLY" at the bottom of the page, and enter the command that appears with cmd etc. (You should be able to copy and paste the command and execute it).

pytorch official website

3. Special types provided by pyTorch

Just as numpy has a type called ndarray, pyTorch has a type called "** Tensor type **". Like the ndarray type, it can perform matrix calculations and is quite similar to each other, but the Tensor type is superior in machine learning in that it can use the GPU. This is because machine learning requires a considerable amount of calculation and uses a GPU with a high calculation speed. In addition, the Tensor type can be differentiated very easily for updating machine learning parameters. The key to this article is how easy it is to do this.

Please refer to the following Link for Tensor type operation and explanation.

What is the Tensor type of pyTorch

Please refer to the link below for how the differentiation is realized.

Summary of examples that cannot be pyTorch backward

4. What is Stochastic Gradient Desent (SGD)?

This is called the stochastic gradient descent method, which is simply a parameter update method. I will not explain it here because there are many mathematical explanations about this if you google.

5. pyTorch SGD

5-1. Import of pyTorch

First, import so that you can use pyTorch. From here, write to a python file instead of cmd etc. Use module by writing the following code.

filename.rb


import torch
import torch.optim as optim

This second line "** import torch.optim as optim **" is a module prepared to use SGD.

5-2. optim.SGD First, the arguments of SGD will be explained. The usage is written as follows.

filename.rb


op = optim.SGD(params, lr=l, momentum=m, dampening=d, weight_decay=w, nesterov=n)

Explanation of the following arguments

--params: Pass the parameter you want to update. This parameter must be differentiable. --lr: learning rate. Pass a float type. --momentum: Momentum. Pass a float type. --dampening: Manipulate momentum momentum. Pass float type. --weight_decay: How much to add the L2 norm of params as a regularization. Pass the float type. --neserov: Apply nesterov momentum as momentum or pass True or False.

This time, the momentum, dampening, weight_decay, and nestrov, which are extra information to see the behavior of SGD, are left as the initial values (all are 0 or False).

5-3. Use of SGD

The program itself is very simple, and first, an example of calculation is shown as a preliminary preparation.

filename.rb


x = torch.tensor(5.0, requires_grad = True)
c = torch.tensor(3.0, requires_grad = True)
b = torch.tensor(2.0)
d = 4.0
y = c*3*torch.exp(x) + b*x + d
print(y)

------------Output below---------------
tensor(1349.7185, grad_fn=<AddBackward0>)

If you write this program as an expression

y = 3 c e^{x} + bx + d

In $ x = 5 $, $ c = 3 $, $ b = 2 $, $ d = 4 $. Also, "** requires_grad = True **" is set so that only the variables ** x ** and ** c ** can be differentially calculated. From this, we pass SGD to update these two variables. The following is an example.

filename.rb


op = optim.SGD([x,c], lr=1.0)

Note that the argument that passes to the ** params ** part of SGD is "** [x, c] ", but the parameters passed in this way are list " [] **". I have to make it big. This is of course ** the same when there is only one parameter variable **. Also, when you create a network by machine learning etc., you may enter the parameters of that model, but in that case you do not need this parenthesis. In a little more detail, params expects iteration to come as an argument, and model parameters are in the form of iteration, so we don't need them. For an example of inserting a model, see the explanation of CNN introduced above.

Also, this variable ** op ** now has the function of SGD, but when I actually output it, the contents of SGD appear.

filename.rb


print(op)

------------Output below---------------
SGD (
Parameter Group 0
    dampening: 0
    lr: 1.0
    momentum: 0
    nesterov: False
    weight_decay: 0
)

The actual parameter derivative is as follows.

filename.rb


y.backward()
print(x.grad)
print(c.grad)

------------Output below---------------
tensor(1337.7185)
tensor(445.2395)

The differential value of each variable can be viewed by typing "** variable name.grad ". By setting " y.backward () **" in this way, the variables related to the final output ** y ** are automatically differentiated. Then, update the parameters as follows.

filename.rb


op.step()
print(x)
print(c)

------------Output below---------------
tensor(-1332.7185, requires_grad=True)
tensor(-442.2395, requires_grad=True)

In this way, "** op.step () **" is used to update using the differential information of each variable. In other words, you can see that the memory of the actual variables ** x ** and ** c ** matches the memory of those values that SGD has. The update formula under the current conditions is as follows.

x \leftarrow x - \eta\frac{\partial y}{\partial x}

$ \ Eta $ is the learning rate, which is now 1.0. Rewriting the above formula with the actual variable value and the derivative value

-1332.7185 \leftarrow 5.0 - 1.0\times 1337.7185

It certainly matches.

5-4. Rewriting the contents of SGD

The variable ** op ** with the SGD function was output above, but detailed information such as parameters did not come out. The detailed output should be as follows. However, the following program is performed with the data before the differential calculation and update performed above (variables x and c remain the same).

filename.rb


print(op.param_groups)

------------Output below---------------
[{'params': [tensor(5., requires_grad=True), tensor(3., requires_grad=True)],
  'lr': 1.0,
  'momentum': 0,
  'dampening': 0,
  'weight_decay': 0,
  'nesterov': False}]

By doing this, you can see the detailed parameters of SGD, and you can see that the information is contained in the list type "** [] **". In other words, if you want to mess with the contents, do as follows.

filename.rb


op.param_groups[0]['lr'] = 0.1
print(op)

------------Output below---------------
SGD (
Parameter Group 0
    dampening: 0
    lr: 0.1
    momentum: 0
    nesterov: False
    weight_decay: 0
)

Now, the learning rate ** lr ** has been rewritten to 0.1. When you actually try the normal output, you can see that it has been rewritten.

5-5. Rewriting the variable you want to update (SGD side)

As above, let's rewrite the information of the variable ** x ** using the parameter rewriting of SGD. First, the parameters are

filename.rb


print(op.param_groups[0]['params'])

------------Output below---------------
[tensor(5., requires_grad=True), tensor(3., requires_grad=True)]

Is. Now, rewrite the 0th element on the variable ** x ** side of this.

filename.rb


op.param_groups[0]['params'][0] = torch.tensor(10., requires_grad=True)
print(op.param_groups)

------------Output below---------------
[{'params': [tensor(10., requires_grad=True), tensor(3., requires_grad=True)],
  'lr': 0.1,
  'momentum': 0,
  'dampening': 0,
  'weight_decay': 0,
  'nesterov': False}]

Certainly rewritten. ** But let's look at the actual value of the variable. **

filename.rb


print(x)

------------Output below---------------
tensor(5., requires_grad=True)

The body has not been rewritten at all. When I actually update it with backward ()

filename.rb


y.backward()
op.step()
print(x)
print(x.grad)
print(op.param_groups)

------------Output below---------------
tensor(5., requires_grad=True)
tensor(1337.7185)
[{'params': [tensor(10., requires_grad=True),
   tensor(-41.5240, requires_grad=True)],
  'lr': 0.1,
  'momentum': 0,
  'dampening': 0,
  'weight_decay': 0,
  'nesterov': False}]

Note that the value of the variable c is different from the example above because the learning rate lr is 0.1. In this way, neither the value of the variable x nor the value of x that SGD has is updated. Moreover, ** x.grad ** is calculated when the variable ** x ** is 5. This is caused by the memory location being different when substituting "** torch.tensor (10., requires_grad = True) **". Be very careful when changing the value of a variable.

5-5. Rewriting the variable you want to update (normal variable side)

Above, it was found that rewriting the SGD variable information causes the variables and SGD parameters to stop interacting with each other. He said the reason was a memory mismatch. Now let's rewrite the variables.

filename.rb


op = optim.SGD([x,c], lr=1.0)
print(op.param_groups)

------------Output below---------------
[{'params': [tensor(5., requires_grad=True), tensor(3., requires_grad=True)],
  'lr': 1.0,
  'momentum': 0,
  'dampening': 0,
  'weight_decay': 0,
  'nesterov': False}]

For this variable ** x **

filename.rb


x = torch.tensor(10., requires_grad=True)
print(x)

------------Output below---------------
tensor(10., requires_grad=True)

And. If you look at the details of the variable ** op ** again here

filename.rb


print(op.param_groups)

------------Output below---------------
[{'params': [tensor(5., requires_grad=True), tensor(3., requires_grad=True)],
  'lr': 1.0,
  'momentum': 0,
  'dampening': 0,
  'weight_decay': 0,
  'nesterov': False}]

The parameters have not changed. Of course, if you update with backward () as it is

filename.rb


y.backward()
op.step()
print(x)
print(x.grad)
print(op.param_groups)

------------Output below---------------
tensor(10., requires_grad=True)
None
[{'params': [tensor(-1332.7185, requires_grad=True),
   tensor(-442.2395, requires_grad=True)],
  'lr': 1.0,
  'momentum': 0,
  'dampening': 0,
  'weight_decay': 0,
  'nesterov': False}]

Interestingly, nothing has been done to the variables ** x ** in different memory, and the parameters of SGD have been updated. In conclusion, ** you should avoid rewriting variables you want to update unnecessarily **.

6. A word

This time, I explained the SGD of optimizer using pyTorch, though it is simple. Surprisingly, there was no example of applying SGD to anything other than Network, so I will introduce it. I think there were many points that were difficult to read, but thank you for reading.

Recommended Posts

Thorough commentary on pyTorch optim SGD
[PyTorch] Sample ⑦ ~ optim package ~
Notes on optimization using Pytorch