This is the second installment of PyTorch Official Tutorial following Last time. This time, I would like to proceed with Autograd: Automatic Differentiation.
1.Autograd 2.Tensor 3. Gradient 4.You can do many crazy things with autograd! 5. Finally History
PyTorch implements the Autograd feature. Gradient information is stored in Tensor, and the gradient is calculated by the backward () method for the defined calculation graph (expression). Let's take a look at Autograd with a concrete example below.
2.Tensor
The PyTorch Tensor will record the gradient by setting the requires_grad attribute to True. When you calculate the gradient with backward (), the gradient is preserved in the Tensor's grad attribute.
The following description defines the Tensor. Specify requires_grad = True so that the gradient is recorded.
import torch
x = torch.ones(2, 2, requires_grad=True)
print(x)
tensor([[1., 1.],
[1., 1.]], requires_grad=True)
Create a calculation graph (formula) y.
y = x + 2
print(y)
tensor([[3., 3.],
[3., 3.]], grad_fn=<AddBackward0>)
When I print, I have grad_fn in the output. This shows that a computational graph has been built to calculate the gradient.
print(y.grad_fn)
<AddBackward0 object at 0x7f8cc977e5c0>
Use y to create further calculation graphs (formulas) z and out.
z = y * y * 3
out = z.mean()
print(z, out)
tensor([[27., 27.],
[27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)
[Reference information] You can change the requires_grad attribute with tensor.requires_grad_ ().
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)
False
True
<SumBackward0 object at 0x7fcb2ba0a3c8>
Calculate the gradient with out.backward ().
out.backward()
Outputs the partial derivative of out by x, d (out) / dx.
print(x.grad)
tensor([[4.5000, 4.5000],
[4.5000, 4.5000]])
Since out is out = z.mean () and z is z = y * y * 3, the formula is as follows.
out = \frac{1}{4}\sum_{i=1}^{4}z_i
、
z_i = 3(x_i + 2)^2
、
z_i\bigr\rvert_{x_i=1} = 27
Therefore, if out is partially differentiated with respect to x,
\begin{align}
\frac{\partial out}{\partial x_i} &= \frac{3}{2}(x_i+2)\\
&= \frac{9}{2} = 4.5
\end{align}
You can see that it is automatically differentiated.
4.You can do many crazy things with autograd!
I'm not sure what the code below means, but let's take a look.
x = torch.randn(3, requires_grad=True)
y = x * 2
while y.data.norm() < 1000:
y = y * 2
print(y)
tensor([ -492.4446, -1700.8485, -339.7951], grad_fn=<MulBackward0>)
x is a standardized (mean 0, standard deviation 1) random value. y.data.norm () is also described in Wikipedia, The distance in vector space, which is the following Euclidean norm.
Euclidean norm= \sqrt{|x_1|^2+\cdots+|x_n|^2} \\
In the case of two dimensions, the formula is the same as the distance between two points, so it is exactly the distance. In fact, if you try to output the value of x and norm (), the result of the above formula will be returned as the norm.
\begin{eqnarray}
Euclidean norm&=& \sqrt{|-0.9618|^2+|-3.3220|^2+|-0.6637|^2}\\
&=& 3.5215
\end{eqnarray}
print(x)
print(x.data.norm())
tensor([-0.9618, -3.3220, -0.6637], requires_grad=True)
tensor(3.5215)
So the code above represents the expression that keeps doubling x until the norm of x is 1,000.
I want to calculate the gradient of y with y.backward (), but since y is not a scalar, it cannot be calculated as it is. In fact, running y.backward () gives an error.
y.backward()
RuntimeError: grad can be implicitly created only for scalar outputs
The gradient is calculated by setting an appropriate vector.
gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(gradients)
print(x.grad)
tensor([5.1200e+01, 5.1200e+02, 5.1200e-02])
Consider what this tutorial means. As mentioned in this tutorial, the gradient can be represented by the Jacobian matrix.
\begin{split}J=\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
\vdots & \ddots & \vdots\\
\frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)\end{split}
There is also a statement that autograd is an engine for calculating the product of the Jacobian matrix and a given vector.
\begin{split}J^{T}\cdot v=\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
\vdots & \ddots & \vdots\\
\frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)\left(\begin{array}{c}
\frac{\partial l}{\partial y_{1}}\\
\vdots\\
\frac{\partial l}{\partial y_{m}}
\end{array}\right)=\left(\begin{array}{c}
\frac{\partial l}{\partial x_{1}}\\
\vdots\\
\frac{\partial l}{\partial x_{n}}
\end{array}\right)\end{split}
Based on this information, I will apply it to this case. From here, it may not be correct because it contains my imagination.
x = torch.randn(3, requires_grad=True)
Since x is 3 random variables, the subscript n of x is 3.
x_1 , x_2 , x_3
Consider y. The definition of y is as follows.
y = x * 2
while y.data.norm() < 1000:
y = y * 2
First, let's look at the first y = x * 2. Focusing on the number of variables, y is x * 2, which is simply doubled, so the number of variables does not change. So after converting with y, the value remains three. Therefore, m is also 3, and the size of the Jacobian matrix is 3 × 3.
{\begin{split}J=\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \frac{\partial y_{1}}{\partial x_{2}} & \frac{\partial y_{1}}{\partial x_{3}}\\
\frac{\partial y_{2}}{\partial x_{1}} & \frac{\partial y_{2}}{\partial x_{2}} & \frac{\partial y_{2}}{\partial x_{3}}\\
\frac{\partial y_{3}}{\partial x_{1}} & \frac{\partial y_{3}}{\partial x_{2}} & \frac{\partial y_{3}}{\partial x_{3}}\\
\end{array}\right)\end{split}
}
The print (x) values above [-0.9618, -3.3220, -0.6637] apply to [x1, x2, x3]. Also, when this value is applied to y = x * 2, it becomes [-1.9236, -6.644, -1.3274], which becomes [y1, y2, y3]. It was not necessary to apply the numerical value, but the conversion formula of x and y is as follows.
y_1 = 2x_1\\
y_2 = 2x_2\\
y_3 = 2x_3\\
The partial differential of this equation with respect to x1, x2, and x3 is as follows.
\frac{\partial y_{1}}{\partial x_{1}} = 2 ,
\frac{\partial y_{1}}{\partial x_{2}} = 0 ,
\frac{\partial y_{1}}{\partial x_{3}} = 0\\
\frac{\partial y_{2}}{\partial x_{1}} = 0 ,
\frac{\partial y_{2}}{\partial x_{2}} = 2 ,
\frac{\partial y_{2}}{\partial x_{3}} = 0\\
\frac{\partial y_{3}}{\partial x_{1}} = 0 ,
\frac{\partial y_{3}}{\partial x_{2}} = 0 ,
\frac{\partial y_{3}}{\partial x_{3}} = 2\\
Therefore, the Jacobian matrix is as follows. while J1 is used to represent the first conversion (y = x * 2) before.
{\begin{split}J_1=\left(\begin{array}{ccc}
2 & 0 & 0\\
0 & 2 & 0\\
0 & 0 & 2\\
\end{array}\right)\end{split}
}
Consider the second y. y = y * 2 after while is the second y. Since the formula is the same as the first time, the Jacobian matrix of y for the second time is the same as before and is as follows. Let's call it J2 to represent the Jacobian matrix of y for the second time.
{\begin{split}J_2=\left(\begin{array}{ccc}
2 & 0 & 0\\
0 & 2 & 0\\
0 & 0 & 2\\
\end{array}\right)\end{split}
}
Repeat this. Since the initial value x.data.norm () is "3.5215" and y.data.norm () <1000, the loop is executed 8 times and y is defined 9 times. As a whole, it looks like this:
formula | Value of x1 | value of x2 | Value of x3 |
---|---|---|---|
- | x1 | x2 | x3 |
Convert with the first y | 2 * x1 | 2 * x2 | 2 * x3 |
Convert with the second y | 4 * x1 | 4 * x2 | 4 * x3 |
Convert with y for the third time | 8 * x1 | 8 * x2 | 8 * x3 |
Convert with y for the 4th time | 16 * x1 | 16 * x2 | 16 * x3 |
Convert with y for the 5th time | 32 * x1 | 32 * x2 | 32 * x3 |
Convert with y for the 6th time | 64 * x1 | 64 * x2 | 64 * x3 |
Convert with y for the 7th time | 128 * x1 | 128 * x2 | 128 * x3 |
Convert with y for the 8th time | 256 * x1 | 256 * x2 | 256 * x3 |
Convert with the 9th y | 512 * x1 | 512 * x2 | 512 * x3 |
Finally y becomes the composition function of these nine transformations. As described in This math site, the Jacobian matrix of the composite function can be expressed by a matrix formula. y The whole Jacobian matrix J
\begin{eqnarray}
J &=& J_9 \times J_8 \times J_7 \times J_6 \times J_5 \times J_4 \times J_3 \times J_2 \times J_1\\\\
&=&
\left(
\begin{array}{ccc}
2 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 2
\end{array}
\right)
\left(
\begin{array}{ccc}
2 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 2
\end{array}
\right)
\cdots
\left(
\begin{array}{ccc}
2 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 2
\end{array}
\right)
\left(
\begin{array}{ccc}
2 & 0 & 0 \\
0 & 2 & 0 \\
0 & 0 & 2
\end{array}
\right) \\\\
&=&
\left(
\begin{array}{ccc}
512 & 0 & 0 \\
0 & 512 & 0 \\
0 & 0 & 512
\end{array}
\right)
\end{eqnarray}
Will be.
Let's apply this to the following calculation.
gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(gradients)
print(x.grad)
gradients are vectors that multiply the Jacobian matrix. When applied to the Jacobian matrix calculated above, it becomes as follows.
{\begin{split}x.grad=\left(\begin{array}{ccc}
512 & 0 & 0\\
0 & 512 & 0\\
0 & 0 & 512\\
\end{array}\right)\left(\begin{array}{c}
0.1\\
1.0\\
0.0001
\end{array}\right)=\left(\begin{array}{c}
51.2\\
512\\
0.0512
\end{array}\right)\end{split}
}
In summary, does Autograd have the following image?
--Keep the Jacobian matrix every time you define a function (expression). --Calculate the "derivative" from the Jacobian matrix held by the backward method.
The story changes here, and by writing in the torch.no_grad () block as follows, we will not track the change of the function. (x ** 2) is not tracked.
print(x.requires_grad)
print((x ** 2).requires_grad)
with torch.no_grad():
print((x ** 2).requires_grad)
True
True
False
Also, detach () copies the variables in the tensor, but the gradient is not inherited.
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())
True
False
tensor(True)
That's it for PyTorch's second tutorial, Autograd: Automatic Differentiation. The content was different from the first tutorial. In the second half, there is a part of imagination, so it may be wrong. I would appreciate it if you could point out.
Next time would like to proceed with the third tutorial "NEURAL NETWORKS".
2020/02/28 First edition released 2020/04/22 Next link added
Recommended Posts