In this article, in order to understand ** "Define-by-Run" **, which is the most characteristic concept of Chainer, which is a Neural Network framework, we will describe and learn a network for classifying handwritten numbers. Let's implement the library "1f-chainer" which has only the minimum necessary functions using only NumPy. I tried to follow all the explanations that appear in mathematical formulas to Appendix, and to explain in the text with only code and sentences as much as possible.
All the code used in this article is located below: 1f-chainer. When I started writing, I wanted to add various things and couldn't make it in time, so I will update it one by one by the end of this week.
In addition, all content of this article is based on my personal opinion and understanding, and has nothing to do with the organization to which I belong.
This article is written with the assumption that you have a basic knowledge of training Neural Networks with Backpropagation and that you have experience with Python and NumPy. Please note that the author is a fan of Chainer, so his impressions of Chainer are based on his personal feelings and are not official views.
Chainer is a framework written in Python that has the ability to build and learn Neural Networks.
--High flexibility (easy to write dynamic network) --Relatively easy to debug (faster trial and error cycle) --Intuitive model can be written (model can be written in Python)
I think that it is a library that aims at such things.
As far as I know, the well-known framework / library for building and learning Neural Networks
There are many things such as, but as far as I know, there are few frameworks written only in Python, including the back end. On the other hand, Chainer adopts NumPy as a tensor library for CPU and CuPy originally developed for GPU, but one of the features of Chainer is that both are independent Python libraries that can be used independently. And I think you are.
CuPy is mainly written using Cython and has a mechanism to compile and execute the CUDA kernel internally. It also serves as a wrapper for cuDNN, a Neural Network library developed by NVIDIA that assumes the use of NVIDIA GPUs. And the big feature of CuPy is that it has a NumPy compatible API. This means that you can easily rewrite the CPU code written using NumPy for GPU processing with very few changes ** (the string numpy
in the code You may just need to replace it with cupy
). Since there are many NumPy functions and functions (advanced indexing etc. [^ cupy-PR]) that CuPy does not yet support, there are various restrictions and exceptions, but basically the goal of CuPy is the code for CPU and GPU. I think it's about making the code for you as close as possible.
I haven't tried all of the major frameworks / libraries listed above, but Chainer's internal design of computational procedures for building and learning Neural Networks is ** ". The idea of Define-by-Run "** is used, and I think this is an important feature that sets Chainer apart from other frameworks. "Define by Run", as the name implies, means "define by running" the structure of a Neural Network. This means that the structure of the network has not yet been determined before it is executed, and only after the code is executed will it be determined how each part of the network will be connected. Specifically, when a certain input variable is applied with a function, the output is a new variable that memorizes "what kind of function was applied", so that "the calculation actually performed" If so, it is possible to follow in the opposite direction as many times as you like, which means that you can actually perform "runtime construction of the calculation graph". For this reason, the user should describe "the calculation process of network forward propagation (including rule-based conditional branching, stochastic conditional branching, or generation of a new layer in the calculation process)" using Python. In other words, if you write the "code that represents the forward calculation", you have defined the network.
Several classes to achieve these characteristics are at the heart of Chainer. In this article, I'll try to implement only the most basic part of Chainer by myself as a library consisting of one file (like Python's web framework, Bottle), and the heart of it is superficial. The purpose is to understand.
Let's start by looking at what happens if we write code that builds and learns Neural Networks as simply as possible, without being aware of the "Chainer-like" implementation. The code that appears in this section below assumes that ʻimport numpy` is done at the beginning. In addition, it is basically assumed that the parameters will be updated by Mini-batched SGD in the subsequent learning of Neural Network.
A three-layer perceptron consisting of only two layers, the Linear layer and the activation function ReLU, can be defined as follows.
class Linear(object):
def __init__(self, in_sz, out_sz):
self.W = numpy.random.randn(out_sz, in_sz) * numpy.sqrt(2. / in_sz)
self.b = numpy.zeros((out_sz,))
def __call__(self, x):
self.x = x
return x.dot(self.W.T) + self.b
def update(self, gy, lr):
self.W -= lr * gy.T.dot(self.x)
self.b -= lr * gy.sum(axis=0)
return gy.dot(self.W)
class ReLU(object):
def __call__(self, x):
self.x = x
return numpy.maximum(0, x)
def update(self, gy, lr):
return gy * (self.x > 0)
model = [
Linear(784, 100),
ReLU(),
Linear(100, 100),
ReLU(),
Linear(100, 10)
]
that's all. You can see that it can be written very short. A detailed description of the Linear layer above is given later in the Appendix: [Linear Layer Basics](#Linear Layer Basics). There is also a brief supplementary explanation about the ReLU layer ([About the ReLU layer](#About the ReLU layer)).
Next, we will prepare the functions forward
and ʻupdate required to learn this three-layer perceptron using the data
xand the correct answer
t`.
def forward(model, x):
for layer in model:
x = layer(x)
return x
def update(model, gy, lr=0.0001):
for layer in reversed(model):
gy = layer.update(gy, lr)
These two functions can be used to train the above three-layer perceptron. In the forward
function, the contents of the model
given as a list of configuration layers are examined in order, and the data is propagated forward. On the contrary, the ʻupdate function looks at the layers in the
modelin the reverse order **, and finds the
gy`, which is the infinite product of the part of the multiplication of the gradients that appears in the chain rule, before you. It is backpropagating. In summary, the learning procedure is
--Inference: Give the forward
function the data x
and put the resulting output as y
--Loss calculation: Calculate the error between y
and the correct answer t
, find the gradient for y
, and set it as gy
.
--Update: Give gy
to the backward
function and update all parameters in the network
There are 3 steps. The code that actually repeats these processes using the MNIST dataset and trains the network can be written as follows. Where td
represents a numpy.ndarray
in the form of (60000, 784)
with $ 60000 $ dimensional vector data lined up, and tl
is a $ 10 $ dimensional one-hot vector (correct answer). Suppose you want to represent a numpy.ndarray
in the form(60000, 10)
with $ 60000 $ (vectors where only the dimension corresponding to the class number is $ 1 $ and all other dimensions are 0). The code that actually creates these arrays is posted in the Appendix: Read Dataset (#Read Dataset).
def softmax_cross_entropy_gy(y, t):
return (numpy.exp(y.T) / numpy.exp(y.T).sum(axis=0)).T - t
#Learning
for epoch in range(30):
for i in range(0, len(td), 128):
x, t = td[i:i + 128], tl[i:i + 128]
y = forward(model, x)
gy = softmax_cross_entropy_gy(y, t)
update(model, gy)
You can see that the basic learning flow with the mini-batch SGD is also very simple. Since the network is defined as a Python list, the forward calculation can be performed simply by giving an input variable to the first element of the list and giving the obtained output as the input of the next element of the list. I will. In the ʻupdate function that updates the parameters, you can" look at this list in reverse order ", calculate the
gy in order from the back, and pass it to the ʻupdate
method of each layer.
Note that the softmax_cross_entropy_gy
function returns for the gradient of the input to the loss value, not the loss value itself. I wrote in the Appendix about the value of the Softmax Cross Entropy loss function and its gradient: Calculation and Differentiation of Softmax Cross Entropy.
After running the training loop above, let's examine the classification accuracy of MNIST's Validation dataset.
y = forward(model, vd)
n_correct = numpy.sum(vl[numpy.arange(len(vl)), y.argmax(axis=1)])
acc = n_correct / float(len(vl))
print(acc)
Here, vd
represents the validation data and vl
represents the label corresponding to the validation data, and how to create these is written in Appendix as well as the training data: [Read Dataset](#Read Dataset). After performing $ 30 $ epoch learning and verifying the accuracy of the validation data with the above code, it was found that the accuracy of $ 0.9674 $, that is, $ 96.74 % $ was achieved.
The whole code is on the right: minimum.py. Running with python minimum.py
will train and display the precision in the Validation dataset and exit.
Then it is the main subject. We have found that implementing the basic layers that make up a Neural Network is very easy. So what kind of implementation is Chainer doing to define and train similar networks?
Chainer defines the network structure based on the idea of ** "Define-by-Run" ** as mentioned at the beginning. On the other hand, the implementation method described above is called ** "Define-and-Run" **, which fixes the network structure ** in advance ** and then the learning process (forward). It was a mechanism to execute calculations, backward calculations, parameter updates, etc.). Therefore, if you try to change the structure of the network depending on the data (there is a branch point in the middle and which branch the forward goes to depends on the content of the data, etc.), it is very troublesome or inconvenient. It may be possible [^ branch]. However, in "Define-by-Run", ** describing the forward calculation and defining the network structure are synonymous **, and since the forward calculation can be described in Python, it includes branches and probabilistic elements. It's also very easy to write on the network.
Let's take a step-by-step look at how this flexibility is achieved.
First, Chainer has several major classes for achieving the most fundamental functionality.
name of the class | Feature overview |
---|---|
Variable |
|
Function |
|
Link | Layer with parameters |
Chain | Class for handling multiple Links at once |
Optimizer | Receives Chain or Link and crawls and updates the parameters they hold |
Personally, I think that this class (and Function class) called Variable is at the center of Chainer's unique implementation. In order to realize "Define-by-Run", it is necessary to be able to ** trace what forward calculation was actually performed ** after executing the calculation **. Because ** it becomes the definition of the network in itself **. This Variable class is the basis of this. In very rough terms, this class is a ** variable that remembers how it was created **.
A Variable should always be ** output from some function ** unless it is the root of the network, the Variable that represents the input data. Therefore, we will add a function to store ** what kind of function output ** in a member variable called creator
.
However, with this alone, even if you look at creator
, you can only see the function that output you, and before that, ** generate the function that generated the variable input to that function **, and generate the input before that function. It is not possible to trace the history of the function that was executed. Therefore, in order to make this possible, make sure that the ** Function class that actually calculates for Variable holds both the input Variable and the output Variable **. By doing this, you can trace from the ʻinput of the function that generated the Variable to the
creator` that generated it, which allows you to trace back from any Variable to any leaf node connected to it. It will be possible.
Also, Variables, of course, need to be able to have values. So let's have the data
member variable hold the array. Since the root Variable at the root represents data, we will put None
in the creator
member. To summarize so far,
creator
: Function that outputs itself
--data
: The value itself: Variable entered in itself --
forward () : ʻinputs
is received and some calculation is performed.
--ʻOutput: Result calculated by
forward ()`You can see that at least the function is necessary. If Variable and Function have these functions, it is possible to ** get the history of calculations performed so far from the output Variable ** as follows.
x = Variable(data)
f_1 = Function() #Creating a Function object
y_1 = f_1(x) #Variable set internally_creator is called
#Output y by passing self_1 is f_1
#In the creator member
f_2 = Function()
y_2 = f_2(y_1)
y_2.creator # => f_2
y_2.creator.input # => y_1
y_2.creator.input.creator # => f_1
y_2.creator.input.creator.input # => x
From the y_2
obtained as a result of the calculation, ** by alternately tracing the creator
that generated it and the ʻinput held by that
creator**, the most fundamental Variable,
x` I was able to reach. The above flow can be represented by a simple diagram as follows.
** Forward calculation flow and calculation graph restoration by reverse reference **
Until now, the destination where the calculation progressed on the network was expressed as the "upper layer", but in this figure, it is written in the form that the calculation proceeds from top to bottom for convenience. Please be careful.
Then, if you look at the boxes in order from the top to the bottom of this figure, you can see how the input data is applied to the Function in order and a new Variable is generated in the form corresponding to the above code. .. On the other hand, the blue arrow shows the actual movement of the data in each class, and the red arrow shows how the final output Variable can be traced back to the calculation process so far.
First, let's write the code of Variable class and Function class so that the actual forward calculation can be performed according to the blue arrow in the above figure.
class Variable(object):
def __init__(self, data):
self.data = data
self.creator = None
def set_creator(self, gen_func):
self.creator = gen_func
class Function(object):
def __call__(self, in_var):
in_data = in_var.data
output = self.forward(in_data)
ret = Variable(output)
ret.set_creator(self)
self.input = in_var
self.output = ret
return ret
def forward(self, in_data):
return in_data
Using these, after performing the forward calculation earlier, let's try to output backward from the final output Variable and follow all the functions in the middle.
data = [0, 1, 2, 3]
x = Variable(data)
f_1 = Function()
y_1 = f_1(x)
f_2 = Function()
y_2 = f_2(y_1)
print(y_2.data)
print(y_2.creator) # => f_2
print(y_2.creator.input) # => y_1
print(y_2.creator.input.creator) # => f_1
print(y_2.creator.input.creator.input) # => x
print(y_2.creator.input.creator.input.data) # => data
>>> [0 1 2 3]
>>> <__main__.Function object at 0x1021efcf8>
>>> <__main__.Variable object at 0x1021efd30>
>>> <__main__.Function object at 0x1021efcc0>
>>> <__main__.Variable object at 0x1023204a8>
>>> [0 1 2 3]
First, data in the numpy.ndarray
format is passed to the Variable constructor. This Variable-formatted object x
is the input to the network.
f_1 = Function()
Here, the Function that you want to use as one layer of the network is materialized. This Function is an identity map and has no parameters, so there is no information to give to the constructor, so there are no arguments.
y_1 = f_1(x)
This line applies the function f_1
to the input data x
and assigns its output to y_1
. The output should also be in Variable format, so y_1
is an instance of the Variable class. If you call f_1
as a function, the internal __call__
will be called, so this line means that x
is passed to the __call__
method of the Function class. Let's first look at the contents of the __call__
method.
in_data = in_var.data
The current code doesn't run any type checking, but assuming that the passed argument is Variable, we take the data
element of that Variable and put it in ʻin_data`. This is the data itself needed for the actual forward calculation.
output = self.forward(in_data)
Here, the forward
method of the own object is passed an array of type numpy.ndarray
fetched from the input Variable in the previous line, and the return value is put in ʻoutput`.
ret = Variable(output)
In this line, assuming that the result of the forward calculation, ʻoutput, is an array of type
numpy.ndarray, we assume that it is of type Variable again. This means that the
forward method itself should be a function that takes an array of type
numpy.ndarrayand returns an array of type
numpy.ndarray`.
ret.set_creator(self)
Now, in this line that comes next, ** remembers that you are the creator ** for the forward result ret
wrapped in Variable **. Now let's take a look at the set_creator
method of the Variable class.
def set_creator(self, gen_func):
self.creator = gen_func
Here, the received Function class object is stored in its own self.creator
member variable. This allows this Variable
to hold a reference to the Function that output it.
self.input = in_var
Next, the input Variable: ʻin_varpassed to this Function is stored and stored in
self.input` so that the Function called before you can be traced from here later.
self.output = ret
In addition, the Variable: ret
of the result of the forward calculation is stored in self.output
. This is because the backpropagation will require a gradient at the next higher layer. This makes no sense when you think of the chain rule of differentiation. Reference: [Slope for a layer's parameters to loss](# Gradient for a layer's parameters to loss)
return ret
Finally, it returns ret
. As a result, if you call an object of Function class as a function and pass Variable, the result obtained by applying the forward
method to the contents data
will be returned in Variable again. Will be.
The Functions in the code so far didn't have any parameters, so we could only build a network with nothing to update. However, if the Function's forward
performs a transformation that is determined by some parameter, then we want to calculate the optimal value for the parameter in order to minimize that transformation to some loss scale. In the case of Neural Networks, this parameter is often optimized using a method based on gradient descent, which requires finding the gradient for all parameters for the loss function for which the loss scale is calculated. Backpropagation was a method for doing this for multi-layer networks, which are considered to be composite maps of many functions.
Since the implementation of the chain rule of differentiation by Backpropagation is very simple, there are various ways to do it, but here we use the fact that "the calculation history can be traced in the opposite direction from the output Variable" based on the implementation of Variable and Function described above. I will explain the implementation method using code.
First, define the necessary functions and perform forward calculation. Here, consider a network in which loss calculation is performed after applying two functions to the data.
f1 = Function() #Definition of the first function
f2 = Function() #Definition of the second function
f3 = Function() #Definition of loss function
y0 = Variable(data) #Input data
y1 = f1(y0) #Apply the first function
y2 = f2(y1) #Apply the second function
y3 = f3(y2) #Loss function application
Now, using this final output Variable (y3
) as the base point, we will calculate the gradient of each layer in order while tracing the functions applied to the data in reverse order [^ What is the gradient of each layer]. The calculated gradient is stored in the grad
member of the input Variable of each function. By doing this, since ʻinput of a certain layer = ʻoutput
of the next lower layer, this gradient can be referred to from the next lower layer.
f3 = y3.creator # 0.First, follow the loss function from the loss value
gx = f3.backward() # 1.Gradient of loss function
f3.input.grad = gx # 2.Stored in input grad
f2 = f3.input.creator # 3.Follow the function one layer below
gx = f2.backward() # 4.Gradient of current layer(d output/d input)
f2.input.grad = f2.output.grad * gx # 5. f.Gradient of loss function with respect to input
f1 = f2.input.creator # 3.Follow the function one layer below
gx = f1.backward() # 4.Gradient of current layer(d output/d input)
f1.input.grad = f1.output.grad * gx # 5. f.Gradient of loss function with respect to input
In 5., the calculation is d loss function / d input = (d loss function / d output) * (d output / d input)
according to the chain rule of differentiation. Also, if you go to 5, you can see that it repeats from 3.
In this way, we found that from the final output Variable y3
, we could calculate the gradient for loss for all layer inputs. Also, this is the same as the following code if you write it shortly using the intermediate output Variable that was temporarily used for forward calculation in a named form so that it is easy to understand without separating lines unnecessarily at the time of assignment. is.
#Starting from y3
f3 = y3.creator
y2.grad = f3.backward() * 1 #Gradient of f3 with respect to y2*Gradient of f3 with respect to y3
y1.grad = f2.backward() * y2.grad #Gradient of f2 with respect to y1*Gradient of f3 with respect to y2
y0.grad = f1.backward() * y1.grad #Gradient of f1 with respect to y0*Gradient of f3 with respect to y1
Furthermore, if you write after f3 = y3.creator
in one line,
y0.grad = 1 * f3.backward() * f2.backward() * f1.backward()
It will be. This corresponds exactly to the following derivative chain.
Since y0
is an input to the network, it should be named x
, f3
is a loss function, so it should be named l
, and y3
is a loss, so it should be named loss
, etc., but here the backward is to find the gradient. In order to emphasize that the calculation is performed by repeating the same calculation from the upper layer to the lower layer, we adopted a notation in which only the subscript changes. Also, the derivative of f3
by y3
is $ 1 $ because it means the derivative of itself.
However, this has not yet calculated the gradient required to update the parameters of the Function. In order to update the parameters, we need a ** gradient about the parameters ** for the loss function. To get this, in each backward
method, first calculate the ** parameter gradient ** for your output, then multiply it by the gradient transmitted from the top layer to calculate the gw
. All you have to do is do it.
For example, if the function f2
in the middle has the parameter w
and the conversion w * y1
is performed on the input y1
, the gradient of f2
for w
is y1. Since it is
and this is the input f2.input
to f2
, gw
becomes:
gw = f2.input.data * f2.output.grad
Information from the upper layer is aggregated by f2.output
Variable, and information from the lower layer is aggregated by f2.input
Variable, so these can be used to calculate the gradient for the loss function for the parameters in the layer. It has become like.
This gw
is used to update the parameter w
. The update rules themselves vary depending on the optimization method. The Optimizer for patrol and updating parameters in the network will be described later.
By the way, the calculation process of Backpropagation as described above is given to the final output Variable with a method called backward
, and the following functions are added to Variable and Function so that all of them can be performed in this method.
grad
member variable
--Add the backward
method
--Trace the calculation process in the opposite direction from your own creator
--Call the backward
method on all Functions in the middle
--At that time, enter the return value of the backward
of that Function in the grad
of the input Variable of each Function.backward
method
--Calculate the gradient gx
for the input to its own output
--Receives the output gradient grad_output
for the loss passed from the top layer and returns the product of gx
--Calculate the gradient for parameters to your own output, multiply by grad_output
to calculate gw
, and holdSpecifically, it looks like this.
class Variable(object):
def __init__(self, data):
self.data = data
self.creator = None
self.grad = 1
def set_creator(self, gen_func):
self.creator = gen_func
def backward(self):
if self.creator is None: # input data
return
func = self.creator
while func:
gy = func.output.grad
func.input.grad = func.backward(gy)
func = func.input.creator
class Function(object):
def __call__(self, in_var):
in_data = in_var.data
output = self.forward(in_data)
ret = Variable(output)
ret.set_creator(self)
self.input = in_var
self.output = ret
return ret
def forward(self, in_data):
NotImplementedError()
def backward(self, grad_output):
NotImplementedError()
Here, it is not interesting if the Function is only an identity map, so the Function is only the interface definition so that various functions can be defined, and the class called Mul
that actually has forward
and backward
is this Function. Is inherited and defined.
class Mul(Function):
def __init__(self, init_w):
self.w = init_w # Initialize the parameter
def forward(self, in_var):
return in_var * self.w
def backward(self, grad_output):
gx = self.w * grad_output
self.gw = self.input
return gx
This is just a function that multiplies the input and returns the parameters given during initialization. In the contents of backward
, the gradient of its own transformation and the gradient of the parameter are obtained respectively, and the gradient of the parameter is kept in self.gw
.
Let's do a forward calculation using these extended Variables and Functions.
data = xp.array([0, 1, 2, 3])
f1 = Mul(2)
f2 = Mul(3)
f3 = Mul(4)
y0 = Variable(data)
y1 = f1(y0) # y1 = y0 * 2
y2 = f2(y1) # y2 = y1 * 3
y3 = f3(y2) # y3 = y2 * 4
print(y0.data)
print(y1.data)
print(y2.data)
print(y3.data)
>>> [0 1 2 3]
>>> [0 2 4 6]
>>> [ 0 6 12 18]
>>> [ 0 24 48 72]
You can see that each time you apply a function, the value is multiplied by the initial value given to each function. Here, when y3.backward ()
is executed, the functions applied so far are traced back from y3
in the reverse order, the backward
of each Function is called in sequence, and the variable of the intermediate input / output The grad
member variable will contain the gradient for its final output.
y3.backward()
print(y3.grad) # df3 / dy3 = 1
print(y2.grad) # df3 / dy2 = (df3 / dy3) * (dy3 / dy2) = 1 * 4
print(y1.grad) # df3 / dy1 = (df3 / dy3) * (dy3 / dy2) * (dy2 / dy1) = 1 * 4 * 3
print(y0.grad) # df3 / dy0 = (df3 / dy3) * (dy3 / dy2) * (dy2 / dy1) * (dy1 / dy0) = 1 * 4 * 3 * 2
>>> 1
>>> 4
>>> 12
>>> 24
print(f3.gw)
print(f2.gw)
print(f1.gw)
>>> [ 0 6 12 18] # f3.gw = y2
>>> [0 2 4 6] # f2.gw = y1
>>> [0 1 2 3] # f1.gw = y0
The following is a simple diagram of what is happening in each object in the sequence of writing the forward calculation by yourself and then calling the backward ()
of the final output Variable.
Link
Link calls the Function internally, and passes the parameters required for the conversion performed by the Function to the Function. These parameters are kept as member variables of the Link object and are updated by Optimizer during network training.
(to be continued)
Chain
Chain can hold any number of Links inside, which is useful for grouping parameters etc. that you want to update and for describing easy-to-understand partial units of a large network.
(to be continued)
Appendix
If we forcefully say that a Neural Network is a composite map that includes multiple alternating applications of linear and non-linear transformations, the affine transformation can be one of the linear transformations that make up this. The affine transformation here means that when a real value vector is placed as $ {\ bf x} \ in \ mathbb {R} ^ {d_ {in}} $, the weight matrix $ {\ bf W} \ By multiplying in \ mathbb {R} ^ {d_ {in} \ times d_ {out}} $ and adding the bias vector $ {\ bf b} \ in \ mathbb {R} ^ {d_ {out}} $ It refers to the transformations that are performed, geometrically, such as "rotation, scaling, shearing, and translation."
Consider implementing this as one layer that makes up a Neural Network called Linear. One layer may or may not have trainable parameters, but the Linear layer has $ {\ bf W} $ and $ {\ bf b} $ parameters for affine transformation. Is a learnable parameter as it will be updated to the one that does the desired transformation.
Now, let's express the Linear layer as a class written in Python.
--forward calculation (calculation that takes $ {\ bf x} $ and returns $ {\ bf y} = {\ bf W} {\ bf x} + {\ bf b} $) --backward calculation (calculate $ \ partial {\ bf y} \ / \ \ partial {\ bf W} $ and $ \ partial {\ bf y} \ / \ \ partial {\ bf b} $) --update calculation (update parameters $ {\ bf W}, {\ bf b} $ using these gradients)
Let's implement the function to do.
class Linear(object):
def __init__(self, in_sz, out_sz):
self.W = numpy.random.randn(out_sz, in_sz) * numpy.sqrt(2. / in_sz)
self.b = numpy.zeros((out_sz,))
def __call__(self, x):
self.x = x
return x.dot(self.W.T) + self.b
def update(self, gy, lr):
self.W -= lr * gy.T.dot(self.x)
self.b -= lr * gy.sum(axis=0)
return gy.dot(self.W)
In this Linear class, first, the parameters ($ {\ bf W}, {\ bf b} $) that the Linear layer has in the constructor are averaged 0, and the standard deviation is $ \ sqrt {2 \ / \ {\ rm in \. Initialized using a normal random number of _sz}} $.
self.W = numpy.random.randn(out_sz, in_sz) * numpy.sqrt(2. / in_sz)
self.b = numpy.zeros((out_sz,))
This initialization method is called HeNormal [^ HeNormal]. ʻIn_sz is the input size, that is, the dimension $ d_ {in} $ of the input vector, and ʻout_sz
is the output size, that is, the dimension $ d_ {out} $ of the converted output vector.
Next, the __call__
method corresponds to the forward calculation, where $ {\ bf W} {\ bf x} + {\ bf b} $ is calculated.
self.h = x.dot(self.W.T) + self.b
Calculating the gradient for the parameters to the output (= backward) is very simple for the Linear layer, so it is not provided as an independent method in the above code. Specifically, $ \ partial {\ bf y} \ / \ \ partial {\ bf W} = {\ bf x}, \ partial {\ bf y} \ / \ \ partial {\ bf b} = {\ bf 1} $ ($ {\ bf 1} $ is a $ d $ dimensional vector whose elements are all $ 1 $), which is used as known in the ʻupdatemethod. The first line of the following part,
self.x, corresponds to $ \ partial {\ bf y} \ / \ \ partial {\ bf W} $. On the right side of the second line,
gy.sum (axis = 0), the same calculation as
gy.T.dot (numpy.ones ((gy.shape [0],)))is performed. I will. The
numpy.ones ((gy.shape [0],))` part of this corresponds to $ \ partial {\ bf y} \ / \ \ partial {\ bf b} $.
self.W -= lr * gy.T.dot(self.x)
self.b -= lr * gy.sum(axis=0)
If the calculation of the gradient is more complicated, it is better to prepare a backward
method etc. and calculate the gradient for each parameter separately from ʻupdate`.
Parameter updates should have been abstracted or cut out as a separate class to accommodate various gradient descent variants [^ optimizers], but here it is the simplest. The Linear class itself that holds the parameters has a ʻupdate` method, only considering updating the parameters using the stochastic gradient descent method (sometimes called Vanilla SGD).
What is done with the ʻupdatemethod is simple. First, according to the chain rule in the differentiation of the composite function, the gradient for each input for each layer output in the upper layer multiplied by all layers is passed as
gy, so that is the parameter $ for the output of this layer. We are calculating the product of the gradient for {\ bf W}, {\ bf b} $. Then, this is the gradient for the parameters $ {\ bf W}, {\ bf b} $ for the objective function, so multiply this by the learning rate
lr` to calculate the update amount, and actually from the parameters We are subtracting and updating.
The ʻupdatemethod returns the sum of the gradients passed from the upper layer up to that point multiplied by the gradient $ {\ bf W} $ for the input to its own output. This is used as
gy` in the lower layers.
The gradient calculation required for Backpropagation is very easy to implement thanks to the chain rule. Each layer passes the gradient $ \ partial f \ / \ \ partial {\ bf x} $ for the input $ {\ bf x} $ to the transformation $ f $ it makes to the lower layers as gy
, and at each layer You can update the parameters by multiplying the gy
passed from the upper layer by the gradient of the parameters for your transformation $ f $ and using this.
If you define a class that implements the above functions, you can create a Linear layer of any input / output size. When passing a value to the Linear layer, you can call the object as a function and pass a numpy.ndarray
object as its argument, and when updating the internal parameters, you can use the gy
passed from the upper layer and the update expression. It means that the learning rate lr
used in is passed to the ʻupdate` method.
At the beginning of the Appendix about the basics of the Linear layer above, I said that the Neural Network is forcibly "alternately applying linear and non-linear conversion", so I want to apply non-linear conversion to the output of the Linear layer. I will. Neural Networks have proposed a wide variety of non-linear transformations called activation functions. One of the most common ones at the moment is ReLU, but its non-linear transformation can be written as follows.
class ReLU(object):
def __call__(self, x):
self.x = x
return numpy.maximum(0, x)
def update(self, gy, lr):
return gy * (self.x > 0)
Since the activation function is a parameterless conversion, ʻupdatedoes not update any parameters. Instead, it calculates its own subgradient and multiplies it by
gy`.
A library called Scikit-learn makes it easy to download and load MNIST datasets.
from sklearn.datasets import fetch_mldata
#Read MNIST dataset
mnist = fetch_mldata('MNIST original', data_home='.')
td, tl = mnist.data[:60000] / 255.0, mnist.target[:60000]
# 1-Make it a hot vector
tl = numpy.array([tl == i for i in range(10)]).T.astype(numpy.int)
#shuffle
perm = numpy.random.permutation(len(td))
td, tl = td[perm], tl[perm]
I also created the Validation data in the same way.
vd, vl = mnist.data[60000:] / 255.0, mnist.target[60000:]
vl = numpy.array([vl == i for i in range(10)]).T.astype(numpy.int)
When the network output is $ {\ bf y} \ in \ mathbb {R} ^ {d_ {l}} $, the value converted to a probability vector using the Softmax function is $ \ hat {\ bf y} $ If so, this is calculated as follows:
At this time, $ \ hat {y} \ _ {i} \ (i = 1,2, \ dots, d_l) $ represents the probability, so $ 0 \ leq \ hat {y} \ _ {i} \ leq 1 $ It becomes.
Now, the teacher signal is also a $ d_l $ dimensional one-hot vector (a vector where only one of the elements is $ 1 $ and all other elements are $ 0 $) $ {\ bf t} = [t_1, t_2, \ dots, t_ {d_l}] ^ {\ rm T} $, if represented by $ \ hat {\ bf y} = {\ bf t} $, likelihood $ L ( \ hat {\ bf y} = {\ bf t}) $ can be defined as follows.
$ t_i $ is $ 1 $ only when $ i $ is the index of the correct class, and everything else is $ 0 $, so if the correct class is $ i = 5 $, the above formula is $ 1 \ cdot 1 \ cdots \ Represents hat {y} \ _ {5} \ cdots 1 = \ hat {y} \ _ {5} $. In other words, this $ L $ can be interpreted as meaning "how well and with a high degree of certainty could you predict the correct answer?" Then, it would be nice if this value could be increased, but in general, take the logarithm of this $ L $ and invert the sign $-\ log (L) $ is ** minimized * *To do. Since $ \ log $ is monotonically increasing, $ L $ is also maximum when $ \ log (L) $ is maximum, and the sign is inverted when $ \ log (L) $ is maximum $-\ log (L) $ Should be the smallest. As a result, by minimizing $-\ log (L) $, the likelihood expressed by the above equation is maximized [^ SoftmaxCrossEntropy_derivation]. This $-\ log (L) $ is called "negative log-likelihood" because it takes the log-likelihood of the likelihood and inverts the sign, but it is more often called cross-entropy in the context of Neural Networks. I think so. Now, if we put this cross entropy again as $ \ mathcal {L} $,
is. We will use this as a loss function for learning the Neural Network that solves the classification problem, and aim to minimize it.
Now, in the layer definition using the Python class as described above, if "$ 1 $ is always passed to gy
of the ʻupdate method", the loss function can be considered as one layer. Since the loss function itself has no parameters to update, the calculation to be done with the ʻupdate
method is to find the gradient for output = loss value for" input to loss function = network output ". That is,
That is, we know that we can calculate: About $ k = 1,2, \ dots, d_l $
The summation symbol does not come off here because the Softmax function has values of all dimensions in the denominator, so it is a function for all subscripts. Now, the gradient of the Softmax function is
When $ k \ neq i $
When $ k = i $ $$ \begin{eqnarray} \frac{\partial \hat{y}_i}{\partial y_k} &=& \frac{\exp(y_i)}{\sum_j \exp(y_j)}
Therefore, we can use this to decompose the $ (1) $ expression into separate terms when the content of the sum symbol is $ i = k $ and when $ i \ neq k $. When you do
It will be. Here, when the first term is $ i = k $ and the second term is $ i \ neq k $. When it transforms further,
It will be. Here, the property of the one-hot vector ($ \ sum_i t_i = 1 $) is used for the final transformation. If you rewrite the result here again,
It turned out to be. In other words, this is the gy
returned from the ʻupdate` method of the Softmax Cross Entropy class.
If the loss function is $ \ mathcal {L} $, the gradient of the loss function for the parameter $ {\ bf W} _l $ in the $ l $ layer is $ l + 1, l + 2 for the layers above it. As, \ dots, L $, it becomes as follows by the chain rule of differentiation.
At this time, all gradients from $ \ partial y_ {l + 1} \ / \ \ partial y_l $ to $ \ partial y_ {L} \ / \ \ partial y_ {L-1} $ are in the $ l $ layer. It is a gradient ** about the input to the ** output of the upper layer that leads to. Let's call this the input / output gradient. Then, for the last $ \ partial \ mathcal {L} \ / \ \ partial y_L $, if you think of $ \ mathcal {L} $ as the loss layer of the $ L + 1 $ layer, the output (loss) of the loss layer is the same. It is an input / output gradient because it is a gradient for the input (predicted value of the network) to. In other words, ** the product of all the input / output gradients of each layer connecting between the own layer and the loss multiplied by the gradient of the parameters for the output of the own layer ** is what you want to calculate by backward calculation. So, I will pass the input / output gradient of each layer to all the Functions that have passed Variable to me, and the passed side will pass the product of my input / output gradient to the lower layer. You just have to do that.
[^ cupy-PR]: Related PRs include: "Support advanced indexing with boolean array for getitem": https://github.com/pfnet/chainer/pull/1840 [^ optimizers]: Chainer implements AdaDelta, AdaGrad, Adam, MomentumSGD, NesterovAG, RMSprop, RMSpropGraves, SGD (= Vanilla SGD), SMORMS3. There is no RProp and Eve has a PR (https://github.com/pfnet/chainer/pull/1847) that has not been merged yet. I want to get Adam and Eve together as soon as possible (?).
Recommended Posts