Overview

In this article, in order to understand ** "Define-by-Run" **, which is the most characteristic concept of Chainer, which is a Neural Network framework, we will describe and learn a network for classifying handwritten numbers. Let's implement the library "1f-chainer" which has only the minimum necessary functions using only NumPy. I tried to follow all the explanations that appear in mathematical formulas to Appendix, and to explain in the text with only code and sentences as much as possible.

All the code used in this article is located below: 1f-chainer. When I started writing, I wanted to add various things and couldn't make it in time, so I will update it one by one by the end of this week.

In addition, all content of this article is based on my personal opinion and understanding, and has nothing to do with the organization to which I belong.

Assumed reader

This article is written with the assumption that you have a basic knowledge of training Neural Networks with Backpropagation and that you have experience with Python and NumPy. Please note that the author is a fan of Chainer, so his impressions of Chainer are based on his personal feelings and are not official views.

Introduction

Chainer is a framework written in Python that has the ability to build and learn Neural Networks.

--High flexibility (easy to write dynamic network) --Relatively easy to debug (faster trial and error cycle) --Intuitive model can be written (model can be written in Python)

I think that it is a library that aims at such things.

As far as I know, the well-known framework / library for building and learning Neural Networks

TensorFlow
Torch7
Caffe (and Caffe2)
CNTK
dsstne
MXNet
Theano (and Keras, Lasagne, blocks)
Neon
DeepLearning4j --Pylearn2 (End of development)
DyNet
Darknet

There are many things such as, but as far as I know, there are few frameworks written only in Python, including the back end. On the other hand, Chainer adopts NumPy as a tensor library for CPU and CuPy originally developed for GPU, but one of the features of Chainer is that both are independent Python libraries that can be used independently. And I think you are.

CuPy is mainly written using Cython and has a mechanism to compile and execute the CUDA kernel internally. It also serves as a wrapper for cuDNN, a Neural Network library developed by NVIDIA that assumes the use of NVIDIA GPUs. And the big feature of CuPy is that it has a NumPy compatible API. This means that you can easily rewrite the CPU code written using NumPy for GPU processing with very few changes ** (the string numpy in the code You may just need to replace it with cupy). Since there are many NumPy functions and functions (advanced indexing etc. [^ cupy-PR]) that CuPy does not yet support, there are various restrictions and exceptions, but basically the goal of CuPy is the code for CPU and GPU. I think it's about making the code for you as close as possible.

I haven't tried all of the major frameworks / libraries listed above, but Chainer's internal design of computational procedures for building and learning Neural Networks is ** ". The idea of Define-by-Run "** is used, and I think this is an important feature that sets Chainer apart from other frameworks. "Define by Run", as the name implies, means "define by running" the structure of a Neural Network. This means that the structure of the network has not yet been determined before it is executed, and only after the code is executed will it be determined how each part of the network will be connected. Specifically, when a certain input variable is applied with a function, the output is a new variable that memorizes "what kind of function was applied", so that "the calculation actually performed" If so, it is possible to follow in the opposite direction as many times as you like, which means that you can actually perform "runtime construction of the calculation graph". For this reason, the user should describe "the calculation process of network forward propagation (including rule-based conditional branching, stochastic conditional branching, or generation of a new layer in the calculation process)" using Python. In other words, if you write the "code that represents the forward calculation", you have defined the network.

Several classes to achieve these characteristics are at the heart of Chainer. In this article, I'll try to implement only the most basic part of Chainer by myself as a library consisting of one file (like Python's web framework, Bottle), and the heart of it is superficial. The purpose is to understand.

Implementation of Neural Network that is not like Chainer

Let's start by looking at what happens if we write code that builds and learns Neural Networks as simply as possible, without being aware of the "Chainer-like" implementation. The code that appears in this section below assumes that ʻimport numpy` is done at the beginning. In addition, it is basically assumed that the parameters will be updated by Mini-batched SGD in the subsequent learning of Neural Network.

Network consisting of Linear layer and ReLU

A three-layer perceptron consisting of only two layers, the Linear layer and the activation function ReLU, can be defined as follows.

class Linear(object):

    def __init__(self, in_sz, out_sz):
        self.W = numpy.random.randn(out_sz, in_sz) * numpy.sqrt(2. / in_sz)
        self.b = numpy.zeros((out_sz,))

    def __call__(self, x):
        self.x = x
        return x.dot(self.W.T) + self.b

    def update(self, gy, lr):
        self.W -= lr * gy.T.dot(self.x)
        self.b -= lr * gy.sum(axis=0)
        return gy.dot(self.W)

class ReLU(object):

    def __call__(self, x):
        self.x = x
        return numpy.maximum(0, x)

    def update(self, gy, lr):
        return gy * (self.x > 0)

model = [
    Linear(784, 100),
    ReLU(),
    Linear(100, 100),
    ReLU(),
    Linear(100, 10)
]

that's all. You can see that it can be written very short. A detailed description of the Linear layer above is given later in the Appendix: [Linear Layer Basics](#Linear Layer Basics). There is also a brief supplementary explanation about the ReLU layer ([About the ReLU layer](#About the ReLU layer)).

Next, we will prepare the functions forward and ʻupdate required to learn this three-layer perceptron using the data xand the correct answert`.

def forward(model, x):
    for layer in model:
        x = layer(x)
    return x

def update(model, gy, lr=0.0001):
    for layer in reversed(model):
        gy = layer.update(gy, lr)

These two functions can be used to train the above three-layer perceptron. In the forward function, the contents of the model given as a list of configuration layers are examined in order, and the data is propagated forward. On the contrary, the ʻupdate function looks at the layers in the modelin the reverse order **, and finds thegy`, which is the infinite product of the part of the multiplication of the gradients that appears in the chain rule, before you. It is backpropagating. In summary, the learning procedure is

--Inference: Give the forward function the data x and put the resulting output as y --Loss calculation: Calculate the error between y and the correct answer t, find the gradient for y, and set it as gy. --Update: Give gy to the backward function and update all parameters in the network

There are 3 steps. The code that actually repeats these processes using the MNIST dataset and trains the network can be written as follows. Where td represents a numpy.ndarray in the form of (60000, 784) with $ 60000 $ dimensional vector data lined up, and tl is a $ 10 $ dimensional one-hot vector (correct answer). Suppose you want to represent a numpy.ndarray in the form(60000, 10)with $ 60000 $ (vectors where only the dimension corresponding to the class number is $ 1 $ and all other dimensions are 0). The code that actually creates these arrays is posted in the Appendix: Read Dataset (#Read Dataset).

def softmax_cross_entropy_gy(y, t):
    return (numpy.exp(y.T) / numpy.exp(y.T).sum(axis=0)).T - t

#Learning
for epoch in range(30):
    for i in range(0, len(td), 128):
        x, t = td[i:i + 128], tl[i:i + 128]
        y = forward(model, x)
        gy = softmax_cross_entropy_gy(y, t)
        update(model, gy)

You can see that the basic learning flow with the mini-batch SGD is also very simple. Since the network is defined as a Python list, the forward calculation can be performed simply by giving an input variable to the first element of the list and giving the obtained output as the input of the next element of the list. I will. In the ʻupdate function that updates the parameters, you can" look at this list in reverse order ", calculate the gy in order from the back, and pass it to the ʻupdate method of each layer.

Note that the softmax_cross_entropy_gy function returns for the gradient of the input to the loss value, not the loss value itself. I wrote in the Appendix about the value of the Softmax Cross Entropy loss function and its gradient: Calculation and Differentiation of Softmax Cross Entropy.

After running the training loop above, let's examine the classification accuracy of MNIST's Validation dataset.

y = forward(model, vd)
n_correct = numpy.sum(vl[numpy.arange(len(vl)), y.argmax(axis=1)])
acc = n_correct / float(len(vl))

print(acc)

Here, vd represents the validation data and vl represents the label corresponding to the validation data, and how to create these is written in Appendix as well as the training data: [Read Dataset](#Read Dataset). After performing $ 30 $ epoch learning and verifying the accuracy of the validation data with the above code, it was found that the accuracy of $ 0.9674 $, that is, $ 96.74 % $ was achieved.

The whole code is on the right: minimum.py. Running with python minimum.py will train and display the precision in the Validation dataset and exit.

Implementation of Neural Network like Chainer

Then it is the main subject. We have found that implementing the basic layers that make up a Neural Network is very easy. So what kind of implementation is Chainer doing to define and train similar networks?

Meaning of forward calculation in Chainer

Chainer defines the network structure based on the idea of ** "Define-by-Run" ** as mentioned at the beginning. On the other hand, the implementation method described above is called ** "Define-and-Run" **, which fixes the network structure ** in advance ** and then the learning process (forward). It was a mechanism to execute calculations, backward calculations, parameter updates, etc.). Therefore, if you try to change the structure of the network depending on the data (there is a branch point in the middle and which branch the forward goes to depends on the content of the data, etc.), it is very troublesome or inconvenient. It may be possible [^ branch]. However, in "Define-by-Run", ** describing the forward calculation and defining the network structure are synonymous **, and since the forward calculation can be described in Python, it includes branches and probabilistic elements. It's also very easy to write on the network.

Let's take a step-by-step look at how this flexibility is achieved.

Part of the main classes that make up Chainer

First, Chainer has several major classes for achieving the most fundamental functionality.

name of the class	Feature overview
Variable	A variable that can record the transformations applied to itself Not only the input variables to the network, but also the input / output of the intermediate layer, the parameters of each layer, etc. are all objects of this class. Variables that represent parameters are`grad`Holds a gradient in a member variable
Function	Represents a conversion that has no parameters or is performed using parameters passed with input variables What was entered`inputs`In the member variable, what you output`outputs`Hold in a member variable
Link	Layer with parameters
Chain	Class for handling multiple Links at once
Optimizer	Receives Chain or Link and crawls and updates the parameters they hold

Variable and Function

Personally, I think that this class (and Function class) called Variable is at the center of Chainer's unique implementation. In order to realize "Define-by-Run", it is necessary to be able to ** trace what forward calculation was actually performed ** after executing the calculation **. Because ** it becomes the definition of the network in itself **. This Variable class is the basis of this. In very rough terms, this class is a ** variable that remembers how it was created **.

A Variable should always be ** output from some function ** unless it is the root of the network, the Variable that represents the input data. Therefore, we will add a function to store ** what kind of function output ** in a member variable called creator.

However, with this alone, even if you look at creator, you can only see the function that output you, and before that, ** generate the function that generated the variable input to that function **, and generate the input before that function. It is not possible to trace the history of the function that was executed. Therefore, in order to make this possible, make sure that the ** Function class that actually calculates for Variable holds both the input Variable and the output Variable **. By doing this, you can trace from the ʻinput of the function that generated the Variable to the creator` that generated it, which allows you to trace back from any Variable to any leaf node connected to it. It will be possible.

Also, Variables, of course, need to be able to have values. So let's have the data member variable hold the array. Since the root Variable at the root represents data, we will put None in the creator member. To summarize so far,

Variable --creator: Function that outputs itself --data: The value itself
Function --ʻInput: Variable entered in itself --forward () : ʻinputs is received and some calculation is performed. --ʻOutput: Result calculated by forward ()`

You can see that at least the function is necessary. If Variable and Function have these functions, it is possible to ** get the history of calculations performed so far from the output Variable ** as follows.

x = Variable(data)

f_1 = Function()  #Creating a Function object
y_1 = f_1(x)      #Variable set internally_creator is called
                  #Output y by passing self_1 is f_1
                  #In the creator member
f_2 = Function()
y_2 = f_2(y_1)

y_2.creator                      # => f_2
y_2.creator.input                # => y_1
y_2.creator.input.creator        # => f_1
y_2.creator.input.creator.input  # => x

From the y_2 obtained as a result of the calculation, ** by alternately tracing the creator that generated it and the ʻinput held by that creator**, the most fundamental Variable,x` I was able to reach. The above flow can be represented by a simple diagram as follows.

** Forward calculation flow and calculation graph restoration by reverse reference **

Until now, the destination where the calculation progressed on the network was expressed as the "upper layer", but in this figure, it is written in the form that the calculation proceeds from top to bottom for convenience. Please be careful.

Then, if you look at the boxes in order from the top to the bottom of this figure, you can see how the input data is applied to the Function in order and a new Variable is generated in the form corresponding to the above code. .. On the other hand, the blue arrow shows the actual movement of the data in each class, and the red arrow shows how the final output Variable can be traced back to the calculation process so far.

Code that can perform Forward calculation and network construction at the same time (exp_1.py)

First, let's write the code of Variable class and Function class so that the actual forward calculation can be performed according to the blue arrow in the above figure.

class Variable(object):

    def __init__(self, data):
        self.data = data
        self.creator = None

    def set_creator(self, gen_func):
        self.creator = gen_func

class Function(object):

    def __call__(self, in_var):
        in_data = in_var.data
        output = self.forward(in_data)
        ret = Variable(output)
        ret.set_creator(self)
        self.input = in_var
        self.output = ret
        return ret

    def forward(self, in_data):
        return in_data

Using these, after performing the forward calculation earlier, let's try to output backward from the final output Variable and follow all the functions in the middle.

data = [0, 1, 2, 3]
x = Variable(data)

f_1 = Function()
y_1 = f_1(x)
f_2 = Function()
y_2 = f_2(y_1)

print(y_2.data)
print(y_2.creator)                           # => f_2
print(y_2.creator.input)                     # => y_1
print(y_2.creator.input.creator)             # => f_1
print(y_2.creator.input.creator.input)       # => x
print(y_2.creator.input.creator.input.data)  # => data

>>> [0 1 2 3]
>>> <__main__.Function object at 0x1021efcf8>
>>> <__main__.Variable object at 0x1021efd30>
>>> <__main__.Function object at 0x1021efcc0>
>>> <__main__.Variable object at 0x1023204a8>
>>> [0 1 2 3]

First, data in the numpy.ndarray format is passed to the Variable constructor. This Variable-formatted object x is the input to the network.

f_1 = Function()

Here, the Function that you want to use as one layer of the network is materialized. This Function is an identity map and has no parameters, so there is no information to give to the constructor, so there are no arguments.

y_1 = f_1(x)

This line applies the function f_1 to the input data x and assigns its output to y_1. The output should also be in Variable format, so y_1 is an instance of the Variable class. If you call f_1 as a function, the internal __call__ will be called, so this line means that x is passed to the __call__ method of the Function class. Let's first look at the contents of the __call__ method.

in_data = in_var.data

The current code doesn't run any type checking, but assuming that the passed argument is Variable, we take the data element of that Variable and put it in ʻin_data`. This is the data itself needed for the actual forward calculation.

output = self.forward(in_data)

Here, the forward method of the own object is passed an array of type numpy.ndarray fetched from the input Variable in the previous line, and the return value is put in ʻoutput`.

ret = Variable(output)

In this line, assuming that the result of the forward calculation, ʻoutput, is an array of type numpy.ndarray, we assume that it is of type Variable again. This means that the forward method itself should be a function that takes an array of type numpy.ndarrayand returns an array of typenumpy.ndarray`.

ret.set_creator(self)

Now, in this line that comes next, ** remembers that you are the creator ** for the forward result ret wrapped in Variable **. Now let's take a look at the set_creator method of the Variable class.

def set_creator(self, gen_func):
    self.creator = gen_func

Here, the received Function class object is stored in its own self.creator member variable. This allows this Variable to hold a reference to the Function that output it.

self.input = in_var

Next, the input Variable: ʻin_varpassed to this Function is stored and stored inself.input` so that the Function called before you can be traced from here later.

self.output = ret

In addition, the Variable: ret of the result of the forward calculation is stored in self.output. This is because the backpropagation will require a gradient at the next higher layer. This makes no sense when you think of the chain rule of differentiation. Reference: [Slope for a layer's parameters to loss](# Gradient for a layer's parameters to loss)

return ret

Finally, it returns ret. As a result, if you call an object of Function class as a function and pass Variable, the result obtained by applying the forward method to the contents data will be returned in Variable again. Will be.

Code to calculate the gradient (exp_2.py)

The Functions in the code so far didn't have any parameters, so we could only build a network with nothing to update. However, if the Function's forward performs a transformation that is determined by some parameter, then we want to calculate the optimal value for the parameter in order to minimize that transformation to some loss scale. In the case of Neural Networks, this parameter is often optimized using a method based on gradient descent, which requires finding the gradient for all parameters for the loss function for which the loss scale is calculated. Backpropagation was a method for doing this for multi-layer networks, which are considered to be composite maps of many functions.

Since the implementation of the chain rule of differentiation by Backpropagation is very simple, there are various ways to do it, but here we use the fact that "the calculation history can be traced in the opposite direction from the output Variable" based on the implementation of Variable and Function described above. I will explain the implementation method using code.

First, define the necessary functions and perform forward calculation. Here, consider a network in which loss calculation is performed after applying two functions to the data.

f1 = Function()  #Definition of the first function
f2 = Function()  #Definition of the second function
f3 = Function()  #Definition of loss function

y0   = Variable(data)  #Input data
y1   = f1(y0)          #Apply the first function
y2   = f2(y1)          #Apply the second function
y3   = f3(y2)          #Loss function application

Now, using this final output Variable (y3) as the base point, we will calculate the gradient of each layer in order while tracing the functions applied to the data in reverse order [^ What is the gradient of each layer]. The calculated gradient is stored in the grad member of the input Variable of each function. By doing this, since ʻinput of a certain layer = ʻoutput of the next lower layer, this gradient can be referred to from the next lower layer.

f3 = y3.creator                      # 0.First, follow the loss function from the loss value

gx = f3.backward()                   # 1.Gradient of loss function
f3.input.grad = gx                   # 2.Stored in input grad

f2 = f3.input.creator                # 3.Follow the function one layer below

gx = f2.backward()                   # 4.Gradient of current layer(d output/d input)
f2.input.grad = f2.output.grad * gx  # 5. f.Gradient of loss function with respect to input

f1 = f2.input.creator                # 3.Follow the function one layer below

gx = f1.backward()                   # 4.Gradient of current layer(d output/d input)
f1.input.grad = f1.output.grad * gx  # 5. f.Gradient of loss function with respect to input

In 5., the calculation is d loss function / d input = (d loss function / d output) * (d output / d input) according to the chain rule of differentiation. Also, if you go to 5, you can see that it repeats from 3.

In this way, we found that from the final output Variable y3, we could calculate the gradient for loss for all layer inputs. Also, this is the same as the following code if you write it shortly using the intermediate output Variable that was temporarily used for forward calculation in a named form so that it is easy to understand without separating lines unnecessarily at the time of assignment. is.

#Starting from y3
f3 = y3.creator

y2.grad = f3.backward() * 1        #Gradient of f3 with respect to y2*Gradient of f3 with respect to y3
y1.grad = f2.backward() * y2.grad  #Gradient of f2 with respect to y1*Gradient of f3 with respect to y2
y0.grad = f1.backward() * y1.grad  #Gradient of f1 with respect to y0*Gradient of f3 with respect to y1

Furthermore, if you write after f3 = y3.creator in one line,

y0.grad = 1 * f3.backward() * f2.backward() * f1.backward()

It will be. This corresponds exactly to the following derivative chain.

\frac{\partial \mathcal{f_3}}{\partial y_0} = \frac{\partial \mathcal{f_3}}{\partial y_3} \frac{\partial \mathcal{y_3}}{\partial y_2} \frac{\partial \mathcal{y_2}}{\partial y_1} \frac{\partial \mathcal{y_1}}{\partial y_0}

Since y0 is an input to the network, it should be named x, f3 is a loss function, so it should be named l, and y3 is a loss, so it should be named loss, etc., but here the backward is to find the gradient. In order to emphasize that the calculation is performed by repeating the same calculation from the upper layer to the lower layer, we adopted a notation in which only the subscript changes. Also, the derivative of f3 by y3 is $ 1 $ because it means the derivative of itself.

However, this has not yet calculated the gradient required to update the parameters of the Function. In order to update the parameters, we need a ** gradient about the parameters ** for the loss function. To get this, in each backward method, first calculate the ** parameter gradient ** for your output, then multiply it by the gradient transmitted from the top layer to calculate the gw. All you have to do is do it.

For example, if the function f2 in the middle has the parameter w and the conversion w * y1 is performed on the input y1, the gradient of f2 for w is y1. Since it is and this is the input f2.input to f2, gw becomes:

gw = f2.input.data * f2.output.grad

Information from the upper layer is aggregated by f2.output Variable, and information from the lower layer is aggregated by f2.input Variable, so these can be used to calculate the gradient for the loss function for the parameters in the layer. It has become like.

This gw is used to update the parameter w. The update rules themselves vary depending on the optimization method. The Optimizer for patrol and updating parameters in the network will be described later.

By the way, the calculation process of Backpropagation as described above is given to the final output Variable with a method called backward, and the following functions are added to Variable and Function so that all of them can be performed in this method.

Variable --Have a grad member variable --Add the backward method --Trace the calculation process in the opposite direction from your own creator --Call the backward method on all Functions in the middle --At that time, enter the return value of the backward of that Function in the grad of the input Variable of each Function.
Function --Add the backward method --Calculate the gradient gx for the input to its own output --Receives the output gradient grad_output for the loss passed from the top layer and returns the product of gx --Calculate the gradient for parameters to your own output, multiply by grad_output to calculate gw, and hold

Specifically, it looks like this.

class Variable(object):

    def __init__(self, data):
        self.data = data
        self.creator = None
        self.grad = 1

    def set_creator(self, gen_func):
        self.creator = gen_func

    def backward(self):
        if self.creator is None:  # input data
            return
        func = self.creator
        while func:
            gy = func.output.grad
            func.input.grad = func.backward(gy)
            func = func.input.creator

class Function(object):

    def __call__(self, in_var):
        in_data = in_var.data
        output = self.forward(in_data)
        ret = Variable(output)
        ret.set_creator(self)
        self.input = in_var
        self.output = ret
        return ret

    def forward(self, in_data):
        NotImplementedError()

    def backward(self, grad_output):
        NotImplementedError()

Here, it is not interesting if the Function is only an identity map, so the Function is only the interface definition so that various functions can be defined, and the class called Mul that actually has forward and backward is this Function. Is inherited and defined.

class Mul(Function):

    def __init__(self, init_w):
        self.w = init_w  # Initialize the parameter

    def forward(self, in_var):
        return in_var * self.w

    def backward(self, grad_output):
        gx = self.w * grad_output
        self.gw = self.input
        return gx

This is just a function that multiplies the input and returns the parameters given during initialization. In the contents of backward, the gradient of its own transformation and the gradient of the parameter are obtained respectively, and the gradient of the parameter is kept in self.gw.

Let's do a forward calculation using these extended Variables and Functions.

data = xp.array([0, 1, 2, 3])

f1 = Mul(2)
f2 = Mul(3)
f3 = Mul(4)

y0 = Variable(data)
y1 = f1(y0)          # y1 = y0 * 2
y2 = f2(y1)          # y2 = y1 * 3
y3 = f3(y2)          # y3 = y2 * 4

print(y0.data)
print(y1.data)
print(y2.data)
print(y3.data)

>>> [0 1 2 3]
>>> [0 2 4 6]
>>> [ 0  6 12 18]
>>> [ 0 24 48 72]

You can see that each time you apply a function, the value is multiplied by the initial value given to each function. Here, when y3.backward () is executed, the functions applied so far are traced back from y3 in the reverse order, the backward of each Function is called in sequence, and the variable of the intermediate input / output The grad member variable will contain the gradient for its final output.

y3.backward()

print(y3.grad)  # df3 / dy3 = 1
print(y2.grad)  # df3 / dy2 = (df3 / dy3) * (dy3 / dy2) = 1 * 4
print(y1.grad)  # df3 / dy1 = (df3 / dy3) * (dy3 / dy2) * (dy2 / dy1) = 1 * 4 * 3
print(y0.grad)  # df3 / dy0 = (df3 / dy3) * (dy3 / dy2) * (dy2 / dy1) * (dy1 / dy0) = 1 * 4 * 3 * 2

>>> 1
>>> 4
>>> 12
>>> 24

print(f3.gw)
print(f2.gw)
print(f1.gw)

>>> [ 0  6 12 18]  # f3.gw = y2
>>> [0 2 4 6]      # f2.gw = y1
>>> [0 1 2 3]      # f1.gw = y0

The following is a simple diagram of what is happening in each object in the sequence of writing the forward calculation by yourself and then calling the backward () of the final output Variable.

Link

Link calls the Function internally, and passes the parameters required for the conversion performed by the Function to the Function. These parameters are kept as member variables of the Link object and are updated by Optimizer during network training.

（to be continued)

Chain

Chain can hold any number of Links inside, which is useful for grouping parameters etc. that you want to update and for describing easy-to-understand partial units of a large network.

（to be continued)

Appendix

Linear layer basics

If we forcefully say that a Neural Network is a composite map that includes multiple alternating applications of linear and non-linear transformations, the affine transformation can be one of the linear transformations that make up this. The affine transformation here means that when a real value vector is placed as $ {\ bf x} \ in \ mathbb {R} ^ {d_ {in}} $, the weight matrix $ {\ bf W} \ By multiplying in \ mathbb {R} ^ {d_ {in} \ times d_ {out}} $ and adding the bias vector $ {\ bf b} \ in \ mathbb {R} ^ {d_ {out}} $ It refers to the transformations that are performed, geometrically, such as "rotation, scaling, shearing, and translation."

Consider implementing this as one layer that makes up a Neural Network called Linear. One layer may or may not have trainable parameters, but the Linear layer has $ {\ bf W} $ and $ {\ bf b} $ parameters for affine transformation. Is a learnable parameter as it will be updated to the one that does the desired transformation.

Now, let's express the Linear layer as a class written in Python.

--forward calculation (calculation that takes $ {\ bf x} $ and returns $ {\ bf y} = {\ bf W} {\ bf x} + {\ bf b} $) --backward calculation (calculate $ \ partial {\ bf y} \ / \ \ partial {\ bf W} $ and $ \ partial {\ bf y} \ / \ \ partial {\ bf b} $) --update calculation (update parameters $ {\ bf W}, {\ bf b} $ using these gradients)

Let's implement the function to do.

class Linear(object):

    def __init__(self, in_sz, out_sz):
        self.W = numpy.random.randn(out_sz, in_sz) * numpy.sqrt(2. / in_sz)
        self.b = numpy.zeros((out_sz,))

    def __call__(self, x):
        self.x = x
        return x.dot(self.W.T) + self.b

    def update(self, gy, lr):
        self.W -= lr * gy.T.dot(self.x)
        self.b -= lr * gy.sum(axis=0)
        return gy.dot(self.W)

In this Linear class, first, the parameters ($ {\ bf W}, {\ bf b} $) that the Linear layer has in the constructor are averaged 0, and the standard deviation is $ \ sqrt {2 \ / \ {\ rm in \. Initialized using a normal random number of _sz}} $.

self.W = numpy.random.randn(out_sz, in_sz) * numpy.sqrt(2. / in_sz)
self.b = numpy.zeros((out_sz,))

This initialization method is called HeNormal [^ HeNormal]. ʻIn_sz is the input size, that is, the dimension $ d_ {in} $ of the input vector, and ʻout_sz is the output size, that is, the dimension $ d_ {out} $ of the converted output vector.

Next, the __call__ method corresponds to the forward calculation, where $ {\ bf W} {\ bf x} + {\ bf b} $ is calculated.

self.h = x.dot(self.W.T) + self.b

Calculating the gradient for the parameters to the output (= backward) is very simple for the Linear layer, so it is not provided as an independent method in the above code. Specifically, $ \ partial {\ bf y} \ / \ \ partial {\ bf W} = {\ bf x}, \ partial {\ bf y} \ / \ \ partial {\ bf b} = {\ bf 1} $ ($ {\ bf 1} $ is a $ d $ dimensional vector whose elements are all $ 1 $), which is used as known in the ʻupdatemethod. The first line of the following part,self.x, corresponds to $ \ partial {\ bf y} \ / \ \ partial {\ bf W} $. On the right side of the second line, gy.sum (axis = 0), the same calculation as gy.T.dot (numpy.ones ((gy.shape [0],)))is performed. I will. Thenumpy.ones ((gy.shape [0],))` part of this corresponds to $ \ partial {\ bf y} \ / \ \ partial {\ bf b} $.

self.W -= lr * gy.T.dot(self.x)
self.b -= lr * gy.sum(axis=0)

If the calculation of the gradient is more complicated, it is better to prepare a backward method etc. and calculate the gradient for each parameter separately from ʻupdate`.

Parameter updates should have been abstracted or cut out as a separate class to accommodate various gradient descent variants [^ optimizers], but here it is the simplest. The Linear class itself that holds the parameters has a ʻupdate` method, only considering updating the parameters using the stochastic gradient descent method (sometimes called Vanilla SGD).

What is done with the ʻupdatemethod is simple. First, according to the chain rule in the differentiation of the composite function, the gradient for each input for each layer output in the upper layer multiplied by all layers is passed asgy, so that is the parameter $ for the output of this layer. We are calculating the product of the gradient for {\ bf W}, {\ bf b} $. Then, this is the gradient for the parameters $ {\ bf W}, {\ bf b} $ for the objective function, so multiply this by the learning rate lr` to calculate the update amount, and actually from the parameters We are subtracting and updating.

The ʻupdatemethod returns the sum of the gradients passed from the upper layer up to that point multiplied by the gradient $ {\ bf W} $ for the input to its own output. This is used asgy` in the lower layers.

The gradient calculation required for Backpropagation is very easy to implement thanks to the chain rule. Each layer passes the gradient $ \ partial f \ / \ \ partial {\ bf x} $ for the input $ {\ bf x} $ to the transformation $ f $ it makes to the lower layers as gy, and at each layer You can update the parameters by multiplying the gy passed from the upper layer by the gradient of the parameters for your transformation $ f $ and using this.

If you define a class that implements the above functions, you can create a Linear layer of any input / output size. When passing a value to the Linear layer, you can call the object as a function and pass a numpy.ndarray object as its argument, and when updating the internal parameters, you can use the gy passed from the upper layer and the update expression. It means that the learning rate lr used in is passed to the ʻupdate` method.

About ReLU layer

At the beginning of the Appendix about the basics of the Linear layer above, I said that the Neural Network is forcibly "alternately applying linear and non-linear conversion", so I want to apply non-linear conversion to the output of the Linear layer. I will. Neural Networks have proposed a wide variety of non-linear transformations called activation functions. One of the most common ones at the moment is ReLU, but its non-linear transformation can be written as follows.

class ReLU(object):

    def __call__(self, x):
        self.x = x
        return numpy.maximum(0, x)

    def update(self, gy, lr):
        return gy * (self.x > 0)

Since the activation function is a parameterless conversion, ʻupdatedoes not update any parameters. Instead, it calculates its own subgradient and multiplies it bygy`.

Data set reading

A library called Scikit-learn makes it easy to download and load MNIST datasets.

from sklearn.datasets import fetch_mldata

#Read MNIST dataset
mnist  = fetch_mldata('MNIST original', data_home='.')
td, tl = mnist.data[:60000] / 255.0, mnist.target[:60000]

# 1-Make it a hot vector
tl     = numpy.array([tl == i for i in range(10)]).T.astype(numpy.int)

#shuffle
perm   = numpy.random.permutation(len(td))
td, tl = td[perm], tl[perm]

I also created the Validation data in the same way.

vd, vl = mnist.data[60000:] / 255.0, mnist.target[60000:]
vl = numpy.array([vl == i for i in range(10)]).T.astype(numpy.int)

Calculation and differentiation of Softmax Cross Entropy

When the network output is $ {\ bf y} \ in \ mathbb {R} ^ {d_ {l}} $, the value converted to a probability vector using the Softmax function is $ \ hat {\ bf y} $ If so, this is calculated as follows:

\hat{y}\_{i} = \frac{\exp(y_i)}{\sum_j \exp(y_j)} \hspace{1em} (i=1,2,\dots,d_l)

At this time, $ \ hat {y} \ _ {i} \ (i = 1,2, \ dots, d_l) $ represents the probability, so $ 0 \ leq \ hat {y} \ _ {i} \ leq 1 $ It becomes.

Now, the teacher signal is also a $ d_l $ dimensional one-hot vector (a vector where only one of the elements is $ 1 $ and all other elements are $ 0 $) $ {\ bf t} = [t_1, t_2, \ dots, t_ {d_l}] ^ {\ rm T} $, if represented by $ \ hat {\ bf y} = {\ bf t} $, likelihood $ L ( \ hat {\ bf y} = {\ bf t}) $ can be defined as follows.

L(\hat{\bf y} = {\bf t}) = \prod_i \hat{y}\_i^{t_i}

$ t_i $ is $ 1 $ only when $ i $ is the index of the correct class, and everything else is $ 0 $, so if the correct class is $ i = 5 $, the above formula is $ 1 \ cdot 1 \ cdots \ Represents hat {y} \ _ {5} \ cdots 1 = \ hat {y} \ _ {5} $. In other words, this $ L $ can be interpreted as meaning "how well and with a high degree of certainty could you predict the correct answer?" Then, it would be nice if this value could be increased, but in general, take the logarithm of this $ L $ and invert the sign $-\ log (L) $ is ** minimized * *To do. Since $ \ log $ is monotonically increasing, $ L $ is also maximum when $ \ log (L) $ is maximum, and the sign is inverted when $ \ log (L) $ is maximum $-\ log (L) $ Should be the smallest. As a result, by minimizing $-\ log (L) $, the likelihood expressed by the above equation is maximized [^ SoftmaxCrossEntropy_derivation]. This $-\ log (L) $ is called "negative log-likelihood" because it takes the log-likelihood of the likelihood and inverts the sign, but it is more often called cross-entropy in the context of Neural Networks. I think so. Now, if we put this cross entropy again as $ \ mathcal {L} $,

\mathcal{L} = - \log \prod_i \hat{y}\_i^{t_i} = - \sum_i t_i \log \hat{y}\_i

is. We will use this as a loss function for learning the Neural Network that solves the classification problem, and aim to minimize it.

Now, in the layer definition using the Python class as described above, if "$ 1 $ is always passed to gy of the ʻupdate method", the loss function can be considered as one layer. Since the loss function itself has no parameters to update, the calculation to be done with the ʻupdate method is to find the gradient for output = loss value for" input to loss function = network output ". That is,

\frac{\partial \mathcal{L}}{\partial {\bf y}}

That is, we know that we can calculate: About $ k = 1,2, \ dots, d_l $

\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial y_k} &=& - \sum_i \frac{\partial \mathcal{L}}{\partial \hat{y}\_i}\frac{\partial \hat{y}\_i}{\partial y\_k} \\\\ &=& - \sum_i \frac{t_i}{\hat{y}\_i} \frac{\partial \hat{y}\_i}{\partial y\_k} \hspace{1em}\cdots(1) \end{eqnarray}

The summation symbol does not come off here because the Softmax function has values of all dimensions in the denominator, so it is a function for all subscripts. Now, the gradient of the Softmax function is

When $ k \ neq i $ $ \begin{eqnarray} \frac{\partial \hat{y}\_i}{\partial y_k} &=& -\frac{\exp(y_i)\exp(y_k)}{\sum_j \exp(y_j)} \\\\ &=& - \hat{y}\_i \hat{y}_k \end{eqnarray} $

When $ k = i $ $$ \begin{eqnarray} \frac{\partial \hat{y}_i}{\partial y_k} &=& \frac{\exp(y_i)}{\sum_j \exp(y_j)}

\frac{\exp(y_i)\exp(y_k)}{\left\{ \sum_j \exp(y_j) \right\}^2} \\ &=& \hat{y}_i(1 - \hat{y}_k) = \hat{y}_k(1 - \hat{y}_k) \end{eqnarray} $$

Therefore, we can use this to decompose the $ (1) $ expression into separate terms when the content of the sum symbol is $ i = k $ and when $ i \ neq k $. When you do

\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial y_k} &=& - \sum_i \frac{t_i}{\hat{y}\_i} \frac{\partial \hat{y}\_i}{\partial y\_k} \\\\ &=& - t_k (1 - \hat{y}\_k) + \sum_{i \neq k} t_i \hat{y}\_k \end{eqnarray}

It will be. Here, when the first term is $ i = k $ and the second term is $ i \ neq k $. When it transforms further,

\begin{eqnarray} &=& - t_k + \hat{y}\_k t_k + \hat{y}\_k \sum_{i \neq k} t_i \\\\ &=& - t_k + \hat{y}\_k \sum_i t_i \\\\ &=& \hat{y}\_k - t_k \end{eqnarray}

It will be. Here, the property of the one-hot vector ($ \ sum_i t_i = 1 $) is used for the final transformation. If you rewrite the result here again,

\frac{\partial \mathcal{L}}{\partial y_k} = \hat{y}\_k - t_k

It turned out to be. In other words, this is the gy returned from the ʻupdate` method of the Softmax Cross Entropy class.

Gradient for a layer's parameters with respect to loss

If the loss function is $ \ mathcal {L} $, the gradient of the loss function for the parameter $ {\ bf W} _l $ in the $ l $ layer is $ l + 1, l + 2 for the layers above it. As, \ dots, L $, it becomes as follows by the chain rule of differentiation.

\frac{\partial \mathcal{L}}{\partial {\bf W}\_l} = \frac{\partial \mathcal{L}}{\partial y_L} \frac{\partial y_L}{\partial y_{L-1}} \cdots \frac{\partial y_{l+1}}{\partial y_l} \frac{\partial y_l}{\partial {\bf W}_l}

At this time, all gradients from $ \ partial y_ {l + 1} \ / \ \ partial y_l $ to $ \ partial y_ {L} \ / \ \ partial y_ {L-1} $ are in the $ l $ layer. It is a gradient ** about the input to the ** output of the upper layer that leads to. Let's call this the input / output gradient. Then, for the last $ \ partial \ mathcal {L} \ / \ \ partial y_L $, if you think of $ \ mathcal {L} $ as the loss layer of the $ L + 1 $ layer, the output (loss) of the loss layer is the same. It is an input / output gradient because it is a gradient for the input (predicted value of the network) to. In other words, ** the product of all the input / output gradients of each layer connecting between the own layer and the loss multiplied by the gradient of the parameters for the output of the own layer ** is what you want to calculate by backward calculation. So, I will pass the input / output gradient of each layer to all the Functions that have passed Variable to me, and the passed side will pass the product of my input / output gradient to the lower layer. You just have to do that.

[^ cupy-PR]: Related PRs include: "Support advanced indexing with boolean array for getitem": https://github.com/pfnet/chainer/pull/1840 [^ optimizers]: Chainer implements AdaDelta, AdaGrad, Adam, MomentumSGD, NesterovAG, RMSprop, RMSpropGraves, SGD (= Vanilla SGD), SMORMS3. There is no RProp and Eve has a PR (https://github.com/pfnet/chainer/pull/1847) that has not been merged yet. I want to get Adam and Eve together as soon as possible (?).

[WIP] Create 1-file Chainer