"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class

While reading "Deep Learning from scratch" (written by Yasuki Saito, published by O'Reilly Japan), I will make a note of the sites I referred to. Part 9 ← → Part 11

After explaining the implementation in layers in Chapter 5, the program itself will not be explained much in Chapter 6 and later. Since the program example is in the file downloaded first, it may be that you should execute it yourself and check the contents, but it is quite difficult for beginners.

Well, I'll go little by little.

Check the contents of the MultiLayerNet class in Chapter 6

Chapter 3 explained the basics of neural networks, and Chapter 4 implemented the two-layer neural network class TwoLayerNet. After that, there were various explanations, and it became the MultiLayerNet class. It looks a lot more complicated, but the basics are the same as TwoLayerNet. Looking at the contents of the library layers.py referenced by this class, it's the same as the one used by the TwoLayerNet class. What looks complicated is Implemented on a layer-by-layer basis to increase the versatility of the program The activation function, parameter update method, initial weight value, etc. can now be selected. It seems to be from.

When you want to understand the program, it is sure to manually trace line by line.

So, let's trace the program on P192.

Generate a neural net object network

weight_decay_lambda = 0.1

network = MultiLayerNet(input_size=784, 
                        hidden_size_list=[100, 100, 100, 100, 100, 100],
                        output_size=10,
                        weight_decay_lambda=weight_decay_lambda)

input_size = 784 means to use MNIST data with 784 elements. output_size = 10 means that there are 10 recognized results. so hidden_size_list=[100, 100, 100, 100, 100, 100] What happens inside the network object is

In the initialization in the definition of MultiLayerNet in multi_layer_net.py

    def __init__(self, input_size, hidden_size_list, output_size,
                 activation='relu', weight_init_std='relu', weight_decay_lambda=0):
        self.input_size = input_size
        self.output_size = output_size
        self.hidden_size_list = hidden_size_list
        self.hidden_layer_num = len(hidden_size_list)
        self.weight_decay_lambda = weight_decay_lambda
        self.params = {}

        #Weight initialization
        self.__init_weight(weight_init_std)

I omitted it in the object creation activation ='relu' Use relu as the activation function weight_init_std ='relu' The initial value of the weight is compatible with relu. Use the initial value of He. self.hidden_layer_num = len (hidden_size_list) Create as many hidden layer layers as there are elements in the list hidden_size_list, It is supposed to be.

Generate a layer

So, for loop as many as the number of elements

        #Layer generation
        activation_layer = {'sigmoid': Sigmoid, 'relu': Relu}
        self.layers = OrderedDict()
        for idx in range(1, self.hidden_layer_num+1):
            self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
                                                      self.params['b' + str(idx)])
            self.layers['Activation_function' + str(idx)] = activation_layer[activation]()

At the end of this as the output layer last_layer SoftmaxWithLoss Will be added.

        idx = self.hidden_layer_num + 1
        self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
            self.params['b' + str(idx)])

        self.last_layer = SoftmaxWithLoss()

In other words, there are 6 hidden layers + 1 output layer, making a 7-layer network. The contents of the list layers are as follows:

OrderedDict([ ('Affine1', Affine(params[W1],params[b1])), ('Activation_function1', Relu), ('Affine2', Affine(params[W2],params[b2])), ('Activation_function2', Relu), ('Affine3', Affine(params[W3],params[b3])), ('Activation_function3', Relu), ('Affine4', Affine(params[W4],params[b4])), ('Activation_function4', Relu), ('Affine5', Affine(params[W5],params[b5])), ('Activation_function5', Relu), ('Affine6', Affine(params[W6],params[b6])), ('Activation_function6', Relu), ('Affine7', Affine(params[W7],params[b7])) ])

By implementing it on a layer-by-layer basis, you can see that the number of hidden layers can be specified by the number of elements in hidden_size_list. If you have about 6 layers, you can increase the number of layers in the program like the TwoLayerNet class, but when this reaches 100, it's a waste.

Let me learn

MNIST data is given to this network object for training.

optimizer = SGD(lr=0.01)

for i in range(1000000000):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    grads = network.gradient(x_batch, t_batch)
    optimizer.update(network.params, grads)

In a mini-batch loop grads = network.gradient(x_batch, t_batch) So I'm looking for a gradient The contents of grads look like this

{ 'W1': array([[-0.00240062, -0.01276378, 0.00096349, ..., 0.0054993 ], [-0.00232299, -0.0022137 , 0.0036697 , ..., -0.00693252], ..., [-0.00214929, 0.00358515, -0.00982791, ..., 0.00342722]]), 'b1': array([-4.51501921e-03, 5.25825778e-03, ..., -8.60827293e-03]), 'W2': array([[ 0.00394647, -0.01781943, 0.00114132, ..., 0.0029042 ], [-0.00551014, 0.00238989, 0.01442266, ..., 0.00171659], ..., [ 0.00279524, 0.01496588, 0.01859664, ..., -0.02194152]]), 'b2': array([ 2.08738753e-03, -8.28071395e-03, ..., 1.22945079e-02]), 'W3': array([[ ..., ]]), 'b3': array([ ..., ]), 'W4': array([[ ..., ]]), 'b4': array([ ..., ]), 'W5': array([[ ..., ]]), 'b5': array([ ..., ]), 'W6': array([[ ..., ]]), 'b6': array([ ..., ]), 'W7': array([ [ 6.72420338e-02,3.36979669e-04,1.26773417e-02,-2.30916938e-03, -4.84414774e-02, -2.58458587e-02,-5.26754173e-02,3.61136740e-02,-4.29689699e-03, -2.85799599e-02], [ ...], [-1.68008362e-02, 6.87882255e-03, -3.15578291e-03, -8.00362948e-04, 8.81555008e-03, -9.23032804e-03,-1.83337109e-02, 2.17933554e-02, -6.52331525e-03, 1.50930257e-02] ]), 'b7': array([ 0.11697053, -0.02521648, 0.03697393, -0.015763 , -0.0456317 , -0.03476072, -0.05961871, 0.0096403 , 0.03581566, -0.01840983]) }

In the content of the last grads ['W7'], the probability of which number 0 to 9 output by the softmax function is made into a list of 10 elements, and the number of lines of the read training data is arranged. I'm out. And

    optimizer.update(network.params, grads)

In the update method of the function SGD of the library optimizer.py in the common folder, update by subtracting the contents of grads from the contents of the parameter params. In the above example, we are updating with the SGD method. In addition to SGD, Momentum, AdaGrad, Adam, and RMSprop are defined in the library.

The updated params will be used for the next batch processing, so learning will proceed as much as the batch loops.

What the gradient method is doing

So what this gradient method does is find the gradient of the weight parameter by the backpropagation method. First, calculate the value of the loss function in the forward direction, and then trace the layer set when the network object was created in the reverse direction to find the gradient.

    def gradient(self, x, t):
        # forward
        self.loss(x, t)

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

        #Setting
        grads = {}
        for idx in range(1, self.hidden_layer_num+2):
            grads['W' + str(idx)] = self.layers['Affine' + str(idx)].dW + self.weight_decay_lambda * self.layers['Affine' + str(idx)].W
            grads['b' + str(idx)] = self.layers['Affine' + str(idx)].db

        return grads

At first, self.loss(x, t) I didn't really understand that. I'm running a function, but it doesn't look like I'm using the result next. So I tried to trace the contents. What we are running is the function loss defined in multi_layer_net.py.

I tried tracing the loss function loss

network.loss(x_batch, t_batch)

62.09479496490768

    def loss(self, x, t):
        y = self.predict(x)
        weight_decay = 0
        for idx in range(1, self.hidden_layer_num + 2):
            W = self.params['W' + str(idx)]
            weight_decay += 0.5 * self.weight_decay_lambda * np.sum(W ** 2)

        return self.last_layer.forward(y, t) + weight_decay

In the loss function, predict predicts the result y from the input data. In this, the forward method of the layer from Affine1 to Affine7 is executed.

    def predict(self, x):
        for layer in self.layers.values():
            x = layer.forward(x)
        return x

Calculate weight_decay from weights (params ['W1'] etc.) to prevent overfitting and add this Output.

weight_decay

59.84568388277881

network.last_layer.forward(y, t_batch)

2.2491110821288687

self.last_layer.forward (y, t) is the initialization of the MultiLayerNet class.

self.last_layer = SoftmaxWithLoss()

Since it is defined as, what is actually executed is the forward method of SoftmaxWithLoss () defined in layers.py.

class SoftmaxWithLoss:
    def __init__(self):
        self.loss = None
        self.y = None #softmax output
        self.t = None #Teacher data

    def forward(self, x, t):
        self.t = t
        self.y = softmax(x)
        self.loss = cross_entropy_error(self.y, self.t)
        
        return self.loss

    def backward(self, dout=1):
        batch_size = self.t.shape[0]
        if self.t.size == self.y.size: #Teacher data is one-hot-For vector
            dx = (self.y - self.t) / batch_size
        else:
            dx = self.y.copy()
            dx[np.arange(batch_size), self.t] -= 1
            dx = dx / batch_size
        
        return dx

So, in this forward method, the cross entropy error is calculated and returned.

network.last_layer.loss

2.2491110821288687

from common.functions import *
cross_entropy_error(network.last_layer.y, network.last_layer.t)

2.2491110821288687

By saying that, I knew what I was referring to and what I was doing with self.loss (x, t).

so,

The SoftmaxWithLoss function will then use the backward method in the backpropagation method to find the gradient. It refers to self.y and self.t, which are variables that are set when the forward method is executed. In other words, ** the first self.loss (x, t) is not looking for a loss function, but preparing to use the backward method in the backpropagation method **.

In order to go back, you have to move forward, well, if you understand it, it's a matter of course.

Find the gradient with backward

After executing self.loss (x, t) and setting the predicted value etc. from the input data, calculate the gradient by the error back propagation method.

        # backward
        dout = 1
        dout = self.last_layer.backward(dout)

self.last_layer.backward (dout) stands for SoftmaxWithLoss.backward (). dout returns a list of the differences between the predicted value y and the teacher label t. [y1 --t1, y2 --t2, y3 --t3, ・・・, y100 --t100]

        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)

In layers.reverse (), the stacked layers are reversed and dout = layer.backward (dout) is repeated to find the gradient. If you expand the iteration, it will look like this.

dout = layers[0].backward(dout)  #Affine7
dout = layers[1].backward(dout)  #Activation_function6 Relu
dout = layers[2].backward(dout)  #Affine6
dout = layers[3].backward(dout)  #Activation_function5 Relu
dout = layers[4].backward(dout)  #Affine5
dout = layers[5].backward(dout)  #Activation_function4 Relu
dout = layers[6].backward(dout)  #Affine4
dout = layers[7].backward(dout)  #Activation_function3 Relu
dout = layers[8].backward(dout)  #Affine3
dout = layers[9].backward(dout)  #Activation_function2 Relu
dout = layers[10].backward(dout) #Affine2
dout = layers[11].backward(dout) #Activation_function1 Relu
dout = layers[12].backward(dout) #Affine1

The self.x self.W referenced in each Affine layer is the one that was set when the forward method was executed.

class Affine:
    def __init__(self, W, b):
        self.W =W
        self.b = b
        
        self.x = None
        self.original_x_shape = None
        #Differentiation of weight / bias parameters
        self.dW = None
        self.db = None

    def forward(self, x):
        #Compatible with tensors
        self.original_x_shape = x.shape
        x = x.reshape(x.shape[0], -1)
        self.x = x

        out = np.dot(self.x, self.W) + self.b

        return out

    def backward(self, dout):
        dx = np.dot(dout, self.W.T)
        self.dW = np.dot(self.x.T, dout)
        self.db = np.sum(dout, axis=0)
        
        dx = dx.reshape(*self.original_x_shape)  #Return to the shape of the input data (compatible with tensors)
        return dx

Using the dw and db obtained for each layer, set the weight and bias gradient of each layer and return it as the value of the function.

        #Setting
        grads = {}
        for idx in range(1, self.hidden_layer_num+2):
            grads['W' + str(idx)] = self.layers['Affine' + str(idx)].dW + self.weight_decay_lambda * self.layers['Affine' + str(idx)].W
            grads['b' + str(idx)] = self.layers['Affine' + str(idx)].db

With the returned gradient, the parameters are updated and the mini-batch process ends once.

    grads = network.gradient(x_batch, t_batch)
    optimizer.update(network.params, grads)

class SGD:
    def __init__(self, lr=0.01):
        self.lr = lr
        
    def update(self, params, grads):
        for key in params.keys():
            params[key] -= self.lr * grads[key]

lr is the learning rate In this example, 0.01 is set.

MultiLayerNetExtend class

The MultiLayerNetExtend class in multi_layer_net_extend.py supports Dropout and Batch Normalization in layer generation, but the basics are the same as MultiLayerNet.

Part 9 ← → Part 11