While reading "Deep Learning from scratch" (written by Yasuki Saito, published by O'Reilly Japan), I will make a note of the sites I referred to. Part 9 ← → Part 11
After explaining the implementation in layers in Chapter 5, the program itself will not be explained much in Chapter 6 and later. Since the program example is in the file downloaded first, it may be that you should execute it yourself and check the contents, but it is quite difficult for beginners.
Well, I'll go little by little.
Chapter 3 explained the basics of neural networks, and Chapter 4 implemented the two-layer neural network class TwoLayerNet. After that, there were various explanations, and it became the MultiLayerNet class. It looks a lot more complicated, but the basics are the same as TwoLayerNet. Looking at the contents of the library layers.py referenced by this class, it's the same as the one used by the TwoLayerNet class. What looks complicated is Implemented on a layer-by-layer basis to increase the versatility of the program The activation function, parameter update method, initial weight value, etc. can now be selected. It seems to be from.
When you want to understand the program, it is sure to manually trace line by line.
So, let's trace the program on P192.
weight_decay_lambda = 0.1
network = MultiLayerNet(input_size=784,
hidden_size_list=[100, 100, 100, 100, 100, 100],
output_size=10,
weight_decay_lambda=weight_decay_lambda)
input_size = 784 means to use MNIST data with 784 elements. output_size = 10 means that there are 10 recognized results. so hidden_size_list=[100, 100, 100, 100, 100, 100] What happens inside the network object is
In the initialization in the definition of MultiLayerNet in multi_layer_net.py
def __init__(self, input_size, hidden_size_list, output_size,
activation='relu', weight_init_std='relu', weight_decay_lambda=0):
self.input_size = input_size
self.output_size = output_size
self.hidden_size_list = hidden_size_list
self.hidden_layer_num = len(hidden_size_list)
self.weight_decay_lambda = weight_decay_lambda
self.params = {}
#Weight initialization
self.__init_weight(weight_init_std)
I omitted it in the object creation activation ='relu' Use relu as the activation function weight_init_std ='relu' The initial value of the weight is compatible with relu. Use the initial value of He. self.hidden_layer_num = len (hidden_size_list) Create as many hidden layer layers as there are elements in the list hidden_size_list, It is supposed to be.
So, for loop as many as the number of elements
#Layer generation
activation_layer = {'sigmoid': Sigmoid, 'relu': Relu}
self.layers = OrderedDict()
for idx in range(1, self.hidden_layer_num+1):
self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
self.params['b' + str(idx)])
self.layers['Activation_function' + str(idx)] = activation_layer[activation]()
At the end of this as the output layer last_layer SoftmaxWithLoss Will be added.
idx = self.hidden_layer_num + 1
self.layers['Affine' + str(idx)] = Affine(self.params['W' + str(idx)],
self.params['b' + str(idx)])
self.last_layer = SoftmaxWithLoss()
In other words, there are 6 hidden layers + 1 output layer, making a 7-layer network. The contents of the list layers are as follows:
OrderedDict([ ('Affine1', Affine(params[W1],params[b1])), ('Activation_function1', Relu), ('Affine2', Affine(params[W2],params[b2])), ('Activation_function2', Relu), ('Affine3', Affine(params[W3],params[b3])), ('Activation_function3', Relu), ('Affine4', Affine(params[W4],params[b4])), ('Activation_function4', Relu), ('Affine5', Affine(params[W5],params[b5])), ('Activation_function5', Relu), ('Affine6', Affine(params[W6],params[b6])), ('Activation_function6', Relu), ('Affine7', Affine(params[W7],params[b7])) ])
By implementing it on a layer-by-layer basis, you can see that the number of hidden layers can be specified by the number of elements in hidden_size_list. If you have about 6 layers, you can increase the number of layers in the program like the TwoLayerNet class, but when this reaches 100, it's a waste.
MNIST data is given to this network object for training.
optimizer = SGD(lr=0.01)
for i in range(1000000000):
batch_mask = np.random.choice(train_size, batch_size)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
grads = network.gradient(x_batch, t_batch)
optimizer.update(network.params, grads)
In a mini-batch loop grads = network.gradient(x_batch, t_batch) So I'm looking for a gradient The contents of grads look like this
{ 'W1': array([[-0.00240062, -0.01276378, 0.00096349, ..., 0.0054993 ], [-0.00232299, -0.0022137 , 0.0036697 , ..., -0.00693252], ..., [-0.00214929, 0.00358515, -0.00982791, ..., 0.00342722]]), 'b1': array([-4.51501921e-03, 5.25825778e-03, ..., -8.60827293e-03]), 'W2': array([[ 0.00394647, -0.01781943, 0.00114132, ..., 0.0029042 ], [-0.00551014, 0.00238989, 0.01442266, ..., 0.00171659], ..., [ 0.00279524, 0.01496588, 0.01859664, ..., -0.02194152]]), 'b2': array([ 2.08738753e-03, -8.28071395e-03, ..., 1.22945079e-02]), 'W3': array([[ ..., ]]), 'b3': array([ ..., ]), 'W4': array([[ ..., ]]), 'b4': array([ ..., ]), 'W5': array([[ ..., ]]), 'b5': array([ ..., ]), 'W6': array([[ ..., ]]), 'b6': array([ ..., ]), 'W7': array([ [ 6.72420338e-02,3.36979669e-04,1.26773417e-02,-2.30916938e-03, -4.84414774e-02, -2.58458587e-02,-5.26754173e-02,3.61136740e-02,-4.29689699e-03, -2.85799599e-02], [ ...], [-1.68008362e-02, 6.87882255e-03, -3.15578291e-03, -8.00362948e-04, 8.81555008e-03, -9.23032804e-03,-1.83337109e-02, 2.17933554e-02, -6.52331525e-03, 1.50930257e-02] ]), 'b7': array([ 0.11697053, -0.02521648, 0.03697393, -0.015763 , -0.0456317 , -0.03476072, -0.05961871, 0.0096403 , 0.03581566, -0.01840983]) }
In the content of the last grads ['W7'], the probability of which number 0 to 9 output by the softmax function is made into a list of 10 elements, and the number of lines of the read training data is arranged. I'm out. And
optimizer.update(network.params, grads)
In the update method of the function SGD of the library optimizer.py in the common folder, update by subtracting the contents of grads from the contents of the parameter params. In the above example, we are updating with the SGD method. In addition to SGD, Momentum, AdaGrad, Adam, and RMSprop are defined in the library.
The updated params will be used for the next batch processing, so learning will proceed as much as the batch loops.
So what this gradient method does is find the gradient of the weight parameter by the backpropagation method. First, calculate the value of the loss function in the forward direction, and then trace the layer set when the network object was created in the reverse direction to find the gradient.
def gradient(self, x, t):
# forward
self.loss(x, t)
# backward
dout = 1
dout = self.last_layer.backward(dout)
layers = list(self.layers.values())
layers.reverse()
for layer in layers:
dout = layer.backward(dout)
#Setting
grads = {}
for idx in range(1, self.hidden_layer_num+2):
grads['W' + str(idx)] = self.layers['Affine' + str(idx)].dW + self.weight_decay_lambda * self.layers['Affine' + str(idx)].W
grads['b' + str(idx)] = self.layers['Affine' + str(idx)].db
return grads
At first, self.loss(x, t) I didn't really understand that. I'm running a function, but it doesn't look like I'm using the result next. So I tried to trace the contents. What we are running is the function loss defined in multi_layer_net.py.
network.loss(x_batch, t_batch)
62.09479496490768
def loss(self, x, t):
y = self.predict(x)
weight_decay = 0
for idx in range(1, self.hidden_layer_num + 2):
W = self.params['W' + str(idx)]
weight_decay += 0.5 * self.weight_decay_lambda * np.sum(W ** 2)
return self.last_layer.forward(y, t) + weight_decay
In the loss function, predict predicts the result y from the input data. In this, the forward method of the layer from Affine1 to Affine7 is executed.
def predict(self, x):
for layer in self.layers.values():
x = layer.forward(x)
return x
Calculate weight_decay from weights (params ['W1'] etc.) to prevent overfitting and add this Output.
weight_decay
59.84568388277881
network.last_layer.forward(y, t_batch)
2.2491110821288687
self.last_layer.forward (y, t) is the initialization of the MultiLayerNet class.
self.last_layer = SoftmaxWithLoss()
Since it is defined as, what is actually executed is the forward method of SoftmaxWithLoss () defined in layers.py.
class SoftmaxWithLoss:
def __init__(self):
self.loss = None
self.y = None #softmax output
self.t = None #Teacher data
def forward(self, x, t):
self.t = t
self.y = softmax(x)
self.loss = cross_entropy_error(self.y, self.t)
return self.loss
def backward(self, dout=1):
batch_size = self.t.shape[0]
if self.t.size == self.y.size: #Teacher data is one-hot-For vector
dx = (self.y - self.t) / batch_size
else:
dx = self.y.copy()
dx[np.arange(batch_size), self.t] -= 1
dx = dx / batch_size
return dx
So, in this forward method, the cross entropy error is calculated and returned.
network.last_layer.loss
2.2491110821288687
from common.functions import *
cross_entropy_error(network.last_layer.y, network.last_layer.t)
2.2491110821288687
By saying that, I knew what I was referring to and what I was doing with self.loss (x, t).
so,
The SoftmaxWithLoss function will then use the backward method in the backpropagation method to find the gradient. It refers to self.y and self.t, which are variables that are set when the forward method is executed. In other words, ** the first self.loss (x, t) is not looking for a loss function, but preparing to use the backward method in the backpropagation method **.
In order to go back, you have to move forward, well, if you understand it, it's a matter of course.
After executing self.loss (x, t) and setting the predicted value etc. from the input data, calculate the gradient by the error back propagation method.
# backward
dout = 1
dout = self.last_layer.backward(dout)
self.last_layer.backward (dout) stands for SoftmaxWithLoss.backward (). dout returns a list of the differences between the predicted value y and the teacher label t. [y1 --t1, y2 --t2, y3 --t3, ・ ・ ・, y100 --t100]
layers = list(self.layers.values())
layers.reverse()
for layer in layers:
dout = layer.backward(dout)
In layers.reverse (), the stacked layers are reversed and dout = layer.backward (dout) is repeated to find the gradient. If you expand the iteration, it will look like this.
dout = layers[0].backward(dout) #Affine7
dout = layers[1].backward(dout) #Activation_function6 Relu
dout = layers[2].backward(dout) #Affine6
dout = layers[3].backward(dout) #Activation_function5 Relu
dout = layers[4].backward(dout) #Affine5
dout = layers[5].backward(dout) #Activation_function4 Relu
dout = layers[6].backward(dout) #Affine4
dout = layers[7].backward(dout) #Activation_function3 Relu
dout = layers[8].backward(dout) #Affine3
dout = layers[9].backward(dout) #Activation_function2 Relu
dout = layers[10].backward(dout) #Affine2
dout = layers[11].backward(dout) #Activation_function1 Relu
dout = layers[12].backward(dout) #Affine1
The self.x self.W referenced in each Affine layer is the one that was set when the forward method was executed.
class Affine:
def __init__(self, W, b):
self.W =W
self.b = b
self.x = None
self.original_x_shape = None
#Differentiation of weight / bias parameters
self.dW = None
self.db = None
def forward(self, x):
#Compatible with tensors
self.original_x_shape = x.shape
x = x.reshape(x.shape[0], -1)
self.x = x
out = np.dot(self.x, self.W) + self.b
return out
def backward(self, dout):
dx = np.dot(dout, self.W.T)
self.dW = np.dot(self.x.T, dout)
self.db = np.sum(dout, axis=0)
dx = dx.reshape(*self.original_x_shape) #Return to the shape of the input data (compatible with tensors)
return dx
Using the dw and db obtained for each layer, set the weight and bias gradient of each layer and return it as the value of the function.
#Setting
grads = {}
for idx in range(1, self.hidden_layer_num+2):
grads['W' + str(idx)] = self.layers['Affine' + str(idx)].dW + self.weight_decay_lambda * self.layers['Affine' + str(idx)].W
grads['b' + str(idx)] = self.layers['Affine' + str(idx)].db
With the returned gradient, the parameters are updated and the mini-batch process ends once.
grads = network.gradient(x_batch, t_batch)
optimizer.update(network.params, grads)
class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for key in params.keys():
params[key] -= self.lr * grads[key]
lr is the learning rate In this example, 0.01 is set.
The MultiLayerNetExtend class in multi_layer_net_extend.py supports Dropout and Batch Normalization in layer generation, but the basics are the same as MultiLayerNet.
Recommended Posts