Suddenly, I started studying in Chapter 4 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". It is a memo of the trip.
The execution environment is macOS Mojave + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.
(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)
This chapter describes learning neural networks.
Usually, a person derives regularity, thinks of an algorithm, writes it in a program, and lets a computer execute it. Machine learning, neural networks, and deep learning let the computer do the work of thinking about this algorithm.
In this book, for the data you want to process, the data that requires the extraction of features (vectorization, etc.) that people have thought in advance is "machine learning", and that "machine learning" is left to the extraction of features. A neural network (deep learning) that allows raw data to be passed as is is defined. This definition may seem a bit rough, but I'm not very interested in the proper use of words, so I'll move on without worrying about it.
It explains training data, test data, overfitting, etc., but there was no particular stumbling block.
It explains the sum of squares error and cross entropy error that are often used as a loss function, and the mini-batch learning that learns using a part of the training data. There was no particular stumbling block here either. It seems good to use all the training data, but it takes time and is inefficient. I think it's like a so-called sample survey.
In addition, it is explained that the reason why recognition accuracy cannot be used instead of the loss function is that recognition accuracy does not respond to small changes in the result and changes discontinuously, so it cannot be learned well. It may not come to you at first, but I think you will get angry after the explanation of the next derivative.
It is an explanation of differentiation. The explanation of rounding error at the time of mounting is practical. When you hear the words "differential" and "partial differential", it seems difficult, but how does the result change if you change the value a little? That's why I can move forward without having to review high school mathematics.
By the way, the symbol $ \ partial $ that appears by differentiation is read as Wikipedia Dell, Dee, Partial Dee, Round Dee, etc. That's right.
Even so, Python can easily pass a function as an argument. When I was a programmer, I was mainly in C / C ++, but I hated the notation of function pointers because it was really confusing: sweat:
The gradient is the vector of the partial derivatives of all variables. This in itself is not difficult.
It's nice to see the values rounded and displayed when outputting decimals in a NumPy array.
python
>>> import numpy as np
>>> a = np.array([1.00000000123, 2.99999999987])
>>> a
array([1., 3.])
However, it may be a problem if it is rolled up without permission, and when I looked up what kind of specifications it was, there was a function to set the display method. numpy.set_printoptions
, how to display decimal numbers and many elements You can change the abbreviation method of the case. For example, if you specify a large number of digits after the decimal point with precision
, it will be displayed without being rounded properly.
python
>>> np.set_printoptions(precision=12)
>>> a
array([1.00000000123, 2.99999999987])
This is convenient!
In the text, the word "gradient descent method" appears, which was translated as "the steepest descent method" in the teaching materials when I studied before.
Also, there is a symbol $ \ eta $ that indicates the learning rate, which is read as eta in Greek letters (I remembered how to read it when I studied before, but I completely forgot and googled it. : sweat :).
I use numerical_gradient (f, x)
to find the gradient, but the function I pass to this f
is
python
def f(W):
return net.loss(x, t)
Is that? Does this function use the argument W
? I was a little confused, but because I am trying to use the form of the function of numerical_gradient (f, x)
implemented in "4.4 Gradient" as it is, the argument W
is a dummy. Sure, the simpleNet
class has its own weight W
, so you don't need to pass the weight W
to the loss function simpleNet.loss
. It's confusing to have a dummy, so I decided to implement it with no arguments.
Also, here, we need to modify numerical_gradient
so that it is okay for multidimensional arrays.
From now on, we will actually implement Stochastic Gradient Descent (SGD) using what we have learned so far.
The first is functions.py
, which is a collection of necessary functions.
functions.py
# coding: utf-8
import numpy as np
def sigmoid(x):
"""Sigmoid function
Since it overflows in the implementation of the book, it is corrected by referring to the following site.
http://www.kamishima.net/mlmpyja/lr/sigmoid.html
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
#Correct x to a range that does not overflow
sigmoid_range = 34.538776394910684
x2 = np.maximum(np.minimum(x, sigmoid_range), -sigmoid_range)
#Sigmoid function
return 1 / (1 + np.exp(-x2))
def softmax(x):
"""Softmax function
Args:
x (numpy.ndarray):input
Returns:
numpy.ndarray:output
"""
#For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
#In this case, it is necessary to calculate well for each image using broadcast.
#Here, np so that it can be shared in both 1D and 2D..max()And np.sum()Is axis=-Calculated by 1
#Keepdims so that you can broadcast as it is=True to maintain the dimension.
c = np.max(x, axis=-1, keepdims=True)
exp_a = np.exp(x - c) #Overflow measures
sum_exp_a = np.sum(exp_a, axis=-1, keepdims=True)
y = exp_a / sum_exp_a
return y
def numerical_gradient(f, x):
"""Gradient calculation
Args:
f (function):Loss function
x (numpy.ndarray):An array of weight parameters for which you want to check the gradient
Returns:
numpy.ndarray:Slope
"""
h = 1e-4
grad = np.zeros_like(x)
# np.Enumerate the elements of a multidimensional array with nditer
it = np.nditer(x, flags=['multi_index'])
while not it.finished:
idx = it.multi_index # it.multi_index is the element number in the list
tmp_val = x[idx] #Save original value
# f(x + h)Calculation of
x[idx] = tmp_val + h
fxh1 = f()
# f(x - h)Calculation of
x[idx] = tmp_val - h
fxh2 = f()
#Calculate the gradient
grad[idx] = (fxh1 - fxh2) / (2 * h)
x[idx] = tmp_val #Return value
it.iternext()
return grad
def cross_entropy_error(y, t):
"""Calculation of cross entropy error
Args:
y (numpy.ndarray):Neural network output
t (numpy.ndarray):Correct label
Returns:
float:Cross entropy error
"""
#Shape the shape if there is only one data
if y.ndim == 1:
t = t.reshape(1, t.size)
y = y.reshape(1, y.size)
#Calculate the error and normalize by the number of batches
batch_size = y.shape[0]
return -np.sum(t * np.log(y + 1e-7)) / batch_size
def sigmoid_grad(x):
"""Functions learned in Chapter 5. Required when using the error back propagation method.
"""
return (1.0 - sigmoid(x)) * sigmoid(x)
softmax
is [Note that an amateur stumbled in Deep Learning made from scratch: Chapter 3](https://qiita.com/segavvy/items/6d79d0c3b4367869f4ea#35-%E5%87%BA%E5%8A% 9B% E5% B1% A4% E3% 81% AE% E8% A8% AD% E8% A8% 88) I tried to make it even more refreshing. I refer to softmax function code improvement plan # 45 in the issue of the GitHub repository of this book. ..
numerical_gradient
has eliminated the function argument passed in the argument f
, as mentioned above. It also loops at numpy.nditer
to accommodate multidimensional arrays. In the code of the book, ʻop_flags = ['readwrite'] is specified when using
numpy.nditer, but the index for accessing
x is just extracted by
multi_index. , I omitted ʻop_flags
("op_flags = ['readonly'] `) because I am not updating the objects enumerated by the iterator. See Iterating Over Arrays # Modifying Array Values for more details.
The last function sigmoid_grad
is learned in Chapter 5, but it is necessary to shorten the processing time (described later), so it is implemented as in the book.
Next is two_layer_net.py
, which implements a two-layer neural network.
two_layer_net.py
# coding: utf-8
from functions import sigmoid, softmax, numerical_gradient, \
cross_entropy_error, sigmoid_grad
import numpy as np
class TwoLayerNet:
def __init__(self, input_size, hidden_size, output_size,
weight_init_std=0.01):
"""Two-layer neural network
Args:
input_size (int):Number of neurons in the input layer
hidden_size (int):Number of neurons in the hidden layer
output_size (int):Number of neurons in the output layer
weight_init_std (float, optional):Adjustment parameter of the initial value of the weight. The default is 0.01。
"""
#Weight initialization
self.params = {}
self.params['W1'] = weight_init_std * \
np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = weight_init_std * \
np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
def predict(self, x):
"""Inference by neural network
Args:
x (numpy.ndarray):Input to neural network
Returns:
numpy.ndarray:Neural network output
"""
#Parameter retrieval
W1, W2 = self.params['W1'], self.params['W2']
b1, b2 = self.params['b1'], self.params['b2']
#Neural network calculation (forward)
a1 = np.dot(x, W1) + b1
z1 = sigmoid(a1)
a2 = np.dot(z1, W2) + b2
y = softmax(a2)
return y
def loss(self, x, t):
"""Loss function value calculation
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
float:Loss function value
"""
#inference
y = self.predict(x)
#Calculation of cross entropy error
loss = cross_entropy_error(y, t)
return loss
def accuracy(self, x, t):
"""Recognition accuracy calculation
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
float:Recognition accuracy
"""
y = self.predict(x)
y = np.argmax(y, axis=1)
t = np.argmax(t, axis=1)
accuracy = np.sum(y == t) / x.shape[0]
return accuracy
def numerical_gradient(self, x, t):
"""Gradient calculation for weight parameters
Args:
x (numpy.ndarray):Input to neural network
t (numpy.ndarray):Correct label
Returns:
dictionary:A dictionary containing gradients
"""
grads = {}
grads['W1'] = \
numerical_gradient(lambda: self.loss(x, t), self.params['W1'])
grads['b1'] = \
numerical_gradient(lambda: self.loss(x, t), self.params['b1'])
grads['W2'] = \
numerical_gradient(lambda: self.loss(x, t), self.params['W2'])
grads['b2'] = \
numerical_gradient(lambda: self.loss(x, t), self.params['b2'])
return grads
def gradient(self, x, t):
"""Functions learned in Chapter 5. Implementation of error back propagation method
"""
W1, W2 = self.params['W1'], self.params['W2']
b1, b2 = self.params['b1'], self.params['b2']
grads = {}
batch_num = x.shape[0]
# forward
a1 = np.dot(x, W1) + b1
z1 = sigmoid(a1)
a2 = np.dot(z1, W2) + b2
y = softmax(a2)
# backward
dy = (y - t) / batch_num
grads['W2'] = np.dot(z1.T, dy)
grads['b2'] = np.sum(dy, axis=0)
dz1 = np.dot(dy, W2.T)
da1 = sigmoid_grad(a1) * dz1
grads['W1'] = np.dot(x.T, da1)
grads['b1'] = np.sum(da1, axis=0)
return grads
It's almost the same as the code in the book. The last gradient
is what you will learn in Chapter 5, but since it is necessary to shorten the processing time (described later), it is implemented as in the book.
Finally, the implementation of mini-batch learning.
mnist.py
# coding: utf-8
import numpy as np
import matplotlib.pylab as plt
import os
import sys
from two_layer_net import TwoLayerNet
sys.path.append(os.pardir) #Add parent directory to path
from dataset.mnist import load_mnist
#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
load_mnist(normalize=True, one_hot_label=True)
#Hyperparameter settings
iters_num = 10000 #Number of updates
batch_size = 100 #Batch size
learning_rate = 0.1 #Learning rate
#Record list of results
train_loss_list = [] #Changes in the value of the loss function
train_acc_list = [] #Recognition accuracy for training data
test_acc_list = [] #Recognition accuracy for test data
train_size = x_train.shape[0] #Training data size
iter_per_epoch = max(train_size / batch_size, 1) #Number of iterations per epoch
#Two-layer neural work generation
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)
#Start learning
for i in range(iters_num):
#Mini batch generation
batch_mask = np.random.choice(train_size, batch_size, replace=False)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
#Gradient calculation
# grad = network.numerical_gradient(x_batch, t_batch)Since it is slow, the error back propagation method is used ...
grad = network.gradient(x_batch, t_batch)
#Weight parameter update
for key in ('W1', 'b1', 'W2', 'b2'):
network.params[key] -= learning_rate * grad[key]
#Loss function value calculation
loss = network.loss(x_batch, t_batch)
train_loss_list.append(loss)
#Recognition accuracy calculation for each epoch
if i % iter_per_epoch == 0:
train_acc = network.accuracy(x_train, t_train)
test_acc = network.accuracy(x_test, t_test)
train_acc_list.append(train_acc)
test_acc_list.append(test_acc)
#Progress display
print(f"[Number of updates]{i: >4} [Loss function value]{loss:.4f} "
f"[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}")
#Draw the transition of the value of the loss function
x = np.arange(len(train_loss_list))
plt.plot(x, train_loss_list, label='loss')
plt.xlabel("iteration")
plt.ylabel("loss")
plt.xlim(left=0)
plt.ylim(bottom=0)
plt.show()
#Draw the transition of recognition accuracy of training data and test data
x2 = np.arange(len(train_acc_list))
plt.plot(x2, train_acc_list, label='train acc')
plt.plot(x2, test_acc_list, label='test acc', linestyle='--')
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.xlim(left=0)
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()
In the code of the book, [numpy.random.choice
](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice" used for mini-batch generation There is no specification of replace = False
in the argument of .html), but I tried to specify it because it seems that the same element may be extracted more than once.
Originally, the gradient is calculated by numerical differentiation using TwoLayerNet.numerical_gradient
, but the processing speed is slow and in the environment at hand ~~ It seems that 10,000 updates will not be completed even if it takes one day ~~ It can only be updated about 600 times in half a day, and it seems that it will take about 8 days to update 10,000 times. Therefore, following the advice in the book, I used TwoLayerNet.gradient
, which implements the error propagation method learned in Chapter 5.
Finally, the transition of the value of the loss function and the transition of the recognition accuracy of the training data and test data are displayed in a graph.
Below are the execution results.
[Number of updates] 0 [Loss function value]2.2882 [Training data recognition accuracy]0.1044 [Test data recognition accuracy]0.1028
[Number of updates] 600 [Loss function value]0.8353 [Training data recognition accuracy]0.7753 [Test data recognition accuracy]0.7818
[Number of updates]1200 [Loss function value]0.4573 [Training data recognition accuracy]0.8744 [Test data recognition accuracy]0.8778
[Number of updates]1800 [Loss function value]0.4273 [Training data recognition accuracy]0.8972 [Test data recognition accuracy]0.9010
[Number of updates]2400 [Loss function value]0.3654 [Training data recognition accuracy]0.9076 [Test data recognition accuracy]0.9098
[Number of updates]3000 [Loss function value]0.2816 [Training data recognition accuracy]0.9142 [Test data recognition accuracy]0.9146
[Number of updates]3600 [Loss function value]0.3238 [Training data recognition accuracy]0.9195 [Test data recognition accuracy]0.9218
[Number of updates]4200 [Loss function value]0.2017 [Training data recognition accuracy]0.9231 [Test data recognition accuracy]0.9253
[Number of updates]4800 [Loss function value]0.1910 [Training data recognition accuracy]0.9266 [Test data recognition accuracy]0.9289
[Number of updates]5400 [Loss function value]0.1528 [Training data recognition accuracy]0.9306 [Test data recognition accuracy]0.9320
[Number of updates]6000 [Loss function value]0.1827 [Training data recognition accuracy]0.9338 [Test data recognition accuracy]0.9347
[Number of updates]6600 [Loss function value]0.1208 [Training data recognition accuracy]0.9362 [Test data recognition accuracy]0.9375
[Number of updates]7200 [Loss function value]0.1665 [Training data recognition accuracy]0.9391 [Test data recognition accuracy]0.9377
[Number of updates]7800 [Loss function value]0.1787 [Training data recognition accuracy]0.9409 [Test data recognition accuracy]0.9413
[Number of updates]8400 [Loss function value]0.1564 [Training data recognition accuracy]0.9431 [Test data recognition accuracy]0.9429
[Number of updates]9000 [Loss function value]0.2361 [Training data recognition accuracy]0.9449 [Test data recognition accuracy]0.9437
[Number of updates]9600 [Loss function value]0.2183 [Training data recognition accuracy]0.9456 [Test data recognition accuracy]0.9448
Looking at the results, the recognition accuracy was already around 94.5%, which exceeded the recognition accuracy of the learned parameters prepared in Chapter 3.
It may be good to read Chapter 4 as a book, but it was quite difficult to proceed while implementing it. (I wanted an explanation of the part where the softmax function and the numerical differentiation function correspond to a multidimensional array ...)
That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out. (To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)
Recommended Posts