An amateur stumbled in Deep Learning from scratch Note: Chapter 4

Introduction

Suddenly, I started studying in Chapter 4 of "Deep Learning from scratch-The theory and implementation of deep learning learned with Python". It is a memo of the trip.

The execution environment is macOS Mojave + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)

Chapter 4 Learning Neural Networks

This chapter describes learning neural networks.

4.1 Learn from data

Usually, a person derives regularity, thinks of an algorithm, writes it in a program, and lets a computer execute it. Machine learning, neural networks, and deep learning let the computer do the work of thinking about this algorithm.

In this book, for the data you want to process, the data that requires the extraction of features (vectorization, etc.) that people have thought in advance is "machine learning", and that "machine learning" is left to the extraction of features. A neural network (deep learning) that allows raw data to be passed as is is defined. This definition may seem a bit rough, but I'm not very interested in the proper use of words, so I'll move on without worrying about it.

It explains training data, test data, overfitting, etc., but there was no particular stumbling block.

4.2 Loss function

It explains the sum of squares error and cross entropy error that are often used as a loss function, and the mini-batch learning that learns using a part of the training data. There was no particular stumbling block here either. It seems good to use all the training data, but it takes time and is inefficient. I think it's like a so-called sample survey.

In addition, it is explained that the reason why recognition accuracy cannot be used instead of the loss function is that recognition accuracy does not respond to small changes in the result and changes discontinuously, so it cannot be learned well. It may not come to you at first, but I think you will get angry after the explanation of the next derivative.

4.3 Numerical differentiation

It is an explanation of differentiation. The explanation of rounding error at the time of mounting is practical. When you hear the words "differential" and "partial differential", it seems difficult, but how does the result change if you change the value a little? That's why I can move forward without having to review high school mathematics.

By the way, the symbol $ \ partial $ that appears by differentiation is read as Wikipedia Dell, Dee, Partial Dee, Round Dee, etc. That's right.

Even so, Python can easily pass a function as an argument. When I was a programmer, I was mainly in C / C ++, but I hated the notation of function pointers because it was really confusing: sweat:

4.4 Gradient

The gradient is the vector of the partial derivatives of all variables. This in itself is not difficult.

It's nice to see the values rounded and displayed when outputting decimals in a NumPy array.

python


>>> import numpy as np
>>> a = np.array([1.00000000123, 2.99999999987])
>>> a
array([1., 3.])

However, it may be a problem if it is rolled up without permission, and when I looked up what kind of specifications it was, there was a function to set the display method. numpy.set_printoptions, how to display decimal numbers and many elements You can change the abbreviation method of the case. For example, if you specify a large number of digits after the decimal point with precision, it will be displayed without being rounded properly.

python


>>> np.set_printoptions(precision=12)
>>> a
array([1.00000000123, 2.99999999987])

This is convenient!

4.4.1 Gradient method

In the text, the word "gradient descent method" appears, which was translated as "the steepest descent method" in the teaching materials when I studied before.

Also, there is a symbol $ \ eta $ that indicates the learning rate, which is read as eta in Greek letters (I remembered how to read it when I studied before, but I completely forgot and googled it. : sweat :).

4.4.2 Gradient with respect to neural network

I use numerical_gradient (f, x) to find the gradient, but the function I pass to this f is

python


def f(W):
    return net.loss(x, t)

Is that? Does this function use the argument W? I was a little confused, but because I am trying to use the form of the function of numerical_gradient (f, x) implemented in "4.4 Gradient" as it is, the argument W is a dummy. Sure, the simpleNet class has its own weight W, so you don't need to pass the weight W to the loss function simpleNet.loss. It's confusing to have a dummy, so I decided to implement it with no arguments.

Also, here, we need to modify numerical_gradient so that it is okay for multidimensional arrays.

4.5 Implementation of learning algorithm

From now on, we will actually implement Stochastic Gradient Descent (SGD) using what we have learned so far.

The first is functions.py, which is a collection of necessary functions.

functions.py


# coding: utf-8
import numpy as np


def sigmoid(x):
    """Sigmoid function
Since it overflows in the implementation of the book, it is corrected by referring to the following site.
    http://www.kamishima.net/mlmpyja/lr/sigmoid.html

    Args:
        x (numpy.ndarray):input
    
    Returns:
        numpy.ndarray:output
    """
    #Correct x to a range that does not overflow
    sigmoid_range = 34.538776394910684
    x2 = np.maximum(np.minimum(x, sigmoid_range), -sigmoid_range)

    #Sigmoid function
    return 1 / (1 + np.exp(-x2))


def softmax(x):
    """Softmax function
    
    Args:
        x (numpy.ndarray):input
    
    Returns:
        numpy.ndarray:output
    """
    #For batch processing x is(Number of batches, 10)It becomes a two-dimensional array of.
    #In this case, it is necessary to calculate well for each image using broadcast.
    #Here, np so that it can be shared in both 1D and 2D..max()And np.sum()Is axis=-Calculated by 1
    #Keepdims so that you can broadcast as it is=True to maintain the dimension.
    c = np.max(x, axis=-1, keepdims=True)
    exp_a = np.exp(x - c)  #Overflow measures
    sum_exp_a = np.sum(exp_a, axis=-1, keepdims=True)
    y = exp_a / sum_exp_a
    return y


def numerical_gradient(f, x):
    """Gradient calculation
    
    Args:
        f (function):Loss function
        x (numpy.ndarray):An array of weight parameters for which you want to check the gradient
    
    Returns:
        numpy.ndarray:Slope
    """
    h = 1e-4
    grad = np.zeros_like(x)

    # np.Enumerate the elements of a multidimensional array with nditer
    it = np.nditer(x, flags=['multi_index'])
    while not it.finished:

        idx = it.multi_index  # it.multi_index is the element number in the list
        tmp_val = x[idx]  #Save original value

        # f(x + h)Calculation of
        x[idx] = tmp_val + h
        fxh1 = f()

        # f(x - h)Calculation of
        x[idx] = tmp_val - h
        fxh2 = f()

        #Calculate the gradient
        grad[idx] = (fxh1 - fxh2) / (2 * h)
    
        x[idx] = tmp_val  #Return value
        it.iternext()

    return grad


def cross_entropy_error(y, t):
    """Calculation of cross entropy error
    
    Args:
        y (numpy.ndarray):Neural network output
        t (numpy.ndarray):Correct label
    
    Returns:
        float:Cross entropy error
    """

    #Shape the shape if there is only one data
    if y.ndim == 1:
        t = t.reshape(1, t.size)
        y = y.reshape(1, y.size)

    #Calculate the error and normalize by the number of batches
    batch_size = y.shape[0]
    return -np.sum(t * np.log(y + 1e-7)) / batch_size


def sigmoid_grad(x):
    """Functions learned in Chapter 5. Required when using the error back propagation method.
    """
    return (1.0 - sigmoid(x)) * sigmoid(x)

softmax is [Note that an amateur stumbled in Deep Learning made from scratch: Chapter 3](https://qiita.com/segavvy/items/6d79d0c3b4367869f4ea#35-%E5%87%BA%E5%8A% 9B% E5% B1% A4% E3% 81% AE% E8% A8% AD% E8% A8% 88) I tried to make it even more refreshing. I refer to softmax function code improvement plan # 45 in the issue of the GitHub repository of this book. ..

numerical_gradient has eliminated the function argument passed in the argument f, as mentioned above. It also loops at numpy.nditer to accommodate multidimensional arrays. In the code of the book, ʻop_flags = ['readwrite'] is specified when using numpy.nditer, but the index for accessing x is just extracted by multi_index. , I omitted ʻop_flags ("op_flags = ['readonly'] `) because I am not updating the objects enumerated by the iterator. See Iterating Over Arrays # Modifying Array Values for more details.

The last function sigmoid_grad is learned in Chapter 5, but it is necessary to shorten the processing time (described later), so it is implemented as in the book.

Next is two_layer_net.py, which implements a two-layer neural network.

two_layer_net.py


# coding: utf-8
from functions import sigmoid, softmax, numerical_gradient, \
    cross_entropy_error, sigmoid_grad
import numpy as np


class TwoLayerNet:

    def __init__(self, input_size, hidden_size, output_size,
                 weight_init_std=0.01):
        """Two-layer neural network
        
        Args:
            input_size (int):Number of neurons in the input layer
            hidden_size (int):Number of neurons in the hidden layer
            output_size (int):Number of neurons in the output layer
            weight_init_std (float, optional):Adjustment parameter of the initial value of the weight. The default is 0.01。
        """

        #Weight initialization
        self.params = {}
        self.params['W1'] = weight_init_std * \
            np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * \
            np.random.randn(hidden_size, output_size)
        self.params['b2'] = np.zeros(output_size)

    def predict(self, x):
        """Inference by neural network
        
        Args:
            x (numpy.ndarray):Input to neural network
        
        Returns:
            numpy.ndarray:Neural network output
        """
        #Parameter retrieval
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']

        #Neural network calculation (forward)
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)

        return y

    def loss(self, x, t):
        """Loss function value calculation
        
        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label

        Returns:
            float:Loss function value
        """
        #inference
        y = self.predict(x)

        #Calculation of cross entropy error
        loss = cross_entropy_error(y, t)

        return loss

    def accuracy(self, x, t):
        """Recognition accuracy calculation
        
        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label
        
        Returns:
            float:Recognition accuracy
        """
        y = self.predict(x)
        y = np.argmax(y, axis=1)
        t = np.argmax(t, axis=1)
        
        accuracy = np.sum(y == t) / x.shape[0]
        return accuracy

    def numerical_gradient(self, x, t):
        """Gradient calculation for weight parameters
        
        Args:
            x (numpy.ndarray):Input to neural network
            t (numpy.ndarray):Correct label
        
        Returns:
            dictionary:A dictionary containing gradients
        """
        grads = {}
        grads['W1'] = \
            numerical_gradient(lambda: self.loss(x, t), self.params['W1'])
        grads['b1'] = \
            numerical_gradient(lambda: self.loss(x, t), self.params['b1'])
        grads['W2'] = \
            numerical_gradient(lambda: self.loss(x, t), self.params['W2'])
        grads['b2'] = \
            numerical_gradient(lambda: self.loss(x, t), self.params['b2'])

        return grads

    def gradient(self, x, t):
        """Functions learned in Chapter 5. Implementation of error back propagation method
        """
        W1, W2 = self.params['W1'], self.params['W2']
        b1, b2 = self.params['b1'], self.params['b2']
        grads = {}
        
        batch_num = x.shape[0]
        
        # forward
        a1 = np.dot(x, W1) + b1
        z1 = sigmoid(a1)
        a2 = np.dot(z1, W2) + b2
        y = softmax(a2)
        
        # backward
        dy = (y - t) / batch_num
        grads['W2'] = np.dot(z1.T, dy)
        grads['b2'] = np.sum(dy, axis=0)
        
        dz1 = np.dot(dy, W2.T)
        da1 = sigmoid_grad(a1) * dz1
        grads['W1'] = np.dot(x.T, da1)
        grads['b1'] = np.sum(da1, axis=0)

        return grads

It's almost the same as the code in the book. The last gradient is what you will learn in Chapter 5, but since it is necessary to shorten the processing time (described later), it is implemented as in the book.

Finally, the implementation of mini-batch learning.

mnist.py


# coding: utf-8
import numpy as np
import matplotlib.pylab as plt
import os
import sys
from two_layer_net import TwoLayerNet
sys.path.append(os.pardir)  #Add parent directory to path
from dataset.mnist import load_mnist


#Read MNIST training data and test data
(x_train, t_train), (x_test, t_test) = \
    load_mnist(normalize=True, one_hot_label=True)

#Hyperparameter settings
iters_num = 10000       #Number of updates
batch_size = 100        #Batch size
learning_rate = 0.1     #Learning rate

#Record list of results
train_loss_list = []    #Changes in the value of the loss function
train_acc_list = []     #Recognition accuracy for training data
test_acc_list = []      #Recognition accuracy for test data

train_size = x_train.shape[0]  #Training data size
iter_per_epoch = max(train_size / batch_size, 1)    #Number of iterations per epoch

#Two-layer neural work generation
network = TwoLayerNet(input_size=784, hidden_size=50, output_size=10)

#Start learning
for i in range(iters_num):

    #Mini batch generation
    batch_mask = np.random.choice(train_size, batch_size, replace=False)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    #Gradient calculation
    # grad = network.numerical_gradient(x_batch, t_batch)Since it is slow, the error back propagation method is used ...
    grad = network.gradient(x_batch, t_batch)

    #Weight parameter update
    for key in ('W1', 'b1', 'W2', 'b2'):
        network.params[key] -= learning_rate * grad[key]
    
    #Loss function value calculation
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)

    #Recognition accuracy calculation for each epoch
    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train, t_train)
        test_acc = network.accuracy(x_test, t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)

        #Progress display
        print(f"[Number of updates]{i: >4} [Loss function value]{loss:.4f} "
              f"[Training data recognition accuracy]{train_acc:.4f} [Test data recognition accuracy]{test_acc:.4f}")

#Draw the transition of the value of the loss function
x = np.arange(len(train_loss_list))
plt.plot(x, train_loss_list, label='loss')
plt.xlabel("iteration")
plt.ylabel("loss")
plt.xlim(left=0)
plt.ylim(bottom=0)
plt.show()

#Draw the transition of recognition accuracy of training data and test data
x2 = np.arange(len(train_acc_list))
plt.plot(x2, train_acc_list, label='train acc')
plt.plot(x2, test_acc_list, label='test acc', linestyle='--')
plt.xlabel("epochs")
plt.ylabel("accuracy")
plt.xlim(left=0)
plt.ylim(0, 1.0)
plt.legend(loc='lower right')
plt.show()

In the code of the book, [numpy.random.choice](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.choice" used for mini-batch generation There is no specification of replace = False in the argument of .html), but I tried to specify it because it seems that the same element may be extracted more than once.

Originally, the gradient is calculated by numerical differentiation using TwoLayerNet.numerical_gradient, but the processing speed is slow and in the environment at hand ~~ It seems that 10,000 updates will not be completed even if it takes one day ~~ It can only be updated about 600 times in half a day, and it seems that it will take about 8 days to update 10,000 times. Therefore, following the advice in the book, I used TwoLayerNet.gradient, which implements the error propagation method learned in Chapter 5.

Finally, the transition of the value of the loss function and the transition of the recognition accuracy of the training data and test data are displayed in a graph.

Below are the execution results.

[Number of updates]   0 [Loss function value]2.2882 [Training data recognition accuracy]0.1044 [Test data recognition accuracy]0.1028
[Number of updates] 600 [Loss function value]0.8353 [Training data recognition accuracy]0.7753 [Test data recognition accuracy]0.7818
[Number of updates]1200 [Loss function value]0.4573 [Training data recognition accuracy]0.8744 [Test data recognition accuracy]0.8778
[Number of updates]1800 [Loss function value]0.4273 [Training data recognition accuracy]0.8972 [Test data recognition accuracy]0.9010
[Number of updates]2400 [Loss function value]0.3654 [Training data recognition accuracy]0.9076 [Test data recognition accuracy]0.9098
[Number of updates]3000 [Loss function value]0.2816 [Training data recognition accuracy]0.9142 [Test data recognition accuracy]0.9146
[Number of updates]3600 [Loss function value]0.3238 [Training data recognition accuracy]0.9195 [Test data recognition accuracy]0.9218
[Number of updates]4200 [Loss function value]0.2017 [Training data recognition accuracy]0.9231 [Test data recognition accuracy]0.9253
[Number of updates]4800 [Loss function value]0.1910 [Training data recognition accuracy]0.9266 [Test data recognition accuracy]0.9289
[Number of updates]5400 [Loss function value]0.1528 [Training data recognition accuracy]0.9306 [Test data recognition accuracy]0.9320
[Number of updates]6000 [Loss function value]0.1827 [Training data recognition accuracy]0.9338 [Test data recognition accuracy]0.9347
[Number of updates]6600 [Loss function value]0.1208 [Training data recognition accuracy]0.9362 [Test data recognition accuracy]0.9375
[Number of updates]7200 [Loss function value]0.1665 [Training data recognition accuracy]0.9391 [Test data recognition accuracy]0.9377
[Number of updates]7800 [Loss function value]0.1787 [Training data recognition accuracy]0.9409 [Test data recognition accuracy]0.9413
[Number of updates]8400 [Loss function value]0.1564 [Training data recognition accuracy]0.9431 [Test data recognition accuracy]0.9429
[Number of updates]9000 [Loss function value]0.2361 [Training data recognition accuracy]0.9449 [Test data recognition accuracy]0.9437
[Number of updates]9600 [Loss function value]0.2183 [Training data recognition accuracy]0.9456 [Test data recognition accuracy]0.9448

1.png 2.png

Looking at the results, the recognition accuracy was already around 94.5%, which exceeded the recognition accuracy of the learned parameters prepared in Chapter 3.

4.6 Summary

It may be good to read Chapter 4 as a book, but it was quite difficult to proceed while implementing it. (I wanted an explanation of the part where the softmax function and the numerical differentiation function correspond to a multidimensional array ...)

That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out. (To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / [Chapter 6](https: / /qiita.com/segavvy/items/ca4ac4c9ee1a126bff41) / Chapter 7 / Chapter 8 / Summary)

Recommended Posts

An amateur stumbled in Deep Learning from scratch Note: Chapter 1
An amateur stumbled in Deep Learning from scratch Note: Chapter 3
An amateur stumbled in Deep Learning from scratch Note: Chapter 5
An amateur stumbled in Deep Learning from scratch Note: Chapter 4
An amateur stumbled in Deep Learning from scratch Note: Chapter 2
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 5
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 2
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 1
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 4
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 6
[Learning memo] Deep Learning made from scratch [Chapter 7]
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 6]
"Deep Learning from scratch" in Haskell (unfinished)
Deep learning / Deep learning made from scratch Chapter 7 Memo
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
Deep Learning from scratch
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
Deep Learning from scratch Chapter 2 Perceptron (reading memo)
Deep Learning from scratch 1-3 chapters
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Create an environment for "Deep Learning from scratch" with Docker
Deep learning from scratch (cost calculation)
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo
Deep Learning memos made from scratch
Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo
[Deep Learning from scratch] I tried to explain the gradient confirmation in an easy-to-understand manner.
"Deep Learning from scratch" Self-study memo (No. 14) Run the program in Chapter 4 on Google Colaboratory
"Deep Learning from scratch" Self-study memo (Part 8) I drew the graph in Chapter 6 with matplotlib
Why ModuleNotFoundError: No module named'dataset.mnist' appears in "Deep Learning from scratch".
Write an impression of Deep Learning 3 framework edition made from scratch
Deep learning from scratch (forward propagation edition)
Deep learning / Deep learning from scratch 2-Try moving GRU
[Windows 10] "Deep Learning from scratch" environment construction
Learning record of reading "Deep Learning from scratch"
[Deep Learning from scratch] About hyperparameter optimization
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
Python vs Ruby "Deep Learning from scratch" Chapter 2 Logic circuit by Perceptron
Python vs Ruby "Deep Learning from scratch" Chapter 4 Implementation of loss function
"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
An amateur tried Deep Learning using Caffe (Introduction)
An amateur tried Deep Learning using Caffe (Practice)
An amateur tried Deep Learning using Caffe (Overview)
Python vs Ruby "Deep Learning from scratch" Summary
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
Python vs Ruby "Deep Learning from scratch" Chapter 3 Implementation of 3-layer neural network
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Lua version Deep Learning from scratch Part 5.5 [Making pkl files available in Lua Torch]
[For beginners] After all, what is written in Deep Learning made from scratch?
[Deep Learning from scratch] I implemented the Affine layer
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
Application of Deep Learning 2 made from scratch Spam filter
Deep Learning Experienced in Python Chapter 2 (Materials for Journals)
[Deep Learning from scratch] I tried to explain Dropout