"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class

While reading "Deep Learning from scratch" (written by Yasuki Saito, published by O'Reilly Japan), I will make a note of the sites I referred to. Part 8 ←

In Chapter 5, we implement the components of the neural network as layers, and in Chapter 6, we use them. The individual layers are explained in Chapter 5, but the implemented MultiLayerNet class is not explained in the book. The source is in the folder common multi_layer_net.py.

So let's take a look at the contents of the MultiLayerNet class.

6.1.8 The MultiLayerNet class is referenced in the source code ch06 / optimizer_compare_mnist.py used in the comparison of update methods using the MNIST dataset.

# coding: utf-8
import os
import sys
sys.path.append(os.pardir)  #Settings for importing files in the parent directory
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.util import smooth_curve
from common.multi_layer_net import MultiLayerNet
from common.optimizer import *


# 0:Read MNIST data==========
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)

train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2000


# 1:Experiment settings==========
optimizers = {}
optimizers['SGD'] = SGD()
optimizers['Momentum'] = Momentum()
optimizers['AdaGrad'] = AdaGrad()
optimizers['Adam'] = Adam()
#optimizers['RMSprop'] = RMSprop()

networks = {}
train_loss = {}
for key in optimizers.keys():
    networks[key] = MultiLayerNet(
        input_size=784, hidden_size_list=[100, 100, 100, 100],
        output_size=10)
    train_loss[key] = []    


# 2:Start of training==========
for i in range(max_iterations):
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]
    
    for key in optimizers.keys():
        grads = networks[key].gradient(x_batch, t_batch)
        optimizers[key].update(networks[key].params, grads)
    
        loss = networks[key].loss(x_batch, t_batch)
        train_loss[key].append(loss)

Because it uses the MNIST dataset input_size = 784 Input is image data of 784 columns (originally 28 x 28) output_size = 10 Prediction probability of which output is 0-9 hidden_size_list = [100, 100, 100, 100] There are 4 hidden layers, and the number of neurons is 100 each. And batch_size The mini-batch size (number of rows of input data) is 128

Since this setting creates a multi-layer neural network with full coupling, An array is created like this.

The following is when referring to an array or object in the networks object.

networks

{'SGD': common.multi_layer_net.MultiLayerNet at 0x8800d50, 'Momentum': common.multi_layer_net.MultiLayerNet at 0x8800a50, 'AdaGrad': common.multi_layer_net.MultiLayerNet at 0x8800710, 'Adam': common.multi_layer_net.MultiLayerNet at 0x88003d0}

networks['SGD']

common.multi_layer_net.MultiLayerNet at 0x8800d50

networks['SGD'].layers

OrderedDict([('Affine1', common.layers.Affine at 0x8800c30), ('Activation_function1', common.layers.Relu at 0x8800c70), ('Affine2', common.layers.Affine at 0x8800bf0), ('Activation_function2', common.layers.Relu at 0x8800bd0), ('Affine3', common.layers.Affine at 0x8800b90), ('Activation_function3', common.layers.Relu at 0x8800b70), ('Affine4', common.layers.Affine at 0x8800ab0), ('Activation_function4', common.layers.Relu at 0x8800b30), ('Affine5', common.layers.Affine at 0x8800af0)])

networks['SGD'].layers['Affine1'].x

array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

networks['SGD'].layers['Affine1'].x.shape

(128, 784)

networks['SGD'].layers['Affine1'].W

array([[-0.04430735, -0.00916858, -0.05385046, ..., -0.01356, 0.0366878, -0.04629992], [-0.0974915 , 0.01896 , 0.0016755 , ..., 0.00820512, -0.01012246, -0.04869024], ..., [ 0.03065034, -0.02653425, -0.00433941, ..., -0.06933382, 0.03986452, 0.06821553], [ 0.01673732, 0.04850334, -0.0291053 , ..., -0.05045292, 0.00599257, 0.08265754]])

networks['SGD'].layers['Affine1'].W.shape

(784, 100)

networks['SGD'].layers['Affine1'].b

array([ 0.01646891, 0.01467293, 0.02892796, 0.02414651, 0.02259769, -0.00919552, -0.01567924, 0.0039934 , 0.00693527, 0.03932801, ..., -0.00536202, 0.00508444, 0.00204647, 0.01040528, 0.00355356, -0.00960685, 0.06204312, 0.02886584, 0.06678846, 0.0186539 ])

networks['SGD'].layers['Affine1'].b.shape

(100,)

networks['SGD'].layers['Activation_function1']

common.layers.Relu at 0x8800c70

networks['SGD'].layers['Affine2'].x.shape

(128, 100)

networks['SGD'].layers['Affine2'].W.shape

(100, 100)

networks['SGD'].layers['Affine2'].b.shape

(100,)

networks['SGD'].layers['Activation_function2']

common.layers.Relu at 0x8800bd0

networks['SGD'].layers['Affine3'].x.shape

(128, 100)

networks['SGD'].layers['Affine3'].W.shape

(100, 100)

networks['SGD'].layers['Affine3'].b.shape

(100,)

networks['SGD'].layers['Activation_function3']

common.layers.Relu at 0x8800b70

networks['SGD'].layers['Affine4'].x.shape

(128, 100)

networks['SGD'].layers['Affine4'].W.shape

(100, 100)

networks['SGD'].layers['Affine4'].b.shape

(100,)

networks['SGD'].layers['Activation_function4']

common.layers.Relu at 0x8800b30

The number of intermediate layers specified is 4, but after that, a fifth layer is created as an output layer.

networks['SGD'].layers['Affine5'].x.shape

(128, 100)

networks['SGD'].layers['Affine5'].W.shape

(100, 10)

networks['SGD'].layers['Affine5'].b.shape

(10,)

The activation function of the output layer is defined in last_layer. The softmax function is used here.

networks['SGD'].last_layer

common.layers.SoftmaxWithLoss at 0x8800770

networks['SGD'].last_layer.y

array([[2.08438091e-05, 2.66555051e-09, 1.29436456e-03, ..., 1.83391350e-07, 9.98317338e-01, 6.77137764e-05], [5.68871828e-04, 1.59787427e-06, 3.60265866e-03, ..., 7.25385216e-05, 1.80220134e-03, 4.95014520e-02], ..., [3.01731618e-03, 5.57601184e-03, 1.40908372e-02, ..., 8.49627989e-02, 5.44208078e-03, 2.32114245e-01], [9.82201047e-07, 3.01213101e-07, 1.05657504e-03, ..., 1.03584551e-05, 9.92242677e-01, 5.06642654e-03]])

networks['SGD'].last_layer.y.shape

(128, 10)

I tested it with the learning results

P177 In this experiment, a 5-layer neural network has 100 neurons in each layer. Targeted the network. We also used ReLU as the activation function. Looking at the results in Figure 6-9, we can see that other methods are learning faster than SGD. I will. The remaining three methods seem to be learned in the same way. If you look closely Learning AdaGrad seems to be a little faster. As a note of this experiment Is the hyperparameters of the learning coefficient and the structure of the neural network (how many layers of depth) The result will change depending on the situation. However, generally more than SGD The three methods of are faster to learn and sometimes have better final recognition performance.

So, I checked the difference in recognition performance with test data.

#Evaluate with test data
x = x_test
t = t_test

for key in optimizers.keys():
    network = networks[key]

    y = network.predict(x)

    accuracy_cnt = 0
    for i in range(len(x)):
        p= np.argmax(y[i])
        if p == t[i]:
            accuracy_cnt += 1

    print(key + " Accuracy:" + str(float(accuracy_cnt) / len(x)))

SGD Accuracy:0.934 Momentum Accuracy:0.9676 AdaGrad Accuracy:0.97 Adam Accuracy:0.9701

Certainly, the recognition rate of SGD seems to be low.

I changed the number of mini-batch from 2000 to 5000, learned and tested it, and it turned out like this.

max_iterations = 5000

SGD Accuracy:0.9557 Momentum Accuracy:0.9754 AdaGrad Accuracy:0.9755 Adam Accuracy:0.9752

As the number of times increased, the recognition rate increased for all four. SGD has also risen, but remains lower than other methods at 2000 times.

I reduced the number of intermediate layers to two and tried learning with a three-layer neural network.

    networks[key] = MultiLayerNet(
        input_size=784, hidden_size_list=[100, 100],
        output_size=10)

SGD Accuracy:0.922 Momentum Accuracy:0.9633 AdaGrad Accuracy:0.9682 Adam Accuracy:0.9701

The recognition rate of all four is about 1% worse.

I increased the number of intermediate layers to eight and tried learning with a nine-layer neural network.

    networks[key] = MultiLayerNet(
        input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100, 100, 100],
        output_size=10)

SGD Accuracy:0.9479 Momentum Accuracy:0.9656 AdaGrad Accuracy:0.9692 Adam Accuracy:0.9701

SGD recognition rates have increased, but others are the same or slightly worse. Is it better to think that the speed of learning has increased rather than the recognition rate increased by increasing the number of layers?

I tried reducing the number of neurons in the middle layer to 50

    networks[key] = MultiLayerNet(
        input_size=784, hidden_size_list=[50, 50, 50, 50],
        output_size=10)

SGD Accuracy:0.9275 Momentum Accuracy:0.9636 AdaGrad Accuracy:0.962 Adam Accuracy:0.9687

I tried increasing the number of neurons in the middle layer to 200

    networks[key] = MultiLayerNet(
        input_size=784, hidden_size_list=[200, 200, 200, 200],
        output_size=10)

SGD Accuracy:0.9372 Momentum Accuracy:0.9724 AdaGrad Accuracy:0.9775 Adam Accuracy:0.9753

Part 8 ←

Unreadable Glossary