While reading "Deep Learning from scratch" (written by Yasuki Saito, published by O'Reilly Japan), I will make a note of the sites I referred to. Part 8 ←
In Chapter 5, we implement the components of the neural network as layers, and in Chapter 6, we use them. The individual layers are explained in Chapter 5, but the implemented MultiLayerNet class is not explained in the book. The source is in the folder common multi_layer_net.py.
So let's take a look at the contents of the MultiLayerNet class.
# coding: utf-8
import os
import sys
sys.path.append(os.pardir) #Settings for importing files in the parent directory
import matplotlib.pyplot as plt
from dataset.mnist import load_mnist
from common.util import smooth_curve
from common.multi_layer_net import MultiLayerNet
from common.optimizer import *
# 0:Read MNIST data==========
(x_train, t_train), (x_test, t_test) = load_mnist(normalize=True)
train_size = x_train.shape[0]
batch_size = 128
max_iterations = 2000
# 1:Experiment settings==========
optimizers = {}
optimizers['SGD'] = SGD()
optimizers['Momentum'] = Momentum()
optimizers['AdaGrad'] = AdaGrad()
optimizers['Adam'] = Adam()
#optimizers['RMSprop'] = RMSprop()
networks = {}
train_loss = {}
for key in optimizers.keys():
networks[key] = MultiLayerNet(
input_size=784, hidden_size_list=[100, 100, 100, 100],
output_size=10)
train_loss[key] = []
# 2:Start of training==========
for i in range(max_iterations):
batch_mask = np.random.choice(train_size, batch_size)
x_batch = x_train[batch_mask]
t_batch = t_train[batch_mask]
for key in optimizers.keys():
grads = networks[key].gradient(x_batch, t_batch)
optimizers[key].update(networks[key].params, grads)
loss = networks[key].loss(x_batch, t_batch)
train_loss[key].append(loss)
Because it uses the MNIST dataset input_size = 784 Input is image data of 784 columns (originally 28 x 28) output_size = 10 Prediction probability of which output is 0-9 hidden_size_list = [100, 100, 100, 100] There are 4 hidden layers, and the number of neurons is 100 each. And batch_size The mini-batch size (number of rows of input data) is 128
Since this setting creates a multi-layer neural network with full coupling, An array is created like this.
The following is when referring to an array or object in the networks object.
networks
{'SGD': common.multi_layer_net.MultiLayerNet at 0x8800d50, 'Momentum': common.multi_layer_net.MultiLayerNet at 0x8800a50, 'AdaGrad': common.multi_layer_net.MultiLayerNet at 0x8800710, 'Adam': common.multi_layer_net.MultiLayerNet at 0x88003d0}
networks['SGD']
common.multi_layer_net.MultiLayerNet at 0x8800d50
networks['SGD'].layers
OrderedDict([('Affine1', common.layers.Affine at 0x8800c30), ('Activation_function1', common.layers.Relu at 0x8800c70), ('Affine2', common.layers.Affine at 0x8800bf0), ('Activation_function2', common.layers.Relu at 0x8800bd0), ('Affine3', common.layers.Affine at 0x8800b90), ('Activation_function3', common.layers.Relu at 0x8800b70), ('Affine4', common.layers.Affine at 0x8800ab0), ('Activation_function4', common.layers.Relu at 0x8800b30), ('Affine5', common.layers.Affine at 0x8800af0)])
networks['SGD'].layers['Affine1'].x
array([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)
networks['SGD'].layers['Affine1'].x.shape
(128, 784)
networks['SGD'].layers['Affine1'].W
array([[-0.04430735, -0.00916858, -0.05385046, ..., -0.01356, 0.0366878, -0.04629992], [-0.0974915 , 0.01896 , 0.0016755 , ..., 0.00820512, -0.01012246, -0.04869024], ..., [ 0.03065034, -0.02653425, -0.00433941, ..., -0.06933382, 0.03986452, 0.06821553], [ 0.01673732, 0.04850334, -0.0291053 , ..., -0.05045292, 0.00599257, 0.08265754]])
networks['SGD'].layers['Affine1'].W.shape
(784, 100)
networks['SGD'].layers['Affine1'].b
array([ 0.01646891, 0.01467293, 0.02892796, 0.02414651, 0.02259769, -0.00919552, -0.01567924, 0.0039934 , 0.00693527, 0.03932801, ..., -0.00536202, 0.00508444, 0.00204647, 0.01040528, 0.00355356, -0.00960685, 0.06204312, 0.02886584, 0.06678846, 0.0186539 ])
networks['SGD'].layers['Affine1'].b.shape
(100,)
networks['SGD'].layers['Activation_function1']
common.layers.Relu at 0x8800c70
networks['SGD'].layers['Affine2'].x.shape
(128, 100)
networks['SGD'].layers['Affine2'].W.shape
(100, 100)
networks['SGD'].layers['Affine2'].b.shape
(100,)
networks['SGD'].layers['Activation_function2']
common.layers.Relu at 0x8800bd0
networks['SGD'].layers['Affine3'].x.shape
(128, 100)
networks['SGD'].layers['Affine3'].W.shape
(100, 100)
networks['SGD'].layers['Affine3'].b.shape
(100,)
networks['SGD'].layers['Activation_function3']
common.layers.Relu at 0x8800b70
networks['SGD'].layers['Affine4'].x.shape
(128, 100)
networks['SGD'].layers['Affine4'].W.shape
(100, 100)
networks['SGD'].layers['Affine4'].b.shape
(100,)
networks['SGD'].layers['Activation_function4']
common.layers.Relu at 0x8800b30
The number of intermediate layers specified is 4, but after that, a fifth layer is created as an output layer.
networks['SGD'].layers['Affine5'].x.shape
(128, 100)
networks['SGD'].layers['Affine5'].W.shape
(100, 10)
networks['SGD'].layers['Affine5'].b.shape
(10,)
The activation function of the output layer is defined in last_layer. The softmax function is used here.
networks['SGD'].last_layer
common.layers.SoftmaxWithLoss at 0x8800770
networks['SGD'].last_layer.y
array([[2.08438091e-05, 2.66555051e-09, 1.29436456e-03, ..., 1.83391350e-07, 9.98317338e-01, 6.77137764e-05], [5.68871828e-04, 1.59787427e-06, 3.60265866e-03, ..., 7.25385216e-05, 1.80220134e-03, 4.95014520e-02], ..., [3.01731618e-03, 5.57601184e-03, 1.40908372e-02, ..., 8.49627989e-02, 5.44208078e-03, 2.32114245e-01], [9.82201047e-07, 3.01213101e-07, 1.05657504e-03, ..., 1.03584551e-05, 9.92242677e-01, 5.06642654e-03]])
networks['SGD'].last_layer.y.shape
(128, 10)
P177 In this experiment, a 5-layer neural network has 100 neurons in each layer. Targeted the network. We also used ReLU as the activation function. Looking at the results in Figure 6-9, we can see that other methods are learning faster than SGD. I will. The remaining three methods seem to be learned in the same way. If you look closely Learning AdaGrad seems to be a little faster. As a note of this experiment Is the hyperparameters of the learning coefficient and the structure of the neural network (how many layers of depth) The result will change depending on the situation. However, generally more than SGD The three methods of are faster to learn and sometimes have better final recognition performance.
So, I checked the difference in recognition performance with test data.
#Evaluate with test data
x = x_test
t = t_test
for key in optimizers.keys():
network = networks[key]
y = network.predict(x)
accuracy_cnt = 0
for i in range(len(x)):
p= np.argmax(y[i])
if p == t[i]:
accuracy_cnt += 1
print(key + " Accuracy:" + str(float(accuracy_cnt) / len(x)))
SGD Accuracy:0.934 Momentum Accuracy:0.9676 AdaGrad Accuracy:0.97 Adam Accuracy:0.9701
Certainly, the recognition rate of SGD seems to be low.
max_iterations = 5000
SGD Accuracy:0.9557 Momentum Accuracy:0.9754 AdaGrad Accuracy:0.9755 Adam Accuracy:0.9752
As the number of times increased, the recognition rate increased for all four. SGD has also risen, but remains lower than other methods at 2000 times.
networks[key] = MultiLayerNet(
input_size=784, hidden_size_list=[100, 100],
output_size=10)
SGD Accuracy:0.922 Momentum Accuracy:0.9633 AdaGrad Accuracy:0.9682 Adam Accuracy:0.9701
The recognition rate of all four is about 1% worse.
networks[key] = MultiLayerNet(
input_size=784, hidden_size_list=[100, 100, 100, 100, 100, 100, 100, 100],
output_size=10)
SGD Accuracy:0.9479 Momentum Accuracy:0.9656 AdaGrad Accuracy:0.9692 Adam Accuracy:0.9701
SGD recognition rates have increased, but others are the same or slightly worse. Is it better to think that the speed of learning has increased rather than the recognition rate increased by increasing the number of layers?
networks[key] = MultiLayerNet(
input_size=784, hidden_size_list=[50, 50, 50, 50],
output_size=10)
SGD Accuracy:0.9275 Momentum Accuracy:0.9636 AdaGrad Accuracy:0.962 Adam Accuracy:0.9687
networks[key] = MultiLayerNet(
input_size=784, hidden_size_list=[200, 200, 200, 200],
output_size=10)
SGD Accuracy:0.9372 Momentum Accuracy:0.9724 AdaGrad Accuracy:0.9775 Adam Accuracy:0.9753
Part 8 ←
Recommended Posts