Verification of Batch Normalization with multi-layer neural network

Introduction

Hello. The other day

-Japanese localization of "10 Deep Learning Trends at NIPS 2015" (?) | Memorandum

Read the article

If you don't use batch normalization, you'll lose your life If you aren't using batch normalization you should

So, I tried to implement and verify (?) Batch Normalization by Theano.

Is partly referred to.

Batch Normalization

algorithm

Normalize each batch so that the mean is 0 and the variance is 1. Let $ B $ be a set of inputs for mini-batch and $ m $ be a batch size.

B = \{x_{1...m}\}\\

Below, $ \ epsilon $ seems to be a parameter for stabilization.

\epsilon = 10^{-5}\\
\mu_{B} \leftarrow \frac{1}{m} \sum_{i=1}^{m} x_i\\
\sigma^2_{B} \leftarrow \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_{B})^2\\
\hat{x_i} \leftarrow \frac{x_i - \mu_{B}}{\sqrt{\sigma^2_{B} + \epsilon}}\\
y_i \leftarrow \gamma \hat{x_i} + \beta

Regarding the above formula, $ \ gamma $ and $ \ beta $ are for scaling and shifting the values normalized by the parameters, respectively. It is necessary to learn each by the error back propagation method, but the derivation of the detailed formula is omitted here.

For Fully-Connected Layer

In a normal Fully-Connected Layer, it is necessary to calculate the mean and variance for the input dimension. In other words, if the input shape is (BacthSize, 784), it is necessary to calculate the mean and variance of 784 pieces.

For Convolutional Layer

On the other hand, in Convolutional Layer, it is necessary to calculate the mean and variance for the number of channels. In other words, if the input shape is (BatchSize, 64 (number of channels), 32, 32), it is necessary to calculate the mean and variance of 64 pieces.

merit

As a merit of Batch Normalization, it seems that a large learning coefficient can be set and learning can be accelerated.

Implementation by Theano

class BatchNormalizationLayer(object):
	def __init__(self, input, shape=None):
		self.shape = shape
		if len(shape) == 2: # for fully connnected
			gamma = theano.shared(value=np.ones(shape[1], dtype=theano.config.floatX), name="gamma", borrow=True)
			beta = theano.shared(value=np.zeros(shape[1], dtype=theano.config.floatX), name="beta", borrow=True)
			mean = input.mean((0,), keepdims=True)
			var = input.var((0,), keepdims=True)
		elif len(shape) == 4: # for cnn
			gamma = theano.shared(value=np.ones(shape[1:], dtype=theano.config.floatX), name="gamma", borrow=True)
			beta = theano.shared(value=np.zeros(shape[1:], dtype=theano.config.floatX), name="beta", borrow=True)
			mean = input.mean((0,2,3), keepdims=True)
			var = input.var((0,2,3), keepdims=True)
			mean = self.change_shape(mean)
			var = self.change_shape(var)

		self.params = [gamma, beta]
		self.output = gamma * (input - mean) / T.sqrt(var + 1e-5) + beta
	
	def change_shape(self, vec):
		ret = T.repeat(vec, self.shape[2]*self.shape[3])
		ret = ret.reshape(self.shape[1:])
		return ret

An example of how to use it (mostly pseudo code) is

...
input = previous_layer.output #Symbol variable, output of previous layer, shape=(batchsize, 784)
h = BatchNormalizationLayer(input, shape=(batchsize, 784))
#When activating
h.output = activation(h.output) # activation=Some activation function
...
params = ... + h.params + ... #Used when updating network parameters.

Experiment

Experimental settings

The data was experimented with a simple multi-layer neural network using MNIST.

--Number of middle layers: 10 --Number of units in the middle layer: 784 in total --Optimization method: Simple SGD (learning coefficient: 0.01) --Activation function: tanh --Dropout Ratio: 0.1 for the first intermediate layer, 0.5 for all but the input / output layers

Well Input layer → (Fully-Connected Layer → Batch Normalization Layer → Activation) * 10 → Output layer It's like that.

Experimental result

--Error function value <img src="https://qiita-image-store.s3.amazonaws.com/0/31899/2cb66cbb-0581-fe69-a043-7794a2103393.png ", width=640> --Classification accuracy <img src="https://qiita-image-store.s3.amazonaws.com/0/31899/bc14eede-2acf-591e-80b3-632269b0d19d.png ", width=640>

Finally

It may have been a little difficult to set up the experiment, but you may have found that it would be damaged if you did not use Batch Normalization.

Recommended Posts

Verification of Batch Normalization with multi-layer neural network
Recognition of handwritten numbers by multi-layer neural network
3. Normal distribution with neural network!
Neural network starting with Chainer
4. Circle parameters with neural network!
Neural network with OpenCV 3 and Python 3
Implementation of a two-layer neural network 2
Simple classification model with neural network
[TensorFlow] [Keras] Neural network construction with Keras
Touch the object of the neural network
Compose with a neural network! Run Magenta
Predict time series data with neural network
Build a classifier with a handwriting recognition rate of 99.2% with a TensorFlow convolutional neural network
Implementation of 3-layer neural network (no learning)
Persist the neural network built with PyBrain
I tried batch normalization with PyTorch (+ note)
Implementation of "blurred" neural network using Chainer
2. Mean and standard deviation with neural network!
Experiment with various optimization algorithms with a neural network
Visualize the inner layer of a neural network
I ran the TensorFlow tutorial with comments (first neural network: the beginning of the classification problem)
Parametric Neural Network
Train MNIST data with a neural network in PyTorch
The story of making a music generation neural network
Bayesian optimization implementation of neural network hyperparameters (Chainer + GPyOpt)
Create a batch of images and inflate with ImageDataGenerator
Basics of PyTorch (2) -How to make a neural network-
Implementation of a convolutional neural network using only Numpy
I tried a convolutional neural network (CNN) with a tutorial on TensorFlow on Cloud9-Classification of handwritten images-