This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 3, Step 10, make a note of your own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

In Chapter 8, we developed from a simple perceptron and introduced a multi-layer perceptron, and in Chapter 9, we implemented a multi-class classifier. Chapter 10 aims to improve neural networks.

--Difficulty in deepening the layer of neural network --Weight optimization method --Neural network tuning method

10.1 Deep Neural Networks Originally, a neural network with three or more layers was called deep. To add more layers, just add more layers with model.add.

Difficulty deepening the layer of neural networks

item	solution
Easy to overfit	・Early stopping ・ If there are many layers, the expressive power of the neural network is high, so it is easy to overfit to the training data. ・ Repeat learning in epoch units, but round up learning before the accuracy of test data decreases. ・Dropout ・ Some units are randomly ignored at a certain rate during learning, and all units are used when predicting. ・ It is difficult to overfit because the number of units that are effective in one learning is small. ・ A prediction method similar to the addition of multiple neural networks has the same effect as ensemble learning.
Learning does not go well	・Batch normalization ・Internal covariate shiftOccurs 　・手前の層の重みの更新のせいで、後ろの層の重みの更新が妨げられる・データの分布が平均0で分散1になるよう正規化する・新たな層として追加したり、層中の活性化関数の前で実行したりする
Increased computational complexity	Learn neural networks using GPUs that can perform parallel processing at high speed

`EarlyStopping`


model.fit(X, y,
    epochs = 100,
    validation_split = 0.1,
    callbacks = [EarlyStopping(min_delta = 0.0, patience = 1)])

    #epochs: Make it large enough so that it doesn't end before rounding up with Early Stopping
    # validation_split: You can specify the ratio of training data and validation data to the input training data.
    #callbacks: The callbacks specified in list are called sequentially during learning.

`Dropout`


model = Sequential()
model.add(Dense(..))
model.add(Dropout(0.5))
model.add(Dense(..))
model.add(Dropout(0.5))
model.add(Dense(..))
model.add(Dropout(0.5))
model.add(Dense(.., activation = 'softmax')
model.compile(..)

#Dropout constructor argument ignores units

`BatchNormalization`


model = Sequential()

#Added as a new layer
model.add(Dense(.., activation = 'relu'))
model.add(BatchNormalization(0.5))

#Added before activation function
model.add(Dense(..))
model.add(BatchNormalization(0.5))
model.add(Activation('relu')

model.add(Dense(.., activation = 'softmax')
model.compile(..)

10.2 Neural network learning

Gradient descent

The slope is obtained by differentiating the error function with the weight, and the weight value is updated in the opposite direction of the slope to advance the learning of the neural network.

Global optimal solution and local optimal solution

--Global optimal solution: The optimal solution you want to find. The weight with the smallest error among the possible weights (optimal solution) --Local optimal solution: Partially optimal solution, but there are other most suitable solutions

Stochastic gradient descent and mini-batch method

--Gradient descent method (batch method) --Inject all data at once and update the weight for the average of all data --Learning is difficult --Easy to fall into a locally optimal solution --Stochastic Gradient Descent (SGD) --Only one randomly selected learning data is input at a time, and the weight is updated for it. ――It is vulnerable to noise, and learning will not be stable if the weight update is in the wrong direction. --Mini batch method --Between batch method and SGD --Only a few randomly selected training data are input at a time, and the weight is updated with respect to the average of those data.

10.3 Neural network tuning

item	Contents
Batch size	-Batch size at learning, Keras defaults to 32 ・ There are many powers of 2, but this is just a convention, but it makes sense to search for small values densely and large values sparsely.
Optimizer	-Adam, a latecomer, is often used, but depending on the problem, a simple SGD may be the best. ・ Tune with the learning rate below
Learning rate	-The default weight of Adam in Keras is 0, which is the ratio of weights to be updated at one time..001
Activation function	・ ReLU is widely used(Rectified Linear Unit)However, the improved version of Leaky ReLU and SeLU(Scaled Exponential Unit)There is also room for consideration. -Leaky ReLU: When the input is 0 or less, convert with a linear function with a small slope. -ELU: If the input is 0 or less, convert by the value obtained by subtracting 1 from the exponential function.
Regularization/Load damping	・ To avoid overfitting, constrain the weight so that it does not become too large, and add the following norm to the loss function. ・ L1 norm: Sum of absolute values of each element of weight ・ L2 norm: Sum of squares of each element of weight ・ L∞ Norm: Maximum absolute value of each element of weight
Weight initialization	-Keras default is initialized with random numbers ・ It is also possible to specify the distribution

`Activation function`


model = Sequential()
model.add(Dense(.., activation = 'selu'))
model.add(Dense(.., activation = LeakyReLU(0.3)))
model.add(Dense(.., activation = 'softmax')
model.compile(..)

`Regularization`


model = Sequential()
model.add(Dense(.., activation = 'relu',
    kernel_regularizer = regularizers.l2(0.1)))
model.add(Dense(.., activation = 'softmax',
    kernel_regularizer = regularizers.l2(0.1)))
model.compile(..)

`Weight initialization`


model = Sequential()
model.add(Dense(.., activation = 'relu',
    kernel_initializer = initializers.glorot_normal()))
model.add(Dense(.., activation = 'softmax',
    kernel_initializer = initializers.glorot_normal()))
model.compile(..)

glorot_normal (): Glolot's normal distribution, also known as Xavier's normal distribution (I've heard this one). The initial value of Xavier is suitable for the sigmoid function and tanh function, but when using ReLU as the activation function, ** the initial value of He specialized for ReLU ** seems to be better.

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"

Contents