Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 3, Step 10, make a note of your own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

In Chapter 8, we developed from a simple perceptron and introduced a multi-layer perceptron, and in Chapter 9, we implemented a multi-class classifier. Chapter 10 aims to improve neural networks.

--Difficulty in deepening the layer of neural network --Weight optimization method --Neural network tuning method

10.1 Deep Neural Networks Originally, a neural network with three or more layers was called deep. To add more layers, just add more layers with model.add.

Difficulty deepening the layer of neural networks

item solution
Easy to overfit Early stopping
・ If there are many layers, the expressive power of the neural network is high, so it is easy to overfit to the training data.
・ Repeat learning in epoch units, but round up learning before the accuracy of test data decreases.

Dropout
・ Some units are randomly ignored at a certain rate during learning, and all units are used when predicting.
・ It is difficult to overfit because the number of units that are effective in one learning is small.
・ A prediction method similar to the addition of multiple neural networks has the same effect as ensemble learning.
Learning does not go well Batch normalization
Internal covariate shiftOccurs
 ・手前の層の重みの更新のせいで、後ろの層の重みの更新が妨げられる
・データの分布が平均0で分散1になるよう正規化する
・新たな層として追加したり、層中の活性化関数の前で実行したりする
Increased computational complexity Learn neural networks using GPUs that can perform parallel processing at high speed

EarlyStopping


model.fit(X, y,
    epochs = 100,
    validation_split = 0.1,
    callbacks = [EarlyStopping(min_delta = 0.0, patience = 1)])

    #epochs: Make it large enough so that it doesn't end before rounding up with Early Stopping
    # validation_split: You can specify the ratio of training data and validation data to the input training data.
    #callbacks: The callbacks specified in list are called sequentially during learning.

Dropout


model = Sequential()
model.add(Dense(..))
model.add(Dropout(0.5))
model.add(Dense(..))
model.add(Dropout(0.5))
model.add(Dense(..))
model.add(Dropout(0.5))
model.add(Dense(.., activation = 'softmax')
model.compile(..)

#Dropout constructor argument ignores units

BatchNormalization


model = Sequential()

#Added as a new layer
model.add(Dense(.., activation = 'relu'))
model.add(BatchNormalization(0.5))

#Added before activation function
model.add(Dense(..))
model.add(BatchNormalization(0.5))
model.add(Activation('relu')

model.add(Dense(.., activation = 'softmax')
model.compile(..)

10.2 Neural network learning

Gradient descent

The slope is obtained by differentiating the error function with the weight, and the weight value is updated in the opposite direction of the slope to advance the learning of the neural network.

Global optimal solution and local optimal solution

--Global optimal solution: The optimal solution you want to find. The weight with the smallest error among the possible weights (optimal solution) --Local optimal solution: Partially optimal solution, but there are other most suitable solutions

Stochastic gradient descent and mini-batch method

--Gradient descent method (batch method) --Inject all data at once and update the weight for the average of all data --Learning is difficult --Easy to fall into a locally optimal solution --Stochastic Gradient Descent (SGD) --Only one randomly selected learning data is input at a time, and the weight is updated for it. ――It is vulnerable to noise, and learning will not be stable if the weight update is in the wrong direction. --Mini batch method --Between batch method and SGD --Only a few randomly selected training data are input at a time, and the weight is updated with respect to the average of those data.

10.3 Neural network tuning

item Contents
Batch size -Batch size at learning, Keras defaults to 32
・ There are many powers of 2, but this is just a convention, but it makes sense to search for small values densely and large values sparsely.
Optimizer -Adam, a latecomer, is often used, but depending on the problem, a simple SGD may be the best.
・ Tune with the learning rate below
Learning rate -The default weight of Adam in Keras is 0, which is the ratio of weights to be updated at one time..001
Activation function ・ ReLU is widely used(Rectified Linear Unit)However, the improved version of Leaky ReLU and SeLU(Scaled Exponential Unit)There is also room for consideration.
-Leaky ReLU: When the input is 0 or less, convert with a linear function with a small slope.
-ELU: If the input is 0 or less, convert by the value obtained by subtracting 1 from the exponential function.
Regularization/Load damping ・ To avoid overfitting, constrain the weight so that it does not become too large, and add the following norm to the loss function.
・ L1 norm: Sum of absolute values of each element of weight
・ L2 norm: Sum of squares of each element of weight
・ L∞ Norm: Maximum absolute value of each element of weight
Weight initialization -Keras default is initialized with random numbers
・ It is also possible to specify the distribution

Activation function


model = Sequential()
model.add(Dense(.., activation = 'selu'))
model.add(Dense(.., activation = LeakyReLU(0.3)))
model.add(Dense(.., activation = 'softmax')
model.compile(..)

Regularization


model = Sequential()
model.add(Dense(.., activation = 'relu',
    kernel_regularizer = regularizers.l2(0.1)))
model.add(Dense(.., activation = 'softmax',
    kernel_regularizer = regularizers.l2(0.1)))
model.compile(..)

Weight initialization


model = Sequential()
model.add(Dense(.., activation = 'relu',
    kernel_initializer = initializers.glorot_normal()))
model.add(Dense(.., activation = 'softmax',
    kernel_initializer = initializers.glorot_normal()))
model.compile(..)

glorot_normal (): Glolot's normal distribution, also known as Xavier's normal distribution (I've heard this one). The initial value of Xavier is suitable for the sigmoid function and tanh function, but when using ReLU as the activation function, ** the initial value of He specialized for ReLU ** seems to be better.

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 13 Memo "Recurrent Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 09 Memo "Identifier by Neural Network"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 14 Memo "Hyperparameter Search"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 1 Memo "Preliminary Knowledge Before Beginning Exercises"
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Types of preprocessing in natural language processing and their power
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
Summary of Chapter 3 of Introduction to Design Patterns Learned in Java Language
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
[WIP] Pre-processing memo in natural language processing
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
An introduction to cross-platform GUI software made with Python / Tkinter! (And many Try and Error)! (In the middle of writing)
Unbearable shortness of Attention in natural language processing
Learning history to participate in team application development in Python ~ After finishing "Introduction to Python 3" of paiza learning ~