Subjects> Deep Learning: Day3 RNN

sutudy-ai


Deep learning

table of contents [Deep Learning: Day1 NN] (https://qiita.com/matsukura04583/items/6317c57bc21de646da8e) [Deep Learning: Day2 CNN] (https://qiita.com/matsukura04583/items/29f0dcc3ddeca4bf69a2) [Deep Learning: Day3 RNN] (https://qiita.com/matsukura04583/items/9b77a238da4441e0f973) [Deep Learning: Day4 Reinforcement Learning / TensorFlow] (https://qiita.com/matsukura04583/items/50806b750c8d77f2305d)

Deep Learning: Day3 CNN (Lecture Summary)

Reviewing the Big Picture of Deep Learning – Learning Concepts

スクリーンショット 2020-01-02 14.09.10.png

Latest CNN

•AlexNet AlexNet is a model that won the second place in the image recognition competition held in 2012 by a large margin. With the advent of AlexNet, deep learning has received a lot of attention. Model structure Consists of 3 fully connected layers, including 5 convolution layers and a pooling layer スクリーンショット 2020-01-02 14.50.59.png

About recurrent neural networks

Section1) Concept of recurrent neural network

python


u[:,t+1] = np.dot(X, W_in) + np.dot(z[:,t].reshape(1, -1), W)
z[:,t+1] = functions.sigmoid(u[:,t+1])
np.dot(z[:,t+1].reshape(1, -1), W_out)
y[:,t] = functions.sigmoid(np.dot(z[:,t+1].reshape(1, -1), W_out))np.dot(z[:,t+1].reshape(1, -1), W_out)

Section2) LSTM Overall picture (previous flow and vision of the overall picture of issues)

Overall view of LTSM

スクリーンショット 2020-01-03 15.25.30.png

A mechanism that was created to meet the needs of spreading the past information of CEC storage to other nodes at any time or forgetting it at any time. The value of CEC itself does not affect the gate control. What is a peephole connection? A structure that allows propagation to the value of $ \ Rightarrow $ CEC itself via a weight matrix.

Section3) GRU

GRU overview

Section4) Bidirectional RNN

Natural language processing with RNN

Section5)Seq2Seq

Seq2Seq Big picture

スクリーンショット 2020-01-03 21.38.30.png What is the specific use of Seq2seq? $ \ Rightarrow $ It is used for machine dialogue and machine translation. What is Seq2seq? $ \ Rightarrow $ Encoder-Decoder model.

Section6)Word2vec

Confirmation test

[P11] Confirmation test Answer the size of the output image when the input image of size 5x5 is folded with the filter of size 3x3. The stride is 2 and the padding is 1. ⇒ [Discussion] Answer 3 ✖️ 3 Input size height (H), input size width (W) Output Hight(OH) Output Width(OW) Filler Hight(FH) Filler Width(FW) Stride (S) Panning (P)

[P12] Find dz / dx using the principle of chain rule.

     z = t^2,t=x+y

⇒ [Discussion] It can be calculated by the following calculation.

 \frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}
,t=x+y
z = t^Since it is 2, if it is differentiated by t\frac{dz}{dt}=2t

t=x+Since it is y, if it is differentiated by x\frac{dt}{dx}=1

\frac{dz}{dx}=2t ・ 1=2t=2(x+y)
   OH =\frac{H+2P-FH}{S}+1 =\frac{5+2.1-3}{2}+1=3
   OH =\frac{W+2P-FW}{S}+1 =\frac{5+2.1-3}{2}+1=3

It's a fixed calculation method, so let's remember it as a formula.

[P23] Confirmation test RNN networks have three main weights. One is the weight applied when defining the current middle layer from the input, and the other is the weight applied when defining the output from the intermediate layer. Explain the remaining one weight. ⇒ [Discussion] The answer is the weight passed from one middle layer to the next. スクリーンショット 2020-01-03 6.53.48.png

[P37] Find dz / dx using the principle of chain rule.

     z = t^2,t=x+y

⇒ [Discussion] It can be calculated by the following calculation.

 \frac{dz}{dx}=\frac{dz}{dy}\frac{dy}{dx}
,t=x+y
z = t^Since it is 2, if it is differentiated by t\frac{dz}{dt}=2t

t=x+Since it is y, if it is differentiated by x\frac{dt}{dx}=1

\frac{dz}{dx}=2t ・ 1=2t=2(x+y)

[P46] Confirmation test

Express y1 in the figure below with a mathematical formula using x, s0, s1, win, w, and wout. * Define the bias with any character. * Also, let the sigmoid function g (x) act on the output of the intermediate layer.


Z_1=sigmoid(S_0W+x_1W_{(in)}+b)

The output layer also uses sigmoid

y_1=sigmoid(Z_1W_{(out)}+c)

Know the essence because the way to write symbols differs depending on the book.

[P54] Code exercises スクリーンショット 2020-01-03 13.43.56.png ⇒ [Discussion] The answer is (2) [Explanation] In RNN, the intermediate layer output h_ {t} depends on the past intermediate layer output h_ {t-1}, .., h_ {1}. When we partially differentiate the loss function with respect to the weights W and U in the RNN, we need to take that into account, and note that dh_ {t} / dh_ {t-1} = U, U each time we go back in time. Is hung. That is, delta_t = delta_t.dot (U).

[P63] When the sigmoid function is differentiated, the maximum value is taken when the input value is 0. Select the correct value from the options. (1) 0.15 (2) 0.25 (3) 0.35 (4) 0.45

⇒ [Discussion] Differentiation of sigumoid

     (sigmoid)'=(1-sigmoid)(sigmoid)

Since the sigmoid function is maximum at 0.5,

     (sigmoid)'=(1-0.5)(0.5)=0.Will be 25

[P65] Exercise Challenge スクリーンショット 2020-01-03 14.28.59.png

⇒ [Discussion] Correct answer: 1 [Explanation] When the norm of the gradient is larger than the threshold value, the norm of the gradient is normalized to the threshold value, so the clipped gradient is calculated as gradient × (threshold value / norm of gradient). To. That is, gradient * rate. It is easy to understand because the threshold value is simply multiplied by the gradient and normalized.

[P79] Confirmation test Suppose you want to enter the following sentence into an LSTM and predict the words that fit in the blanks. The word "very" in the text is not considered to have any effect even if it disappears in the blank prediction. Which gate is considered to work in such a case? "The movie was interesting. By the way, I'm so hungry that something ____." ⇒ [Discussion] Correct answer: Oblivion gate. The role of the forgetting gate is used to determine how much immediate impact is considered.

[P80] Exercise Challenge スクリーンショット 2020-01-03 16.10.21.png

⇒ [Discussion] Correct answer: 3 [Explanation] The state of the new cell is expressed as the sum of the input to the calculated cell and the state of the cell one step before, multiplied by the input gate and the forgetting gate. That is, input_gate * a + forget_gate * c.

[P89] Confirmation test Briefly describe the challenges facing LSTMs and CECs.

⇒ [Discussion] Challenges faced by LSTM and CEC. The LSTM has a problem that the number of parameters is large and the calculation load is high. In CEC, there is no concept of learning, and weights are not used. It cannot meet the needs of propagating the stored past information to other nodes at any time or forgetting it at any time.

[P91] Exercise Challenge スクリーンショット 2020-01-03 16.53.16.png

[P93] Confirmation test Briefly describe the difference between LSTMs and GRUs. ⇒ [Discussion] In LSTM, there was a problem that the number of parameters was large and the calculation load was high, but in GRU, the parameters were reduced and the processing became faster. However, not all GRUs are superior, and it is better to compare and select in some cases.

[P96] Exercise Challenge

スクリーンショット 2020-01-03 20.20.29.png ⇒ [Discussion] Correct answer: 4 [Explanation] In a bidirectional RNN, the feature quantity is the combination of the intermediate layer representation when propagating in the forward and reverse directions, so np.concatenate ([h_f, h_b [:: -1]]] , Axis = 1). (Reference) [Learn the np.concatenate syntax here](https://www.sejuku.net/blog/67869)

[P111] Exercise Challenge スクリーンショット 2020-01-03 21.07.54.png ⇒ [Discussion] Correct answer: 1 [Explanation] The word w is a one-hot vector, which is converted into another feature by embedding the word. This can be written as E.dot (w) using the embedded matrix E. w is made up of One-hot vectors.

(Reference) Learn the relationship between natural language processing and on-hot here When the document is large, the on-hot data also becomes large, and there is a problem that processing may not be in time. [P120] Confirmation test seq2 Briefly describe the difference between seq and HRED and between HRED and VHRED. ⇒ [Discussion] seq2seq could only answer one question at a time, but HRED was created to solve that problem. The difference between HRED and VHRED is that there are problems that HRED cannot answer in the same way, and VHRED can answer while changing the expression by solving the problems.

[P129] Confirmation test Answer the blanks in the description below about VAE. Introducing ____ to the latent variable of the self-encoder ⇒ [Discussion] The answer is the introduction of "random variables" into latent variables.

[P138] Confirmation test Briefly describe the difference between RNN and word2vec, and seq2seq and Attention. ⇒ [Discussion] RNN needed to generate a matrix of vocabulary number ✖️ vocabulary number weights, but word2vec can be made of a vocabulary number ✖️ arbitrary word vector number weight matrix. With seq2seq and Attention, you can only give the same answer to the same question with seq2seq, but with Attention, you can use the importance and relevance, and you will be able to return answers with variations. Through iterative learning, it becomes possible to give answers that lead to improved accuracy.

[Video DN60] Exercise Challenge スクリーンショット 2020-01-04 1.29.38.png ⇒ [Discussion] The answer is (2). It is represented by a representation vector, and the left and right representation vectors are weighted and calculated.

# Exercise

DN42_Source Exercise ①

simple RNN Binary addition Execution result of binary addition スクリーンショット 2020-01-03 4.42.29.png

[try] Let's change weight_init_std, learning_rate, hidden_layer_size weight_init_std 1→10 learning_rate 0.1→0.01 hidden_layer_size 16→32 スクリーンショット 2020-01-03 4.53.56.png Learning got worse.

[try] Let's change the weight initialization method Try changing both Xavier and He. (Source change)

python


###########changes##############
#Weight initialization(Bias is omitted for simplicity)
#W_in = weight_init_std * np.random.randn(input_layer_size, hidden_layer_size)
#W_out = weight_init_std * np.random.randn(hidden_layer_size, output_layer_size)
#W = weight_init_std * np.random.randn(hidden_layer_size, hidden_layer_size)

#Weight initialization using Xavier
W_in = np.random.randn(input_layer_size, hidden_layer_size) / (np.sqrt(input_layer_size))
W_out = np.random.randn(hidden_layer_size, output_layer_size) / (np.sqrt(hidden_layer_size))
W = np.random.randn(hidden_layer_size, hidden_layer_size) / (np.sqrt(hidden_layer_size))

#Weight initialization using He
# W_in = np.random.randn(input_layer_size, hidden_layer_size) / (np.sqrt(input_layer_size)) * np.sqrt(2)
# W_out = np.random.randn(hidden_layer_size, output_layer_size) / (np.sqrt(hidden_layer_size)) * np.sqrt(2)
# W = np.random.randn(hidden_layer_size, hidden_layer_size) / (np.sqrt(hidden_layer_size)) * np.sqrt(2)

#####################################

Results using Xavier スクリーンショット 2020-01-03 5.38.13.png Results using HE スクリーンショット 2020-01-03 5.41.19.png The results were almost close.

[try] Let's change the activation function of the middle layer ReLU (Let's check the gradient explosion)

python changes


     #  z[:,t+1] = functions.sigmoid(u[:,t+1])
        z[:,t+1] = functions.relu(u[:,t+1])
     #  z[:,t+1] = functions.np.tanh(u[:,t+1])
スクリーンショット 2020-01-03 5.55.44.png

tanh (tanh is provided in numpy. Let's create a derivative as d_tanh)

python changes Added definition of derivative


def d_tanh(x):
     return np.tanh(x)

python changes


     #  z[:,t+1] = functions.sigmoid(u[:,t+1])
     #  z[:,t+1] = functions.relu(u[:,t+1])
        z[:,t+1] = d_tanh(u[:,t+1])
スクリーンショット 2020-01-03 6.00.26.png

Recommended Posts

Subjects> Deep Learning: Day3 RNN
<Course> Deep Learning: Day2 CNN
Rabbit Challenge Deep Learning 1Day
<Course> Deep Learning: Day1 NN
Rabbit Challenge Deep Learning 2Day
Deep Learning
Thoroughly study Deep Learning [DW Day 0]
[Rabbit Challenge (E qualification)] Deep learning (day2)
[Rabbit Challenge (E qualification)] Deep learning (day3)
<Course> Deep Learning Day4 Reinforcement Learning / Tensor Flow
Deep Learning Memorandum
Start Deep learning
Python learning day 4
Python Deep Learning
Deep learning × Python
[Rabbit Challenge (E qualification)] Deep learning (day4)
Python: Deep Learning Practices
Deep learning / activation functions
Deep Learning from scratch
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Deep learning 1 Practice of deep learning
Deep learning / cross entropy
Learning record 5 (9th day)
Learning record 6 (10th day)
First Deep Learning ~ Preparation ~
First Deep Learning ~ Solution ~
Learning record 8 (12th day)
[AI] Deep Metric Learning
Learning record 1 (4th day)
Learning record 7 (11th day)
I tried deep learning
Python: Deep Learning Tuning
Learning record 2 (6th day)
Deep learning large-scale technology
Learning record 16 (20th day)
Learning record 22 (26th day)
Deep learning / softmax function
Learning record No. 21 (25th day)
Deep Learning from scratch 1-3 chapters
Try deep learning with TensorFlow
Deep Learning Gaiden ~ GPU Programming ~
Effective Python Learning Memorandum Day 6 [6/100]
Learning record No. 10 (14th day)
Effective Python Learning Memorandum Day 12 [12/100]
Learning record No. 17 (21st day)
Effective Python Learning Memorandum Day 9 [9/100]
Learning record 12 (16th day) Kaggle2
Deep learning image recognition 1 theory
Effective Python Learning Memorandum Day 8 [8/100]
Learning record No. 18 (22nd day)
Deep running 2 Tuning of deep learning
Deep learning / LSTM scratch code
Sine wave prediction using RNN in deep learning library Keras
Learning record No. 24 (28th day)
Deep Kernel Learning with Pyro
Try Deep Learning with FPGA
Introducing Udacity Deep Learning Nanodegree
Effective Python Learning Memorandum Day 14 [14/100]
Effective Python Learning Memorandum Day 1 [1/100]