Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo

1.First of all

I'm reading a masterpiece, ** "Deep Learning from Zero 2" **. This time is a memo of Chapter 8. To execute the code, download the entire code from Github and use jupyter notebook in ch08.

2.Attention Below is a schematic diagram of Attention in the Seq2seq model.

スクリーンショット 2020-05-27 17.43.40.png

Put ** Attention ** between ** LSTM ** and ** Affine ** at all times of ** Decoder **. Then, input ** hs **, which is a collection of ** hidden vectors (hs0 to hs4) ** at each time of ** Encoder **, to all Attentions.

In each Attention, the similarity (vector product) between the input from the LSTM and hs is calculated, and the importance of the hidden vector (hs0 to hs4) is calculated by the probability distribution. Then, the hidden vector is weighted by its probability distribution and synthesized, and sent to Affine.

スクリーンショット 2020-05-28 10.14.46.png

It is a schematic diagram inside the Attention layer and the matrix operation image (when batch size N = 1) performed there.

First, the ** Attention_weight ** part. Find the ** product ** of the ** vectors hs0 to hs4 ** of the Encoder and the ** vector h ** of the Decoder, and add them in the ** column direction (axis = 2) **, respectively, to obtain two vectors. The higher the ** similarity of, the larger **. By passing Softmax through this number, you get a ** vector a ** that represents ** what weighting should be used for the vectors hs0 to hs4 **.

Next is the ** Weight_Sum ** part. The ** product ** of ** vectors hs0 to hs4 ** and ** vector a ** is calculated, and ** vector C ** added by ** row direction (axis = 1) ** is ** attention ( The information to be (see) is weighted and combined (added) to form a vector **. By sending this to Affine, Affine can get ** information on the Encoder side that should be referred to at that time in addition to the information from the conventional LSTM **.

By the way, when calculating np.sum (x, axis =?) Of x = (N, T, H), sum (addition in the batch direction) so that the N axis is erased when ** axis = 0. Then, when ** axis = 1, sum (addition in the row direction) is done so that the T axis is erased **, and when ** axis = 2, the H axis is erased ** (addition in the column direction). To do.

Let's take a look at the Attention_weight code.

class AttentionWeight:
    def __init__(self):
        self.params, self.grads = [], []
        self.softmax = Softmax()
        self.cache = None

    def forward(self, hs, h):
        N, T, H = hs.shape

        #Since hr is used for broadcasting, only the dimension is adjusted without repeating.
        hr = h.reshape(N, 1, H)  #.repeat(T, axis=1)
        t = hs * hr  #Take the product of vectors
        s = np.sum(t, axis=2)  #Addition in the column direction
        a = self.softmax.forward(s)  #Calculate the probability distribution a that represents the importance of each hidden vector
        self.cache = (hs, hr)
        return a

    def backward(self, da):
        hs, hr = self.cache
        N, T, H = hs.shape

        ds = self.softmax.backward(da)
        dt = ds.reshape(N, T, 1).repeat(H, axis=2)  # (N, T, H)Conversion to
        dhs = dt * hr
        dhr = dt * hs
        dh = np.sum(dhr, axis=1)

        return dhs, dh

For forward propagation, t = hs * hr is calculated by broadcast, so hr is not repeated. Next, let's take a look at Weight_sum.

class WeightSum:
    def __init__(self):
        self.params, self.grads = [], []
        self.cache = None

    def forward(self, hs, a):
        N, T, H = hs.shape

        #Since ar is used for broadcasting, only the dimensions are adjusted without repeating.
        ar = a.reshape(N, T, 1) #.repeat(T, axis=1)
        t = hs * ar  #Take the product of vectors
        c = np.sum(t, axis=1)  #Addition in the row direction
        self.cache = (hs, ar)
        return c

    def backward(self, dc):
        hs, ar = self.cache
        N, T, H = hs.shape
        dt = dc.reshape(N, 1, H).repeat(T, axis=1)
        dar = dt * hs
        dhs = dt * ar
        da = np.sum(dar, axis=2)

        return dhs, da

For forward propagation, t = hs * ar is calculated by broadcast, so ar is not repeated. Together, these two classes make up the class Attention.

class Attention:
    def __init__(self):
        self.params, self.grads = [], []
        self.attention_weight_layer = AttentionWeight()
        self.weight_sum_layer = WeightSum()
        self.attention_weight = None

    def forward(self, hs, h):
        a = self.attention_weight_layer.forward(hs, h)
        out = self.weight_sum_layer.forward(hs, a)
        self.attention_weight = a
        return out

    def backward(self, dout):
        dhs0, da = self.weight_sum_layer.backward(dout)
        dhs1, dh = self.attention_weight_layer.backward(da)
        dhs = dhs0 + dhs1
        return dhs, d

In addition, it is summarized in class TimeAttention to correspond to the time.

class TimeAttention:
    def __init__(self):
        self.params, self.grads = [], []
        self.layers = None
        self.attention_weights = None

    def forward(self, hs_enc, hs_dec):
        N, T, H = hs_dec.shape
        out = np.empty_like(hs_dec)
        self.layers = []
        self.attention_weights = []

        for t in range(T):
            layer = Attention()
            out[:, t, :] = layer.forward(hs_enc, hs_dec[:,t,:])
            self.layers.append(layer)
            self.attention_weights.append(layer.attention_weight)

        return out

    def backward(self, dout):
        N, T, H = dout.shape
        dhs_enc = 0
        dhs_dec = np.empty_like(dout)

        for t in range(T):
            layer = self.layers[t]
            dhs, dh = layer.backward(dout[:, t, :])
            dhs_enc += dhs
            dhs_dec[:,t,:] = dh

        return dhs_enc, dhs_dec

Now, let's run the sample code (date format conversion) ** train.py ** of seq2seq using this Attention mechanism.

import sys
sys.path.append('..')
import numpy as np
import matplotlib.pyplot as plt
from dataset import sequence
from common.optimizer import Adam
from common.trainer import Trainer
from common.util import eval_seq2seq
from attention_seq2seq import AttentionSeq2seq
from ch07.seq2seq import Seq2seq

#Data reading
(x_train, t_train), (x_test, t_test) = sequence.load_data('date.txt')
char_to_id, id_to_char = sequence.get_vocab()

#Invert input statement
x_train, x_test = x_train[:, ::-1], x_test[:, ::-1]

#Hyperparameter settings
vocab_size = len(char_to_id)
wordvec_size = 16
hidden_size = 256
batch_size = 128
max_epoch = 10
max_grad = 5.0

model = AttentionSeq2seq(vocab_size, wordvec_size, hidden_size)
# model = Seq2seq(vocab_size, wordvec_size, hidden_size)
# model = PeekySeq2seq(vocab_size, wordvec_size, hidden_size)

optimizer = Adam()
trainer = Trainer(model, optimizer)

acc_list = []
for epoch in range(max_epoch):
    trainer.fit(x_train, t_train, max_epoch=1,
                batch_size=batch_size, max_grad=max_grad)

    correct_num = 0
    for i in range(len(x_test)):
        question, correct = x_test[[i]], t_test[[i]]
        verbose = i < 10
        correct_num += eval_seq2seq(model, question, correct,
                                    id_to_char, verbose, is_reverse=True)

    acc = float(correct_num) / len(x_test)
    acc_list.append(acc)
    print('val acc %.3f%%' % (acc * 100))

model.save_params()

#Drawing a graph
x = np.arange(len(acc_list))
plt.plot(x, acc_list, marker='o')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.ylim(-0.05, 1.05)
plt.show()

out.PNG After 2 epoch, the accuracy is 100%. As expected, it is the power of Attention.

Recommended Posts

Deep Learning / Deep Learning from Zero 2 Chapter 4 Memo
Deep Learning / Deep Learning from Zero Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 5 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 7 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo
Deep Learning / Deep Learning from Zero Chapter 5 Memo
Deep Learning / Deep Learning from Zero Chapter 4 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 3 Memo
Deep Learning / Deep Learning from Zero 2 Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 7]
Deep learning / Deep learning made from scratch Chapter 6 Memo
[Learning memo] Deep Learning made from scratch [Chapter 5]
[Learning memo] Deep Learning made from scratch [Chapter 6]
Deep learning / Deep learning made from scratch Chapter 7 Memo
[Learning memo] Deep Learning made from scratch [~ Chapter 4]
Deep Learning from scratch Chapter 2 Perceptron (reading memo)
Reinforcement learning to learn from zero to deep
"Deep Learning from scratch" Self-study memo (Part 12) Deep learning
Deep Learning from scratch
"Deep Learning from scratch" self-study memo (unreadable glossary)
"Deep Learning from scratch" Self-study memo (9) MultiLayerNet class
Deep Learning from scratch ① Chapter 6 "Techniques related to learning"
[Learning memo] Deep Learning from scratch ~ Implementation of Dropout ~
"Deep Learning from scratch" Self-study memo (10) MultiLayerNet class
"Deep Learning from scratch" Self-study memo (No. 11) CNN
Deep Learning from scratch 1-3 chapters
"Deep Learning from scratch" Self-study memo (No. 19) Data Augmentation
"Deep Learning from scratch 2" Self-study memo (No. 21) Chapters 3 and 4
Python learning memo for machine learning by Chainer from Chapter 2
An amateur stumbled in Deep Learning from scratch Note: Chapter 1
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 5
Deep learning from scratch (cost calculation)
An amateur stumbled in Deep Learning from scratch Note: Chapter 3
An amateur stumbled in Deep Learning from scratch Note: Chapter 7
An amateur stumbled in Deep Learning from scratch Note: Chapter 5
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 7
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 1
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 4
"Deep Learning from scratch" self-study memo (No. 18) One! Meow! Grad-CAM!
"Deep Learning from scratch" self-study memo (No. 19-2) Data Augmentation continued
An amateur stumbled in Deep Learning from scratch Note: Chapter 4
Deep Learning memos made from scratch
Deep Learning
An amateur stumbled in Deep Learning from scratch Note: Chapter 2
"Deep Learning from scratch" self-study memo (No. 15) TensorFlow beginner tutorial
Making from scratch Deep Learning ❷ An amateur stumbled Note: Chapter 6
Deep learning tutorial from environment construction
"Deep Learning from scratch" Self-study memo (No. 14) Run the program in Chapter 4 on Google Colaboratory
"Deep Learning from scratch" Self-study memo (Part 8) I drew the graph in Chapter 6 with matplotlib
"Deep Learning from scratch" self-study memo (No. 13) Try using Google Colaboratory
"Deep Learning from scratch" Self-study memo (No. 10-2) Initial value of weight
Deep learning from scratch (forward propagation edition)
Deep learning / Deep learning from scratch 2-Try moving GRU
Image alignment: from SIFT to deep learning
"Deep Learning from scratch" in Haskell (unfinished)
[Windows 10] "Deep Learning from scratch" environment construction
Learning record of reading "Deep Learning from scratch"
[Deep Learning from scratch] About hyperparameter optimization
Deep Learning from mathematical basics (during attendance)
LPIC201 learning memo
Deep Learning Memorandum