1.First of all

I'm reading a masterpiece, ** "Deep Learning from Zero 2" **. This time is a memo of Chapter 8. To execute the code, download the entire code from Github and use jupyter notebook in ch08.

2.Attention Below is a schematic diagram of Attention in the Seq2seq model.

スクリーンショット 2020-05-27 17.43.40.png

Put ** Attention ** between ** LSTM ** and ** Affine ** at all times of ** Decoder **. Then, input ** hs **, which is a collection of ** hidden vectors (hs0 to hs4) ** at each time of ** Encoder **, to all Attentions.

In each Attention, the similarity (vector product) between the input from the LSTM and hs is calculated, and the importance of the hidden vector (hs0 to hs4) is calculated by the probability distribution. Then, the hidden vector is weighted by its probability distribution and synthesized, and sent to Affine.

スクリーンショット 2020-05-28 10.14.46.png

It is a schematic diagram inside the Attention layer and the matrix operation image (when batch size N = 1) performed there.

First, the ** Attention_weight ** part. Find the ** product ** of the ** vectors hs0 to hs4 ** of the Encoder and the ** vector h ** of the Decoder, and add them in the ** column direction (axis = 2) **, respectively, to obtain two vectors. The higher the ** similarity of, the larger **. By passing Softmax through this number, you get a ** vector a ** that represents ** what weighting should be used for the vectors hs0 to hs4 **.

Next is the ** Weight_Sum ** part. The ** product ** of ** vectors hs0 to hs4 ** and ** vector a ** is calculated, and ** vector C ** added by ** row direction (axis = 1) ** is ** attention ( The information to be (see) is weighted and combined (added) to form a vector **. By sending this to Affine, Affine can get ** information on the Encoder side that should be referred to at that time in addition to the information from the conventional LSTM **.

By the way, when calculating np.sum (x, axis =?) Of x = (N, T, H), sum (addition in the batch direction) so that the N axis is erased when ** axis = 0. Then, when ** axis = 1, sum (addition in the row direction) is done so that the T axis is erased **, and when ** axis = 2, the H axis is erased ** (addition in the column direction). To do.

Let's take a look at the Attention_weight code.

class AttentionWeight:
    def __init__(self):
        self.params, self.grads = [], []
        self.softmax = Softmax()
        self.cache = None

    def forward(self, hs, h):
        N, T, H = hs.shape

        #Since hr is used for broadcasting, only the dimension is adjusted without repeating.
        hr = h.reshape(N, 1, H)  #.repeat(T, axis=1)
        t = hs * hr  #Take the product of vectors
        s = np.sum(t, axis=2)  #Addition in the column direction
        a = self.softmax.forward(s)  #Calculate the probability distribution a that represents the importance of each hidden vector
        self.cache = (hs, hr)
        return a

    def backward(self, da):
        hs, hr = self.cache
        N, T, H = hs.shape

        ds = self.softmax.backward(da)
        dt = ds.reshape(N, T, 1).repeat(H, axis=2)  # (N, T, H)Conversion to
        dhs = dt * hr
        dhr = dt * hs
        dh = np.sum(dhr, axis=1)

        return dhs, dh

For forward propagation, t = hs * hr is calculated by broadcast, so hr is not repeated. Next, let's take a look at Weight_sum.

class WeightSum:
    def __init__(self):
        self.params, self.grads = [], []
        self.cache = None

    def forward(self, hs, a):
        N, T, H = hs.shape

        #Since ar is used for broadcasting, only the dimensions are adjusted without repeating.
        ar = a.reshape(N, T, 1) #.repeat(T, axis=1)
        t = hs * ar  #Take the product of vectors
        c = np.sum(t, axis=1)  #Addition in the row direction
        self.cache = (hs, ar)
        return c

    def backward(self, dc):
        hs, ar = self.cache
        N, T, H = hs.shape
        dt = dc.reshape(N, 1, H).repeat(T, axis=1)
        dar = dt * hs
        dhs = dt * ar
        da = np.sum(dar, axis=2)

        return dhs, da

For forward propagation, t = hs * ar is calculated by broadcast, so ar is not repeated. Together, these two classes make up the class Attention.

class Attention:
    def __init__(self):
        self.params, self.grads = [], []
        self.attention_weight_layer = AttentionWeight()
        self.weight_sum_layer = WeightSum()
        self.attention_weight = None

    def forward(self, hs, h):
        a = self.attention_weight_layer.forward(hs, h)
        out = self.weight_sum_layer.forward(hs, a)
        self.attention_weight = a
        return out

    def backward(self, dout):
        dhs0, da = self.weight_sum_layer.backward(dout)
        dhs1, dh = self.attention_weight_layer.backward(da)
        dhs = dhs0 + dhs1
        return dhs, d

In addition, it is summarized in class TimeAttention to correspond to the time.

class TimeAttention:
    def __init__(self):
        self.params, self.grads = [], []
        self.layers = None
        self.attention_weights = None

    def forward(self, hs_enc, hs_dec):
        N, T, H = hs_dec.shape
        out = np.empty_like(hs_dec)
        self.layers = []
        self.attention_weights = []

        for t in range(T):
            layer = Attention()
            out[:, t, :] = layer.forward(hs_enc, hs_dec[:,t,:])
            self.layers.append(layer)
            self.attention_weights.append(layer.attention_weight)

        return out

    def backward(self, dout):
        N, T, H = dout.shape
        dhs_enc = 0
        dhs_dec = np.empty_like(dout)

        for t in range(T):
            layer = self.layers[t]
            dhs, dh = layer.backward(dout[:, t, :])
            dhs_enc += dhs
            dhs_dec[:,t,:] = dh

        return dhs_enc, dhs_dec

Now, let's run the sample code (date format conversion) ** train.py ** of seq2seq using this Attention mechanism.

import sys
sys.path.append('..')
import numpy as np
import matplotlib.pyplot as plt
from dataset import sequence
from common.optimizer import Adam
from common.trainer import Trainer
from common.util import eval_seq2seq
from attention_seq2seq import AttentionSeq2seq
from ch07.seq2seq import Seq2seq

#Data reading
(x_train, t_train), (x_test, t_test) = sequence.load_data('date.txt')
char_to_id, id_to_char = sequence.get_vocab()

#Invert input statement
x_train, x_test = x_train[:, ::-1], x_test[:, ::-1]

#Hyperparameter settings
vocab_size = len(char_to_id)
wordvec_size = 16
hidden_size = 256
batch_size = 128
max_epoch = 10
max_grad = 5.0

model = AttentionSeq2seq(vocab_size, wordvec_size, hidden_size)
# model = Seq2seq(vocab_size, wordvec_size, hidden_size)
# model = PeekySeq2seq(vocab_size, wordvec_size, hidden_size)

optimizer = Adam()
trainer = Trainer(model, optimizer)

acc_list = []
for epoch in range(max_epoch):
    trainer.fit(x_train, t_train, max_epoch=1,
                batch_size=batch_size, max_grad=max_grad)

    correct_num = 0
    for i in range(len(x_test)):
        question, correct = x_test[[i]], t_test[[i]]
        verbose = i < 10
        correct_num += eval_seq2seq(model, question, correct,
                                    id_to_char, verbose, is_reverse=True)

    acc = float(correct_num) / len(x_test)
    acc_list.append(acc)
    print('val acc %.3f%%' % (acc * 100))

model.save_params()

#Drawing a graph
x = np.arange(len(acc_list))
plt.plot(x, acc_list, marker='o')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.ylim(-0.05, 1.05)
plt.show()

After 2 epoch, the accuracy is 100%. As expected, it is the power of Attention.

Deep Learning / Deep Learning from Zero 2 Chapter 8 Memo

1.First of all