Following the implementation of Seq2Seq in previous, this time I implemented Attention Seq2Seq with Attention added to Seq2Seq with PyTorch.
Even beginners like myself can't find much source code that implements Attention in PyTorch, and there is also PyTorch Attention Tutorial. There is, but it seems that I have not learned mini-batch (?), And I wanted to implement a simpler plain (?) Attention with a feeling that it seems to be customized for this task. I tried to implement Attention myself. We hope that we can provide you with some helpful information for those who are having trouble implementing Attention.
The mechanism of Attention is still [Deep Learning from scratch ❷ ― Natural language processing](https://www.amazon.co.jp/%E3%82%BC%E3%83%AD%E3%81%8B%] E3% 82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95% E8% 87% AA% E7% 84% B6% E8% A8% 80% E8% AA% 9E % E5% 87% A6% E7% 90% 86% E7% B7% A8-% E6% 96% 8E% E8% 97% A4-% E5% BA% B7% E6% AF% 85 / dp / 4873118360 / ref = sr_1_2? __ mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords =% E3% 82% BC% E3% 83% AD% E3% 81% 8B% E3 % 82% 89% E4% BD% 9C% E3% 82% 8B & qid = 1568304570 & s = gateway & sr = 8-2) was overwhelmingly easy to understand.
The implementation example I will introduce is just a scratch implementation of Zero Work 2 (should be), so if this article is difficult to understand, I strongly recommend that you read Zero Work 2. ..
I think there are various types of Attention such as soft Attention and hard Attention, but the Attention here is Deep Learning from scratch ❷ ― Natural language processing. 82% BC% E3% 83% AD% E3% 81% 8B% E3% 82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95% E8% 87% AA% E7 % 84% B6% E8% A8% 80% E8% AA% 9E% E5% 87% A6% E7% 90% 86% E7% B7% A8-% E6% 96% 8E% E8% 97% A4-% E5 % BA% B7% E6% AF% 85 / dp / 4873118360 / ref = sr_1_2? __mk_ja_JP =% E3% 82% AB% E3% 82% BF% E3% 82% AB% E3% 83% 8A & keywords =% E3% 82 % BC% E3% 83% AD% E3% 81% 8B% E3% 82% 89% E4% BD% 9C% E3% 82% 8B & qid = 1568304570 & s = gateway & sr = 8-2) (soft) Let's refer to Attention.
Seq2Seq has the problem that the characteristics of long series cannot be captured because the Encoder converts it to a fixed-length vector regardless of the length of the input series. Attention provides a mechanism that can consider the length of the input sequence on the Encoder side in order to solve this problem.
If you explain Attention very roughly
I will do the operation. In 1., the number of hidden layer vectors on the Encoder side depends on the length of the series that is the input on the Encoder side, so the shape takes into account the length of the series. In 2., the operation of selecting cannot be differentiated, but the operation of selecting where to pay attention to each element is stochastically weighted by $ softmax $.
For the sake of simplicity, the figure below deals with two cases where the Encoder side has three input sequences w1, w2, and w3, and the Decoder side has w'1, w'2.
① When the value of each hidden layer on the Encoder side is $ h_1 $, $ h_2 $, $ \ cdots $, $ h_n $, $ hs = [h_1, h_2, \ cdots, h_n] $ is each layer on the Decoder side. Pass to.
(2) Calculate the inner product of the vector of each hidden layer on the Decoder side (here, $ d_i $) and each vector of $ hs $ $ h_1, h_2, \ cdots $. This means that we are calculating how similar each vector on the Decoder side and each vector on $ hs $ are. (The inner product is expressed as $ (\ cdot, \ cdot) $.)
③ Convert the inner product calculated in ② to a probability expression with $ softmax $ (this is called attention weight)
④ Weight each element of $ hs $ with attention weight and add them all to make one vector (this is called a context vector).
⑤ Combine the context vector and $ d_i $ into a single vector
――Add the processes 1 to 5 explained above to the Decoder side and you're done. It deals with the date format conversion problem as well as Zero Saku 2. (Because it is easy to confirm the certainty when the attention weight is visualized) --The following is implemented on Google Colab. -Since I will explain by adding Attention processing to the implementation of Seq2Seq explained in Last time, most of the previous source is used. Please also refer to the previous source code. -I implemented Seq2Seq with PyTorch
Let's solve the task of converting various date writing methods such as the following to the YYYY-MM-DD format with Attention seq 2seq.
Before conversion | After conversion |
---|---|
Nobenver, 30, 1995 | 1995-11-30 |
Monday, July 9, 2001 | 2001-07-09 |
1/23/01 | 2001-01-23 |
WEDNESDAY, AUGUST 1, 2001 | 2001-08-01 |
sep 7, 1981 | 1981-09-07 |
We borrow data from the Github repository of Zero Work 2. https://github.com/oreilly-japan/deep-learning-from-scratch-2/tree/master/dataset
Put this file on Google Drive and separate it before and after conversion as follows.
from sklearn.model_selection import train_test_split
import random
from sklearn.utils import shuffle
#Mount Google Drive in advance and date to the following location.Store txt
file_path = "drive/My Drive/Colab Notebooks/date.txt"
input_date = [] #Date data before conversion
output_date = [] #Date data after conversion
# date.Read txt line by line, divide before and after conversion, and separate by input and output
with open(file_path, "r") as f:
date_list = f.readlines()
for date in date_list:
date = date[:-1]
input_date.append(date.split("_")[0])
output_date.append("_" + date.split("_")[1])
#Get the length of the input and output series
#Since they are all the same length, we take len at the 0th element
input_len = len(input_date[0]) # 29
output_len = len(output_date[0]) # 10
# date.Assign an ID to every character that appears in txt
char2id = {}
for input_chars, output_chars in zip(input_date, output_date):
for c in input_chars:
if not c in char2id:
char2id[c] = len(char2id)
for c in output_chars:
if not c in char2id:
char2id[c] = len(char2id)
input_data = [] #IDized pre-conversion date data
output_data = [] #ID-ized converted date data
for input_chars, output_chars in zip(input_date, output_date):
input_data.append([char2id[c] for c in input_chars])
output_data.append([char2id[c] for c in output_chars])
# 7:Divide into train and test in 3
train_x, test_x, train_y, test_y = train_test_split(input_data, output_data, train_size= 0.7)
#Define a function to batch data
def train2batch(input_data, output_data, batch_size=100):
input_batch = []
output_batch = []
input_shuffle, output_shuffle = shuffle(input_data, output_data)
for i in range(0, len(input_data), batch_size):
input_batch.append(input_shuffle[i:i+batch_size])
output_batch.append(output_shuffle[i:i+batch_size])
return input_batch, output_batch
Encoder --The Encoder side is almost the same as the previously implemented seq2seq. ――I want to have a little fun, so I changed LSTM to GRU. --Since the value of each hidden layer of GRU is used for Attention on the Decoder side, the first return value ($ hs $) of GRU is also received.
import torch
import torch.nn as nn
import torch.optim as optim
#Various parameters, etc.
embedding_dim = 200
hidden_dim = 128
BATCH_NUM = 100
vocab_size = len(char2id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#Encoder class
class Encoder(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim):
super(Encoder, self).__init__()
self.hidden_dim = hidden_dim
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=char2id[" "])
self.gru = nn.GRU(embedding_dim, hidden_dim, batch_first=True)
def forward(self, sequence):
embedding = self.word_embeddings(sequence)
#hs is the vector of the hidden layer of GRU of each series
#Attention element
hs, h = self.gru(embedding)
return hs, h
Decoder ――Similar to the Encoder side, LSTM is changed to GRU compared to the previous time. ――If you implement it while writing on a piece of paper what axis of the tensor of each layer means, you can organize your head. ――I also listed the size of each tensor in the Attention layer to help you understand it.
#Attention Decoder class
class AttentionDecoder(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, batch_size):
super(AttentionDecoder, self).__init__()
self.hidden_dim = hidden_dim
self.batch_size = batch_size
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=char2id[" "])
self.gru = nn.GRU(embedding_dim, hidden_dim, batch_first=True)
# hidden_dim*2 is torch the context vector calculated by the hidden layer and Attention layer of each series of GRU..Because the length is doubled by connecting with cat
self.hidden2linear = nn.Linear(hidden_dim * 2, vocab_size)
#I want to convert the column direction with probability, so dim=1
self.softmax = nn.Softmax(dim=1)
def forward(self, sequence, hs, h):
embedding = self.word_embeddings(sequence)
output, state = self.gru(embedding, h)
#Attention layer
# hs.size() = ([100, 29, 128])
# output.size() = ([100, 10, 128])
#Output on the Encoder side using bmm(hs)And the output on the Decoder side(output)In order to calculate the matrix for each batch, the output on the Decoder side is fixed to batch and a transposed matrix is taken.
t_output = torch.transpose(output, 1, 2) # t_output.size() = ([100, 128, 10])
#Matrix calculation with bmm considering batch
s = torch.bmm(hs, t_output) # s.size() = ([100, 29, 10])
#Column direction(dim=1)Take softmax and convert it to a probabilistic expression
#Since this value will be used for visualization of Attention later, return it with return.
attention_weight = self.softmax(s) # attention_weight.size() = ([100, 29, 10])
#Prepare a container to organize the context vectors
c = torch.zeros(self.batch_size, 1, self.hidden_dim, device=device) # c.size() = ([100, 1, 128])
#I didn't know how to calculate the context vector for each Decoder's GRU layer at once, so
#Take out the attention weight in each layer (the GRU layer on the Decoder side has 10 characters because the generated character string is 10 characters) and create one context vector in the for loop.
#Since the batch direction could be calculated collectively, the batch remains as it is
for i in range(attention_weight.size()[2]): #10 loops
# attention_weight[:,:,i].size() = ([100, 29])
#Take the attention weight for the i-th GRU layer, but unsqueeze it to align the tensor size with hs.
unsq_weight = attention_weight[:,:,i].unsqueeze(2) # unsq_weight.size() = ([100, 29, 1])
#Weight each vector of hs by attention weight
weighted_hs = hs * unsq_weight # weighted_hs.size() = ([100, 29, 128])
#Create a context vector by adding all the vectors of each hs weighted by attention weight
weight_sum = torch.sum(weighted_hs, axis=1).unsqueeze(1) # weight_sum.size() = ([100, 1, 128])
c = torch.cat([c, weight_sum], dim=1) # c.size() = ([100, i, 128])
#Since the zero element prepared as a box remains, slice it and delete it
c = c[:,1:,:]
output = torch.cat([output, c], dim=2) # output.size() = ([100, 10, 256])
output = self.hidden2linear(output)
return output, state, attention_weight
--No particular change from last time
encoder = Encoder(vocab_size, embedding_dim, hidden_dim).to(device)
attn_decoder = AttentionDecoder(vocab_size, embedding_dim, hidden_dim, BATCH_NUM).to(device)
#Loss function
criterion = nn.CrossEntropyLoss()
#optimisation
encoder_optimizer = optim.Adam(encoder.parameters(), lr=0.001)
attn_decoder_optimizer = optim.Adam(attn_decoder.parameters(), lr=0.001)
--Don't forget to pass the Encoder output $ hs $ to the Attention Decoder. --Since there is no change in the input and output of both Encoder and Decoder, it is almost the same as the previous Seq2Seq. ――Loss will decrease with tremendous momentum ――In the following, the lower limit of loss is set to 0.1, but it has already reached the 16th epoch.
BATCH_NUM=100
EPOCH_NUM = 100
all_losses = []
print("training ...")
for epoch in range(1, EPOCH_NUM+1):
epoch_loss = 0
#Divide the data into mini-batch
input_batch, output_batch = train2batch(train_x, train_y, batch_size=BATCH_NUM)
for i in range(len(input_batch)):
#Gradient initialization
encoder_optimizer.zero_grad()
attn_decoder_optimizer.zero_grad()
#Convert data to tensor
input_tensor = torch.tensor(input_batch[i], device=device)
output_tensor = torch.tensor(output_batch[i], device=device)
#Encoder forward propagation
hs, h = encoder(input_tensor)
#Attention Decoder Input
source = output_tensor[:, :-1]
#Correct answer data of Attention Decoder
target = output_tensor[:, 1:]
loss = 0
decoder_output, _, attention_weight= attn_decoder(source, hs, h)
for j in range(decoder_output.size()[1]):
loss += criterion(decoder_output[:, j, :], target[:, j])
epoch_loss += loss.item()
#Backpropagation of error
loss.backward()
#Parameter update
encoder_optimizer.step()
attn_decoder_optimizer.step()
#Show loss
print("Epoch %d: %.2f" % (epoch, epoch_loss))
all_losses.append(epoch_loss)
if epoch_loss < 0.1: break
print("Done")
# training ...
# Epoch 1: 1500.33
# Epoch 2: 77.53
# Epoch 3: 12.98
# Epoch 4: 3.40
# Epoch 5: 1.78
# Epoch 6: 1.13
# Epoch 7: 0.78
# Epoch 8: 0.56
# Epoch 9: 0.42
# Epoch 10: 0.32
# Epoch 11: 0.25
# Epoch 12: 0.20
# Epoch 13: 0.16
# Epoch 14: 0.13
# Epoch 15: 0.11
# Epoch 16: 0.09
# Done
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(all_losses)
――It is predicted by almost the same method as the prediction at the time of the previous Seq2Seq.
import pandas as pd
#Returns the index with the largest element from the Decoder's output tensor. That means the generated character
def get_max_index(decoder_output):
results = []
for h in decoder_output:
results.append(torch.argmax(h))
return torch.tensor(results, device=device).view(BATCH_NUM, 1)
#Evaluation data
test_input_batch, test_output_batch = train2batch(test_x, test_y)
input_tensor = torch.tensor(test_input_batch, device=device)
predicts = []
for i in range(len(test_input_batch)):
with torch.no_grad():
hs, encoder_state = encoder(input_tensor[i])
#Decoder first indicates the start of character string generation"_"Because it is an input"_"Create tensor for batch size
start_char_batch = [[char2id["_"]] for _ in range(BATCH_NUM)]
decoder_input_tensor = torch.tensor(start_char_batch, device=device)
decoder_hidden = encoder_state
batch_tmp = torch.zeros(100,1, dtype=torch.long, device=device)
for _ in range(output_len - 1):
decoder_output, decoder_hidden, _ = attn_decoder(decoder_input_tensor, hs, decoder_hidden)
#While acquiring the predicted character, it becomes the input of the next decoder as it is
decoder_input_tensor = get_max_index(decoder_output.squeeze())
batch_tmp = torch.cat([batch_tmp, decoder_input_tensor], dim=1)
predicts.append(batch_tmp[:,1:])
#Readability is poor if the ID remains as it is when viewing the prediction result, so define a dictionary to convert from ID to character string to restore to the original character string.
id2char = {}
for k, v in char2id.items():
id2char[v] = k
row = []
for i in range(len(test_input_batch)):
batch_input = test_input_batch[i]
batch_output = test_output_batch[i]
batch_predict = predicts[i]
for inp, output, predict in zip(batch_input, batch_output, batch_predict):
x = [id2char[idx] for idx in inp]
y = [id2char[idx] for idx in output[1:]]
p = [id2char[idx.item()] for idx in predict]
x_str = "".join(x)
y_str = "".join(y)
p_str = "".join(p)
judge = "O" if y_str == p_str else "X"
row.append([x_str, y_str, p_str, judge])
predict_df = pd.DataFrame(row, columns=["input", "answer", "predict", "judge"])
predict_df.head()
――It happened that it wasn't 100% this time, but I think it will be about 100% correct answer rate.
print(len(predict_df.query('judge == "O"')) / len(predict_df))
# 0.9999333333333333
predict_df.query('judge == "X"').head(10)
――I made a mistake in the following one case ――When you make a mistake in this task, it seems that there are many date formats separated by slashes as shown below.
――Let's visualize attention weight, which is one of the real thrills of Attention. ――You can check the certainty of learning by looking at the attention weight. --Since heatmap is often used to visualize attention weight, it is visualized with seaborn heatmap. ――The first mini-batch of the test data of 3 divided into 7: 3 is sent.
import seaborn as sns
import pandas as pd
input_batch, output_batch = train2batch(test_x, test_y, batch_size=BATCH_NUM)
input_minibatch, output_minibatch = input_batch[0], output_batch[0]
with torch.no_grad():
#Convert data to tensor
input_tensor = torch.tensor(input_minibatch, device=device)
output_tensor = torch.tensor(output_minibatch, device=device)
hs, h = encoder(input_tensor)
source = output_tensor[:, :-1]
decoder_output, _, attention_weight= attn_decoder(source, hs, h)
for i in range(3):
with torch.no_grad():
df = pd.DataFrame(data=torch.transpose(attention_weight[i], 0, 1).cpu().numpy(),
columns=[id2char[idx.item()] for idx in input_tensor[i]],
index=[id2char[idx.item()] for idx in output_tensor[i][1:]])
plt.figure(figsize=(12, 8))
sns.heatmap(df, xticklabels = 1, yticklabels = 1, square=True, linewidths=.3,cbar_kws = dict(use_gridspec=False,location="top"))
It's a little hard to see, but the characters "Tuesday, March 27, 2012" at the bottom of the above figure are the characters before conversion (Encoder input), and "2012-03-27" arranged vertically on the left is generated. It is a character. This is how to read the heatmap, but when you look at the characters generated by Decoder one by one, it means that the characters in the boxes on the left are the ones that are generated with the brightest color. I think it will be. (Please point out if it is different ...) (Of course, if you add all the values in the box to the left, it will be 1.)
In the example above, you can see the following.
――It can be seen that you are paying attention to the year part if you generate YYYY as a whole, and the month part if you generate MM. --This task is converted to YYYY-MM-DD, that is, the day of the week is not converted, so I do not pay attention to any generated characters in "Tuesday" --"0" is the attention of the "a" part of "March". "05" for "May" and "04" for "March", but if the letters "Ma" are lined up, the generation of "0" is confirmed, and then the letters "rch" are lined up, so the last Do you feel that 3 is paying attention to the "h" of?
Besides, Attention is done like this ↓
――It seems that there are various patterns in Attention as described in Zero work 2. ――Next, we will deal with (?) Self-Attention, which is more versatile than Attention!
end
Recommended Posts