This article is the 25th day article of Pytorch Advent Calendar 2019!
In Last time, I implemented Attention in the Encoder-Decoder model, but this time I will implement sentence classification in Self Attention.
The embedded expressions of sentences in Self Attention are introduced in the following papers, and are also cited in the famous paper "Attention Is All You Need" by Transformer.
This article implements the Self Attention introduced in this paper.
Regarding the implementation, I referred to the following article Almost Marupakuri </ s>.
In addition, we use torchtext, which can conveniently perform preprocessing for implementation, but I also referred to the following article of the same person for torchtext.
② Easy and deep natural language processing with torchtext
The mechanism of this paper is briefly explained in Reference (1), but the algorithm is roughly divided into the following three steps.
Here, $ d_a $ and $ r $ when calculating Attention are hyperparameters. $ d_a $ represents the size of the weight matrix when predicting Attention with Neural Network, and $ r $ is a parameter corresponding to how many layers of Attention are stacked.
The idea is very simple, the point is to let the Neural Network learn which words should be emphasized (weighted) when classifying sentences.
Then, we will implement the above mechanism with PyTorch. The task to be solved is the negative / positive judgment of IMDb movie reviews. The data can be downloaded from the following.
--Import various libraries used in the implementation ――Since the data set is in English, I think that the morphological analysis engine is okay, but for the time being, I prepared a function that performs some preprocessing with nltk (I do not care once though there is a part that suffers from preprocessing of torchtext). Please refer to here for nltk.
# torchtext
import torchtext
from torchtext import data
from torchtext import datasets
from torchtext.vocab import GloVe
from torchtext.vocab import Vectors
# pytorch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch
#Others
import os
import pickle
import numpy as np
import pandas as pd
from itertools import chain
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
#Finally, it is used to visualize attention.
import itertools
import random
from IPython.display import display, HTML
#For preprocessing by nltk
import re
import nltk
from nltk import stem
nltk.download('punkt')
#Morpheme engine by nltk prepared
def nltk_analyzer(text):
stemmer = stem.LancasterStemmer()
text = re.sub(re.compile(r'[!-\/:-@[-`{-~]'), ' ', text)
text = stemmer.stem(text)
text = text.replace('\n', '') #Delete line breaks
text = text.replace('\t', '') #Delete tab
morph = nltk.word_tokenize(text)
return morph
--Download the dataset from the above URL and prepare a tsv file in the following format. --Both train and test are available --The label is quantified with 0 for positive and 1 for negative.
For example, how to prepare the data was as follows.
train_pos_dir = 'aclImdb/train/pos/'
train_neg_dir = 'aclImdb/train/neg/'
test_pos_dir = 'aclImdb/test/pos/'
test_neg_dir = 'aclImdb/test/neg/'
header = ['text', 'label', 'label_id']
train_pos_files = os.listdir(train_pos_dir)
train_neg_files = os.listdir(train_neg_dir)
test_pos_files = os.listdir(test_pos_dir)
test_neg_files = os.listdir(test_neg_dir)
def make_row(root_dir, files, label, idx):
row = []
for file in files:
tmp = []
with open(root_dir + file, 'r') as f:
text = f.read()
tmp.append(text)
tmp.append(label)
tmp.append(idx)
row.append(tmp)
return row
row = make_row(train_pos_dir, train_pos_files, 'pos', 0)
row += make_row(train_neg_dir, train_neg_files, 'neg', 1)
train_df = pd.DataFrame(row, columns=header)
row = make_row(test_pos_dir, test_pos_files, 'pos', 0)
row += make_row(test_neg_dir, test_neg_files, 'neg', 1)
test_df = pd.DataFrame(row, columns=header)
Prepare the data like the above, and finally create the following dataframe (while deleting the label column because it is not needed once).
train_df = pd.read_csv(imdb_dir + 'train.tsv', delimiter="\t", header=None)
train_df
--With torchtext, you can preprocess data, get distributed expressions of words, mini-batch, etc. quickly. ――For the distributed representation of words, we used 200-dimensional GloVe. You can download it with torchtext, but I borrowed glove.6B.200d.txt from here because I don't want to download it every time. Be careful because the size is large!
# train.tsv, test.Put tsv here
imdb_dir = "drive/My Drive/Colab Notebooks/imdb_datasets/"
# glove.6B.200d.Put txt here
word_embedding_dir = "drive/My Drive/Colab Notebooks/word_embedding_models/"
TEXT = data.Field(sequential=True, tokenize=nltk_analyzer, lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False, use_vocab=False, is_target=True)
train, test = data.TabularDataset.splits(
path=imdb_dir, train='train.tsv', test='test.tsv', format='tsv',
fields=[('Text', TEXT), ('Label', LABEL)])
glove_vectors = Vectors(name=word_embedding_dir + "glove.6B.200d.txt")
TEXT.build_vocab(train, test, vectors=glove_vectors, min_freq=1)
――There is no particular reason, but I used the following parameters. ――The figure below of the Attention layer is made into 3 layers.
#I want to use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
BATCH_SIZE = 100 #Batch size
EMBEDDING_DIM = 200 #Number of embedded dimensions of words
LSTM_DIM = 128 #Number of dimensions of hidden layer of LSTM
VOCAB_SIZE =TEXT.vocab.vectors.size()[0] #Total number of words
TAG_SIZE = 2 #This time we will make a negative / positive judgment, so the final size of the network is 2.
DA = 64 #Size of weight matrix when calculating Attention with Neural Network
R = 3 #View Attention in 3 layers
Bidirectional LSTM
--Convert sentences with Bidirectional LSTM --Please refer to here for the specifications of PyTorch's Bidirectional LSTM.
class BiLSTMEncoder(nn.Module):
def __init__(self, embedding_dim, lstm_dim, vocab_size):
super(BiLSTMEncoder, self).__init__()
self.lstm_dim = lstm_dim
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
#Set the learned word vector as embedding
self.word_embeddings.weight.data.copy_(TEXT.vocab.vectors)
#Requires to prevent the word vector from being updated by error backpropagation_Set grad to False
self.word_embeddings.requires_grad_ = False
# bidirectional=True and easy to make bidirectional LSTM
self.bilstm = nn.LSTM(embedding_dim, lstm_dim, batch_first=True, bidirectional=True)
def forward(self, text):
embeds = self.word_embeddings(text)
#Receive the first return value because we want the vector of each hidden layer
out, _ = self.bilstm(embeds)
#Returns the vector of each hidden layer in the forward and backward directions as they are connected.
return out
--Receive the vector of each hidden layer of Bidirectional LSTM and calculate Attention with Neural Network
――According to the paper, Tanh ()
is used for the activation function, but the article introduced in Reference ① uses ReLU ()
, so either one seems to be fine.
class SelfAttention(nn.Module):
def __init__(self, lstm_dim, da, r):
super(SelfAttention, self).__init__()
self.lstm_dim = lstm_dim
self.da = da
self.r = r
self.main = nn.Sequential(
#Since it is Bidirectional, the vector dimension of each hidden layer is doubled in size.
nn.Linear(lstm_dim * 2, da),
nn.Tanh(),
nn.Linear(da, r)
)
def forward(self, out):
return F.softmax(self.main(out), dim=1)
--Attention weight weights the vector of each hidden layer and Neural Network returns the prediction for binary classification. ――I'm honestly not sure what to do after weighting each Attenion layer, so this time I tried the following steps for the time being.
lstm_dim * 2 * 3
)
class SelfAttentionClassifier(nn.Module):
def __init__(self, lstm_dim, da, r, tagset_size):
super(SelfAttentionClassifier, self).__init__()
self.lstm_dim = lstm_dim
self.r = r
self.attn = SelfAttention(lstm_dim, da, r)
self.main = nn.Linear(lstm_dim * 6, tagset_size)
def forward(self, out):
attention_weight = self.attn(out)
m1 = (out * attention_weight[:,:,0].unsqueeze(2)).sum(dim=1)
m2 = (out * attention_weight[:,:,1].unsqueeze(2)).sum(dim=1)
m3 = (out * attention_weight[:,:,2].unsqueeze(2)).sum(dim=1)
feats = torch.cat([m1, m2, m3], dim=1)
return F.log_softmax(self.main(feats)), attention_weight
encoder = BiLSTMEncoder(EMBEDDING_DIM, LSTM_DIM, VOCAB_SIZE).to(device)
classifier = SelfAttentionClassifier(LSTM_DIM, DA, R, TAG_SIZE).to(device)
loss_function = nn.NLLLoss()
#Enclose multiple models in from itertools import chain to combine optimizers into one
optimizer = optim.Adam(chain(encoder.parameters(), classifier.parameters()), lr=0.001)
train_iter, test_iter = data.Iterator.splits((train, test), batch_sizes=(BATCH_SIZE, BATCH_SIZE), device=device, repeat=False, sort=False)
――For the time being, I tried learning with Epoch 10. --Loss is steadily decreasing, so it's okay for the time being
losses = []
for epoch in range(10):
all_loss = 0
for idx, batch in enumerate(train_iter):
batch_loss = 0
encoder.zero_grad()
classifier.zero_grad()
text_tensor = batch.Text[0]
label_tensor = batch.Label
out = encoder(text_tensor)
score, attn = classifier(out)
batch_loss = loss_function(score, label_tensor)
batch_loss.backward()
optimizer.step()
all_loss += batch_loss.item()
print("epoch", epoch, "\t" , "loss", all_loss)
#epoch 0 loss 97.37978366017342
#epoch 1 loss 50.07680431008339
#epoch 2 loss 27.79373042844236
#epoch 3 loss 9.353876578621566
#epoch 4 loss 1.9509600398596376
#epoch 5 loss 0.22650832029466983
#epoch 6 loss 0.021685686125238135
#epoch 7 loss 0.011305359620109812
#epoch 8 loss 0.007448446772286843
#epoch 9 loss 0.005398457038154447
――I don't think the accuracy is better than I expected ... ――Reference ① said that the accuracy was about 90%, and it seems that there are various backfires in some different implementations ...
answer = []
prediction = []
with torch.no_grad():
for batch in test_iter:
text_tensor = batch.Text[0]
label_tensor = batch.Label
out = encoder(text_tensor)
score, _ = classifier(out)
_, pred = torch.max(score, 1)
prediction += list(pred.cpu().numpy())
answer += list(label_tensor.cpu().numpy())
print(classification_report(prediction, answer, target_names=['positive', 'negative']))
# precision recall f1-score support
#
# positive 0.86 0.88 0.87 12103
# negative 0.89 0.86 0.87 12897
#
# accuracy 0.87 25000
# macro avg 0.87 0.87 0.87 25000
#weighted avg 0.87 0.87 0.87 25000
--Highlight and visualize which words are Attention. ――For the function to highlight, I borrowed the source of Reference ① as it is. --Please refer to here when displaying HTML on jupyter notebook etc. ――I'm doing strange processing such as for loop, but I just wanted to pick up one randomly from the test data and predict it. Sorry for the ridiculous implementation ...
def highlight(word, attn):
html_color = '#%02X%02X%02X' % (255, int(255*(1 - attn)), int(255*(1 - attn)))
return '<span style="background-color: {}">{}</span>'.format(html_color, word)
def mk_html(sentence, attns):
html = ""
for word, attn in zip(sentence, attns):
html += ' ' + highlight(
TEXT.vocab.itos[word],
attn
)
return html
id2ans = {'0': 'positive', '1':'negative'}
_, test_iter = data.Iterator.splits((train, test), batch_sizes=(1, 1), device=device, repeat=False, sort=False)
n = random.randrange(len(test_iter))
for batch in itertools.islice(test_iter, n-1,n):
x = batch.Text[0]
y = batch.Label
encoder_outputs = encoder(x)
output, attn = classifier(encoder_outputs)
pred = output.data.max(1, keepdim=True)[1]
display(HTML('[Correct answer]' + id2ans[str(y.item())] + '\t [Forecast]' + id2ans[str(pred.item())] + '<br><br>'))
for i in range(attn.size()[2]):
display(HTML(mk_html(x.data[0], attn.data[0,:,i]) + '<br><br>'))
I'm sorry that it became small, but when it is visualized, it will be displayed like this. Three of the same sentences are displayed, but since there are three Attention layers, each layer shows which word is attention. The degree of attention of words is slightly different in each Attention layer, but it seems that they are attracting in almost the same way.
――By the way, when you solve this negative / positive judgment only with Bidirectional LSTM without Self Attention, the accuracy is about 79.4%. --When solving with only Bidirectional LSTM, use the following network and leave the other parameters as they are. ――Self Attention seems to contribute greatly to raising the level of accuracy.
class BiLSTMEncoder(nn.Module):
def __init__(self, embedding_dim, lstm_dim, vocab_size, tagset_size):
super(BiLSTMEncoder, self).__init__()
self.lstm_dim = lstm_dim
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.word_embeddings.weight.data.copy_(TEXT.vocab.vectors)
self.word_embeddings.requires_grad_ = False
self.bilstm = nn.LSTM(embedding_dim, lstm_dim, batch_first=True, bidirectional=True)
self.hidden2tag = nn.Linear(lstm_dim * 2, tagset_size)
self.softmax = nn.LogSoftmax()
def forward(self, text):
embeds = self.word_embeddings(text)
_, bilstm_hc = self.bilstm(embeds)
bilstm_out = torch.cat([bilstm_hc[0][0], bilstm_hc[0][1]], dim=1)
tag_space = self.hidden2tag(bilstm_out)
tag_scores = self.softmax(tag_space.squeeze())
return tag_scores
――I'm a little worried about the pattern that calculates Attention lexicographically as implemented in Transformer etc. (the one that divides the embedding of words into query, key, value) and the Neural Network of this paper to predict Attention. I don't really understand the difference in the patterns to be made. Before I knew this paper, I thought that if I went to Attention, the inner product would be a mess, so I wonder if there are various calculation methods for Attention. ――Next, I want to write something about Transformer!
end
Recommended Posts