I tried to implement sentence classification by Self Attention with PyTorch

This article is the 25th day article of Pytorch Advent Calendar 2019!

Introduction

In Last time, I implemented Attention in the Encoder-Decoder model, but this time I will implement sentence classification in Self Attention.

The embedded expressions of sentences in Self Attention are introduced in the following papers, and are also cited in the famous paper "Attention Is All You Need" by Transformer.

This article implements the Self Attention introduced in this paper.

reference

Regarding the implementation, I referred to the following article Almost Marupakuri </ s>.

[self attention] Implement a document classification model that can easily visualize the reason for prediction

In addition, we use torchtext, which can conveniently perform preprocessing for implementation, but I also referred to the following article of the same person for torchtext.

Easy and deep natural language processing with torchtext

How it works

The mechanism of this paper is briefly explained in Reference (1), but the algorithm is roughly divided into the following three steps.

  1. Convert a sentence of length $ n $ with Bidirectional LSTM (the dimension of the hidden layer is $ u $) (get each $ h_i, (n \ times 2u) $ in (a) below)
  2. Calculate Attention with Neural Network using the value of each hidden layer of Bidirectional LSTM as input ($ A = (A_ {ij}), 1 \ leq i \ leq r, 1 \ leq j \ leq in the figure below (b)) Get n $)
  3. Weight the vector of each hidden layer of Bidirectional LSTM by each Attention $ A_ {ij} $ to get the embedding of sentences in Neural Network.

Here, $ d_a $ and $ r $ when calculating Attention are hyperparameters. $ d_a $ represents the size of the weight matrix when predicting Attention with Neural Network, and $ r $ is a parameter corresponding to how many layers of Attention are stacked.

image.png

The idea is very simple, the point is to let the Neural Network learn which words should be emphasized (weighted) when classifying sentences.

Implementation

Then, we will implement the above mechanism with PyTorch. The task to be solved is the negative / positive judgment of IMDb movie reviews. The data can be downloaded from the following.

  • http://ai.stanford.edu/~amaas/data/sentiment/
  • The following implementation example is written on the assumption that it will run on Google Colab.

Library import

--Import various libraries used in the implementation ――Since the data set is in English, I think that the morphological analysis engine is okay, but for the time being, I prepared a function that performs some preprocessing with nltk (I do not care once though there is a part that suffers from preprocessing of torchtext). Please refer to here for nltk.

# torchtext
import torchtext
from torchtext import data
from torchtext import datasets
from torchtext.vocab import GloVe
from torchtext.vocab import Vectors

# pytorch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import torch

#Others
import os
import pickle
import numpy as np
import pandas as pd
from itertools import chain
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

#Finally, it is used to visualize attention.
import itertools
import random
from IPython.display import display, HTML

#For preprocessing by nltk
import re
import nltk
from nltk import stem
nltk.download('punkt')

#Morpheme engine by nltk prepared
def nltk_analyzer(text):
    stemmer = stem.LancasterStemmer()
    text = re.sub(re.compile(r'[!-\/:-@[-`{-~]'), ' ', text)
    text = stemmer.stem(text)
    text = text.replace('\n', '') #Delete line breaks
    text = text.replace('\t', '') #Delete tab
    morph = nltk.word_tokenize(text)
    return morph

Data preparation

--Download the dataset from the above URL and prepare a tsv file in the following format. --Both train and test are available --The label is quantified with 0 for positive and 1 for negative.

Reference

For example, how to prepare the data was as follows.

train_pos_dir = 'aclImdb/train/pos/'
train_neg_dir = 'aclImdb/train/neg/'

test_pos_dir = 'aclImdb/test/pos/'
test_neg_dir = 'aclImdb/test/neg/'

header = ['text', 'label', 'label_id']

train_pos_files = os.listdir(train_pos_dir)
train_neg_files = os.listdir(train_neg_dir)
test_pos_files = os.listdir(test_pos_dir)
test_neg_files = os.listdir(test_neg_dir)


def make_row(root_dir, files, label, idx):
    row = []
    for file in files:
        tmp = []
        with open(root_dir + file, 'r') as f:
            text = f.read()
            tmp.append(text)
            tmp.append(label)
            tmp.append(idx)
        row.append(tmp)
    return row

row = make_row(train_pos_dir, train_pos_files, 'pos', 0)
row += make_row(train_neg_dir, train_neg_files, 'neg', 1)
train_df = pd.DataFrame(row, columns=header)


row = make_row(test_pos_dir, test_pos_files, 'pos', 0)
row += make_row(test_neg_dir, test_neg_files, 'neg', 1)
test_df = pd.DataFrame(row, columns=header)

Prepare the data like the above, and finally create the following dataframe (while deleting the label column because it is not needed once).

train_df = pd.read_csv(imdb_dir + 'train.tsv', delimiter="\t", header=None)
train_df

image.png

Pre-processing with torchtext

--With torchtext, you can preprocess data, get distributed expressions of words, mini-batch, etc. quickly. ――For the distributed representation of words, we used 200-dimensional GloVe. You can download it with torchtext, but I borrowed glove.6B.200d.txt from here because I don't want to download it every time. Be careful because the size is large!

# train.tsv, test.Put tsv here
imdb_dir = "drive/My Drive/Colab Notebooks/imdb_datasets/"

# glove.6B.200d.Put txt here
word_embedding_dir = "drive/My Drive/Colab Notebooks/word_embedding_models/"

TEXT = data.Field(sequential=True, tokenize=nltk_analyzer, lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False, use_vocab=False, is_target=True)

train, test = data.TabularDataset.splits(
      path=imdb_dir, train='train.tsv', test='test.tsv', format='tsv',
      fields=[('Text', TEXT), ('Label', LABEL)])

glove_vectors = Vectors(name=word_embedding_dir + "glove.6B.200d.txt")
TEXT.build_vocab(train, test, vectors=glove_vectors, min_freq=1)

Hyperparameter settings, etc.

――There is no particular reason, but I used the following parameters. ――The figure below of the Attention layer is made into 3 layers.

#I want to use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

BATCH_SIZE = 100 #Batch size
EMBEDDING_DIM = 200 #Number of embedded dimensions of words
LSTM_DIM = 128 #Number of dimensions of hidden layer of LSTM
VOCAB_SIZE =TEXT.vocab.vectors.size()[0] #Total number of words
TAG_SIZE = 2 #This time we will make a negative / positive judgment, so the final size of the network is 2.
DA = 64 #Size of weight matrix when calculating Attention with Neural Network
R = 3 #View Attention in 3 layers

Model definition

Bidirectional LSTM

--Convert sentences with Bidirectional LSTM --Please refer to here for the specifications of PyTorch's Bidirectional LSTM.

class BiLSTMEncoder(nn.Module):
    def __init__(self, embedding_dim, lstm_dim, vocab_size):
        super(BiLSTMEncoder, self).__init__()
        self.lstm_dim = lstm_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        #Set the learned word vector as embedding
        self.word_embeddings.weight.data.copy_(TEXT.vocab.vectors)

        #Requires to prevent the word vector from being updated by error backpropagation_Set grad to False
        self.word_embeddings.requires_grad_ = False

        # bidirectional=True and easy to make bidirectional LSTM
        self.bilstm = nn.LSTM(embedding_dim, lstm_dim, batch_first=True, bidirectional=True)
  
    def forward(self, text):
        embeds = self.word_embeddings(text)

        #Receive the first return value because we want the vector of each hidden layer
        out, _ = self.bilstm(embeds)

        #Returns the vector of each hidden layer in the forward and backward directions as they are connected.
        return out

Self Attention layer

--Receive the vector of each hidden layer of Bidirectional LSTM and calculate Attention with Neural Network ――According to the paper, Tanh () is used for the activation function, but the article introduced in Reference ① uses ReLU (), so either one seems to be fine.

class SelfAttention(nn.Module):
  def __init__(self, lstm_dim, da, r):
    super(SelfAttention, self).__init__()
    self.lstm_dim = lstm_dim
    self.da = da
    self.r = r
    self.main = nn.Sequential(
        #Since it is Bidirectional, the vector dimension of each hidden layer is doubled in size.
        nn.Linear(lstm_dim * 2, da), 
        nn.Tanh(),
        nn.Linear(da, r)
    )
  def forward(self, out):
    return F.softmax(self.main(out), dim=1)

Where to classify in consideration of Attention

--Attention weight weights the vector of each hidden layer and Neural Network returns the prediction for binary classification. ――I'm honestly not sure what to do after weighting each Attenion layer, so this time I tried the following steps for the time being.

  1. Weight the vector of each hidden layer of the Bidirectional LSTM with the weights of the three Attention layers
  2. Add each weighted vector to get m1, m2, m3
  3. Combine the three vectors m1, m2, and m3 as they are (the number of dimensions becomes lstm_dim * 2 * 3)

class SelfAttentionClassifier(nn.Module):
  def __init__(self, lstm_dim, da, r, tagset_size):
    super(SelfAttentionClassifier, self).__init__()
    self.lstm_dim = lstm_dim
    self.r = r
    self.attn = SelfAttention(lstm_dim, da, r)
    self.main = nn.Linear(lstm_dim * 6, tagset_size)

  def forward(self, out):
    attention_weight = self.attn(out)
    m1 = (out * attention_weight[:,:,0].unsqueeze(2)).sum(dim=1)
    m2 = (out * attention_weight[:,:,1].unsqueeze(2)).sum(dim=1)
    m3 = (out * attention_weight[:,:,2].unsqueeze(2)).sum(dim=1)
    feats = torch.cat([m1, m2, m3], dim=1)
    return F.log_softmax(self.main(feats)), attention_weight

Model declaration

encoder = BiLSTMEncoder(EMBEDDING_DIM, LSTM_DIM, VOCAB_SIZE).to(device)
classifier = SelfAttentionClassifier(LSTM_DIM, DA, R, TAG_SIZE).to(device)
loss_function = nn.NLLLoss()

#Enclose multiple models in from itertools import chain to combine optimizers into one
optimizer = optim.Adam(chain(encoder.parameters(), classifier.parameters()), lr=0.001)

train_iter, test_iter = data.Iterator.splits((train, test), batch_sizes=(BATCH_SIZE, BATCH_SIZE), device=device, repeat=False, sort=False)

To learn

――For the time being, I tried learning with Epoch 10. --Loss is steadily decreasing, so it's okay for the time being

losses = []
for epoch in range(10):
    all_loss = 0

    for idx, batch in enumerate(train_iter):
        batch_loss = 0
        encoder.zero_grad()
        classifier.zero_grad()

        text_tensor = batch.Text[0]
        label_tensor = batch.Label
        out = encoder(text_tensor)
        score, attn = classifier(out)
        batch_loss = loss_function(score, label_tensor)
        batch_loss.backward()
        optimizer.step()
        all_loss += batch_loss.item()
    print("epoch", epoch, "\t" , "loss", all_loss)
#epoch 0 	 loss 97.37978366017342
#epoch 1 	 loss 50.07680431008339
#epoch 2 	 loss 27.79373042844236
#epoch 3 	 loss 9.353876578621566
#epoch 4 	 loss 1.9509600398596376
#epoch 5 	 loss 0.22650832029466983
#epoch 6 	 loss 0.021685686125238135
#epoch 7 	 loss 0.011305359620109812
#epoch 8 	 loss 0.007448446772286843
#epoch 9 	 loss 0.005398457038154447

Prediction & accuracy

――I don't think the accuracy is better than I expected ... ――Reference ① said that the accuracy was about 90%, and it seems that there are various backfires in some different implementations ...

answer = []
prediction = []
with torch.no_grad():
    for batch in test_iter:

        text_tensor = batch.Text[0]
        label_tensor = batch.Label
    
        out = encoder(text_tensor)
        score, _ = classifier(out)
        _, pred = torch.max(score, 1)

        prediction += list(pred.cpu().numpy())
        answer += list(label_tensor.cpu().numpy())
print(classification_report(prediction, answer, target_names=['positive', 'negative']))
#              precision    recall  f1-score   support
#
#    positive       0.86      0.88      0.87     12103
#    negative       0.89      0.86      0.87     12897
#
#    accuracy                           0.87     25000
#   macro avg       0.87      0.87      0.87     25000
#weighted avg       0.87      0.87      0.87     25000

Attention visualization

--Highlight and visualize which words are Attention. ――For the function to highlight, I borrowed the source of Reference ① as it is. --Please refer to here when displaying HTML on jupyter notebook etc. ――I'm doing strange processing such as for loop, but I just wanted to pick up one randomly from the test data and predict it. Sorry for the ridiculous implementation ...

def highlight(word, attn):
    html_color = '#%02X%02X%02X' % (255, int(255*(1 - attn)), int(255*(1 - attn)))
    return '<span style="background-color: {}">{}</span>'.format(html_color, word)

def mk_html(sentence, attns):
    html = ""
    for word, attn in zip(sentence, attns):
        html += ' ' + highlight(
            TEXT.vocab.itos[word],
            attn
        )
    return html


id2ans = {'0': 'positive', '1':'negative'}

_, test_iter = data.Iterator.splits((train, test), batch_sizes=(1, 1), device=device, repeat=False, sort=False)

n = random.randrange(len(test_iter))

for batch in itertools.islice(test_iter, n-1,n):
    x = batch.Text[0]
    y = batch.Label
    encoder_outputs = encoder(x)
    output, attn = classifier(encoder_outputs)
    pred = output.data.max(1, keepdim=True)[1]

    display(HTML('[Correct answer]' + id2ans[str(y.item())] + '\t [Forecast]' + id2ans[str(pred.item())] + '<br><br>'))
    for i in range(attn.size()[2]):
      display(HTML(mk_html(x.data[0], attn.data[0,:,i]) + '<br><br>'))

I'm sorry that it became small, but when it is visualized, it will be displayed like this. Three of the same sentences are displayed, but since there are three Attention layers, each layer shows which word is attention. The degree of attention of words is slightly different in each Attention layer, but it seems that they are attracting in almost the same way.

image.png

Supplement

Without Self Attention ...

――By the way, when you solve this negative / positive judgment only with Bidirectional LSTM without Self Attention, the accuracy is about 79.4%. --When solving with only Bidirectional LSTM, use the following network and leave the other parameters as they are. ――Self Attention seems to contribute greatly to raising the level of accuracy.

class BiLSTMEncoder(nn.Module):
    def __init__(self, embedding_dim, lstm_dim, vocab_size, tagset_size):
        super(BiLSTMEncoder, self).__init__()
        self.lstm_dim = lstm_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.word_embeddings.weight.data.copy_(TEXT.vocab.vectors)
        self.word_embeddings.requires_grad_ = False
        self.bilstm = nn.LSTM(embedding_dim, lstm_dim, batch_first=True, bidirectional=True)
        self.hidden2tag = nn.Linear(lstm_dim * 2, tagset_size)
        self.softmax = nn.LogSoftmax()
  
    def forward(self, text):
        embeds = self.word_embeddings(text)
        _, bilstm_hc = self.bilstm(embeds)
        bilstm_out = torch.cat([bilstm_hc[0][0], bilstm_hc[0][1]], dim=1)
        tag_space = self.hidden2tag(bilstm_out)
        tag_scores = self.softmax(tag_space.squeeze())
        return tag_scores

in conclusion

――I'm a little worried about the pattern that calculates Attention lexicographically as implemented in Transformer etc. (the one that divides the embedding of words into query, key, value) and the Neural Network of this paper to predict Attention. I don't really understand the difference in the patterns to be made. Before I knew this paper, I thought that if I went to Attention, the inner product would be a mess, so I wonder if there are various calculation methods for Attention. ――Next, I want to write something about Transformer!

end

Recommended Posts

I tried to implement sentence classification by Self Attention with PyTorch
I tried to implement sentence classification & Attention visualization by Japanese BERT in PyTorch
I tried to implement CVAE with PyTorch
I tried to implement reading Dataset with PyTorch
I tried to implement and learn DCGAN with PyTorch
I tried to implement SSD with PyTorch now (Dataset)
I tried to classify MNIST by GNN (with PyTorch geometric)
I tried to implement SSD with PyTorch now (model edition)
I tried to implement Autoencoder with TensorFlow
I tried to detect Mario with pytorch + yolov3
I tried to move Faster R-CNN quickly with pytorch
I tried to implement Minesweeper on terminal with python
I tried to implement an artificial perceptron with python
I tried to implement time series prediction with GBDT
[Introduction to Pytorch] I tried categorizing Cifar10 with VGG16 ♬
I tried to implement Grad-CAM with keras and tensorflow
I tried to implement StarGAN (1)
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique
I tried to implement a volume moving average with Quantx
I tried to implement anomaly detection by sparse structure learning
I tried to implement breakout (deception avoidance type) with Quantx
[Django] I tried to implement access control by class inheritance.
I tried to implement ListNet of rank learning with Chainer
I tried to implement Harry Potter sort hat with CNN
I tried to implement Deep VQE
I implemented Attention Seq2Seq with PyTorch
I tried to implement adversarial validation
I tried to explain Pytorch dataset
I tried implementing DeepPose with PyTorch
I tried to implement hierarchical clustering
I tried to implement Realness GAN
I tried sentence generation with GPT-2
I tried to implement PLSA in Python
I tried to implement permutation in Python
I tried to visualize AutoEncoder with TensorFlow
I tried to get started with Hy
I tried to implement PLSA in Python 2
[Introduction to Pytorch] I played with sinGAN ♬
I tried implementing DeepPose with PyTorch PartⅡ
I tried to implement PPO in Python
I tried to solve TSP with QAOA
I tried to implement deep learning that is not deep with only NumPy
I tried to communicate with a remote server by Socket communication with Python.
I tried to implement a blockchain that actually works with about 170 lines
765 I tried to identify the three professional families by CNN (with Chainer 2.0.0)
I tried to implement Bayesian linear regression by Gibbs sampling in python
I tried to predict next year with AI
I tried to program bubble sort by language
I tried to use lightGBM, xgboost with Boruta
I tried to learn logical operations with TF Learn
I tried to move GAN (mnist) with keras
I tried to get an image by scraping
I tried to save the data with discord
I tried to detect motion quickly with OpenCV
I tried to integrate with Keras in TFv1.1
I tried to output LLVM IR with Python
I tried to implement TOPIC MODEL in Python
I tried to detect an object with M2Det!
I tried to automate sushi making with python
I tried to predict Titanic survival with PyCaret
I tried to operate Linux with Discord Bot