Introduction

When declaring an LSTM when dealing with Bidirectional LSTM in PyTorch, as in the LSTM Reference (https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM) It's OK just to specify bidirectional = True for, and it's very easy to handle (Keras is OK just to surround the LSTM with Bidrectional). However, looking at the reference, I don't think there is much mention of the output of making the LSTM Bidirectional. Even if I googled it, I couldn't understand the output specifications of Bidirectional LSTM in PyTorch, so I will summarize it here.

reference

Bidirectional LSTM output question in PyTorch -Understanding LSTM-with recent trends

Specification confirmation

As you can see from References 1 and 2, you can see that bidirectional RNNs and LSTMs are as simple as the frontal and backward RNNs and LSTMs overlapping.

I will actually use it for the time being.

import torch
import torch.nn as nn

#5 embedded dimensions for each series
#The size of the hidden layer of the LSTM layer is 6
# batch_first=True for input format(batch_size, vocab_size, embedding_dim)I'm doing
# bidrectional=Declare bidirectional LSTM with True
bilstm = nn.LSTM(5, 6, batch_first=True, bidirectional=True)

#Batch size 1
#The length of the series is 4
#The number of embedded dimensions of each series is 5
#Generate a tensor like
a = torch.rand(1, 4, 5)
print(a)
#tensor([[[0.1360, 0.4574, 0.4842, 0.6409, 0.1980],
#         [0.0364, 0.4133, 0.0836, 0.2871, 0.3542],
#         [0.7796, 0.7209, 0.1754, 0.0147, 0.6572],
#         [0.1504, 0.1003, 0.6787, 0.1602, 0.6571]]])

#Like a normal LSTM, it has two outputs, so it receives both.
out, hc = bilstm(a)

print(out)
#tensor([[[-0.0611,  0.0054, -0.0828,  0.0416, -0.0570, -0.1117,  0.0902, -0.0747, -0.0215, -0.1434, -0.2318,  0.0783],
#         [-0.1194, -0.0127, -0.2058,  0.1152, -0.1627, -0.2206,  0.0747, -0.0210,  0.0307, -0.0708, -0.2458,  0.1627],
#         [-0.0163, -0.0568, -0.0266,  0.0878, -0.1461, -0.1745,  0.1097, 0.0230,  0.0353, -0.0739, -0.2186,  0.0818],
#         [-0.1145, -0.0460, -0.0732,  0.0950, -0.1765, -0.2599,  0.0063, 0.0143,  0.0124,  0.0089, -0.1188,  0.0996]]],
#       grad_fn=<TransposeBackward0>)
print(hc)
#(tensor([[[-0.1145, -0.0460, -0.0732,  0.0950, -0.1765, -0.2599]],
#        [[ 0.0902, -0.0747, -0.0215, -0.1434, -0.2318,  0.0783]]],
#       grad_fn=<StackBackward>), 
#tensor([[[-0.2424, -0.1340, -0.1559,  0.3499, -0.3792, -0.5514]],
#        [[ 0.1876, -0.1413, -0.0384, -0.2345, -0.4982,  0.1573]]],
#       grad_fn=<StackBackward>))

Like a normal LSTM, there are two outputs, ʻout and hc, and hc returns hc = (h, c) `in tuple format like a normal LSTM. I think there are two differences from the output of a normal LSTM.

――The dimension of each element of ʻoutis not the size of the dimension of the hidden layer of LSTM (6 this time), but double the size (12 this time). --Two elementsh and cofhc` are returned.

The following is a brief explanation of what these mean.

(C is omitted. I wrote the Embedding layer, but the Embedding layer is not done by LSTM.)

As you can see from the figure above, each element of ʻoutconnects each hidden layer vector in the forward and backward directions. (So the dimensions of each element are double the normal.) Also,h in hc = (h, c) `returns the last hidden layer vector in each of the forward and backward directions.

In other words

--The first half of the last element of ʻoutmatchesh [0]whenhc = (h, c) --The last half of the first element of ʻout matches h [1] when hc = (h, c)

Will be. You can read it from the source code output of the sample above, which means that.

print(out[:,-1][:,:6]) #The first half of the last element of out
print(hc[0][0])        #Last hidden layer value of forward LSTM
#tensor([[-0.1145, -0.0460, -0.0732,  0.0950, -0.1765, -0.2599]], grad_fn=<SliceBackward>)
#tensor([[-0.1145, -0.0460, -0.0732,  0.0950, -0.1765, -0.2599]], grad_fn=<SelectBackward>)

print(out[:,0][:,6:]) #The back half of the first element of out
print(hc[0][1])       #Backward LSTM last hidden layer value
#tensor([[ 0.0902, -0.0747, -0.0215, -0.1434, -0.2318,  0.0783]], grad_fn=<SliceBackward>)
#tensor([[ 0.0902, -0.0747, -0.0215, -0.1434, -0.2318,  0.0783]], grad_fn=<SelectBackward>)

Once you know the output specifications, you can cook as you like, When making a Many to One model such as sentence classification into a Bidirectional LSTM, there seem to be various methods such as combining the second return value of the LSTM, averaging, and taking the element product. In the case of Keras, it seems that it will be combined on the Keras side (by default), but in the case of PyTorch, it seems that you need to implement these processes yourself. For example, if I post Sentence classification by LSTM as Bidirectional LSTM, it will look like the following.

class LSTMClassifier(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size, batch_size=100):
        super(LSTMClassifier, self).__init__()
        self.batch_size = batch_size
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.bilstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        #It receives a combination of the last hidden layer vectors in the forward and backward directions, so it is hidden._Double dim
        self.hidden2tag = nn.Linear(hidden_dim * 2, tagset_size)
        self.softmax = nn.LogSoftmax()

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        _, bilstm_hc = self.bilstm(embeds)
        # bilstm_out[0][0]->Last hidden layer vector of forward LSTM
        # bilstm_out[0][1]->Backward LSTM last hidden layer vector
        bilstm_out = torch.cat([bilstm_hc[0][0], bilstm_hc[0][1]], dim=1)
        tag_space = self.hidden2tag(bilstm_out)
        tag_scores = self.softmax(tag_space.squeeze())
        return tag_scores

in conclusion

――It may be a story that seems to be easy to understand in the world, but even for a moment when dealing with Bidirectional LSTM with PyTorch like yourself? I hope this article will help those who think that it is time to look it up. ――By the way, GRU becomes Bidirectional GRU with bidirectional = True like LSTM. As for the output format, if you know the above LSTM specifications, there should be no problem.

end

I checked the output specifications of PyTorch's Bidirectional LSTM

Introduction

reference

Specification confirmation

in conclusion