Explanation of CopyNet, the third version of seq2seq, and its implementation
Synopsis up to the last time http://qiita.com/kenchin110100/items/b34f5106d5a211f4c004 http://qiita.com/kenchin110100/items/eb70d69d1d65fb451b67
Normal seq2seq, Attention Model, this time I implemented CopyNet.
We will first explain CopyNet, then the implementation and its results.
CopyNet
To explain CopyNet, let's start with a review of Seq2Seq.
Sequence to Sequence |
---|
Seq2Seq is a type of Encoder Decoder model, in which the Encoder converts the utterance sentence ("How are you feeling?") Into a vector, and the Decoder outputs the response sentence ("I feel good") from that vector.
In Seq2Seq Encoder, only the last output intermediate vector was considered, but the Attention Model was to consider more various intermediate vectors.
Attention Model |
---|
Now, think about what you do with CopyNet, when the utterance is "How are you feeling?" And the response is "I feel good."
The word
CopyNet |
---|
(The figure is just an image)
The reason why CopyNet is good is that it can handle unknown words.
For example, even if you do not have the word
In the following, I will introduce two papers on CopyNet.
Jiatao Gu et al.
This is the original paper of CopyNet Gu, Jiatao, et al. "Incorporating copying mechanism in sequence-to-sequence learning." arXiv preprint arXiv:1603.06393 (2016).
Gu, Jiatao, et al |
---|
The figure used in the paper is the one above, but if you take a closer look, it will look like the figure below.
Copy mode and StateUpdate |
---|
The method proposed by Gu et al. Has two main mechanisms: StateUpdate and CopyMode.
In StateUpdate, if the word input to Decoder is a word (
In CopyMode, if the word you expect to output is included in the utterance sentence (
(The explanation is pretty bad, but please read the paper for details ...)
Ziqiang Cao et al.
I would like to introduce another paper related to CopyNet. Strictly speaking, it is not CopyNet, but the following papers implement a similar mechanism.
Ziqiang Cao et al. |
---|
(Figure used in the paper)
This one is a little simpler, and if you briefly explain it, it will be as follows.
Restricted Generative Decoder |
---|
The policy is to use the weight calculated by the Attention Model as it is.
If the word that is expected to be output is not in the input, the probability of the generated word is used as it is.
If there is a word in the input that is expected to be output (
The point is how to balance this λ, but we will also learn λ. (Please read the paper for details ...)
This time, we implemented the method of Ziqiang Cao et al. In Chainer. There aren't many CopyNet implementations on the net, and I'm sorry if I made a mistake ...
Encoder and Decoder use the model used at the time of Attention Model as it is.
Attention
It is basically the same as the Attention Model, but the weight of each intermediate vector is also changed to be output.
attention.py
class Copy_Attention(Attention):
def __call__(self, fs, bs, h):
"""
Attention calculation
:param fs:List of forward Encoder intermediate vectors
:param bs:A list of reverse Encoder intermediate vectors
:param h:Intermediate vector output by Decoder
:return att_f:Weighted average of forward Encoder intermediate vectors
:return att_b:Weighted average of the intermediate vector of the reverse Encoder
:return att:Weight of each intermediate vector
"""
#Remember the size of the mini-batch
batch_size = h.data.shape[0]
#Initializing the list to record weights
ws = []
att = []
#Initialize the value to calculate the total weight
sum_w = Variable(self.ARR.zeros((batch_size, 1), dtype='float32'))
#Weight calculation using Encoder intermediate vector and Decoder intermediate vector
for f, b in zip(fs, bs):
#Weight calculation using forward Encoder intermediate vector, reverse Encoder intermediate vector, and Decoder intermediate vector
w = self.hw(functions.tanh(self.fh(f)+self.bh(b)+self.hh(h)))
att.append(w)
#Normalize using the softmax function
w = functions.exp(w)
#Record the calculated weight
ws.append(w)
sum_w += w
#Initialization of output weighted average vector
att_f = Variable(self.ARR.zeros((batch_size, self.hidden_size), dtype='float32'))
att_b = Variable(self.ARR.zeros((batch_size, self.hidden_size), dtype='float32'))
for i, (f, b, w) in enumerate(zip(fs, bs, ws)):
#Normalized so that the sum of the weights is 1.
w /= sum_w
#weight*Add the intermediate vector of Encoder to the output vector
att_f += functions.reshape(functions.batch_matmul(f, w), (batch_size, self.hidden_size))
att_b += functions.reshape(functions.batch_matmul(f, w), (batch_size, self.hidden_size))
att = functions.concat(att, axis=1)
return att_f, att_b, att
Seq2Seq with CopyNet
The model that combines Encoder, Decorder, and Attention is as follows.
copy_seq2seq.py
class Copy_Seq2Seq(Chain):
def __init__(self, vocab_size, embed_size, hidden_size, batch_size, flag_gpu=True):
super(Copy_Seq2Seq, self).__init__(
#Forward Encoder
f_encoder = LSTM_Encoder(vocab_size, embed_size, hidden_size),
#Reverse Encoder
b_encoder = LSTM_Encoder(vocab_size, embed_size, hidden_size),
# Attention Model
attention=Copy_Attention(hidden_size, flag_gpu),
# Decoder
decoder=Att_LSTM_Decoder(vocab_size, embed_size, hidden_size),
#Network for calculating the weight of λ
predictor=links.Linear(hidden_size, 1)
)
self.vocab_size = vocab_size
self.embed_size = embed_size
self.hidden_size = hidden_size
self.batch_size = batch_size
if flag_gpu:
self.ARR = cuda.cupy
else:
self.ARR = np
#Initialize the list to store the forward Encoder intermediate vector and the reverse Encoder intermediate vector
self.fs = []
self.bs = []
def encode(self, words):
"""
Encoder calculation
:param words:A recorded list of words to use for input
:return:
"""
#Internal memory, intermediate vector initialization
c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
#First, calculate the forward Encoder
for w in words:
c, h = self.f_encoder(w, c, h)
#Record the calculated intermediate vector
self.fs.append(h)
#Internal memory, intermediate vector initialization
c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
#Reverse Encoder calculation
for w in reversed(words):
c, h = self.b_encoder(w, c, h)
#Record the calculated intermediate vector
self.bs.insert(0, h)
#Internal memory, intermediate vector initialization
self.c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
self.h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
def decode(self, w):
"""
Decoder calculation
:param w:Words to enter with Decoder
:return t:Predictive word
:return att:Attention weight for each word
:return lambda_:Weight for determining whether Copy is important or Generate is important
"""
#Calculate input vector with Attention Model
att_f, att_b, att = self.attention(self.fs, self.bs, self.h)
#Input vector to Decoder
t, self.c, self.h = self.decoder(w, self.c, self.h, att_f, att_b)
#Calculation of λ using the calculated intermediate vector
lambda_ = self.predictor(self.h)
return t, att, lambda_
Actually, this is not much different from the Attention Model. The change is that it also outputs the Attention weight, which calculates λ to balance Copy Mode and Generative Mode.
forward
The big change is in the forward function. The forward function looks at the input sentence and the word you want to output to determine whether to calculate the Copy Mode.
forward.py
def forward(enc_words, dec_words, model, ARR):
"""
Function to calculate forward
:param enc_words:Input statement
:param dec_words:Output statement
:param model:model
:param ARR:numpy or cuda.Either cupy
:return loss:loss
"""
#Record batch size
batch_size = len(enc_words[0])
#Reset the gradient recorded in the model
model.reset()
#Prepare a list to check the words used in the input sentence
enc_key = enc_words.T
#Change the statement input to Encoder to Variable type
enc_words = [Variable(ARR.array(row, dtype='int32')) for row in enc_words]
#Encoder calculation
model.encode(enc_words)
#Loss initialization
loss = Variable(ARR.zeros((), dtype='float32'))
# <eos>To the decoder
t = Variable(ARR.array([0 for _ in range(batch_size)], dtype='int32'))
#Decoder calculation
for w in dec_words:
#Decode word by word
y, att, lambda_ = model.decode(t)
#Convert correct word to Variable type
t = Variable(ARR.array(w, dtype='int32'))
#Log of words calculated by Generative Mode_Take softmax
s = functions.log_softmax(y)
#Attention weight log_Take softmax
att_s = functions.log_softmax(att)
#By multiplying lambda by sigmoid function, 0~Change to a value of 1
lambda_s = functions.reshape(functions.sigmoid(lambda_), (batch_size,))
#Generative Mode loss initialization
Pg = Variable(ARR.zeros((), dtype='float32'))
#Copy Mode loss initialization
Pc = Variable(ARR.zeros((), dtype='float32'))
#Loss initialization to learn lambda balance
epsilon = Variable(ARR.zeros((), dtype='float32'))
#From here, the loss of each word in the batch is calculated, and the for statement is turned around ...
counter = 0
for i, words in enumerate(w):
# -1 is a label attached to words that are not learned. Ignore this.
if words != -1:
#Generative Mode Loss Calculation
Pg += functions.get_item(functions.get_item(s, i), words) * functions.reshape((1.0 - functions.get_item(lambda_s, i)), ())
counter += 1
#If there is a word you want to output in the input sentence
if words in enc_key[i]:
#Calculate Copy Mode
Pc += functions.get_item(functions.get_item(att_s, i), list(enc_key[i]).index(words)) * functions.reshape(functions.get_item(lambda_s, i), ())
#Learn to make lambda better than Copy Mode
epsilon += functions.log(functions.get_item(lambda_s, i))
#If there is no word you want to output in the input sentence
else:
#Learn to make lambda better than Generative Mode
epsilon += functions.log(1.0 - functions.get_item(lambda_s, i))
#Divide each loss by batch size and add up
Pg *= (-1.0 / np.max([1, counter]))
Pc *= (-1.0 / np.max([1, counter]))
epsilon *= (-1.0 / np.max([1, counter]))
loss += Pg + Pc + epsilon
return loss
In the code, three losses, Pg, Pc, and epsilon, are defined and calculated in order to learn each of Generative Mode, Copy Mode, and λ.
The point is to use functions.log_softmax. If you set log (softmax (x)), an error will occur when the calculation of softmax becomes 0, but this function does it well (how it works is a mystery ...).
If you use the functions.softmax_cross_entropy function, you don't need such a troublesome calculation, but this time I want to balance the loss of Copy Mode and the loss of Generative Mode with λ, so I use the functions.get_items and functions.log_softmax functions to make the loss. Is being calculated.
If you know a better implementation, please let me know ...
The created code is https://github.com/kenchin110100/machine_learning/blob/master/sampleCopySeq2Seq.py It is in.
I used the dialogue failure corpus as before. https://sites.google.com/site/dialoguebreakdowndetection/chat-dialogue-corpus
The following 4 types of utterances
We will look at the response results for each Epoch.
Epoch 1
Utterance:Good morning=>response: ['Good morning', '</s>'] ['copy', 'copy']
Utterance:How's it going?=>response: ['Condition', 'Is', 'Is', 'is', 'is', '</s>'] ['copy', 'copy', 'copy', 'copy', 'copy', 'copy']
Utterance:I'm hungry=>response: ['stomach', 'But', 'But', 'But', 'Ta', 'Ta', 'is', '</s>'] ['copy', 'copy', 'copy', 'copy', 'copy', 'copy', 'gen', 'copy']
Utterance:It's hot today=>response: ['today', 'Is', 'Is', 'is', 'is', '</s>'] ['copy', 'copy', 'copy', 'copy', 'copy', 'copy']
It's completely broken ...
Epoch 3
Utterance:Good morning=>response: ['Good morning', '</s>'] ['copy', 'copy']
Utterance:How's it going?=>response: ['Condition', 'Is', '</s>'] ['copy', 'gen', 'copy']
Utterance:I'm hungry=>response: ['stomach', '</s>'] ['copy', 'copy']
Utterance:It's hot today=>response: ['hot', 'Is', 'Like', 'is', 'Ne', '</s>'] ['copy', 'copy', 'gen', 'gen', 'gen', 'copy']
Epoch 5
Utterance:Good morning=>response: ['Good morning', '</s>'] ['copy', 'copy']
Utterance:How's it going?=>response: ['Condition', 'Is', 'Like', 'is', 'Or', '</s>'] ['copy', 'copy', 'gen', 'copy', 'gen', 'copy']
Utterance:I'm hungry=>response: ['stomach', '</s>'] ['copy', 'copy']
Utterance:It's hot today=>response: ['hot', 'is', '</s>'] ['copy', 'gen', 'copy']
Even if you are declared hungry ...
Epoch 7
Utterance:Good morning=>response: ['Good morning', 'Thank you', 'Masu', '</s>'] ['copy', 'gen', 'gen', 'copy']
Utterance:How's it going?=>response: ['Condition', 'Is', '</s>'] ['copy', 'gen', 'copy']
Utterance:I'm hungry=>response: ['stomach', 'But', 'Free', 'Better', 'Ta', '</s>'] ['copy', 'gen', 'copy', 'copy', 'gen', 'gen']
Utterance:It's hot today=>response: ['hot', 'is', '</s>'] ['copy', 'gen', 'copy']
The word
However, honestly, I would like you to answer a little better.
It learns both Copy Mode and Generate Mode, so it seems like the Decoder hasn't fully trained the language model.
This may be related to the fact that the paper was evaluated not by the dialogue task but by the summary task. (Well, the main cause may be the implementation ...)
I implemented CopyNet using chainer. I've done the dialogue model three times, so I'm already full lol Next time I'll do something else.
Recommended Posts