Explanation of CopyNet, the third version of seq2seq, and its implementation

Introduction

Synopsis up to the last time http://qiita.com/kenchin110100/items/b34f5106d5a211f4c004 http://qiita.com/kenchin110100/items/eb70d69d1d65fb451b67

Normal seq2seq, Attention Model, this time I implemented CopyNet.

We will first explain CopyNet, then the implementation and its results.

CopyNet

What is CopyNet?

To explain CopyNet, let's start with a review of Seq2Seq.

Sequence to Sequence

Seq2Seq is a type of Encoder Decoder model, in which the Encoder converts the utterance sentence ("How are you feeling?") Into a vector, and the Decoder outputs the response sentence ("I feel good") from that vector.

In Seq2Seq Encoder, only the last output intermediate vector was considered, but the Attention Model was to consider more various intermediate vectors.

Attention Model

Now, think about what you do with CopyNet, when the utterance is "How are you feeling?" And the response is "I feel good."

The word is used in both utterances and responses. The idea of CopyNet is to make it easier for the Decoder side to generate the words used in the utterance.

CopyNet

(The figure is just an image)

The reason why CopyNet is good is that it can handle unknown words. For example, even if you do not have the word when learning, you can respond using the word by copying.

In the following, I will introduce two papers on CopyNet.

Jiatao Gu et al.

This is the original paper of CopyNet Gu, Jiatao, et al. "Incorporating copying mechanism in sequence-to-sequence learning." arXiv preprint arXiv:1603.06393 (2016).

Gu, Jiatao, et al

The figure used in the paper is the one above, but if you take a closer look, it will look like the figure below.

Copy mode and StateUpdate

The method proposed by Gu et al. Has two main mechanisms: StateUpdate and CopyMode.

In StateUpdate, if the word input to Decoder is a word () included in the utterance, the intermediate vector of that word (output by Encoder) is input.

In CopyMode, if the word you expect to output is included in the utterance sentence (), the intermediate vector is used to increase the probability of occurrence of so that the word can be easily output.

(The explanation is pretty bad, but please read the paper for details ...)

Ziqiang Cao et al.

I would like to introduce another paper related to CopyNet. Strictly speaking, it is not CopyNet, but the following papers implement a similar mechanism.

Cao, Ziqiang, et al. "Joint Copying and Restricted Generation for Paraphrase." arXiv preprint arXiv:1611.09235 (2016).

Ziqiang Cao et al.

(Figure used in the paper)

This one is a little simpler, and if you briefly explain it, it will be as follows.

Restricted Generative Decoder

The policy is to use the weight calculated by the Attention Model as it is.

If the word that is expected to be output is not in the input, the probability of the generated word is used as it is. If there is a word in the input that is expected to be output (), the probability of the generated word and the weight calculated by the Attention Model are averaged by λ (λ is between 0 and 1). Scalar).

The point is how to balance this λ, but we will also learn λ. (Please read the paper for details ...)

Implementation

This time, we implemented the method of Ziqiang Cao et al. In Chainer. There aren't many CopyNet implementations on the net, and I'm sorry if I made a mistake ...

Encoder and Decoder use the model used at the time of Attention Model as it is.

Attention

It is basically the same as the Attention Model, but the weight of each intermediate vector is also changed to be output.

`attention.py`


class Copy_Attention(Attention):

    def __call__(self, fs, bs, h):
        """
Attention calculation
        :param fs:List of forward Encoder intermediate vectors
        :param bs:A list of reverse Encoder intermediate vectors
        :param h:Intermediate vector output by Decoder
        :return att_f:Weighted average of forward Encoder intermediate vectors
        :return att_b:Weighted average of the intermediate vector of the reverse Encoder
        :return att:Weight of each intermediate vector
        """
        #Remember the size of the mini-batch
        batch_size = h.data.shape[0]
        #Initializing the list to record weights
        ws = []
        att = []
        #Initialize the value to calculate the total weight
        sum_w = Variable(self.ARR.zeros((batch_size, 1), dtype='float32'))
        #Weight calculation using Encoder intermediate vector and Decoder intermediate vector
        for f, b in zip(fs, bs):
            #Weight calculation using forward Encoder intermediate vector, reverse Encoder intermediate vector, and Decoder intermediate vector
            w = self.hw(functions.tanh(self.fh(f)+self.bh(b)+self.hh(h)))
            att.append(w)
            #Normalize using the softmax function
            w = functions.exp(w)
            #Record the calculated weight
            ws.append(w)
            sum_w += w
        #Initialization of output weighted average vector
        att_f = Variable(self.ARR.zeros((batch_size, self.hidden_size), dtype='float32'))
        att_b = Variable(self.ARR.zeros((batch_size, self.hidden_size), dtype='float32'))
        for i, (f, b, w) in enumerate(zip(fs, bs, ws)):
            #Normalized so that the sum of the weights is 1.
            w /= sum_w
            #weight*Add the intermediate vector of Encoder to the output vector
            att_f += functions.reshape(functions.batch_matmul(f, w), (batch_size, self.hidden_size))
            att_b += functions.reshape(functions.batch_matmul(f, w), (batch_size, self.hidden_size))
        att = functions.concat(att, axis=1)
        return att_f, att_b, att

Seq2Seq with CopyNet

The model that combines Encoder, Decorder, and Attention is as follows.

`copy_seq2seq.py`


class Copy_Seq2Seq(Chain):
    def __init__(self, vocab_size, embed_size, hidden_size, batch_size, flag_gpu=True):
        super(Copy_Seq2Seq, self).__init__(
            #Forward Encoder
            f_encoder = LSTM_Encoder(vocab_size, embed_size, hidden_size),
            #Reverse Encoder
            b_encoder = LSTM_Encoder(vocab_size, embed_size, hidden_size),
            # Attention Model
            attention=Copy_Attention(hidden_size, flag_gpu),
            # Decoder
            decoder=Att_LSTM_Decoder(vocab_size, embed_size, hidden_size),
            #Network for calculating the weight of λ
            predictor=links.Linear(hidden_size, 1)
        )
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        if flag_gpu:
            self.ARR = cuda.cupy
        else:
            self.ARR = np

        #Initialize the list to store the forward Encoder intermediate vector and the reverse Encoder intermediate vector
        self.fs = []
        self.bs = []

    def encode(self, words):
        """
Encoder calculation
        :param words:A recorded list of words to use for input
        :return:
        """
        #Internal memory, intermediate vector initialization
        c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        #First, calculate the forward Encoder
        for w in words:
            c, h = self.f_encoder(w, c, h)
            #Record the calculated intermediate vector
            self.fs.append(h)

        #Internal memory, intermediate vector initialization
        c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        #Reverse Encoder calculation
        for w in reversed(words):
            c, h = self.b_encoder(w, c, h)
            #Record the calculated intermediate vector
            self.bs.insert(0, h)

        #Internal memory, intermediate vector initialization
        self.c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        self.h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))


    def decode(self, w):
        """
Decoder calculation
        :param w:Words to enter with Decoder
        :return t:Predictive word
        :return att:Attention weight for each word
        :return lambda_:Weight for determining whether Copy is important or Generate is important
        """
        #Calculate input vector with Attention Model
        att_f, att_b, att = self.attention(self.fs, self.bs, self.h)
        #Input vector to Decoder
        t, self.c, self.h = self.decoder(w, self.c, self.h, att_f, att_b)
        #Calculation of λ using the calculated intermediate vector
        lambda_ = self.predictor(self.h)
        return t, att, lambda_

Actually, this is not much different from the Attention Model. The change is that it also outputs the Attention weight, which calculates λ to balance Copy Mode and Generative Mode.

forward

The big change is in the forward function. The forward function looks at the input sentence and the word you want to output to determine whether to calculate the Copy Mode.

`forward.py`


def forward(enc_words, dec_words, model, ARR):
    """
Function to calculate forward
    :param enc_words:Input statement
    :param dec_words:Output statement
    :param model:model
    :param ARR:numpy or cuda.Either cupy
    :return loss:loss
    """
    #Record batch size
    batch_size = len(enc_words[0])
    #Reset the gradient recorded in the model
    model.reset()
    #Prepare a list to check the words used in the input sentence
    enc_key = enc_words.T
    #Change the statement input to Encoder to Variable type
    enc_words = [Variable(ARR.array(row, dtype='int32')) for row in enc_words]
    #Encoder calculation
    model.encode(enc_words)
    #Loss initialization
    loss = Variable(ARR.zeros((), dtype='float32'))
    # <eos>To the decoder
    t = Variable(ARR.array([0 for _ in range(batch_size)], dtype='int32'))
    #Decoder calculation
    for w in dec_words:
        #Decode word by word
        y, att, lambda_ = model.decode(t)
        #Convert correct word to Variable type
        t = Variable(ARR.array(w, dtype='int32'))

        #Log of words calculated by Generative Mode_Take softmax
        s = functions.log_softmax(y)
        #Attention weight log_Take softmax
        att_s = functions.log_softmax(att)
        #By multiplying lambda by sigmoid function, 0~Change to a value of 1
        lambda_s = functions.reshape(functions.sigmoid(lambda_), (batch_size,))
        #Generative Mode loss initialization
        Pg = Variable(ARR.zeros((), dtype='float32'))
        #Copy Mode loss initialization
        Pc = Variable(ARR.zeros((), dtype='float32'))
        #Loss initialization to learn lambda balance
        epsilon = Variable(ARR.zeros((), dtype='float32'))
        #From here, the loss of each word in the batch is calculated, and the for statement is turned around ...
        counter = 0
        for i, words in enumerate(w):
            # -1 is a label attached to words that are not learned. Ignore this.
            if words != -1:
                #Generative Mode Loss Calculation
                Pg += functions.get_item(functions.get_item(s, i), words) * functions.reshape((1.0 - functions.get_item(lambda_s, i)), ())
                counter += 1
                #If there is a word you want to output in the input sentence
                if words in enc_key[i]:
                    #Calculate Copy Mode
                    Pc += functions.get_item(functions.get_item(att_s, i), list(enc_key[i]).index(words)) * functions.reshape(functions.get_item(lambda_s, i), ())
                    #Learn to make lambda better than Copy Mode
                    epsilon += functions.log(functions.get_item(lambda_s, i))
                #If there is no word you want to output in the input sentence
                else:
                    #Learn to make lambda better than Generative Mode
                    epsilon += functions.log(1.0 - functions.get_item(lambda_s, i))
        #Divide each loss by batch size and add up
        Pg *= (-1.0 / np.max([1, counter]))
        Pc *= (-1.0 / np.max([1, counter]))
        epsilon *= (-1.0 / np.max([1, counter]))
        loss += Pg + Pc + epsilon
    return loss

In the code, three losses, Pg, Pc, and epsilon, are defined and calculated in order to learn each of Generative Mode, Copy Mode, and λ.

The point is to use functions.log_softmax. If you set log (softmax (x)), an error will occur when the calculation of softmax becomes 0, but this function does it well (how it works is a mystery ...).

If you use the functions.softmax_cross_entropy function, you don't need such a troublesome calculation, but this time I want to balance the loss of Copy Mode and the loss of Generative Mode with λ, so I use the functions.get_items and functions.log_softmax functions to make the loss. Is being calculated.

If you know a better implementation, please let me know ...

The created code is https://github.com/kenchin110100/machine_learning/blob/master/sampleCopySeq2Seq.py It is in.

Experiment

Corpus

I used the dialogue failure corpus as before. https://sites.google.com/site/dialoguebreakdowndetection/chat-dialogue-corpus

Experimental result

The following 4 types of utterances

token1 ='Good morning'
token2 ='How are you doing? '
token3 ='I'm hungry'
token4 ='It's hot today'

We will look at the response results for each Epoch.

Epoch 1

Utterance:Good morning=>response:  ['Good morning', '</s>'] ['copy', 'copy']
Utterance:How's it going?=>response:  ['Condition', 'Is', 'Is', 'is', 'is', '</s>'] ['copy', 'copy', 'copy', 'copy', 'copy', 'copy']
Utterance:I'm hungry=>response:  ['stomach', 'But', 'But', 'But', 'Ta', 'Ta', 'is', '</s>'] ['copy', 'copy', 'copy', 'copy', 'copy', 'copy', 'gen', 'copy']
Utterance:It's hot today=>response:  ['today', 'Is', 'Is', 'is', 'is', '</s>'] ['copy', 'copy', 'copy', 'copy', 'copy', 'copy']

It's completely broken ...

Epoch 3

Utterance:Good morning=>response:  ['Good morning', '</s>'] ['copy', 'copy']
Utterance:How's it going?=>response:  ['Condition', 'Is', '</s>'] ['copy', 'gen', 'copy']
Utterance:I'm hungry=>response:  ['stomach', '</s>'] ['copy', 'copy']
Utterance:It's hot today=>response:  ['hot', 'Is', 'Like', 'is', 'Ne', '</s>'] ['copy', 'copy', 'gen', 'gen', 'gen', 'copy']

Epoch 5

Utterance:Good morning=>response:  ['Good morning', '</s>'] ['copy', 'copy']
Utterance:How's it going?=>response:  ['Condition', 'Is', 'Like', 'is', 'Or', '</s>'] ['copy', 'copy', 'gen', 'copy', 'gen', 'copy']
Utterance:I'm hungry=>response:  ['stomach', '</s>'] ['copy', 'copy']
Utterance:It's hot today=>response:  ['hot', 'is', '</s>'] ['copy', 'gen', 'copy']

Even if you are declared hungry ...

Epoch 7

Utterance:Good morning=>response:  ['Good morning', 'Thank you', 'Masu', '</s>'] ['copy', 'gen', 'gen', 'copy']
Utterance:How's it going?=>response:  ['Condition', 'Is', '</s>'] ['copy', 'gen', 'copy']
Utterance:I'm hungry=>response:  ['stomach', 'But', 'Free', 'Better', 'Ta', '</s>'] ['copy', 'gen', 'copy', 'copy', 'gen', 'gen']
Utterance:It's hot today=>response:  ['hot', 'is', '</s>'] ['copy', 'gen', 'copy']

The word is not included in the learning, so it can be said that it is copied well in that respect.

However, honestly, I would like you to answer a little better.

It learns both Copy Mode and Generate Mode, so it seems like the Decoder hasn't fully trained the language model.

This may be related to the fact that the paper was evaluated not by the dialogue task but by the summary task. (Well, the main cause may be the implementation ...)

Conclusion

I implemented CopyNet using chainer. I've done the dialogue model three times, so I'm already full lol Next time I'll do something else.

Seq2Seq (3) ~ CopyNet Edition ~ with chainer