Since I implemented sequence to sequence with chainer, its code and verification

Introduction

A well-known sentence generation model using an RNN neural network is sequence to sequence (Seq2Seq).

This time, I will summarize the results in the method and verification when implementing this Seq2Seq using chainer.

Sequence to Sequence（Seq2Seq）

Seq2Seq is a kind of Encoder Decoder model using RNN, and can be used as a model for machine dialogue and machine translation.

This is the original paper Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

The outline of the flow of Seq2Seq looks like the following

For example, if there is an utterance and response such as "How are you feeling?" Or "It's pretty good", the Encoder (blue in the figure) side vectorizes the utterance and the Decoder (red in the figure) side. , Train the RNN to generate a response.

"<'EOS'>" is an abbreviation for End Of Statement, which is a signal that the sentence ends here.

The point of Seq2Seq is to input the utterance from the opposite direction, and if the utterance is "How are you feeling?", Enter "<?>, , , " in the Encoder in that order.

Separate Embeds are used on the Encoder side and Decoder side, and only the generated intermediate layer (red line in the figure) is shared.

I wrote Seq2Seq as an RNN-based neural network, but this time I implemented it using Long Short Term Memory (LSTM).

For a detailed explanation of LSTM, http://qiita.com/t_Signull/items/21b82be280b46f467d1b http://qiita.com/KojiOhki/items/89cd7b69a8a6239d67ca The area is easy to understand.

The point of LSTM is that LSTM itself has a memory cell (a kind of accumulation of memory), and when a new input is made, the memory cell is forgotten (Forget Gate), remembered (Input Gate), output (Output Gate). It is a point to operate.

Implementation

This time, I implemented Seq2Seq using chainer.

There is a lot of sample code that implements Seq2Seq with chainer, but this time I tried to write it as simple as possible (I intend).

The reference code is https://github.com/odashi/chainer_examples is. Thank you, oda.

In chainer, the model of NN is described as a class.

Encoder

First, the Encoder for converting utterances into vectors

`encoder.py`



class LSTM_Encoder(Chain):
    def __init__(self, vocab_size, embed_size, hidden_size):
        """
Class initialization
        :param vocab_size:Number of types of words used (vocabulary)
        :param embed_size:The size of a word as a vector representation
        :param hidden_size:Intermediate layer size
        """
        super(LSTM_Encoder, self).__init__(
            #Layer to convert words into word vectors
            xe = links.EmbedID(vocab_size, embed_size, ignore_label=-1),
            #A layer that transforms a word vector into a vector four times the size of the hidden layer
            eh = links.Linear(embed_size, 4 * hidden_size),
            #Layer for converting the output intermediate layer to 4 times the size
            hh = links.Linear(hidden_size, 4 * hidden_size)
        )

    def __call__(self, x, c, h):
        """
Encoder behavior
        :param x: one-hot vector
        :param c:Internal memory
        :param h:Hidden layer
        :return:Next internal memory, next hidden layer
        """
        #Convert to word vector with xe and multiply that vector by tanh
        e = functions.tanh(self.xe(x))
        #Input by adding the value of the previous internal memory, 4 times the size of the word vector, and 4 times the size of the middle layer.
        return functions.lstm(c, self.eh(e) + self.hh(h))

The point of the encoder is why the vector is converted to 4 times the size of the specified hidden layer.

In the official documentation of chainer, スクリーンショット 2017-02-23 18.06.40.png a.

In other words, "Because the input vector is divided into forget, input, output, and cell, make it four times the size."

Chainer's functions.lstm only calculates functions, not network learning. So eh and hh in the code do that instead.

Actually, there is a convenient class called links.LSTM in chainer that outputs only output and even learns if you input it, but I did not use it this time. Because I want to share the hidden layer value between Encoder and Decoder (I think that links.LSTM can still be used, but this time it is for the future ...).

So the image of the calculation looks like this, the lines overlap and it is hard to see ...

Decoder

Next, about Decoder

`decoder.py`



class LSTM_Decoder(Chain):
    def __init__(self, vocab_size, embed_size, hidden_size):
        """
Class initialization
        :param vocab_size:Number of types of words used (vocabulary)
        :param embed_size:The size of a word as a vector representation
        :param hidden_size:Intermediate vector size
        """
        super(LSTM_Decoder, self).__init__(
            #Layer to convert input words into word vectors
            ye = links.EmbedID(vocab_size, embed_size, ignore_label=-1),
            #A layer that transforms a word vector into a vector four times the size of an intermediate vector
            eh = links.Linear(embed_size, 4 * hidden_size),
            #Layer that transforms an intermediate vector into a vector four times the size of the intermediate vector
            hh = links.Linear(hidden_size, 4 * hidden_size),
            #Layer to convert the output vector to the size of the word vector
            he = links.Linear(hidden_size, embed_size),
            #Word vector to vocabulary size vector (one-Layer to convert to hot vector)
            ey = links.Linear(embed_size, vocab_size)
        )

    def __call__(self, y, c, h):
        """

        :param y: one-hot vector
        :param c:Internal memory
        :param h:Intermediate vector
        :return:Predicted word, next internal memory, next intermediate vector
        """
        #Convert the input word to a word vector and apply it to tanh
        e = functions.tanh(self.ye(y))
        #Internal memory, 4 times the word vector+Multiply LSTM by 4 times the intermediate vector
        c, h = functions.lstm(c, self.eh(e) + self.hh(h))
        #Convert the output intermediate vector to a word vector and the word vector to a vocabulary-sized output vector
        t = self.ey(functions.tanh(self.he(h)))
        return t, c, h

Decoder also makes the vector four times as large. The difference is that the output intermediate vector is converted to a vector of the size of the number of vocabularies.

Therefore, we need layers he and ey that Encoder did not have.

The image of this calculation is as follows

In Decorder, backpropagation is performed using the output vector.

Seq2Seq

The code below is Seq2Seq created by combining these Encoders and Decoders.

`seq2seq.py`



class Seq2Seq(Chain):
    def __init__(self, vocab_size, embed_size, hidden_size, batch_size, flag_gpu=True):
        """
Initialization of Seq2Seq
        :param vocab_size:Vocabulary size
        :param embed_size:Word vector size
        :param hidden_size:Intermediate vector size
        :param batch_size:Mini batch size
        :param flag_gpu:Whether to use GPU
        """
        super(Seq2Seq, self).__init__(
            #Encoder instantiation
            encoder = LSTM_Encoder(vocab_size, embed_size, hidden_size),
            #Decoder instantiation
            decoder = LSTM_Decoder(vocab_size, embed_size, hidden_size)
        )
        self.vocab_size = vocab_size
        self.embed_size = embed_size
        self.hidden_size = hidden_size
        self.batch_size = batch_size
        #Use numpy to calculate cupy on CPU when calculating on GPU
        if flag_gpu:
            self.ARR = cuda.cupy
        else:
            self.ARR = np

    def encode(self, words):
        """
The part that calculates the Encoder
        :param words:List of recorded words
        :return:
        """
        #Internal memory, intermediate vector initialization
        c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))

        #Have the encoder read the words in order
        for w in words:
            c, h = self.encoder(w, c, h)

        #Make the calculated intermediate vector an instance variable to take over to the decoder
        self.h = h
        #Internal memory is not inherited, so initialize
        self.c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))

    def decode(self, w):
        """
The part that calculates the decoder
        :param w:word
        :return:Output a vector of word number size
        """
        t, self.c, self.h = self.decoder(w, self.c, self.h)
        return t

    def reset(self):
        """
Intermediate vector, internal memory, gradient initialization
        :return:
        """
        self.h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
        self.c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))

        self.zerograds()

The forward propagation calculation using the Seq2Seq class is performed as follows.

`forward.py`



def forward(enc_words, dec_words, model, ARR):
    """
A function that calculates forward propagation
    :param enc_words:A list of spoken words
    :param dec_words:A list of words in the response sentence
    :param model:Instance of Seq2Seq
    :param ARR: cuda.cupy or numpy
    :return:Total loss calculated
    """
    #Record batch size
    batch_size = len(enc_words[0])
    #Reset the gradient stored in the model
    model.reset()
    #Change the word in the utterance list to Variable type which is the type of chainer
    enc_words = [Variable(ARR.array(row, dtype='int32')) for row in enc_words]
    #Encoding calculation ⑴
    model.encode(enc_words)
    #Loss initialization
    loss = Variable(ARR.zeros((), dtype='float32'))
    # <eos>To the decoder(2)
    t = Variable(ARR.array([0 for _ in range(batch_size)], dtype='int32'))
    #Decoder calculation
    for w in dec_words:
        #Decode word by word(3)
        y = model.decode(t)
        #Convert correct word to Variable type
        t = Variable(ARR.array(w, dtype='int32'))
        #Calculate the loss by comparing the correct word with the predicted word(4)
        loss += functions.softmax_cross_entropy(y, t)
    return loss

The flow of this calculation is illustrated like this

Words in enc_words and dec_words to be learned must be IDized (converted to numbers) in advance.

The softmax function is used to calculate the loss.

All you have to do is let chainer learn the loss calculated by forward and update the network.

The main code for learning is as follows.

`train.py`



def train():
    #Check the number of vocabulary
    vocab_size = len(word_to_id)
    #Model instantiation
    model = Seq2Seq(vocab_size=vocab_size,
                    embed_size=EMBED_SIZE,
                    hidden_size=HIDDEN_SIZE,
                    batch_size=BATCH_SIZE,
                    flag_gpu=FLAG_GPU)
    #Model initialization
    model.reset()
    #Decide whether to use GPU
    if FLAG_GPU:
        ARR = cuda.cupy
        #Put the model in GPU memory
        cuda.get_device(0).use()
        model.to_gpu(0)
    else:
        ARR = np

    #Start learning
    for epoch in range(EPOCH_NUM):
        #Initialize optimizer for each epoch
        #Use Adam safely
        opt = optimizers.Adam()
        #Set the model to optimizer
        opt.setup(model)
        #Adjust if the gradient is too large
        opt.add_hook(optimizer.GradientClipping(5))
        
        #Reading the learning data created in advance
        data = Filer.read_pkl(path)
        #Shuffle data
        random.shuffle(data)
        #Start batch learning
        for num in range(len(data)//BATCH_SIZE):
            #Create mini-batch of any size
            minibatch = data[num*BATCH_SIZE: (num+1)*BATCH_SIZE]
            #Data creation for reading
            enc_words, dec_words = make_minibatch(minibatch)
            #Calculation of loss by forward propagation
            total_loss = forward(enc_words=enc_words,
                                 dec_words=dec_words,
                                 model=model,
                                 ARR=ARR)
            #Gradient calculation with error backpropagation
            total_loss.backward()
            #Update the network with the calculated gradient
            opt.update()
            #Initialize the recorded gradient
            opt.zero_grads()
        #Save model for each epoch
        serializers.save_hdf5(outputpath, model)

It was quite a long time, but the code explanation is over. By the way, the created code is https://github.com/kenchin110100/machine_learning/blob/master/sampleSeq2Sep.py It is in.

Experiment

Corpus

Dialogue bankruptcy corpus https://sites.google.com/site/dialoguebreakdowndetection/chat-dialogue-corpus Was used.

I really wanted to study with a longer corpus, but I gave up because it took too long to study ...

Experimental result

The following four utterances

token1 ='Good morning'
token2 ='How are you doing? '
token3 ='I'm hungry'
token4 ='It's hot today'

Let's look at the accuracy of the model for each Epoch

First 1 Epoch

Utterance:Good morning=>response:  ['Yes', '</s>']
Utterance:How's it going?=>response:  ['boring', 'Is', 'Like', 'is', 'Ne', '</s>']
Utterance:I'm hungry=>response:  ['so', 'is', '</s>']
Utterance:It's hot today=>response:  ['so', 'is', '</s>']

Are you a philosopher?

Then 3 Epoch

Utterance:Good morning=>response:  ['Good morning', 'Thank you', 'Masu', '</s>']
Utterance:How's it going?=>response:  ['watermelon', 'Is', 'I love You', 'is', 'Ne', '</s>']
Utterance:I'm hungry=>response:  ['so', 'Nana', 'Hmm', 'is', 'Or', '?', '</s>']
Utterance:It's hot today=>response:  ['what', 'Or', 'To', 'Go', 'hand', 'Masu', 'Or', '?', '</s>']

I haven't gone anywhere ...

5 Epoch

Utterance:Good morning=>response:  ['Good morning', 'Thank you', 'Masu', '</s>']
Utterance:How's it going?=>response:  ['Sea', 'Is', 'one', 'Man', 'so', 'Go', 'hand', 'Masu', 'Or', '?', '</s>']
Utterance:I'm hungry=>response:  ['Yup', '</s>']
Utterance:It's hot today=>response:  ['what', 'To', 'eat', 'Better', 'Ta', 'Or', '?', '</s>']

I haven't even gone to the sea ...

With 8 Epoch ...

Utterance:Good morning=>response:  ['Good morning', 'Thank you', 'Masu', '</s>']
Utterance:How's it going?=>response:  ['jellyfish', 'Is', 'Good', 'is', 'Ne', '</s>']
Utterance:I'm hungry=>response:  ['Also', '</s>']
Utterance:It's hot today=>response:  ['heatstroke', 'To', 'Qi', 'To', 'Attach', 'Absent', 'Hmm', 'is', 'Or', '?', '</s>']

It's getting regrettable, but is this the limit? I also tried more Epoch, but the accuracy did not change much.

Conclusion

I have implemented Seq2Seq using chainer. It seems that the accuracy will be improved if a larger corpus is used, but if the amount of calculation becomes too large, it will not converge easily ...

By the way, I added (1) to the title because I am thinking about the 2nd and 3rd bullets! !! Next, I would like to add Attention to this Seq2Seq.

Seq2Seq (1) with chainer

Introduction

Implementation

encoder.py

decoder.py

seq2seq.py

forward.py

train.py