Since I implemented sequence to sequence with chainer, its code and verification
A well-known sentence generation model using an RNN neural network is sequence to sequence (Seq2Seq).
This time, I will summarize the results in the method and verification when implementing this Seq2Seq using chainer.
Sequence to Sequence(Seq2Seq)
Seq2Seq is a kind of Encoder Decoder model using RNN, and can be used as a model for machine dialogue and machine translation.
This is the original paper Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.
The outline of the flow of Seq2Seq looks like the following
For example, if there is an utterance and response such as "How are you feeling?" Or "It's pretty good", the Encoder (blue in the figure) side vectorizes the utterance and the Decoder (red in the figure) side. , Train the RNN to generate a response.
"<'EOS'>" is an abbreviation for End Of Statement, which is a signal that the sentence ends here.
The point of Seq2Seq is to input the utterance from the opposite direction, and if the utterance is "How are you feeling?", Enter "<?>,
Separate Embeds are used on the Encoder side and Decoder side, and only the generated intermediate layer (red line in the figure) is shared.
I wrote Seq2Seq as an RNN-based neural network, but this time I implemented it using Long Short Term Memory (LSTM).
For a detailed explanation of LSTM, http://qiita.com/t_Signull/items/21b82be280b46f467d1b http://qiita.com/KojiOhki/items/89cd7b69a8a6239d67ca The area is easy to understand.
The point of LSTM is that LSTM itself has a memory cell (a kind of accumulation of memory), and when a new input is made, the memory cell is forgotten (Forget Gate), remembered (Input Gate), output (Output Gate). It is a point to operate.
This time, I implemented Seq2Seq using chainer.
There is a lot of sample code that implements Seq2Seq with chainer, but this time I tried to write it as simple as possible (I intend).
The reference code is https://github.com/odashi/chainer_examples is. Thank you, oda.
In chainer, the model of NN is described as a class.
Encoder
First, the Encoder for converting utterances into vectors
encoder.py
class LSTM_Encoder(Chain):
def __init__(self, vocab_size, embed_size, hidden_size):
"""
Class initialization
:param vocab_size:Number of types of words used (vocabulary)
:param embed_size:The size of a word as a vector representation
:param hidden_size:Intermediate layer size
"""
super(LSTM_Encoder, self).__init__(
#Layer to convert words into word vectors
xe = links.EmbedID(vocab_size, embed_size, ignore_label=-1),
#A layer that transforms a word vector into a vector four times the size of the hidden layer
eh = links.Linear(embed_size, 4 * hidden_size),
#Layer for converting the output intermediate layer to 4 times the size
hh = links.Linear(hidden_size, 4 * hidden_size)
)
def __call__(self, x, c, h):
"""
Encoder behavior
:param x: one-hot vector
:param c:Internal memory
:param h:Hidden layer
:return:Next internal memory, next hidden layer
"""
#Convert to word vector with xe and multiply that vector by tanh
e = functions.tanh(self.xe(x))
#Input by adding the value of the previous internal memory, 4 times the size of the word vector, and 4 times the size of the middle layer.
return functions.lstm(c, self.eh(e) + self.hh(h))
The point of the encoder is why the vector is converted to 4 times the size of the specified hidden layer.
In the official documentation of chainer, a.
In other words, "Because the input vector is divided into forget, input, output, and cell, make it four times the size."
Chainer's functions.lstm only calculates functions, not network learning. So eh and hh in the code do that instead.
Actually, there is a convenient class called links.LSTM in chainer that outputs only output and even learns if you input it, but I did not use it this time. Because I want to share the hidden layer value between Encoder and Decoder (I think that links.LSTM can still be used, but this time it is for the future ...).
So the image of the calculation looks like this, the lines overlap and it is hard to see ...
Decoder
Next, about Decoder
decoder.py
class LSTM_Decoder(Chain):
def __init__(self, vocab_size, embed_size, hidden_size):
"""
Class initialization
:param vocab_size:Number of types of words used (vocabulary)
:param embed_size:The size of a word as a vector representation
:param hidden_size:Intermediate vector size
"""
super(LSTM_Decoder, self).__init__(
#Layer to convert input words into word vectors
ye = links.EmbedID(vocab_size, embed_size, ignore_label=-1),
#A layer that transforms a word vector into a vector four times the size of an intermediate vector
eh = links.Linear(embed_size, 4 * hidden_size),
#Layer that transforms an intermediate vector into a vector four times the size of the intermediate vector
hh = links.Linear(hidden_size, 4 * hidden_size),
#Layer to convert the output vector to the size of the word vector
he = links.Linear(hidden_size, embed_size),
#Word vector to vocabulary size vector (one-Layer to convert to hot vector)
ey = links.Linear(embed_size, vocab_size)
)
def __call__(self, y, c, h):
"""
:param y: one-hot vector
:param c:Internal memory
:param h:Intermediate vector
:return:Predicted word, next internal memory, next intermediate vector
"""
#Convert the input word to a word vector and apply it to tanh
e = functions.tanh(self.ye(y))
#Internal memory, 4 times the word vector+Multiply LSTM by 4 times the intermediate vector
c, h = functions.lstm(c, self.eh(e) + self.hh(h))
#Convert the output intermediate vector to a word vector and the word vector to a vocabulary-sized output vector
t = self.ey(functions.tanh(self.he(h)))
return t, c, h
Decoder also makes the vector four times as large. The difference is that the output intermediate vector is converted to a vector of the size of the number of vocabularies.
Therefore, we need layers he and ey that Encoder did not have.
The image of this calculation is as follows
In Decorder, backpropagation is performed using the output vector.
Seq2Seq
The code below is Seq2Seq created by combining these Encoders and Decoders.
seq2seq.py
class Seq2Seq(Chain):
def __init__(self, vocab_size, embed_size, hidden_size, batch_size, flag_gpu=True):
"""
Initialization of Seq2Seq
:param vocab_size:Vocabulary size
:param embed_size:Word vector size
:param hidden_size:Intermediate vector size
:param batch_size:Mini batch size
:param flag_gpu:Whether to use GPU
"""
super(Seq2Seq, self).__init__(
#Encoder instantiation
encoder = LSTM_Encoder(vocab_size, embed_size, hidden_size),
#Decoder instantiation
decoder = LSTM_Decoder(vocab_size, embed_size, hidden_size)
)
self.vocab_size = vocab_size
self.embed_size = embed_size
self.hidden_size = hidden_size
self.batch_size = batch_size
#Use numpy to calculate cupy on CPU when calculating on GPU
if flag_gpu:
self.ARR = cuda.cupy
else:
self.ARR = np
def encode(self, words):
"""
The part that calculates the Encoder
:param words:List of recorded words
:return:
"""
#Internal memory, intermediate vector initialization
c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
#Have the encoder read the words in order
for w in words:
c, h = self.encoder(w, c, h)
#Make the calculated intermediate vector an instance variable to take over to the decoder
self.h = h
#Internal memory is not inherited, so initialize
self.c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
def decode(self, w):
"""
The part that calculates the decoder
:param w:word
:return:Output a vector of word number size
"""
t, self.c, self.h = self.decoder(w, self.c, self.h)
return t
def reset(self):
"""
Intermediate vector, internal memory, gradient initialization
:return:
"""
self.h = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
self.c = Variable(self.ARR.zeros((self.batch_size, self.hidden_size), dtype='float32'))
self.zerograds()
The forward propagation calculation using the Seq2Seq class is performed as follows.
forward.py
def forward(enc_words, dec_words, model, ARR):
"""
A function that calculates forward propagation
:param enc_words:A list of spoken words
:param dec_words:A list of words in the response sentence
:param model:Instance of Seq2Seq
:param ARR: cuda.cupy or numpy
:return:Total loss calculated
"""
#Record batch size
batch_size = len(enc_words[0])
#Reset the gradient stored in the model
model.reset()
#Change the word in the utterance list to Variable type which is the type of chainer
enc_words = [Variable(ARR.array(row, dtype='int32')) for row in enc_words]
#Encoding calculation ⑴
model.encode(enc_words)
#Loss initialization
loss = Variable(ARR.zeros((), dtype='float32'))
# <eos>To the decoder(2)
t = Variable(ARR.array([0 for _ in range(batch_size)], dtype='int32'))
#Decoder calculation
for w in dec_words:
#Decode word by word(3)
y = model.decode(t)
#Convert correct word to Variable type
t = Variable(ARR.array(w, dtype='int32'))
#Calculate the loss by comparing the correct word with the predicted word(4)
loss += functions.softmax_cross_entropy(y, t)
return loss
The flow of this calculation is illustrated like this
Words in enc_words and dec_words to be learned must be IDized (converted to numbers) in advance.
The softmax function is used to calculate the loss.
All you have to do is let chainer learn the loss calculated by forward and update the network.
The main code for learning is as follows.
train.py
def train():
#Check the number of vocabulary
vocab_size = len(word_to_id)
#Model instantiation
model = Seq2Seq(vocab_size=vocab_size,
embed_size=EMBED_SIZE,
hidden_size=HIDDEN_SIZE,
batch_size=BATCH_SIZE,
flag_gpu=FLAG_GPU)
#Model initialization
model.reset()
#Decide whether to use GPU
if FLAG_GPU:
ARR = cuda.cupy
#Put the model in GPU memory
cuda.get_device(0).use()
model.to_gpu(0)
else:
ARR = np
#Start learning
for epoch in range(EPOCH_NUM):
#Initialize optimizer for each epoch
#Use Adam safely
opt = optimizers.Adam()
#Set the model to optimizer
opt.setup(model)
#Adjust if the gradient is too large
opt.add_hook(optimizer.GradientClipping(5))
#Reading the learning data created in advance
data = Filer.read_pkl(path)
#Shuffle data
random.shuffle(data)
#Start batch learning
for num in range(len(data)//BATCH_SIZE):
#Create mini-batch of any size
minibatch = data[num*BATCH_SIZE: (num+1)*BATCH_SIZE]
#Data creation for reading
enc_words, dec_words = make_minibatch(minibatch)
#Calculation of loss by forward propagation
total_loss = forward(enc_words=enc_words,
dec_words=dec_words,
model=model,
ARR=ARR)
#Gradient calculation with error backpropagation
total_loss.backward()
#Update the network with the calculated gradient
opt.update()
#Initialize the recorded gradient
opt.zero_grads()
#Save model for each epoch
serializers.save_hdf5(outputpath, model)
It was quite a long time, but the code explanation is over. By the way, the created code is https://github.com/kenchin110100/machine_learning/blob/master/sampleSeq2Sep.py It is in.
Dialogue bankruptcy corpus https://sites.google.com/site/dialoguebreakdowndetection/chat-dialogue-corpus Was used.
I really wanted to study with a longer corpus, but I gave up because it took too long to study ...
The following four utterances
Let's look at the accuracy of the model for each Epoch
First 1 Epoch
Utterance:Good morning=>response: ['Yes', '</s>']
Utterance:How's it going?=>response: ['boring', 'Is', 'Like', 'is', 'Ne', '</s>']
Utterance:I'm hungry=>response: ['so', 'is', '</s>']
Utterance:It's hot today=>response: ['so', 'is', '</s>']
Are you a philosopher?
Then 3 Epoch
Utterance:Good morning=>response: ['Good morning', 'Thank you', 'Masu', '</s>']
Utterance:How's it going?=>response: ['watermelon', 'Is', 'I love You', 'is', 'Ne', '</s>']
Utterance:I'm hungry=>response: ['so', 'Nana', 'Hmm', 'is', 'Or', '?', '</s>']
Utterance:It's hot today=>response: ['what', 'Or', 'To', 'Go', 'hand', 'Masu', 'Or', '?', '</s>']
I haven't gone anywhere ...
5 Epoch
Utterance:Good morning=>response: ['Good morning', 'Thank you', 'Masu', '</s>']
Utterance:How's it going?=>response: ['Sea', 'Is', 'one', 'Man', 'so', 'Go', 'hand', 'Masu', 'Or', '?', '</s>']
Utterance:I'm hungry=>response: ['Yup', '</s>']
Utterance:It's hot today=>response: ['what', 'To', 'eat', 'Better', 'Ta', 'Or', '?', '</s>']
I haven't even gone to the sea ...
With 8 Epoch ...
Utterance:Good morning=>response: ['Good morning', 'Thank you', 'Masu', '</s>']
Utterance:How's it going?=>response: ['jellyfish', 'Is', 'Good', 'is', 'Ne', '</s>']
Utterance:I'm hungry=>response: ['Also', '</s>']
Utterance:It's hot today=>response: ['heatstroke', 'To', 'Qi', 'To', 'Attach', 'Absent', 'Hmm', 'is', 'Or', '?', '</s>']
It's getting regrettable, but is this the limit? I also tried more Epoch, but the accuracy did not change much.
I have implemented Seq2Seq using chainer. It seems that the accuracy will be improved if a larger corpus is used, but if the amount of calculation becomes too large, it will not converge easily ...
By the way, I added (1) to the title because I am thinking about the 2nd and 3rd bullets! !! Next, I would like to add Attention to this Seq2Seq.
Recommended Posts