It is a memo of what I stumbled upon in Chapter 7 of "Deep Learning from scratch ❷ --- Natural language processing", which I suddenly started studying.
The execution environment is macOS Catalina + Anaconda 2019.10, and the Python version is 3.7.4. For details, refer to Chapter 1 of this memo.
(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / Chapter 6 / Chapter 7)
This chapter describes sentence generation using the language model created in the previous chapter and a new model called seq2seq. I didn't have time to implement it myself, so I'm still at the level of trying to implement the book. Please note.
I'm not good at English with the language model made from the PTB corpus, so I'm not sure if the result is good or bad, so I tried it with the language model of Aozora Bunko made in the previous chapter.
First, in order to try RNNLM whose perplexity was 105.19 in the previous chapter, we will change ch07/generate_text.py
a little. The ★
part is the change.
ch07/generate_text.py
# coding: utf-8
import sys
sys.path.append('..')
from rnnlm_gen import RnnlmGen
from dataset import aozorabunko #★ Changed to use the corpus of Aozora Bunko
corpus, word_to_id, id_to_word = aozorabunko.load_data('train') #★ Change corpus
vocab_size = len(word_to_id)
corpus_size = len(corpus)
model = RnnlmGen(vocab_size=vocab_size) #★ Specify the number of vocabularies (default value of Rnnlm = PTB value)
model.load_params('../ch06/Rnnlm.pkl')
#Set start and skip characters
start_word = 'you' # ★you
start_id = word_to_id[start_word]
skip_words = [] #★ No skip characters as it has not been preprocessed
skip_ids = [word_to_id[w] for w in skip_words]
#Sentence generation
word_ids = model.generate(start_id, skip_ids)
#★ Because it is Japanese, it is connected without spaces,<eos>Replace with punctuation + line break
eos_id = word_to_id['<eos>']
txt = ''.join([id_to_word[i] if i != eos_id else '。\n' for i in word_ids])
txt = txt.replace('\n。\n', '\n') #Removal of blank lines
txt = txt.replace('」。\n', '」\n') #Remove the punctuation mark at the end of the conversation
print(txt)
And it is the result of the sentence generated from "you". I tried it several times.
Perplexity 105.19th edition
After a while, you guys are going on, and after a while, you're the gentleman in that woman's train, so when you hear a little more about Edo, you still can't see it as BC. ..
He wouldn't cross quietly as it was, so I had to mess it up, and I just gave him a literary story about Pamijin.
The young ladies try to be disappointed
Perplexity 105.19th edition
I interrupted you and sent it to the pigeon's mouth.
"That's the way back, this house is pomponing." ""
"Isn't it?"
"That's right"
"Then I can't help.
」
"How is it?" Giovanni inadvertently flew through the hairy straw on the head of the instrument.
Also, seaweed is like gold.
Perplexity 105.19th edition
I caught up with your shield.
The two did not return.
However, I was always sleeping because of the difficulty.
However, outside the mirror, the mad neck may be meaningless, so the sound of silence depends on what it was, and in what way it was the very original of white intractable liquor. This is a long and narrow one, which is too big for the main gate, and Iisaki Tako is irresistible.
Therefore
Somehow it's becoming a sentence. Since the corpus I'm using is a novel, I have some novel-like sentences.
In the second result, you can see that the parentheses are roughly aligned and that the relationship between the start and end of the parentheses can be properly remembered.
Even so, the development and meaning of the second result, "I sent it to the pigeon's mouth," to "Giovanni flew involuntarily through the hairy straw on the head of the instrument." I'm worried: grin:
Next, I will try the improved version, which was reduced to Perplexity 73.66 in the previous chapter. The following is a modified version of ch07/generate_better_text.py
. The ★
part is the change.
ch07/generate_better_text.py
# coding: utf-8
import sys
sys.path.append('..')
from common.np import *
from rnnlm_gen import BetterRnnlmGen
from dataset import aozorabunko #★ Changed to use the corpus of Aozora Bunko
corpus, word_to_id, id_to_word = aozorabunko.load_data('train') #★ Change corpus
vocab_size = len(word_to_id)
corpus_size = len(corpus)
model = BetterRnnlmGen(vocab_size=vocab_size) #★ Specify the number of vocabulary (default is BetterRnnlm)
model.load_params('../ch06/BetterRnnlm.pkl')
#Set start and skip characters
start_word = 'you' # ★you
start_id = word_to_id[start_word]
skip_words = [] #★ No skip characters as it has not been preprocessed
skip_ids = [word_to_id[w] for w in skip_words]
#Sentence generation
word_ids = model.generate(start_id, skip_ids)
#★ Because it is Japanese, it is connected without spaces,<eos>Replace with punctuation + line break
eos_id = word_to_id['<eos>']
txt = ''.join([id_to_word[i] if i != eos_id else '。\n' for i in word_ids])
txt = txt.replace('\n。\n', '\n') #Removal of blank lines
txt = txt.replace('」。\n', '」\n') #Remove the punctuation mark at the end of the conversation
print(txt)
model.reset_state()
start_words = 'The meaning of life is' # ★the meaning of life is
start_ids = [word_to_id[w] for w in start_words.split(' ')]
for x in start_ids[:-1]:
x = np.array(x).reshape(1, 1)
model.predict(x)
word_ids = model.generate(start_ids[-1], skip_ids)
word_ids = start_ids[:-1] + word_ids
#★ Because it is Japanese, it is connected without spaces,<eos>Replace with punctuation + line break
txt = ''.join([id_to_word[i] if i != eos_id else '。\n' for i in word_ids])
txt = txt.replace('\n。\n', '\n') #Removal of blank lines
txt = txt.replace('」。\n', '」\n') #Remove the punctuation mark at the end of the conversation
print('-' * 50)
print(txt)
The following is the result of generating sentences starting with "you".
Perplexity 73.66th edition
I took it off without even knowing you.
However, I wonder if that person will be interested in me from now on, and when I get to work, I can't help myself.
slip'It will be the death of a family called the royal family.
It would be fun to read the history of household races and see the four or two letters if people are left in the water like the world and shoulders with tools to throw away money and pleasure.
Belly representative
Perplexity 73.66th edition
You told me to talk to you.
(I'm such a go.
At one point, I came with Ueno with a gate entrance lantern.
I will not go on my own because it will come out in the middle of the night of this year.
It's decreasing, my hands are crying, and now it's not so easy.
I'm sick now, so the teacher has broken the way of holding it rather than the best means. "
"why"
"No still
Perplexity 73.66th edition
I wonder if you came so dizzy because you have the fact that you went.
I don't know if I'm still sick.
No matter how much you can swim, it's definitely coming.
This twenty-four.
I'm spinning around on the floor, looking for customers with anxious faces.
Both teachers had sex near noon and started walking while returning to the tatami room.
When we came to Tokyo in the back, it was a little lively next time
Somehow, I feel that it is more Japanese than the previous result.
In addition, the word "slip'" suddenly appeared in the first result, so when I looked up the corpus, it said "many a slip' twixt the cup and the lip" in "I am a cat". There was a sentence saying that you probably know the Western proverb. It's only used here, so you have to preprocess rare words like the PTB corpus: sweat:
Then, like the book, I tried to tell the continuation of "The meaning of life".
Perplexity 73.66th edition
The garden was restless to the end because the meaning of life seemed to be the same as the history of the mind.
Something is related to the famous literary artist.
Is it okay for a burnt sand lens to be hit by pressure?
This isn't a big stimulus to the spiritual world, as they saw the problems every month and tied their lies to the cup.
It ’s just a belly
Perplexity 73.66th edition
The meaning of life is messy because the writer's advertisement is wrong.
So I felt like I wouldn't die in the future.
Sanshiro is like this.
Even in Tokyo, I thought that the person who put in a circle early and dug up the corporal was slaughtered, and it was just like this.
Sanshiro was still sitting under the bag with his master from the lady's god and giving lessons.
Perplexity 73.66th edition
The meaning of life was just that I chose the second distant place.
However, when such a face is over, that word.
I started today when I went to Noonmachi.
A man who lives as a friend does not go to the inn The singing basket is the light of stripes, so he was alive when he urged him now, so it was the face of his mother who gave him the surrounding thirteen meals.
The teacher is Matachi and Kura
I didn't come up with very deep words, but since it is a probabilistic logic, it may be said that it is repeated.
Finally, let me write the continuation of "I am a dog."
Perplexity 73.66th edition
I am a dog.
While climbing the cup, shake your hands while holding your hands on your front legs.
A duster room is in the works.
I was in a bad mood.
Then it ends with a different eye.
Looking beyond the water, they are lined up.
I have a relative in the daytime while crushing from the tatami mat on the front.
But even a man who may look at him asked this question and brought him 100,000 today.
I don't know what it is, but he made something that looks like a novel.
7.2 seq2seq
This is a description of seq2seq that converts time series data to time series data. The book deals with addition as a toy problem, but it's not fun to try the same, so I decided to solve the square root. For example, "2" is given to the input and "1.414" is output.
The dataset we created is simple, a pair of 50,000 numbers from 0 to 49,999 and their square roots (4 valid digits). I have aligned the digits and separated the input and output with _
so that you can learn as it is with the code of the book. Below is the dataset generation code dataset/create_sqroot_dataset.py
. If you move this in the dataset
directory, you will get sqroot.txt
.
dataset/create_sqroot_dataset.py
# coding: utf-8
import math
file_name = 'sqroot.txt'
with open(file_name, mode='w') as f:
for i in range(50000):
res = f'{math.sqrt(i):.4g}'
f.write(f'{i: <5}_{res: <5}\n')
The contents of the generated dataset sqroot.txt
looks like this:
dataset/sqroot.txt
0 _0
1 _1
2 _1.414
3 _1.732
4 _2
5 _2.236
6 _2.449
7 _2.646
8 _2.828
9 _3
10 _3.162
11 _3.317
12 _3.464
13 _3.606
14 _3.742
15 _3.873
16 _4
17 _4.123
18 _4.243
19 _4.359
(Omitted)
49980_223.6
49981_223.6
49982_223.6
49983_223.6
49984_223.6
49985_223.6
49986_223.6
49987_223.6
49988_223.6
49989_223.6
49990_223.6
49991_223.6
49992_223.6
49993_223.6
49994_223.6
49995_223.6
49996_223.6
49997_223.6
49998_223.6
49999_223.6
The input is 5 characters, the output is 6 characters including _
, and the number of vocabulary is 13 which is the same as the addition data set (addition+
is reduced and decimal point.
is increased).
In the state before the improvement, the correct answer rate did not increase as easily as the addition.
With the addition of data inversion and peeping improvements, it managed to find the square root.
Below is the source of ch07/train_seq2seq.py
. The ★
part is the change from the code of the book. I tried hyperparameters several times and increased the size of the hidden layer a bit.
ch07/train_seq2seq.py
# coding: utf-8
import sys
sys.path.append('..')
import numpy as np
import matplotlib.pyplot as plt
from dataset import sequence
from common.optimizer import Adam
from common.trainer import Trainer
from common.util import eval_seq2seq
from seq2seq import Seq2seq
from peeky_seq2seq import PeekySeq2seq
#Data set loading
(x_train, t_train), (x_test, t_test) = sequence.load_data('sqroot.txt') #★ Data set change
char_to_id, id_to_char = sequence.get_vocab()
# Reverse input? =================================================
is_reverse = True #★ Improved version
if is_reverse:
x_train, x_test = x_train[:, ::-1], x_test[:, ::-1]
# ================================================================
#Hyperparameter settings
vocab_size = len(char_to_id)
wordvec_size = 16
hidden_size = 192 #★ Adjustment
batch_size = 128
max_epoch = 25
max_grad = 5.0
# Normal or Peeky? ==============================================
# model = Seq2seq(vocab_size, wordvec_size, hidden_size)
model = PeekySeq2seq(vocab_size, wordvec_size, hidden_size) #★ Improved version
# ================================================================
optimizer = Adam()
trainer = Trainer(model, optimizer)
acc_list = []
for epoch in range(max_epoch):
trainer.fit(x_train, t_train, max_epoch=1,
batch_size=batch_size, max_grad=max_grad)
correct_num = 0
for i in range(len(x_test)):
question, correct = x_test[[i]], t_test[[i]]
verbose = i < 10
correct_num += eval_seq2seq(model, question, correct,
id_to_char, verbose, is_reverse)
acc = float(correct_num) / len(x_test)
acc_list.append(acc)
print('val acc %.3f%%' % (acc * 100))
#Drawing a graph
x = np.arange(len(acc_list))
plt.plot(x, acc_list, marker='o')
plt.xlabel('epochs')
plt.ylabel('accuracy')
plt.ylim(0, 1.0)
plt.show()
Below is the last part of the execution result.
| epoch 25 | iter 1 / 351 | time 0[s] | loss 0.08
| epoch 25 | iter 21 / 351 | time 6[s] | loss 0.08
| epoch 25 | iter 41 / 351 | time 13[s] | loss 0.09
| epoch 25 | iter 61 / 351 | time 18[s] | loss 0.08
| epoch 25 | iter 81 / 351 | time 22[s] | loss 0.09
| epoch 25 | iter 101 / 351 | time 27[s] | loss 0.09
| epoch 25 | iter 121 / 351 | time 32[s] | loss 0.08
| epoch 25 | iter 141 / 351 | time 38[s] | loss 0.08
| epoch 25 | iter 161 / 351 | time 43[s] | loss 0.09
| epoch 25 | iter 181 / 351 | time 48[s] | loss 0.08
| epoch 25 | iter 201 / 351 | time 52[s] | loss 0.08
| epoch 25 | iter 221 / 351 | time 56[s] | loss 0.09
| epoch 25 | iter 241 / 351 | time 61[s] | loss 0.08
| epoch 25 | iter 261 / 351 | time 66[s] | loss 0.09
| epoch 25 | iter 281 / 351 | time 72[s] | loss 0.09
| epoch 25 | iter 301 / 351 | time 77[s] | loss 0.08
| epoch 25 | iter 321 / 351 | time 81[s] | loss 0.09
| epoch 25 | iter 341 / 351 | time 85[s] | loss 0.09
Q 27156
T 164.8
☑ 164.8
---
Q 41538
T 203.8
☑ 203.8
---
Q 82
T 9.055
☒ 9.124
---
Q 40944
T 202.3
☑ 202.3
---
Q 36174
T 190.2
☑ 190.2
---
Q 13831
T 117.6
☑ 117.6
---
Q 16916
T 130.1
☑ 130.1
---
Q 1133
T 33.66
☒ 33.63
---
Q 31131
T 176.4
☑ 176.4
---
Q 21956
T 148.2
☑ 148.2
---
val acc 79.000%
I managed to get a correct answer rate of just under 80%. It may be possible to improve by adjusting the hyperparameters a little more, but I found that simply using a model that works well with addition does not easily give high accuracy. It seems difficult to select and adjust the model according to the problem to be dealt with.
Seeing examples like chatbots and image captions will open up your dreams. Behind the scenes, I'm deeply moved when I think that there was a lot of trial and error by the ancestors.
That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.
(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / Chapter 6 / Chapter 7)
Recommended Posts