Introduction

This is a memo of what I stumbled upon in Chapter 6 of "Deep Learning from scratch ❷ --- Natural language processing", which I suddenly started studying.

The execution environment is macOS Catalina + Anaconda 2019.10, and the Python version is 3.7.4. See Chapter 1 of this memo for details. In this chapter, we also use Google Colaboratory from the middle.

(To other chapters of this memo: Chapter 1 / Chapter 2 / Chapter 3 / Chapter 4 / Chapter 5 / 6 Chapter / Chapter 7)

Chapter 6 Gated RNN

This chapter describes gated RNNs.

6.1 Problems with RNN

It is a story that RNN is not good at long-term memory due to the problems of gradient explosion and gradient disappearance.

First of all, it is a countermeasure against gradient explosion, but the gradient clipping used here is very low-tech. Not limited to this, there are many stories related to Deep Learning that "it went well when I tried it." I have never been a researcher, and I have only learned the technologies that have been established so far, so I feel that cutting-edge technology development is like this.

6.2 Gradient disappearance and LSTM

Use a gated RNN as a measure against gradient disappearance. In this book, LSTM is mainly explained, and GRU is explained in the appendix. The explanation of the book was easy to understand and there was no particular stumbling block.

The explanation of this book is based on colah's blog: "Understanding LSTM Networks" in Reference [31]. Overview "](https://qiita.com/KojiOhki/items/89cd7b69a8a6239d67ca) has been translated. GRU is also explained as a variation of LSTM.

6.3 Implementation of LSTM

Originally, I would implement it myself from here, but I'm too busy to take the time, and I'm still just looking at the code in the book. I will try to implement it when I actually have time, and I will add it if there is something wrong with it.

6.4 Language model using LSTM

Even if I say that I will not implement it, it is boring just to read it, so I tried using Aozora Bunko's divided text as in Chapters 2 and 4.

Since dataset/aozorabunko.py, which was created by diverting dataset/ptb.py in Chapter 2, could only acquire data for train, how many works were added to Aozora Bunko's divided text. In addition, data for test and valid can also be obtained.

The PTB data set has the same vocabulary for the three sets of train, test, and valid, but my code only divides some works selected appropriately from Aozora Bunko, so the vocabulary is common. It will not be. Therefore, I make a vocabulary with the set for train with the largest amount of data, and if a word that is not there appears in test or valid, I make a corpus by simply ignoring that word.

Below is the code for dataset/aozorabunko.py. The ★ part is the change from dataset/ptb.py.

`dataset/aozorabunko.py`


# coding: utf-8
import sys
import os
sys.path.append('..')
try:
    import urllib.request
except ImportError:
    raise ImportError('Use Python3!')
import pickle
import numpy as np

#★ This URL is the download URL for the assorted Aozora Bunko works that have been uploaded to GitHub.
#For details https://github.com/segavvy/wakatigaki-See aozorabunko.
url_base = 'https://github.com/segavvy/wakatigaki-aozorabunko/raw/master/'
key_file = {
    'train': '20200516merge.txt',
    'test': '20201207merge.txt',
    'valid': '20201231merge.txt'
}
save_file = {
    'train': 'aozorabunko.train.npy',
    'test': 'aozorabunko.test.npy',
    'valid': 'aozorabunko.valid.npy'
}
vocab_file = 'aozorabunko.vocab.pkl'

dataset_dir = os.path.dirname(os.path.abspath(__file__))


def _download(file_name):
    file_path = dataset_dir + '/' + file_name
    if os.path.exists(file_path):
        return

    print('Downloading ' + file_name + ' ... ')

    try:
        urllib.request.urlretrieve(url_base + file_name, file_path)
    except urllib.error.URLError:
        import ssl
        ssl._create_default_https_context = ssl._create_unverified_context
        urllib.request.urlretrieve(url_base + file_name, file_path)

    print('Done')


#★ Since the text division is used in two places, it is made into a function. Implementation is super ad hoc ...
def _split_data(text):
    return text.replace('\n', '<eos> ').replace('。', '<eos> ').strip().split()


def load_vocab():
    vocab_path = dataset_dir + '/' + vocab_file

    if os.path.exists(vocab_path):
        with open(vocab_path, 'rb') as f:
            word_to_id, id_to_word = pickle.load(f)
        return word_to_id, id_to_word

    word_to_id = {}
    id_to_word = {}
    data_type = 'train'
    file_name = key_file[data_type]
    file_path = dataset_dir + '/' + file_name

    _download(file_name)

    words = _split_data(open(file_path).read())

    for i, word in enumerate(words):
        if word not in word_to_id:
            tmp_id = len(word_to_id)
            word_to_id[word] = tmp_id
            id_to_word[tmp_id] = word

    with open(vocab_path, 'wb') as f:
        pickle.dump((word_to_id, id_to_word), f)

    return word_to_id, id_to_word


def load_data(data_type='train'):
    '''
        :param data_type:Data type:'train' or 'test' or 'valid (val)'
        :return:
    '''
    if data_type == 'val': data_type = 'valid'
    save_path = dataset_dir + '/' + save_file[data_type]

    word_to_id, id_to_word = load_vocab()

    if os.path.exists(save_path):
        corpus = np.load(save_path)
        return corpus, word_to_id, id_to_word

    file_name = key_file[data_type]
    file_path = dataset_dir + '/' + file_name
    _download(file_name)

    words = _split_data(open(file_path).read())
    #★ Because the data of Aozora Bunko used this time is not preprocessed like PTB,
    #Words that were not in train also appear in the test data. It's pretty rough, but skip those words.
    if data_type == 'train':
        corpus = np.array([word_to_id[w] for w in words])
    else:
        corpus = np.array([word_to_id[w] for w in words if w in word_to_id])

    np.save(save_path, corpus)
    return corpus, word_to_id, id_to_word


if __name__ == '__main__':
    for data_type in ('train', 'val', 'test'):
        load_data(data_type)

The learning code ch06/train_rnnlm.py just changes the part that uses the PTB corpus to Aozora Bunko. The ★ part is the change.

`ch06/train_rnnlm.py`


# coding: utf-8
import sys
sys.path.append('..')
from common.optimizer import SGD
from common.trainer import RnnlmTrainer
from common.util import eval_perplexity
from dataset import aozorabunko  #★ Changed to use the corpus of Aozora Bunko
from rnnlm import Rnnlm


#Hyperparameter settings
batch_size = 20
wordvec_size = 100
hidden_size = 100  #Number of elements of hidden state vector of RNN
time_size = 35  #The size to deploy the RNN
lr = 20.0
max_epoch = 4
max_grad = 0.25

#Reading training data
corpus, word_to_id, id_to_word = aozorabunko.load_data('train')  #★ Change corpus
corpus_test, _, _ = aozorabunko.load_data('test')  #★ Change corpus
vocab_size = len(word_to_id)
xs = corpus[:-1]
ts = corpus[1:]

#Model generation
model = Rnnlm(vocab_size, wordvec_size, hidden_size)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

#Learn by applying gradient clipping
trainer.fit(xs, ts, max_epoch, batch_size, time_size, max_grad,
            eval_interval=20)
trainer.plot(ylim=(0, 500))

#Evaluation with test data
model.reset_state()
ppl_test = eval_perplexity(model, corpus_test)
print('test perplexity: ', ppl_test)

#Save parameters
model.save_params()

And the execution result.

| epoch 1 |  iter 1 / 1247 | time 1[s] | perplexity 27249.26
| epoch 1 |  iter 21 / 1247 | time 36[s] | perplexity 4283.50
| epoch 1 |  iter 41 / 1247 | time 71[s] | perplexity 1291.40
| epoch 1 |  iter 61 / 1247 | time 110[s] | perplexity 846.53
| epoch 1 |  iter 81 / 1247 | time 148[s] | perplexity 597.52
| epoch 1 |  iter 101 / 1247 | time 185[s] | perplexity 459.56
| epoch 1 |  iter 121 / 1247 | time 221[s] | perplexity 351.08
| epoch 1 |  iter 141 / 1247 | time 258[s] | perplexity 327.10
| epoch 1 |  iter 161 / 1247 | time 294[s] | perplexity 294.42
| epoch 1 |  iter 181 / 1247 | time 331[s] | perplexity 248.45
| epoch 1 |  iter 201 / 1247 | time 367[s] | perplexity 242.54
| epoch 1 |  iter 221 / 1247 | time 404[s] | perplexity 222.16
| epoch 1 |  iter 241 / 1247 | time 439[s] | perplexity 197.64
| epoch 1 |  iter 261 / 1247 | time 478[s] | perplexity 200.25
| epoch 1 |  iter 281 / 1247 | time 527[s] | perplexity 193.02
| epoch 1 |  iter 301 / 1247 | time 568[s] | perplexity 189.60
| epoch 1 |  iter 321 / 1247 | time 612[s] | perplexity 185.42
| epoch 1 |  iter 341 / 1247 | time 645[s] | perplexity 196.01
| epoch 1 |  iter 361 / 1247 | time 675[s] | perplexity 183.63
| epoch 1 |  iter 381 / 1247 | time 707[s] | perplexity 175.95

(Omitted)

| epoch 4 |  iter 861 / 1247 | time 7444[s] | perplexity 48.34
| epoch 4 |  iter 881 / 1247 | time 7474[s] | perplexity 49.01
| epoch 4 |  iter 901 / 1247 | time 7502[s] | perplexity 43.83
| epoch 4 |  iter 921 / 1247 | time 7533[s] | perplexity 42.77
| epoch 4 |  iter 941 / 1247 | time 7564[s] | perplexity 43.90
| epoch 4 |  iter 961 / 1247 | time 7595[s] | perplexity 44.55
| epoch 4 |  iter 981 / 1247 | time 7626[s] | perplexity 43.74
| epoch 4 |  iter 1001 / 1247 | time 7657[s] | perplexity 44.34
| epoch 4 |  iter 1021 / 1247 | time 7687[s] | perplexity 42.23
| epoch 4 |  iter 1041 / 1247 | time 7714[s] | perplexity 43.39
| epoch 4 |  iter 1061 / 1247 | time 7742[s] | perplexity 43.26
| epoch 4 |  iter 1081 / 1247 | time 7769[s] | perplexity 43.42
| epoch 4 |  iter 1101 / 1247 | time 7795[s] | perplexity 47.44
| epoch 4 |  iter 1121 / 1247 | time 7822[s] | perplexity 43.31
| epoch 4 |  iter 1141 / 1247 | time 7849[s] | perplexity 51.94
| epoch 4 |  iter 1161 / 1247 | time 7875[s] | perplexity 45.76
| epoch 4 |  iter 1181 / 1247 | time 7900[s] | perplexity 45.39
| epoch 4 |  iter 1201 / 1247 | time 7925[s] | perplexity 45.50
| epoch 4 |  iter 1221 / 1247 | time 7950[s] | perplexity 47.77
| epoch 4 |  iter 1241 / 1247 | time 7975[s] | perplexity 45.17
Unable to create basic Accelerated OpenGL renderer.
Unable to create basic Accelerated OpenGL renderer.
Core Image is now using the software OpenGL renderer. This will be slow.
evaluating perplexity ...
670 / 671
test perplexity:  105.19809806111377

I think the error "Unable to create basic Accelerated OpenGL renderer." Is probably due to my special environment, which disables the GPU when starting Visual Studio Code [^ 1].

In the end, with 4 epochs, the perplexity dropped to 45.17, and the test data score was 105.19 ... The value is better than the book, but the data used is too different to make a simple comparison. The reason why the first perplexity is abnormally larger than the book is also because of the difference in the number of vocabularies.

The following is a summary of the differences between the PTB dataset used in the book and the data in Aozora Bunko used this time.

item	PTB dataset	Data of this Aozora Bunko
Vocabulary number	10,000	27,255
Training data(train)Number of words	929,589	873,028
Validation data(valid)Number of words	73,760	154,598
test data(test)Number of words	82,430	234,943
Main pretreatment	Rare words`<unk>`Replace with, concrete number`N`Replace with etc.	No pre-processing of rare words and numbers. Vocabulary is created only from learning data, and unknown words appearing in verification data and test data are ignored.

Pre-processing is too sloppy compared to PTB datasets, but for study purposes we move forward: sweat_smile:

6.5 Further improvements to RNNLM

I also tried to learn from the data of Aozora Bunko. In the learning code ch06/train_better_rnnlm.py, just change the part that uses the PTB corpus to Aozora Bunko. The ★ part is the change.

`ch06/train_better_rnnlm.py`


# coding: utf-8
import sys
sys.path.append('..')
from common import config
#When executing on GPU, delete the comment out below (cupy required)
# ==============================================
# config.GPU = True
# ==============================================
from common.optimizer import SGD
from common.trainer import RnnlmTrainer
from common.util import eval_perplexity, to_gpu
from dataset import aozorabunko  #★ Changed to use the corpus of Aozora Bunko
from better_rnnlm import BetterRnnlm


#Hyperparameter settings
batch_size = 20
wordvec_size = 650
hidden_size = 650
time_size = 35
lr = 20.0
max_epoch = 40
max_grad = 0.25
dropout = 0.5

#Reading training data
corpus, word_to_id, id_to_word = aozorabunko.load_data('train')  #★ Change corpus
corpus_val, _, _ = aozorabunko.load_data('val')  #★ Change corpus
corpus_test, _, _ = aozorabunko.load_data('test')  #★ Change corpus

print(len(corpus))
print(len(corpus_val))
print(len(corpus_test))


if config.GPU:
    corpus = to_gpu(corpus)
    corpus_val = to_gpu(corpus_val)
    corpus_test = to_gpu(corpus_test)

vocab_size = len(word_to_id)
xs = corpus[:-1]
ts = corpus[1:]

model = BetterRnnlm(vocab_size, wordvec_size, hidden_size, dropout)
optimizer = SGD(lr)
trainer = RnnlmTrainer(model, optimizer)

best_ppl = float('inf')
for epoch in range(max_epoch):
    trainer.fit(xs, ts, max_epoch=1, batch_size=batch_size,
                time_size=time_size, max_grad=max_grad)

    model.reset_state()
    ppl = eval_perplexity(model, corpus_val)
    print('valid perplexity: ', ppl)

    if best_ppl > ppl:
        best_ppl = ppl
        model.save_params()
    else:
        lr /= 4.0
        optimizer.lr = lr

    model.reset_state()
    print('-' * 50)


#Evaluation with test data
model.reset_state()
ppl_test = eval_perplexity(model, corpus_test)
print('test perplexity: ', ppl_test)

I tried to run it with this ...

| epoch 1 |  iter 1 / 1247 | time 29[s] | perplexity 27253.64
| epoch 1 |  iter 21 / 1247 | time 464[s] | perplexity 5073.64
| epoch 1 |  iter 41 / 1247 | time 876[s] | perplexity 1873.61
| epoch 1 |  iter 61 / 1247 | time 1246[s] | perplexity 1320.39
| epoch 1 |  iter 81 / 1247 | time 1611[s] | perplexity 996.53
| epoch 1 |  iter 101 / 1247 | time 1978[s] | perplexity 775.86
| epoch 1 |  iter 121 / 1247 | time 2374[s] | perplexity 564.95
| epoch 1 |  iter 141 / 1247 | time 2789[s] | perplexity 504.54
| epoch 1 |  iter 161 / 1247 | time 3201[s] | perplexity 424.97
| epoch 1 |  iter 181 / 1247 | time 3605[s] | perplexity 374.88
| epoch 1 |  iter 201 / 1247 | time 4007[s] | perplexity 339.03
| epoch 1 |  iter 221 / 1247 | time 4409[s] | perplexity 302.10
| epoch 1 |  iter 241 / 1247 | time 4810[s] | perplexity 272.72
| epoch 1 |  iter 261 / 1247 | time 5208[s] | perplexity 260.39
| epoch 1 |  iter 281 / 1247 | time 5608[s] | perplexity 247.86
| epoch 1 |  iter 301 / 1247 | time 6009[s] | perplexity 235.96
| epoch 1 |  iter 321 / 1247 | time 6530[s] | perplexity 237.14
| epoch 1 |  iter 341 / 1247 | time 7067[s] | perplexity 241.14
| epoch 1 |  iter 361 / 1247 | time 7597[s] | perplexity 214.95
| epoch 1 |  iter 381 / 1247 | time 8124[s] | perplexity 212.66
| epoch 1 |  iter 401 / 1247 | time 8653[s] | perplexity 194.03
| epoch 1 |  iter 421 / 1247 | time 9171[s] | perplexity 191.89
| epoch 1 |  iter 441 / 1247 | time 9685[s] | perplexity 189.85
| epoch 1 |  iter 461 / 1247 | time 10187[s] | perplexity 176.23
| epoch 1 |  iter 481 / 1247 | time 10692[s] | perplexity 172.90
| epoch 1 |  iter 501 / 1247 | time 11206[s] | perplexity 169.14
| epoch 1 |  iter 521 / 1247 | time 11713[s] | perplexity 169.99
| epoch 1 |  iter 541 / 1247 | time 12221[s] | perplexity 160.42
| epoch 1 |  iter 561 / 1247 | time 12731[s] | perplexity 150.20
| epoch 1 |  iter 581 / 1247 | time 13189[s] | perplexity 154.43
| epoch 1 |  iter 601 / 1247 | time 13561[s] | perplexity 169.05
| epoch 1 |  iter 621 / 1247 | time 13926[s] | perplexity 145.88
| epoch 1 |  iter 641 / 1247 | time 14305[s] | perplexity 149.73
| epoch 1 |  iter 661 / 1247 | time 14671[s] | perplexity 140.01
| epoch 1 |  iter 681 / 1247 | time 15041[s] | perplexity 136.77
| epoch 1 |  iter 701 / 1247 | time 15414[s] | perplexity 138.95
| epoch 1 |  iter 721 / 1247 | time 15778[s] | perplexity 128.21
| epoch 1 |  iter 741 / 1247 | time 16143[s] | perplexity 129.12
| epoch 1 |  iter 761 / 1247 | time 16509[s] | perplexity 117.92
| epoch 1 |  iter 781 / 1247 | time 16877[s] | perplexity 118.82
| epoch 1 |  iter 801 / 1247 | time 17252[s] | perplexity 131.58
| epoch 1 |  iter 821 / 1247 | time 17659[s] | perplexity 125.46
| epoch 1 |  iter 841 / 1247 | time 18055[s] | perplexity 121.22
| epoch 1 |  iter 861 / 1247 | time 18490[s] | perplexity 124.50
| epoch 1 |  iter 881 / 1247 | time 18909[s] | perplexity 128.52
| epoch 1 |  iter 901 / 1247 | time 19254[s] | perplexity 110.78
| epoch 1 |  iter 921 / 1247 | time 19613[s] | perplexity 109.87
| epoch 1 |  iter 941 / 1247 | time 19974[s] | perplexity 104.43
| epoch 1 |  iter 961 / 1247 | time 20334[s] | perplexity 109.85
| epoch 1 |  iter 981 / 1247 | time 20713[s] | perplexity 105.33
| epoch 1 |  iter 1001 / 1247 | time 21214[s] | perplexity 107.46
| epoch 1 |  iter 1021 / 1247 | time 21727[s] | perplexity 98.93
| epoch 1 |  iter 1041 / 1247 | time 22195[s] | perplexity 101.21
| epoch 1 |  iter 1061 / 1247 | time 22608[s] | perplexity 99.85
| epoch 1 |  iter 1081 / 1247 | time 23070[s] | perplexity 100.58
| epoch 1 |  iter 1101 / 1247 | time 23493[s] | perplexity 106.59
| epoch 1 |  iter 1121 / 1247 | time 23930[s] | perplexity 101.43
| epoch 1 |  iter 1141 / 1247 | time 24364[s] | perplexity 118.50
| epoch 1 |  iter 1161 / 1247 | time 24754[s] | perplexity 103.30
| epoch 1 |  iter 1181 / 1247 | time 25166[s] | perplexity 101.68
| epoch 1 |  iter 1201 / 1247 | time 25574[s] | perplexity 101.07
| epoch 1 |  iter 1221 / 1247 | time 26001[s] | perplexity 105.31
| epoch 1 |  iter 1241 / 1247 | time 26425[s] | perplexity 97.66
evaluating perplexity ...
440 / 441
valid perplexity:  131.10745282836612

In my environment, one epoch took eight hours. It takes about 2 weeks to turn this 40 epochs according to the book. In the environment at hand where macOS is virtualized with Mac mini 8 years ago, it seems impossible to execute it. So from here, I tried using "Google Colaboratory".

(Derailment) Execution flow in Google Colaboratory

Google Colaboratory (hereafter, Google Colab) is an environment for Jupyter notebooks provided free of charge by Google. All the modules required for this book are pre-installed, and you can even use the GPU. The following is a brief summary of the process of executing the program in this book.

If you have never used Jupyter notebook or Google Colab, you can easily understand Introduction to Python from scratch> Python first experience on Japanese Python information site python.jp. I recommend it. Also, regarding cooperation with Google Drive, @tomo_makes's [Use free GPU at speed per second] Deep Learning Practice Tips on Colaboratory is easy to understand and recommended.

Since GPU is available in Google Colab, delete the comment out of GPU usage at the beginning of ch06/train_better_rnnlm.py.

`ch06/train_better_rnnlm.First part of py`


# coding: utf-8
import sys
sys.path.append('..')
from common import config
#When executing on GPU, delete the comment out below (cupy required)
# ==============================================
config.GPU = True  #★ Run on GPU
# ==============================================

Create an appropriate folder in Google Drive and upload the files required for execution. Since we need the ch06 folder, the common folder, and the dataset folder for Chapter 6 this time, I made it look like the following. Please do not upload the __pycache__ folder because it is the location [^ 2] where compiled modules are cached. It is automatically generated when you run it in Google Colab.
Display the ch06 folder, right-click on the gap, and create a Google Colab file to use for running the program.
An empty notebook will open, so let's set a name. I made it "train_better_rnnlm run.ipynb".
Select the icon to mount Google Drive from the folder icon in the red circle on the left side of the above figure. You will be asked "Do you want to allow this notebook to access files in Google Drive?", So please "Connect to GOOGLE Drive". Now Google Drive will be mounted under "My Drive" and you will be able to access the files you just uploaded from your notebook.
GPU is not available by default in Google Colab, so switch the settings so that it can be used. Select "Change Runtime Type" from "Runtime", select "GPU" in the hardware accelerator, and then "Save".
After that, you can switch the current to ch06 in the notebook and execute it with import train_better_rnnlm. With Google Colab, it took only 3 hours.

------------------------------------------------------------
                       GPU Mode (cupy)
------------------------------------------------------------

| epoch 1 |  iter 1 / 1247 | time 0[s] | perplexity 27254.36
| epoch 1 |  iter 21 / 1247 | time 4[s] | perplexity 5519.64
| epoch 1 |  iter 41 / 1247 | time 9[s] | perplexity 1787.35
| epoch 1 |  iter 61 / 1247 | time 13[s] | perplexity 1414.92
| epoch 1 |  iter 81 / 1247 | time 17[s] | perplexity 1023.22
| epoch 1 |  iter 101 / 1247 | time 21[s] | perplexity 841.00
| epoch 1 |  iter 121 / 1247 | time 26[s] | perplexity 645.57
| epoch 1 |  iter 141 / 1247 | time 30[s] | perplexity 554.69
| epoch 1 |  iter 161 / 1247 | time 34[s] | perplexity 466.50
| epoch 1 |  iter 181 / 1247 | time 38[s] | perplexity 378.45
| epoch 1 |  iter 201 / 1247 | time 43[s] | perplexity 355.62
| epoch 1 |  iter 221 / 1247 | time 47[s] | perplexity 312.13
| epoch 1 |  iter 241 / 1247 | time 51[s] | perplexity 284.31
| epoch 1 |  iter 261 / 1247 | time 55[s] | perplexity 263.32
| epoch 1 |  iter 281 / 1247 | time 60[s] | perplexity 249.79
| epoch 1 |  iter 301 / 1247 | time 64[s] | perplexity 241.19
| epoch 1 |  iter 321 / 1247 | time 68[s] | perplexity 240.92
| epoch 1 |  iter 341 / 1247 | time 72[s] | perplexity 245.29
| epoch 1 |  iter 361 / 1247 | time 77[s] | perplexity 223.77
| epoch 1 |  iter 381 / 1247 | time 81[s] | perplexity 212.58
| epoch 1 |  iter 401 / 1247 | time 85[s] | perplexity 196.46
| epoch 1 |  iter 421 / 1247 | time 90[s] | perplexity 193.10
| epoch 1 |  iter 441 / 1247 | time 94[s] | perplexity 190.85
| epoch 1 |  iter 461 / 1247 | time 98[s] | perplexity 178.43
| epoch 1 |  iter 481 / 1247 | time 102[s] | perplexity 175.06
| epoch 1 |  iter 501 / 1247 | time 107[s] | perplexity 172.07
| epoch 1 |  iter 521 / 1247 | time 111[s] | perplexity 174.37
| epoch 1 |  iter 541 / 1247 | time 115[s] | perplexity 162.38
| epoch 1 |  iter 561 / 1247 | time 120[s] | perplexity 153.01
| epoch 1 |  iter 581 / 1247 | time 124[s] | perplexity 153.36
| epoch 1 |  iter 601 / 1247 | time 128[s] | perplexity 167.39
| epoch 1 |  iter 621 / 1247 | time 132[s] | perplexity 147.94
| epoch 1 |  iter 641 / 1247 | time 137[s] | perplexity 151.36
| epoch 1 |  iter 661 / 1247 | time 141[s] | perplexity 140.71
| epoch 1 |  iter 681 / 1247 | time 145[s] | perplexity 137.02
| epoch 1 |  iter 701 / 1247 | time 150[s] | perplexity 140.36
| epoch 1 |  iter 721 / 1247 | time 154[s] | perplexity 129.04
| epoch 1 |  iter 741 / 1247 | time 158[s] | perplexity 132.23
| epoch 1 |  iter 761 / 1247 | time 163[s] | perplexity 121.74
| epoch 1 |  iter 781 / 1247 | time 167[s] | perplexity 123.19
| epoch 1 |  iter 801 / 1247 | time 171[s] | perplexity 132.10
| epoch 1 |  iter 821 / 1247 | time 175[s] | perplexity 126.86
| epoch 1 |  iter 841 / 1247 | time 180[s] | perplexity 121.01
| epoch 1 |  iter 861 / 1247 | time 184[s] | perplexity 124.81
| epoch 1 |  iter 881 / 1247 | time 188[s] | perplexity 127.83
| epoch 1 |  iter 901 / 1247 | time 193[s] | perplexity 111.56
| epoch 1 |  iter 921 / 1247 | time 197[s] | perplexity 110.17
| epoch 1 |  iter 941 / 1247 | time 201[s] | perplexity 106.49
| epoch 1 |  iter 961 / 1247 | time 206[s] | perplexity 109.84
| epoch 1 |  iter 981 / 1247 | time 210[s] | perplexity 105.97
| epoch 1 |  iter 1001 / 1247 | time 214[s] | perplexity 106.18
| epoch 1 |  iter 1021 / 1247 | time 219[s] | perplexity 98.83
| epoch 1 |  iter 1041 / 1247 | time 223[s] | perplexity 100.83
| epoch 1 |  iter 1061 / 1247 | time 227[s] | perplexity 100.39
| epoch 1 |  iter 1081 / 1247 | time 232[s] | perplexity 98.95
| epoch 1 |  iter 1101 / 1247 | time 236[s] | perplexity 107.26
| epoch 1 |  iter 1121 / 1247 | time 240[s] | perplexity 100.78
| epoch 1 |  iter 1141 / 1247 | time 245[s] | perplexity 115.91
| epoch 1 |  iter 1161 / 1247 | time 249[s] | perplexity 103.40
| epoch 1 |  iter 1181 / 1247 | time 253[s] | perplexity 102.27
| epoch 1 |  iter 1201 / 1247 | time 258[s] | perplexity 102.13
| epoch 1 |  iter 1221 / 1247 | time 262[s] | perplexity 103.44
| epoch 1 |  iter 1241 / 1247 | time 266[s] | perplexity 97.24
evaluating perplexity ...
440 / 441
valid perplexity:  131.96678
--------------------------------------------------
| epoch 2 |  iter 1 / 1247 | time 0[s] | perplexity 152.90

(Omitted)

| epoch 39 |  iter 1241 / 1247 | time 268[s] | perplexity 26.58
evaluating perplexity ...
440 / 441
valid perplexity:  87.54218
--------------------------------------------------
| epoch 40 |  iter 1 / 1247 | time 0[s] | perplexity 44.52
| epoch 40 |  iter 21 / 1247 | time 4[s] | perplexity 35.15
| epoch 40 |  iter 41 / 1247 | time 8[s] | perplexity 34.62
| epoch 40 |  iter 61 / 1247 | time 13[s] | perplexity 34.63
| epoch 40 |  iter 81 / 1247 | time 17[s] | perplexity 30.87
| epoch 40 |  iter 101 / 1247 | time 21[s] | perplexity 31.86
| epoch 40 |  iter 121 / 1247 | time 26[s] | perplexity 29.81
| epoch 40 |  iter 141 / 1247 | time 30[s] | perplexity 31.46
| epoch 40 |  iter 161 / 1247 | time 34[s] | perplexity 31.74
| epoch 40 |  iter 181 / 1247 | time 39[s] | perplexity 29.82
| epoch 40 |  iter 201 / 1247 | time 43[s] | perplexity 30.46
| epoch 40 |  iter 221 / 1247 | time 47[s] | perplexity 30.38
| epoch 40 |  iter 241 / 1247 | time 52[s] | perplexity 29.90
| epoch 40 |  iter 261 / 1247 | time 56[s] | perplexity 30.13
| epoch 40 |  iter 281 / 1247 | time 60[s] | perplexity 31.11
| epoch 40 |  iter 301 / 1247 | time 65[s] | perplexity 30.98
| epoch 40 |  iter 321 / 1247 | time 69[s] | perplexity 30.86
| epoch 40 |  iter 341 / 1247 | time 73[s] | perplexity 33.41
| epoch 40 |  iter 361 / 1247 | time 78[s] | perplexity 31.96
| epoch 40 |  iter 381 / 1247 | time 82[s] | perplexity 31.97
| epoch 40 |  iter 401 / 1247 | time 86[s] | perplexity 30.91
| epoch 40 |  iter 421 / 1247 | time 91[s] | perplexity 32.48
| epoch 40 |  iter 441 / 1247 | time 95[s] | perplexity 30.68
| epoch 40 |  iter 461 / 1247 | time 99[s] | perplexity 29.06
| epoch 40 |  iter 481 / 1247 | time 104[s] | perplexity 29.30
| epoch 40 |  iter 501 / 1247 | time 108[s] | perplexity 30.14
| epoch 40 |  iter 521 / 1247 | time 112[s] | perplexity 30.65
| epoch 40 |  iter 541 / 1247 | time 117[s] | perplexity 30.09
| epoch 40 |  iter 561 / 1247 | time 121[s] | perplexity 28.05
| epoch 40 |  iter 581 / 1247 | time 125[s] | perplexity 30.44
| epoch 40 |  iter 601 / 1247 | time 130[s] | perplexity 31.19
| epoch 40 |  iter 621 / 1247 | time 134[s] | perplexity 28.56
| epoch 40 |  iter 641 / 1247 | time 138[s] | perplexity 31.40
| epoch 40 |  iter 661 / 1247 | time 143[s] | perplexity 28.68
| epoch 40 |  iter 681 / 1247 | time 147[s] | perplexity 28.82
| epoch 40 |  iter 701 / 1247 | time 151[s] | perplexity 29.54
| epoch 40 |  iter 721 / 1247 | time 156[s] | perplexity 26.66
| epoch 40 |  iter 741 / 1247 | time 160[s] | perplexity 27.79
| epoch 40 |  iter 761 / 1247 | time 164[s] | perplexity 26.76
| epoch 40 |  iter 781 / 1247 | time 169[s] | perplexity 26.98
| epoch 40 |  iter 801 / 1247 | time 173[s] | perplexity 29.41
| epoch 40 |  iter 821 / 1247 | time 177[s] | perplexity 27.64
| epoch 40 |  iter 841 / 1247 | time 182[s] | perplexity 28.54
| epoch 40 |  iter 861 / 1247 | time 186[s] | perplexity 29.63
| epoch 40 |  iter 881 / 1247 | time 190[s] | perplexity 28.86
| epoch 40 |  iter 901 / 1247 | time 195[s] | perplexity 26.40
| epoch 40 |  iter 921 / 1247 | time 199[s] | perplexity 25.49
| epoch 40 |  iter 941 / 1247 | time 203[s] | perplexity 26.11
| epoch 40 |  iter 961 / 1247 | time 208[s] | perplexity 27.68
| epoch 40 |  iter 981 / 1247 | time 212[s] | perplexity 26.79
| epoch 40 |  iter 1001 / 1247 | time 216[s] | perplexity 27.18
| epoch 40 |  iter 1021 / 1247 | time 221[s] | perplexity 25.34
| epoch 40 |  iter 1041 / 1247 | time 225[s] | perplexity 26.43
| epoch 40 |  iter 1061 / 1247 | time 229[s] | perplexity 26.31
| epoch 40 |  iter 1081 / 1247 | time 234[s] | perplexity 26.05
| epoch 40 |  iter 1101 / 1247 | time 238[s] | perplexity 27.87
| epoch 40 |  iter 1121 / 1247 | time 242[s] | perplexity 26.48
| epoch 40 |  iter 1141 / 1247 | time 247[s] | perplexity 30.43
| epoch 40 |  iter 1161 / 1247 | time 251[s] | perplexity 28.00
| epoch 40 |  iter 1181 / 1247 | time 255[s] | perplexity 26.65
| epoch 40 |  iter 1201 / 1247 | time 260[s] | perplexity 27.46
| epoch 40 |  iter 1221 / 1247 | time 264[s] | perplexity 27.82
| epoch 40 |  iter 1241 / 1247 | time 269[s] | perplexity 26.43
evaluating perplexity ...
440 / 441
valid perplexity:  87.54218
--------------------------------------------------
evaluating perplexity ...
670 / 671
test perplexity:  73.6651

Thanks to Google Colab, even I, who has only a poor PC, could try full-scale learning using GPU. Thank you.

Perplexity has also dropped to 73.66 ... It looks pretty good.

6.6 Summary

Since I learned at Aozora Bunko, I would like to try the generation of sentences, but I will tackle it at the beginning of the next chapter.

That's all for this chapter. If you have any mistakes, I would be grateful if you could point them out.