Natural Language: GPT --Japanese Generative Pretraining Transformer

Target

I tried GPT using Microsoft Cognitive Toolkit (CNTK).

Have a Japanese corpus ready for training. It is assumed that you have NVIDIA GPU CUDA installed.

Introduction

This time, I prepared a Japanese corpus and trained a Japanese sentence generation model. For word division, create a subword model using sentencepiece [1].

GPT Generative Pretraining Transformer (GPT) [2] uses only the Decoder part of Transformer [3]. Transformers are introduced in Natural Language: Machine Translation Part2-Neural Machine Translation Transformer.

Also, this time, as with Natural Language: BERT Part2 --Unsupervised pretraining ALBERT, Pre-Layer Normalization Transformer [4] I configured it. The details of the layer structure are shown in the figure below.

We also used Factorized embedding parameterization and Cross-layer parameter sharing to reduce the weight of the model.

gpt.png

GPT's Multi-Head Attention uses Masked Self-Attention for unsupervised learning by autoregression.

Settings in training

The initial value of each parameter was set to a normal distribution with a variance of 0.02.

The loss function uses the Cross Entropy Error.

Adam [5] was used as the optimization algorithm. Adam's hyperparameters $ β_1 $ are set to 0.9 and $ β_2 $ are set to the default values of CNTK.

For the learning rate, use the Cyclical Learning Rate (CLR) [6], the maximum learning rate is 1e-4, the base learning rate is 1e-8, the step size is 10 times the number of epochs, and the policy is Set to triangular2.

Model training performed 1,000,000 Iterations by mini-batch learning.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 6000 24GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.53 ・ Pandas 1.1.2 ・ Sentencenpiece 0.1.91

Program to run

The training program is available on GitHub.

jgpt_training.py


Commentary

I will supplement the contents required for this implementation.

OpenAI GPT GPT was proposed as a pre-learning model in natural language processing. An autoregressive language model that predicts the word $ w_ {t + 1} $ from the input word $ w_1, w_2, ..., w_ {t} $ up to time $ t $ to the next time $ t + 1 $. ..

p(w) = \prod^T_{t=1} p(w_{t+1} | w_1, w_2, ..., w_t)

Like BERT, GPT does unsupervised pre-learning and then fine-tunes on a dataset of multiple tasks. In BERT, unsupervised learning was realized by using a special [MASK] token, but in GPT, unsupervised pre-learning is performed by using autoregressive as shown in the figure below.

autoregression.png

The figure below shows the Multi-Head Attention of BERT and GPT. BERT can use bidirectional information in the past and future, but GPT masks future information and uses only unidirectional information.

bert_gpt.png

GPT-2 GPT-2 [7] consists of Pre-Layer Normalization, and Transformer Decoder has up to 48 layers and 1.5 billion parameters.

GPT-2 performed well on multiple tasks in Zero-shot by pre-training using a huge 40GB dataset called WebText, which contains 8 million sentences.

GPT-3 GPT-3 [8] has acquired a more accurate language model by maximizing the size of GPT-2's network and dataset.

The GPT-3 model has the same configuration as the GPT-2, but introduces a Sparse Transformer [9] in the Transformer Decoder, with up to 96 layers and 175 trillion parameters. It seems that learning GPT-3 costs about 490 million yen, and if training with one GPU is done, it will take about 355 years.

GPT-3 is said to be able to generate sentences at a level that does not make you feel uncomfortable when exchanging sentences with humans, but it has a weakness that it is inferior to the BERT model in tasks that require two-way information. Seems to be manifest.

result

Training Loss The figure below is a visualization of the loss function log during training. The horizontal axis represents the number of repetitions, and the vertical axis represents the value of the loss function.

gpt_logging.png

Example of Japanese sentence generation

Here is an example of generation with a trained model. Enter a word that starts with> and generate the continuation.

>Mankind
Mankind has learned this reconstructed bioengineering uterus for the first time in economics, learning devices and not switching the price.
>Magic
Magic is the one who has great success.
>Earth
The Earth emphasizes the importance of the latter.
>Aoi Hazuki is
Aoi Hazuki crouched down, don't be so peaceful.

It looks like a Japanese-like sentence has been generated, but the sentence has become meaningless. You'll realize that you need a larger model, training dataset, and enough hardware to run it to generate a comfortable level of text.

reference

Natural Language : Machine Translation Part2 - Neural Machine Translation Transformer Natural Language : BERT Part2 - Unsupervised pretraining ALBERT

  1. Taku Kudo and John Richardson. "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing", arXiv preprint arXiv:1808.06226, (2018).
  2. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. "Improving Language Understanding by Generative Pre-Training", (2018): 12.
  3. Ashish Vaswani, et. al. "Attention Is All You Need", Advances in neural information processing systems. 2017. p. 5998-6008.
  4. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu . "On Layer Normalization in the Transformer Architecture", arXiv preprint arXiv:2002.04745 (2020).
  5. Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
  6. Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, p. 464-472.
  7. Alex Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. "Language Models are Unsupervised Multitask Learners", OpenAI blog 1.8 (2019): 9.
  8. Tom B. Brown, et al. "Language Models are Few-Shot Learners", arXiv preprint arXiv:2005.14165 (2020).
  9. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutckever. "Generating Long Sequences with Sparse Transformers", arXiv preprint arXiv:1904.10509 (2019).

Recommended Posts

Natural Language: GPT --Japanese Generative Pretraining Transformer
[Natural language processing] Preprocessing with Japanese
Natural Language: Word2Vec Part1 --Japanese Corpus
Natural Language: BERT Part1 --Japanese Wikipedia Corpus
Natural Language: Machine Translation Part2 --Neural Machine Translation Transformer
Natural Language: BERT Part2 --Unsupervised pretraining ALBERT
Anyway, classify natural language immediately [simple transformer, transformer]
Python: Natural language processing
RNN_LSTM2 Natural language processing