I tried GPT using Microsoft Cognitive Toolkit (CNTK).
Have a Japanese corpus ready for training. It is assumed that you have NVIDIA GPU CUDA installed.
This time, I prepared a Japanese corpus and trained a Japanese sentence generation model. For word division, create a subword model using sentencepiece [1].
GPT Generative Pretraining Transformer (GPT) [2] uses only the Decoder part of Transformer [3]. Transformers are introduced in Natural Language: Machine Translation Part2-Neural Machine Translation Transformer.
Also, this time, as with Natural Language: BERT Part2 --Unsupervised pretraining ALBERT, Pre-Layer Normalization Transformer [4] I configured it. The details of the layer structure are shown in the figure below.
We also used Factorized embedding parameterization and Cross-layer parameter sharing to reduce the weight of the model.
GPT's Multi-Head Attention uses Masked Self-Attention for unsupervised learning by autoregression.
The initial value of each parameter was set to a normal distribution with a variance of 0.02.
The loss function uses the Cross Entropy Error.
Adam [5] was used as the optimization algorithm. Adam's hyperparameters $ β_1 $ are set to 0.9 and $ β_2 $ are set to the default values of CNTK.
For the learning rate, use the Cyclical Learning Rate (CLR) [6], the maximum learning rate is 1e-4, the base learning rate is 1e-8, the step size is 10 times the number of epochs, and the policy is Set to triangular2.
Model training performed 1,000,000 Iterations by mini-batch learning.
・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 6000 24GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.53 ・ Pandas 1.1.2 ・ Sentencenpiece 0.1.91
The training program is available on GitHub.
jgpt_training.py
I will supplement the contents required for this implementation.
OpenAI GPT GPT was proposed as a pre-learning model in natural language processing. An autoregressive language model that predicts the word $ w_ {t + 1} $ from the input word $ w_1, w_2, ..., w_ {t} $ up to time $ t $ to the next time $ t + 1 $. ..
p(w) = \prod^T_{t=1} p(w_{t+1} | w_1, w_2, ..., w_t)
Like BERT, GPT does unsupervised pre-learning and then fine-tunes on a dataset of multiple tasks. In BERT, unsupervised learning was realized by using a special [MASK] token, but in GPT, unsupervised pre-learning is performed by using autoregressive as shown in the figure below.
The figure below shows the Multi-Head Attention of BERT and GPT. BERT can use bidirectional information in the past and future, but GPT masks future information and uses only unidirectional information.
GPT-2 GPT-2 [7] consists of Pre-Layer Normalization, and Transformer Decoder has up to 48 layers and 1.5 billion parameters.
GPT-2 performed well on multiple tasks in Zero-shot by pre-training using a huge 40GB dataset called WebText, which contains 8 million sentences.
GPT-3 GPT-3 [8] has acquired a more accurate language model by maximizing the size of GPT-2's network and dataset.
The GPT-3 model has the same configuration as the GPT-2, but introduces a Sparse Transformer [9] in the Transformer Decoder, with up to 96 layers and 175 trillion parameters. It seems that learning GPT-3 costs about 490 million yen, and if training with one GPU is done, it will take about 355 years.
GPT-3 is said to be able to generate sentences at a level that does not make you feel uncomfortable when exchanging sentences with humans, but it has a weakness that it is inferior to the BERT model in tasks that require two-way information. Seems to be manifest.
Training Loss The figure below is a visualization of the loss function log during training. The horizontal axis represents the number of repetitions, and the vertical axis represents the value of the loss function.
Here is an example of generation with a trained model. Enter a word that starts with> and generate the continuation.
>Mankind
Mankind has learned this reconstructed bioengineering uterus for the first time in economics, learning devices and not switching the price.
>Magic
Magic is the one who has great success.
>Earth
The Earth emphasizes the importance of the latter.
>Aoi Hazuki is
Aoi Hazuki crouched down, don't be so peaceful.
It looks like a Japanese-like sentence has been generated, but the sentence has become meaningless. You'll realize that you need a larger model, training dataset, and enough hardware to run it to generate a comfortable level of text.
Natural Language : Machine Translation Part2 - Neural Machine Translation Transformer Natural Language : BERT Part2 - Unsupervised pretraining ALBERT
Recommended Posts