This is a continuation of machine translation using the Microsoft Cognitive Toolkit (CNTK).
In Part2, we will train the machine translation model by Transformer using the Japanese-English bilingual data set prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.
In Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus, Japanese-English Subtitle Corpus [1] translates into Japanese and English. I prepared.
In Part2, you will create and train a machine translation model with Transformer.
Transformer Transformer [2] has been proposed as a replacement for RNN [3] and CNN [4], which were previously mainstream in natural language processing.
In RNN, the performance has been improved by the gate structure and Attention Mechanism, but since RNN cannot calculate the next time until the calculation of the current time is completed, it is not possible to utilize the parallel calculation of GPU and it takes time to train. There is a problem.
Transformer is capable of parallel computing by GPU during training, has a simpler structure than RNN, and has the feature of being able to realize a wider receptive field than CNN.
The part surrounded by blue on the left side is the Encoder, and the part surrounded by green on the right side is the Decoder, each with 6 layers.
As a technique for improving accuracy and reducing parameters, the embedded layer and fully connected layer of Decoder use weight sharing [5].
The initial value of each parameter is the initial value of Glorot [[6]](# reference).
Since it is a classification problem that predicts the next word, we set the loss function as Cross Entropy Error and adopted Adam [7] as the optimization algorithm. Adam's hyperparameters $ \ beta_1 $ are set to 0.9 and $ \ beta_2 $ are set to the default values of CNTK.
For the learning rate, use the Cyclical Learning Rate (CLR) [8], the maximum learning rate is 0.04, the base learning rate is 1e-8, the step size is 10 times the number of epochs, and the strategy is exp_range, I set $ \ gamma $ to 0.99994.
Model training performed 10 Epoch by mini-batch learning.
・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 6000 24GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.33 ・ Numpy 1.17.3 ・ Pandas 0.25.0 ・ Sentencepiece 0.1.86
The training program is available on GitHub.
nmtt_training.py
I will supplement the main contents of this implementation.
Scaled Dot-Product Attention Assuming that the tensor consisting of the hidden state at each time of the Encoder is $ Source $ and the tensor consisting of the hidden state at each time of the Decoder is $ Target $, the base dot product attention is expressed by the following formula.
Attention(Target, Source) = Target \cdot Source^T
Here, as shown in the figure below, copy $ Source $, take $ Target $ as a dictionary object consisting of $ Key $ and $ Value $, and take the inner product of $ Query $ and $ Key $ as $ Query $, and normalize the inner product of $ Query $ and $ Key $ with Softmax. The conversion calculates the attention weight, and the inner product of the attention weight and $ Value $.
Attention(Q, K, V) = Softmax \left( QK^T \right) V
By copying $ Source $ to $ Key $ and $ Value $ in this way, we expect to get a non-trivial conversion between $ Source $ and $ Target $.
However, in this case, if the dimension $ d_ {k} $ of the model becomes large, the inner product of $ Q $ and $ K $ becomes too large, so scale by the square root of $ d_ {k} $.
Attention(Q, K, V) = Softmax \left( \frac{QK^T}{\sqrt {d_{k}}} \right) V
The attention mechanism in the above figure is called ** Source-Target Attention **, and the attention mechanism in the figure below where $ Q, K, and V $ are all copies of $ Source $ is called ** Self-Attention **.
Transformer uses Self-Attention with Encoder and Self-Attention and Source-Target Attention with Decoder. However, Decoder's Self-Attention masks future information during training.
Multi-Head Attention Instead of applying a single Scaled Dot-Product Attention to the whole, Transformer uses a Multi-Head that splits into multiple parts, and fully joins the Key, Value, and Query before inputting to each Head. Apply, concatenate the outputs from each Head, and then apply the full join again.
MultiHeadAttention(Q, K, V) = \left[ Attention_1(QW^Q_1, KQ^K_1, VW^V_1), ..., Attention_h(QW^Q_h, KQ^K_h, VW^V_h) \right] W
By executing Attention for each of multiple parts in this way, we expect each Head to acquire a different subspace representation.
Position-wise Feed-Forward Network Position-wise Feed-Forward Network applies a two-tier full bond for each position in the sequence length. In the original paper [1], the outer dimension is 512, the inner dimension is 2048, which is four times that, and the inner activation function is ReLU.
FFN(x) = max(0, xW_{inner} + b_{inner})W_{outer} + b_{outer}
Positional Encoding Transformers do not have a recursive structure like RNNs, so they cannot take into account the sequence length order. Therefore, add the position information of each word immediately after the embed layer. [9]
The following formula is used in Positional Encoding.
PE_{(pos, 2i)} = \sin \left( \frac{pos}{10000^{\frac{2i}{d_{k}}}} \right) \\
PE_{(pos, 2i+1)} = \cos \left( \frac{pos}{10000^{\frac{2i}{d_{k}}}} \right)
Where $ d_ {k} $ is the dimension of the embedded layer, $ pos $ is the position of the word, and $ 2i and 2i + 1 $ are the even and odd dimensions of the embedded layer, respectively. Assuming that the maximum series length is 97 and the dimension of the embedded layer is 512, the Positional Encoding is visualized as shown in the figure below.
Transformer uses this formula because $ PE_ {pos + \ tau} $ can be represented as a linear function of $ PE_ {pos} $.
here,
u_i = \frac{1}{10000^{\frac{2i}{d_{k}}}}
Then, the expression of Positional Encoding is as follows.
PE_{(pos, 2i)} = \sin (pos \cdot u_i) \\
PE_{(pos, 2i+1)} = \cos (pos \cdot u_i)
Then $ PE_ {pos + \ tau} $ will be
\begin{align}
PE_{pos+\tau} &= \sin ((pos+\tau) \cdot u_i) \\
&= \sin (pos \cdot u_i) \cos (\tau u_i) + \cos (pos \cdot u_i) \sin (\tau u_i) \\
&= PE_{(pos, 2i)} \cos(\tau u_i) + PE_{(pos, 2i+1)} \sin (\tau u_i)
\end{align}
It can be expressed as the linear sum of $ PE_ {(pos, 2i)}, PE_ {(pos, 2i + 1)} $.
Training loss and perplexity The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the Perplexity, the horizontal axis shows the number of epochs, and the vertical axis shows the value and Perplexity of the loss function, respectively.
Validation BLEU score Now that the Japanese-English translation model has been trained, I tried to evaluate the performance using the verification data. We adopted the greedy algorithm for verification.
Bilingual Evaluation Understudy (BLEU) [10] was calculated for this performance evaluation. BLEU was calculated using nltk and smoothed by NIST. Using dev as the validation data resulted in the following: The number after the hyphen stands for n-gram.
BLEU-4 Score 1.84
BLEU-1 Score 12.22
Transformer can visualize Attention maps. The figure below shows a visualization of the Attention map for each head of the Encoder's 5th and 6th layers of Self-Attention, and the color map is displayed as hot.
Encoder 5
Encoder 6
In the 5th layer, each head seems to pay attention to a specific word, but the middle head in the 6th layer does not seem to work well.
JESC Deep learning library that builds on and extends Microsoft CNTK
Natural Language : Machine Translation Part1 - Japanese-English Subtitle Corpus
Recommended Posts