Target

This is a continuation of machine translation using the Microsoft Cognitive Toolkit (CNTK).

In Part2, we will train the machine translation model by Transformer using the Japanese-English bilingual data set prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.

Introduction

In Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus, Japanese-English Subtitle Corpus [1] translates into Japanese and English. I prepared.

In Part2, you will create and train a machine translation model with Transformer.

Transformer Transformer [2] has been proposed as a replacement for RNN [3] and CNN [4], which were previously mainstream in natural language processing.

In RNN, the performance has been improved by the gate structure and Attention Mechanism, but since RNN cannot calculate the next time until the calculation of the current time is completed, it is not possible to utilize the parallel calculation of GPU and it takes time to train. There is a problem.

Transformer is capable of parallel computing by GPU during training, has a simpler structure than RNN, and has the feature of being able to realize a wider receptive field than CNN.

The part surrounded by blue on the left side is the Encoder, and the part surrounded by green on the right side is the Decoder, each with 6 layers.

As a technique for improving accuracy and reducing parameters, the embedded layer and fully connected layer of Decoder use weight sharing [5].

Settings in training

The initial value of each parameter is the initial value of Glorot [[6]](# reference).

Since it is a classification problem that predicts the next word, we set the loss function as Cross Entropy Error and adopted Adam [7] as the optimization algorithm. Adam's hyperparameters $ \ beta_1 $ are set to 0.9 and $ \ beta_2 $ are set to the default values of CNTK.

For the learning rate, use the Cyclical Learning Rate (CLR) [8], the maximum learning rate is 0.04, the base learning rate is 1e-8, the step size is 10 times the number of epochs, and the strategy is exp_range, I set $ \ gamma $ to 0.99994.

Model training performed 10 Epoch by mini-batch learning.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 6000 24GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.33 ・ Numpy 1.17.3 ・ Pandas 0.25.0 ・ Sentencepiece 0.1.86

Program to run

The training program is available on GitHub.

`nmtt_training.py`

Commentary

I will supplement the main contents of this implementation.

Scaled Dot-Product Attention Assuming that the tensor consisting of the hidden state at each time of the Encoder is $ Source $ and the tensor consisting of the hidden state at each time of the Decoder is $ Target $, the base dot product attention is expressed by the following formula.

Attention(Target, Source) = Target \cdot Source^T

Here, as shown in the figure below, copy $ Source $, take $ Target $ as a dictionary object consisting of $ Key $ and $ Value $, and take the inner product of $ Query $ and $ Key $ as $ Query $, and normalize the inner product of $ Query $ and $ Key $ with Softmax. The conversion calculates the attention weight, and the inner product of the attention weight and $ Value $.

Attention(Q, K, V) = Softmax \left( QK^T \right) V

By copying $ Source $ to $ Key $ and $ Value $ in this way, we expect to get a non-trivial conversion between $ Source $ and $ Target $.

However, in this case, if the dimension $ d_ {k} $ of the model becomes large, the inner product of $ Q $ and $ K $ becomes too large, so scale by the square root of $ d_ {k} $.

Attention(Q, K, V) = Softmax \left( \frac{QK^T}{\sqrt {d_{k}}} \right) V

The attention mechanism in the above figure is called ** Source-Target Attention **, and the attention mechanism in the figure below where $ Q, K, and V $ are all copies of $ Source $ is called ** Self-Attention **.

Transformer uses Self-Attention with Encoder and Self-Attention and Source-Target Attention with Decoder. However, Decoder's Self-Attention masks future information during training.

Multi-Head Attention Instead of applying a single Scaled Dot-Product Attention to the whole, Transformer uses a Multi-Head that splits into multiple parts, and fully joins the Key, Value, and Query before inputting to each Head. Apply, concatenate the outputs from each Head, and then apply the full join again.

MultiHeadAttention(Q, K, V) = \left[ Attention_1(QW^Q_1, KQ^K_1, VW^V_1), ..., Attention_h(QW^Q_h, KQ^K_h, VW^V_h) \right] W

By executing Attention for each of multiple parts in this way, we expect each Head to acquire a different subspace representation.

Position-wise Feed-Forward Network Position-wise Feed-Forward Network applies a two-tier full bond for each position in the sequence length. In the original paper [1], the outer dimension is 512, the inner dimension is 2048, which is four times that, and the inner activation function is ReLU.

FFN(x) = max(0, xW_{inner} + b_{inner})W_{outer} + b_{outer}

Positional Encoding Transformers do not have a recursive structure like RNNs, so they cannot take into account the sequence length order. Therefore, add the position information of each word immediately after the embed layer. [9]

The following formula is used in Positional Encoding.

PE_{(pos, 2i)} = \sin \left( \frac{pos}{10000^{\frac{2i}{d_{k}}}} \right) \\
PE_{(pos, 2i+1)} = \cos \left( \frac{pos}{10000^{\frac{2i}{d_{k}}}} \right)

Where $ d_ {k} $ is the dimension of the embedded layer, $ pos $ is the position of the word, and $ 2i and 2i + 1 $ are the even and odd dimensions of the embedded layer, respectively. Assuming that the maximum series length is 97 and the dimension of the embedded layer is 512, the Positional Encoding is visualized as shown in the figure below.

Transformer uses this formula because $ PE_ {pos + \ tau} $ can be represented as a linear function of $ PE_ {pos} $.

here,

u_i = \frac{1}{10000^{\frac{2i}{d_{k}}}}

Then, the expression of Positional Encoding is as follows.

PE_{(pos, 2i)} = \sin (pos \cdot u_i) \\
PE_{(pos, 2i+1)} = \cos (pos \cdot u_i)

Then $ PE_ {pos + \ tau} $ will be

\begin{align}
PE_{pos+\tau} &= \sin ((pos+\tau) \cdot u_i) \\
&= \sin (pos \cdot u_i) \cos (\tau u_i) + \cos (pos \cdot u_i) \sin (\tau u_i) \\
&= PE_{(pos, 2i)} \cos(\tau u_i) + PE_{(pos, 2i+1)} \sin (\tau u_i)
\end{align}

It can be expressed as the linear sum of $ PE_ {(pos, 2i)}, PE_ {(pos, 2i + 1)} $.

result

Training loss and perplexity The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the Perplexity, the horizontal axis shows the number of epochs, and the vertical axis shows the value and Perplexity of the loss function, respectively.

Validation BLEU score Now that the Japanese-English translation model has been trained, I tried to evaluate the performance using the verification data. We adopted the greedy algorithm for verification.

Bilingual Evaluation Understudy (BLEU) [10] was calculated for this performance evaluation. BLEU was calculated using nltk and smoothed by NIST. Using dev as the validation data resulted in the following: The number after the hyphen stands for n-gram.

BLEU-4 Score 1.84
BLEU-1 Score 12.22

Visualization of Self-Attention

Transformer can visualize Attention maps. The figure below shows a visualization of the Attention map for each head of the Encoder's 5th and 6th layers of Self-Attention, and the color map is displayed as hot.

Encoder 5

Encoder 6

In the 5th layer, each head seems to pay attention to a specific word, but the middle head in the 6th layer does not seem to work well.

reference

JESC Deep learning library that builds on and extends Microsoft CNTK

Natural Language : Machine Translation Part1 - Japanese-English Subtitle Corpus

Reid Pryzant, Youngjoo Chung, Dan Jurafsky, and Denny Britz. "JESC: Japanese-English Subtitle Corpus", arXiv preprint arXiv:1710.10639 (2017).
Ashish Vaswani,　Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. "Attention Is All You Need", Advances in neural information processing systems. 2017. p. 5998-6008.
Yougnhui Wu, Mike Schuster, Zhifen Chen, Quoc V. Le, Mohammad Norouzi, et. al. "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation", arXiv preprint arXiv:1609.08144, 2016.
Jonas Gehring, Michael Auli, David Grangier, Denis Tarats, and Tann N. Dauphin, "Convolutional Sequence to Sequence Learning", arXiv preprint arXiv:1705.03122 (2017).
Ofir Press and Lior Wolf. "Using the Output Embedding to Improve Language Models.", arXiv preprint arXiv:1608.05859 (2016).
Xaiver Glorot and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks", Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. 2010, p. 249-256.
Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, p. 464-472.
Sainbayer Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus, "End-To-End Memory Networks", Advances in Neural Information Processing Systems. 2015. p. 2440-2448.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. "BLEU: a Method for Automatic Evaluation of Machine Translation", Proceedings of the 40-th Annual Meeting of the Association for Computational Linguistics (ACL). 2002, p. 311-318.

Natural Language: Machine Translation Part2 --Neural Machine Translation Transformer