Natural Language: BERT Part2 --Unsupervised pretraining ALBERT

Target

This is a continuation of BERT using the Microsoft Cognitive Toolkit (CNTK).

In Part2, ALBERT will be pre-learned using the Japanese Wikipedia prepared in Part1. It is assumed that you have NVIDIA GPU CUDA installed.

Introduction

Natural Language: BERT Part1 --Japanese Wikipedia Corpus prepared a pre-learning corpus using Japanese Wikipedia.

In Part 2, we will create and train an unsupervised pre-learning model.

BERT Bidirectional Encoder Representations from Transformers (BERT) [1] uses only the Encoder part of Transformer [2]. Transformers are introduced in Natural Language: Machine Translation Part2-Neural Machine Translation Transformer.

In addition, this time, we implemented the base model of ALBERT [3], which is a lighter version of BERT, and configured it with Pre-Layer Normalization Transformer [4]. The details of the layer structure are shown in the figure below.

bert.png

BERT's Multi-Head Attention uses Self-Attention, which enables interactive learning.

Settings in training

The initial value of each parameter was set to a normal distribution with a variance of 0.02.

The loss function uses Cross Entropy Error for masked word prediction in Masked LM and Binary Cross Entropy for identification in Sentence Prediction.

Adam [5] was used as the optimization algorithm. Adam's hyperparameters $ β_1 $ are set to 0.9 and $ β_2 $ are set to the default values of CNTK.

For the learning rate, use the Cyclical Learning Rate (CLR) [6], the maximum learning rate is 1e-4, the base learning rate is 1e-8, the step size is 10 times the number of epochs, and the strategy is triangular2. I set it to.

Model training performed 3,000,000 Iterations by mini-batch learning.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 6000 24GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.33 ・ MeCab 0.996 ・ Numpy 1.17.3 ・ Pandas 0.25.0

Program to run

The training program is available on GitHub.

bert_pretraining.py


Commentary

I will supplement the main contents of this implementation.

Masked LM and Sentence Prediction In BERT pre-learning, the input sentence begins with the special word [CLS] and consists of two sentences, as shown in the figure below. A special word [SEP] is inserted at the end of each sentence. It then does two types of learning called Masked LM and Sentence Prediction.

pretraining.png

Masked LM In Masked LM, 15% of the input sentence is replaced with the special word [MASK] as the input sentence, and the word at the same position in the original input sentence is the correct answer data.

To predict the masked word, use the following process to predict the correct word. At this time, the gradient of only the masked position is used to update the parameters.

bert_masked_lm


def bert_masked_lm(encode):
    """ Masked Language Model """
    h = Dense(num_hidden, activation=Cx.gelu_fast, init=C.normal(0.02))(encode)
    h = LayerNormalization()(h)
    return Dense(num_word, activation=None, init=C.normal(0.02))(h)

However, since [MASK] is a special word used only in BERT pre-learning, it contributes to the unnaturalness of the language model during fine tuning. Therefore, BERT uses the following strategies to reduce the unnaturalness.

・ 80% chance to replace with [MASK]. ・ 10% chance to replace with a random word. ・ There is a 10% chance that it will not be replaced and will remain as it is. This has the meaning of getting closer to the actual word expression.

Next Sentence Prediction Next Sentence Prediction aims to acquire contextual understanding by solving a binary classification problem of whether or not two sentences contained in an input sentence are connected.

For this classification, the hidden layer (pooler) at the [CLS] position of the input sentence is taken out as a feature, and the full join and sigmoid function are applied.

bert_sentence_prediction


def bert_sentence_prediction(pooler):
    """ Sentence Prediction """
    h = C.sequence.unpack(pooler, padding_value=0, no_mask_output=True)[0, :]
    return Dense(1, activation=C.sigmoid, init=C.normal(0.02))(h)

50% of the training data is two consecutive sentences, and the remaining 50% is a discontinuous negative example by randomly connecting sentences.

A Lite BERT A Lite BERT (ALBERT) improves the weight reduction and contextual understanding of the model, which is a problem of BERT.

Factorized embedding parameterization Factorization reduces the number of parameters in the embedded layer.

If the number of words is $ V $, the dimension of the hidden layer is $ H $, and the embedded dimension of the lower dimension is $ E $, then the number of parameters is changed from $ V \ times H $ to $ V \ times E + E . It can be reduced to times H $.

If the actual number of words is $ E = 32,000 $, the hidden layer dimension is $ H = 768 $, and the lower embedded dimension is $ E = 128 $,

V \times H = 24,576,000 \\
V \times E + E \times H = 4,096,000 + 98,304 = 4,194,304

It can be seen that the number of parameters can be reduced by about 83% in the embedded layer.

Cross-layer parameter sharing ALBERT shares the Position-wise Feedfoward Network parameters with each Transformer Encoder Self-Attention head in all 12 layers.

This can significantly reduce the number of parameters.

Sentence-Order Prediction Since Next Sentence Prediction is a simple problem for understanding the context, its usefulness is questioned by RoBERTa [7] and so on.

Therefore, ALBERT trains in context understanding with Sentence-Order Prediction instead of Next Sentence Prediction.

It's very easy to implement, just prepare a negative example of randomly connecting sentences, but a negative example of swapping two sentences.

GELU Gaussian Error Linear Units (GELU) [8] is proposed as an activation function that combines Dropout [9], Zoneout [10], and ReLU. It is a differentiable and smooth function, and the effect of probabilistically regularizing the input can be expected. GELU looks like the figure below.

gelu.png

GELU is expressed by the following formula.

GELU(x) = x \Phi(x)

Here, $ \ Phi $ represents the cumulative probability density function of the normal distribution, assuming that the input $ x $ approaches a mean of 0 and a variance of 1 by Batch Normalization and Layer Normalization. Use the cumulative probability density.

\Phi(x) = \frac{1}{2} \left( 1 + erf \left( \frac{x - \mu}{\sqrt{\sigma^2}} \right) \right) \\
erf(x) = \frac{2}{\sqrt{\pi}} \int^x_0 e^{-u^2} du

Where $ erf $ represents the error function. The implementation uses the following formula, which is an approximation of the above formula.

GELU(x) \approx 0.5x \left( 1 + \tanh \left[ \sqrt{\frac{2}{\pi}}(x + 0.044715x^3) \right] \right)

However, the above approximation formula takes a long time to calculate, so in this implementation, we used the following formula, which is a further approximation of the above formula.

GELU(x) \approx x\sigma(1.702x)

Where $ \ sigma $ represents the sigmoid function.

result

Training loss The figure below is a visualization of the loss function log during training. The horizontal axis represents the number of repetitions, and the vertical axis represents the value of the loss function.

bert_logging.png

Masked LM Prediction

I asked the trained ALBERT to solve the sentence filling problem. Here, answer is the original sentence, masked is the original sentence with a part of the sentence replaced with [MASK], and albert predicts the word in the [MASK] position.

answer :Mankind must evolve by using intelligence correctly.
masked :Mankind[MASK]Must be used correctly to evolve.
albert :Mankind must evolve by using living things correctly.

Visualization of Self-Attention

The figure below shows a visualization of the Attention map for each head of the Self-Attention on the 11th and 12th layers of the Encoder, and the color map is displayed as hot.

Encoder 11

enc11.png

Encoder 12

enc12.png

BERT fine-tuning The original motive of BERT is transfer learning of the pre-learning model. Therefore, using the model learned in advance this time, transfer learning was performed with the document classification task of livedoor NEWS Corpus used in Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus. I tried it.

Text data preprocessing and morphological analysis We used MeCab and the NEologd dictionary to extract only nouns, verbs, and adjectives, performed stopword removal, and then converted the words to ids in the SentencePiece model.

This time, we trained 5 Epoch in addition to 9 classifications of fully coupled to the output of Pooler.

Similar to Natural Language: Doc2Vec Part2 --Document Classification, the performance evaluation using verification data resulted in the following results. Doc2Vec's 10 Epoch has lower performance compared to 90%.

Accuracy 75.56%

Due to time constraints, ALBERT's pre-learning could only train 1 Epoch in practice, and the model was too complex for the problem, which could be the causes of the performance degradation.

reference

Natural Language : Doc2Vec Part1 - livedoor NEWS Corpus Natural Language : Doc2Vec Part2 - Document Classification Natural Language : BERT Part1 - Japanese Wikipedia Corpus

  1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv preprint arXiv:1810.04805, 2018.
  2. Ashish Vaswani, et. al. "Attention Is All You Need", Advances in neural information processing systems. 2017. p. 5998-6008.
  3. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. "ALBERT: A Lite BERT for self-supervised learning of language representations", arXiv preprint arXiv:1909.11942 (2019).
  4. Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu . "On Layer Normalization in the Transformer Architecture", arXiv preprint arXiv:2002.04745 (2020).
  5. Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
  6. Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, p. 464-472.
  7. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv preprint arXiv:1907.11692 (2019).
  8. Dan Hendrycks and Kevin Gimpel. "Gaussian Error Linear Units(GELUs)", arXiv preprint arXiv:1606.08415, (2016).
  9. Nitish Srivastava, Geoffrey Hinton, Alex Krizhevshky, Ilya Sutskever, and Ruslan Salakhutdinov. "Dropout: A Simple Way to Prevent Neural Networks from Overfitting", The Journal of Machine Learning Research 15.1 (2014) p. 1929-1958.
  10. David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Christopher Pal. "Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations", arXiv preprint arXiv:1606.01305 (2016).

Recommended Posts

Natural Language: BERT Part2 --Unsupervised pretraining ALBERT
Natural Language: BERT Part1 --Japanese Wikipedia Corpus
Natural Language: Word2Vec Part3 --CBOW model
Natural Language: Word2Vec Part1 --Japanese Corpus
Natural Language: Doc2Vec Part2 --Document Classification
Natural Language: Word2Vec Part2 --Skip-gram model
Natural Language: GPT --Japanese Generative Pretraining Transformer
Natural Language: Machine Translation Part2 --Neural Machine Translation Transformer
Natural Language: Doc2Vec Part1 --livedoor NEWS Corpus
Natural Language: Machine Translation Part1 --Japanese-English Subtitle Corpus
RNN_LSTM2 Natural language processing