1.First of all

I'm reading ** "Developmental Deep Learning with PyTorch" **. This time, I studied BERT in Chapter 8, so I would like to output my own summary.

2. What is BERT?

In 2018, the year after Transformer was announced, ** BERT ** was finally announced with accuracy exceeding humans even in the field of natural language processing. It was.

BERT was able to handle all natural language processing tasks with fine tuning, and achieved overwhelming SoTA in 11 types of tasks.

スクリーンショット 2020-07-30 13.31.18.png

This is the model diagram of BERT in the paper. It looks complicated because it is expanded in the length direction of the word, but in simple terms, it is the one that extracts only the Encoder part of ** Transformer **. So what's the difference between BERT and Transformer? It is to perform ** pre-learning with two types of tasks ** and ** two-step learning ** of ** fine tuning ** according to the target task.

** 1) Pre-learning with 2 types of tasks ** スクリーンショット 2020-07-31 17.54.40.png

The task of masking 15% of the words in a sentence and guessing the words (** Masked Language Model ) and the task of determining whether the contexts of the two sentences are connected ( Next Sentence Prediction **) ** Learn at the same time **.

Put ** [CLS] ** at the beginning of the sentence to be entered, and add ** embedded expression ** to represent the first or second sentence to the two sentences, and ** [SEP] ** in between. Put in.

By learning these two tasks, ** the ability to convert words into feature vectors according to the context ** and ** the ability to judge whether sentences are semantically connected ** (roughly the meaning of sentences) Acquire the ability to understand).

Pre-learning that trains this ground costs a lot of calculation cost, and it seems that it takes about 4 days even if you use 4 TPUs, but if someone does it once, it will transform into a network that can solve various tasks with fine tuning. You can do it.

** 2) Fine tuning ** スクリーンショット 2020-07-31 17.54.57.png

With the ** pre-learning weight as the initial value **, fine tuning is performed with labeled data. Since the Jito has been considerably trained by pre-learning, ** it is possible to create a high-performance model from a small amount of sentence data **. The paper states that the computational cost of fine-tuning for various tasks was less than an hour with a single TPU.

Below are the 11 NLP tasks that BERT recorded SoTA.

data set	type	Overview
MNLI	inference	Implications of premise and hypothesis sentences/Contradiction/Judgment of either neutrality
QQP	Similarity judgment	Implications of premise and hypothesis sentences/Contradiction/Judgment of either neutrality
QNLI	inference	A sentence-question pair is passed to determine if the sentence contains an answer
SST-2	1 sentence classification	Sentence positive/Negative sentiment analysis
CoLA	1 sentence classification	Determine if the sentence is grammatically correct
STS-B	Similarity judgment	Score 1 how similar the two sentences are in terms of meaning~Determined by 5
MRPC	Similarity judgment	Determine if the two sentences are semantically the same
RTE	inference	Determine if two sentences imply
SQuAD v1.1	inference	Predict where the answer is with a sentence containing a question and an answer
SQuAD v2.0	inference	v1.1 plus the option that there is no answer
SWAG	inference	Select a sentence following a given sentence from 4 choices

** 3) Other differences ** -In Transformer, the position information of a word is given as a value consisting of sin and cos in Positonal Encoder, but in BERT, it is learned. -I use GELU (the output around input 0 is smooth and not crisp) instead of ReLU as part of the activation function.

3. Model to be implemented this time

スクリーンショット 2020-08-01 13.11.12.png

This time, we will use BERT's ** pre-learned model ** to perform fine tune ** of the task of determining the negative / positive of sentences. There are two types of BERT models with different model sizes, and this time we will use the smaller model called Base.

There are two BERT outputs, ** for identification ** and ** for token level **, and this time we will connect the fully connected layers for identification and make a negative / positive judgment. The dataset used is ** IMDb ** (Internet Movie Dataset), which summarizes whether the content of a movie review (in English) is positive or negative.

By training the model, when you enter a review for a movie, determine whether the review is positive or negative, and let the mutual Attention of the review words clarify the word on which the decision was based.

4. Model code

from bert import get_config, BertModel, set_learned_params

#Read the JSON file of the model settings as an object variable
config = get_config(file_path="./data/bert_config.json")

#Create a BERT model
net_bert = BertModel(config)

#Set trained parameters in the BERT model
net_bert = set_learned_params(
    net_bert, weights_path="./data/pytorch_model.bin")

Create a BERT model and set the pre-trained weight parameters.

class BertForIMDb(nn.Module):
    '''A model that connects the BERT model with the part that determines the positive / negative of IMDb.'''

    def __init__(self, net_bert):
        super(BertForIMDb, self).__init__()

        #BERT module
        self.bert = net_bert  #BERT model

        #Added positive / negative prediction to head
        #The input is the dimension of the output feature of BERT, and the output is positive and negative.
        self.cls = nn.Linear(in_features=768, out_features=2)

        #Weight initialization processing
        nn.init.normal_(self.cls.weight, std=0.02)
        nn.init.normal_(self.cls.bias, 0)

    def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=False, attention_show_flg=False):
        '''
        input_ids： [batch_size, sequence_length]List of word IDs of sentences
        token_type_ids： [batch_size, sequence_length]The id that indicates whether each word is the first sentence or the second sentence
        attention_mask: Masking that works the same as the Transformer mask.
        output_all_encoded_layers: Specify whether to return all 12 Transformers in a list or only the last in the final output
        attention_show_flg：Self-Flag to return Attention weight
        '''

        #Forward propagation of the basic model part of BERT
        #Propagate forward
        if attention_show_flg == True:
            '''attention_Attention at show_probs also return'''
            encoded_layers, pooled_output, attention_probs = self.bert(
                input_ids, token_type_ids, attention_mask, output_all_encoded_layers, attention_show_flg)
        elif attention_show_flg == False:
            encoded_layers, pooled_output = self.bert(
                input_ids, token_type_ids, attention_mask, output_all_encoded_layers, attention_show_flg)

        #First word of input sentence[CLS]Classify positives and negatives using the features of
        vec_0 = encoded_layers[:, 0, :]
        vec_0 = vec_0.view(-1, 768)  #size[batch_size, hidden_Convert to size
        out = self.cls(vec_0)

        # attention_Attention at show_probs (last one) also return
        if attention_show_flg == True:
            return out, attention_probs
        elif attention_show_flg == False:
            return out

This is a model in which the BERT model is connected to the Linear that determines the negative / positive of IMDb. Since updating the weight parameter is heavy when it is performed on all layers of BertLayer, it is performed only on the final layer (12th layer) of BertLayer and the added Linear.

5. Whole code and execution

The entire code was created on Google Colab and posted on Github, so if you want to try it yourself, this [** "link" **](https://github.com/cedro3/BERT/blob/master/ You can move it by clicking BERT_IMDb_run.ipynb) and clicking the "Colab on Web" button at the top of the displayed sheet.

When I ran the code, I learned only 2 epochs, but the accuracy rate of the test data was about 90%. Last time, I did the same task with Transfomer, but the correct answer rate at that time was about 85%, so ** + 5 points improvement ** It has become.

By the way, as for the clarification of the judgment basis, スクリーンショット 2020-08-01 14.05.21.png With this kind of feeling, clarify which word was used as the basis for judgment.

(reference) ・ Learn while making! Deep learning developed by PyTorch ・ Thorough explanation of the paper of "BERT", the king of natural language processing

Negative / positive judgment of sentences by BERT and visualization of grounds

1.First of all

2. What is BERT?

3. Model to be implemented this time

4. Model code

5. Whole code and execution