I'm reading ** "Developmental Deep Learning with PyTorch" **. This time, I studied BERT in Chapter 8, so I would like to output my own summary.
In 2018, the year after Transformer was announced, ** BERT ** was finally announced with accuracy exceeding humans even in the field of natural language processing. It was.
BERT was able to handle all natural language processing tasks with fine tuning, and achieved overwhelming SoTA in 11 types of tasks.
This is the model diagram of BERT in the paper. It looks complicated because it is expanded in the length direction of the word, but in simple terms, it is the one that extracts only the Encoder part of ** Transformer **. So what's the difference between BERT and Transformer? It is to perform ** pre-learning with two types of tasks ** and ** two-step learning ** of ** fine tuning ** according to the target task.
** 1) Pre-learning with 2 types of tasks **
The task of masking 15% of the words in a sentence and guessing the words (** Masked Language Model ) and the task of determining whether the contexts of the two sentences are connected ( Next Sentence Prediction **) ** Learn at the same time **.
Put ** [CLS] ** at the beginning of the sentence to be entered, and add ** embedded expression ** to represent the first or second sentence to the two sentences, and ** [SEP] ** in between. Put in.
By learning these two tasks, ** the ability to convert words into feature vectors according to the context ** and ** the ability to judge whether sentences are semantically connected ** (roughly the meaning of sentences) Acquire the ability to understand).
Pre-learning that trains this ground costs a lot of calculation cost, and it seems that it takes about 4 days even if you use 4 TPUs, but if someone does it once, it will transform into a network that can solve various tasks with fine tuning. You can do it.
** 2) Fine tuning **
With the ** pre-learning weight as the initial value **, fine tuning is performed with labeled data. Since the Jito has been considerably trained by pre-learning, ** it is possible to create a high-performance model from a small amount of sentence data **. The paper states that the computational cost of fine-tuning for various tasks was less than an hour with a single TPU.
Below are the 11 NLP tasks that BERT recorded SoTA.
data set | type | Overview |
---|---|---|
MNLI | inference | Implications of premise and hypothesis sentences/Contradiction/Judgment of either neutrality |
QQP | Similarity judgment | Implications of premise and hypothesis sentences/Contradiction/Judgment of either neutrality |
QNLI | inference | A sentence-question pair is passed to determine if the sentence contains an answer |
SST-2 | 1 sentence classification | Sentence positive/Negative sentiment analysis |
CoLA | 1 sentence classification | Determine if the sentence is grammatically correct |
STS-B | Similarity judgment | Score 1 how similar the two sentences are in terms of meaning~Determined by 5 |
MRPC | Similarity judgment | Determine if the two sentences are semantically the same |
RTE | inference | Determine if two sentences imply |
SQuAD v1.1 | inference | Predict where the answer is with a sentence containing a question and an answer |
SQuAD v2.0 | inference | v1.1 plus the option that there is no answer |
SWAG | inference | Select a sentence following a given sentence from 4 choices |
** 3) Other differences ** -In Transformer, the position information of a word is given as a value consisting of sin and cos in Positonal Encoder, but in BERT, it is learned. -I use GELU (the output around input 0 is smooth and not crisp) instead of ReLU as part of the activation function.
This time, we will use BERT's ** pre-learned model ** to perform fine tune ** of the task of determining the negative / positive of sentences. There are two types of BERT models with different model sizes, and this time we will use the smaller model called Base.
There are two BERT outputs, ** for identification ** and ** for token level **, and this time we will connect the fully connected layers for identification and make a negative / positive judgment. The dataset used is ** IMDb ** (Internet Movie Dataset), which summarizes whether the content of a movie review (in English) is positive or negative.
By training the model, when you enter a review for a movie, determine whether the review is positive or negative, and let the mutual Attention of the review words clarify the word on which the decision was based.
from bert import get_config, BertModel, set_learned_params
#Read the JSON file of the model settings as an object variable
config = get_config(file_path="./data/bert_config.json")
#Create a BERT model
net_bert = BertModel(config)
#Set trained parameters in the BERT model
net_bert = set_learned_params(
net_bert, weights_path="./data/pytorch_model.bin")
Create a BERT model and set the pre-trained weight parameters.
class BertForIMDb(nn.Module):
'''A model that connects the BERT model with the part that determines the positive / negative of IMDb.'''
def __init__(self, net_bert):
super(BertForIMDb, self).__init__()
#BERT module
self.bert = net_bert #BERT model
#Added positive / negative prediction to head
#The input is the dimension of the output feature of BERT, and the output is positive and negative.
self.cls = nn.Linear(in_features=768, out_features=2)
#Weight initialization processing
nn.init.normal_(self.cls.weight, std=0.02)
nn.init.normal_(self.cls.bias, 0)
def forward(self, input_ids, token_type_ids=None, attention_mask=None, output_all_encoded_layers=False, attention_show_flg=False):
'''
input_ids: [batch_size, sequence_length]List of word IDs of sentences
token_type_ids: [batch_size, sequence_length]The id that indicates whether each word is the first sentence or the second sentence
attention_mask: Masking that works the same as the Transformer mask.
output_all_encoded_layers: Specify whether to return all 12 Transformers in a list or only the last in the final output
attention_show_flg:Self-Flag to return Attention weight
'''
#Forward propagation of the basic model part of BERT
#Propagate forward
if attention_show_flg == True:
'''attention_Attention at show_probs also return'''
encoded_layers, pooled_output, attention_probs = self.bert(
input_ids, token_type_ids, attention_mask, output_all_encoded_layers, attention_show_flg)
elif attention_show_flg == False:
encoded_layers, pooled_output = self.bert(
input_ids, token_type_ids, attention_mask, output_all_encoded_layers, attention_show_flg)
#First word of input sentence[CLS]Classify positives and negatives using the features of
vec_0 = encoded_layers[:, 0, :]
vec_0 = vec_0.view(-1, 768) #size[batch_size, hidden_Convert to size
out = self.cls(vec_0)
# attention_Attention at show_probs (last one) also return
if attention_show_flg == True:
return out, attention_probs
elif attention_show_flg == False:
return out
This is a model in which the BERT model is connected to the Linear that determines the negative / positive of IMDb. Since updating the weight parameter is heavy when it is performed on all layers of BertLayer, it is performed only on the final layer (12th layer) of BertLayer and the added Linear.
The entire code was created on Google Colab and posted on Github, so if you want to try it yourself, this [** "link" **](https://github.com/cedro3/BERT/blob/master/ You can move it by clicking BERT_IMDb_run.ipynb) and clicking the "Colab on Web" button at the top of the displayed sheet.
When I ran the code, I learned only 2 epochs, but the accuracy rate of the test data was about 90%. Last time, I did the same task with Transfomer, but the correct answer rate at that time was about 85%, so ** + 5 points improvement ** It has become.
By the way, as for the clarification of the judgment basis, With this kind of feeling, clarify which word was used as the basis for judgment.
(reference) ・ Learn while making! Deep learning developed by PyTorch ・ Thorough explanation of the paper of "BERT", the king of natural language processing
Recommended Posts