Let the Japanese BERT model do the center test and sentence generation

1.First of all

Last time, I tried ** Fine-tuning negative / positive judgment task ** on the pre-trained model of BERT, but there is no fine tuning. But of course it works.

This time, I will try ** Japanese BERT pre-learned model ** as it is on Google Colab.

2. Setup

The setup procedure is as follows. See the code on Google Colab (there is a link at the end).

** 1) Module installation ** Install the required modules (** pyknp, transformers **).

** 2) Installation of morphological analysis library ** [** "Japanese Morphological Analysis System JUMMAN ++" **](http://nlp.ist.i.kyoto-u.ac.jp/ index.php? JUMAN ++) is used.

** 3) Download Japanese BERT pre-learned model ** ** "BERT Japanese Pretrained Model" ** provided by Kyoto University / Kurobashi / Kurobashi / Murawaki Laboratory .php? BERT Japanese Pretrained model) is used.

3. Can BERT solve center exam questions?

This model is pre-learned using ** Japanese Wikipedia **. That means ** I'm studying various knowledge of the east and west in the form of solving the fill-in-the-blank question **, so I'll pretend that I might be able to solve the ** fill-in-the-blank question of the National Center Test for University Admissions **. I decided to. The subject is Problem 9 of World History B in 2018. スクリーンショット 2020-08-06 12.55.39.png The correct answer is ① "noble" and "Caesar", but how does BERT answer?

import torch
from transformers import BertTokenizer, BertForMaskedLM, BertConfig
import numpy as np
import textwrap
config = BertConfig.from_json_file('./bert/Japanese_L-12_H-768_A-12_E-30_BPE_transformers/config.json')
model = BertForMaskedLM.from_pretrained('./bert/Japanese_L-12_H-768_A-12_E-30_BPE_transformers/pytorch_model.bin', config=config)
bert_tokenizer = BertTokenizer('./bert/Japanese_L-12_H-768_A-12_E-30_BPE_transformers/vocab.txt',
 do_lower_case=False, do_basic_tokenize=False)
from pyknp import Juman
jumanpp = Juman()

First, import the required libraries and configure BERT.

When inputting to BERT, put [CLS] at the beginning of the word list, put [SEP] at the sentence delimiter, and replace the word you want to predict with [MASK], so define a function to do that. I will.

#To word list[CLS],[SEP],[MASK]Function to add
def preparation(tokenized_text):
    
    # [CLS],[SEP]Insert
    tokenized_text.insert(0, '[CLS]')  #At the beginning of the word list[CLS]Attach
    tokenized_text.append('[SEP]')  #At the end of the word list[SEP]Attach
        
    maru = []
    for i, word in enumerate(tokenized_text):
        if word =='。' and i !=len(tokenized_text)-2:  #Position detection of "."
            maru.append(i)

    for i, loc in enumerate(maru):
        tokenized_text.insert(loc+1+i, '[SEP]')  #Next to "." In the word list[SEP]Insert
        
    #"□"[MASK]Replace with
    mask_index = []
    for index, word in enumerate(tokenized_text):
        if word =='□':  #Position detection of "□"
            tokenized_text[index] = '[MASK]'
            mask_index.append(index)
    
    return tokenized_text, mask_index  

The function inserts [CLS] at the beginning of the word list, appends [SEP] at the end, and inserts [SEP] after the "." In the middle. After deciding the word position in this way, replace the predicted part "□" with [MASK] and return the word list and [MASK] position.

Then convert the text to an ID tensor.

#Convert text to ID tensor
text = "In his book History, the Greek Polybius praises the Roman Republic's national system (political system) as excellent. According to him, the national system has a royal element called consul, a □ system element called the Senate, and a democratic element called the people, and these three elements cooperate and restrain each other and balance. It is said that it is doing. The Romans are proud of this political system, which can also be read from the name they called "Roman Senate and the People" to refer to their nation. Even □, who seemed to have won the civil war at the end of the republican government, was assassinated on suspicion of trying to break this system."
result = jumanpp.analysis(text)  #Word-separation
tokenized_text = [mrph.midasi for mrph in result.mrph_list()]  #Convert to word list
tokenized_text, mask_index = preparation(tokenized_text)  # [CLS],[SEP],[MASK]Add
tokens = bert_tokenizer.convert_tokens_to_ids(tokenized_text)  #Convert to ID list
tokens_tensor = torch.tensor([tokens])  #Convert to ID tensor

Turn the text into a word list, add [CLS], [SEP], [MASK] using the previous function, then convert it to an ID list and convert it to a Pytorch readable ID tensor.

スクリーンショット 2020-08-06 18.47.49.png The conversion is done like this.

Now, infer the [MASK] part (top 5 candidates).

# [MASK]Infer the location(TOP5)
model.eval() 
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')
print(textwrap.fill(text, 45))
print()

with torch.no_grad():
  outputs = model(tokens_tensor)
  predictions = outputs[0]
  
  for i in range(len(mask_index)):
     _, predicted_indexes = torch.topk(predictions[0, mask_index[i]], k=5)
     predicted_tokens = bert_tokenizer.convert_ids_to_tokens(predicted_indexes.tolist())
     print(i, predicted_tokens)

スクリーンショット 2020-08-06 15.35.16.png

It was unreasonable, but the first prediction included the correct ** "noble" **! Unfortunately, I couldn't answer the second "Caesar" correctly, but BERT will do it better than I expected.

4. Can BERT generate sentences?

BERT, which has only learned in advance, has learned only ** fill-in-the-blank question ** and ** connection of two sentences **, so it is not suitable for sentence generation as it is. However, there is nothing that cannot be done in principle.

If you prepare a certain text, multiply the first word by [MASK] to make a prediction, replace the first word with the prediction result, and multiply the next word by [MASK] to make a prediction, which is similar to the text. You should be able to generate a new statement.

Let's do it. The subject is ["President Kennedy's speech expressing support for the Apollo program"](https://ja.wikipedia.org/wiki/Apollo program). スクリーンショット 2020-08-06 13.28.39.png

#Morphological analysis
text = "We decided to go to the moon within 10 years, not because it was easy. It's rather difficult. This goal will help us to bring together the best of our actions and skills and see how much they are. That challenge is what we want to take and don't want to procrastinate. And that's what we want to win, and not just us."
result = jumanpp.analysis(text)  #Word-separation
tokenized_text = [mrph.midasi for mrph in result.mrph_list()]  #Convert to word list
tokenized_text, mask_index = preparation(tokenized_text)  # [CLS],[SEP]Add
tokens = bert_tokenizer.convert_tokens_to_ids(tokenized_text)  #Convert to ID list
tokens_tensor = torch.tensor([tokens])  #Convert to ID tensor

As before, turn the text into a word list, add [CLS], [SEP] using the defined function, convert it to an ID list, and convert it to a Pytorch readable ID tensor.

Since we make word predictions many times, we define a function that predicts one word.

#1 word prediction function
def predict_one(tokens_tensor, mask_index):

    model.eval()    
    tokens_tensor = tokens_tensor.to('cuda')
    model.to('cuda')
 
    with torch.no_grad():
      outputs = model(tokens_tensor)
      predictions = outputs[0]
 
      _, predicted_indexes = torch.topk(predictions[0, mask_index], k=5)
      predicted_tokens = bert_tokenizer.convert_ids_to_tokens(predicted_indexes.tolist())
    return predicted_tokens, predicted_indexes.tolist()

A function that predicts a word multiplied by [MASK] and returns the predicted word and ID.

Then write the code to generate the sentence.

#Sentence generation
for i in range(1,len(tokens_tensor[0])):
    tmp = torch.tensor(tokens_tensor)  # tokens_Copy tensor to tmp
    tmp[0, i]=4  #i th[mask]Rewrite to
    predicted_tokens, predicted_indexes =predict_one(tmp, i)  # [mask]Predict
    if predicted_indexes !=1:  #The prediction is[UNK]Otherwise
      tokens_tensor[0, i] = predicted_indexes[0]  #Forecast ID[0]Second tokens_Overwrite i-th of tensor

target_list = tokens_tensor.tolist()[0]  
predict_list = bert_tokenizer.convert_ids_to_tokens(target_list)  
predict_sentence = ''.join(predict_list[1:])

print('------ original_text -------')
print(textwrap.fill(text,45))
print('------ predict_text -------')
print(textwrap.fill(predict_sentence,45))  

Copy tokens_tensor to tmp once, multiply tmp by [MASK] in sequence, and overwrite the corresponding part of tokens_tensor with the result of prediction. Well, when you do this,

スクリーンショット 2020-08-06 15.37.18.png

Even though the original said, "Let's go to the moon within 10 years," the sentence generation was "should go abroad within a year," and it became small (laughs). The content of the sentence is a bit unclear. Sentence generation does not seem to work very well with only pre-learning.

The entire code was created on Google Colab and posted on Github, so if you want to try it yourself, this [** "link" **](https://github.com/cedro3/BERT/blob/master/ Click BERT_pretrained_model.ipynb) and click the ** "Colab on Web" ** button at the top of the displayed sheet to move it.

(reference) ・ I tried to guess what I wanted for a Christmas present using the BERT Japanese modelUse JUMAN ++ with Colab

Recommended Posts

Let the Japanese BERT model do the center test and sentence generation
I wrote the code for Japanese sentence generation with DeZero
[2020 version] Let Python do all the tax and take-home calculations
[PyTorch] Japanese sentence generation using Transformer
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique