I tried using Japanese BERT with huggingface / transformers in previous article, but if you use huggingface / transformers, other pre-learned The BERT model of is also easy to handle.
In List of usable models, it seems that there are other models that seem to be Japanese, such as DistilBERT and ALBERT. Both are positioned as lightweight BERT.
This time, while briefly introducing DistilBERT provided by Namco Bandai that can also be used from the hugging face. , I tried to compare the accuracy with normal BERT. Finally, I will introduce one of the techniques to improve the accuracy when classifying sentences with BERT.
I will borrow the README of Namco Bandai's Github as it is.
DistilBERT is a model released by Huggingface at NeurIPS 2019, and its name stands for "Distilated-BERT". Please refer to here for submitted papers.
Distil BERT is a small, fast and light Transformer model based on the BERT architecture. DistilBERT is said to have 40% fewer parameters than BERT-base, run 60% faster, and maintain 97% of BERT's performance as measured by the GLUE Benchmark.
DistilBERT is trained using knowledge distillation, a technique that compresses a large model called a teacher into a small model called a student. By distilling BERT, you can obtain a Transformer model that has many similarities to the original BERT model, but is lighter and runs faster.
In short, it feels like a lightweight version of BERT-base that achieves high speed (hence the accuracy is slightly inferior to BERT-base). Let's actually use it to see how fast it is and how inaccurate it is.
[Official Github](https://github.com/BandaiNamcoResearchInc/DistilBERT-base-jp/blob/master/docs/GUIDE.md#transformers-%E3%83%A9%E3%82%A4%E3%83% 96% E3% 83% A9% E3% 83% AA% E3% 83% BC% E3% 81% 8B% E3% 82% 89% E8% AA% AD% E8% BE% BC% E3% 81% BF) You can easily call it from hugging face / transformers on the street.
-For tokenizer, if you don't add cl-tohoku /
to the name as shown below, an error will occur.
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese-whole-word-masking")
distil_model = AutoModel.from_pretrained("bandainamco-mirai/distilbert-base-japanese")
Basically, it can be used in the same way as the Japanese BERT-base introduced in Last article, but due to the difference in the internal network structure, fine tuning is a little. It looks like it needs to be changed.
For the time being, let's check the structure of the contents of Distil BERT.
print(distil_model)
It's long, so keep it closed.
DistilBertModel(
(embeddings): Embeddings(
(word_embeddings): Embedding(32000, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): Transformer(
(layer): ModuleList(
(0): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(1): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(2): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(3): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(4): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
(5): TransformerBlock(
(attention): MultiHeadSelfAttention(
(dropout): Dropout(p=0.1, inplace=False)
(q_lin): Linear(in_features=768, out_features=768, bias=True)
(k_lin): Linear(in_features=768, out_features=768, bias=True)
(v_lin): Linear(in_features=768, out_features=768, bias=True)
(out_lin): Linear(in_features=768, out_features=768, bias=True)
)
(sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(ffn): FFN(
(dropout): Dropout(p=0.1, inplace=False)
(lin1): Linear(in_features=768, out_features=3072, bias=True)
(lin2): Linear(in_features=3072, out_features=768, bias=True)
)
(output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
)
)
)
The difference from BERT-base is that there are 12 transformer
blocks in BERT-base, but only 6 in Distil BERT. You can also see that the naming of the contents layer is a little different from BERT-base.
Therefore, when fine-tuning, you can write as follows. An example of a classification model declaration is also described.
This time there is no particular effect, but let's also check the DistilBERT reference. The model return is also slightly different.
(Similar to the previous article, the title classification of livedoor news corpus is assumed.)
import torch
from torch import nn
import torch.nn.functional as F
from transformers import *
class DistilBertClassifier(nn.Module):
def __init__(self):
super(DistilBertClassifier, self).__init__()
# BERT-This is the only place that is different from base.
self.distil_bert = AutoModel.from_pretrained("bandainamco-mirai/distilbert-base-japanese")
#The number of dimensions of the hidden layer of DistilBERT is 768,9 livedoor news categories
self.linear = nn.Linear(768, 9)
#Weight initialization processing
nn.init.normal_(self.linear.weight, std=0.02)
nn.init.normal_(self.linear.bias, 0)
def forward(self, input_ids):
vec, _ = self.distil_bert(input_ids)
#Get only the vector of the first token cls
vec = vec[:,0,:]
vec = vec.view(-1, 768)
#Convert dimensions for classification in fully connected layers
out = self.linear(vec)
return F.log_softmax(out)
#Instance of classification model
distil_classifier = DistilBertClassifier()
#First of all OFF
for param in distil_classifier.parameters():
param.requires_grad = False
#Update only the last layer of DistilBERT ON
# BERT-base is.encoder.layer[-1]It was,
#In the case of DistilBERT, as confirmed above, the structure is as follows..transfomer.layer[-1]It will be.
for param in distil_classifier.distil_bert.transformer.layer[-1].parameters():
param.requires_grad = True
#Class classification is also ON
for param in distil_classifier.linear.parameters():
param.requires_grad = True
import torch.optim as optim
#The learning rate should be small for the pre-learned part, and large for the last fully connected layer.
#Don't forget to change this for Distil BERT
optimizer = optim.Adam([
{'params': distil_classifier.distil_bert.transformer.layer[-1].parameters(), 'lr': 5e-5},
{'params': distil_classifier.linear.parameters(), 'lr': 1e-4}
])
Similar to Last time, it handles the task of title classification of livedoor news corpus.
BERT-base
--The following source code is almost the same as last time.
class BertClassifier(nn.Module):
def __init__(self):
super(BertClassifier, self).__init__()
self.bert = BertModel.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
#The number of dimensions of the hidden layer of BERT is 768,9 livedoor news categories
self.linear = nn.Linear(768, 9)
#Weight initialization processing
nn.init.normal_(self.linear.weight, std=0.02)
nn.init.normal_(self.linear.bias, 0)
def forward(self, input_ids):
# last_Receive only hidden
vec, _ = self.bert(input_ids)
#Get only the vector of the first token cls
vec = vec[:,0,:]
vec = vec.view(-1, 768)
#Convert dimensions for classification in fully connected layers
out = self.linear(vec)
return F.log_softmax(out)
#Classification model instance declaration
bert_classifier = BertClassifier()
#Fine tuning settings
#First of all OFF
for param in bert_classifier.parameters():
param.requires_grad = False
#Update only the last layer of BERT ON
for param in bert_classifier.bert.encoder.layer[-1].parameters():
param.requires_grad = True
#Class classification is also ON
for param in bert_classifier.linear.parameters():
param.requires_grad = True
import torch.optim as optim
#The learning rate should be small for the pre-learned part, and large for the last fully connected layer.
optimizer = optim.Adam([
{'params': bert_classifier.bert.encoder.layer[-1].parameters(), 'lr': 5e-5},
{'params': bert_classifier.linear.parameters(), 'lr': 1e-4}
])
#Loss function settings
loss_function = nn.NLLLoss()
#Measure the study time.
import time
start = time.time()
#GPU settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#Send network to GPU
bert_classifier.to(device)
losses = []
#The number of epochs is 10
for epoch in range(10):
all_loss = 0
for idx, batch in enumerate(train_iter):
batch_loss = 0
bert_classifier.zero_grad()
input_ids = batch.Text[0].to(device)
label_ids = batch.Label.to(device)
out = bert_classifier(input_ids)
batch_loss = loss_function(out, label_ids)
batch_loss.backward()
optimizer.step()
all_loss += batch_loss.item()
print("epoch", epoch, "\t" , "loss", all_loss)
end = time.time()
print ("time : ", end - start)
#epoch 0 loss 251.19750046730042
#epoch 1 loss 110.7038831859827
#epoch 2 loss 82.88570280373096
#epoch 3 loss 67.0771074667573
#epoch 4 loss 56.24497305601835
#epoch 5 loss 42.61423560976982
#epoch 6 loss 35.98485875874758
#epoch 7 loss 25.728398952633142
#epoch 8 loss 20.40780107676983
#epoch 9 loss 16.567239843308926
#time : 101.97362518310547
#inference
answer = []
prediction = []
with torch.no_grad():
for batch in test_iter:
text_tensor = batch.Text[0].to(device)
label_tensor = batch.Label.to(device)
score = bert_classifier(text_tensor)
_, pred = torch.max(score, 1)
prediction += list(pred.cpu().numpy())
answer += list(label_tensor.cpu().numpy())
print(classification_report(prediction, answer, target_names=categories))
# precision recall f1-score support
# kaden-channel 0.94 0.92 0.93 172
#dokujo-tsushin 0.75 0.86 0.80 156
# peachy 0.81 0.68 0.74 211
# movie-enter 0.78 0.81 0.80 171
# smax 0.98 0.91 0.94 176
#livedoor-homme 0.68 0.83 0.75 83
# it-life-hack 0.79 0.94 0.86 150
# topic-news 0.81 0.76 0.78 172
# sports-watch 0.89 0.82 0.85 185
# accuracy 0.83 1476
# macro avg 0.83 0.84 0.83 1476
# weighted avg 0.84 0.83 0.83 1476
The learning time for 10 epochs was about 102 seconds, and the accuracy was 0.83 (F score).
DistilBERT
--Perform learning & inference as follows based on the model defined above and fine tuning settings ――But it's no different from BERT-base, but just in case ...
import time
#GPU settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#Send network to GPU
distil_classifier.to(device)
losses = []
start = time.time()
#The number of epochs is 10
for epoch in range(10):
all_loss = 0
for idx, batch in enumerate(train_iter):
batch_loss = 0
distil_classifier.zero_grad()
input_ids = batch.Text[0].to(device)
label_ids = batch.Label.to(device)
out = distil_classifier(input_ids)
batch_loss = loss_function(out, label_ids)
batch_loss.backward()
optimizer.step()
all_loss += batch_loss.item()
print("epoch", epoch, "\t" , "loss", all_loss)
end = time.time()
print ("time : ", end - start)
#/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:26: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
#epoch 0 loss 450.1027842760086
#epoch 1 loss 317.39041769504547
#epoch 2 loss 211.34138756990433
#epoch 3 loss 144.4813650548458
#epoch 4 loss 106.24609130620956
#epoch 5 loss 83.87273170053959
#epoch 6 loss 68.9661111086607
#epoch 7 loss 59.31868125498295
#epoch 8 loss 49.874382212758064
#epoch 9 loss 41.56027300283313
#time : 60.22182369232178
from sklearn.metrics import classification_report
answer = []
prediction = []
with torch.no_grad():
for batch in test_iter:
text_tensor = batch.Text[0].to(device)
label_tensor = batch.Label.to(device)
score = distil_classifier(text_tensor)
_, pred = torch.max(score, 1)
prediction += list(pred.cpu().numpy())
answer += list(label_tensor.cpu().numpy())
print(classification_report(prediction, answer, target_names=categories))
#/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:26: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
# precision recall f1-score support
# kaden-channel 0.93 0.96 0.95 163
#dokujo-tsushin 0.88 0.88 0.88 178
# peachy 0.86 0.75 0.80 202
# movie-enter 0.86 0.84 0.85 183
# smax 0.96 0.95 0.95 165
#livedoor-homme 0.67 0.71 0.69 96
# it-life-hack 0.91 0.91 0.91 178
# topic-news 0.80 0.86 0.83 148
# sports-watch 0.88 0.91 0.89 163
# accuracy 0.87 1476
# macro avg 0.86 0.86 0.86 1476
# weighted avg 0.87 0.87 0.87 1476
――The learning time of 10 epochs was about 60 seconds, and the accuracy was 0.87 (F score). ――It's nice that the learning time is faster, but the accuracy has improved. --Originally, it was supposed to be a little less accurate than BERT-base, but it seems that it may be higher. ――Somehow, the task of classifying the titles of the livedoor news corpus, which I always try as an experiment, may not be very good ...
From here, I will introduce one technique to improve the accuracy when classifying Japanese BERT sentences, not by comparing with Distil BERT.
(Originally, you should first thoroughly consider preprocessing according to the task, but it seems to be an accuracy improvement technique that does not depend much on the task, so I will introduce it here.)
Although the technique is said to be introduced in 5.3 Feature-based Approach with BERT of BERT's paper, the NLP competition previously held at kaggle It seems to be the 1st method of Jigsaw Unintended Bias in Toxicity Classification.
Please refer to the following article for details of the technique.
-[Kaggle Competition Review] Google QUEST Q & A
The point is that out of the 12 Encoder layers of BERT-base, it is better to combine the vectors of the CLS tokens of the final 4 layers than to use only the vectors of the CLS tokens of the final layer. It seems. (I don't know why ...)
The idea is so simple that I'll try it in this livedoor news corpus title classification task.
class BertClassifierRevised(nn.Module):
def __init__(self):
super(BertClassifierRevised, self).__init__()
self.bert = BertModel.from_pretrained('cl-tohoku/bert-base-japanese-whole-word-masking')
#The number of dimensions of the hidden layer of BERT is 768, but since the vector of the last 4 layers is combined, it is set to 768 × 4 dimensions.
self.linear = nn.Linear(768*4, 9)
#Weight initialization processing
nn.init.normal_(self.linear.weight, std=0.02)
nn.init.normal_(self.linear.bias, 0)
#Prepare a function to get the vector of cls token
def _get_cls_vec(self, vec):
return vec[:,0,:].view(-1, 768)
def forward(self, input_ids):
#First return value last_hidden_In state, you can only get the last layer, so
# output_hidden_states=Declare True to get all hidden layer vectors,
#Third return value(State of all hidden layers)To get.
_, _, hidden_states = self.bert(input_ids, output_hidden_states=True)
#Get the vector of cls token from each of the last 4 hidden layers
vec1 = self._get_cls_vec(hidden_states[-1])
vec2 = self._get_cls_vec(hidden_states[-2])
vec3 = self._get_cls_vec(hidden_states[-3])
vec4 = self._get_cls_vec(hidden_states[-4])
#Combine four cls tokens into one vector.
vec = torch.cat([vec1, vec2, vec3, vec4], dim=1)
#Convert dimensions for classification in fully connected layers
out = self.linear(vec)
return F.log_softmax(out)
#Instance declaration
bert_classifier_revised = BertClassifierRevised()
#First of all OFF
for param in bert_classifier_revised.parameters():
param.requires_grad = False
#Turn on the last 4 layers of BERT
for param in bert_classifier_revised.bert.encoder.layer[-1].parameters():
param.requires_grad = True
for param in bert_classifier_revised.bert.encoder.layer[-2].parameters():
param.requires_grad = True
for param in bert_classifier_revised.bert.encoder.layer[-3].parameters():
param.requires_grad = True
for param in bert_classifier_revised.bert.encoder.layer[-4].parameters():
param.requires_grad = True
#Class classification is also ON
for param in bert_classifier_revised.linear.parameters():
param.requires_grad = True
import torch.optim as optim
#The learning rate should be small for the pre-learned part, and large for the last fully connected layer.
optimizer = optim.Adam([
{'params': bert_classifier_revised.bert.encoder.layer[-1].parameters(), 'lr': 5e-5},
{'params': bert_classifier_revised.bert.encoder.layer[-2].parameters(), 'lr': 5e-5},
{'params': bert_classifier_revised.bert.encoder.layer[-3].parameters(), 'lr': 5e-5},
{'params': bert_classifier_revised.bert.encoder.layer[-4].parameters(), 'lr': 5e-5},
{'params': bert_classifier_revised.linear.parameters(), 'lr': 1e-4}
])
#Loss function settings
loss_function = nn.NLLLoss()
import time
start = time.time()
#GPU settings
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
#Send network to GPU
bert_classifier_revised.to(device)
losses = []
#The number of epochs is 5
for epoch in range(10):
all_loss = 0
for idx, batch in enumerate(train_iter):
batch_loss = 0
bert_classifier_revised.zero_grad()
input_ids = batch.Text[0].to(device)
label_ids = batch.Label.to(device)
out = bert_classifier_revised(input_ids)
batch_loss = loss_function(out, label_ids)
batch_loss.backward()
optimizer.step()
all_loss += batch_loss.item()
print("epoch", epoch, "\t" , "loss", all_loss)
end = time.time()
print ("time : ", end - start)
#epoch 0 loss 196.0047192275524
#epoch 1 loss 75.8067753687501
#epoch 2 loss 42.30751228891313
#epoch 3 loss 16.470114511903375
#epoch 4 loss 7.427484432584606
#epoch 5 loss 2.9392087209271267
#epoch 6 loss 1.5984382012393326
#epoch 7 loss 1.7370687873335555
#epoch 8 loss 0.9278695838729618
#epoch 9 loss 1.499190401067608
#time : 149.01919651031494
#inference
answer = []
prediction = []
with torch.no_grad():
for batch in test_iter:
text_tensor = batch.Text[0].to(device)
label_tensor = batch.Label.to(device)
score = bert_classifier_revised(text_tensor)
_, pred = torch.max(score, 1)
prediction += list(pred.cpu().numpy())
answer += list(label_tensor.cpu().numpy())
print(classification_report(prediction, answer, target_names=categories))
# precision recall f1-score support
# kaden-channel 0.80 0.99 0.89 137
#dokujo-tsushin 0.89 0.86 0.88 183
# peachy 0.78 0.82 0.80 168
# movie-enter 0.87 0.88 0.87 176
# smax 0.95 0.93 0.94 168
#livedoor-homme 0.72 0.83 0.77 88
# it-life-hack 0.95 0.79 0.86 215
# topic-news 0.83 0.84 0.83 159
# sports-watch 0.92 0.86 0.89 182
# accuracy 0.86 1476
# macro avg 0.86 0.87 0.86 1476
# weighted avg 0.87 0.86 0.86 1476
--Loss decreases faster than BERT base ――The learning time was about 150 seconds, which was a little longer than BERT-base. --The accuracy is 0.86, which is improved from 0.83 of BERT-base. Great.
--I compared BERT-base and Distil BERT. As a result, DistilBERT is better in terms of speed and accuracy, but I feel that I understand a little about how to use DistilBERT. ――In the second half, we introduced a plan to consider the cls token of the final 4 layers as a plan to improve the accuracy of BERT. BERT-base It seems that it definitely contributes to the improvement of accuracy compared to. From now on, when implementing the classification model with BERT, let's use the final 4 layers for the time being.
end
Recommended Posts