In this article, we will follow the process of fine-tuning a pre-trained BERT model through the task of categorizing English news article headlines. In the case of Japanese, unlike English, morphological analysis is required, but the overall flow is the same as the content of this article.
This implementation is also the answer to question 89 of 100 language processing knock 2020 version. For sample answers to other questions, see [Language Processing 100 Knock 2020] Summary of Answer Examples in Python.
Google Colaboratory is used for implementation. For details on how to set up and use Google Colaboratory, see [this article](https://cpp-fu learning.com/python_colaboratory/). ** If you want to use GPU for reproduction, please change the hardware accelerator to "GPU" from "Runtime"-> "Change runtime type" and save it in advance. ** ** The notebook containing the execution results is available on github.
News article headings using the public data News Aggregator Data Set are "Business", "Science and Technology", and "Entertainment". We will implement a BERT document classification model for tasks that fall into the "health" category.
First, download the target data.
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
!unzip NewsAggregatorDataset.zip
#Check the number of lines
!wc -l ./newsCorpora.csv
output
422937 ./newsCorpora.csv
#Check the first 10 lines
!head -10 ./newsCorpora.csv
output
1 Fed official says weak data caused by weather, should not slow taper http://www.latimes.com/business/money/la-fi-mo-federal-reserve-plosser-stimulus-economy-20140310,0,1312750.story\?track=rss Los Angeles Times b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.latimes.com 1394470370698
2 Fed's Charles Plosser sees high bar for change in pace of tapering http://www.livemint.com/Politics/H2EvwJSK2VE6OF7iK1g3PP/Feds-Charles-Plosser-sees-high-bar-for-change-in-pace-of-ta.html Livemint b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.livemint.com 1394470371207
3 US open: Stocks fall after Fed official hints at accelerated tapering http://www.ifamagazine.com/news/us-open-stocks-fall-after-fed-official-hints-at-accelerated-tapering-294436 IFA Magazine b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.ifamagazine.com 1394470371550
4 Fed risks falling 'behind the curve', Charles Plosser says http://www.ifamagazine.com/news/fed-risks-falling-behind-the-curve-charles-plosser-says-294430 IFA Magazine b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.ifamagazine.com 1394470371793
5 Fed's Plosser: Nasty Weather Has Curbed Job Growth http://www.moneynews.com/Economy/federal-reserve-charles-plosser-weather-job-growth/2014/03/10/id/557011 Moneynews b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.moneynews.com 1394470372027
6 Plosser: Fed May Have to Accelerate Tapering Pace http://www.nasdaq.com/article/plosser-fed-may-have-to-accelerate-tapering-pace-20140310-00371 NASDAQ b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.nasdaq.com 1394470372212
7 Fed's Plosser: Taper pace may be too slow http://www.marketwatch.com/story/feds-plosser-taper-pace-may-be-too-slow-2014-03-10\?reflink=MW_news_stmp MarketWatch b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.marketwatch.com 1394470372405
8 Fed's Plosser expects US unemployment to fall to 6.2% by the end of 2014 http://www.fxstreet.com/news/forex-news/article.aspx\?storyid=23285020-b1b5-47ed-a8c4-96124bb91a39 FXstreet.com b ddUyU0VZz0BRneMioxUPQVP6sIxvM www.fxstreet.com 1394470372615
9 US jobs growth last month hit by weather:Fed President Charles Plosser http://economictimes.indiatimes.com/news/international/business/us-jobs-growth-last-month-hit-by-weatherfed-president-charles-plosser/articleshow/31788000.cms Economic Times b ddUyU0VZz0BRneMioxUPQVP6sIxvM economictimes.indiatimes.com 1394470372792
10 ECB unlikely to end sterilisation of SMP purchases - traders http://www.iii.co.uk/news-opinion/reuters/news/152615 Interactive Investor b dPhGU51DcrolUIMxbRm0InaHGA2XM www.iii.co.uk 1394470501265
#Replaced double quotes with single quotes to avoid errors when reading
!sed -e 's/"/'\''/g' ./newsCorpora.csv > ./newsCorpora_re.csv
Next, read it as a data frame, extract only the cases where the information source (PUBLISHER) is Reuters, Huffington Post, Businessweek, Contactmusic.com, Daily Mail, and then divide it into training data, validation data, and evaluation data.
import pandas as pd
from sklearn.model_selection import train_test_split
#Data reading
df = pd.read_csv('./newsCorpora_re.csv', header=None, sep='\t', names=['ID', 'TITLE', 'URL', 'PUBLISHER', 'CATEGORY', 'STORY', 'HOSTNAME', 'TIMESTAMP'])
#Data extraction
df = df.loc[df['PUBLISHER'].isin(['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail']), ['TITLE', 'CATEGORY']]
#Data split
train, valid_test = train_test_split(df, test_size=0.2, shuffle=True, random_state=123, stratify=df['CATEGORY'])
valid, test = train_test_split(valid_test, test_size=0.5, shuffle=True, random_state=123, stratify=valid_test['CATEGORY'])
train.reset_index(drop=True, inplace=True)
valid.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)
print(train.head())
output
TITLE CATEGORY
0 REFILE-UPDATE 1-European car sales up for sixt... b
1 Amazon Plans to Fight FTC Over Mobile-App Purc... t
2 Kids Still Get Codeine In Emergency Rooms Desp... m
3 What On Earth Happened Between Solange And Jay... e
4 NATO Missile Defense Is Flight Tested Over Hawaii b
#Confirmation of the number of cases
print('[Learning data]')
print(train['CATEGORY'].value_counts())
print('[Verification data]')
print(valid['CATEGORY'].value_counts())
print('[Evaluation data]')
print(test['CATEGORY'].value_counts())
output
[Learning data]
b 4501
e 4235
t 1220
m 728
Name: CATEGORY, dtype: int64
[Verification data]
b 563
e 529
t 153
m 91
Name: CATEGORY, dtype: int64
[Evaluation data]
b 563
e 530
t 152
m 91
Name: CATEGORY, dtype: int64
(b: Business, e: Entertainment, t: Science and Technology, m: Health)
Install the `transformers``` library to use the BERT model. Through
`transformers```, many pretrained models besides BERT can be used very easily with short code.
!pip install transformers
Import the libraries needed to train and evaluate your model.
import numpy as np
import transformers
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
from torch import optim
from torch import cuda
import time
from matplotlib import pyplot as plt
Next, shape the data into a form that can be populated into the model.
First, define a class to create a `Dataset``` that holds the feature vector and the label vector together, which is often used in PyTorch. By passing ``` tokenizer``` to this class, it is possible to preprocess the input text, pad it to the specified longest sequence length, and then convert it to a word ID. However, the
tokenizer``` itself, where all the processing is written for BERT, will be obtained later through ``
tranformers```, so what you need in the class is `tokenizer
. Only the process of passing to `and the process of receiving the result.
#Dataset definition
class CreateDataset(Dataset):
def __init__(self, X, y, tokenizer, max_len):
self.X = X
self.y = y
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self): # len(Dataset)Specify the value to be returned with
return len(self.y)
def __getitem__(self, index): # Dataset[index]Specify the value to be returned with
text = self.X[index]
inputs = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
pad_to_max_length=True
)
ids = inputs['input_ids']
mask = inputs['attention_mask']
return {
'ids': torch.LongTensor(ids),
'mask': torch.LongTensor(mask),
'labels': torch.Tensor(self.y[index])
}
Create a Dataset
using the above.
In addition, BERT that can be used as an English version pre-learned model is LARGE, which is a configuration aiming for the highest accuracy, BASE, which has fewer parameters, and 4 of each of them, lowercase only (Uncased) and mixed case (Cased). There is a pattern.
This time, we will use BASE's Uncased, which you can easily try.
#Correct label one-Hot
y_train = pd.get_dummies(train, columns=['CATEGORY'])[['CATEGORY_b', 'CATEGORY_e', 'CATEGORY_t', 'CATEGORY_m']].values
y_valid = pd.get_dummies(valid, columns=['CATEGORY'])[['CATEGORY_b', 'CATEGORY_e', 'CATEGORY_t', 'CATEGORY_m']].values
y_test = pd.get_dummies(test, columns=['CATEGORY'])[['CATEGORY_b', 'CATEGORY_e', 'CATEGORY_t', 'CATEGORY_m']].values
#Creating a Dataset
max_len = 20
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
dataset_train = CreateDataset(train['TITLE'], y_train, tokenizer, max_len)
dataset_valid = CreateDataset(valid['TITLE'], y_valid, tokenizer, max_len)
dataset_test = CreateDataset(test['TITLE'], y_test, tokenizer, max_len)
for var in dataset_train[0]:
print(f'{var}: {dataset_train[0][var]}')
output
ids: tensor([ 101, 25416, 9463, 1011, 10651, 1015, 1011, 2647, 2482, 4341,
2039, 2005, 4369, 3204, 2004, 18730, 8980, 102, 0, 0])
mask: tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
labels: tensor([1., 0., 0., 0.])
The information of the first sentence is output.
You can see that the input string has been converted to an ID series as ids
. In BERT, the special delimiters [CLS] and [SEP] are inserted at the beginning and end of the original sentence during the conversion process, so they are also `101``` and
102``. Included in the series as
. `0``` represents padding. The correct label is also held in one-hot format as
`labels. We also keep a
maskthat represents the padding position so that we can pass it to the model along with the
ids``` during training.
Next, define the network.
transfomers
By using, the whole bert partbertmodel
It can be expressed with. Then, to handle the classification task, define a dropout that receives bert's output vector and a fully connected layer, and you're done.
#Definition of BERT classification model
class BERTClass(torch.nn.Module):
def __init__(self, drop_rate, otuput_size):
super().__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.drop = torch.nn.Dropout(drop_rate)
self.fc = torch.nn.Linear(768, otuput_size) #Specify 768 dimensions according to the output of BERT
def forward(self, ids, mask):
_, out = self.bert(ids, attention_mask=mask)
out = self.fc(self.drop(out))
return out
Now that the `` `Datasetand the network are ready, it's time to create the usual learning loop. Here, a series of flows is defined as a
train_model``` function. For the meaning of the components that appear, see the flow of the problem in the article [Language Processing 100 Knock 2020] Chapter 8: Neural Net. Please refer to the explanation along with it.
def calculate_loss_and_accuracy(model, criterion, loader, device):
"""Calculate loss / correct answer rate"""
model.eval()
loss = 0.0
total = 0
correct = 0
with torch.no_grad():
for data in loader:
#Device specification
ids = data['ids'].to(device)
mask = data['mask'].to(device)
labels = data['labels'].to(device)
#Forward propagation
outputs = model.forward(ids, mask)
#Loss calculation
loss += criterion(outputs, labels).item()
#Correct answer rate calculation
pred = torch.argmax(outputs, dim=-1).cpu().numpy() #Predicted label array for batch size length
labels = torch.argmax(labels, dim=-1).cpu().numpy() #Batch size length correct label array
total += len(labels)
correct += (pred == labels).sum().item()
return loss / len(loader), correct / total
def train_model(dataset_train, dataset_valid, batch_size, model, criterion, optimizer, num_epochs, device=None):
"""Executes model training and returns a log of loss / correct answer rate"""
#Device specification
model.to(device)
#Creating a dataloader
dataloader_train = DataLoader(dataset_train, batch_size=batch_size, shuffle=True)
dataloader_valid = DataLoader(dataset_valid, batch_size=len(dataset_valid), shuffle=False)
#Learning
log_train = []
log_valid = []
for epoch in range(num_epochs):
#Record start time
s_time = time.time()
#Set to training mode
model.train()
for data in dataloader_train:
#Device specification
ids = data['ids'].to(device)
mask = data['mask'].to(device)
labels = data['labels'].to(device)
#Initialize gradient to zero
optimizer.zero_grad()
#Forward propagation+Backpropagation of error+Weight update
outputs = model.forward(ids, mask)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
#Calculation of loss and correct answer rate
loss_train, acc_train = calculate_loss_and_accuracy(model, criterion, dataloader_train, device)
loss_valid, acc_valid = calculate_loss_and_accuracy(model, criterion, dataloader_valid, device)
log_train.append([loss_train, acc_train])
log_valid.append([loss_valid, acc_valid])
#Save checkpoint
torch.save({'epoch': epoch, 'model_state_dict': model.state_dict(), 'optimizer_state_dict': optimizer.state_dict()}, f'checkpoint{epoch + 1}.pt')
#Record end time
e_time = time.time()
#Output log
print(f'epoch: {epoch + 1}, loss_train: {loss_train:.4f}, accuracy_train: {acc_train:.4f}, loss_valid: {loss_valid:.4f}, accuracy_valid: {acc_valid:.4f}, {(e_time - s_time):.4f}sec')
return {'train': log_train, 'valid': log_valid}
Set the parameters and perform fine tuning.
#Parameter setting
DROP_RATE = 0.4
OUTPUT_SIZE = 4
BATCH_SIZE = 32
NUM_EPOCHS = 4
LEARNING_RATE = 2e-5
#Model definition
model = BERTClass(DROP_RATE, OUTPUT_SIZE)
#Definition of loss function
criterion = torch.nn.BCEWithLogitsLoss()
#Optimizer definition
optimizer = torch.optim.AdamW(params=model.parameters(), lr=LEARNING_RATE)
#Device specification
device = 'cuda' if cuda.is_available() else 'cpu'
#Model learning
log = train_model(dataset_train, dataset_valid, BATCH_SIZE, model, criterion, optimizer, NUM_EPOCHS, device=device)
output
epoch: 1, loss_train: 0.0859, accuracy_train: 0.9516, loss_valid: 0.1142, accuracy_valid: 0.9229, 49.9137sec
epoch: 2, loss_train: 0.0448, accuracy_train: 0.9766, loss_valid: 0.1046, accuracy_valid: 0.9259, 49.7376sec
epoch: 3, loss_train: 0.0316, accuracy_train: 0.9831, loss_valid: 0.1082, accuracy_valid: 0.9266, 49.5454sec
epoch: 4, loss_train: 0.0170, accuracy_train: 0.9932, loss_valid: 0.1179, accuracy_valid: 0.9289, 49.4525sec
Check the result.
#Log visualization
x_axis = [x for x in range(1, len(log['train']) + 1)]
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax[0].plot(x_axis, np.array(log['train']).T[0], label='train')
ax[0].plot(x_axis, np.array(log['valid']).T[0], label='valid')
ax[0].set_xlabel('epoch')
ax[0].set_ylabel('loss')
ax[0].legend()
ax[1].plot(x_axis, np.array(log['train']).T[1], label='train')
ax[1].plot(x_axis, np.array(log['valid']).T[1], label='valid')
ax[1].set_xlabel('epoch')
ax[1].set_ylabel('accuracy')
ax[1].legend()
plt.show()
#Calculation of correct answer rate
def calculate_accuracy(model, dataset, device):
#Creating a Dataloader
loader = DataLoader(dataset, batch_size=len(dataset), shuffle=False)
model.eval()
total = 0
correct = 0
with torch.no_grad():
for data in loader:
#Device specification
ids = data['ids'].to(device)
mask = data['mask'].to(device)
labels = data['labels'].to(device)
#Forward propagation+Get predicted value+Counting the number of correct answers
outputs = model.forward(ids, mask)
pred = torch.argmax(outputs, dim=-1).cpu().numpy()
labels = torch.argmax(labels, dim=-1).cpu().numpy()
total += len(labels)
correct += (pred == labels).sum().item()
return correct / total
print(f'Correct answer rate (learning data):{calculate_accuracy(model, dataset_train, device):.3f}')
print(f'Correct answer rate (verification data):{calculate_accuracy(model, dataset_valid, device):.3f}')
print(f'Correct answer rate (evaluation data):{calculate_accuracy(model, dataset_test, device):.3f}')
output
Correct answer rate (learning data): 0.993
Correct answer rate (verification data): 0.929
Correct answer rate (evaluation data): 0.948
The correct answer rate was about 95% in the evaluation data.
Normally, I think that it is often the case that parameters such as whether or not the weight is fixed for each layer of BERT and the learning rate are adjusted while checking the accuracy of the verification data. This time, the parameters were fixed, but the accuracy was relatively high, and the result showed the strength of pre-learning.
transformers BERT (official) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin, J. et al. (2018) (Original Article) [Language processing 100 knock 2020] Summary of answer examples by Python