table of contents 1.First of all 2. Introduction of the library 3. Test code on livedoor-corpus
In this article, we will introduce a library that allows you to easily perform multi-class classification (document classification, text classification) of text by BERT. The three lines in the title are the method-like meaning of the library.
BERT is an abbreviation for Bidirectional Encoder Representations from Transformers Translated as "two-way encoded representation by Transformer", it is a natural language processing model published in a paper by Jacob Devlin et al. Of Google in October 2018. The fields of natural language processing work such as translation, document classification, and question answering are called "(natural language processing) tasks", and BERT has set the highest score at the time for various tasks. Quote: Ledge.ai "What is BERT | Explaining the features and mechanism of Google's proud natural language processing model"
Reference: Qiita "Thorough explanation of the paper of" BERT ", the king of natural language processing"
Thankfully, there are already many sample articles on text classification by BERT. However, it's quite long and it's hard to get started.
reference: Japanese sentence classification using natural language processing model (BERT) Multi-value classification of Japanese sentences using BERT [PyTorch] Introduction to Japanese Document Classification Using BERT
So, after a little research, there was a person who packed it in a handy library ↓ ↓
「Simple Transformers」
Original article: [Simple Transformers — Multi-Class Text Classification with BERT, RoBERTa, XLNet, XLM, and DistilBERT](https://medium.com/swlh/simple-transformers-multi-class-text-classification-with-bert] -roberta-xlnet-xlm-and-8b585000ce3a)
This library is a "ready-to-use" Transformer library. This is great if you want to use Transformer with three lines of code without worrying about the technical details. (Original article translation)
There are several types of BERT, A library called Transformers can execute the eight BERT, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, and DistliBERT in a similar way. This "Simple Transformers" is a library that makes it even easier to use.
Officially I recommend conda, but I did it in a venv virtual environment.
Prerequisites: $ pip install pandas tqdm scipy scikit-learn transformers tensorboardx simple transformers
** In addition to these, you will need pytorch. ** **
If you use GPU, you need to install CUDA separately, so please check it.
For CPU, you only need to install pytorch.
You can get the installation command from the official one that suits your environment. → Pytorch Official
By the way, in my environment, I couldn't avoid the GPU memory shortage error, so I ran it on the CPU. It's long.
First of all, if you summarize the officially riding Demo in Japanese
train.csv
and test.csv
to the data /
directoryimport pandas as pd
train_df = pd.read_csv('data/train.csv', header=None)
train_df['text'] = train_df.iloc[:, 1] + " " + train_df.iloc[:, 2]
train_df = train_df.drop(train_df.columns[[1, 2]], axis=1)
train_df.columns = ['label', 'text']
train_df = train_df[['text', 'label']]
train_df['text'] = train_df['text'].apply(lambda x: x.replace('\\', ' '))
train_df['label'] = train_df['label'].apply(lambda x:x-1)
eval_df = pd.read_csv('data/test.csv', header=None)
eval_df['text'] = eval_df.iloc[:, 1] + " " + eval_df.iloc[:, 2]
eval_df = eval_df.drop(eval_df.columns[[1, 2]], axis=1)
eval_df.columns = ['label', 'text']
eval_df = eval_df[['text', 'label']]
eval_df['text'] = eval_df['text'].apply(lambda x: x.replace('\\', ' '))
eval_df['label'] = eval_df['label'].apply(lambda x:x-1)
from simpletransformers.classification import ClassificationModel
model = ClassificationModel('roberta', 'roberta-base', num_labels=4)
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(eval_df)
The above is the sample published in the original article. It's easy.
Next, I wonder how much it can be used in Japanese sentences (although I don't understand BERT enough). I tried it with the familiar livedoor corpus.
If it is in the downloaded state, it is scattered in .txt for each domain, so I summarized it in CSV. At that time, replace the domain with the label, leaving only the label and body. Since it is a little difficult to use CPU, the test was done in 3 domains from 0 to 2. (dokujo-tsushin、it-life-hack、kaden-channel)
Divide this into train and test
from sklearn.model_selection import train_test_split
X_train_df, X_test_df, y_train_s, y_test_s = train_test_split(
data["text"], data["label"], test_size=0.2, random_state=0, stratify=data["label"]
)
train_df = pd.DataFrame([X_train_df,y_train_s]).T
test_df = pd.DataFrame([X_test_df,y_test_s]).T
train_df["label"] = train_df["label"].astype("int")
test_df["label"] = test_df["label"].astype("int")
from simpletransformers.classification import ClassificationModel
model = ClassificationModel('roberta', 'roberta-base', num_labels=3,use_cuda=False)
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(test_df)
Accuracy: 0.8798329801724872
Loss: 0.24364208317164218
was. I haven't read the original data properly, so I don't know the characteristics of each domain, but it's good accuracy.
As a bonus, when I predict other domain articles, it looks like this.
It seems that IT life hack and gadget site S-MAX are similar.
I plotted it roughly without dividing it into domains, but it is quite divided.
You can easily perform text classification by BERT just by preparing the data like this. It looks like github It seems that it can be used for detailed settings and other tasks. I touched it before learning about BERT, so I will study a little and then try again with various data. Please use it.
Recommended Posts