An era when text classification by BERT is possible with 3 lines of code

table of contents 1.First of all 2. Introduction of the library 3. Test code on livedoor-corpus

Introduction

In this article, we will introduce a library that allows you to easily perform multi-class classification (document classification, text classification) of text by BERT. The three lines in the title are the method-like meaning of the library.

What is BERT

BERT is an abbreviation for Bidirectional Encoder Representations from Transformers Translated as "two-way encoded representation by Transformer", it is a natural language processing model published in a paper by Jacob Devlin et al. Of Google in October 2018. The fields of natural language processing work such as translation, document classification, and question answering are called "(natural language processing) tasks", and BERT has set the highest score at the time for various tasks. Quote: Ledge.ai "What is BERT | Explaining the features and mechanism of Google's proud natural language processing model"

Reference: Qiita "Thorough explanation of the paper of" BERT ", the king of natural language processing"

Text classification by BERT

Thankfully, there are already many sample articles on text classification by BERT. However, it's quite long and it's hard to get started.

reference: Japanese sentence classification using natural language processing model (BERT) Multi-value classification of Japanese sentences using BERT [PyTorch] Introduction to Japanese Document Classification Using BERT

So, after a little research, there was a person who packed it in a handy library ↓ ↓

「Simple Transformers」

Original article: [Simple Transformers — Multi-Class Text Classification with BERT, RoBERTa, XLNet, XLM, and DistilBERT](https://medium.com/swlh/simple-transformers-multi-class-text-classification-with-bert] -roberta-xlnet-xlm-and-8b585000ce3a)

This library is a "ready-to-use" Transformer library. This is great if you want to use Transformer with three lines of code without worrying about the technical details. (Original article translation)

There are several types of BERT, A library called Transformers can execute the eight BERT, GPT, GPT-2, Transformer-XL, XLNet, XLM, RoBERTa, and DistliBERT in a similar way. This "Simple Transformers" is a library that makes it even easier to use.

Introduction

Officially I recommend conda, but I did it in a venv virtual environment.

Prerequisites: $ pip install pandas tqdm scipy scikit-learn transformers tensorboardx simple transformers ** In addition to these, you will need pytorch. ** ** If you use GPU, you need to install CUDA separately, so please check it. For CPU, you only need to install pytorch. You can get the installation command from the official one that suits your environment. → Pytorch Official

By the way, in my environment, I couldn't avoid the GPU memory shortage error, so I ran it on the CPU. It's long.

Try using

First of all, if you summarize the officially riding Demo in Japanese

Data acquisition

  1. Download data from here
  2. Extract train.csv and test.csv to the data / directory

Preprocessing

import pandas as pd

train_df = pd.read_csv('data/train.csv', header=None)
train_df['text'] = train_df.iloc[:, 1] + " " + train_df.iloc[:, 2]
train_df = train_df.drop(train_df.columns[[1, 2]], axis=1)
train_df.columns = ['label', 'text']
train_df = train_df[['text', 'label']]
train_df['text'] = train_df['text'].apply(lambda x: x.replace('\\', ' '))
train_df['label'] = train_df['label'].apply(lambda x:x-1)

eval_df = pd.read_csv('data/test.csv', header=None)
eval_df['text'] = eval_df.iloc[:, 1] + " " + eval_df.iloc[:, 2]
eval_df = eval_df.drop(eval_df.columns[[1, 2]], axis=1)
eval_df.columns = ['label', 'text']
eval_df = eval_df[['text', 'label']]
eval_df['text'] = eval_df['text'].apply(lambda x: x.replace('\\', ' '))
eval_df['label'] = eval_df['label'].apply(lambda x:x-1)

Instance generation

from simpletransformers.classification import ClassificationModel

model = ClassificationModel('roberta', 'roberta-base', num_labels=4)

Training

model.train_model(train_df)

Evaluation

result, model_outputs, wrong_predictions = model.eval_model(eval_df)

The above is the sample published in the original article. It's easy.

How about in Japanese?

Next, I wonder how much it can be used in Japanese sentences (although I don't understand BERT enough). I tried it with the familiar livedoor corpus.

Preprocessing

If it is in the downloaded state, it is scattered in .txt for each domain, so I summarized it in CSV. At that time, replace the domain with the label, leaving only the label and body. Since it is a little difficult to use CPU, the test was done in 3 domains from 0 to 2. (dokujo-tsushin、it-life-hack、kaden-channel) 22222222.png

Divide this into train and test

from sklearn.model_selection import train_test_split
X_train_df, X_test_df, y_train_s, y_test_s = train_test_split(
    data["text"], data["label"], test_size=0.2, random_state=0, stratify=data["label"]
)

train_df = pd.DataFrame([X_train_df,y_train_s]).T
test_df = pd.DataFrame([X_test_df,y_test_s]).T

train_df["label"] = train_df["label"].astype("int")
test_df["label"] = test_df["label"].astype("int")

Training & evaluation

from simpletransformers.classification import ClassificationModel

model = ClassificationModel('roberta', 'roberta-base', num_labels=3,use_cuda=False)
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(test_df)

result

Accuracy: 0.8798329801724872 
Loss: 0.24364208317164218

was. I haven't read the original data properly, so I don't know the characteristics of each domain, but it's good accuracy.

As a bonus, when I predict other domain articles, it looks like this. 000000000000000000000.png

It seems that IT life hack and gadget site S-MAX are similar.

Overall probability

I plotted it roughly without dividing it into domains, but it is quite divided. image.png

Probability per domain

image.png

end

You can easily perform text classification by BERT just by preparing the data like this. It looks like github It seems that it can be used for detailed settings and other tasks. I touched it before learning about BERT, so I will study a little and then try again with various data. Please use it.

Recommended Posts

An era when text classification by BERT is possible with 3 lines of code
Challenge text classification by Naive Bayes with sklearn
Summary of reference sites when editing Blender Script with an external editor (VS Code)
Quickly list multiple lines of text in your code
Mass generation of QR code with character display by Python