One of the tasks of NLP is document classification. It estimates the label for a labeled document.
Document classification can be broadly divided into the following two types according to the nature of the label attached to the document.
--Topic classification --Documents labeled on the topic --Frequently look at news articles labeled as politics, sports, entertainment, etc. ――Some are classified into two categories, and some are multi-labeled (there are more multi-labeled ones). --Applied to news article recommendations, etc.
--Sentiment analysis --Labeled as to whether the document is positive or negative ――There is a binary classification, and it is also classified into more numbers (3 labels of positive, neutral, negative, etc.) ――It is also used for marketing research
There are many ways to solve these document classification problems. There are the following two typical methods. (I think there are others)
--Create a document vector and classify it by machine learning method --How to make a document vector - Tf-idf --bag of embedding (mean or maximum for the distributed representation of each word in the document) --How to classify - Logistic Regression - Naive Baise model - Support Vector Machine - Random Forest, Xgboost --And so on
--Put row text into a neural network - LSTM - BERT fine tuning --And so on
Even though I did the easiest one, I'm not really sure which one is the easiest (hey). This time, I would like to work on the method of SVM (with linear karnel) the Tf-idf vector. Tf-idf is a vector that has the frequency of occurrence of each word in a document multiplied by the importance of that word as an element. Therefore, the dimension of the document vector is equal to the number of vocabularies.
SVM with linear kernel seems to be a little difficult to explain, so I will omit it.
This time, I will use the one included in sklearn.
Since the model is simple (?), I will try to use a corpus that is a little complicated (multi-label + some topics are given to each document). The corpus used is a Reuters news article with about 10,000 documents and 90 labels.
Download the corpus. The python module nltk contains a Reuters corpus, so use that.
First, if nltk is not included
pip install nltk
Then type the following in a python interactive shell:
python
>>> import nltk
>>> nltk.download("reuters")
Then, a directory called nltk_data is created under the user directory, and the data is in that directory. ____ is inside.
import glob
import nltk
import re
import codecs
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords, reuters
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics
path = "../nltk_data/corpora/reuters/"
with open(path+"stopwords") as sw:
stopwords = [x for x in sw]
#Define tokenizer
def tokenize(text):
min_length = 3
words = map(lambda word: word.lower(), word_tokenize(text))
words = [word for word in words if word not in stopwords]
tokens = (list(map(lambda token: PorterStemmer().stem(token), words)))
p = re.compile('[a-zA-Z]+');
filtered_tokens = list(filter (lambda token: p.match(token) and len(token) >= min_length, tokens))
return filtered_tokens
#Get document id and its category from data
with codecs.open("../nltk_data/corpora/reuters/cats.txt", "r", "utf-8", "ignore") as categories:
train_docs_id = [line.split(" ")[0][9:] for line in categories if line.split(" ")[0][:9] == 'training/']
categories.seek(0)
test_docs_id = [line.split(" ")[0][5:] for line in categories if line.split(" ")[0][:5] == 'test/']
categories.seek(0)
train_docs_cat = [line.strip("\n").split(" ")[1:] for line in categories if line.split(" ")[0][:9] == 'training/']
categories.seek(0)
test_docs_cat = [line.strip("\n").split(" ")[1:] for line in categories if line.split(" ")[0][:5] == 'test/']
#List documents
train_docs = []
test_docs = []
for num in train_docs_id:
with codecs.open(path+"training/"+num, "r", "utf-8", "ignore") as doc:
train_docs.append(" ".join([line.strip(" ") for line in doc.read().split("\n")]))
for num in test_docs_id:
with codecs.open(path+"test/"+num, "r", "utf-8", "ignore") as doc:
test_docs.append(" ".join([line.strip(" ") for line in doc.read().split("\n")]))
#Sklearn from the document list.Generate document vector with TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=tokenize)
vectorised_train_documents = vectorizer.fit_transform(train_docs)
vectorised_test_documents = vectorizer.transform(test_docs)
#Binary label(0 or 1)Convert to vector of
# Transform multilabel labels
mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform(train_docs_cat)
test_labels = mlb.transform(test_docs_cat)
# Classifier
#Try different parameters
param_list = [0.001, 0.01, 0.1, 0.5, 1, 10, 100]
for C in param_list:
classifier = OneVsRestClassifier(LinearSVC(C=C, random_state=42))
classifier.fit(vectorised_train_documents, train_labels)
predictions = classifier.predict(vectorised_test_documents)
train_predictions = classifier.predict(vectorised_train_documents)
ftest = metrics.f1_score(test_labels, predictions, average="macro")
ftrain = metrics.f1_score(train_labels, train_predictions, average="macro")
print("parameter test_f1 train_f1")
print("c={}:\t{}\t{}".format(C, ftest, ftrain))
Running the above code gives the following result:
parameter test_f1 train_f1
c=0.001: 0.009727246626471432 0.007884179312750742
c=0.01: 0.02568945815128711 0.02531440097069285
c=0.1: 0.20504347026711428 0.26430270726815386
c=0.5: 0.3908058642922242 0.6699048987962078
c=1: 0.45945765878179573 0.9605946547451458
c=10: 0.5253686991407462 0.9946632502765812
c=100: 0.5312185383446876 0.9949908225328556
You are overfitting to your heart's content. The same method according to the paper below should give an accuracy of the latter half of 80% ... https://www.aclweb.org/anthology/N19-1408/
If anyone knows why it's not good, please let me know.
Recommended Posts