There was a text classification example by Naive Bayes in Chapter 6 of Practical Machine Learning System, so I will challenge myself.
sklearn dataset 20newsgroups [sklearn.naive_bayes.MultinomialNB](http: / /scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) Use to categorize.
It will be the flow.
All defaults are used except for the stop word setting.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import nltk
def stopwords():
symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
stopwords = nltk.corpus.stopwords.words('english')
return stopwords + symbols
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
vectorizer = CountVectorizer(stop_words=stopwords())
vectorizer.fit(newsgroups_train.data)
# Train
X = vectorizer.transform(newsgroups_train.data)
y = newsgroups_train.target
print(X.shape)
clf = MultinomialNB()
clf.fit(X, y)
print(clf.score(X,y))
# Test
X_test = vectorizer.transform(newsgroups_test.data)
y_test = newsgroups_test.target
print(clf.score(X_test, y_test))
The correct answer rate for the test data was 62%. (81% for training data)
I found that using sklearn makes it easy to classify text using a naive Bayes classifier. However, since the correct answer rate is 62%, it seems necessary to apply various natural language processing such as TfIdf and Stemming in order to improve the accuracy.
I changed to TfidVectorizer and tried to search for the optimum parameter using GridSearchCV. The percentage of correct answers to test data increased slightly to 66%.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
import nltk
def stopwords():
symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
stopwords = nltk.corpus.stopwords.words('english')
return stopwords + symbols
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))
# Pipeline
pipeline = Pipeline([('vectorizer', TfidfVectorizer()), ('multinomial_nb', MultinomialNB())])
params = {
'vectorizer__max_df': [1.0, 0.99],
'vectorizer__ngram_range': [(1,1), (1, 2)],
'vectorizer__stop_words' : [stopwords()],
}
clf = GridSearchCV(pipeline, params)
# Train
X = newsgroups_train.data
y = newsgroups_train.target
clf.fit(X,y)
print(clf.score(X, y))
# Test
X_test = newsgroups_test.data
y_test = newsgroups_test.target
print(clf.score(X_test, y_test))
Recommended Posts