Challenge text classification by Naive Bayes with sklearn

There was a text classification example by Naive Bayes in Chapter 6 of Practical Machine Learning System, so I will challenge myself.

things to do

sklearn dataset 20newsgroups [sklearn.naive_bayes.MultinomialNB](http: / /scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) Use to categorize.

Use CountVectorizer to convert a document into a matrix of word occurrence frequencies.
Use MultinomialNB to train naive bayes classifiers.
Verify with test data

It will be the flow.

Implementation

All defaults are used except for the stop word setting.

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import nltk

def stopwords():
    symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
    stopwords = nltk.corpus.stopwords.words('english')
    return stopwords + symbols

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test  = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

vectorizer = CountVectorizer(stop_words=stopwords())
vectorizer.fit(newsgroups_train.data)

# Train
X = vectorizer.transform(newsgroups_train.data)
y = newsgroups_train.target
print(X.shape)

clf = MultinomialNB()
clf.fit(X, y)
print(clf.score(X,y))

# Test
X_test = vectorizer.transform(newsgroups_test.data) 
y_test = newsgroups_test.target

print(clf.score(X_test, y_test))

result

The correct answer rate for the test data was 62%. (81% for training data)

I found that using sklearn makes it easy to classify text using a naive Bayes classifier. However, since the correct answer rate is 62%, it seems necessary to apply various natural language processing such as TfIdf and Stemming in order to improve the accuracy.

Postscript (2016/03/30)

I changed to TfidVectorizer and tried to search for the optimum parameter using GridSearchCV. The percentage of correct answers to test data increased slightly to 66%.

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
import nltk

def stopwords():
    symbols = ["'", '"', '`', '.', ',', '-', '!', '?', ':', ';', '(', ')', '*', '--', '\\']
    stopwords = nltk.corpus.stopwords.words('english')
    return stopwords + symbols

newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test  = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

# Pipeline
pipeline = Pipeline([('vectorizer', TfidfVectorizer()), ('multinomial_nb', MultinomialNB())])
params = {
    'vectorizer__max_df': [1.0, 0.99],
    'vectorizer__ngram_range': [(1,1), (1, 2)],
    'vectorizer__stop_words' : [stopwords()],
}
clf = GridSearchCV(pipeline, params)

# Train
X = newsgroups_train.data
y = newsgroups_train.target
clf.fit(X,y)
print(clf.score(X, y))

# Test
X_test = newsgroups_test.data
y_test = newsgroups_test.target
print(clf.score(X_test, y_test))