Introductory Natural language processing was being read smoothly, and I was caught in Chapter 6, Exercise 5. When I tried to classify documents using the maximum entropy classification, it spewed out errors and did not move.
# -*- coding: utf-8 -*-
#from __future__ import division
import nltk,re
import random
import numpy
#Categorize movie reviews as positive or negative
#data
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
#Feature extractor
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]#Frequent Words 2000
def document_features(document):
document_words = set(document)
features = {}
for w in word_features:
features['contains(%s)' % w] = (w in document) #Whether top2000 characters are in doc
return features
#Classifier training and testing
featuresets = [(document_features(d),c) for (d,c) in documents]
train_set,test_set = featuresets[100:],featuresets[:100]
#Maximum entropy classification
maxentclassifier = nltk.MaxentClassifier.train(train_set)
#test
print "MaxentClassifier"
print nltk.classify.accuracy(maxentclassifier,test_set)
print maxentclassifier.show_most_informative_features(5)
The error looks like this
==> Training (100 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -0.69315 0.498
/usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py:1332: RuntimeWarning: overflow encountered in power
exp_nf_delta = 2 ** nf_delta
/usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py:1334: RuntimeWarning: invalid value encountered in multiply
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
/usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py:1335: RuntimeWarning: invalid value encountered in multiply
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
/usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py:1341: RuntimeWarning: invalid value encountered in divide
deltas -= (ffreq_empirical - sum1) / -sum2
Final nan 0.502
Apparently, the default variable in maxent.py is throwing an error due to overflow. So, I tried various googles, but Japanese information did not come out easily, so I made a note.
I modified it referring to here.
Hello Dmitry,
will this change affect the performance? Based on my test, the improvement between iterations drops a lot, comparing to GIS algorithm with default set. the accuracy could reach to 70% after three iterations using GIS, but only 58% after using the modified IIS.
May 7, 2012 Star Period 1 UTC-4 6:05:38 pm, Dmitry Sergeev Shado: It seems that changing exp_nf_delta = 2 ** nf_delta (maxent.py line ~1350) to exp_nf_delta = 2 ** numpy.sqrt(nf_delta) do the trick.
That's why
sudo vi /usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py
maxent.py
.
.
.
for rangenum in range(MAX_NEWTON):
nf_delta = numpy.outer(nfarray, deltas)
#exp_nf_delta = 2 ** nf_delt #from here
exp_nf_delta = 2 ** numpy.sqrt(nf_delta) #Change to this
nf_exp_nf_delta = nftranspose * exp_nf_delta
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
.
.
.
I tried it again and it succeeded. There is not much information in Japanese and it is difficult to learn, but I want to learn natural language processing well!
Recommended Posts