nltk.MaxentClassifier.train () throws an error.

Introductory Natural language processing was being read smoothly, and I was caught in Chapter 6, Exercise 5. When I tried to classify documents using the maximum entropy classification, it spewed out errors and did not move.

# -*- coding: utf-8 -*-
#from __future__ import division
import nltk,re
import random
import numpy
#Categorize movie reviews as positive or negative

#data
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

#Feature extractor
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]#Frequent Words 2000

def document_features(document):
  document_words = set(document)
  features = {}
  for w in word_features:
    features['contains(%s)' % w] = (w in document)  #Whether top2000 characters are in doc
  return features

#Classifier training and testing
featuresets = [(document_features(d),c) for (d,c) in documents]
train_set,test_set = featuresets[100:],featuresets[:100]

#Maximum entropy classification
maxentclassifier = nltk.MaxentClassifier.train(train_set)

#test
print "MaxentClassifier"
print nltk.classify.accuracy(maxentclassifier,test_set)
print maxentclassifier.show_most_informative_features(5)

The error looks like this

  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.498
/usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py:1332: RuntimeWarning: overflow encountered in power
  exp_nf_delta = 2 ** nf_delta
/usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py:1334: RuntimeWarning: invalid value encountered in multiply
  sum1 = numpy.sum(exp_nf_delta * A, axis=0)
/usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py:1335: RuntimeWarning: invalid value encountered in multiply
  sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
/usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py:1341: RuntimeWarning: invalid value encountered in divide
  deltas -= (ffreq_empirical - sum1) / -sum2
         Final               nan        0.502

Apparently, the default variable in maxent.py is throwing an error due to overflow. So, I tried various googles, but Japanese information did not come out easily, so I made a note.

I modified it referring to here.

Hello Dmitry,

will this change affect the performance? Based on my test, the improvement between iterations drops a lot, comparing to GIS algorithm with default set. the accuracy could reach to 70% after three iterations using GIS, but only 58% after using the modified IIS.

May 7, 2012 Star Period 1 UTC-4 6:05:38 pm, Dmitry Sergeev Shado: It seems that changing exp_nf_delta = 2 ** nf_delta (maxent.py line ~1350) to exp_nf_delta = 2 ** numpy.sqrt(nf_delta) do the trick.

That's why

sudo vi /usr/local/lib/python2.7/site-packages/nltk/classify/maxent.py

`maxent.py`


.
.
.

for rangenum in range(MAX_NEWTON):
    nf_delta = numpy.outer(nfarray, deltas)
    #exp_nf_delta = 2 ** nf_delt　　　　　　　 #from here
    exp_nf_delta = 2 ** numpy.sqrt(nf_delta)    #Change to this
    nf_exp_nf_delta = nftranspose * exp_nf_delta
    sum1 = numpy.sum(exp_nf_delta * A, axis=0)
    sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
.
.
.

I tried it again and it succeeded. There is not much information in Japanese and it is difficult to learn, but I want to learn natural language processing well!