Introduction

The most troublesome thing when checking the operation of Robinson's Bayesian Spam Filter written in PHP was the calculation of chi-square. After researching various things, a python program written by Robinson himself came out. I didn't like this, so I added some processing to check the calculation of php, and when I noticed, I made a class of Bayesian Filter.

I couldn't find the implementation of Robinson's Bayesian Spam Filter so much even if I googled it, so I decided to publish it for the time being. I'm new to python, so I'm sorry if I make a lot of mistakes. We welcome you. I hope it helps someone. By the way, I don't intend to claim the copyright, so please like it by boiling or baking.

Click here for a Japanese explanation of the Bayesian filter. http://akademeia.info/index.php?%A5%D9%A5%A4%A5%B8%A5%A2%A5%F3%A5%D5%A5%A3%A5%EB%A5%BF

program

""" Robinson's Spam filter program
This program is inspired by the following article
http://www.linuxjournal.com/article/6467?page=0,0
"""
import math

class RobinsonsBayes(object):
    """RobinsonsBayes
        This class only support calculation assuming you already have training set.
    """
    x = float(0.5) #possibility that first appeard word would be spam
    s = float(1)   #intensity of x
    def __init__(self,spam_doc_num,ham_doc_num):
        self.spam_doc_num = spam_doc_num
        self.ham_doc_num = ham_doc_num
        self.total_doc_num = spam_doc_num+ham_doc_num
        self.possibility_list = []

    def CalcProbabilityToBeSpam(self,num_in_spam_docs,num_in_ham_docs):
        degree_of_spam = float(num_in_spam_docs)/self.spam_doc_num;
        degree_of_ham  = float(num_in_ham_docs)/self.ham_doc_num;

        #p(w)
        probability = degree_of_spam/(degree_of_spam+degree_of_ham);

        #f(w)
        robinson_probability = ((self.x*self.s) + (self.total_doc_num*probability))/(self.s+self.total_doc_num)
        return robinson_probability

    def AddWord(self,num_in_spam_docs,num_in_ham_docs):
        probability = self.CalcProbabilityToBeSpam(num_in_spam_docs,num_in_ham_docs)
        self.possibility_list.append(probability)
        return probability

    #retrieved from
    #http://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/064/6467/6467s2.html
    def chi2P(self,chi, df):
        """Return prob(chisq >= chi, with df degrees of freedom).
        df must be even.
        """
        assert df & 1 == 0

        # XXX If chi is very large, exp(-m) will underflow to 0.
        m = chi / 2.0
        sum = term = math.exp(-m)
        for i in range(1, df//2):
            term *= m / i
            sum += term
        # With small chi and large df, accumulated
        # roundoff error, plus error in
        # the platform exp(), can cause this to spill
        # a few ULP above 1.0. For
        # example, chi2P(100, 300) on my box
        # has sum == 1.0 + 2.0**-52 at this
        # point.  Returning a value even a teensy
        # bit over 1.0 is no good.
        return min(sum, 1.0)

    def CalcNess(self,f,n):
        Ness = self.chi2P(-2*math.log(f),2*n)
        return Ness

    def CalcIndicator(self):
        fwpi_h=fwpi_s=1
        for fwi in self.possibility_list:
            fwpi_h *= fwi
            fwpi_s *= (1-fwi)

        H = self.CalcNess(fwpi_h,3)
        S = self.CalcNess(fwpi_s,3)

        #Notice that the bigger H(Hamminess) indicates that the document is more likely to be SPAM.
        I = (1+H-S)/2
        return I

if __name__ == '__main__':
    """
        This is a exapmple of checking if "I have a pen" is a spam.

        Following program assuming like:
            - We have 10 spam documents and 10 ham documents in our hand.
            - Number of "I" in spam documents is 1 and that of ham documents is 5
            - Number of "have" in spam documents is 2 and that of ham documents is 6
            - Number of "a" in spam documents is 1 and that of ham documents is 2
            - Number of "pen" in spam documents is 5 and that of ham documents is 1

        By the way, "I have a pen" is an sentence the most of Japanese learn in the first English class.
        Enjoy!
    """

    #init class by giving the number of document
    RobinsonsBayes = RobinsonsBayes(10,10)

    #Add train data of words one by one
    print "I   : "+str(RobinsonsBayes.AddWord(1,5))
    print "have: "+str(RobinsonsBayes.AddWord(2,6))
    print "a   : "+str(RobinsonsBayes.AddWord(1,2))
    print "pen : "+str(RobinsonsBayes.AddWord(5,1))

    #calculate Indicater
    print "I (probability to be spam)"
    print RobinsonsBayes.CalcIndicator()
    print ""

I implemented Robinson's Bayesian Spam Filter in python

Introduction

program