The most troublesome thing when checking the operation of Robinson's Bayesian Spam Filter written in PHP was the calculation of chi-square. After researching various things, a python program written by Robinson himself came out. I didn't like this, so I added some processing to check the calculation of php, and when I noticed, I made a class of Bayesian Filter.
I couldn't find the implementation of Robinson's Bayesian Spam Filter so much even if I googled it, so I decided to publish it for the time being. I'm new to python, so I'm sorry if I make a lot of mistakes. We welcome you. I hope it helps someone. By the way, I don't intend to claim the copyright, so please like it by boiling or baking.
Click here for a Japanese explanation of the Bayesian filter. http://akademeia.info/index.php?%A5%D9%A5%A4%A5%B8%A5%A2%A5%F3%A5%D5%A5%A3%A5%EB%A5%BF
""" Robinson's Spam filter program
This program is inspired by the following article
http://www.linuxjournal.com/article/6467?page=0,0
"""
import math
class RobinsonsBayes(object):
"""RobinsonsBayes
This class only support calculation assuming you already have training set.
"""
x = float(0.5) #possibility that first appeard word would be spam
s = float(1) #intensity of x
def __init__(self,spam_doc_num,ham_doc_num):
self.spam_doc_num = spam_doc_num
self.ham_doc_num = ham_doc_num
self.total_doc_num = spam_doc_num+ham_doc_num
self.possibility_list = []
def CalcProbabilityToBeSpam(self,num_in_spam_docs,num_in_ham_docs):
degree_of_spam = float(num_in_spam_docs)/self.spam_doc_num;
degree_of_ham = float(num_in_ham_docs)/self.ham_doc_num;
#p(w)
probability = degree_of_spam/(degree_of_spam+degree_of_ham);
#f(w)
robinson_probability = ((self.x*self.s) + (self.total_doc_num*probability))/(self.s+self.total_doc_num)
return robinson_probability
def AddWord(self,num_in_spam_docs,num_in_ham_docs):
probability = self.CalcProbabilityToBeSpam(num_in_spam_docs,num_in_ham_docs)
self.possibility_list.append(probability)
return probability
#retrieved from
#http://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/064/6467/6467s2.html
def chi2P(self,chi, df):
"""Return prob(chisq >= chi, with df degrees of freedom).
df must be even.
"""
assert df & 1 == 0
# XXX If chi is very large, exp(-m) will underflow to 0.
m = chi / 2.0
sum = term = math.exp(-m)
for i in range(1, df//2):
term *= m / i
sum += term
# With small chi and large df, accumulated
# roundoff error, plus error in
# the platform exp(), can cause this to spill
# a few ULP above 1.0. For
# example, chi2P(100, 300) on my box
# has sum == 1.0 + 2.0**-52 at this
# point. Returning a value even a teensy
# bit over 1.0 is no good.
return min(sum, 1.0)
def CalcNess(self,f,n):
Ness = self.chi2P(-2*math.log(f),2*n)
return Ness
def CalcIndicator(self):
fwpi_h=fwpi_s=1
for fwi in self.possibility_list:
fwpi_h *= fwi
fwpi_s *= (1-fwi)
H = self.CalcNess(fwpi_h,3)
S = self.CalcNess(fwpi_s,3)
#Notice that the bigger H(Hamminess) indicates that the document is more likely to be SPAM.
I = (1+H-S)/2
return I
if __name__ == '__main__':
"""
This is a exapmple of checking if "I have a pen" is a spam.
Following program assuming like:
- We have 10 spam documents and 10 ham documents in our hand.
- Number of "I" in spam documents is 1 and that of ham documents is 5
- Number of "have" in spam documents is 2 and that of ham documents is 6
- Number of "a" in spam documents is 1 and that of ham documents is 2
- Number of "pen" in spam documents is 5 and that of ham documents is 1
By the way, "I have a pen" is an sentence the most of Japanese learn in the first English class.
Enjoy!
"""
#init class by giving the number of document
RobinsonsBayes = RobinsonsBayes(10,10)
#Add train data of words one by one
print "I : "+str(RobinsonsBayes.AddWord(1,5))
print "have: "+str(RobinsonsBayes.AddWord(2,6))
print "a : "+str(RobinsonsBayes.AddWord(1,2))
print "pen : "+str(RobinsonsBayes.AddWord(5,1))
#calculate Indicater
print "I (probability to be spam)"
print RobinsonsBayes.CalcIndicator()
print ""
Recommended Posts