introduction

La chose la plus gênante lors de la vérification du fonctionnement du filtre anti-spam bayésien de Robinson écrit en PHP était le calcul du chi carré. Après avoir recherché diverses choses, un programme python écrit par Robinson lui-même est sorti. Je n'aimais pas ça, alors j'ai ajouté un traitement pour vérifier le calcul php, et quand j'ai remarqué, j'ai fait une classe de filtre bayésien.

Je ne pouvais pas trouver autant la mise en œuvre du filtre anti-spam bayésien de Robinson, même si je l'ai recherchée sur Google, alors j'ai décidé de le publier pour le moment. Je suis nouveau dans Python, donc je suis désolé si je fais beaucoup d'erreurs. Nous vous souhaitons la bienvenue. J'espère que cela aide quelqu'un. Au fait, je n'ai pas l'intention de revendiquer le droit d'auteur, alors s'il vous plaît, aimez-le en le faisant bouillir ou en le cuisant.

Cliquez ici pour une explication japonaise du filtre Basian. http://akademeia.info/index.php?%A5%D9%A5%A4%A5%B8%A5%A2%A5%F3%A5%D5%A5%A3%A5%EB%A5%BF

programme

""" Robinson's Spam filter program
This program is inspired by the following article
http://www.linuxjournal.com/article/6467?page=0,0
"""
import math

class RobinsonsBayes(object):
    """RobinsonsBayes
        This class only support calculation assuming you already have training set.
    """
    x = float(0.5) #possibility that first appeard word would be spam
    s = float(1)   #intensity of x
    def __init__(self,spam_doc_num,ham_doc_num):
        self.spam_doc_num = spam_doc_num
        self.ham_doc_num = ham_doc_num
        self.total_doc_num = spam_doc_num+ham_doc_num
        self.possibility_list = []

    def CalcProbabilityToBeSpam(self,num_in_spam_docs,num_in_ham_docs):
        degree_of_spam = float(num_in_spam_docs)/self.spam_doc_num;
        degree_of_ham  = float(num_in_ham_docs)/self.ham_doc_num;

        #p(w)
        probability = degree_of_spam/(degree_of_spam+degree_of_ham);

        #f(w)
        robinson_probability = ((self.x*self.s) + (self.total_doc_num*probability))/(self.s+self.total_doc_num)
        return robinson_probability

    def AddWord(self,num_in_spam_docs,num_in_ham_docs):
        probability = self.CalcProbabilityToBeSpam(num_in_spam_docs,num_in_ham_docs)
        self.possibility_list.append(probability)
        return probability

    #retrieved from
    #http://www.linuxjournal.com/files/linuxjournal.com/linuxjournal/articles/064/6467/6467s2.html
    def chi2P(self,chi, df):
        """Return prob(chisq >= chi, with df degrees of freedom).
        df must be even.
        """
        assert df & 1 == 0

        # XXX If chi is very large, exp(-m) will underflow to 0.
        m = chi / 2.0
        sum = term = math.exp(-m)
        for i in range(1, df//2):
            term *= m / i
            sum += term
        # With small chi and large df, accumulated
        # roundoff error, plus error in
        # the platform exp(), can cause this to spill
        # a few ULP above 1.0. For
        # example, chi2P(100, 300) on my box
        # has sum == 1.0 + 2.0**-52 at this
        # point.  Returning a value even a teensy
        # bit over 1.0 is no good.
        return min(sum, 1.0)

    def CalcNess(self,f,n):
        Ness = self.chi2P(-2*math.log(f),2*n)
        return Ness

    def CalcIndicator(self):
        fwpi_h=fwpi_s=1
        for fwi in self.possibility_list:
            fwpi_h *= fwi
            fwpi_s *= (1-fwi)

        H = self.CalcNess(fwpi_h,3)
        S = self.CalcNess(fwpi_s,3)

        #Notice that the bigger H(Hamminess) indicates that the document is more likely to be SPAM.
        I = (1+H-S)/2
        return I

if __name__ == '__main__':
    """
        This is a exapmple of checking if "I have a pen" is a spam.

        Following program assuming like:
            - We have 10 spam documents and 10 ham documents in our hand.
            - Number of "I" in spam documents is 1 and that of ham documents is 5
            - Number of "have" in spam documents is 2 and that of ham documents is 6
            - Number of "a" in spam documents is 1 and that of ham documents is 2
            - Number of "pen" in spam documents is 5 and that of ham documents is 1

        By the way, "I have a pen" is an sentence the most of Japanese learn in the first English class.
        Enjoy!
    """

    #init class by giving the number of document
    RobinsonsBayes = RobinsonsBayes(10,10)

    #Add train data of words one by one
    print "I   : "+str(RobinsonsBayes.AddWord(1,5))
    print "have: "+str(RobinsonsBayes.AddWord(2,6))
    print "a   : "+str(RobinsonsBayes.AddWord(1,2))
    print "pen : "+str(RobinsonsBayes.AddWord(5,1))

    #calculate Indicater
    print "I (probability to be spam)"
    print RobinsonsBayes.CalcIndicator()
    print ""

J'ai essayé d'implémenter le filtre anti-spam bayésien de Robinson avec python

introduction

programme