Implement naive bayes in Python 3.3

** This is the first day article of Escapism Advent Calendar 2013 **

Let's get started with machine learning Let's implement the 3rd Bayesian filter Implemented the naive Bayes introduced.

However, the code in this article is a bit confusing. Specifically, there are many traps, such as the variable name is a singular word called word, but the type is list. In addition, I found a common mistake in List 7 on page 3 that the comparison operator was mistaken. It was. Maybe you didn't run the code yourself.

So, the Bayesian filter introduced in this article

--In Python 3.3 --Using MeCab instead of Japanese morphological analysis of Yahoo! Developers Network --More readable

Reimplemented.

Postscript

Using the Bayesian filter created in this article, we learned and classified using Web pages acquired using the Bing API. http://qiita.com/katryo/items/62291ba328de9d12bd30

code

`naive_bayes.py`


#coding:utf-8
# http://gihyo.jp/dev/serial/01/machine-learning/0003 Bayesian filter implementation in Python3.Improved to be readable for 3
import math
import sys
import MeCab


class NaiveBayes():
    def __init__(self):
        self.vocabularies = set()
        self.word_count = {}  # {'Measures against pollinosis': {'Cedar pollen': 4, 'medicine': 2,...} }
        self.category_count = {}  # {'Measures against pollinosis': 16, ...}

    def to_words(self, sentence):
        """
input: 'All to myself'
output: tuple(['all', 'myself', 'of', 'How', 'What'])
        """
        tagger = MeCab.Tagger('mecabrc')  #You can use another Tagger
        mecab_result = tagger.parse(sentence)
        info_of_words = mecab_result.split('\n')
        words = []
        for info in info_of_words:
            #When divided by macab, "" is at the end of the sentence, before that.'EOS'Is coming
            if info == 'EOS' or info == '':
                break
                # info => 'Nana\t particle,Final particle,*,*,*,*,Nana,Na,Na'
            info_elems = info.split(',')
            #Sixth, inflected words are included. If the sixth is'*'If so, enter the 0th
            if info_elems[6] == '*':
                # info_elems[0] => 'Van Rossum\t noun'
                words.append(info_elems[0][:-3])
                continue
            words.append(info_elems[6])
        return tuple(words)

    def word_count_up(self, word, category):
        self.word_count.setdefault(category, {})
        self.word_count[category].setdefault(word, 0)
        self.word_count[category][word] += 1
        self.vocabularies.add(word)

    def category_count_up(self, category):
        self.category_count.setdefault(category, 0)
        self.category_count[category] += 1

    def train(self, doc, category):
        words = self.to_words(doc)
        for word in words:
            self.word_count_up(word, category)
        self.category_count_up(category)

    def prior_prob(self, category):
        num_of_categories = sum(self.category_count.values())
        num_of_docs_of_the_category = self.category_count[category]
        return num_of_docs_of_the_category / num_of_categories

    def num_of_appearance(self, word, category):
        if word in self.word_count[category]:
            return self.word_count[category][word]
        return 0

    def word_prob(self, word, category):
        #Calculation of Bayesian law. Usually it is a decimal number very close to 0.
        numerator = self.num_of_appearance(word, category) + 1  # +1 is the Laplace method of additive smoothing
        denominator = sum(self.word_count[category].values()) + len(self.vocabularies)

        #In Python3, division is automatically float
        prob = numerator / denominator
        return prob

    def score(self, words, category):
        #It is word to take log_prob is 0.000....Because it becomes a decimal number of about 01
        score = math.log(self.prior_prob(category))
        for word in words:
            score += math.log(self.word_prob(word, category))
        return score

    #If you don't log, the value may be too small to underflow.
    def score_without_log(self, words, category):
        score = self.prior_prob(category)
        for word in words:
            score *= self.word_prob(word, category)
        return score

    def classify(self, doc):
        best_guessed_category = None
        max_prob_before = -sys.maxsize
        words = self.to_words(doc)

        for category in self.category_count.keys():
            prob = self.score(words, category)
            if prob > max_prob_before:
                max_prob_before = prob
                best_guessed_category = category
        return best_guessed_category

if __name__ == '__main__':
    nb = NaiveBayes()
    nb.train('''Python is an open source programming language created by Dutchman Guido van Rossum.
It is a type of object-oriented scripting language and is widely used in Europe and the United States along with Perl. Named after the comedy show "Flying Monty Python" produced by the British television station BBC.
Python means the reptile python in English and is sometimes used as a mascot or icon in the Python language. Python is a general-purpose high-level language. Designed with programmer productivity and code reliability in mind, it has a large, convenient standard library with core syntax and semantics kept to a minimum.
It supports character string operations using Unicode, and Japanese processing is also possible as standard. It supports many platforms (platforms that work), and has abundant documents and abundant libraries, so its use is increasing in the industrial world.
             ''',
             'Python')
    nb.train('''Snake (snake) is a general term for reptiles classified in the order Squamata (Serpentes).
It is characterized by its slender body and no limbs. However, animals of similar shape also exist in other groups.
                ''', 'Snake')
    nb.train('''Ruby is an object-oriented scripting language developed by Yukihiro Matsumoto (commonly known as Matz).
It realizes object-oriented programming in the area where scripting languages such as Perl have been used in the past.
Ruby was originally born on February 24, 1993, and was announced on fj in December 1995.
The name Ruby is because the programming language Perl pronounces the same as Pearl, the birthstone for June.
It was named after the ruby of Matsumoto's colleague's birthstone (July).
             ''',
             'Ruby')
    nb.train('''Ruby (English:Ruby (Jonathan) is a variant of corundum (steel ball). It is a gem with a characteristic red color.
Natural rubies are produced in Asia and cannot be obtained in Europe and the United States.
Even in the production area, the places where beautiful stones that can be used as gems can be obtained are extremely limited.
Large stones over 3 carats produce less.
             ''', 'Gem')
    doc = 'Open source made by Guido van Rossum'
    print('%s =>Estimated category: %s' % (doc, nb.classify(doc)))  #Estimated category:Should be python
    print('Probability of being in the Python category: %s' % nb.score_without_log(['Guido van Rossum', 'But', 'Had made'], 'Python'))
    print('Probability of being in the Ruby category: %s' % nb.score_without_log(['Guido van Rossum', 'But', 'Had made'], 'Ruby'))

    doc = 'The programming language Ruby is a pure object-oriented language.'
    print('%s =>Estimated category: %s' % (doc, nb.classify(doc)))  #Estimated category:Should be Ruby
    print('Probability of being in the Ruby category: %s' % nb.score_without_log(['Programming language', 'of', 'Ruby', 'Is', 'pure', 'Nana', 'Object-oriented language', 'is', '。'], 'Ruby'))
    print('Probability of being in the Python category: %s' % nb.score_without_log(['Programming language', 'of', 'Ruby', 'Is', 'pure', 'Nana', 'Object-oriented language', 'is', '。'], 'Python'))

    doc = 'corundum'
    print('%s =>Estimated category: %s' % (doc, nb.classify(doc)))  #Estimated category:Should be a Gem
    print('Probability of being in the Gem category: %s' % nb.score_without_log(['corundum'], 'Gem'))
    print('Probability of being in the Ruby category: %s' % nb.score_without_log(['corundum'], 'Ruby'))

Your own commentary

We'll explain what this code does, splitting it into train and classify.

What you are doing in training

NaiveBayes' train (self, doc, category) method inputs the string doc and the category to which the string belongs, and puts the document with that word into that category when it comes up in the NaiveBayes object. Let's learn "should".

The following is a summary.

Decompose the input string into a list of words with MeCab ** to_words () **
Count how many times each word appears. ** word_count_up () **
Count how many times the entered category has been entered ** category_count_up () **

What you are doing with classify

Of the categories entered in the learning phase so far, "Which is the most plausible to include the entered text?" Is calculated.

In other words, the score method is used for each category to calculate the probability (the logarithm is taken to make it easier to calculate).

The one with the highest score among all the categories is judged to be the category with the highest likelihood, and the category is finally output. At this time, the Laplace method of additive smoothing is used.

About using log in NaiveBayes.score

Why is log used in the score function to logarithmically calculate the probabilities calculated with prior_prob and word_prob?

As stated in Original article of technical review, the probability calculated by the formula of Bayes' law The value of is very close to 0. The total number of words that appear in a document is enormous, and the number of times a particular word (such as "Python", "pure", or "made") appears is very small in comparison. Therefore, the logarithm is taken with e so that it can be calculated even if it is close to 0.

So what if we calculate that probability without logging? Let's try it.

    def score(self, words, category):
        #It is word to take log_prob is 0.000....Because it becomes a decimal number of about 01
        score = math.log(self.prior_prob(category))
        for word in words:
            score += math.log(self.word_prob(word, category))
        return score

    #If you don't log, the value may be too small to underflow.
    def score_without_log(self, words, category):
        score = self.prior_prob(category)
        for word in words:
            score *= self.word_prob(word, category)
        return score

score_without_log is a version that calculates the probability without taking the logarithm. The logarithm of a * b * c gives log (a) + log (b) + log (c). In the score method, the calculation result of word_prob was repeatedly added to score in the loop for word in words, but in the score_without_log method, it is repeatedly multiplied.

If you run the code in naive_bayes.py as it is, you can see the effect of logging.

By the way, this part of the code trains four categories of "Python, Snake, Gem, Ruby".

`naive_bayes.py`


if __name__ == '__main__':
    nb = NaiveBayes()
    nb.train('''Python is an open source programming language created by Dutchman Guido van Rossum.
It is a type of object-oriented scripting language and is widely used in Europe and the United States along with Perl. Named after the comedy show "Flying Monty Python" produced by the British television station BBC.
Python means the reptile python in English and is sometimes used as a mascot or icon in the Python language. Python is a general-purpose high-level language. Designed with programmer productivity and code reliability in mind, it has a large, convenient standard library with core syntax and semantics kept to a minimum.
It supports character string operations using Unicode, and Japanese processing is also possible as standard. It supports many platforms (platforms that work), and has abundant documents and abundant libraries, so its use is increasing in the industrial world.
             ''',
             'Python')
    nb.train('''Snake (snake) is a general term for reptiles classified in the order Squamata (Serpentes).
It is characterized by its slender body and no limbs. However, animals of similar shape also exist in other groups.
                ''', 'Snake')
    nb.train('''Ruby is an object-oriented scripting language developed by Yukihiro Matsumoto (commonly known as Matz).
It realizes object-oriented programming in the area where scripting languages such as Perl have been used in the past.
Ruby was originally born on February 24, 1993, and was announced on fj in December 1995.
The name Ruby is because the programming language Perl pronounces the same as Pearl, the birthstone for June.
It was named after the ruby of Matsumoto's colleague's birthstone (July).
             ''',
             'Ruby')
    nb.train('''Ruby (English:Ruby (Jonathan) is a variant of corundum (steel ball). It is a gem with a characteristic red color.
Natural rubies are produced in Asia and cannot be obtained in Europe and the United States.
Even in the production area, the places where beautiful stones that can be used as gems can be obtained are extremely limited.
Large stones over 3 carats produce less.
             ''', 'Gem')
    doc = 'Open source made by Guido van Rossum'
    print('%s =>Estimated category: %s' % (doc, nb.classify(doc)))  #Estimated category:Should be python
    print('Probability of being in the Python category: %s' % nb.score_without_log(['Guido van Rossum', 'But', 'Had made'], 'Python'))
    print('Probability of being in the Ruby category: %s' % nb.score_without_log(['Guido van Rossum', 'But', 'Had made'], 'Ruby'))

    doc = 'The programming language Ruby is a pure object-oriented language.'
    print('%s =>Estimated category: %s' % (doc, nb.classify(doc)))  #Estimated category:Should be Ruby
    print('Probability of being in the Ruby category: %s' % nb.score_without_log(['Programming language', 'of', 'Ruby', 'Is', 'pure', 'Nana', 'Object-oriented language', 'is', '。'], 'Ruby'))
    print('Probability of being in the Python category: %s' % nb.score_without_log(['Programming language', 'of', 'Ruby', 'Is', 'pure', 'Nana', 'Object-oriented language', 'is', '。'], 'Python'))

    doc = 'corundum'
    print('%s =>Estimated category: %s' % (doc, nb.classify(doc)))  #Estimated category:Should be a Gem
    print('Probability of being in the Gem category: %s' % nb.score_without_log(['corundum'], 'Gem'))
    print('Probability of being in the Ruby category: %s' % nb.score_without_log(['corundum'], 'Ruby'))

This is the execution result.

Open source made by Guido van Rossum=>Estimated category: Python
Probability of being in the Python category: 5.05740150710565e-08
Probability of being in the Ruby category: 2.592066486298008e-08
The programming language Ruby is a pure object-oriented language. =>Estimated category: Ruby
Probability of being in the Ruby category: 1.0568043348436783e-20
Probability of being in the Python category: 1.4013428667584096e-21
corundum=>Estimated category: Gem
Probability of being in the Gem category: 0.0018248175182481751
Probability of being in the Ruby category: 0.0008143322475570033

** 1.0568043348436783e-20 ** …… (ﾟ Д ﾟ;)

10 minus 20th power. It's a number very close to 0.

The longer the input string, the less likely it is to fall into a particular category. On the contrary, the last example is to shorten the input character string and calculate the probability of entering a specific category only by "corundum". But still, the probability of entering the most probable category "Gem" is only ** 0.0018248175182481751 **. 0.18%. It's a lot larger than the probability of falling into the "Ruby" category, 0.0008143322475570033, but it's still very small. You can see that it is a difficult number to calculate.

reference

Japanese article about Naive Bayes http://aidiary.hatenablog.com/entry/20100613/1276389337

Postscript

Using the Bayesian filter created in this article, we learned and classified using Web pages acquired using the Bing API. Click here for the sequel http://qiita.com/katryo/items/62291ba328de9d12bd30