Machine learning beginners try to reach out to Naive Bayes (2) --Implementation

This page

Last time, I explained about the theory (mathematical formula) of naive Bayes in Machine learning beginners try to touch naive Bayes (1) --Theory. did.

On this page, I'll try a simple implementation using Python. As usual, it is made by imitating the site that I referred to, so it will be explained & this explanation if you are interested. By the way, I haven't written Python at all, so please watch with warm eyes ...

The sample code is at the bottom.

A little review

The Naive Bayesian formula can be expressed as follows.

P(cat|doc) = \log P(cat) + \prod_{i=0}^k \log P(word_k|cat)

I don't know if it's a program ... I will organize what I need first.

< P(cat) >

< P(word|cat) >

Implementation!

policy

nb = NaiveBayes()
nb.train(data)

I want to use it as above, so I design it as a Class. If you declare what you need in init above, the code up to __init__ will be ..

class NaiveBayes:

    def __init__(self):
        self.vocabularies    = set() #Non-duplicate category
        self.categories      = set() #Non-overlapping vocabulary
        self.category_count  = {}    #Number of occurrences of category, category_count[cat]
        self.word_count      = {}    #Number of word occurrences for each category, word_count[cat]
	    ...

Until learning

def train(self, data):
    self.category_count = defaultdict(lambda: 0)

    for d in data:
        category = d[0]
        self.categories.add(category) #Add category
        self.category_count[category] += 1

        for word in d[1:]: #Add vocabulary
            self.vocabularies.add(word)

    for category in self.categories:
        self.word_count[category]  = defaultdict(lambda: 0)

    for d in data:
        category = d[0]
        for word in d[1:]:
            self.word_count[category][word] += 1

This is a preparation. Already ... that's it.

from collections import defaultdict

hash = {}
hash = defaultdict(lambda: 0)

You can use defaultdict to specify the initial value when there is no value. I was worried about Laplace smoothing, but I decided to add + 1 to the calculation.

P(word|cat)

\prod_{i=0}^k \log P(word_k|cat)

Define a function called word_probability to calculate each term of. How to find this

P(word_k|cat) = \frac{category(cat)Words in(word)Number of appearances+ 1}{category(cat)Total number of words that appear in+Total number of words}

It was. Here we are applying Laplace Smoothing.

def word_probability(self, word, category):
        '''Probability of being a category given a word, P(word|cat)'''
        #Apply Laplace Smoothing
        word_count       = self.word_count[category][word] + 1
        vocabulary_count = sum(self.word_count[category].values()) + len(self.vocabularies)
        return float(word_count) / float(vocabulary_count)

Calculate the score for a sentence

Calculate P (cat | doc).

P(cat|doc) = \log P(cat) + \prod_{i=0}^k \log P(word_k|cat)

Given a sentence, use the above function to create a function that calculates the score because we have to take the total product of the probabilities of each word. However, please note that the multiplication of logs with the same base is the addition, so it is the sum, not the total product. Also, since the base of log is 10, it will be a negative value when calculated. I personally disliked it, so I did x (-1) at the end. I think either one is fine.

First, put P (cat) in score, and then addP (word | cat)in a loop.

def score(self, words, category):
        '''documents(word)Probability of being a category given'''
        documents_count = sum(self.category_count.values())
        score = math.log(float(self.category_count[category]) / documents_count)

        for word in words:
            score += math.log(self.word_probability(word, category))

        #Because the bottom of log is 10, it will be negative+To
        return score * (-1)

Classify

Well, it's finally classification. In reality, I think you'll just call this function after training.

def classify(self, words):
        '''P(cat|doc)Returns the largest category'''
        best  = None
        value = 0

        for category in self.categories:
            v = self.score(words, category)
            if v > value:
                best  = category
                value = v
        return best

The structure is simple, it calculates the score for all the categories passed in the learning stage and returns the highest one.

Complete code

#coding:utf-8

from collections import defaultdict
import math

class NaiveBayes:

    def __init__(self):
        self.vocabularies    = set() #Non-duplicate category
        self.categories      = set() #Non-overlapping vocabulary
        self.category_count  = {}    #Number of occurrences of category, category_count[cat]
        self.word_count      = {}    #Number of word occurrences for each category, word_count[cat]


    def train(self, data):
        self.category_count = defaultdict(lambda: 0)

        for d in data:
            category = d[0]
            self.categories.add(category) #Add category
            self.category_count[category] += 1

            for word in d[1:]: #Add vocabulary
                self.vocabularies.add(word)

        for category in self.categories:
            self.word_count[category]  = defaultdict(lambda: 0)

        for d in data:
            category = d[0]
            for word in d[1:]:
                self.word_count[category][word] += 1


    def word_probability(self, word, category):
        '''Probability of being a category given a word, P(word|cat)'''
        #Apply Laplace Smoothing
        word_count       = self.word_count[category][word] + 1
        vocabulary_count = sum(self.word_count[category].values()) + len(self.vocabularies)
        return float(word_count) / float(vocabulary_count)


    def score(self, words, category):
        '''documents(word)Probability of being a category given'''
        documents_count = sum(self.category_count.values())
        score = math.log(float(self.category_count[category]) / documents_count)

        for word in words:
            score += math.log(self.word_probability(word, category))

        #Because the bottom of log is 10, it will be negative+To
        return score * (-1)


    def classify(self, words):
        '''P(cat|doc)Returns the largest category'''
        best  = None
        value = 0

        for category in self.categories:
            v = self.score(words, category)
            if v > value:
                best  = category
                value = v
        return best


if __name__ == "__main__":
    data = [["yes", "Chinese", "Beijing", "Chinese"],
            ["yes", "Chinese", "Chinese", "Shanghai"],
            ["yes", "Chinese", "Macao"],
            ["no", "Tokyo", "Japan", "Chinese"]]

    #Train naive bayes classifier
    nb = NaiveBayes()
    nb.train(data)

    print "P(Chinese|yes) = ", nb.word_probability("Chinese", "yes")
    print "P(Tokyo|yes) = ", nb.word_probability("Tokyo", "yes")
    print "P(Japan|yes) = ", nb.word_probability("Japan", "yes")
    print "P(Chinese|no) = ", nb.word_probability("Chinese", "no")
    print "P(Tokyo|no) = ", nb.word_probability("Tokyo", "no")
    print "P(Japan|no) = ", nb.word_probability("Japan", "no")
    #
    # #Predict test data categories
    test = ["Chinese", "Chinese", "Chinese", "Tokyo", "Japan"]
    print "log P(yes|test) =", nb.score(test, "yes")
    print "log P(no|test) =", nb.score(test, "no")
    print nb.classify(test)

reference

I have referred to the following pages very much. Thank you very much.

Recommended Posts

Machine learning beginners try to reach out to Naive Bayes (2) --Implementation
Machine learning beginners try to reach out to Naive Bayes (1) --Theory
Machine learning beginners try to make a decision tree
Machine learning beginners try linear regression
Try to forecast power demand by machine learning
[For beginners] Introduction to vectorization in machine learning
Introduction to machine learning
[Machine learning] Try to detect objects using Selective Search
Everything for beginners to be able to do machine learning
An introduction to machine learning
Machine learning beginners tried RBM
Try to evaluate the performance of machine learning / classification model
Try to predict if tweets will burn with machine learning
Super introduction to machine learning
Try machine learning with Kaggle
Try to make a blackjack strategy by reinforcement learning ((1) Implementation of blackjack)
Introduction to machine learning Note writing
[Machine learning] Try studying decision trees
Try machine learning with scikit-learn SVM
<For beginners> python library <For machine learning>
Introduction to Machine Learning Library SHOGUN
[Machine learning] Try studying random forest
python beginners tried to find out
How to collect machine learning data
[Python] [Machine learning] Beginners without any knowledge try machine learning for the time being
Specific implementation method to add horse past performance data to machine learning features
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
Python beginners publish web applications using machine learning [Part 2] Introduction to explosive Python !!
Introduction to Machine Learning: How Models Work
scikit-learn How to use summary (machine learning)
Record the steps to understand machine learning
First Steps for Machine Learning (AI) Beginners
I installed Python 3.5.1 to study machine learning
Machine learning algorithm (implementation of multi-class classification)
Introduction to ClearML-Easy to manage machine learning experiments-
Machine learning algorithm classification and implementation summary
How to enjoy Coursera / Machine Learning (Week 10)
An introduction to Python for machine learning
Try to write code from 1 using the machine learning framework chainer (mnist edition)
Try to draw a "weather map-like front" by machine learning based on weather data (5)
[Super Introduction] Machine learning using Python-From environment construction to implementation of simple perceptron-
Machine learning beginners tried to make a horse racing prediction model with python
Try to draw a "weather map-like front" by machine learning based on weather data (1)
Try to draw a "weather map-like front" by machine learning based on weather data (4)
Try to draw a "weather map-like front" by machine learning based on weather data (2)
Is it possible to eat stock price forecasts by machine learning [Implementation plan]