Machine learning beginners try to reach out to Naive Bayes (1) --Theory

This page

I'm a beginner in machine learning and a beginner in Python, so I think it's basically like reading the pages that I referred to.

What is conditional probability (Bayes' theorem)? ??

It's a little math. Have you ever heard the word conditional probability? I heard a little about it in class before, but I could hardly remember it.

P(B|A) = \frac{P(A \cap B)}{P(A)} = \frac{P(B)P(A|B)}{P(A)}

P (B | A) refers to the probability that event B will occur if event A occurs.

A well-known issue is the story of children.

A couple has two children. It turns out that at least one of the two is a boy. At this time, find the probability that both of them are boys. However, the probability of having a boy and the probability of having a girl are both halved.

The idea of this is

Event A: `It turns out that at least one of the two is a boy. ``
Event B: Both are boys

If you think about it, you can think of it as follows.

Event A

At least one is a boy, so if you list all the patterns when two children are born, there are 4 of man & man, man & woman, woman & woman, woman & man. It is a pattern. (Including sibling differences) Of these, 3/4, which omits woman & woman, which are both womans, is the probability of event A.

Event B

This is, needless to say, 1/4.

That is, the probability of P (B | A) is

P(B|A) = \frac{\frac{1}{4}}{\frac{3}{4}} = \frac{1}{3}

It will be.

Reference: Beautiful story of high school mathematics

Challenge Naive Bayes

Well, this is the main subject. Naive Bayes is also called Naive Bayes. It seems that it is used for the purpose of categorizing sentences by slightly modifying Bayes' theorem.

P(cat|doc) = \frac{P(cat)P(doc|cat)}{P(doc)}\propto P(cat)P(doc|cat)

This is the basic form.

I would like to explain each element.

P (cat | doc) is the probability of being a category (cat) given a sentence (doc). To put it a little more, it seems to mean the probability of being in a category such as "IT" when the text contains words such as "Apple", "iPhone", and "MacBook Pro".

\frac{P(cat)P(doc|cat)}{P(doc)}

Since P (doc) is common to all, we will ignore it and calculate.

P(cat)P(doc|cat)

P (cat) is easy.

If the total number of sentences is 100, and the number of sentences in the IT category is 50

P(IT) = \frac{50}{100} = \frac{1}{2}

It can be calculated like this.

P (doc | cat) may not make sense. Probability of being a sentence (doc) given a category (cat) ...? And this is the heart of Naive Bayes.

Naive Bayes does not give the exact P (doc | cat), but assumes independence between word occurrences.

Independence between the appearance of words may mean, for example, machine and learning are words that often appear in the field of machine learning, but the words machine and learning are independent. It is calculated by simplifying it by assuming that it appears.

P(doc|cat) = P(word_1 \wedge .... \wedge word_k|cat) = \prod_{i=0}^k P(word_k|cat)

To calculate.

\prod

Is the (x) version of Σ. Now let's calculate P (word | cat). This is semantically the probability that a word will appear when a category (cat) is given, so calculate as follows. (The formula is on the reference page, so please refer to it ..)

P(word_k|cat) = \frac{category(cat)Words in(word)Number of appearances}{category(cat)Total number of words that appear in}

All you have to do now is calculate using these! !!

Eat learning data
Scoring the desired sentence
Returns the category with the highest number.

Technique

Two keywords will appear.

Laplace smoothing
Take the logarithm

I'm just doing + 1. There seems to be a zero frequency problem. What is this? I think that P (doc | cat) was expressed as the infinite product of the probability of appearance of words, but the product is 0 if there is one 0, so it is 0 as a whole. It will be 0.

For example, let's say you're learning in the category "IT" and you come across a new word "Python" that wasn't training data. Then ..

P(word_k|cat) = \frac{category(cat)Words in(word)Number of appearances}{category(cat)Total number of words that appear in}

The numerator of is 0,

P(Python|IT) = 0

It will be. So, add +1 to the number of appearances in advance to prevent the probability from becoming 0! !! That seems to be a technique called Laplace smoothing. Naturally, the denominator also changes. If you write the formula again ..

P(word_k|cat) = \frac{category(cat)Words in(word)Number of appearances+ 1}{category(cat)Total number of words that appear in+Total number of words}

It will be.

In the sample data, it is not a big deal, but when applied to the actual calculation, the denominator of P (word | cat) may become very large.

The reason is that the number of words can be very large. Underflow may occur in such a case.

The opposite of underflow or overflow is that the decimal point becomes too large to handle.

That's where the logarithm comes in. If you take the logarithm, you can replace it with the same base addition, so even if you take the logarithm and add it, the result will be the same.

So, change the following into a logarithmic form.

\prod_{i=0}^k P(word_k|cat)

\prod_{i=0}^k \log P(word_k|cat)

Putting these together, the final formula is:

P(cat|doc) = \log P(cat) + \sum_{i=0}^k \log P(word_k|cat)

Reference: Logarithm (LOG) calculation and formula! This is perfect! !!

Machine learning beginners try Naive Bayes (2) --Implementation

So I would like to implement it using Python. (Refer to the site)

reference

I have referred to the following sites very much. Thank you very much.

Edit history

2017.11.15: The final formula was incorrect, so I fixed it to Sigma.
2018.10.31: The description of conditional probabilities was incorrect, so it has been fixed.