Bayesian update, tried to understand binomial distribution / likelihood function

Introduction

Last time, I learned about Bayesian inference, which I think is inevitable when learning machine learning. This time, we subsequently summarized Bayesian updates, binomial distributions, and likelihood functions.

I tried hard to understand Bayes' theorem, which I think is inevitable in machine learning. https://qiita.com/Fumio-eisan/items/f33533c1e5bc8d6c93db

Machine learning and its theory (Information Olympics 2015 Spring Training Camp Lecture Material) https://www.slideshare.net/irrrrr/2015-46395273

Understand Bayesian updates

An idea that applies Bayes' theorem is Bayesian update. This refers to the idea that prior probabilities are updated posterior probabilities. Let's take a concrete calculation using the concept of spam filter as an example.

Spam filter

Consider determining whether an email $ A $ is junk email $ A_1 $ or non-spam $ A_2 $. The e-mail contained words such as "absolute victory" and "completely free" that are easy to enter in junk e-mail. The probability that these words are included in junk and non-spam is ** from the database ** as shown in the table below.

spam 非spam
Absolute victory 0.11 0.01
Completely free 0.12 0.02

Now, let's say that the event that includes the word "absolute victory" is $ B_1 $ and the event that does not include it is $ B_2 $. Then

P(B_1|A_1) = P(Absolute victory|spam) = 0.11\\
P(B_1|A_2) = P(Absolute victory|Non-spam) = 0.01

It will be. Even in non-spam mail, about 1 in 100 mails indicates that it may be written as an absolute victory. Now, the ** database shows that the ratio of junk mail to all mail is $ P (A_1) = 0.6 $. At this time, the "probability that an email containing the word" absolute victory "is junk email" is calculated from Bayes' theorem.

\begin{align}
P(A_1|B_1)& =  P(spam|Absolute victory)= \frac{P(B_1|A_1)P(A_1)}{P(B_1)}\\
&=\frac{P(Absolute victory|spam)P(spam)}{P(Absolute victory)}\\
&=\frac{0.11×0.6}{0.11×0.6+0.01×(1-0.6))}\\
&=0.9429
\end{align}

It will be.

Therefore, it was found that about 94% of the emails containing "absolute victory" are junk emails and about 6% are non-spam emails. From this calculation, the assumption that 60% was junk mail was replaced by the assumption that 94% was junk mail. This is the idea that ** posterior probabilities are prior probabilities. ** Based on this idea, the principle of Bayesian update is applied to "totally free".

\begin{align}
P(A_1|B_2)& =  P(spam|Completely free)= \frac{P(B_2|A_1)P(A_1)}{P(B_2)}\\
&=\frac{P(Completely free|spam)P(spam)}{P(Completely free)}\\
&=\frac{0.12×0.9429}{0.12×0.9429+0.01×(1-0.9429))}\\
&=0.9900
\end{align}

have become. In other words, it turned out that about 99% of the emails containing "completely free" are junk emails. This way of thinking is called Bayesian update.

Understand the Bernoulli distribution and the binomial theorem

The Bernoulli distribution is a distribution that shows the probability distribution.


f(x|θ) = θ^x(1-θ)^{1-x}, x=0,1

$ x $ is a random variable, taking 1 or 0. Usually, the front and back of the coin, the presence or absence of illness, etc. are represented by this 1,0. Also, $ θ $ refers to the probability that the event will occur. This $ θ $, also called the parameter, is a numerical index that characterizes the probability distribution.

Understand the binomial distribution

By the way, the Bernoulli distribution mentioned earlier represents one trial. Next, consider the case where this trial is performed N times. Take basketball as an example. There are players who can make a successful free throw with $ \ frac {1} {3} $. Consider the probability of throwing 3 times and succeeding 2 times. For example, the probability of success ⇒ success ⇒ failure being observed is


θ^2×(1-θ)^1 = {\frac{1}{3}}^2×(1-\frac{1}{3})=\frac{2}{27}

It will be. However, there are three patterns of success ⇒ failure ⇒ success, failure ⇒ success ⇒ success. Therefore, it can be calculated as \ frac {2} {9}. To generalize this


f(x;N,θ) = nCx×θ^x(1-θ)^{n-x}, x=0,1

Let's see how this probability distribution is drawn when $ n = 30 $, $ θ = 0.3,0.5,0.8 $.

bayes.ipynb


import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binom

k = np.arange(0,30,1)
p_1 = binom.pmf(k, 30, 0.3, loc=0)
p_2 = binom.pmf(k, 30, 0.5, loc=0)
p_3 = binom.pmf(k, 30, 0.8, loc=0)

plt.plot(k,p_1,label='p_1=0.3')
plt.plot(k,p_2, color="orange",label='p_2=0.5')
plt.plot(k,p_3, color="green",label='p_3=0.8')
plt.legend()
plt.show()

002.png

In the case of 30 trials, it was found that the most frequently observed times were about 9 times with a probability of 0.3, about 15 times with a probability of 0.5, and about 25 times with a probability of 0.8. The reason for the lowest peak with a probability of 0.5 is the largest variance.

Understand the likelihood function

Let us consider the binomial distribution that appears earlier. In the previous example, we considered the parameter to be tried 30 times as a constant and the probability as a variable. By reversing this idea, a function whose parameter is the number of trials is called a variable and whose probability is a constant is called a likelihood function **. The value obtained by this likelihood function is called the likelihood.


f(x;N,θ) = nCx×θ^x(1-θ)^{n-x}, x=0,1

Let's diagram it in the same way as before. Calculate as $ θ = 30,50,80 $ and the probability is 0.3.

bayes.ipynb


import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import binom

k = np.arange(0,50,1)
p_1 = binom.pmf(k, 30, 0.3, loc=0)
p_2 = binom.pmf(k, 50, 0.3, loc=0)
p_3 = binom.pmf(k, 80, 0.3, loc=0)

plt.plot(k,p_1,label='θ=30')
plt.plot(k,p_2, color="orange",label='θ=50')
plt.plot(k,p_3, color="green",label='θ=80')
plt.legend()
plt.show()

003.png

It was found that as the number of trials increased, the variance increased and the expected value (peak) decreased.

in conclusion

This time, it was easy to understand the binomial theorem because it is within the scope of learning in high school mathematics. However, it was refreshing that the meaning of the likelihood function and the idea of probability changed slightly depending on what was placed as a variable. Also, while I understood that Bayesian update is a concept that can be applied in machine learning, I found that it is very necessary to be very careful in determining prior probabilities by myself. There may be pitfalls if we do not work from the perspective of whether the premise is really valid.

Reference URL https://lib-arts.hatenablog.com/entry/implement_bayes1

Recommended Posts

Bayesian update, tried to understand binomial distribution / likelihood function
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to notify slack of Redmine update
Move your hand to understand the chi-square distribution