Introduction

Previous explained the Markov chain and the Monte Carlo method, respectively, and summarized the outline of the Markov chain Monte Carlo method (commonly known as the MCMC method). There are multiple algorithms for the MCMC method, but this time I would like to summarize the metropolitan hasting method (commonly known as the MH method).

I learned this time by referring to the following articles and books.

[Bayes Statistics from the Basics: A Practical Introduction to the Hamiltonian Monte Carlo Method](https://www.amazon.co.jp/%E5%9F%BA%E7%A4%8E%E3%81%8B%E3 % 82% 89% E3% 81% AE% E3% 83% 99% E3% 82% A4% E3% 82% BA% E7% B5% B1% E8% A8% 88% E5% AD% A6-% E3% 83% 8F% E3% 83% 9F% E3% 83% AB% E3% 83% 88% E3% 83% 8B% E3% 82% A2% E3% 83% B3% E3% 83% A2% E3% 83% B3% E3% 83% 86% E3% 82% AB% E3% 83% AB% E3% 83% AD% E6% B3% 95% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E5% AE% 9F% E8% B7% B5% E7% 9A% 84% E5% 85% A5% E9% 96% 80-% E8% B1% 8A% E7% 94% B0-% E7% A7% 80% E6% A8% B9 / dp / 4254122128)
[Statistics] Let's explain sampling by Markov chain Monte Carlo method (MCMC) with animation.
Bayesian Statistics from the Basics Round-reading Material Chapter 4 Metropolis-Hastings Method

Looking back on the Markov chain Monte Carlo method

The Markov chain Monte Carlo method (commonly known as the MCMC method) is a method for obtaining a complex distribution such as a probability distribution in a multidimensional space from ** sampling based on the probability distribution **. 　

When calculating the expected value of a function $ θ_ {eap} $, it tends to be a high-dimensional function for data analysis. In that case, the calculation is very complicated and often difficult to solve, so consider randomly sampling the target distribution (Monte Carlo method). The Markov chain adjusts the distribution of the sampled values to obtain the desired distribution. 　 This MCMC method has the following four major methods and algorithms.

** Metropolis-Hastings method (explained this time) **
Hamiltonian Monte Carlo method
Gibbs sampling
Slice sampling

This time, I will summarize this Metropolis-Hasting method.

What is the Metropolis-Hastings method?

In the MCMC method, detailed balance conditions shown below are considered for arbitrary random variables $ θ and θ'$.

f(θ'|θ)f(θ) = f(θ|θ')f(θ') \hspace{1cm}Equation 1

This time,f(θ'|θ)as well asf(θ|θ')ButTransition nucleusIt is a distribution called. Usually known posterior distributionf(θ)に対して式1を満たすようなTransition nucleusをいきなり見つけることは非常に困難です。

So this transition nucleusf(|)Instead ofAppropriate analystTransition nucleusq(|)ToProposed distributionI will use it as.この適当に選ぶことが実は難しさTo表していることが後の実装で分かります。

The proposed distribution is a probability distribution that proposes suitable candidates as a sample from the target distribution. This proposed distribution does not necessarily satisfy the equal sign because it is chosen appropriately. For example

q(θ|θ')f(θ') > q(θ'|θ)f(θ)\hspace{1cm}Equation 2\\

It looks like Equation 2 here. The method of performing probability correction to satisfy detailed balance conditions for this formula is called the ** Metropolis-Hastings algorithm method (commonly known as the MH method) **.

Originally, it is an algorithm proposed in the field of physical chemistry. It was applied to Bayesian statistics. 　 Equation 2 shows that the probability of moving to $ θ $ is greater than the probability of moving to $ θ'$. However, in order to correct with the transition probability, we introduce correction coefficients $ c $ and $ c'$ with positive signs.

f(θ|θ') = cq(θ|θ')\\
f(θ'|θ) = c'q(θ'|θ)

Therefore, after correction,

 cq(θ|θ')f(θ')=c'q(θ'|θ)f(θ)

And the integration was established. But even in this case

There are two correction coefficients [tex: c, c'], which is redundant.
It is difficult to take stochastic actions because the correction coefficient is not within 0 or more and 1 or less.

There are issues such as. Therefore,

r = \frac{c}{c'} = \frac{q(θ'|θ)f(θ)}{q(θ|θ')f(θ')}\\
r' =\frac{c'}{c'} = 1

It will be. Substituting this

 rq(θ|θ')f(θ')=r'q(θ'|θ)f(θ)

Because it becomesrq(θ|θ')Whenr'q(θ'|θ)を遷移核Whenすれば、詳細釣り合い条件を満たすこWhenWhenなります。

MH algorithm

Now, the MH algorithm is below.

Use the proposed distribution $ q (| θ ^ {t}) $ to generate a random number $ a $.
Determine the following propositions.

q(a|θ^{t})f(θ^{t}) > q(θ^{t}|a)f(a)

True: From the equation 1 inequality sign condition, it is determined that $ θ = a $, $ θ'= θ ^ {t} $. In this case, the proposed distribution is $ q (θ | θ') $, so we need to make a stochastic correction using the coefficient $ r $ on the left side of Equation 2.

r  =\frac {q(θ^{t}|a)f(a) }{q(a|θ^{t})f(θ^{t})}

Is calculated, accepts $ a $ with a probability of $ r $, and sets $ θ ^ {t + 1} = a $. Discard $ a $ with a probability of $ 1-r $ and set $ θ ^ {t + 1} = θ ^ t $.

False: From the equation 1 inequality sign condition, it is determined that $ θ'= a $, $ θ = θ ^ {t} $. In this case, the proposed distribution is $ q (θ'| θ) $, so it is necessary to make a stochastic correction using the coefficient $ r'$ on the right side of Equation 2. Always accept $ a $ with a probability of $ 1 (= r') $ and set $ θ ^ {t + 1} = a $.

Set $ t = t + 1 $ and return to 1.

In summary, ** accept the proposed candidate point $ a $ with a probability of $ min (1, r) $ $ (θ ^ {t + 1} = a) $, or stay in place $ θ ^ {t + 1} = θ ^ {t} $ is repeated any number of times **. The outline of the movement is shown in the figure below. $ Θ $ makes it easy to move toward the target distribution $ f (θ) $. 　

Implement and check (independent MH method)

Then, I would like to actually implement this MH method. There are two types of MH methods: the independent MH method and the random walk MH method. First of all, about the independent MH method. The theme is as follows.

The posterior distribution of the parameter $ θ $ is assumed to be the gamma distribution of $ f (θ | α = 11, λ = 13) $. Then, the proposed distribution is a normal distribution $ q (θ) $ with an average of 1 and a variance of 0.5. Here, the correction coefficient is

r  =\frac {q(θ^{t})f(a) }{q(a)f(θ^{t})}

And. The program is below (the library import etc. are not written. Please see the full text link later).

theta = []

# Initial value
current = 3
theta.append(current)

n_itr = 100000

for i in range(n_itr):
    #Random number generation from the proposed distribution
    a = rand_prop()
    if a < 0:
        theta.append(theta[-1])
        continue

    r = (q(current)*f_gamma(a)) / (q(a)*f_gamma(current))
    assert r > 0
    
    if r < 0:
        #reject
        theta.append(theta[-1])
        continue
    if r >= 1 or r > st.uniform.rvs():#Generate a random number and accept it if it is greater than r
        # Accept
        theta.append(a)
        current = a
    else:
        #Reject
        theta.append(theta[-1])

The point is how to express whether to accept with a probability of $ r $.

if r >= 1 or r > st.uniform.rvs()

It is decided that the st.uniform.rvs () method will generate random numbers in a uniform distribution of $ 0 to $ 1 and accept them if they are larger than that.

As a result of this program, check if the posterior distribution can be reproduced.


plt.title("Histgram from MCMC sampling.")
plt.hist(theta[n_burn_in:], bins=30, normed=True)

xx = np.linspace(0, 2.5,501)
plt.plot(xx, st.gamma(11, 0, 1/13.).pdf(xx))
plt.show()

017_100000(loc=1).png

The data shown in blue as a histogram is the frequency distribution of $ θ $ moved by the MH method. It is normalized and corrected so that the sum is 1. Since the orange color is the posterior distribution that we originally wanted to reproduce, we can confirm that it has been reproduced.

By the way, it seems that it was done without any problem, but in the proposed distribution, the average was 1 and the variance was 0.5. This is arbitrary and may not actually work. For example, if a normal distribution with a mean of 3 and a variance of 0.5 is the proposed distribution,

It will shift like this. In fact, in the independent MH method, there is a problem that the distribution does not converge well unless the distribution close to the target distribution is a steady distribution. In other words, the independent MH method makes a big difference in the results until convergence depending on the quality of the proposed distribution. This time, the posterior distribution has been clarified for simplicity, but in many cases the distribution is unknown in actual data analysis. Therefore, we need to think of ways to solve this problem.

Random walk MH method

An algorithm called the random walk MH method has been proposed as a method to solve this problem.

Candidates for suggestions

a =θ^{t} + e

will do. For $ e $, a normal distribution with an average of 0 or a uniform distribution is used.

If you choose a symmetric distribution such as a normal distribution or a uniform distribution for the proposed distribution,

q(a|θ^{t}) = q(θ^{t}|a)

And the proposed distribution. Therefore, the correction coefficient r in the random walk MH method is

r = \frac {f(a)}{f(θ^{t})}

And the proposed distribution always disappears.

Implement and check (random walk MH method)

Now, let's implement and check the same problem as before.

# MCMC sampling

theta = []
current  = 5
theta.append(current)
n_itr = 100000
prop_m = 5
prop_sd = np.sqrt(0.10)
x = np.linspace(-1,5,501)

def rand_prop1(prop_m):
    return st.norm.rvs(prop_m, prop_sd)

for i in range(n_itr):
    a = rand_prop1(current)  #Random number generation from the proposed distribution
    r = f_gamma(a) / f_gamma(current)
    
    if a < 0 or r < 0:
        #reject
        theta.append(theta[-1])
        pass
    elif r >= 1 or r > st.uniform.rvs():
        # Accept
        theta.append(a)
        current = a
        status = "acc"
        col = "r"
    else:
        #Reject
        theta.append(theta[-1])
        pass

The graph description is below.

plt.title("Histgram from MCMC sampling.")
n, b, p = plt.hist(theta[n_burn_in:], bins=50, density=True)

xx = np.linspace(0, 2.5,501)
plt.plot(xx, st.gamma(11, 0, 1/13.).pdf(xx))
plt.show()

kk = 11.
tt = 13.
print ("sample mean:{0:.5f}, sample std:{1:.5f}".format(np.mean(theta), np.std(theta)))
#Theoretical expected value / variance
print("mean:{0:.5f}, std:{1:.5f}".format( kk/tt, np.sqrt(kk*((1/tt)**2))) )

It turned out that the distribution was reproduced neatly. Compared to the independent MH method I mentioned earlier, I found that the algorithm of this random walk MH method seems to be safe to use.

However, the problem with the MH method

I would like to say that this is a relief, but in reality there are still issues to be solved. After all, you need to specify the value of $ e $ properly, so this also requires a hyperparameter idea. 　 If the variance is small, the percentage of transitions accepted is high, but the advance in the state space is a slow random walk, which requires a long convergence time. On the other hand, the higher the variance, the higher the rejection rate. 　 The Hamiltonian Monte Carlo method is a way to solve this. I hope that will be summarized next time.

At the end

This time, we have summarized the Metropolis-Hastings method. Thinking about how to converge towards the mean of the posterior distribution is just like optimizing weight parameters in a neural network.

After all, the algorithms are different, but the purpose I want to do is similar, so I thought that if I could grasp that, I could see the fun and breadth of statistics and machine learning.

The full text of the program is stored here. https://github.com/Fumio-eisan/Montecarlo20200516

Understand the metropolitan hasting method (one of the methods in Markov chain Monte Carlo method) with implementation