Overview

Participating in a round-reading session of "Introduction to Statistical Modeling for Data Analysis" (so-called Midorimoto) held by the company, It was a very easy-to-understand book, but sadly for Mac users, the implementation sample is WinBUGS. Because it was We implemented the Bayesian inference approach of the generalized linear model in Chapter 9 in Python + STAN.

What i did

Roughly following the steps below.

Generate dummy data from a probability distribution based on specific parameters
Set the forecast model
From the dummy data and the prediction model, estimate the parameters (posterior distribution) that generated the data with MCMC and match the answers.

Specifically, the body size of a certain plant (taking discrete values in 0.1 increments from 3.0 to 7.0) is used as an explanatory variable. Estimate the probability distribution of the number of seeds (integer greater than or equal to 0) that follows the Poisson distribution.

Tools and libraries used

MCMC sampler: STAN (PyStan)
Matrix calculation, array operation: NumPy
Graph drawing: matplotlib

Estimating method

MCMC MCMC method is an abbreviation for Marcov Chain Monte Carlo method. In Japanese, it is called Markov chain Monte Carlo method. Markov chain Monte Carlo method is a Markov chain using the Monte Carlo method. ... I've become a tautology, so I'll explain a little more.

Markov chain

The property that a certain state depends only on the immediately preceding state is called ** Markov property **. The ** Markov chain ** is a probabilistic model in which states that depend only on the immediately preceding state occur in a chain. From an engineer's point of view, it's easier to understand if you think of it as a ** type of automaton **.

Monte Carlo method

The Monte Carlo method is a general name for numerical calculations and simulations using random numbers.

In other words, the ** Markov chain Monte Carlo method ** is a method of generating Markov chains using random numbers. Here, in particular, it refers to an algorithm that generates a probability distribution (posterior distribution of parameters) using the properties of Markov chains.

Bayesian inference x MCMC

By combining the Bayesian estimation framework (posterior distribution ∝ likelihood x prior distribution) and MCMC that samples the probability distribution proportional to the likelihood, Even if the model cannot be solved analytically, the posterior distribution can be estimated as long as it can be expressed by a mathematical formula.

Implementation

A detailed explanation of the generalized linear model and MCMC seems to be very long, so we will start implementing it. "Introduction to Statistical Modeling for Data Analysis-Generalized Linear Models, Hierarchical Bayesian Models, MCMC (Science of Probability and Information)" is easy to understand.

Generation of training data

Body size is 3.0 ~ 7.0, Average μ = 1.5 + 0.1 * for each individual It is generated from the Poisson distribution of body size.

def generate(n):
	for i in range(n):
		x = round(random.random() * 4 + 3, 1) # 3.0 ~ 7.Random numbers up to 0
		mu = math.exp(1.5 + 0.1*x)
		print (x, np.random.poisson(mu))

"Body size" and "number of seeds".

STAN mcmc.stan

data {
  int<lower=1> N;
  vector[N] x;
  int<lower=0> y[N];
}
parameters {
  real beta1;
  real beta2;
}
model {
  for (i in 1:N) {
    y[i] ~ poisson(exp(beta1 + beta2 * x[i])); //Poisson distribution x logarithmic link function
  }
  beta1 ~ normal(0, 1000); //Average 0,Normal distribution with variance 1000 ≒ no information prior distribution
  beta2 ~ normal(0, 1000); //Average 0,Normal distribution with variance 1000 ≒ no information prior distribution
}

How to read

data This is the data to be passed to the stan program. Declare in the format ** {data type} {variable name}; **. Pass the data from Python by specifying the variable name written here.

parameters This is the variable used in the model described in stan. This time, the intercept β1 and the coefficient β2 of the logarithmic link function of the Poisson distribution are set inside the STAN. It is generated from a non-information prior distribution (a normal distribution with a very large variance that approximates it).

model It is a prediction model. The operators **'~' ** mean that the left term follows the probability distribution of the right term. Here, the number of seeds ** y ** follows a Poisson distribution with ** exp (β1 + β2x) ** as the link function (that is, the logarithmic link function).

Python

import numpy as np
import pystan
import matplotlib.pyplot as plt

data = np.loadtxt('./data.txt', delimiter=' ')

#Data generation to pass to Stan Interface
x = data[:,0] #numpy notation, cut out the first column of data
y = data[:,1].astype(np.int) #Since it is the objective variable of Poisson distribution, it is converted to an integer value.
N = data.shape[0] #The number of data

stan_data = {'N': N, 'x': x, 'y': y}

fit = pystan.stan(file='./mcmc.stan',\
	data=stan_data, iter=5000, chains=3, thin=10)
    # iter =Number of each sampling
    # chain =Repeat n sets of sampling specified by iter
    # thin =Sample thinning number

Execution result

If it goes well, a log like this will come out. STAN itself is implemented in C ++, so compilation is running.

INFO:pystan:COMPILING THE C++ CODE FOR MODEL ~ NOW.
Chain 1, Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 0, Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2, Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1, Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2, Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 0, Iteration: 1000 / 10000 [ 10%]  (Warmup)
...
Chain 0, Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2, Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1, Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 0, Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2, Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1, Iteration: 10000 / 10000 [100%]  (Sampling)# 
...
#  Elapsed Time: 9.51488 seconds (Warm-up)
#                10.7133 seconds (Sampling)
#                20.2282 seconds (Total)
#

Graph

Statistics

summary

Inference for Stan model: anon_model_381b30a63720cfb3906aa9ce3e051d13.
3 chains, each with iter=10000; warmup=5000; thin=10; 
post-warmup draws per chain=500, total post-warmup draws=1500.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
beta1   1.41  2.4e-3   0.05   1.31   1.38   1.41   1.45   1.52    481    1.0
beta2   0.12  4.6e-4   0.01    0.1   0.11   0.12   0.13   0.14    478    1.0
lp__  7821.4    0.04   0.97 7818.8 7821.1 7821.7 7822.1 7822.4    496    1.0

Samples were drawn using NUTS(diag_e) at Tue Feb  9 23:31:02 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

When the fit.summary () method is executed, the output will be as follows. First, looking at β1, it shows that the average of the samples is 1.41, with a 95% probability of being in the range of 1.31 to 1.52. (It seems to be called a credit interval in Bayesian terms.)

Since there is only one mountain in the distribution this time, I will look at the representative value on average, β1 (1.5) is 1.41 and β2 (0.1) is 0.12, which are not significantly different, and the numerical values can be estimated.

Rhat When calling the stan code from Python, I repeated sampling 3 times with the * chain * parameter. In the posterior distribution estimation of parameters by MCMC, the initial values of the parameters are determined appropriately. Depending on the model, the probability distribution estimated by the initial value may differ, so Repeat sampling multiple times and check if the probability distribution converges by ** Rhat **, which quantifies the variation between sets. It seems that it is OK if it is 1.1 or less, but this time it can be judged that there is no problem because both beta1 and beta2 are less than that.

Reference book

Since the focus is on implementation introduction, the meaning of each argument such as * thin * when calling STAN, and Why are you using the first half of sampling for * Warm up *? In addition, MCMC is a general term for methods in the first place, and what is the specific algorithm? If you want to know more, please read the following.

[Introduction to Statistical Modeling for Data Analysis-Generalized Linear Model, Hierarchical Bayes Model, MCMC (Science of Probability and Information)](http://www.amazon.co.jp/%E3%83%87%E3 % 83% BC% E3% 82% BF% E8% A7% A3% E6% 9E% 90% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AE% E7% B5 % B1% E8% A8% 88% E3% 83% A2% E3% 83% 87% E3% 83% AA% E3% 83% B3% E3% 82% B0% E5% 85% A5% E9% 96% 80 % E2% 80% 95% E2% 80% 95% E4% B8% 80% E8% 88% AC% E5% 8C% 96% E7% B7% 9A% E5% BD% A2% E3% 83% A2% E3 % 83% 87% E3% 83% AB% E3% 83% BB% E9% 9A% 8E% E5% B1% A4% E3% 83% 99% E3% 82% A4% E3% 82% BA% E3% 83 % A2% E3% 83% 87% E3% 83% AB% E3% 83% BBMCMC-% E7% A2% BA% E7% 8E% 87% E3% 81% A8% E6% 83% 85% E5% A0% B1% E3% 81% AE% E7% A7% 91% E5% AD% A6-% E4% B9% 85% E4% BF% 9D-% E6% 8B% 93% E5% BC% A5 / dp / 400006973X)

The first Markov chain Monte Carlo method by PyStan