Introduction

Previous article introduced an overview of the stochastic programming language Pyro and how to sample samples using Pyro. This time, using Pyro, "Introduction to Data Analysis by Bayesian Statistical Modeling Beginning with Practical Data Science Series R and Stan" (hereinafter referred to as a reference book) Based on the first example (Chapter 3, Part 2, Simple Regression Model) in (Masu.), We will implement it in Pyro. The implementation is available on 3-2 Simple Regression Model (Google Colaboratory).

Usage data & example introduction

The data handled this time (see Book 3-2) is ** Relationship between beer sales and temperature **. Statistical modeling builds a model that explains sales by temperature. Samples of actual data and scatter plots are as follows.

#Data read
beer_sales_2 = pd.read_csv(
    'https://raw.githubusercontent.com/logics-of-blue/book-r-stan-bayesian-model-intro/master/book-data/3-2-1-beer-sales-2.csv'
) # shape = (100, 2)
beer_sales_2.head()

	sales	temperature
0	41.68	13.7
1	110.99	24
2	65.32	21.5
3	72.64	13.4
4	76.54	28.9

#Visualization
beer_sales_2.plot.scatter('temperature', 'sales')
plt.title("Relationship between beer sales and temperature")

There seems to be a positive correlation somehow. The following is a description of a model that explains sale in terms of temperature using a simple regression model. sales_i \sim \rm{Normal}(Intercept + \beta * temperature_i, \sigma^2) In Bayesian statistical modeling, the parameters $ Intercept $, $ \ beta $, $ \ sigma ^ 2 $ that you want to estimate in the above equation are treated as random variables (these are called latent variables), and the observation data is used after the fact. Estimate the distribution.

Implementation by Pyro

In Pyro, model description and parameter estimation are performed according to the following procedure.

model method description
Estimating posterior distribution In the following, we will explain the components one by one.

model method description

In Pyro, the process of generating data by the assumed statistical model is described in one method. The name of the method is usually model. At that time, be careful to satisfy the following points.

Describe the data generation process using pyro.sample and pyro.plate introduced in Previous article.
Receives observation data as an argument and describes where the observation data corresponds in the data generation process. Below is the implementation.

import torch
import pyro
import pyro.distributions as dist


def model(temperature: torch.Tensor, sales: torch.Tensor=None):
    intercept = pyro.sample("intercept", dist.Normal(0, 1000))
    beta = pyro.sample("beta", dist.Normal(0, 1000))
    sigma2 = pyro.sample("sigma2", dist.Uniform(0, 1000))
    with pyro.plate("plate", size=temperature.size(0)):
        y_ = intercept + beta * temperature
        y = pyro.sample("y", dist.Normal(y_, sigma2), obs=sales)
    return y

All four variables ($ Intercept $, $ \ beta $, $ \ sigma ^ 2 $, $ y $) defined by pyro.sample in model are treated as random variables. Then, among the random variables, those that do not correspond to the observed data (other than $ y $) are latent variables, and will be the target for estimating the posterior distribution later.

Estimating posterior distribution

After writing the model method for the statistical model, the posterior distribution of the latent variables is estimated. The following two estimation methods are mainly provided.

MCMC
Variational reasoning

MCMC The methods required for estimation are provided in pyro.infer. To make an estimate using the NUTS kernel, write as follows.

import pyro.infer as infer

#Torch explanatory variables and observation variables.Convert to Tensor
temperature = torch.Tensor(beer_sales_2.temperature)
sales = torch.Tensor(beer_sales_2.sales)

#Estimating posterior distribution
nuts_kernel = infer.NUTS(model, adapt_step_size=True, jit_compile=True, ignore_jit_warnings=True)
mcmc = infer.MCMC(nuts_kernel, num_samples=3000, warmup_steps=200, num_chains=3)
mcmc.run(temperature, sales)

	mean	std	median	5.0%	95.0%	n_eff	r_hat
intercept	21.00	6.03	20.99	11.01	30.80	2869.59	1.00
beta	2.47	0.29	2.47	1.98	2.94	2866.62	1.00
sigma2	17.06	1.22	16.98	15.02	19.00	4014.61	1.00

Among the random variables in the model method, the posterior distributions of the intercept, slope, and variance (sigma2) to be estimated were estimated. The intercept is about 21 and the slope is about 2.5, which is in good agreement with the results shown in the reference books.

Estimate by variational reasoning

Pyro can also be estimated by variational inference. In variational inference, the latent variable you want to findZPosterior distribution ofp(Z|X)To another functionq(Z)Consider approximating with. In Pyro thisq(Z)Method corresponding to(guide)If you writep(Z|X)Best approximateq(Z)Can be calculated. Let's organize the situation while replacing it with an example. Looking back at the data generation process described by the model method, we can see that there are three latent variables that we want to estimate for this problem: ʻintercept, beta, and sigma2. The process of generating these three latent variables should be described in the guide method. However, if you write the guide appropriately$q(Z)$To$p(Z|X)$I can't get close to. In Pyropyro.paramThe parameters declared by$q(Z)$When$p(Z|X)$Variational inference is realized by updating the distribution of. In doing so, we use Pytorch's automatic differentiation and gradient descent. Based on the above, how to write the guide` method is summarized below.

Declare the update target parameter and initial value with pyro.param.
Read the parameter in the guide method. Constraints are specified by dist.constraints.
Generate latent variables with dist. Probability distribution (update target parameter) and pyro.sample.
The argument must be the same as the model method.

The following is an example of the implementation of guide for this problem. This time, we assumed that the three latent variables were independent, and chose the delta distribution as the functional form.

#Declaration of variables and setting of initial values
pyro.param("intercept_q", torch.tensor(0.))
pyro.param("beta_q", torch.tensor(0.))
pyro.param("sigma2_q", torch.tensor(10.))

# q(Z)Implementation of
def guide(temperature, sales=None):
    intercept_q = pyro.param("intercept_q")
    beta_q = pyro.param("beta_q")
    sigma2_q = pyro.param("sigma2_q", constraint=dist.constraints.positive) 
    intercept = pyro.sample("intercept", dist.Delta(intercept_q))
    beta = pyro.sample("beta", dist.Delta(beta_q))
    sigma2 = pyro.sample("sigma2", dist.Delta(sigma2_q))

For example, if you want a normal distribution instead of a delta distribution, define the mean ($ \ mu ) and variance ( \ sigma ^ 2 ) in `pyro.param` for each latent variable, and then` dist.Normal. All you have to do is rewrite it so that it is generated from `( \ mu $, $ \ sigma ^ 2 $) using pyro.sample. In this way, you can flexibly change the method of estimating the posterior distribution by changing the guide part. By the way, if all latent variables have an independent delta distribution as in the code example, guide = infer.autoguide.guides.AutoDelta(model) It is the same even if you write. ʻInfer.autogide.guides provides a convenient function that automatically creates a guide` method. (http://docs.pyro.ai/en/0.2.1-release/contrib.autoguide.html)

After writing the model and guide methods, you can easily find the posterior distribution.

#Method for holding the value of the parameter to be estimated for each epoch
def update_param_dict(param_dict):
    for name in pyro.get_param_store():
        param_dict[name].append(pyro.param(name).item())

#Execution of variational reasoning
adam_params = {"lr": 1e-1, "betas": (0.95, 0.999)}
oprimizer = pyro.optim.Adam(adam_params)

svi = infer.SVI(model, guide, oprimizer, loss=infer.JitTrace_ELBO(),)

n_steps = 10000
losses = []
param_dict = defaultdict(list)

for step in tqdm(range(n_steps)):
    loss = svi.step(temperature, sales,)
    update_param_dict(param_dict)
    losses.append(loss)

for name in pyro.get_param_store():
    print("{}: {}".format(name, pyro.param(name)))

    intercept_q: 21.170936584472656
    beta_q: 2.459129810333252
    sigma2_q: 16.684457778930664

With an intercept of about 21, a slope of about 2.5, and a variance of about 17, reasonable estimation results were obtained. From the figure below, it can be seen that the loss and each update target parameter are also converged.

Acceleration by GPU

Now that we know that we can estimate the posterior distribution by variational inference, we would like to consider the scalability of Pyro by using GPU, which is the greatest advantage of Pyro. The easiest way to use the GPU with Pyro is at the beginning of the code torch.set_default_tensor_type(torch.cuda.FloatTensor) By writing, all the generated Tensors are put on the GPU from the beginning. By the way, in the example, the number of samples was 100, but if this increases to 1000, 10000, ..., how will the time required for variational inference increase? Here, assuming that the intercept, slope, and variance estimated from 100 samples are true values, 1000, 10000, ... samples are generated from the true distribution (provisional), and the samples are generated. We compared the time required for variational inference with the CPU and GPU as observation data.

The number of samples	CPU(Seconds)	GPU(Seconds)
10^2	1.34	3.66
10^3	1.4	1.95
10^4	1.51	1.95
10^5	4.2	1.96
10^6	30.41	2.26
10^7	365.36	11.41
10^8	3552.44	104.62

As you can see from the table, in the area where the number of samples is 10,000 or less, the benefit of parallelization is not received and the CPU is faster. However, the difference began to open when the number of samples exceeded 100,000, and the result was that there was a difference of 30 times or more for $ 10 ^ 8 $ pieces. It can be seen that statistical modeling can be performed without stress even when dealing with extremely large data using a GPU.

Summary

This time, I introduced a method of performing analysis by a simple regression model using Pyro. The process by which the assumed statistical model generates data can be described in the model method, and the posterior distribution can be estimated by MCMC or variational inference. We also conducted an experiment on speeding up using GPU and confirmed that scalable analysis can be performed by Pyro.

[Pyro] Statistical modeling using the stochastic programming language Pyro ② ~ Simple regression model ~