Previous article introduced an overview of the stochastic programming language Pyro and how to sample samples using Pyro. This time, using Pyro, "Introduction to Data Analysis by Bayesian Statistical Modeling Beginning with Practical Data Science Series R and Stan" (hereinafter referred to as a reference book) Based on the first example (Chapter 3, Part 2, Simple Regression Model) in (Masu.), We will implement it in Pyro. The implementation is available on 3-2 Simple Regression Model (Google Colaboratory).
The data handled this time (see Book 3-2) is ** Relationship between beer sales and temperature **. Statistical modeling builds a model that explains sales by temperature. Samples of actual data and scatter plots are as follows.
#Data read
beer_sales_2 = pd.read_csv(
'https://raw.githubusercontent.com/logics-of-blue/book-r-stan-bayesian-model-intro/master/book-data/3-2-1-beer-sales-2.csv'
) # shape = (100, 2)
beer_sales_2.head()
sales | temperature | |
---|---|---|
0 | 41.68 | 13.7 |
1 | 110.99 | 24 |
2 | 65.32 | 21.5 |
3 | 72.64 | 13.4 |
4 | 76.54 | 28.9 |
#Visualization
beer_sales_2.plot.scatter('temperature', 'sales')
plt.title("Relationship between beer sales and temperature")
There seems to be a positive correlation somehow.
The following is a description of a model that explains sale in terms of temperature using a simple regression model.
In Pyro, model description and parameter estimation are performed according to the following procedure.
In Pyro, the process of generating data by the assumed statistical model is described in one method. The name of the method is usually model
. At that time, be careful to satisfy the following points.
pyro.sample
and pyro.plate
introduced in Previous article.import torch
import pyro
import pyro.distributions as dist
def model(temperature: torch.Tensor, sales: torch.Tensor=None):
intercept = pyro.sample("intercept", dist.Normal(0, 1000))
beta = pyro.sample("beta", dist.Normal(0, 1000))
sigma2 = pyro.sample("sigma2", dist.Uniform(0, 1000))
with pyro.plate("plate", size=temperature.size(0)):
y_ = intercept + beta * temperature
y = pyro.sample("y", dist.Normal(y_, sigma2), obs=sales)
return y
All four variables ($ Intercept $, $ \ beta $, $ \ sigma ^ 2 $, $ y $) defined by pyro.sample
in model
are treated as random variables. Then, among the random variables, those that do not correspond to the observed data (other than $ y $) are latent variables, and will be the target for estimating the posterior distribution later.
After writing the model method for the statistical model, the posterior distribution of the latent variables is estimated. The following two estimation methods are mainly provided.
MCMC
The methods required for estimation are provided in pyro.infer
.
To make an estimate using the NUTS kernel, write as follows.
import pyro.infer as infer
#Torch explanatory variables and observation variables.Convert to Tensor
temperature = torch.Tensor(beer_sales_2.temperature)
sales = torch.Tensor(beer_sales_2.sales)
#Estimating posterior distribution
nuts_kernel = infer.NUTS(model, adapt_step_size=True, jit_compile=True, ignore_jit_warnings=True)
mcmc = infer.MCMC(nuts_kernel, num_samples=3000, warmup_steps=200, num_chains=3)
mcmc.run(temperature, sales)
mean | std | median | 5.0% | 95.0% | n_eff | r_hat | |
---|---|---|---|---|---|---|---|
intercept | 21.00 | 6.03 | 20.99 | 11.01 | 30.80 | 2869.59 | 1.00 |
beta | 2.47 | 0.29 | 2.47 | 1.98 | 2.94 | 2866.62 | 1.00 |
sigma2 | 17.06 | 1.22 | 16.98 | 15.02 | 19.00 | 4014.61 | 1.00 |
Among the random variables in the model method, the posterior distributions of the intercept, slope, and variance (sigma2) to be estimated were estimated. The intercept is about 21 and the slope is about 2.5, which is in good agreement with the results shown in the reference books.
Pyro can also be estimated by variational inference. In variational inference, the latent variable you want to findguide
)If you writemodel
method, we can see that there are three latent variables that we want to estimate for this problem: ʻintercept,
beta, and
sigma2. The process of generating these three latent variables should be described in the
guide method. However, if you write the guide appropriately$q(Z)$To$p(Z|X)$I can't get close to. In Pyro
pyro.paramThe parameters declared by$q(Z)$When$p(Z|X)$Variational inference is realized by updating the distribution of. In doing so, we use Pytorch's automatic differentiation and gradient descent. Based on the above, how to write the
guide` method is summarized below.
pyro.param
.guide
method. Constraints are specified by dist.constraints
.dist. Probability distribution (update target parameter)
and pyro.sample
.model
method.The following is an example of the implementation of guide
for this problem. This time, we assumed that the three latent variables were independent, and chose the delta distribution as the functional form.
#Declaration of variables and setting of initial values
pyro.param("intercept_q", torch.tensor(0.))
pyro.param("beta_q", torch.tensor(0.))
pyro.param("sigma2_q", torch.tensor(10.))
# q(Z)Implementation of
def guide(temperature, sales=None):
intercept_q = pyro.param("intercept_q")
beta_q = pyro.param("beta_q")
sigma2_q = pyro.param("sigma2_q", constraint=dist.constraints.positive)
intercept = pyro.sample("intercept", dist.Delta(intercept_q))
beta = pyro.sample("beta", dist.Delta(beta_q))
sigma2 = pyro.sample("sigma2", dist.Delta(sigma2_q))
For example, if you want a normal distribution instead of a delta distribution, define the mean ($ \ mu pyro.sample
. In this way, you can flexibly change the method of estimating the posterior distribution by changing the guide part.
By the way, if all latent variables have an independent delta distribution as in the code example,
guide = infer.autoguide.guides.AutoDelta(model)
It is the same even if you write. ʻInfer.autogide.guides provides a convenient function that automatically creates a
guide` method. (http://docs.pyro.ai/en/0.2.1-release/contrib.autoguide.html)
After writing the model
and guide
methods, you can easily find the posterior distribution.
#Method for holding the value of the parameter to be estimated for each epoch
def update_param_dict(param_dict):
for name in pyro.get_param_store():
param_dict[name].append(pyro.param(name).item())
#Execution of variational reasoning
adam_params = {"lr": 1e-1, "betas": (0.95, 0.999)}
oprimizer = pyro.optim.Adam(adam_params)
svi = infer.SVI(model, guide, oprimizer, loss=infer.JitTrace_ELBO(),)
n_steps = 10000
losses = []
param_dict = defaultdict(list)
for step in tqdm(range(n_steps)):
loss = svi.step(temperature, sales,)
update_param_dict(param_dict)
losses.append(loss)
for name in pyro.get_param_store():
print("{}: {}".format(name, pyro.param(name)))
intercept_q: 21.170936584472656
beta_q: 2.459129810333252
sigma2_q: 16.684457778930664
With an intercept of about 21, a slope of about 2.5, and a variance of about 17, reasonable estimation results were obtained. From the figure below, it can be seen that the loss and each update target parameter are also converged.
Now that we know that we can estimate the posterior distribution by variational inference, we would like to consider the scalability of Pyro by using GPU, which is the greatest advantage of Pyro.
The easiest way to use the GPU with Pyro is at the beginning of the code
torch.set_default_tensor_type(torch.cuda.FloatTensor)
By writing, all the generated Tensors are put on the GPU from the beginning.
By the way, in the example, the number of samples was 100, but if this increases to 1000, 10000, ..., how will the time required for variational inference increase?
Here, assuming that the intercept, slope, and variance estimated from 100 samples are true values, 1000, 10000, ... samples are generated from the true distribution (provisional), and the samples are generated. We compared the time required for variational inference with the CPU and GPU as observation data.
The number of samples | CPU(Seconds) | GPU(Seconds) |
---|---|---|
1.34 | 3.66 | |
1.4 | 1.95 | |
1.51 | 1.95 | |
4.2 | 1.96 | |
30.41 | 2.26 | |
365.36 | 11.41 | |
3552.44 | 104.62 |
As you can see from the table, in the area where the number of samples is 10,000 or less, the benefit of parallelization is not received and the CPU is faster. However, the difference began to open when the number of samples exceeded 100,000, and the result was that there was a difference of 30 times or more for $ 10 ^ 8 $ pieces. It can be seen that statistical modeling can be performed without stress even when dealing with extremely large data using a GPU.
This time, I introduced a method of performing analysis by a simple regression model using Pyro.
The process by which the assumed statistical model generates data can be described in the model
method, and the posterior distribution can be estimated by MCMC or variational inference.
We also conducted an experiment on speeding up using GPU and confirmed that scalable analysis can be performed by Pyro.
Recommended Posts