As you know, the Python MCMC (Markov Chain Monte Carlo) library ** PyMC3 ** is based on the numerical processing library ** Theano **. Thanks to this combination, it is said that it has the flexibility to handle complicated problems and sufficient computing power, but Ikansen ** Theano ** is quite difficult for an amateur. The concept of defining the handling of symbols first and then binding and calculating numerical values makes it difficult to understand. Here, we will introduce a trick to debug without touching the functions of ** Theano ** when a problem occurs in PyMC3 simulation work.
The general work flow of simulation is as follows.
It is necessary to pay attention to the "description of statistical model". If it is a simple model, I think that it is easy to create a model by referring to the Tutorial, but if the model becomes complicated, it will not be calculated or the calculated result will be strange.
There is no doubt that outputting the intermediate value of a variable is a great help for debugging, but this is not easy under the Theano & PyMC3 environment.
Take the following code as an example. It is a simulation of logistic regression.
import numpy as np
import pymc3 as pm
#1. 1. Data set preparation
n_betl = np.array([59, 60, 62, 56, 63, 59, 62, 60])
y_betl = np.array([6, 13, 18, 28, 52, 53, 61, 60])
x1 = np.array([1.6907, 1.7242, 1.7552, 1.7842,
1.8113, 1.8369, 1.8610, 1.8839])
#2. Statistical model description
mymodel1 = pm.Model()
def invlogit(x):
return pm.exp(x) / (1 + pm.exp(x))
with mymodel1: # model definition
mytheta = pm.Normal('theta', mu=0, sd=32, shape=2) # 2-dim array
p = invlogit(mytheta[0] + mytheta[1] * x1)
y_obs = pm.Binomial('y_obs', n=n_betl, p=p, observed=y_betl)
#3. 3. MCMC calculation
with mymodel1: # running simulation
start = pm.find_MAP()
step = pm.NUTS()
trace1 = pm.sample(10000, step, start=start)
#4. Calculation result output...summary and trace plot
pm.summary(trace1)
pm.traceplot(trace1)
What I want to do is often output variable values at each step of MCMC calculation, but where should I put that statement? I want to intuitively put it in the MCMC calculation part, but the calculation instruction is
trace1 = pm.sample(10000, step, start=start)
It is difficult to insert because it is done in one line. Then I wondered if I could do something in the previous "Description of statistical model" [Theano's document](http://deeplearning.net/software/theano/tutorial/debug_faq.html#how-do-i-print-an -intermediate-value-in-a-function) was searched for, and the following line was added. (It will be in the place of ** "theano.printing.Print ()" **.)
import theano
(Omitted)
with mymodel1: # model definition
mytheta = pm.Normal('theta', mu=0, sd=32, shape=2) # 2-dim array
#I logit in the middle of MCMC calculation_I want to output p. .. ..
logit_p = mytheta[0] + mytheta[1] * x1
my_printed = theano.printing.Print('logit_p = ')(logit_p)
p = invlogit(mytheta[0] + mytheta[1] * x1)
y_obs = pm.Binomial('y_obs', n=n_betl, p=p, observed=y_betl)
As a result, the initial value of all zero was output.
logit_p = __str__ = [ 0. 0. 0. 0. 0. 0. 0. 0.]
Therefore, it does not make sense for debugging. After all, it seems that it is useless to perform "output" in the statistical model description.
As another method, I came up with a method of logging by adding an arbitrary variable to the trace of PyMC. The main variables of the statistical model are recorded as trace data at each step. I tried to add the variable I wanted to see to this.
with mymodel1: # model definition
mytheta = pm.Normal('theta', mu=0, sd=32, shape=2) # 2-dim array
p = invlogit(mytheta[0] + mytheta[1] * x1)
# for debug
logit_p = pm.Deterministic('logit_p', (mytheta[0] + mytheta[1] * x1))
y_obs = pm.Binomial('y_obs', n=n_betl, p=p, observed=y_betl)
As mentioned above, the class method of ** pm.Deterministic () ** was used. Originally, pm.Deterministic is used to derive (deterministically determined) variables from other related variables, but by using this, it has the name of 'ligit_p' passed as an argument. Variables are subject to trace recording. Therefore, after calculation, trace can refer to this as debug information. Of course, you can refer to other variables that make up mymodel1, or you can enter additional formulas and get the results.
>>> mylog = trace1['logit_p']
>>> mylog[-5:] ...Show only the last 5 steps
array([[-2.60380812, -1.54381452, -0.56292492, 0.35468148, 1.21216885,
2.02219381, 2.78475637, 3.50934901],
[-2.44719254, -1.38360671, -0.39939295, 0.52132315, 1.38171646,
2.19448653, 2.95963336, 3.68668158],
[-2.73650878, -1.64559829, -0.63609903, 0.30827125, 1.19076899,
2.02441999, 2.80922426, 3.55495113],
[-2.71951223, -1.62859273, -0.61908514, 0.32529294, 1.20779796,
2.04145585, 2.82626659, 3.57199962],
[-2.51597252, -1.42300483, -0.41160189, 0.53454925, 1.41871117,
2.25393425, 3.04021847, 3.78735161]])
Also, since it is a trace, you can plot it with traceplot ().
In addition to the regression parameter theta that we wanted to obtain in Simulation, logit_p is also output. (It was surprising that logit_p was a vector of size = 8 ...)
By increasing the variables to be output to trace, we were able to obtain more debug information. In MCMC modeling work, each library does not give a very helpful error message, so it is quite "tired" (and "gets crazy"). The situation has improved a little with this method, and I feel like I can do my best.
This time, it was a method to deal with it without "getting down" to Theano, but it is said that Theano is an effective tool for machine learning, not limited to PyMC3, so this is also difficult. I want to investigate little by little without giving up. Also, I would like to increase the tips for general Python debugging methods.
Recommended Posts