It is an article that uses Stan to logistic regression of Titanic data and further evaluates the performance of classification.
The stochastic programming language "Stan" used in this article uses the Hamiltonian Monte Carlo method (HMC method) and NUTS to estimate the distribution parameters. Strictly speaking, the principle of random number generation is different, but a slightly simpler method is the Markov chain Monte Carlo method, the Metropolis-Hastings method (MH method). I wrote @ kenmatsu4 about this principle of operation
There are two points, so please refer to them if you like. I think that the MH method and the HMC method are not so different if the intention is to give an image of what they are doing. (Basically, the difference in random number sampling efficiency)
The full code is posted on GitHub.
I was worried that things didn't go well, but this was simple and worked.
conda install pystan
import numpy as np
import numpy.random as rd
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import animation as ani
import matplotlib.cm as cm
import seaborn as sns
sns.set(style="whitegrid", palette="muted", color_codes=True)
from tabulate import tabulate
from time import time
import pystan
from pystan import StanModel
from sklearn import cross_validation
from scipy.stats import gaussian_kde
[Machine learning] Summary and execution of model evaluation / index (w / Titanic data set) Uses the Titanic data that has been converted to numerical values. To do. We have prepared it on GitHub, so please use it as well.
titanic = pd.read_table("https://raw.githubusercontent.com/matsuken92/Qiita_Contents/master/PyStan-Titanic/data/titanic_converted.csv",
sep=",", header=0)
titanic.head(10)
ID | survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alone |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 1 | 22 | 1 | 0 | 7.25 | 1 | 3 | 1 | 1 | 0 | 3 | 1 |
1 | 1 | 1 | 0 | 38 | 1 | 0 | 71.2833 | 2 | 1 | 2 | 0 | 3 | 1 | 0 |
2 | 1 | 3 | 0 | 26 | 0 | 0 | 7.925 | 1 | 3 | 2 | 0 | 0 | 3 | 0 |
3 | 1 | 1 | 0 | 35 | 1 | 0 | 53.1 | 1 | 1 | 2 | 0 | 3 | 3 | 0 |
4 | 0 | 3 | 1 | 35 | 0 | 0 | 8.05 | 1 | 3 | 1 | 1 | 0 | 3 | 1 |
5 | 0 | 3 | 1 | 28 | 0 | 0 | 8.4583 | 3 | 3 | 1 | 1 | 0 | 2 | 1 |
6 | 0 | 1 | 1 | 54 | 0 | 0 | 51.8625 | 1 | 1 | 1 | 1 | 5 | 3 | 1 |
7 | 0 | 3 | 1 | 2 | 3 | 1 | 21.075 | 1 | 3 | 0 | 0 | 0 | 3 | 0 |
8 | 1 | 3 | 0 | 27 | 0 | 2 | 11.1333 | 1 | 3 | 2 | 0 | 0 | 3 | 0 |
9 | 1 | 2 | 0 | 14 | 1 | 0 | 30.0708 | 2 | 2 | 0 | 0 | 0 | 1 | 0 |
Divide the data into the explanatory variable data
and the objective variable target
.
target = titanic.ix[:, 0]
data = titanic.ix[:, [1,2,3,4,5,6]]
#Training data(80%),test data(20%)Divide into
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, target, test_size=0.2, random_state=0)
print [d.shape for d in [X_train, X_test, y_train, y_test]]
out
[(712, 6), (179, 6), (712,), (179,)]
That is, the training data $ X $ is
It is a matrix with the number of data $ N = 712 $ and the number of features $ M = 6 $. The horizontal vector $ x_i $ is the data for one person.
The six features are
pclass sex age sibsp parch fare
is.
The teacher data is Titanic's survival information. Survival = 1, and the person who never returned = 0 $ N $ dimensional vector
is.
Now, when there is such data, [Logistic Regression Model (wikipedeia link)](https://ja.wikipedia.org/wiki/Logistic Regression) is expressed as follows.
\operatorname{logit} (p_{i}) = \ln \left({\frac {p_{i}}{1-p_{i}}}\right)
= \beta + \beta_{1}x_{1,i} + \cdots + \beta_{M}x_{M,i} \\
(i=1, 2, \cdots, N)
$ p_i $ can be regarded as the survival probability of the $ i $ th record data. Let's rearrange the formula a little to see why it can be regarded as a survival probability. First, put both sides in $ \ exp (\ cdot) $
Solving this for $ p_i $
Here the standard sigmoid function
To use. Similar in shape to the previous formula, replacing $ x $ with $ (\ beta + \ beta_ {1} x_ {1, i} + \ cdots + \ beta_ {M} x_ {M, i}) $ I did
You can treat it as.
It looks like this in a graph.
def sigmoid(x):
return 1 / (1+np.exp(-x))
x_range = np.linspace(-10, 10, 501)
plt.figure(figsize=(9,5))
plt.ylim(-0.1, 1.1)
plt.xlim(-10, 10)
plt.plot([-10,10],[0,0], "k", lw=1)
plt.plot([0,0],[-1,1.5], "k", lw=1)
plt.plot(x_range, sigmoid(x_range), zorder=100)
As you can see in the graph, the sigmoid function $ \ sigma (x) $ takes a value in the range (0, 1), so this value can be interpreted as a probability and used.
From the above, the part that looks like a normal linear regression,
By using the sigmoid function well based on
\operatorname{logit} (p_{i}) = \ln \left({\frac {p_{i}}{1-p_{i}}}\right)
= \beta + \beta_{1}x_{1,i} + \cdots + \beta_{M}x_{M,i} \\
(i=1, 2, \cdots, N)
Logistic regression can be said to be taken in the form of and used by interpreting the value as a probability.
Stan is introduced by trying to estimate the parameters $ \ beta, \ beta_1, \ cdots, \ beta_M $ of this expression using Stan this time.
Create a Dictionary object packed with training data for passing to Stan.
dat = {'N': X_train.shape[0], 'M': X_train.shape[1], 'X': X_train, 'y': y_train}
code = """
data {
int<lower=0> N;
int<lower=0> M;
matrix[N, M] X;
int<lower=0, upper=1> y[N];
} parameters {
real beta0;
vector[M] beta;
} model {
for (i in 1:N)
y[i] ~ bernoulli(inv_logit (beta0 + dot_product(X[i] , beta)));
}
"""
Stan's code is divided into several blocks, but this time
It is divided into three parts.
Let's look at each one.
data {
int<lower=0> N;
int<lower=0> M;
matrix[N, M] X;
int<lower=0, upper=1> y[N];
}
This is a so-called placeholder-like role where values are packed from the Python side. On the Python side
dat = {'N': X_train.shape[0], 'M': X_train.shape[1], 'X': X_train, 'y': y_train}
Since I was assigning a value like this, it is used on the Stan side. this time,
Each variable is used like.
parameters {
real beta0;
vector[M] beta;
}
The parameters block is a block that declares the variables of the target you want to estimate, such as the parameters of the probability distribution. This time, we want to estimate the regression parameters (intercept $ \ beta $ and regression coefficient $ \ beta_i \ (i = 1,2, \ cdots, M) $), so we declare them here.
This is the key. A declaration of a statistical model.
model {
for (i in 1:N)
y[i] ~ bernoulli(inv_logit (beta0 + dot_product(X[i] , beta)));
}
There are several Stan functions, so I will introduce them first.
real dot_product(vector x, vector y)
Stan Manual v2.9 p358
Calculate the inner product. Here, it corresponds to the $ \ beta_ {1} x_ {1, i} + \ cdots + \ beta_ {M} x_ {M, i} $ part of the formula, and each record and parameter are multiplied and added.
real inv_logit(real y)
Stan Manual v2.9 p338
It is a function that performs the calculation expressed by the following formula. In other words, it's a standard sigmoid function.
y ~ bernoulli(theta)
Stan Manual v2.9 p384
This expression represented by ~
is called Sampling Statements, and is actually the following expression,
increment_log_prob(bernoulli_log(y, theta));
Is called and executed. Since the posterior probability by Bayesian estimation is calculated, y
is entered as the observed data and theta
is entered as the parameter to be estimated. theta
is the part of ʻinv_logit (beta0 + dot_product (X [i], beta))` which linearly combines the explanatory variable and the parameter $ \ beta_i $ explained earlier and sets them as the argument of the standard sigmoid function. Right.
In summary, this one line is the part that calculates the log-likelihood, and the probability function of the bernoulli distribution
And the logarithm of that is
Furthermore, since the number of data is $ N $ at the same time, the logarithm is added.
is. This addition part is the part processed by ʻincrement_log_prob, and it is added by stacking while turning with the
for` statement (so increment).
Also, the prior distribution is not explicitly set for the parameters beta0, beta
, but in this case Stan sets the non-information prior distribution (uniform distribution of the range (−∞, ∞)). ..
From the above Stan code, the value proportional to the posterior distribution (likelihood x prior distribution) is ready. This is used to generate random numbers by HMC and NUTS.
Reference: Stan Manual v2.9 26.3. Sampling Statements 6.2. Priors for Coefficients and Scales
Stan translates that code into C ++ code, compiles it, and runs it to speed things up. It is inefficient to compile the code every time, so once the code is decided, it is convenient to compile it and have it as a Python object.
# Build Stan model
%time stm = StanModel(model_code=code)
out
CPU times: user 1.36 s, sys: 113 ms, total: 1.48 s
Wall time: 15.4 s
Once the model build is complete, it's time to sample. This is where it takes the most time. The settings related to the number of samplings are as follows.
(3000-200) * 2 = Sampling until the number of samples reaches 5600.
As a setting of execution
[^ 1]: According to BDA3 (Notes on Bayesian Data Analysis 3rd edition p282), burn-in can be misleading, so it is recommended to call it warm-up.
n_itr = 3000
n_warmup = 200
chains = 2
#Sampling
%time fit = stm.sampling(data=dat, iter=n_itr, chains=chains, n_jobs=-1, warmup=n_warmup, algorithm="NUTS", verbose=False)
#Extract sample columns
la = fit.extract(permuted=True) # return a dictionary of arrays
#Parameter name
names = fit.model_pars
#Number of parameters
n_param = np.sum([1 if len(x) == 0 else x[0] for x in fit.par_dims])
It took about 5 minutes to execute the sampling.
out
CPU times: user 156 ms, sys: 59.1 ms, total: 215 ms
Wall time: 5min 24s
Displays a summary of sampling results.
print fit
You can see the mean, standard error, standard deviation, each percentile, etc. for each parameter. Information about likelihood is also shown at the bottom. Since the value Rhat
that judges the convergence of the sampling result is 1.0, it can be seen that the distribution is sufficiently converged. (In general, if Rhat is 1.1 or less, it can be judged that it has converged sufficiently [^ 2])
[^ 2]: Reference: Citation memo about the specific value of Rhat in the convergence diagnosis of MCMC
out
Inference for Stan model: anon_model_b12e3f2368679a0c562b9f74618b2f82.
2 chains, each with iter=3000; warmup=200; thin=1;
post-warmup draws per chain=2800, total post-warmup draws=5600.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
beta0 5.02 8.0e-3 0.6 3.89 4.61 5.01 5.42 6.23 5600.0 1.0
beta[0] -1.04 2.1e-3 0.15 -1.35 -1.15 -1.04 -0.93 -0.75 5600.0 1.0
beta[1] -2.77 3.0e-3 0.23 -3.22 -2.92 -2.76 -2.61 -2.33 5600.0 1.0
beta[2] -0.04 1.2e-4 8.9e-3 -0.06 -0.05 -0.04 -0.04 -0.03 5600.0 1.0
beta[3] -0.41 1.6e-3 0.12 -0.66 -0.49 -0.41 -0.33 -0.18 5600.0 1.0
beta[4] -0.08 1.8e-3 0.14 -0.35 -0.17 -0.08 0.02 0.18 5600.0 1.0
beta[5] 2.5e-3 3.4e-5 2.5e-3-2.3e-3 7.0e-4 2.4e-3 4.1e-3 7.7e-3 5600.0 1.0
lp__ -322.4 0.02 1.86 -326.8 -323.4 -322.0 -321.0 -319.7 5600.0 1.0
Samples were drawn using NUTS(diag_e) at Sun Dec 13 02:53:45 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
The expected value of the distribution of each parameter is called the EAP (Expected A Posteriori) estimator. The EAP estimator can be obtained by simply averaging the sample results. It's easy: grin:
#Extract EAP estimator list for each parameter
mean_list = np.array(fit.summary()['summary'])[:,0]
Next, find the MAP (Maximum A Posteriori) estimator. This was calculated by estimating the kernel density and bringing in the value that takes the maximum value. (If anyone knows an easier way, please let me know m (_ _) m)
#List generation of MAP estimators for each parameter
ddd = la['beta0']
def find_MAP(data):
kde = gaussian_kde(data)
x_range = np.linspace(min(data), max(data), 501)
eval_kde = kde.evaluate(x_range)
#plt.plot(x_range, eval_kde)
return x_range[np.argmax(eval_kde)]
MAP_list = []
MAP_list.append(find_MAP(ddd))
for i in range(n_param-1):
MAP_list.append(find_MAP(la['beta'][:,i]))
Finally, the sampling result of each parameter is visualized in a graph. The left side is the kernel density of the sampling result, and the right side is called trace plot, which plots the random numbers in the order of sampling.
f, axes = plt.subplots(n_param, 2, figsize=(15, 4*n_param))
cnt = 0
for name in names:
dat = la[name]
if dat.ndim == 2:
for j in range(dat.shape[1]):
d = dat[:,j]
sns.distplot(d, hist=False, rug=True, ax=axes[cnt, 0])
sns.tsplot(d, alpha=0.8, lw=.5, ax=axes[cnt, 1])
cnt += 1
else:
# Intercept
sns.distplot(dat, hist=False, rug=True, ax=axes[cnt, 0])
sns.tsplot(dat, alpha=0.8, lw=.5, ax=axes[cnt, 1])
cnt += 1
name_list = []
for name in names:
if la[name].ndim == 2:
for i in range(dat.shape[1]):
name_list.append("{}{}".format(name,i+1))
else:
name_list.append(name)
for i in range(2):
for j, t in enumerate(name_list):
axes[j, i].set_title(t)
plt.show()
Similarly, I wrote a graph about the likelihood.
# Likelihood
f, axes = plt.subplots(1, 2, figsize=(15, 4))
sns.distplot(la['lp__'], hist=False, rug=True, ax=axes[0])
sns.tsplot(la['lp__'], alpha=0.8, lw=.5, ax=axes[1])
axes[0].set_title("Histgram of likelihood.")
axes[1].set_title("Traceplot of likelihood.")
plt.show()
Finally, let's look at the performance of the estimation results. First, the data was divided into 8: 2 by the holdout method, so let's look at the performance when reassigning and the generalization performance. Let's compare using both the EAP estimator and the MAP estimator as the parameters to be used.
#Substitute the estimated parameter value and the probability p_A function to calculate i.
def logis(x, beta):
assert len(beta) == 7
assert len(beta) == 7
if type(x) != 'np.array':
x = np.array(x)
tmp = [1]
tmp.extend(x)
x = tmp
return (1+np.exp(-np.dot(x, beta)))**(-1)
#Judge 0 or 1 with the set threshold. The default threshold is 0.5
def check_accuracy(data, target, param, threshold = 0.5):
ans_list = []
for i in range(len(data)):
idx = data.index[i]
res = logis(data.ix[idx], param)
ans = 1 if res > threshold else 0
ans_list.append(ans == target.ix[idx])
return np.mean(ans_list)
param = mean_list[0:7]
#Correct answer rate when reassigned
print u"[train][EAP] Accuracy:{0:.5f}".format(check_accuracy(X_train, y_train, param))
print u"[train][MAP] Accuracy:{0:.5f}".format(check_accuracy(X_train, y_train, MAP_list))
#See generalization performance by correct answer rate of test data
print "[test][EAP] Accuracy:{0:.5f}".format(check_accuracy(X_test, y_test, param))
print "[test][MAP] Accuracy:{0:.5f}".format(check_accuracy(X_test, y_test, MAP_list))
out
[train][EAP] Accuracy:0.79073
[train][MAP] Accuracy:0.80618
[test][EAP] Accuracy:0.81006
[test][MAP] Accuracy:0.79888
At the time of reassignment, the MAP estimator has a higher correct answer rate, but the EAP estimator method seems to be better in terms of generalization performance. I'm curious if this happens or if there is such a tendency.
http://aaaazzzz036.hatenablog.com/entry/2013/11/06/225417 → I referred to the Stan code.
Stan Manual (Modeling Language User ’s Guide and Reference Manual, v2.9.0) https://github.com/stan-dev/stan/releases/download/v2.9.0/stan-reference-2.9.0.pdf
Recommended Posts