[Statistics] Let's explain the execution of logistic regression in stan in detail (w / Titanic dataset)

It is an article that uses Stan to logistic regression of Titanic data and further evaluates the performance of classification.

The stochastic programming language "Stan" used in this article uses the Hamiltonian Monte Carlo method (HMC method) and NUTS to estimate the distribution parameters. Strictly speaking, the principle of random number generation is different, but a slightly simpler method is the Markov chain Monte Carlo method, the Metropolis-Hastings method (MH method). I wrote @ kenmatsu4 about this principle of operation

There are two points, so please refer to them if you like. I think that the MH method and the HMC method are not so different if the intention is to give an image of what they are doing. (Basically, the difference in random number sampling efficiency)

environment

OSX Yosemite 10.10.5
Python 2.7
Anaconda 3.18.9
PyStan 2.8.0

The full code is posted on GitHub.

Introduction of PyStan

I was worried that things didn't go well, but this was simple and worked.

conda install pystan

Library preparation

import numpy as np
import numpy.random as rd
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import animation as ani
import matplotlib.cm as cm
import seaborn as sns
sns.set(style="whitegrid", palette="muted", color_codes=True)

from tabulate import tabulate
from time import time

import pystan
from pystan import StanModel

from sklearn import cross_validation
from scipy.stats import gaussian_kde

Data preparation

[Machine learning] Summary and execution of model evaluation / index (w / Titanic data set) Uses the Titanic data that has been converted to numerical values. To do. We have prepared it on GitHub, so please use it as well.

titanic = pd.read_table("https://raw.githubusercontent.com/matsuken92/Qiita_Contents/master/PyStan-Titanic/data/titanic_converted.csv",
              sep=",", header=0)
titanic.head(10)

ID	survived	pclass	sex	age	sibsp	parch	fare	embarked	class	who	adult_male	deck	embark_town	alone
0	0	3	1	22	1	0	7.25	1	3	1	1	0	3	1
1	1	1	0	38	1	0	71.2833	2	1	2	0	3	1	0
2	1	3	0	26	0	0	7.925	1	3	2	0	0	3	0
3	1	1	0	35	1	0	53.1	1	1	2	0	3	3	0
4	0	3	1	35	0	0	8.05	1	3	1	1	0	3	1
5	0	3	1	28	0	0	8.4583	3	3	1	1	0	2	1
6	0	1	1	54	0	0	51.8625	1	1	1	1	5	3	1
7	0	3	1	2	3	1	21.075	1	3	0	0	0	3	0
8	1	3	0	27	0	2	11.1333	1	3	2	0	0	3	0
9	1	2	0	14	1	0	30.0708	2	2	0	0	0	1	0

Divide the data into the explanatory variable data and the objective variable target.

target = titanic.ix[:, 0]
data = titanic.ix[:, [1,2,3,4,5,6]]

#Training data(80%),test data(20%)Divide into
X_train, X_test, y_train, y_test = cross_validation.train_test_split(data, target, test_size=0.2, random_state=0)
print [d.shape for d in [X_train, X_test, y_train, y_test]]

`out`


[(712, 6), (179, 6), (712,), (179,)]

That is, the training data $ X $ is

It is a matrix with the number of data $ N = 712 $ and the number of features $ M = 6 $. The horizontal vector $ x_i $ is the data for one person. The six features are pclass 　　sex　　 age　　 sibsp　　 parch　　 fare is.

The teacher data is Titanic's survival information. Survival = 1, and the person who never returned = 0 $ N $ dimensional vector

y= (y_1, \cdots, y_N)^{{\rm T}}

is.

About logistic regression

Now, when there is such data, [Logistic Regression Model (wikipedeia link)](https://ja.wikipedia.org/wiki/Logistic Regression) is expressed as follows.

\operatorname{logit} (p_{i}) = \ln \left({\frac {p_{i}}{1-p_{i}}}\right)
= \beta + \beta_{1}x_{1,i} + \cdots + \beta_{M}x_{M,i} \\
(i=1, 2, \cdots, N)

$ p_i $ can be regarded as the survival probability of the $ i $ th record data. Let's rearrange the formula a little to see why it can be regarded as a survival probability. First, put both sides in $ \ exp (\ cdot) $

\left({\frac {p_{i}}{1-p_{i}}}\right) = \exp ( \beta + \beta_{1}x_{1,i} + \cdots + \beta_{M}x_{M,i} )

Solving this for $ p_i $

p_i = {1 \over 1 + \exp (\ -(\beta + \beta_{1}x_{1,i} + \cdots + \beta_{M}x_{M,i} )\ ) }

Here the standard sigmoid function $ \sigma(x) = {1 \over 1+ e^{-x}} $

To use. Similar in shape to the previous formula, replacing $ x $ with $ (\ beta + \ beta_ {1} x_ {1, i} + \ cdots + \ beta_ {M} x_ {M, i}) $ I did

p_i = \sigma(\beta + \beta_{1}x_{1,i} + \cdots + \beta_{M}x_{M,i})

You can treat it as.

It looks like this in a graph.

def sigmoid(x):
    return 1 / (1+np.exp(-x))

x_range = np.linspace(-10, 10, 501)

plt.figure(figsize=(9,5))
plt.ylim(-0.1, 1.1)
plt.xlim(-10, 10)

plt.plot([-10,10],[0,0], "k", lw=1)
plt.plot([0,0],[-1,1.5], "k", lw=1)
plt.plot(x_range, sigmoid(x_range), zorder=100)

As you can see in the graph, the sigmoid function $ \ sigma (x) $ takes a value in the range (0, 1), so this value can be interpreted as a probability and used.

From the above, the part that looks like a normal linear regression,

\beta + \beta_{1}x_{1,i} + \cdots + \beta_{M}x_{M,i}

By using the sigmoid function well based on

\operatorname{logit} (p_{i}) = \ln \left({\frac {p_{i}}{1-p_{i}}}\right)
= \beta + \beta_{1}x_{1,i} + \cdots + \beta_{M}x_{M,i} \\
(i=1, 2, \cdots, N)

Logistic regression can be said to be taken in the form of and used by interpreting the value as a probability.

Stan is introduced by trying to estimate the parameters $ \ beta, \ beta_1, \ cdots, \ beta_M $ of this expression using Stan this time.

Running on Stan

Create a Dictionary object packed with training data for passing to Stan.

dat = {'N': X_train.shape[0], 'M': X_train.shape[1], 'X': X_train, 'y': y_train}

code = """
        data {
            int<lower=0> N;
            int<lower=0> M;
            matrix[N, M] X;
            int<lower=0, upper=1> y[N];
        } parameters {
            real beta0;
            vector[M] beta; 
        } model {
            for (i in 1:N)
                y[i] ~ bernoulli(inv_logit (beta0 + dot_product(X[i] , beta)));
        }
        """

For the stan code, I referred to the article here.

Stan's code is divided into several blocks, but this time

data
parameters
model

It is divided into three parts.

Let's look at each one.

data block

data {
    int<lower=0> N;
    int<lower=0> M;
    matrix[N, M] X;
    int<lower=0, upper=1> y[N];
}

This is a so-called placeholder-like role where values are packed from the Python side. On the Python side

dat = {'N': X_train.shape[0], 'M': X_train.shape[1], 'X': X_train, 'y': y_train}

Since I was assigning a value like this, it is used on the Stan side. this time,

N: Number of data
M: Number of features
X: NxM matrix training data
y: N-dimensional teacher data. There is a limit of 1 upper limit and 0 lower limit.

Each variable is used like.

parameters block

parameters {
    real beta0;
    vector[M] beta; 
}

The parameters block is a block that declares the variables of the target you want to estimate, such as the parameters of the probability distribution. This time, we want to estimate the regression parameters (intercept $ \ beta $ and regression coefficient $ \ beta_i \ (i = 1,2, \ cdots, M) $), so we declare them here.

model block

This is the key. A declaration of a statistical model.

model {
    for (i in 1:N)
        y[i] ~ bernoulli(inv_logit (beta0 + dot_product(X[i] , beta)));
}

There are several Stan functions, so I will introduce them first.

real dot_product(vector x, vector y) Stan Manual v2.9 p358 Calculate the inner product. Here, it corresponds to the $ \ beta_ {1} x_ {1, i} + \ cdots + \ beta_ {M} x_ {M, i} $ part of the formula, and each record and parameter are multiplied and added.

real inv_logit(real y) Stan Manual v2.9 p338 It is a function that performs the calculation expressed by the following formula. In other words, it's a standard sigmoid function. $ \operatorname{inv \\_ logit}(y) = {1 \over 1 + \exp(−y)} $

y ~ bernoulli(theta) Stan Manual v2.9 p384 This expression represented by ~ is called Sampling Statements, and is actually the following expression, increment_log_prob(bernoulli_log(y, theta)); Is called and executed. Since the posterior probability by Bayesian estimation is calculated, y is entered as the observed data and theta is entered as the parameter to be estimated. theta is the part of ʻinv_logit (beta0 + dot_product (X [i], beta))` which linearly combines the explanatory variable and the parameter $ \ beta_i $ explained earlier and sets them as the argument of the standard sigmoid function. Right. In summary, this one line is the part that calculates the log-likelihood, and the probability function of the bernoulli distribution

B(y_i|\theta) = \theta^{y_i}(1-\theta)^{1-y_i}

And the logarithm of that is

\ln(B(y_i|\theta)) = y_i \ln (\theta) + (1-y_i) \ln(1-\theta)

Furthermore, since the number of data is $ N $ at the same time, the logarithm is added.

\sum_{i=1}^N\ln(B(y_i|\theta)) = \sum_{i=1}^N\ \\{ y_i \ln (\theta) + (1-y_i) \ln(1-\theta) \\}

is. This addition part is the part processed by ʻincrement_log_prob, and it is added by stacking while turning with the for` statement (so increment).

Also, the prior distribution is not explicitly set for the parameters beta0, beta, but in this case Stan sets the non-information prior distribution (uniform distribution of the range (−∞, ∞)). ..

From the above Stan code, the value proportional to the posterior distribution (likelihood x prior distribution) is ready. This is used to generate random numbers by HMC and NUTS.

Reference: Stan Manual v2.9 　　　26.3. Sampling Statements 　　　6.2. Priors for Coefficients and Scales

Compiling the model

Stan translates that code into C ++ code, compiles it, and runs it to speed things up. It is inefficient to compile the code every time, so once the code is decided, it is convenient to compile it and have it as a Python object.

# Build Stan model
%time stm = StanModel(model_code=code)

`out`


CPU times: user 1.36 s, sys: 113 ms, total: 1.48 s
Wall time: 15.4 s

Once the model build is complete, it's time to sample. This is where it takes the most time. The settings related to the number of samplings are as follows.

Sampling number: 3000
Number of warm-ups [^ 1](number of burn-ins): 200
Number of chains: 200

(3000-200) * 2 = Sampling until the number of samples reaches 5600.

As a setting of execution

Number of parallel runs: Set -1 to run at the maximum number of parallel runs possible on the machine.
Use NUTS for the sampling algorithm.

[^ 1]: According to BDA3 (Notes on Bayesian Data Analysis 3rd edition p282), burn-in can be misleading, so it is recommended to call it warm-up.

n_itr = 3000
n_warmup = 200
chains = 2

#Sampling
%time fit = stm.sampling(data=dat, iter=n_itr, chains=chains, n_jobs=-1, warmup=n_warmup, algorithm="NUTS", verbose=False)

#Extract sample columns
la    = fit.extract(permuted=True)  # return a dictionary of arrays
#Parameter name
names = fit.model_pars 
#Number of parameters
n_param = np.sum([1 if len(x) == 0 else x[0] for x in fit.par_dims])

It took about 5 minutes to execute the sampling.

`out`


CPU times: user 156 ms, sys: 59.1 ms, total: 215 ms
Wall time: 5min 24s

Displays a summary of sampling results.

print fit

You can see the mean, standard error, standard deviation, each percentile, etc. for each parameter. Information about likelihood is also shown at the bottom. Since the value Rhat that judges the convergence of the sampling result is 1.0, it can be seen that the distribution is sufficiently converged. (In general, if Rhat is 1.1 or less, it can be judged that it has converged sufficiently [^ 2])

[^ 2]: Reference: Citation memo about the specific value of Rhat in the convergence diagnosis of MCMC

`out`


Inference for Stan model: anon_model_b12e3f2368679a0c562b9f74618b2f82.
2 chains, each with iter=3000; warmup=200; thin=1; 
post-warmup draws per chain=2800, total post-warmup draws=5600.

          mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
beta0     5.02  8.0e-3    0.6   3.89   4.61   5.01   5.42   6.23 5600.0    1.0
beta[0]  -1.04  2.1e-3   0.15  -1.35  -1.15  -1.04  -0.93  -0.75 5600.0    1.0
beta[1]  -2.77  3.0e-3   0.23  -3.22  -2.92  -2.76  -2.61  -2.33 5600.0    1.0
beta[2]  -0.04  1.2e-4 8.9e-3  -0.06  -0.05  -0.04  -0.04  -0.03 5600.0    1.0
beta[3]  -0.41  1.6e-3   0.12  -0.66  -0.49  -0.41  -0.33  -0.18 5600.0    1.0
beta[4]  -0.08  1.8e-3   0.14  -0.35  -0.17  -0.08   0.02   0.18 5600.0    1.0
beta[5] 2.5e-3  3.4e-5 2.5e-3-2.3e-3 7.0e-4 2.4e-3 4.1e-3 7.7e-3 5600.0    1.0
lp__    -322.4    0.02   1.86 -326.8 -323.4 -322.0 -321.0 -319.7 5600.0    1.0

Samples were drawn using NUTS(diag_e) at Sun Dec 13 02:53:45 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The expected value of the distribution of each parameter is called the EAP (Expected A Posteriori) estimator. The EAP estimator can be obtained by simply averaging the sample results. It's easy: grin:

#Extract EAP estimator list for each parameter
mean_list = np.array(fit.summary()['summary'])[:,0]

Next, find the MAP (Maximum A Posteriori) estimator. This was calculated by estimating the kernel density and bringing in the value that takes the maximum value. (If anyone knows an easier way, please let me know m (_ _) m)

#List generation of MAP estimators for each parameter
ddd = la['beta0']
def find_MAP(data):
    kde = gaussian_kde(data)
    x_range = np.linspace(min(data), max(data), 501)
    eval_kde = kde.evaluate(x_range)
    #plt.plot(x_range, eval_kde)
    return x_range[np.argmax(eval_kde)]

MAP_list = []
MAP_list.append(find_MAP(ddd))
for i in range(n_param-1):
    MAP_list.append(find_MAP(la['beta'][:,i]))

Finally, the sampling result of each parameter is visualized in a graph. The left side is the kernel density of the sampling result, and the right side is called trace plot, which plots the random numbers in the order of sampling.

f, axes = plt.subplots(n_param, 2, figsize=(15, 4*n_param))
cnt = 0
for name in names:
    dat = la[name]
    if dat.ndim == 2:
        for j in range(dat.shape[1]):
            d = dat[:,j]
            sns.distplot(d, hist=False, rug=True, ax=axes[cnt, 0])
            sns.tsplot(d,   alpha=0.8, lw=.5, ax=axes[cnt, 1])
            cnt += 1
    else:
        # Intercept
        sns.distplot(dat, hist=False, rug=True, ax=axes[cnt, 0])
        sns.tsplot(dat,   alpha=0.8, lw=.5, ax=axes[cnt, 1])
        cnt += 1

name_list = []
for name in names:
    if la[name].ndim == 2:
        for i in range(dat.shape[1]):
            name_list.append("{}{}".format(name,i+1))
    else:
        name_list.append(name)

for i in range(2):
    for j, t in enumerate(name_list):
        axes[j, i].set_title(t)
plt.show()

Similarly, I wrote a graph about the likelihood.

# Likelihood
f, axes = plt.subplots(1, 2, figsize=(15, 4))

sns.distplot(la['lp__'], hist=False, rug=True, ax=axes[0])
sns.tsplot(la['lp__'],   alpha=0.8, lw=.5, ax=axes[1])
axes[0].set_title("Histgram of likelihood.")
axes[1].set_title("Traceplot of likelihood.")
plt.show()

Finally, let's look at the performance of the estimation results. First, the data was divided into 8: 2 by the holdout method, so let's look at the performance when reassigning and the generalization performance. Let's compare using both the EAP estimator and the MAP estimator as the parameters to be used.

#Substitute the estimated parameter value and the probability p_A function to calculate i.
def logis(x, beta):
    assert len(beta) == 7
    assert len(beta) == 7
    if type(x) != 'np.array':
        x = np.array(x)
    tmp = [1]
    tmp.extend(x)
    x = tmp
    return (1+np.exp(-np.dot(x, beta)))**(-1)

#Judge 0 or 1 with the set threshold. The default threshold is 0.5
def check_accuracy(data, target, param, threshold = 0.5):
    ans_list = []
    for i in range(len(data)):
        idx = data.index[i]
        res = logis(data.ix[idx], param)
        ans = 1 if res > threshold else 0
        ans_list.append(ans == target.ix[idx])

    return np.mean(ans_list)


param = mean_list[0:7]

#Correct answer rate when reassigned
print u"[train][EAP] Accuracy:{0:.5f}".format(check_accuracy(X_train, y_train, param))
print u"[train][MAP] Accuracy:{0:.5f}".format(check_accuracy(X_train, y_train, MAP_list))

#See generalization performance by correct answer rate of test data
print "[test][EAP] Accuracy:{0:.5f}".format(check_accuracy(X_test, y_test, param))
print "[test][MAP] Accuracy:{0:.5f}".format(check_accuracy(X_test, y_test, MAP_list))

`out`


[train][EAP] Accuracy:0.79073
[train][MAP] Accuracy:0.80618
[test][EAP] Accuracy:0.81006
[test][MAP] Accuracy:0.79888

At the time of reassignment, the MAP estimator has a higher correct answer rate, but the EAP estimator method seems to be better in terms of generalization performance. I'm curious if this happens or if there is such a tendency.

reference

http://aaaazzzz036.hatenablog.com/entry/2013/11/06/225417 → I referred to the Stan code.

Stan Manual (Modeling Language User ’s Guide and Reference Manual, v2.9.0) https://github.com/stan-dev/stan/releases/download/v2.9.0/stan-reference-2.9.0.pdf