In curve fitting, posterior prediction distribution is calculated by treating it like Bayesian, but I have the impression that such a thing is not done so much in classification, and this time logistic regression, which is often used in classification, is more Bayesian. I would like to implement the code that calculates the posterior predictive distribution.
As I wrote above, I haven't seen much or no code that calculates the predicted distribution in the classification (maybe just a few codes I've seen). In order to calculate the predicted distribution in Bayesian manner, the posterior distribution of the weight parameters must be used to integrate and eliminate the parameters. However, since logistic regression uses a logistic sigmoid function, it is not possible to integrate the parameters analytically. Therefore, the predicted distribution is approximately obtained using the Laplace approximation.
First, let's consider logistic regression, which classifies two classes. Let $ \ phi $ be the feature vector of a point $ x $, and let that point belong to class $ C_1 $
p(C_1|\phi) = y(\phi) = \sigma({\bf w}^\intercal\phi)
It is expressed as. However,
Training dataset $ \ {\ phi_n, t_n \} _ {n = 1} ^ N $ as $ \ phi_n = \ phi (x_n) $, $ t_n \ in \ {0,1 \} $ The likelihood function given is modeled on the Bernoulli distribution.
\begin{align}
p({\bf t}|{\bf\Phi},{\bf w}) &= \prod_{n=1}^N{\rm Bern}(t_n|y_n)\\
&= \prod_{n=1}^N y_n^{t_n}(1 - y_n)^{1 - t_n}
\end{align}
Will be. However, $ {\ bf \ Phi} $ is a design matrix in which the horizontal vectors $ \ phi ^ {\ rm T} $ are arranged vertically. As always, defining the cost function as a negative log-likelihood function takes the form of cross entropy.
E({\bf w}) = -\ln p({\bf t}|{\bf\Phi},{\bf w}) = -\sum_{n=1}^N\{t_n\ln y_n + (1 - t_n)\ln(1 - y_n)\}
The results of first-order and second-order differentiation of this cost function for the parameter $ {\ bf w} $ are
\begin{align}
\nabla E({\bf w}) &= \sum_{n=1}^N (y_n - t_n) {\bf\phi}_n = {\bf\Phi}^\intercal({\bf y} - {\bf t})\\
{\bf H} = \nabla\nabla E({\bf w}) &= \sum_{n=1}^N y_n(1 - y_n)\phi_n\phi_n^\intercal = {\bf\Phi}^\intercal{\bf R}{\bf\Phi}
\end{align}
However, the matrix $ {\ bf R} $ is a diagonal matrix whose elements are $ R_ {nn} = y_n (1-y_n) $. Using these results,
{\bf w}^{new} = {\bf w}^{old} - {\bf H}^{-1}\nabla E({\bf w})
As, the update is repeated until the parameter $ {\ bf w} $ converges.
Prior distribution for parameter $ {\ bf w} $ $ p ({\ bf w}) = \ mathcal {N} ({\ bf w} | {\ bf 0}, \ alpha ^ {-1} {\ bf I }) Introduce $ and use the Laplace approximation to find an approximate posterior distribution in the form of a Gaussian function. From Bayes' theorem
p({\bf w}|{\bf t},{\bf\Phi}) \propto p({\bf t}|{\bf\Phi},{\bf w})p({\bf w})
So, if you take that logarithm
\ln p({\bf w}|{\bf t},{\bf\Phi}) = -{\alpha\over2}{\bf w}^\intercal{\bf w} + \sum_{n=1}^N\{t_n\ln y_n + (1 - t_n)\ln(1 - y_n)\}
Will be. To calculate the Gaussian approximation from this, first maximize the above equation for $ {\ bf w} $, then the parameter $ {\ bf w} _ {MAP} $ and the second derivative $ S_N ^ {-1 } Using $, the approximate posterior distribution is
q({\bf w}) = \mathcal{N}({\bf w}|{\bf w}_{MAP}, S_N)
Is required as.
The posterior predictive distribution for the new point $ \ phi $ is obtained by marginalizing the posterior distribution of the parameters.
\begin{align}
p(C_1|\phi,{\bf t},{\bf\Phi}) &= \int p(C_1|\phi,{\bf w})p({\bf w}|{\bf t},{\bf\Phi}){\rm d}{\bf w}\\
&\approx \int \sigma({\bf w}^\intercal\phi)\mathcal{N}({\bf w}|{\bf w}_{MAP},S_N){\rm d}{\bf w}\\
&\approx \sigma\left({\mu\over\sqrt{1 + \pi\sigma^2/8}}\right)
\end{align}
However,
\begin{align}
\mu &= {\bf w}_{MAP}^\intercal\phi\\
\sigma^2 &= \phi^\intercal S_N\phi
\end{align}
And. Laplace approximation of the posterior distribution of the parameter $ {\ bf w} $ and approximation of the logistic sigmoid function with the probit function, these two approximations are used. Please see PRML for detailed formula transformation.
In addition to the usual matplotlib and numpy, it uses a standard Python library called itertools.
import itertools
import matplotlib.pyplot as plt
import numpy as np
Converting a two-dimensional point $ (x_1, x_2) $ to a quadratic polynomial feature vector
\phi(x_1,x_2) =
\begin{bmatrix}
1\\
x_1\\
x_2\\
x_1^2\\
x_1x_2\\
x_2^2
\end{bmatrix}
It will be. Below is a class that transforms a number of 2D points into a design matrix. For example, the design matrix of the quadratic polynomial features of the two-dimensional points $ (2,5) $ and $ (-3,1) $
{\bf\Phi}=
\begin{bmatrix}
1 & 2 & 5 & 2^2 & 2 \times 5 & 5^2\\
1 & 3 & -1 & 3^2 & 3 \times (-1) & (-1)^2
\end{bmatrix}
It looks like.
class PolynomialFeatures(object):
def __init__(self, degree=2):
self.degree = degree
def transform(self, x):
x_t = x.transpose()
features = [np.ones(len(x))]
for degree in xrange(1, self.degree + 1):
for items in itertools.combinations_with_replacement(x_t, degree):
features.append(reduce(lambda x, y: x * y, items))
return np.array(features).transpose()
It is not necessary to separate the classes for logistic regression and Bayesian logistic regression, but this time to make the difference between the two noticeable.
class LogisticRegression(object):
def __init__(self, iter_max, alpha=0):
self.iter_max = iter_max
self.alpha = alpha
def _sigmoid(self, a):
return np.divide(1, 1 + np.exp(-a))
def fit(self, X, t):
self.w = np.zeros(np.size(X, 1))
for i in xrange(self.iter_max):
w = np.copy(self.w)
y = self.predict_proba(X)
grad = X.T.dot(y - t) + self.alpha * w
hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
+ self.alpha * np.identity(len(w)))
try:
self.w -= np.linalg.solve(hessian, grad)
except np.linalg.LinAlgError:
break
if np.allclose(w, self.w):
break
if i == self.iter_max - 1:
print "weight parameter w may not have converged"
def predict(self, X):
return (self.predict_proba(X) > 0.5).astype(np.int)
def predict_proba(self, X):
return self._sigmoid(X.dot(self.w))
LogisticRegression | Method description |
---|---|
__init__ | Setting the maximum number of parameter updates and the accuracy of the parameter prior distribution |
_sigmoid | Logistic sigmoid function |
fit | Parameter estimation |
predict | Predict input label |
predict_proba | Calculate the probability that the input will be label 1 |
In estimating the maximum posteriori probability of a parameter, not only the mode value but also the approximate variance is obtained. The logistic regression above also sought the Hessian matrix (the precision matrix of the approximate parameters), but I didn't store that matrix in any variable because it only uses that matrix when estimating. However, Bayesian logistic regression also uses the parameter variance to calculate the predicted distribution and saves it. The predict_dist method calculates the predicted distribution using the variance matrix of that parameter.
class BayesianLogisticRegression(LogisticRegression):
def __init__(self, iter_max, alpha=0.1):
super(BayesianLogisticRegression, self).__init__(iter_max, alpha)
def fit(self, X, t):
super(BayesianLogisticRegression, self).fit(X, t)
y = self.predict_proba(X)
hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
+ self.alpha * np.identity(len(self.w)))
self.w_var = np.linalg.inv(hessian)
def predict_dist(self, X):
mu_a = X.dot(self.w)
var_a = np.sum(X.dot(self.w_var) * X, axis=1)
return self._sigmoid(mu_a / np.sqrt(1 + np.pi * var_a / 8))
BayesianLogisticRegression | Method description |
---|---|
__init__ | Setting the maximum number of parameter updates and the accuracy of the parameter prior distribution |
fit | Parameter estimation |
predict_dist | Calculation of posterior predictive distribution for input |
import itertools
import matplotlib.pyplot as plt
import numpy as np
class PolynomialFeatures(object):
def __init__(self, degree=2):
self.degree = degree
def transform(self, x):
x_t = x.transpose()
features = [np.ones(len(x))]
for degree in xrange(1, self.degree + 1):
for items in itertools.combinations_with_replacement(x_t, degree):
features.append(reduce(lambda x, y: x * y, items))
return np.array(features).transpose()
class LogisticRegression(object):
def __init__(self, iter_max, alpha=0):
self.iter_max = iter_max
self.alpha = alpha
def _sigmoid(self, a):
return np.divide(1, 1 + np.exp(-a))
def fit(self, X, t):
self.w = np.zeros(np.size(X, 1))
for i in xrange(self.iter_max):
w = np.copy(self.w)
y = self.predict_proba(X)
grad = X.T.dot(y - t) + self.alpha * w
hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
+ self.alpha * np.identity(len(w)))
try:
self.w -= np.linalg.inv(hessian).dot(grad)
except np.linalg.LinAlgError:
break
if np.allclose(w, self.w):
break
if i == self.iter_max - 1:
print "weight parameter w may not have converged"
def predict(self, X):
return (self.predict_proba(X) > 0.5).astype(np.int)
def predict_proba(self, X):
return self._sigmoid(X.dot(self.w))
class BayesianLogisticRegression(LogisticRegression):
def __init__(self, iter_max, alpha=0.1):
super(BayesianLogisticRegression, self).__init__(iter_max, alpha)
def fit(self, X, t):
super(BayesianLogisticRegression, self).fit(X, t)
y = self.predict_proba(X)
hessian = (X.T.dot(np.diag(y * (1 - y))).dot(X)
+ self.alpha * np.identity(len(self.w)))
self.w_var = np.linalg.inv(hessian)
def predict_dist(self, X):
mu_a = X.dot(self.w)
var_a = np.sum(X.dot(self.w_var) * X, axis=1)
return self._sigmoid(mu_a / np.sqrt(1 + np.pi * var_a / 8))
def create_data_set():
x = np.random.normal(size=50).reshape(-1, 2)
y = np.random.normal(size=50).reshape(-1, 2)
y += np.array([2., 2.])
return (np.concatenate([x, y]), np.concatenate([np.zeros(25), np.ones(25)]))
def main():
x, labels = create_data_set()
colors = ['blue', 'red']
plt.scatter(x[:, 0], x[:, 1], c=[colors[int(label)] for label in labels])
features = PolynomialFeatures(degree=3)
classifier = BayesianLogisticRegression(iter_max=100, alpha=0.1)
classifier.fit(features.transform(x), labels)
X_test, Y_test = np.meshgrid(np.linspace(-2, 4, 100), np.linspace(-2, 4, 100))
x_test = np.array([X_test, Y_test]).transpose(1, 2, 0).reshape(-1, 2)
probs = classifier.predict_proba(features.transform(x_test))
Probs = probs.reshape(100, 100)
dists = classifier.predict_dist(features.transform(x_test))
Dists = dists.reshape(100, 100)
levels = np.linspace(0, 1, 5)
cp = plt.contour(X_test, Y_test, Probs, levels, colors='k', linestyles='dashed')
plt.clabel(cp, inline=True, fontsize=10)
plt.contourf(X_test, Y_test, Dists, levels, alpha=0.5)
plt.colorbar()
plt.xlim(-2, 4)
plt.ylim(-2, 4)
plt.show()
if __name__ == '__main__':
main()
--Learning data: Blue (class 0), red (class 1) dots -** Black dotted line: Logistic regression output , a line with a probability of belonging to class 1 from the top 0.75,0.5,0.25, although it is difficult to see - Color coding: Bayesian logistic regression output **, each with a probability of belonging to class 1 --Red: 0.75 ~ --Yellow: 0.5 to 0.75 --Water: 0.25 to 0.5 --Blue: ~ 0.25
Looking at this, the line with a probability of 0.5 (the dotted line in the middle and the boundary between light blue and yellow) is common in both cases, while the lines of 0.25 and 0.75 are more 0.5 for those who treat them like Bayesian. It is far from the line and tells us that it is unclear which class it belongs to. Furthermore, where there are no data points in the upper left of the figure, the predicted distribution becomes more ambiguous than in other regions. On the other hand, since there are many data points around the center, the width from 0.25 to 0.75 is narrow. After calculating the probability of belonging to class 1 by either method, it is assigned to either class by some criterion, but if the criterion is misclassification minimization, it does not change which one is used. Because the line with a probability of 0.5 is common. However, it may be better to treat it like Bayesian when using more complicated decision criteria.
This time, we implemented a method to perform Bayesian logistic regression using Laplace approximation. The advantage of treating it in a Bayesian way is that it behaves differently in areas where there are many data points and where there are not. In addition to using the Laplace approximation, there seems to be a method of deriving the predicted distribution using variational Bayes. Chapter 10 of PRML also uses variational Bayes to estimate hyperparameters $ \ alpha $. I think I'll implement that as well.
Recommended Posts