This article is a sequel to my article [Python] Implementation of clustering using a mixed Gaussian model. Machine Learning Professional Series "Nonparametric Bayes Point Process and Statistical Machine Learning Mathematics" is read and written in the book. I implemented the model in Python. This time, we will improve the model implemented last time and perform clustering with ** marginalized Gibbs sampling **.
Here are some of the mathematical symbols used in this article that are rarely found elsewhere.
-$ x_ {1: N} $: $ x_1, \ cdots, x_N $ -$ x_ {1: N} ^ {\ backslash i} $: $ x_ {1: N} $ minus $ x_i $
In addition, the vectors in this article are column vectors and are not shown in bold.
Before we talk about marginalization, let's revisit the previous Gibbs sampling in a little more detail.
What we did last time is to assume that the dataset $ x_ {1: N} $ is represented by a mixed Gaussian model of $ K $ clusters, and the mean $ \ mu_ {1: K of the Gaussian distribution corresponding to each cluster. } $ And the cluster $ z_ {1: N} $ for each data was estimated. The joint distribution of $ z_ {1: N} $ and $ \ mu_ {1: K} $ has the following conditional probabilities, which are difficult to handle analytically.
python
P(\mu_{1: K}, z_{1:N} | x_{1: N}, \Sigma_{1: K}, \pi)
In the above equation, $ \ Sigma_ {1: K} $ is the covariance matrix of the Gaussian distribution, and $ \ pi: = (\ pi_1, \ cdots, \ pi_K) ^ T $ is the categorical distribution that each $ z_i $ follows. It is a parameter. [^ 2] Therefore, instead of estimating $ \ mu_ {1: K}, z_ {1: N} $ at the same time, each variable $ \ mu_1, \ cdots, \ mu_K, z_1, \ cdots, z_N $ is sequentially estimated one by one. The whole is estimated by sampling and repeating this. This technique is called ** Gibbs sampling **. Therefore, it will be estimated in the following two steps.
--For each $ k = 1, \ cdots, K $, $ P (\ mu_k | \ mu_ {1: K} ^ {\ backslash k}, z_ {1: N}, x_ {1: N}, \ Sampling $ \ mu_k $ based on Sigma_ {1: K}, \ pi, \ mu_ {\ rm pri}, \ Sigma_ {\ rm pri}) $ --For each $ i = 1, \ cdots, N $, $ P (z_i | z_ {1: N} ^ {\ backslash i}, \ mu_ {1: K}, x_ {1: N}, \ Sigma_ {1: K}, \ pi) Sampling $ z_i $ based on $
$ \ mu_ {\ rm pri}, \ Sigma_ {\ rm pri} $ are prior distribution parameters for $ \ mu_k $. [^ 1]
In Gibbs sampling above, the part that estimates the cluster of data is the second step, and the first step is the ancillary process required to calculate the cluster. Now, using a technique called ** marginalization **, suddenly the second step "$ z_ {1: N} $" without the first step "sampling $ \ mu_ {1: N} $" Consider doing "sampling". This technique is called ** marginalized Gibbs sampling **. Peripheral Gibbs sampling repeats one step:
--For each $ i = 1, \ cdots, N $, $ P (z_i | z_ {1: N} ^ {\ backslash i}, x_ {1: N}, \ Sigma_ {1: K}, \ pi , \ mu_ {\ rm pri}, \ Sigma_ {\ rm pri}) Sample $ z_i $ based on $
In marginalized Gibbs sampling, even if the variance of $ \ mu_ {1: K} $ is large [^ 6], $ z_ {1 is not affected by the unstable sampling result of $ \ mu_ {1: K} $. : N} $ can be sampled. You may be wondering how to estimate without $ \ mu_ {1: K} $, but the information for $ \ mu_ {1: K} $ is $ z_ {1: N} ^ {\ backslash i}, x_ {1: N}, \ Sigma_ {1: K}, \ pi, \ mu_ {\ rm pri}, \ Sigma_ {\ rm pri} $ have them, so use them Image to estimate $ z_i $ (without using $ \ mu_ {1: K} $). Now, let's move away from Gibbs sampling and talk about marginalization.
Let $ x \ in X, y \ in Y $ be continuous random variables, and we know the joint distribution $ P (x, y) $. At this time, eliminating the random variable using integration as follows is called ** marginalization **.
python
P(x) = \int_{Y}P(x, y)dy
Transforming the right-hand side using Bayes' theorem and writing it with conditional probabilities gives: Use this if you know the conditional probabilities rather than the joint distributions.
python
P(x) = \int_{Y}P(x|y)P(y)dy
By the way, if the random variable $ y $ is discrete, the marginalization formulation is as follows. This may be easier to understand.
python
P(x) = \sum_{y \in Y}P(x, y) = \sum_{y \in Y}P(x|y)P(y)
Let's return to Gibbs sampling. Perform marginalization to find the probability distribution of the $ z_i $ without $ \ mu_ {1: K} $ version. $ D $ is the dimension of $ x_i $ and $ \ mu_k $. Since there are many conditional variables of conditional probabilities, they are omitted.
python
P(z_i=k | \cdots)
= \int_{\mathbb{R}^D} P(z_i=k | \mu_{1:K}, \cdots)P(\mu_{1:K} | \cdots)d\mu_{1:K}
As in the last time, assuming $ \ Sigma_k = \ sigma_k ^ 2 I_D, \ Sigma_ {\ rm pri}: = \ sigma_ {\ rm pri} ^ 2 I_D $, this is calculated as follows. [^ 3]
python
\begin{align}
P(z_i=k | \cdots)
&\propto \pi_k N\left(x_i | a_k\left(\frac{n_k^{\backslash i}}{\sigma^2}\overline{x_k}^{\backslash i} + \frac{1}{\sigma_{\rm pri}^2} \mu_{\rm pri} \right),
\left( \sigma^2 + a_k \right)I_D\right) \\
a_k &:= \left( \frac{n_k^{\backslash i}}{\sigma^2} + \frac{1}{\sigma_{\rm pri}^2} \right)^{-1}
\end{align}
$ n_k ^ {\ backslash i} $ is the number of data belonging to the cluster $ k $ in the dataset obtained by subtracting $ x_i $ from $ x_ {1: N} $, $ \ overline {x_k} ^ {\ backslash i} $ is their average.
Based on the previous implementation, we will implement peripheralized Gibbs sampling. There are two major changes from the last time.
--Delete the class variable mu
(unnecessary because it is not sampled)
--Change of log_deformed_gaussian
method
The first is the core idea of peripheral Gibbs sampling, which is the most important.
Let me explain the second one. Last time, in order to obtain the ratio of the probability density function values of the normal distribution, the calculation was performed ignoring the factors that do not depend on the cluster k
. I will repost the previous formula. [^ 4]
python
\begin{align}
P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i}, \mu_{1:K}, \Sigma_{1:K}, \pi)
&\propto \pi_k \exp\left\{- \frac{1}{2}\log|\Sigma_k| - \frac{1}{2} (x_i - \mu_k)^T\Sigma_k(x_i - \mu_k)\right\} \\
&\propto \pi_k \exp\left\{- \frac{1}{2 \sigma^2}\ \| x_i - \mu_k \|_2^2\right\}
\end{align}
Here, like the formula
python
\begin{align}
P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i}, \mu_{1:K}, \Sigma_{1:K}, \pi)
&\propto \pi_k \exp\left\{ \frac{D}{2}\log \sigma_k^2 -\frac{1}{2 \sigma^2}\ \| x_i - \mu_k \|_2^2\right\}
\end{align}
The implementation of the method log_deformed_gaussian
that calculates the contents of this ʻexp` is as follows.
python
def log_deformed_gaussian(x, mu, var):
D = x.shape[0]
norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
return -D * np.log(var) / 2 - norm_squared / (2 * var)
Based on the above, I implemented it. It's long so it's folded.
python
import numpy as np
from collections import Counter
def log_deformed_gaussian(x, mu, var):
norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
return -np.log(var) / 2 - norm_squared / (2 * var)
class ClusterArray(object):
def __init__(self, array):
#array is a one-dimensional list, array
self._array = np.array(array, dtype=np.int)
self._counter = Counter(array)
@property
def array(self):
return self._array.copy()
def count(self, k):
return self._counter[k]
def __setitem__(self, i, k):
#Self when executed._counter is also updated
pre_value = self._array[i]
if pre_value == k:
return
if self._counter[pre_value] > 0:
self._counter[pre_value] -= 1
self._array[i] = k
self._counter[k] += 1
def __getitem__(self, i):
return self._array[i]
#A class that clusters with peripheral Gibbs sampling
class GMMClustering(object):
def __init__(self, X, K, var=1, var_pri=1, seed=None):
self.X = X.copy()
self.K = K
self.N = X.shape[0]
self.D = X.shape[1]
self.z = None
#Parameter setting of probability distribution
# self.mu is not used, so it is not defined
self.var = var #Fixed, common to all clusters
self.pi = np.full(self.K, 1 / self.K) #Fixed, common to all clusters
#Prior distribution settings
self.mu_pri = np.zeros(self.D)
self.var_pri = var_pri
self._random = np.random.RandomState(seed)
def fit(self, n_iter):
init_z = self._random.randint(0, self.K, self.N)
self.z = ClusterArray(init_z)
for _ in range(n_iter):
for i, x_i in enumerate(self.X):
self.z[i] = self._sample_zi(i)
def _sample_zi(self, i):
n_vec = np.array([
self.z.count(k) - bool(k == self.z.array[i])
for k in range(self.K)
])
x_bar_vec = np.array([self._mean_in_removed(k, i) for k in range(self.K)])
tmp = 1/(n_vec/self.var + 1/self.var_pri)
mu = np.tile(tmp, (self.D, 1)).T * (np.tile(n_vec, (self.D, 1)).T * x_bar_vec / self.var + self.mu_pri / self.var_pri)
var = tmp + self.var
log_probs_xi = log_deformed_gaussian(self.X[i], mu, var)
probs_zi = np.exp(log_probs_xi) * self.pi
probs_zi = probs_zi / probs_zi.sum()
z_i = self._random.multinomial(1, probs_zi)
z_i = np.where(z_i)[0][0]
return z_i
def _mean_in_removed(self, k, i):
# self.Calculate the average of the data belonging to cluster k from the data set with the i-th subtracted from X
mean = np.array([
x for j, x in enumerate(self.X)
if (j != i) and (self.z[j] == k)
]).mean(axis=0)
return mean
Like last time, I tried clustering on Iris Dataset. The hyperparameters $ \ sigma ^ 2 = 0.1, \ sigma_ {\ rm pri} ^ 2 = 1 $, and the number of Gibbs sampling iterations is $ 10 $. These are the same settings as last time. Comparing the labels of the actual dataset with the clustering results, we get:
python
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Data set loading
df = sns.load_dataset('iris')
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values
#Clustering with mixed Gaussian model
gmc = GMMClustering(X, K=3, var=0.05, seed=1)
gmc.fit(n_iter=10)
df['GMM_cluster'] = gmc.z.array
#Visualization of results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
for sp in df['species'].unique():
x = df[df['species'] == sp]['petal_length']
y = df[df['species'] == sp]['petal_width']
ax1.scatter(x, y, label=sp)
ax1.legend()
ax1.set_title('species')
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
for k in range(gmc.K):
x = df[df['GMM_cluster'] == k]['petal_length']
y = df[df['GMM_cluster'] == k]['petal_width']
ax2.scatter(x, y, label=k)
ax2.legend()
ax2.set_title('GMM cluster')
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
plt.show()
When I visualized the result using pair plot
of seaborn
, I got the following.
python
sns.pairplot(
df.drop(columns=['species']),
vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
hue='GMM_cluster'
)
Even with marginalized Gibbs sampling that does not sample $ \ mu_ {1: K} $, you can cluster nicely.
Implemented peripheral Gibbs sampling from Machine Learning Professional Series "Nonparametric Bayes Point Process and Statistical Machine Learning Mathematics" It was confirmed that the same clustering as Last time can be performed. Next time, we plan to implement the main content of this book, nonparametric Bayes.
[^ 1]: In order to Bayesian estimate the probability distribution of $ \ mu_k $, it is necessary to set the prior distribution. [^ 2]: Like last time, these are fixed values, not random variables. [^ 3]: The book describes this derivation method, but I checked the calculation myself. It was quite difficult. [^ 4]: Last time, I described the probability distribution itself, but this time I described the proportional formula. If you get used to it, this one will be easier to understand. [^ 6]: The variance of $ \ mu_k $ is large when the number of data belonging to the cluster $ k $ is small. In marginalized Gibbs sampling, which is an extreme example, it is possible to find the probability that $ z_i = k $ will occur even if the data belonging to the cluster $ k $ is $ 0 $. This property is important for nonparametric Bayes.
Recommended Posts