[Python] I implemented peripheral Gibbs sampling

This article is a sequel to my article [Python] Implementation of clustering using a mixed Gaussian model. Machine Learning Professional Series "Nonparametric Bayes Point Process and Statistical Machine Learning Mathematics" is read and written in the book. I implemented the model in Python. This time, we will improve the model implemented last time and perform clustering with ** marginalized Gibbs sampling **.

Environment Python: 3.7.5 numpy: 1.18.1 pandas: 0.25.3 matplotlib: 3.1.2 seaborn: 0.9.0

Symbols used in this article

Here are some of the mathematical symbols used in this article that are rarely found elsewhere.

-$ x_ {1: N} $: $ x_1, \ cdots, x_N $ -$ x_ {1: N} ^ {\ backslash i} $: $ x_ {1: N} $ minus $ x_i $

In addition, the vectors in this article are column vectors and are not shown in bold.

Gibbs sampling review

Before we talk about marginalization, let's revisit the previous Gibbs sampling in a little more detail.

What we did last time is to assume that the dataset $ x_ {1: N} $ is represented by a mixed Gaussian model of $ K $ clusters, and the mean $ \ mu_ {1: K of the Gaussian distribution corresponding to each cluster. } $ And the cluster $ z_ {1: N} $ for each data was estimated. The joint distribution of $ z_ {1: N} $ and $ \ mu_ {1: K} $ has the following conditional probabilities, which are difficult to handle analytically.

`python`


P(\mu_{1: K}, z_{1:N} | x_{1: N}, \Sigma_{1: K}, \pi)

In the above equation, $ \ Sigma_ {1: K} $ is the covariance matrix of the Gaussian distribution, and $ \ pi: = (\ pi_1, \ cdots, \ pi_K) ^ T $ is the categorical distribution that each $ z_i $ follows. It is a parameter. [^ 2] Therefore, instead of estimating $ \ mu_ {1: K}, z_ {1: N} $ at the same time, each variable $ \ mu_1, \ cdots, \ mu_K, z_1, \ cdots, z_N $ is sequentially estimated one by one. The whole is estimated by sampling and repeating this. This technique is called ** Gibbs sampling **. Therefore, it will be estimated in the following two steps.

--For each $ k = 1, \ cdots, K $, $ P (\ mu_k | \ mu_ {1: K} ^ {\ backslash k}, z_ {1: N}, x_ {1: N}, \ Sampling $ \ mu_k $ based on Sigma_ {1: K}, \ pi, \ mu_ {\ rm pri}, \ Sigma_ {\ rm pri}) $ --For each $ i = 1, \ cdots, N $, $ P (z_i | z_ {1: N} ^ {\ backslash i}, \ mu_ {1: K}, x_ {1: N}, \ Sigma_ {1: K}, \ pi) Sampling $ z_i $ based on $

$ \ mu_ {\ rm pri}, \ Sigma_ {\ rm pri} $ are prior distribution parameters for $ \ mu_k $. [^ 1]

What is peripheral Gibbs sampling?

In Gibbs sampling above, the part that estimates the cluster of data is the second step, and the first step is the ancillary process required to calculate the cluster. Now, using a technique called ** marginalization **, suddenly the second step "$ z_ {1: N} $" without the first step "sampling $ \ mu_ {1: N} $" Consider doing "sampling". This technique is called ** marginalized Gibbs sampling **. Peripheral Gibbs sampling repeats one step:

--For each $ i = 1, \ cdots, N $, $ P (z_i | z_ {1: N} ^ {\ backslash i}, x_ {1: N}, \ Sigma_ {1: K}, \ pi , \ mu_ {\ rm pri}, \ Sigma_ {\ rm pri}) Sample $ z_i $ based on $

In marginalized Gibbs sampling, even if the variance of $ \ mu_ {1: K} $ is large [^ 6], $ z_ {1 is not affected by the unstable sampling result of $ \ mu_ {1: K} $. : N} $ can be sampled. You may be wondering how to estimate without $ \ mu_ {1: K} $, but the information for $ \ mu_ {1: K} $ is $ z_ {1: N} ^ {\ backslash i}, x_ {1: N}, \ Sigma_ {1: K}, \ pi, \ mu_ {\ rm pri}, \ Sigma_ {\ rm pri} $ have them, so use them Image to estimate $ z_i $ (without using $ \ mu_ {1: K} $). Now, let's move away from Gibbs sampling and talk about marginalization.

Peripheralization of joint distribution

Let $ x \ in X, y \ in Y $ be continuous random variables, and we know the joint distribution $ P (x, y) $. At this time, eliminating the random variable using integration as follows is called ** marginalization **.

`python`


P(x) = \int_{Y}P(x, y)dy

Transforming the right-hand side using Bayes' theorem and writing it with conditional probabilities gives: Use this if you know the conditional probabilities rather than the joint distributions.

`python`


P(x) = \int_{Y}P(x|y)P(y)dy

By the way, if the random variable $ y $ is discrete, the marginalization formulation is as follows. This may be easier to understand.

`python`


P(x) = \sum_{y \in Y}P(x, y) = \sum_{y \in Y}P(x|y)P(y)

Application to Gibbs sampling

Let's return to Gibbs sampling. Perform marginalization to find the probability distribution of the $ z_i $ without $ \ mu_ {1: K} $ version. $ D $ is the dimension of $ x_i $ and $ \ mu_k $. Since there are many conditional variables of conditional probabilities, they are omitted.

`python`


P(z_i=k | \cdots)
= \int_{\mathbb{R}^D} P(z_i=k | \mu_{1:K}, \cdots)P(\mu_{1:K} | \cdots)d\mu_{1:K}

As in the last time, assuming $ \ Sigma_k = \ sigma_k ^ 2 I_D, \ Sigma_ {\ rm pri}: = \ sigma_ {\ rm pri} ^ 2 I_D $, this is calculated as follows. [^ 3]

`python`


\begin{align}
P(z_i=k | \cdots)
&\propto \pi_k N\left(x_i | a_k\left(\frac{n_k^{\backslash i}}{\sigma^2}\overline{x_k}^{\backslash i} + \frac{1}{\sigma_{\rm pri}^2} \mu_{\rm pri}  \right),
 \left( \sigma^2 + a_k \right)I_D\right) \\
a_k &:= \left( \frac{n_k^{\backslash i}}{\sigma^2} + \frac{1}{\sigma_{\rm pri}^2} \right)^{-1}
\end{align}

$ n_k ^ {\ backslash i} $ is the number of data belonging to the cluster $ k $ in the dataset obtained by subtracting $ x_i $ from $ x_ {1: N} $, $ \ overline {x_k} ^ {\ backslash i} $ is their average.

Implementation

Based on the previous implementation, we will implement peripheralized Gibbs sampling. There are two major changes from the last time.

--Delete the class variable mu (unnecessary because it is not sampled) --Change of log_deformed_gaussian method

The first is the core idea of peripheral Gibbs sampling, which is the most important. Let me explain the second one. Last time, in order to obtain the ratio of the probability density function values of the normal distribution, the calculation was performed ignoring the factors that do not depend on the cluster k. I will repost the previous formula. [^ 4]

`python`


\begin{align}
P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i},  \mu_{1:K}, \Sigma_{1:K}, \pi)
&\propto \pi_k \exp\left\{- \frac{1}{2}\log|\Sigma_k| - \frac{1}{2} (x_i - \mu_k)^T\Sigma_k(x_i - \mu_k)\right\}  \\
&\propto \pi_k \exp\left\{- \frac{1}{2 \sigma^2}\ \| x_i - \mu_k \|_2^2\right\}
\end{align}

Here, like the formula- \frac{1}{2}\log|\Sigma_k|I was able to ignore the factor of\Sigma_kIs a clusterkThis is because it was a fixed value that did not depend on. However, this time $ \ Sigma_k = (\ sigma ^ 2 + a_k) I_D $, and $ a_k $ changes with $ k $, so this factor cannot be ignored. Therefore, the formula for this time is as follows.

`python`


\begin{align}
P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i},  \mu_{1:K}, \Sigma_{1:K}, \pi) 
&\propto \pi_k \exp\left\{ \frac{D}{2}\log \sigma_k^2 -\frac{1}{2 \sigma^2}\ \| x_i - \mu_k \|_2^2\right\}
\end{align}

The implementation of the method log_deformed_gaussian that calculates the contents of this ʻexp` is as follows.

`python`


def log_deformed_gaussian(x, mu, var):
    D = x.shape[0]
    norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
    return -D * np.log(var) / 2 - norm_squared / (2 * var)

Based on the above, I implemented it. It's long so it's folded.

Source code

`python`


import numpy as np
from collections import Counter


def log_deformed_gaussian(x, mu, var):
    norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
    return -np.log(var) / 2 - norm_squared / (2 * var)


class ClusterArray(object):
    def __init__(self, array):
        #array is a one-dimensional list, array
        self._array = np.array(array, dtype=np.int)
        self._counter = Counter(array)

    @property
    def array(self):
        return self._array.copy()

    def count(self, k):
        return self._counter[k]

    def __setitem__(self, i, k):
        #Self when executed._counter is also updated
        pre_value = self._array[i]
        if pre_value == k:
            return

        if self._counter[pre_value] > 0:
            self._counter[pre_value] -= 1
        self._array[i] = k
        self._counter[k] += 1

    def __getitem__(self, i):
        return self._array[i]


#A class that clusters with peripheral Gibbs sampling
class GMMClustering(object):
    def __init__(self, X, K, var=1, var_pri=1, seed=None):
        self.X = X.copy()
        self.K = K
        self.N = X.shape[0]
        self.D = X.shape[1]
        self.z = None

        #Parameter setting of probability distribution
        # self.mu is not used, so it is not defined
        self.var = var  #Fixed, common to all clusters
        self.pi = np.full(self.K, 1 / self.K)  #Fixed, common to all clusters

        #Prior distribution settings
        self.mu_pri = np.zeros(self.D)
        self.var_pri = var_pri

        self._random = np.random.RandomState(seed)

    def fit(self, n_iter):
        init_z = self._random.randint(0, self.K, self.N)
        self.z = ClusterArray(init_z)

        for _ in range(n_iter):
            for i, x_i in enumerate(self.X):
                self.z[i] = self._sample_zi(i)

    def _sample_zi(self, i):
        n_vec = np.array([
            self.z.count(k) - bool(k == self.z.array[i])
            for k in range(self.K)
        ])
        x_bar_vec = np.array([self._mean_in_removed(k, i) for k in range(self.K)])

        tmp = 1/(n_vec/self.var + 1/self.var_pri)
        mu = np.tile(tmp, (self.D, 1)).T * (np.tile(n_vec, (self.D, 1)).T * x_bar_vec / self.var + self.mu_pri / self.var_pri)
        var = tmp + self.var

        log_probs_xi = log_deformed_gaussian(self.X[i], mu, var)
        probs_zi = np.exp(log_probs_xi) * self.pi
        probs_zi = probs_zi / probs_zi.sum()

        z_i = self._random.multinomial(1, probs_zi)
        z_i = np.where(z_i)[0][0]
        return z_i

    def _mean_in_removed(self, k, i):
        # self.Calculate the average of the data belonging to cluster k from the data set with the i-th subtracted from X
        mean = np.array([
            x for j, x in enumerate(self.X)
            if (j != i) and (self.z[j] == k)
        ]).mean(axis=0)
        return mean

result

Like last time, I tried clustering on Iris Dataset. The hyperparameters $ \ sigma ^ 2 = 0.1, \ sigma_ {\ rm pri} ^ 2 = 1 $, and the number of Gibbs sampling iterations is $ 10 $. These are the same settings as last time. Comparing the labels of the actual dataset with the clustering results, we get:

Source code

`python`


import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


#Data set loading
df = sns.load_dataset('iris')
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values

#Clustering with mixed Gaussian model
gmc = GMMClustering(X, K=3, var=0.05, seed=1)
gmc.fit(n_iter=10)
df['GMM_cluster'] = gmc.z.array

#Visualization of results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
for sp in df['species'].unique():
    x = df[df['species'] == sp]['petal_length']
    y = df[df['species'] == sp]['petal_width']
    ax1.scatter(x, y, label=sp)
ax1.legend()
ax1.set_title('species')
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
for k in range(gmc.K):
    x = df[df['GMM_cluster'] == k]['petal_length']
    y = df[df['GMM_cluster'] == k]['petal_width']
    ax2.scatter(x, y, label=k)
ax2.legend()
ax2.set_title('GMM cluster')
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
plt.show()

When I visualized the result using pair plot of seaborn, I got the following.

Source code

`python`


sns.pairplot(
    df.drop(columns=['species']),
    vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
    hue='GMM_cluster'
)

Even with marginalized Gibbs sampling that does not sample $ \ mu_ {1: K} $, you can cluster nicely.

in conclusion

Implemented peripheral Gibbs sampling from Machine Learning Professional Series "Nonparametric Bayes Point Process and Statistical Machine Learning Mathematics" It was confirmed that the same clustering as Last time can be performed. Next time, we plan to implement the main content of this book, nonparametric Bayes.

[^ 1]: In order to Bayesian estimate the probability distribution of $ \ mu_k $, it is necessary to set the prior distribution. [^ 2]: Like last time, these are fixed values, not random variables. [^ 3]: The book describes this derivation method, but I checked the calculation myself. It was quite difficult. [^ 4]: Last time, I described the probability distribution itself, but this time I described the proportional formula. If you get used to it, this one will be easier to understand. [^ 6]: The variance of $ \ mu_k $ is large when the number of data belonging to the cluster $ k $ is small. In marginalized Gibbs sampling, which is an extreme example, it is possible to find the probability that $ z_i = k $ will occur even if the data belonging to the cluster $ k $ is $ 0 $. This property is important for nonparametric Bayes.