[Python] Implementation of clustering using a mixed Gaussian model

Machine Learning Professional Series "Nonparametric Bayes Point Process and Statistical Machine Learning Mathematics" is read and written in the book. I implemented the model in Python. Nonparametric Bayes is an extension of the Gaussian mixed model. I really want to implement nonparametric Bayes, but in preparation for that, I will implement a mixed Gaussian model.

In Qiita, there was an article explaining the same model and method, and implementation, so I would like to introduce it. Both were very helpful.

-"Introduction to Machine Learning by Bayesian Inference" Approximate Inference of Poisson Mixed Model Implemented Only with Python numpy -Mixed Gaussian distribution and logsumexp

The formulas described in this article include those calculated by yourself with reference to books.

Environment Python: 3.7.5 numpy: 1.18.1 pandas: 0.25.3 matplotlib: 3.1.2 seaborn: 0.9.0

Symbols used in this article

Before we get into the main story, let's take a look at some of the mathematical symbols used in this article that are rarely seen elsewhere.

-$ x_ {1: N} $: $ x_1, \ cdots, x_N $ -$ x_ {1: N} ^ {\ backslash i} $: $ x_ {1: N} $ minus $ x_i $

In addition, the vectors in this article are column vectors and are not shown in bold.

What is a mixed Gaussian model?

A Gaussian Mixture Model (GMM) is a dataset model in which data generated from multiple Gaussian distributions are mixed. Let's take Iris Dataset as an example. This is a dataset of iris characteristics and types, with a mixture of three types of iris "setosa", "versicolor" and "virginica". Many of you are familiar with it because it is often used for exercises in machine learning classification. There are four explanatory variables in this dataset, but for the sake of simplicity, we took out two, "petal_length" and "petal_width", and created a scatter plot of "color-coded version" and "color-coded version for each type of iris". I will.

Source code

`python`


import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
df = sns.load_dataset('iris')
ax1.scatter(df['petal_length'], df['petal_width'], color='gray')
ax1.set_title('without label')
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
for sp in df['species'].unique():
    x = df[df['species'] == sp]['petal_length']
    y = df[df['species'] == sp]['petal_width']
    ax2.scatter(x, y, label=sp)
ax2.legend()
ax2.set_title('with label')
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
plt.show()

Looking at the figure on the right, it can be interpreted that each data is generated according to the multivariate Gaussian distribution [^ 1], which has a different mean for each type of iris. This is a mixed Gaussian model. A rough average is shown in the table below.

type	petal_length	petal_width
setosa	About 1.5	About 0.25
versicolor	About 4.5	About 1.3
virsinica	About 6.0	About 2.0

By using the mixed Gaussian model, the average of multiple Gaussian distributions can be estimated from the unlabeled dataset on the left, and the data and each Gaussian distribution can be linked for clustering.

Mathematical explanation

The mixed Gaussian model is described as a probabilistic model. Let the cluster of data $ x_i $ be $ z_i $ and the multivariate Gaussian distribution corresponding to each cluster $ 1, \ cdots, k $ be $ N (\ mu_k, \ Sigma_k) $. In other words, when $ x_i $ belongs to the classra $ k $, its generation probability can be written as: $ D $ is the dimension of $ x_i $ and $ N $ is the number of data. ..

`python`


\begin{align}
P(x_i | z_i=k, x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i},  \mu_{1:K}, \Sigma_{1:K}) 
&= N(x | \mu_k, \Sigma_k)
\end{align}

On the other hand, $ z_i $ is also expressed stochastically. Suppose $ z_i $ is generated from a categorical distribution with $ \ pi: = (\ pi_1, \ cdots, \ pi_K) ^ T $ as parameters. $ K $ is the number of clusters.

`python`


P(z_i = k | x_k, \pi) = \pi_k

From the above, the probability that the cluster $ z_i $ of data $ x_i $ is $ k $ can be written as follows. [^ 2]

`python`


P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i},  \mu_{1:K}, \Sigma_{1:K}, \pi)
= \frac{\pi_k N(x_i| \mu_k, \Sigma_k)}{\sum_{l=1}^K \pi_l N(x_i| \mu_l, \Sigma_l)} \qquad\cdots (1)

Clustering is done by estimating each $ z_i $ from the dataset using this formula.

How to estimate the cluster

In order to estimate the cluster $ z_ {1: N} $ of each data from the formula $ (1) $, the parameters of the multivariate Gaussian distribution corresponding to each cluster $ \ mu_ {1: K}, \ Sigma_ { You need to know 1: K} $ and the categorical distribution parameter $ \ pi $. These are not given explicitly, so you should either estimate from the dataset or assume appropriate values. This time, we estimate the mean $ \ mu_ {1: K} $ of the multivariate Gaussian distribution from the dataset and assume the remaining parameters to the following values: $ I_D $ is the identity matrix of $ D \ times D $ and $ \ sigma ^ 2 $ is the hyperparameter. [^ 6]

`python`


\Sigma_1 = \cdots = \Sigma_K = \sigma^2 I_D \\
\pi = (1 / K, \cdots, 1 / K)^T

Then, in order to estimate $ \ mu_ {1: K} $ and $ z_ {1: N} $, we adopt the method of probabilistically sampling these one by one. This method is called ** Gibbs sampling **. [^ 9] [^ 10]

Sampling of $ z_i $ can be done according to the formula $ (1) $, so let's consider the probability distribution of $ \ mu_k $ below. I want to estimate the probability distribution, so it seems that Bayesian estimation can be used. The mean $ \ mu_ {\ rm pri} $ of the conjugate prior distribution of $ \ mu_k $ and the covariance matrix $ \ Sigma_ {\ rm pri} $ are defined as follows. [^ 7] These are common to all $ k = 1, \ cdots, K $. $ \ sigma_ {\ rm pri} ^ 2 $ is a hyperparameter.

`python`


\begin{align}
\mu_{\rm pri} &= (0, \cdots, 0)^T \\
\Sigma_{\rm pri} &= \sigma_{\rm pri}^2I_D
\end{align}

At this time, the posterior distribution of $ \ mu_k $ is as follows. $ n_k $ is the number of data belonging to the cluster $ k $ of $ x_ {1: N} $, and $ \ bar {x} _k $ is the average of them.

`python`


\begin{align}
\mu_{k, {\rm pos}} &= \Sigma_{k, {\rm pos}}(n_k \Sigma_k^{-1} \overline{x}_k + \Sigma_{\rm pri}^{-1}\mu_{\rm pri})  \\
\Sigma_{k, {\rm pos}} &= (n_k \Sigma_{k}^{-1} + \Sigma_{\rm pri}^{-1})^{-1} \\
\mu_k &= N(\mu | \mu_{k, {\rm pos}}, \Sigma_{k, {\rm pos}}) \qquad\cdots (2)
\end{align}

This time, $ \ Sigma_k $ and $ \ Sigma_ {\ rm pri} ^ {-1} $ are both constant multiples of $ I_D $, so $ \ mu_ {k, {\ rm pos}} $ and $ \ Sigma_ { k, {\ rm pos}} $ can be transformed as follows.

`python`


\begin{align}
\mu_{k, {\rm pos}} &= \Sigma_{k, {\rm pos}}\left(\frac{n_k}{\sigma^2} \overline{x}_k + \frac{1}{\sigma_{\rm pri}^2}\mu_{\rm pri}\right)  \\
\Sigma_{k, {\rm pos}} &= \left(\frac{n_k}{\sigma^2} + \frac{1}{\sigma_{\rm pri}^{2}}\right)^{-1}
\end{align}

From the above, it can be seen that sampling of $ \ mu_k $ should be performed according to formula (2).

Implementation

Here are some important points in the implementation.

Implementation of an array that can count the number of data in the cluster

From the formula $ (2) $, we need to calculate the number of data each cluster has. So we implement an array class ClusterArray with a count method that returns that number. [^ 8] In addition to the count method, the methods that are likely to be used are delegated from numpy.ndarray.

`python`


import numpy as np
from collections import Counter

class ClusterArray(object):
    def __init__(self, array):
        #array is a one-dimensional list, array
        self._array = np.array(array, dtype=np.int)
        self._counter = Counter(array)

    @property
    def array(self):
        return self._array.copy()

    def count(self, k):
        return self._counter[k]

    def __setitem__(self, i, k):
        #Self when executed._counter is also updated
        pre_value = self._array[i]
        if pre_value == k:
            return

        if self._counter[pre_value] > 0:
            self._counter[pre_value] -= 1
        self._array[i] = k
        self._counter[k] += 1

    def __getitem__(self, i):
        return self._array[i]

Remove extra calculations

You can calculate the probability density function directly in the formula $ (1) $, but you can reduce the amount of calculation by removing the factors unrelated to $ k $.

`python`


\begin{align}
P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i},  \mu_{1:K}, \Sigma_{1:K}, \pi)
&= \frac{\pi_k \exp\{- \frac{1}{2}\log|\Sigma_k| - \frac{1}{2} (x_i - \mu_k)^T\Sigma_k(x_i - \mu_k)\}}{\sum_{l=1}^K \pi_l \exp\{- \frac{1}{2}\log|\Sigma_l|- \frac{1}{2} (x_i - \mu_l)^T\Sigma_l(x_i - \mu_l)\}}  \\
&= \frac{\pi_k \exp\{- \frac{1}{2 \sigma^2}\ \| x_i - \mu_k \|_2^2\}}{\sum_{l=1}^{K}\pi_l \exp\{- \frac{1}{2 \sigma^2}\ \| x_i - \mu_l \|_2^2\}}
\end{align}

Implement the method log_deformed_gaussian to calculate the contents of $ \ exp $ of this expression and use it for the calculation.

`python`


def log_deformed_gaussian(x, mu, var):
    norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
    return -norm_squared / (2 * var)

About logsumexp

For calculations like $ \ log (\ sum_ {i = 1} \ exp f_i (x)) $, logsumexp is a technique to prevent overflow and underflow. However, this is not very efficient because it uses $ \ log $ and $ \ exp $ many times. Therefore, we will not use logsumexp this time. [^ 5]

For logsumexp, the following article will be very helpful. Mixed Gaussian distribution and logsumexp

Overall implementation

I implemented it while keeping the above in mind. With scikit-learn in mind, clustering can be executed with the fit method. The source code is long, so I fold it.

Source code

`python`


import numpy as np
from collections import Counter


def log_deformed_gaussian(x, mu, var):
    norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
    return -norm_squared / (2 * var)


class ClusterArray(object):
    def __init__(self, array):
        #array is a one-dimensional list, array
        self._array = np.array(array, dtype=np.int)
        self._counter = Counter(array)

    @property
    def array(self):
        return self._array.copy()

    def count(self, k):
        return self._counter[k]

    def __setitem__(self, i, k):
        #Self when executed._counter is also updated
        pre_value = self._array[i]
        if pre_value == k:
            return

        if self._counter[pre_value] > 0:
            self._counter[pre_value] -= 1
        self._array[i] = k
        self._counter[k] += 1

    def __getitem__(self, i):
        return self._array[i]


class GaussianMixtureClustering(object):
    def __init__(self, K, D, var=1, var_pri=1, seed=None):
        self.K = K  #Number of clusters
        self.D = D  #Explanatory variable dimensions(Set at the time of constructor for ease of implementation)
        self.z = None

        #Parameter setting of probability distribution
        self.mu = np.zeros((self.K, self.D))
        self.var = var  #Fixed, common to all clusters
        self.pi = np.full(self.K, 1 / self.K)  #Fixed, common to all clusters

        #Prior distribution settings
        self.mu_pri = np.zeros(self.D)
        self.var_pri = var_pri

        self._random = np.random.RandomState(seed)

    def fit(self, X, n_iter):
        init_z = self._random.randint(0, self.K, X.shape[0])
        self.z = ClusterArray(init_z)

        for _ in range(n_iter):
            for k in range(self.K):
                self.mu[k] = self._sample_mu_k(X, k)
            for i, x_i in enumerate(X):
                self.z[i] = self._sample_zi(x_i)

    def _sample_zi(self, x_i):
        log_probs_xi = log_deformed_gaussian(x_i, self.mu, self.var)

        probs_zi = np.exp(log_probs_xi) * self.pi
        probs_zi = probs_zi / probs_zi.sum()

        z_i = self._random.multinomial(1, probs_zi)
        z_i = np.where(z_i)[0][0]
        return z_i

    def _sample_mu_k(self, X, k):
        xk_bar = np.array([x for i, x in enumerate(X) if self.z[i] == k]).mean(axis=0)
        var_pos = 1 / (self.z.count(k) / self.var + 1 / self.var_pri)
        mu_pos = var_pos * (xk_bar * self.z.count(k) / self.var + self.mu_pri / self.var_pri)

        mu_k = self._random.multivariate_normal(mu_pos, var_pos * np.eye(self.D))
        return mu_k

Try clustering

Let's cluster the opening Iris Dataset using the implemented mixed Gaussian model. The hyperparameters are $ \ sigma ^ 2 = 0.1, \ sigma_ {\ rm pri} ^ 2 = 1 $, and the number of Gibbs sampling iterations is 10. Comparing the labels of the actual dataset with the clustering results, we get:

Source code

`python`


import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


#Data set loading
df = sns.load_dataset('iris')
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values

#Clustering with mixed Gaussian model
gmc = GMMClustering(K=3, D=4, var=0.1, seed=1)
gmc.fit(X, n_iter=10)
df['GMM_cluster'] = gmc.z.array

#Visualization of results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
for sp in df['species'].unique():
    x = df[df['species'] == sp]['petal_length']
    y = df[df['species'] == sp]['petal_width']
    ax1.scatter(x, y, label=sp)
ax1.legend()
ax1.set_title('species')
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
for k in range(gmc.K):
    x = df[df['GMM_cluster'] == k]['petal_length']
    y = df[df['GMM_cluster'] == k]['petal_width']
    ax2.scatter(x, y, label=k)
ax2.legend()
ax2.set_title('GMM cluster')
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
plt.show()

The boundary between "versicolor" and "virginica" is suspicious, but clustering according to the label is almost complete. I also tried to visualize the clustering result using pair plot of seaborn.

Source code

`python`


sns.pairplot(
    df.drop(columns=['species']),
    vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
    hue='GMM_cluster'
)

Clustering is done nicely. If you look at the diagonal diagrams of pairplot, you can see that the dataset is represented by a mixed Gaussian model.

in conclusion

Machine learning professional series "Nonparametric Bayes point process and statistical machine learning mathematics" implements a mixed Gaussian model and is easy I tried clustering with various datasets. This time we dealt with a mixed Gaussian model, but you can also think of models like the "mixed Bernoulli model" and the "mixed Poisson model" that can be used for clustering. Next time, I will write about peripheral Gibbs sampling. This is a necessary technique for performing nonparametric Bayes.

(Added on January 21, 2020) I was able to write the next article. [Python] I implemented peripheralized Gibbs sampling

[^ 1]: The variance is different, but for the sake of simplicity, we are only considering the mean. [^ 2]: Can be derived using Bayes' theorem. [^ 5]: I've decided that it's okay not to use logsumexp as I can avoid overflow and underflow by setting $ \ sigma ^ 2 $ which is neither too big nor too small for the dataset. [^ 6]: $ \ sigma ^ 2 $ is the variance. All covariances are $ 0 $. It may seem like a rough assumption, but if you decide $ \ sigma ^ 2 $ properly, you can still cluster enough, and if you simplify it, you can reduce the amount of calculation. Of course, estimating these from the dataset should allow for more accurate clustering. [^ 7]: The conjugate prior of $ \ mu_k $ is a multivariate Gaussian distribution. [^ 8]: I think it's okay to keep the Counter instance outside without creating a class like this. I like to encapsulate these things into classes. [^ 9]: Gibbs sampling is a technique for approximating joint distributions that are difficult to calculate analytically. This time, we approximate $ P (\ mu_ {1: K}, z_ {1: N} | \ cdots) $. [^ 10]: Other methods can be used to estimate the mixed Gaussian distribution, but we recognize that Gibbs sampling is needed to extend to nonparametric Bayes. To be honest, I don't understand it properly, so please let me know if you can understand it.