Machine Learning Professional Series "Nonparametric Bayes Point Process and Statistical Machine Learning Mathematics" is read and written in the book. I implemented the model in Python. Nonparametric Bayes is an extension of the Gaussian mixed model. I really want to implement nonparametric Bayes, but in preparation for that, I will implement a mixed Gaussian model.
-"Introduction to Machine Learning by Bayesian Inference" Approximate Inference of Poisson Mixed Model Implemented Only with Python numpy -Mixed Gaussian distribution and logsumexp
Before we get into the main story, let's take a look at some of the mathematical symbols used in this article that are rarely seen elsewhere.
-$ x_ {1: N} $: $ x_1, \ cdots, x_N $ -$ x_ {1: N} ^ {\ backslash i} $: $ x_ {1: N} $ minus $ x_i $
In addition, the vectors in this article are column vectors and are not shown in bold.
A Gaussian Mixture Model (GMM) is a dataset model in which data generated from multiple Gaussian distributions are mixed. Let's take Iris Dataset as an example. This is a dataset of iris characteristics and types, with a mixture of three types of iris "setosa", "versicolor" and "virginica". Many of you are familiar with it because it is often used for exercises in machine learning classification. There are four explanatory variables in this dataset, but for the sake of simplicity, we took out two, "petal_length" and "petal_width", and created a scatter plot of "color-coded version" and "color-coded version for each type of iris". I will.
python
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
df = sns.load_dataset('iris')
ax1.scatter(df['petal_length'], df['petal_width'], color='gray')
ax1.set_title('without label')
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
for sp in df['species'].unique():
x = df[df['species'] == sp]['petal_length']
y = df[df['species'] == sp]['petal_width']
ax2.scatter(x, y, label=sp)
ax2.legend()
ax2.set_title('with label')
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
plt.show()
Looking at the figure on the right, it can be interpreted that each data is generated according to the multivariate Gaussian distribution [^ 1], which has a different mean for each type of iris. This is a mixed Gaussian model. A rough average is shown in the table below.
type | petal_length | petal_width |
---|---|---|
setosa | About 1.5 | About 0.25 |
versicolor | About 4.5 | About 1.3 |
virsinica | About 6.0 | About 2.0 |
By using the mixed Gaussian model, the average of multiple Gaussian distributions can be estimated from the unlabeled dataset on the left, and the data and each Gaussian distribution can be linked for clustering.
The mixed Gaussian model is described as a probabilistic model. Let the cluster of data $ x_i $ be $ z_i $ and the multivariate Gaussian distribution corresponding to each cluster $ 1, \ cdots, k $ be $ N (\ mu_k, \ Sigma_k) $. In other words, when $ x_i $ belongs to the classra $ k $, its generation probability can be written as: $ D $ is the dimension of $ x_i $ and $ N $ is the number of data. ..
python
\begin{align}
P(x_i | z_i=k, x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i}, \mu_{1:K}, \Sigma_{1:K})
&= N(x | \mu_k, \Sigma_k)
\end{align}
On the other hand, $ z_i $ is also expressed stochastically. Suppose $ z_i $ is generated from a categorical distribution with $ \ pi: = (\ pi_1, \ cdots, \ pi_K) ^ T $ as parameters. $ K $ is the number of clusters.
python
P(z_i = k | x_k, \pi) = \pi_k
From the above, the probability that the cluster $ z_i $ of data $ x_i $ is $ k $ can be written as follows. [^ 2]
python
P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i}, \mu_{1:K}, \Sigma_{1:K}, \pi)
= \frac{\pi_k N(x_i| \mu_k, \Sigma_k)}{\sum_{l=1}^K \pi_l N(x_i| \mu_l, \Sigma_l)} \qquad\cdots (1)
Clustering is done by estimating each $ z_i $ from the dataset using this formula.
In order to estimate the cluster $ z_ {1: N} $ of each data from the formula $ (1) $, the parameters of the multivariate Gaussian distribution corresponding to each cluster $ \ mu_ {1: K}, \ Sigma_ { You need to know 1: K} $ and the categorical distribution parameter $ \ pi $. These are not given explicitly, so you should either estimate from the dataset or assume appropriate values. This time, we estimate the mean $ \ mu_ {1: K} $ of the multivariate Gaussian distribution from the dataset and assume the remaining parameters to the following values: $ I_D $ is the identity matrix of $ D \ times D $ and $ \ sigma ^ 2 $ is the hyperparameter. [^ 6]
python
\Sigma_1 = \cdots = \Sigma_K = \sigma^2 I_D \\
\pi = (1 / K, \cdots, 1 / K)^T
Then, in order to estimate $ \ mu_ {1: K} $ and $ z_ {1: N} $, we adopt the method of probabilistically sampling these one by one. This method is called ** Gibbs sampling **. [^ 9] [^ 10]
Sampling of $ z_i $ can be done according to the formula $ (1) $, so let's consider the probability distribution of $ \ mu_k $ below. I want to estimate the probability distribution, so it seems that Bayesian estimation can be used. The mean $ \ mu_ {\ rm pri} $ of the conjugate prior distribution of $ \ mu_k $ and the covariance matrix $ \ Sigma_ {\ rm pri} $ are defined as follows. [^ 7] These are common to all $ k = 1, \ cdots, K $. $ \ sigma_ {\ rm pri} ^ 2 $ is a hyperparameter.
python
\begin{align}
\mu_{\rm pri} &= (0, \cdots, 0)^T \\
\Sigma_{\rm pri} &= \sigma_{\rm pri}^2I_D
\end{align}
At this time, the posterior distribution of $ \ mu_k $ is as follows. $ n_k $ is the number of data belonging to the cluster $ k $ of $ x_ {1: N} $, and $ \ bar {x} _k $ is the average of them.
python
\begin{align}
\mu_{k, {\rm pos}} &= \Sigma_{k, {\rm pos}}(n_k \Sigma_k^{-1} \overline{x}_k + \Sigma_{\rm pri}^{-1}\mu_{\rm pri}) \\
\Sigma_{k, {\rm pos}} &= (n_k \Sigma_{k}^{-1} + \Sigma_{\rm pri}^{-1})^{-1} \\
\mu_k &= N(\mu | \mu_{k, {\rm pos}}, \Sigma_{k, {\rm pos}}) \qquad\cdots (2)
\end{align}
This time, $ \ Sigma_k $ and $ \ Sigma_ {\ rm pri} ^ {-1} $ are both constant multiples of $ I_D $, so $ \ mu_ {k, {\ rm pos}} $ and $ \ Sigma_ { k, {\ rm pos}} $ can be transformed as follows.
python
\begin{align}
\mu_{k, {\rm pos}} &= \Sigma_{k, {\rm pos}}\left(\frac{n_k}{\sigma^2} \overline{x}_k + \frac{1}{\sigma_{\rm pri}^2}\mu_{\rm pri}\right) \\
\Sigma_{k, {\rm pos}} &= \left(\frac{n_k}{\sigma^2} + \frac{1}{\sigma_{\rm pri}^{2}}\right)^{-1}
\end{align}
From the above, it can be seen that sampling of $ \ mu_k $ should be performed according to formula (2).
Here are some important points in the implementation.
From the formula $ (2) $, we need to calculate the number of data each cluster has. So we implement an array class ClusterArray
with a count
method that returns that number. [^ 8] In addition to the count
method, the methods that are likely to be used are delegated from numpy.ndarray
.
python
import numpy as np
from collections import Counter
class ClusterArray(object):
def __init__(self, array):
#array is a one-dimensional list, array
self._array = np.array(array, dtype=np.int)
self._counter = Counter(array)
@property
def array(self):
return self._array.copy()
def count(self, k):
return self._counter[k]
def __setitem__(self, i, k):
#Self when executed._counter is also updated
pre_value = self._array[i]
if pre_value == k:
return
if self._counter[pre_value] > 0:
self._counter[pre_value] -= 1
self._array[i] = k
self._counter[k] += 1
def __getitem__(self, i):
return self._array[i]
You can calculate the probability density function directly in the formula $ (1) $, but you can reduce the amount of calculation by removing the factors unrelated to $ k $.
python
\begin{align}
P(z_i = k | x_{1:N}^{\backslash i}, z_{1:N}^{\backslash i}, \mu_{1:K}, \Sigma_{1:K}, \pi)
&= \frac{\pi_k \exp\{- \frac{1}{2}\log|\Sigma_k| - \frac{1}{2} (x_i - \mu_k)^T\Sigma_k(x_i - \mu_k)\}}{\sum_{l=1}^K \pi_l \exp\{- \frac{1}{2}\log|\Sigma_l|- \frac{1}{2} (x_i - \mu_l)^T\Sigma_l(x_i - \mu_l)\}} \\
&= \frac{\pi_k \exp\{- \frac{1}{2 \sigma^2}\ \| x_i - \mu_k \|_2^2\}}{\sum_{l=1}^{K}\pi_l \exp\{- \frac{1}{2 \sigma^2}\ \| x_i - \mu_l \|_2^2\}}
\end{align}
Implement the method log_deformed_gaussian
to calculate the contents of $ \ exp $ of this expression and use it for the calculation.
python
def log_deformed_gaussian(x, mu, var):
norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
return -norm_squared / (2 * var)
For calculations like $ \ log (\ sum_ {i = 1} \ exp f_i (x)) $, logsumexp is a technique to prevent overflow and underflow. However, this is not very efficient because it uses $ \ log $ and $ \ exp $ many times. Therefore, we will not use logsumexp this time. [^ 5]
For logsumexp, the following article will be very helpful. Mixed Gaussian distribution and logsumexp
I implemented it while keeping the above in mind. With scikit-learn
in mind, clustering can be executed with the fit
method.
The source code is long, so I fold it.
python
import numpy as np
from collections import Counter
def log_deformed_gaussian(x, mu, var):
norm_squared = ((x - mu) * (x - mu)).sum(axis=1)
return -norm_squared / (2 * var)
class ClusterArray(object):
def __init__(self, array):
#array is a one-dimensional list, array
self._array = np.array(array, dtype=np.int)
self._counter = Counter(array)
@property
def array(self):
return self._array.copy()
def count(self, k):
return self._counter[k]
def __setitem__(self, i, k):
#Self when executed._counter is also updated
pre_value = self._array[i]
if pre_value == k:
return
if self._counter[pre_value] > 0:
self._counter[pre_value] -= 1
self._array[i] = k
self._counter[k] += 1
def __getitem__(self, i):
return self._array[i]
class GaussianMixtureClustering(object):
def __init__(self, K, D, var=1, var_pri=1, seed=None):
self.K = K #Number of clusters
self.D = D #Explanatory variable dimensions(Set at the time of constructor for ease of implementation)
self.z = None
#Parameter setting of probability distribution
self.mu = np.zeros((self.K, self.D))
self.var = var #Fixed, common to all clusters
self.pi = np.full(self.K, 1 / self.K) #Fixed, common to all clusters
#Prior distribution settings
self.mu_pri = np.zeros(self.D)
self.var_pri = var_pri
self._random = np.random.RandomState(seed)
def fit(self, X, n_iter):
init_z = self._random.randint(0, self.K, X.shape[0])
self.z = ClusterArray(init_z)
for _ in range(n_iter):
for k in range(self.K):
self.mu[k] = self._sample_mu_k(X, k)
for i, x_i in enumerate(X):
self.z[i] = self._sample_zi(x_i)
def _sample_zi(self, x_i):
log_probs_xi = log_deformed_gaussian(x_i, self.mu, self.var)
probs_zi = np.exp(log_probs_xi) * self.pi
probs_zi = probs_zi / probs_zi.sum()
z_i = self._random.multinomial(1, probs_zi)
z_i = np.where(z_i)[0][0]
return z_i
def _sample_mu_k(self, X, k):
xk_bar = np.array([x for i, x in enumerate(X) if self.z[i] == k]).mean(axis=0)
var_pos = 1 / (self.z.count(k) / self.var + 1 / self.var_pri)
mu_pos = var_pos * (xk_bar * self.z.count(k) / self.var + self.mu_pri / self.var_pri)
mu_k = self._random.multivariate_normal(mu_pos, var_pos * np.eye(self.D))
return mu_k
Let's cluster the opening Iris Dataset using the implemented mixed Gaussian model. The hyperparameters are $ \ sigma ^ 2 = 0.1, \ sigma_ {\ rm pri} ^ 2 = 1 $, and the number of Gibbs sampling iterations is 10. Comparing the labels of the actual dataset with the clustering results, we get:
python
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Data set loading
df = sns.load_dataset('iris')
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values
#Clustering with mixed Gaussian model
gmc = GMMClustering(K=3, D=4, var=0.1, seed=1)
gmc.fit(X, n_iter=10)
df['GMM_cluster'] = gmc.z.array
#Visualization of results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
for sp in df['species'].unique():
x = df[df['species'] == sp]['petal_length']
y = df[df['species'] == sp]['petal_width']
ax1.scatter(x, y, label=sp)
ax1.legend()
ax1.set_title('species')
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
for k in range(gmc.K):
x = df[df['GMM_cluster'] == k]['petal_length']
y = df[df['GMM_cluster'] == k]['petal_width']
ax2.scatter(x, y, label=k)
ax2.legend()
ax2.set_title('GMM cluster')
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
plt.show()
The boundary between "versicolor" and "virginica" is suspicious, but clustering according to the label is almost complete.
I also tried to visualize the clustering result using pair plot
of seaborn
.
python
sns.pairplot(
df.drop(columns=['species']),
vars=['sepal_length', 'sepal_width', 'petal_length', 'petal_width'],
hue='GMM_cluster'
)
Clustering is done nicely. If you look at the diagonal diagrams of pairplot
, you can see that the dataset is represented by a mixed Gaussian model.
Machine learning professional series "Nonparametric Bayes point process and statistical machine learning mathematics" implements a mixed Gaussian model and is easy I tried clustering with various datasets. This time we dealt with a mixed Gaussian model, but you can also think of models like the "mixed Bernoulli model" and the "mixed Poisson model" that can be used for clustering. Next time, I will write about peripheral Gibbs sampling. This is a necessary technique for performing nonparametric Bayes.
(Added on January 21, 2020) I was able to write the next article. [Python] I implemented peripheralized Gibbs sampling
[^ 1]: The variance is different, but for the sake of simplicity, we are only considering the mean.
[^ 2]: Can be derived using Bayes' theorem.
[^ 5]: I've decided that it's okay not to use logsumexp as I can avoid overflow and underflow by setting $ \ sigma ^ 2 $ which is neither too big nor too small for the dataset.
[^ 6]: $ \ sigma ^ 2 $ is the variance. All covariances are $ 0 $. It may seem like a rough assumption, but if you decide $ \ sigma ^ 2 $ properly, you can still cluster enough, and if you simplify it, you can reduce the amount of calculation. Of course, estimating these from the dataset should allow for more accurate clustering.
[^ 7]: The conjugate prior of $ \ mu_k $ is a multivariate Gaussian distribution.
[^ 8]: I think it's okay to keep the Counter
instance outside without creating a class like this. I like to encapsulate these things into classes.
[^ 9]: Gibbs sampling is a technique for approximating joint distributions that are difficult to calculate analytically. This time, we approximate $ P (\ mu_ {1: K}, z_ {1: N} | \ cdots) $.
[^ 10]: Other methods can be used to estimate the mixed Gaussian distribution, but we recognize that Gibbs sampling is needed to extend to nonparametric Bayes. To be honest, I don't understand it properly, so please let me know if you can understand it.
Recommended Posts