Chapter 9 of PRML introduces the EM algorithm. The EM algorithm itself is a technique that can be used in various places, and I myself have combined the EM algorithm with a classifier to perform more noise-resistant classification. However, since we have never clustered with a mixed Gaussian distribution, which is the most famous application example of applying the EM algorithm, we implemented ** maximum likelihood estimation of the mixed Gaussian distribution with the EM algorithm ** this time.
For example, a normal Gaussian distribution of data points such as the blue dots in the figure above.
p({\bf x}) = \mathcal{N}({\bf x}|{\bf\mu},{\bf\Sigma})
When fitting with, This is why the single-peak Gaussian distribution is not a good model in this case. Focusing on the fact that the data points are divided into three groups, in this case, a mixed Gaussian distribution using three Gaussian distributions.
p({\bf x}) = \sum_{k=1}^3 \pi_k\mathcal{N}({\bf x}|{\bf\mu}_k,{\bf\Sigma}_k)
Is considered to be a suitable model. However, let $ \ sum_k \ pi_k = 1 $. $ {\ Bf \ mu} \ _1 $ is the top, $ {\ bf \ mu} \ _2 $ is the bottom right, and $ {\ bf \ mu} \ _3 $ is the average of the chunks of data points at the bottom left. I will. Therefore, ** fitting each block with a Gaussian distribution is fine, but it is obvious to us humans which data points belong to which block, but machines do not know that.
Which data points belong to which of the K chunks, with the coordinates $ \ {{\ bf x} \ _ n \} \ _ {n = 1} ^ N $ of N data points as observation variables The latent variable $ \ {{\ bf z} \ _n \} \ _ {n = 1} ^ N $ and the parameter $ \ {{\ bf \ pi} \ _k , {\ bf \ mu} \ _ k, {\ bf \ Sigma} \ _ k \} \ _ {k = 1} ^ K $ are estimated at the same time. When there are latent variables like this, it is common to use the EM algorithm.
The procedure for maximum likelihood estimation of the mixed Gaussian distribution by the EM algorithm (PRML equations (9.23) to (9.27)) is summarized in Section 9.2.2 of PRML, so it is omitted here.
Use Numpy and matplotlib as usual.
import matplotlib.pyplot as plt
import numpy as np
class GaussianMixture(object):
def __init__(self, n_component):
#Number of Gaussian distributions
self.n_component = n_component
#Maximum likelihood estimation using EM algorithm
def fit(self, X, iter_max=10):
#Data dimension
self.ndim = np.size(X, 1)
#Initialization of mixing coefficient
self.weights = np.ones(self.n_component) / self.n_component
#Average initialization
self.means = np.random.uniform(X.min(), X.max(), (self.ndim, self.n_component))
#Initialization of covariance matrix
self.covs = np.repeat(10 * np.eye(self.ndim), self.n_component).reshape(self.ndim, self.ndim, self.n_component)
#Repeat E step and M step
for i in xrange(iter_max):
params = np.hstack((self.weights.ravel(), self.means.ravel(), self.covs.ravel()))
#E step, calculate burden rate
resps = self.expectation(X)
#M step, update parameters
self.maximization(X, resps)
#Check if the parameters have converged
if np.allclose(params, np.hstack((self.weights.ravel(), self.means.ravel(), self.covs.ravel()))):
break
else:
print("parameters may not have converged")
#Gaussian function
def gauss(self, X):
precisions = np.linalg.inv(self.covs.T).T
diffs = X[:, :, None] - self.means
assert diffs.shape == (len(X), self.ndim, self.n_component)
exponents = np.sum(np.einsum('nik,ijk->njk', diffs, precisions) * diffs, axis=1)
assert exponents.shape == (len(X), self.n_component)
return np.exp(-0.5 * exponents) / np.sqrt(np.linalg.det(self.covs.T).T * (2 * np.pi) ** self.ndim)
#E step
def expectation(self, X):
#PRML formula(9.23)
resps = self.weights * self.gauss(X)
resps /= resps.sum(axis=-1, keepdims=True)
return resps
#M step
def maximization(self, X, resps):
#PRML formula(9.27)
Nk = np.sum(resps, axis=0)
#PRML formula(9.26)
self.weights = Nk / len(X)
#PRML formula(9.24)
self.means = X.T.dot(resps) / Nk
diffs = X[:, :, None] - self.means
#PRML formula(9.25)
self.covs = np.einsum('nik,njk->ijk', diffs, diffs * np.expand_dims(resps, 1)) / Nk
#Probability distribution p(x)Calculate
def predict_proba(self, X):
#PRML formula(9.7)
gauss = self.weights * self.gauss(X)
return np.sum(gauss, axis=-1)
#Clustering
def classify(self, X):
joint_prob = self.weights * self.gauss(X)
return np.argmax(joint_prob, axis=1)
gaussian_mixture.py
import matplotlib.pyplot as plt
import numpy as np
class GaussianMixture(object):
def __init__(self, n_component):
self.n_component = n_component
def fit(self, X, iter_max=10):
self.ndim = np.size(X, 1)
self.weights = np.ones(self.n_component) / self.n_component
self.means = np.random.uniform(X.min(), X.max(), (self.ndim, self.n_component))
self.covs = np.repeat(10 * np.eye(self.ndim), self.n_component).reshape(self.ndim, self.ndim, self.n_component)
for i in xrange(iter_max):
params = np.hstack((self.weights.ravel(), self.means.ravel(), self.covs.ravel()))
resps = self.expectation(X)
self.maximization(X, resps)
if np.allclose(params, np.hstack((self.weights.ravel(), self.means.ravel(), self.covs.ravel()))):
break
else:
print("parameters may not have converged")
def gauss(self, X):
precisions = np.linalg.inv(self.covs.T).T
diffs = X[:, :, None] - self.means
assert diffs.shape == (len(X), self.ndim, self.n_component)
exponents = np.sum(np.einsum('nik,ijk->njk', diffs, precisions) * diffs, axis=1)
assert exponents.shape == (len(X), self.n_component)
return np.exp(-0.5 * exponents) / np.sqrt(np.linalg.det(self.covs.T).T * (2 * np.pi) ** self.ndim)
def expectation(self, X):
resps = self.weights * self.gauss(X)
resps /= resps.sum(axis=-1, keepdims=True)
return resps
def maximization(self, X, resps):
Nk = np.sum(resps, axis=0)
self.weights = Nk / len(X)
self.means = X.T.dot(resps) / Nk
diffs = X[:, :, None] - self.means
self.covs = np.einsum('nik,njk->ijk', diffs, diffs * np.expand_dims(resps, 1)) / Nk
def predict_proba(self, X):
gauss = self.weights * self.gauss(X)
return np.sum(gauss, axis=-1)
def classify(self, X):
joint_prob = self.weights * self.gauss(X)
return np.argmax(joint_prob, axis=1)
def create_toy_data():
x1 = np.random.normal(size=(100, 2))
x1 += np.array([-5, -5])
x2 = np.random.normal(size=(100, 2))
x2 += np.array([5, -5])
x3 = np.random.normal(size=(100, 2))
x3 += np.array([0, 5])
return np.vstack((x1, x2, x3))
def main():
X = create_toy_data()
model = GaussianMixture(3)
model.fit(X, iter_max=100)
labels = model.classify(X)
x_test, y_test = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100))
X_test = np.array([x_test, y_test]).reshape(2, -1).transpose()
probs = model.predict_proba(X_test)
Probs = probs.reshape(100, 100)
colors = ["red", "blue", "green"]
plt.scatter(X[:, 0], X[:, 1], c=[colors[int(label)] for label in labels])
plt.contour(x_test, y_test, Probs)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.show()
if __name__ == '__main__':
main()
Maximum likelihood estimation of the parameters of the Gaussian mixture distribution using points as training data, and the probability distribution is illustrated by contour lines. Also, the color of the dots indicates which cluster it belongs to. This is the result of success, but ** sometimes fails **. As described in PRML, maximizing the log-likelihood function this time is a well-posed problem and may not be a good solution. Some heuristics work around it, but this time it fails because it doesn't implement a workaround. However, it should not fail very much.
Unsupervised clustering was performed by fitting with a mixed Gaussian distribution. The number of Gaussian distributions used at that time is specified here. In the next Chapter 10, we will introduce a method to automatically estimate the number of elements of an appropriate mixed Gaussian distribution, so next time we will implement the variational Bayes used there.
Recommended Posts