Chapter 9 of PRML introduces the EM algorithm. The EM algorithm itself is a technique that can be used in various places, and I myself have combined the EM algorithm with a classifier to perform more noise-resistant classification. However, since we have never clustered with a mixed Gaussian distribution, which is the most famous application example of applying the EM algorithm, we implemented ** maximum likelihood estimation of the mixed Gaussian distribution with the EM algorithm ** this time.

Maximum likelihood estimation of mixed Gaussian distribution

For example, a normal Gaussian distribution of data points such as the blue dots in the figure above.

p({\bf x}) = \mathcal{N}({\bf x}|{\bf\mu},{\bf\Sigma})

When fitting with, This is why the single-peak Gaussian distribution is not a good model in this case. Focusing on the fact that the data points are divided into three groups, in this case, a mixed Gaussian distribution using three Gaussian distributions.

p({\bf x}) = \sum_{k=1}^3 \pi_k\mathcal{N}({\bf x}|{\bf\mu}_k,{\bf\Sigma}_k)

Is considered to be a suitable model. However, let $ \ sum_k \ pi_k = 1 $. $ {\ Bf \ mu} \ _1 $ is the top, $ {\ bf \ mu} \ _2 $ is the bottom right, and $ {\ bf \ mu} \ _3 $ is the average of the chunks of data points at the bottom left. I will. Therefore, ** fitting each block with a Gaussian distribution is fine, but it is obvious to us humans which data points belong to which block, but machines do not know that.

Which data points belong to which of the K chunks, with the coordinates $ \ {{\ bf x} \ _ n \} \ _ {n = 1} ^ N $ of N data points as observation variables The latent variable $ \ {{\ bf z} \ _n \} \ _ {n = 1} ^ N $ and the parameter $ \ {{\ bf \ pi} \ _k , {\ bf \ mu} \ _ k, {\ bf \ Sigma} \ _ k \} \ _ {k = 1} ^ K $ are estimated at the same time. When there are latent variables like this, it is common to use the EM algorithm.

EM algorithm

The procedure for maximum likelihood estimation of the mixed Gaussian distribution by the EM algorithm (PRML equations (9.23) to (9.27)) is summarized in Section 9.2.2 of PRML, so it is omitted here.

code

Library

Use Numpy and matplotlib as usual.

import matplotlib.pyplot as plt
import numpy as np

Mixed Gaussian distribution

class GaussianMixture(object):

    def __init__(self, n_component):
        #Number of Gaussian distributions
        self.n_component = n_component

    #Maximum likelihood estimation using EM algorithm
    def fit(self, X, iter_max=10):
        #Data dimension
        self.ndim = np.size(X, 1)
        #Initialization of mixing coefficient
        self.weights = np.ones(self.n_component) / self.n_component
        #Average initialization
        self.means = np.random.uniform(X.min(), X.max(), (self.ndim, self.n_component))
        #Initialization of covariance matrix
        self.covs = np.repeat(10 * np.eye(self.ndim), self.n_component).reshape(self.ndim, self.ndim, self.n_component)

        #Repeat E step and M step
        for i in xrange(iter_max):
            params = np.hstack((self.weights.ravel(), self.means.ravel(), self.covs.ravel()))
            #E step, calculate burden rate
            resps = self.expectation(X)
            #M step, update parameters
            self.maximization(X, resps)
            #Check if the parameters have converged
            if np.allclose(params, np.hstack((self.weights.ravel(), self.means.ravel(), self.covs.ravel()))):
                break
        else:
            print("parameters may not have converged")

    #Gaussian function
    def gauss(self, X):
        precisions = np.linalg.inv(self.covs.T).T
        diffs = X[:, :, None] - self.means
        assert diffs.shape == (len(X), self.ndim, self.n_component)
        exponents = np.sum(np.einsum('nik,ijk->njk', diffs, precisions) * diffs, axis=1)
        assert exponents.shape == (len(X), self.n_component)
        return np.exp(-0.5 * exponents) / np.sqrt(np.linalg.det(self.covs.T).T * (2 * np.pi) ** self.ndim)

    #E step
    def expectation(self, X):
        #PRML formula(9.23)
        resps = self.weights * self.gauss(X)
        resps /= resps.sum(axis=-1, keepdims=True)
        return resps

    #M step
    def maximization(self, X, resps):
        #PRML formula(9.27)
        Nk = np.sum(resps, axis=0)

        #PRML formula(9.26)
        self.weights = Nk / len(X)

        #PRML formula(9.24)
        self.means = X.T.dot(resps) / Nk

        diffs = X[:, :, None] - self.means
        #PRML formula(9.25)
        self.covs = np.einsum('nik,njk->ijk', diffs, diffs * np.expand_dims(resps, 1)) / Nk

    #Probability distribution p(x)Calculate
    def predict_proba(self, X):
        #PRML formula(9.7)
        gauss = self.weights * self.gauss(X)
        return np.sum(gauss, axis=-1)

    #Clustering
    def classify(self, X):
        joint_prob = self.weights * self.gauss(X)
        return np.argmax(joint_prob, axis=1)

Whole code

`gaussian_mixture.py`


import matplotlib.pyplot as plt
import numpy as np


class GaussianMixture(object):

    def __init__(self, n_component):
        self.n_component = n_component

    def fit(self, X, iter_max=10):
        self.ndim = np.size(X, 1)
        self.weights = np.ones(self.n_component) / self.n_component
        self.means = np.random.uniform(X.min(), X.max(), (self.ndim, self.n_component))
        self.covs = np.repeat(10 * np.eye(self.ndim), self.n_component).reshape(self.ndim, self.ndim, self.n_component)
        for i in xrange(iter_max):
            params = np.hstack((self.weights.ravel(), self.means.ravel(), self.covs.ravel()))
            resps = self.expectation(X)
            self.maximization(X, resps)
            if np.allclose(params, np.hstack((self.weights.ravel(), self.means.ravel(), self.covs.ravel()))):
                break
        else:
            print("parameters may not have converged")

    def gauss(self, X):
        precisions = np.linalg.inv(self.covs.T).T
        diffs = X[:, :, None] - self.means
        assert diffs.shape == (len(X), self.ndim, self.n_component)
        exponents = np.sum(np.einsum('nik,ijk->njk', diffs, precisions) * diffs, axis=1)
        assert exponents.shape == (len(X), self.n_component)
        return np.exp(-0.5 * exponents) / np.sqrt(np.linalg.det(self.covs.T).T * (2 * np.pi) ** self.ndim)

    def expectation(self, X):
        resps = self.weights * self.gauss(X)
        resps /= resps.sum(axis=-1, keepdims=True)
        return resps

    def maximization(self, X, resps):
        Nk = np.sum(resps, axis=0)
        self.weights = Nk / len(X)
        self.means = X.T.dot(resps) / Nk
        diffs = X[:, :, None] - self.means
        self.covs = np.einsum('nik,njk->ijk', diffs, diffs * np.expand_dims(resps, 1)) / Nk

    def predict_proba(self, X):
        gauss = self.weights * self.gauss(X)
        return np.sum(gauss, axis=-1)

    def classify(self, X):
        joint_prob = self.weights * self.gauss(X)
        return np.argmax(joint_prob, axis=1)


def create_toy_data():
    x1 = np.random.normal(size=(100, 2))
    x1 += np.array([-5, -5])
    x2 = np.random.normal(size=(100, 2))
    x2 += np.array([5, -5])
    x3 = np.random.normal(size=(100, 2))
    x3 += np.array([0, 5])
    return np.vstack((x1, x2, x3))


def main():
    X = create_toy_data()

    model = GaussianMixture(3)
    model.fit(X, iter_max=100)
    labels = model.classify(X)

    x_test, y_test = np.meshgrid(np.linspace(-10, 10, 100), np.linspace(-10, 10, 100))
    X_test = np.array([x_test, y_test]).reshape(2, -1).transpose()
    probs = model.predict_proba(X_test)
    Probs = probs.reshape(100, 100)
    colors = ["red", "blue", "green"]
    plt.scatter(X[:, 0], X[:, 1], c=[colors[int(label)] for label in labels])
    plt.contour(x_test, y_test, Probs)
    plt.xlim(-10, 10)
    plt.ylim(-10, 10)
    plt.show()


if __name__ == '__main__':
    main()

result

Maximum likelihood estimation of the parameters of the Gaussian mixture distribution using points as training data, and the probability distribution is illustrated by contour lines. Also, the color of the dots indicates which cluster it belongs to. This is the result of success, but ** sometimes fails **. As described in PRML, maximizing the log-likelihood function this time is a well-posed problem and may not be a good solution. Some heuristics work around it, but this time it fails because it doesn't implement a workaround. However, it should not fail very much.

At the end

Unsupervised clustering was performed by fitting with a mixed Gaussian distribution. The number of Gaussian distributions used at that time is specified here. In the next Chapter 10, we will introduce a method to automatically estimate the number of elements of an appropriate mixed Gaussian distribution, so next time we will implement the variational Bayes used there.

PRML Chapter 9 Mixed Gaussian Distribution Python Implementation