Overview

When the distribution obtained from the sample is multimodal, it is not appropriate to model with a simple Gaussian distribution. A multimodal distribution can be modeled using a ** mixed Gaussian distribution ** that combines multiple Gaussian distributions. This article provides an example of using the EM algorithm to determine the parameters of a ** mixed Gaussian distribution **.

First from the single peak type distribution

Maximum likelihood estimation

Let the sample be $ x_n (n = 1,…, N) $. By maximum likelihood estimation of the Gaussian distribution, the mean and variance can be obtained in the following forms.

\mu_{ML}=\frac{1}{N}\sum_{n=1}^N x_n \\\
\sigma^2_{ML}=\frac{1}{N-1}\sum_{n=1}^N (x_n-\mu_{ML})^2

The variance value uses an unbiased estimator.

procedure

Generate 10000 samples from a Gaussian distribution with mean 3.0 and variance 2.0.
Estimate the mean and variance from the obtained sample using maximum likelihood estimation.

Source code

# -*- coding: utf-8 -*-

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import math

#Function to plot the histogram
def draw_hist(xs, bins):
    plt.hist(xs, bins=bins, normed=True, alpha=0.5)

#A function that finds the mean and variance from a given sample using maximum likelihood estimation
def predict(data):
    mu = np.mean(data)
    var = np.var(data, ddof=1)  #Use an unbiased estimator
    return mu, var

def main():
    #Average mu,Generate N samples that follow a Gaussian distribution with standard deviation std
    mu = 3.0
    v = 2.0
    std = math.sqrt(v)
    N = 10000
    data = np.random.normal(mu, std, N)
    #Maximum likelihood estimation,Find the mean and variance
    mu_predicted, var_predicted = predict(data)
    #Find the standard deviation from the variance value
    std_predicted = math.sqrt(var_predicted)
    print("original: mu={0}, var={1}".format(mu, v))
    print(" predict: mu={0}, var={1}".format(mu_predicted, var_predicted))

    #Result plot
    draw_hist(data, bins=40)
    xs = np.linspace(min(data), max(data), 200)
    norm = mlab.normpdf(xs, mu_predicted, std_predicted)
    plt.plot(xs, norm, color="red")
    plt.xlim(min(xs), max(xs))
    plt.xlabel("x")
    plt.ylabel("Probability")
    plt.show()


if __name__ == '__main__':
    main()

Execution result

The histogram (blue) represents the sample and the red line represents the Gaussian distribution modeled using the estimated values.

original: mu=3.0, var=2.0
 predict: mu=2.98719564872, var=2.00297779707

We were able to find a model of a suitable Gaussian distribution.

Binekaze-class distribution

data set

Old Faithful-Uses geyser data. You can download it from the link below. Old Faithful The contents of the dataset are as follows.

1st row: Last eruption duration (minutes)
Second row: Waiting time until the next eruption (minutes)

This time, we will use only the most recent eruption duration (first column) as a specimen. From this specimen, the following bimodal distribution was obtained.

The feeling of seeing the distribution It seems that a simple Gaussian distribution cannot be modeled. Now, let's model using a mixed Gaussian distribution that combines two Gaussian distributions. When modeling, it is necessary to determine the parameters of mean $ \ mu_1, \ mu_2 $, variance $ \ sigma_1 ^ 2, \ sigma_2 ^ 2 $, and mixing probability $ \ pi $. Assuming that the Gaussian distribution is $ \ phi (x | \ mu, \ sigma ^ 2) $, the mixed Gaussian distribution is given by the following equation.

y=(1-\pi)\phi(x|\mu_1, \sigma^2_1)+\pi\phi(x|\mu_2, \sigma^2_2)

EM algorithm

The mean and variance of the Gaussian distribution can be found by analytically solving the maximization of the likelihood function (PRML. See PRML /) §2.3.4). However, the maximization of the likelihood function of the mixed Gaussian distribution is difficult to solve analytically, so the maximization is performed using the EM algorithm, which is one of the optimization methods. The EM algorithm is an algorithm that repeats two steps, E step and M step.

E (Expectation) step: Calculate the burden rate.
M (Maximization) step: Calculate the weighted mean and variance.

See §8.5 of The Elements of Statistical Learning for detailed algorithm content. The English version of the PDF can be downloaded for free.

Source code

# -*- coding: utf-8 -*-

import numpy as np
import math
import random
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab

#Average m,Gaussian distribution of variance v
def gaussian(x, m, v):
    p = math.exp(- pow(x - m, 2) / (2 * v)) / math.sqrt(2 * math.pi * v)
    return p

#E step
def e_step(xs, ms, vs, p):
    burden_rates = []
    for x in xs:
        d = (1 - p) * gaussian(x, ms[0], vs[0]) + p * gaussian(x, ms[1], vs[1])
        n = p * gaussian(x, ms[1], vs[1])
        burden_rate = n / d
        burden_rates.append(burden_rate)
    return burden_rates


#M step
def m_step(xs, burden_rates):
    d = sum([1 - r for r in burden_rates])
    n = sum([(1 - r) * x for x, r in zip(xs, burden_rates)])
    mu1 = n / d

    n = sum([(1 - r) * pow(x - mu1, 2) for x, r in zip(xs, burden_rates)])
    var1 = n / d

    d = sum(burden_rates)
    n = sum([r * x for x, r in zip(xs, burden_rates)])
    mu2 = n / d

    n = sum(r * pow(x - mu2, 2) for x, r in zip(xs, burden_rates))
    var2 = n / d

    N = len(xs)
    p = sum(burden_rates) / N

    return [mu1, mu2], [var1, var2], p


#Log-likelihood function
def calc_log_likelihood(xs, ms, vs, p):
    s = 0
    for x in xs:
        g1 = gaussian(x, ms[0], vs[0])
        g2 = gaussian(x, ms[1], vs[1])
        s += math.log((1 - p) * g1 + p * g2)
    return s

#Function to plot the histogram
def draw_hist(xs, bins):
    plt.hist(xs, bins=bins, normed=True, alpha=0.5)

def main():
    #Read the first column of the dataset
    fp = open("faithful.txt")
    data = []
    for row in fp:
        data.append(float((row.split()[0])))
    fp.close()
    #mu, vs,Set the initial value of p
    p = 0.5
    ms = [random.choice(data), random.choice(data)]
    vs = [np.var(data), np.var(data)]
    T = 50  #Number of iterations
    ls = []  #Save the calculation result of the log-likelihood function
    #EM algorithm
    for t in range(T):
        burden_rates = e_step(data, ms, vs, p)
        ms, vs, p = m_step(data, burden_rates)
        ls.append(calc_log_likelihood(data, ms, vs, p))

    print("predict: mu1={0}, mu2={1}, v1={2}, v2={3}, p={4}".format(
        ms[0], ms[1], vs[0], vs[1], p))
    #Result plot
    plt.subplot(211)
    xs = np.linspace(min(data), max(data), 200)
    norm1 = mlab.normpdf(xs, ms[0], math.sqrt(vs[0]))
    norm2 = mlab.normpdf(xs, ms[1], math.sqrt(vs[1]))
    draw_hist(data, 20)
    plt.plot(xs, (1 - p) * norm1 + p * norm2, color="red", lw=3)
    plt.xlim(min(data), max(data))
    plt.xlabel("x")
    plt.ylabel("Probability")

    plt.subplot(212)
    plt.plot(np.arange(len(ls)), ls)
    plt.xlabel("step")
    plt.ylabel("log_likelihood")
    plt.show()

if __name__ == '__main__':
    main()

Execution result

predict: mu1=2.01860781706, mu2=4.27334342119, v1=0.0555176191851, v2=0.191024193785, p=0.651595365985

The figure above shows the results of modeling a sample from a dataset with a mixed Gaussian distribution. The figure below shows how the log-likelihood increases as the EM algorithm is repeated.

Other

Initial value in EM algorithm

The mean was a randomly selected sample, the variance was the variance of the sample, and the mixing probability was 0.5 as the initial value. There are many ways to choose the initial value, but it seems that this alone will result in a dissertation (?).

About multi-peak distribution

Since this sample has a bimodal distribution, we combined two Gaussian distributions. Of course, you can combine three or more Gaussian distributions, but the number of parameters that need to be determined will increase.

About the dimensions of the dataset

This time we used a one-dimensional sample, but the EM algorithm can also be used for multivariate Gaussian distributions. The multivariate Gaussian distribution has significantly more parameters to determine than the one-dimensional Gaussian distribution (mean, covariance matrix).

Try to model a multimodal distribution using the EM algorithm

Overview

First from the single peak type distribution

Maximum likelihood estimation

procedure

Source code

Execution result

Binekaze-class distribution

data set

EM algorithm

Source code

Execution result

Other

Initial value in EM algorithm

About multi-peak distribution

About the dimensions of the dataset