This time we will implement maximum likelihood estimation of Student's t distribution. The Student's t distribution is well known as a distribution that is more robust to outliers than the Gaussian distribution, but if you remember well, this distribution has never been used, so this is a good opportunity and its robustness. I will actually check. At the time of First PRML implementation, I set a rule that only numpy is possible except for the standard library, but this time it is a function that is not so familiar as the digamma function. I also used scipy because it came out. For the second time of this project, I will use a third party package other than numpy early, and I will be reminded of the future.

Student's t distribution

Student's t distribution is Gaussian distribution\mathcal{N}(x|\mu,\tau^{-1})Accuracy parameter\tauGamma distribution as conjugate prior distribution of{\rm Gam}(\tau|a,b)It is a distribution obtained by integrating and eliminating the accuracy using. From this, it can be interpreted that the Student's t distribution is a mixed Gaussian distribution in which an infinite number of Gaussian distributions with the same mean but different variances are added. $p(x|\mu,a,b)=\int^{\infty}_0 \mathcal{N}(x|\mu,\tau^{-1}){\rm Gam}(\tau|a,b){\rm d}\tauAfter this,\nu=2a,\lambda=a/bThen it will be in the form of the student's t distribution. ${\rm St}(x|\mu,\lambda,\nu)={\Gamma(\nu/2+1/2)\over\Gamma(\nu/2)}\left({\lambda\over\pi\nu}\right)^{1/2}\left[1+{\lambda(x-\mu)^2\over\nu}\right]^{-\nu/2-1/2}$Maximum likelihood estimation given some points x\mu,a,b(Or\mu,\lambda,\nu)I want to estimate, but it is not closed form unlike the case of maximum likelihood estimation with Gaussian distribution. upper{\rm St}(x|\mu,\lambda,\nu)It seems that even if you differentiate with respect to the parameters, it will not be a beautiful shape at all. In PRML, the EM algorithm is used to estimate the maximum likelihood of the Student's t distribution.(PRML Exercise 12.24)It says to use. The EM algorithm is a commonly used method for estimating parameters in the presence of unobserved data and is introduced in Chapter 9 of PRML. Precision parameters\tau$Apply the EM algorithm with. The calculation formula when applied to Student's t distribution is not included in PRML, so I will calculate it a little. I would appreciate it if you could point out any mistakes.

E step

This is the E (Expectation) step of the EM algorithm. In this step, fix the parameters ($ \ mu, a, b $) you want to estimate, and calculate from what variance (or accuracy $ \ tau $) the Gaussian distribution each sample point $ x_i $ is sampled from. I will. $ p(\tau_i|x_i,\mu,a,b) = \mathcal{N}(x_i|\mu,\tau^{-1}){\rm Gam}(\tau_i|a,b)/const. $ The gamma distribution is used for the prior distribution, and the Gaussian function is used for the likelihood function. When this is calculated, the gamma distribution $ {\ rm Gam} (\ tau_i | a + {1 \ over2}, b + {1 \ over2} (x_i- \ mu) ^ 2) $ is obtained as the posterior distribution. From the expected value of this posterior distribution, the accuracy parameter for each sample point can be obtained as $ \ tau_i = {a + {1 \ over2} \ over b + {1 \ over2} (x_i- \ mu) ^ 2} $. ..

M step

This is the M (Maximization) step of the EM algorithm. Compute the log-likelihood function for the complete data $ \ {x_i, \ tau_i \} $ and maximize it for the parameters $ \ mu, a, b $. For the accuracy parameter $ \ tau_i $ at this time, use the one obtained in step E. $\sum_{i=1}^N\ln p(x_i,\tau_i|\mu,a,b) = \sum_{i=1}^N\ln\\{\mathcal{N}(x_i|\mu,\tau_i^{-1}){\rm Gam}(\tau_i|a,b)\\}If you calculate this (I'm not confident) and extract the part where the parameter you want to estimate is involved, $-{1\over2}\sum_i\tau_i(x_i-\mu)^2 + aN\ln b -N\ln\Gamma(a)+a\sum_i\ln\tau_i - b\sum_i\tau_i$ Differentiate this with each parameter and set it as equal 0 to solve the equation. $ \ mu = {\ sum_i \ tau_ix_i \ over \ sum_i \ tau_i} $ $ a = \ psi ^ {-1} (\ ln b + {1 \ over N} \ sum_i \ ln \ tau_i) $ When differentiated by $ b = {aN \ over \ sum_i \ tau_i} $$ a, the digamma function $ \ psi (x) $ is obtained. If the inverse function of that function actually exists, I could solve it, but numpy and scipy do not have the inverse function of the digamma function (the digamma function is in scipy). Therefore, parameter a is only updated a little by the gradient method.

If you look at the solutions of $ a and b $ in the first place, they are mixed with each other, so the above three equations probably do not maximize the log-likelihood function. In this way, the EM algorithm when the log-likelihood function is not completely maximized is called the generalized EM algorithm. When performing maximum likelihood estimation of the Student's t distribution, the degree of freedom parameter $ \ nu (= 2a) $ is usually fixed at some value, probably because the M step is performed exactly. If you fix the degrees of freedom parameter, the remaining estimation target will be $ \ mu, \ lambda $, and I think that you can easily maximize the log-likelihood function.

I will make some corrections and supplements based on the comments below. If it is the original E step, the accuracy parameter\tau_iPosterior distribution ofp(\tau_i|x_i,\mu,a,b) = {\rm Gam}(\tau_i|a+{1\over2},b+{1\over2}(x_i-\mu)^2)Ends with the calculated part, and in the subsequent M step, the perfect log-likelihood function\ln p(x_i,\tau_i|\mu,a,b)のそPosterior distribution ofについての期待値\mathbb{E}[\sum_i\ln p(x_i,\tau_i|\mu,a,b)]To calculate. If the calculation result is extracted only where it is related to the parameter,-{1\over2}\sum_i\mathbb{E}\[\tau_i\](x_i-\mu)^2+a\sum_i\mathbb{E}\[\ln\tau_i\]+aN\lnb-b\sum_i\mathbb{E}[\tau_i]-N\ln\Gamma(a),Andtheshapeofthefunctiontobemaximizedispartiallydifferent.TheEMalgorithmintheoriginalarticleisasampleapproximationtotheexpectedvaluecalculation\mathbb{E}[\sum_i\lnp(x_i,\tau_i|\mu,a,b)]=\sum_i\lnp(x_i,\tau_i^{(sample)}|\mu,a,b)However,thereisalwaysonlyonesamplesizeeach\tau_i^{(sample)}=\mathbb{E}[\tau_i]Iwouldappreciateitifyoucouldinterpretthatasasample.Thefigurebelowseemstoworktosomeextent,sothissampleapproximationmaynotaffecttheaccuracysomuch(I'mreallyhappyifthathappens).

Maximum likelihood estimation flow

To summarize the above,

Set the initial values of the parameters $ \ mu, a, b $
Calculate the accuracy parameter $ \ tau $ for all sample points in step E
Update the parameters $ \ mu, a, b $ so that the value of the log-likelihood function for the complete data is large in the M step.
Exit if the parameters have converged, otherwise return to step E

Use this generalized EM algorithm to find the parameters of the Student's t distribution.

Implementation

import Import gamma and digamma functions from matplotlib and numpy to illustrate the results, and in addition scipy.

import matplotlib.pyplot as plt
import numpy as np
from scipy.special import gamma, digamma

Gaussian distribution

Maximum likelihood estimation by Gaussian distribution only for comparison with Student's t distribution

class Gaussian(object):

    def fit(self, x):
        self.mean = np.mean(x)
        self.var = np.var(x)

    def predict_proba(self, x):
        return (np.exp(-0.5 * (x - self.mean) ** 2 / self.var)
                / np.sqrt(2 * np.pi * self.var))

Student's t distribution

This is the code that performs maximum likelihood estimation based on the student's t distribution. Maximum likelihood estimation is performed by the fit method. Repeat steps E and M in it, and finish when the parameters are no longer updated.

class StudentsT(object):

    def __init__(self, mean=0, a=1, b=1, learning_rate=0.01):
        self.mean = mean
        self.a = a
        self.b = b
        self.learning_rate = learning_rate

    def fit(self, x):
        while True:
            params = [self.mean, self.a, self.b]
            self._expectation(x)
            self._maximization(x)
            if np.allclose(params, [self.mean, self.a, self.b]):
                break

    def _expectation(self, x):
        self.precisions = (self.a + 0.5) / (self.b + 0.5 * (x - self.mean) ** 2)

    def _maximization(self, x):
        self.mean = np.sum(self.precisions * x) / np.sum(self.precisions)
        a = self.a
        b = self.b
        self.a = a + self.learning_rate * (
            len(x) * np.log(b)
            + np.log(np.prod(self.precisions))
            - len(x) * digamma(a))
        self.b = a * len(x) / np.sum(self.precisions)

    def predict_proba(self, x):
        return ((1 + (x - self.mean) ** 2/(2 * self.b)) ** (-self.a - 0.5)
                * gamma(self.a + 0.5)
                / (gamma(self.a) * np.sqrt(2 * np.pi * self.b)))

Whole code

`students_t_mle.py`


import matplotlib.pyplot as plt
import numpy as np
from scipy.special import gamma, digamma


class Gaussian(object):

    def fit(self, x):
        self.mean = np.mean(x)
        self.var = np.var(x)

    def predict_proba(self, x):
        return (np.exp(-0.5 * (x - self.mean) ** 2 / self.var)
                / np.sqrt(2 * np.pi * self.var))


class StudentsT(object):

    def __init__(self, mean=0, a=1, b=1, learning_rate=0.01):
        self.mean = mean
        self.a = a
        self.b = b
        self.learning_rate = learning_rate

    def fit(self, x):
        while True:
            params = [self.mean, self.a, self.b]
            self._expectation(x)
            self._maximization(x)
            if np.allclose(params, [self.mean, self.a, self.b]):
                break

    def _expectation(self, x):
        self.precisions = (self.a + 0.5) / (self.b + 0.5 * (x - self.mean) ** 2)

    def _maximization(self, x):
        self.mean = np.sum(self.precisions * x) / np.sum(self.precisions)
        a = self.a
        b = self.b
        self.a = a + self.learning_rate * (
            len(x) * np.log(b)
            + np.log(np.prod(self.precisions))
            - len(x) * digamma(a))
        self.b = a * len(x) / np.sum(self.precisions)

    def predict_proba(self, x):
        return ((1 + (x - self.mean) ** 2/(2 * self.b)) ** (-self.a - 0.5)
                * gamma(self.a + 0.5)
                / (gamma(self.a) * np.sqrt(2 * np.pi * self.b)))


def main():

    # create toy data including outliers and plot histogram
    x = np.random.normal(size=20)
    x = np.concatenate([x, np.random.normal(loc=20., size=3)])
    plt.hist(x, bins=50, normed=1., label="samples")

    # prepare model
    students_t = StudentsT()
    gaussian = Gaussian()

    # maximum likelihood estimate
    students_t.fit(x)
    gaussian.fit(x)

    # plot results
    x = np.linspace(-5, 25, 1000)
    plt.plot(x, students_t.predict_proba(x), label="student's t", linewidth=2)
    plt.plot(x, gaussian.predict_proba(x), label="gaussian", linewidth=2)
    plt.legend()
    plt.show()


if __name__ == '__main__':
    main()

result

If you run the above code, you will get the following result. As shown in Figure 2.16 of PRML, fitting by Student's t distribution is certainly robust even if there are outliers. The mean of the Student's t distribution is around 0, but the mean of the Gaussian distribution is around 2.5, pulled by outliers.

At the end

It was confirmed that the maximum likelihood estimation with the Student's t distribution is more robust than when the Gaussian distribution is modeled. If I have a chance, I will try to solve the curve regression problem using Student's t distribution.

PRML Chapter 2 Student's t Distribution Python Implementation

Student's t distribution

E step

M step

Maximum likelihood estimation flow

Implementation

Gaussian distribution

Student's t distribution

Whole code

students_t_mle.py

result

At the end

`students_t_mle.py`