Speaking of data analysis, it is the construction of a statistical model that can explain the relationships hidden in observation data. Understanding the probability distribution is essential for building statistical models. In the idea of data analysis, the data of a variable arises from a stochastic phenomenon. Conversely, what kind of probability distribution can be used to represent the observed data pattern? </ b>
One way to do this is to apply a probability distribution. However, the shape of the probability distribution changes depending on the parameters, so it is necessary to determine appropriate parameter values based on the observed data for fitting. The maximum likelihood estimation method is a major method for calculating the parameter value.
Roughly describing the method of maximum likelihood estimation, it is assumed that the obtained data are N data Y extracted from a probability distribution with a certain parameter θ, and the extraction probability of each value is $ p (y_i | θ). ) Let the likelihood function L be the product of $ by N pieces.
There are two types of probability distributions, discrete type and continuous type, depending on the data. This time, we are doing maximum likelihood estimation of Poisson distribution and fitting of probability distribution as an example of discrete type probability distribution.
--Discrete probability distribution: Poisson distribution (this time) -~~ Continuous probability distribution: Normal distribution ~~ (next time)
The Poisson distribution is mainly used for count data. The probability function that represents the distribution is as follows.
The test data for estimating and fitting the model is created in advance from the probability distribution of the target so that the answers can be matched.
Python
"""Random numbers from Poisson distribution"""
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(seed=10)
#Parameter λ=2.4
poisson_values = np.random.poisson(lam=2.4, size=1000)
p_y, bin_edges, patches = plt.hist(poisson_values , bins=11, range=[-0.5, 10.5], normed=True)
y_k = 0.5*(bin_edges[1:] + bin_edges[:-1])
y_k #y
p_y #p(y)
You can see the values and numbers of the 1000 pseudo data obtained. Now, let's apply the Poisson distribution to such data and estimate its parameters with maximum likelihood.
Python
"""Parameter estimation by partial differentiation of the log-likelihood function"""
import sympy #Library for algebraic calculations (solving derivatives and equations)
#Define variables
sympy.var('λ y ')
#Likelihood p(Parameters|y)Define
f = (λ**y)*sympy.exp(-λ)/sympy.factorial(y)
#Logarithmic
logf=sympy.log(f)
#Partially differentiate f and expand the equation
pdff = sympy.expand(sympy.diff(logf, λ))
The formula pdff, which is the partial derivative of the log-likelihood with respect to the parameter λ, is $ \ frac {y_i} {λ} -1 $. Since this can be calculated as λ, which is the sum of the data and $ \ sum_ {i = 1} ^ {1000} (\ frac {y_i} {λ} -1) = 0 $.
Python
def L_sympy(f,var,values):
likelihood = 0 #Initial value of likelihood
for i in np.arange(len(values)): #For the number of data
# model output
# print(values[i])
likelihood += f.subs(var,values[i]) #Substitute a value for y
# print(likelihood)
param = sympy.solve(likelihood, λ) #Solve the equation for λ
# print(param)
return param
L_sympy(pdff,"y",poisson_values)
Parameter estimator:[289/125]=2.312
The parameter λ was estimated to be 2.312. The setting of λ of this pseudo data is 2.4, so you can estimate it nicely.
As for the textbook method, you can directly obtain the formula that can calculate the parameters by solving the partially differentiated formula (derivative) as described above. However, in general statistical modeling, there are multiple parameters to estimate, and the probability distribution formula is more complicated, so it is not easy to solve.
In the commonly used generalized linear model (GLM), the statistical model that describes the data is represented by the form (and link function) of the probability distribution and linear prediction.
The example given above just treats Y as Y ... It can be said that it is a statistical model.
By the way, it is not easy to solve because there are multiple parameters to estimate as a general statistical model and the formula is complicated (not a linear formula like so-called $ \ frac {y} {λ} -1 $, but $ λ. Non-linear like ^ 2 + θ ^ 3 + ... $). Therefore, we use the method of solving nonlinear equations.
Python
"""Function of discrete probability distribution (probability mass function)"""
def probability_poisson(y,λ):
from scipy.special import factorial
# λ:Parameters
# y:Data y: Occurs y times
# return:Probability P of data Y(y|Parameters)
return (λ**y)*np.exp(-λ)/factorial(y)
"""Log-likelihood function: discrete type"""
def L_func(param,y):
likelihood = 0 #Initial value of likelihood
for i in np.arange(len(y)):
# model output
p = probability_poisson(y[i], param)
# likelihood
likelihood += -np.log(p) #Likelihood
return likelihood
"""Parameter estimation"""
"""
Quasi-Newton method (BFGS, L)-BFGS method: Memory can be saved in complicated cases)
"""
from scipy import optimize
x0 = [2] #Initial values of parameters
bound = [(0, None)] #Scope of optimal parameter search(min,max)
params_MLE = optimize.minimize(L_func,x0,args=(poisson_values),method='l-bfgs-b',
jac=None, bounds=bound, tol=None, callback=None,
options={'disp': None, 'maxls': 20, 'iprint': -1,
'gtol': 1e-05, 'eps': 1e-08, 'maxiter': 15000,
'ftol': 2.220446049250313e-09, 'maxcor': 10,
'maxfun': 15000})
#Maximum likelihood estimator of parameters
print('Parameters:',params_MLE.x)
#Number of parameters
k=1
#AIC (the model with the smallest value is a good fit)
print('AIC:',params_MLE.fun*(2)+2*k)
Parameters:[2.31200054]
AIC: [3607.0081302]
Similarly, the parameter λ was estimated to be 2.312.
Python
"""curve_There is also a parameter calculation with fit for the time being"""
from scipy.optimize import curve_fit
parameters, cov_matrix = curve_fit(f=probability_poisson, xdata=y_k, ydata=p_y)
print("Parameters:",parameters, "Covariance:",cov_matrix)
Parameters: [2.31643586]Covariance: [[0.00043928]]
Let's apply the model (Poisson distribution) to the data with the estimated parameter λ = 2.312.
Python
"""Illustration of whether it applies"""
acc_mle = probability_poisson(y_k, params_MLE.x)
plt.hist(poisson_values , bins=11, range=[-0.5, 10.5], normed=True)
plt.plot(y_k,acc_mle)
It feels good.
Recommended Posts