This article is the 16th day article of Nifty Group Advent Calendar 2020. Yesterday was @ hajimete's I made my own planning poker with Node.js. I will use it when the opportunity for Scrum development comes around!
Currently, the movement of data analysis is widespread at our company, and I also participate in the project to analyze the data.
This time, I've summarized the generalized linear model, which is one of the statistical methods I learned recently, as well as organizing my mind. We will touch on what a generalized linear model is, the work required for modeling, and how to evaluate the created model.
If you have any mistakes or concerns, please comment!
I mainly referred to the following documents. -Introduction to Statistical Modeling for Data Analysis-Generalized Linear Models, Hierarchical Bayesian Models, MCMC (Science of Probability and Information)
The General Linear Model (GLM) is one of the methods for creating a model using the framework of statistics. Modeling is performed by setting the following three with reference to the data distribution to be analyzed.
--Probability distribution --Link function --Linear predictor
After setting the above three, create a model with what is called the maximum likelihood estimation method, and then evaluate the created model.
The value taken by a random variable is associated with the probability of obtaining that value.
――For example, in the case of dice, the random variable takes 6 integers from 1 to 6, and the probability of each roll is 1/6, so the probability distribution is as follows.
GLM uses the residuals of the objective variable as random variables and assumes a probability distribution that the residuals follow.
The residual is the difference between the predicted value and the actual data. Taking linear regression as an example, suppose you draw the following regression line for the distribution of the objective variables. At this time, the distance between the straight line and the data point is the difference between the predicted value (point on the regression line) and the actual data point, and this is called the residual.
If the value of this residual follows a normal distribution, then the linear regression is more accurate for the data points because there is more data gathered around the regression line. On the other hand, if the residuals do not follow a normal distribution, the data will vary widely and cannot be fitted with a regression line, resulting in poor accuracy.
In GLM, it is possible to use a probability distribution other than the normal distribution as the probability distribution that the residuals follow, and modeling corresponding to various residual distributions is possible. Which probability distribution should be used is determined to some extent by the characteristics of the objective variable.
--Poisson distribution: The data are discrete, non-negative, have no upper bound, and have almost the same mean and variance. -Binomial distribution: Data is discrete, non-negative, within a finite range, variance is a function of mean. --Bernoulli distribution: Data takes only discrete and binary values, within a finite range, variance is a function of mean. --Normal distribution: Data is continuous, range is $ [-\ infty, + \ infty] $, variance is independent of mean. --Gamma distribution: Data is continuous, range is $ [0, + \ infty] $, variance is a function of mean.
The link function is a function that is fitted to a data point by associating the objective variable $ y_i $ with the explanatory variable. Generally, it can be described as follows.
$ f $: Link function
The right-hand side of equation (1) is called a linear predictor, and is represented by the sum of the products of the explanatory variable $ x_i $ and the corresponding parameter $ \ beta_i $$ (i = 1, 2, ...) $.
The linear predictor sets the "effectiveness" such as (1) which explanatory variable to consider for the objective variable on the left side and what power to ask.
--As for the explanatory variable, the converted one such as $ \ log {x_i} $ may be used. --However, note that it is a linear combination of the parameters $ \ beta_i $.
For GLM, select the probability distribution and the link function corresponding to that distribution. The correspondence is decided here as well as the probability distribution.
--Poisson distribution: logarithmic link function --Binomial distribution: logit link function --Bernoulli distribution: Logit link function --Normal distribution: Identity link function --Gamma function: Logarithmic link function
From here, we will look at the actual movement of GLM based on the fictitious data handled in the references.
I will borrow the conditions of the data used in the references as they are. Suppose you are given a dataset of the number of seeds and the size of a fictitious plant. Suppose you want to model the number of seeds $ y_i $ obtained for a plant size $ x_i $ in a single plant individual $ i $. Also assume that $ y_i $ has no upper limit.
--Objective variable: Number of seeds $ y_i $ --Explanatory variable: Plant size $ x_i $
Since the number of seeds $ y_i $ is a non-negative discrete value and has no upper limit, let's assume a Poisson distribution as a probability distribution.
The Poisson distribution is expressed by the following equation.
Here, $ \ lambda_i $ is the average number of seeds that can be obtained from the plant, which is the average value of the Poisson distribution, and is a parameter that affects the shape of the Poisson distribution. Let's see in the code below that the value of $ \ lambda_i $ changes the shape of the Poisson distribution.
from scipy.stats import poisson
x = np.arange(1, 50, 1)
y1= [poisson.pmf(i, 10) for i in x]
y2= [poisson.pmf(i, 20) for i in x]
y3= [poisson.pmf(i, 30) for i in x]
plt.bar(x, y1, align="center", width=0.4, color="red"
,alpha=0.5, label="Poisson λ= %d" % 10)
plt.bar(x, y2, align="center", width=0.4, color="green"
,alpha=0.5, label="Poisson λ= %d" % 20)
plt.bar(x, y3, align="center", width=0.4, color="blue"
,alpha=0.5, label="Poisson λ= %d" % 30)
plt.xlabel("x")
plt.ylabel("probability")
plt.legend()
plt.show()
As you can see in the figure above, the Poisson distribution changes depending on the parameter $ \ lambda_i $.
Given the average number of seeds $ \ lambda_i $, we can get a probability distribution that follows the number of seeds $ y_i $ that can be obtained from the individual plant $ i $.
We will determine the link function and linear predictor of the Poisson distribution. The shape of this is almost fixed due to the ease of calculation, and in the case of Poisson distribution, set as follows.
When this is transformed, it becomes $ \ lambda_i = \ exp {(\ beta_0 + \ beta_1x_i)} $. From the plant size $ x_i $ and the parameter $ \ beta_i $, the average number of seeds $ \ lambda_i $ is obtained, and the Poisson distribution followed by the individual plant of size $ x_i $ is obtained.
Let's plot the relationship between $ \ lambda_i $ and $ x_i $.
#Illustrates the relationship between average seed number and body size
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0, 30, 1)
beta_0 = 1
beta_1 = 0.1
lam = np.exp(beta_0 + beta_1*x)
plt.plot(x, lam)
plt.xlabel("x")
plt.ylabel("λ")
plt.show()
This curve will be the curve you are trying to draw (model) for the $ (x_i, y_i) $ dataset.
Finally, the value of the still undecided parameter $ \ beta_i $ is estimated by the maximum likelihood estimation method.
The maximum likelihood estimation method is a method that defines a quantity called likelihood that expresses "goodness of fit for the data at hand" and finds the parameter that maximizes it. I will omit the detailed explanation, but I will show the likelihood in this example.
For the Poisson distribution, the likelihood is expressed as a function of the parameter $ \ beta_i $ as follows:
For simplicity, a logged log-likelihood is often used.
Find the parameter $ \ beta_i $ that maximizes this log-likelihood.
Let's take a brief look at model evaluation.
When performing the fitting, the maximum likelihood estimation method was used to estimate the value of the parameter that maximizes the log-likelihood. However, the maximum value of this log-likelihood (maximum log-likelihood) shows the goodness of fitting to the data at hand. Therefore, it is possible that the prediction accuracy will decrease for unknown data. Statistical modeling evaluates multiple models by comparing the magnitude of the AIC, which can represent "good prediction." I will omit a detailed explanation of AIC, but the lower the AIC, the better the prediction.
Below is a quick look at how to do GLM with python.
GLM the Boston home price data using a python library called statsmodel.
Model the house price (column name: PRICE) for the average number of rooms in the house (column name: RM).
Compares the AICs when the specified probability distribution is changed.
Google Colaboratory
#Basic library import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
# statsmodel
import statsmodels.api as sm
import statsmodels.formula.api as smf
#Import Boston Home Price Data
from sklearn.datasets import load_boston
#Convert data to data frame type
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
#Prepare explanatory variable x and objective variable y
boston_df["PRICE"] = boston.target
x = boston_df.drop("PRICE", axis=1)
y = boston_df["PRICE"]
Let's check the explanatory variables of the Boston Home Price Dataset.
#Boston dataset column description
print(boston.DESCR)
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000's
Let's plot a scatter plot of PRICE and RM.
column = "RM"
sns.scatterplot(x="PRICE", y=column, data=boston_df[["PRICE", column]])
You can see that it has a distribution that seems to be able to fit in a relatively straight line.
Finally modeling is done. As mentioned above, the implementation itself is very simple thanks to the library.
#Combine explanatory variables and objective variables into one matrix
data = boston_df
#Define a linear predictor
formula = "PRICE ~ 1 + RM"
#Set link function
link = sm.genmod.families.links.log
link_gauss = sm.genmod.families.links.identity
#Set probability distribution
family = sm.families.Poisson(link=link)
family_gauss = sm.families.Gaussian(link=link_gauss)
#Modeling (Poisson distribution)
model = smf.glm(formula=formula, data=data, family=family)
#Modeling (normal distribution)
model_gauss = smf.glm(formula=formula, data=data, family=family_gauss)
#Estimated
result = model.fit()
result_gauss = model_gauss.fit()
Output modeling results
print("===== Modeling result =====")
print()
print(result.summary())
print("===== Modeling result(model_gauss)=====")
print()
print(result_gauss.summary())
Check the Log-Likelihood item in the output result. This shows the value of the log-likelihood, and the larger it is, the better the fit for the data used for estimation. This time, the Poisson distribution model is -1689.2 and the normal distribution model is -1673.1, so there is not much difference.
Model evaluation criteria Output AIC and compare.
#Confirmation of AIC
print("AIC")
print(result.aic)
print()
print("AIC(model_gauss)")
print(result_gauss.aic)
print()
AIC
3382.3210451034556
AIC(model_gauss)
3350.151117225073
Since the normal distribution model has a lower AIC, it can be said that the normal distribution model has a higher “prediction” than the RM and PRICE models.
As a bonus, plot the modeled fitting curve on the data points.
y_hat = result.predict(x)
y_hat_gauss = result_gauss.predict(x)
fig = plt.figure(figsize=(6.0, 6.0))
plt.plot(x_train["RM"], y_train, "o", label="data")
plt.plot(x_train["RM"], y_hat, "*", label="RM")
plt.plot(x_train["RM"], y_hat_gauss, "*", label="RM: gauss")
plt.xlabel('x (RM)'), plt.ylabel('y (PRICE)')
plt.legend()
plt.show()
It was nice to be able to summarize the generalized linear model that I had been learning.
In the field of business, prediction is sometimes important, but I personally like the method of facing the phenomenon and assuming a model that fits it.
When actually using it, there are still many unclear points such as how to select the probability distribution, so I would like to continue learning.
Next time is @y_kono. looking forward to!
Recommended Posts