"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"

motivation

――I want to write a machine learning algorithm from scratch in Python ――There is a Technical Review Page that has a good reputation when you google for "machine learning", and it also serves as a study. Although it is an old page, it is recommended for those who have not read it because the explanation was easy to understand. --The following is a re-summary of the above page for self-understanding (see the original page at the end of the sentence for details).

Flow of how to find linear regression

Basis set

――In "Linear Regression", first define a "basis set" --Literally, what you use as a base to express a function --There are no special conditions for basis functions, so you can use any function you like. ――However, since the performance of the model and the shape of the obtained function are determined by the selection method, it is necessary to select the basis function according to the problem to be solved. ――Two types of basis functions that are often used are as follows (details of each will be described later).

** Polynomial basis **

\phi_i(x)=x^i\hspace{1.0em}(i=0,\cdots,M-1)

** Gaussian basis **

\phi_i(x)=\exp\left\{-\frac{(x-c_i)^2}{2s^2}\right\}\hspace{1.0em}(i=0,\cdots,M-1)

How to find linear regression

--The basis function is first determined and fixed, and the function f (x) whose linear sum is to be calculated. --By using the weight $ w_i $ given to each basis set as a parameter, you can get the "set of assumed functions".

f(x)= \sum^{M-1}_{i=0}w_i\phi_i(x) = w^T\phi(x) \hspace{1.0em}(Equation 1)

--F (x) can be obtained by determining the parameter $ w_i $, so there should be a way to determine $ w_i $ appropriately. -=> Use squared error as miscalculation function

E(w)=\frac{1}{2}\sum^N_{n=1}(f(x_n)-t_n)^2 
=\frac{1}{2}\sum^N_{n=1}\left(\sum^{M-1}_{i=0}w_{i}x^{i}_{n}-t_n\right)^2 \hspace{1.0em}(Equation 2)

--Although seemingly complicated, remembering that $ x_n $ and $ t_n $ are constants (points already given), this can be thought of as a function of $ w_i $, just a quadratic expression of $ w_i $. Is --Therefore, you can find $ w_i $ that minimizes the error function $ E (w) $, and you can also decide the function f (x) you want to find.

--The basic flow of "linear regression" is to find the weight $ w $ by minimizing the squared error.

Sample data

Given the following plot, I want to find its linear regression

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

X = np.array([0.02, 0.12, 0.19, 0.27, 0.42, 0.51, 0.64, 0.84, 0.88, 0.99])
t = np.array([0.05, 0.87, 0.94, 0.92, 0.54, -0.11, -0.78, -0.79, -0.89, -0.04])

plt.xlim([0,1.0])
plt.ylim([-1.5,1.5])
plt.plot(X, t, 'o')
plt.show()

Find the relationship between X and t by linear regression

Find linear regression using polynomial basis

Polynomial basis

Defined by the following formula (repost)

\phi_i(x)=x^i\hspace{1.0em}(i=0,\cdots,M-1)

--The polynomial basis has the advantage that the solution can be obtained in the familiar "polynomial" format, but it has the characteristic that it strongly affects the estimation at a distant point. ――It can be said that the restrictions on the notation of the solution are too strong.

--This time, we will solve the characteristic function φ using the polynomial basis of the cubic function. Therefore, the regression equation $ f (x) $ you want to find is as follows.

f(x)=w_{1}+w_{2}x+w_{3}x^2+w_{4}x^3

--While fixing the basis function φ (x), the parameter $ w $ can be moved freely. --Then $ f (x) $ changes, so finding "the most suitable $ w $ for a given data point" is the linear regression approach. --$ w $ such that $ f (x) $ passes "closest" to the data point ($ X_n $, $ t_n $) (n = 1, ..., N) is "miscalculated function $ E (w) At $ w_i $ where $ is the minimum, the partial differential is 0 ”, and find $ w_i $ which gives the minimum value

\frac{\partial E(w)}{\partial w_m}=\sum^N_{n=1}\phi_m(x_n)\left(\sum^M_{j=1}w_j\phi_j(x_n)-t_n\right)=0 \hspace{1.0em}(m=1,\cdots,M) \hspace{1.0em}(Equation 3)

--If you arrange this simultaneous equations using a matrix, the solution $ w $ can be obtained by the following equation.

w = (\phi^T\phi)^{-1}\phi^{T}t \hspace{1.0em}(Equation 4)

However, Φ is a matrix defined as follows.

	\phi=\left(
	\begin{array}{ccc} 
    	\phi_{1}(x_1) & \phi_{2}(x_1) & \cdots & \phi_{M}(x_1)  \\ 
    	\vdots & \vdots& & \vdots      \\ 
        \phi_{1}(x_N) & \phi_{2}(x_N) & \cdots & \phi_{M}(x_N)\\ 
	\end{array}  
	\right) \hspace{1.0em}(Equation 5)

--By substituting the parameter $ w $ obtained in this way into $ f (x) = \ sum ^ {M-1} _ {i = 0} w_i \ phi_i (x) = w ^ T \ phi (x) $, the data You get the function $ f (x) $ that passes near the point

Find linear regression with polynomial basis (cubic function) (implementation of the above contents)

Write the above contents as they are in the code and find the linear regression with the polynomial basis (cubic function). (* Normalization terms are not considered here)

#When a polynomial basis is used for the characteristic function φ
def phi(x):
    return [1, x, x**2, x**3]

PHI = np.array([phi(x) for x in X])
#Calculation of the above (Equation 4) part. np.dot:Find the inner product np.linalg.inv:Find the reciprocal
w = np.dot(np.linalg.inv(np.dot(PHI.T, PHI)), np.dot(PHI.T, t))
#Np to find the solution of simultaneous equations.linalg.You can also use the solve function. This is faster. np.linalg.solve(A, b): A^{-1}returns b
#w = np.linalg.solve(np.dot(PHI.T, PHI), np.dot(PHI.T, t))

xlist = np.arange(0, 1, 0.01) #The x point of the linear regression equation. 0.Plot finely in 01 units and output as a line
ylist = [np.dot(w, phi(x)) for x in xlist] #Linear regression equation y

plt.plot(xlist, ylist) #Linear regression equation plot
plt.plot(X, t, 'o')
plt.show()

#Output w obtained by the above calculation
w

array([ -0.14051447,  11.51076413, -33.6161329 ,  22.33919877])

In other words, as a result of linear regression, the regression equation $ f (x) =-0.14 + 11.5x-33.6x ^ 2 + 22.3x ^ 3 $ was obtained.

Find linear regression using Gaussian basis

Gaussian basis

The Gaussian basis is a bell-shaped function (not a distribution) defined by the following equation.

\phi_i(x)=\exp\left\{-\frac{(x-c_i)^2}{2s^2}\right\}\hspace{1.0em}(i=0,\cdots,M-1)\hspace{1.0em}(Equation 6)

--The Gaussian basis is a model that strongly uses the information near the data point and weakens the influence as it goes away. ――It is based on the idea that "a certain random noise is added to the observed data". ――The explanation is based on the assumption that the deviation of the observed values (random noise) follows a normal distribution.

Gaussian base variable s

-$ s $ is a parameter that controls the distance that the data affects (Gaussian basis is not a distribution, but $ s $ is an image like a variance in a normal distribution (probably)) ――The larger the size, the farther the effect will reach. Therefore, a small value is desirable ――However, if you make it too small, you will not be able to use any data points in places where the data density is low, and you will not be able to infer correctly. --This time, try running it with $ s $ = 0.1

Gaussian-based variable c_i

-$ c_i $ represents the center of the Gaussian basis --Since the linear sum of this Gaussian basis represents the function you want to find, you need to take it so that it covers the interval you want to perform regression. --Basically, if you take it at the interval of the value set in $ s $ earlier, it's ok --This time, let's use 11 Gaussian basis functions taken from the 0 to 1 interval at $ s $ = 0.1 intervals.

Find a linear regression on a Gaussian basis (implementation of the above)

First, let's simply put out a diagram in which the characteristic function φ used in the above linear regression is rewritten from "polynomial basis" to "Gaussian basis".

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

X = np.array([0.02, 0.12, 0.19, 0.27, 0.42, 0.51, 0.64, 0.84, 0.88, 0.99])
t = np.array([0.05, 0.87, 0.94, 0.92, 0.54, -0.11, -0.78, -0.79, -0.89, -0.04])

#When a Gaussian basis is used for the characteristic function φ
#phi returns a 12-dimensional vector by adding 1 to represent the constant term in addition to the 11 Gaussian basis functions.
#Gaussian basis (0 to 1).0 to 0.11 points in 1 increments)+Returns a total of 12 points (dimensions), one point of the constant term
def phi(x): 
    s = 0.1 #"Width" of Gaussian basis
    #Writing down Equation 6
    return np.append(1, np.exp(-(x - np.arange(0, 1 + s, s)) ** 2 / (2 * s * s)))

PHI = np.array([phi(x) for x in X])
w = np.linalg.solve(np.dot(PHI.T, PHI), np.dot(PHI.T, t))

xlist = np.arange(0, 1, 0.01)
ylist = [np.dot(w, phi(x)) for x in xlist]

plt.plot(xlist, ylist)
plt.plot(X, t, 'o')
plt.show()

A 12-dimensional Gaussian basis is completely overfitted because it has more degrees of freedom and expressiveness than a 4-dimensional polynomial basis.

Bayesian linear regression that probabilizes linear regression

--Difference in meaning of estimated value $ w $ --Linear regression ⇒ Find the coefficient w that minimizes the square error within the range of the linear sum of the basis functions. --Bayesian linear regression ⇒ ** The value that maximizes the posterior probability ** is the estimated value of $ w $

--The stochastic linear regression is based on the idea that "some random noise is added to the observed data". --"Random noise", that is, "deviation" is formulated by $ N (t | \ mu, \ beta ^ {-1}) $ assumed by the normal distribution in the following equation.

p(t|w,x) = N(t|\mu,\beta^{-1})=\frac{1}{Z}\exp\left\{-\frac{1}{2}\beta(t-\mu)^2\right\} \hspace{1.0em}(Equation 7)\\
However,\mu=f(x)=\sum^M_{i=1}w_i\phi_i(x), Z=\sqrt{2\pi\beta^{-1}}

-$ N (\ mu, \ beta ^ -1) $ is a distribution centered on the predicted value $ \ mu = f (x) $, and the probability decreases as the distance increases. ――The degree of lowering is controlled by $ \ beta $, and when $ \ beta $ is large, it gathers strongly in the center, and when it is small, the deviation becomes wide. --This $ \ beta , the reciprocal of the variance, is also called "precision". - \ Beta $ is preferred to be as large as 1 or greater

--When the point $ (x_1, t_1) $ is assigned to the above equation 7, the function $ p (t_1 | w, x_1) $ of $ w $ obtained by it is called the "likelihood function". --In Bayesian linear regression, which is a stochastic version of linear regression, $ w $, which maximizes the likelihood function, is selected as the optimal parameter.

--The probability can be obtained by multiplying the likelihood at each point, and $ w $ that maximizes it can be obtained. --Multiplication is troublesome, so take the logarithm of the likelihood (logarithm-likelihood function). The maximum $ w $ before the logarithm and the maximum $ w $ after the logarithm match. --In other words, you can find $ w $ to get the maximum value of Equation 8 below, which is the logarithm of Equation 7.

\ln{p(T|w,X)}=-\frac{1}{2}\beta\sum^N_{n=1}\left(t_n-\sum^{M}_{i=1}w_{i}\phi_i(x_n)\right)^2+C \hspace{1.0em}(Equation 8)

--Looking at the structure of the equation, equation 8 is substantially the same as equation 2 for which the square error was calculated. --The sign is reversed because the square error corresponds to "minimize" and the log-likelihood function corresponds to "maximize". --β (> 0) and C are constant terms that do not include w, so there is no effect on maximization. --In other words, ** $ w $ that maximizes Equation 8 can be solved by Equation 4 **

Find the posterior distribution p (t | w, x) from the prior distribution p (w)

-Ex-post distributionp(t|w,x)Especially since x is an input value, treat it as a constantp(t|w,x)Alsop(t|w)Notated as --Update the distribution of $ w $ to the posterior distribution $ p (w | t) $ using $ p (w) $ and the data points (x, t). This posterior distribution is the "confidence of the answer" that you want to find. --The update formula to the posterior distribution can be obtained from the Bayesian formula and the multiplication / addition theorem.

p(w|t)=p(t|w)p(w)/p(t) \\
p(t)=\int p(t,w)dw=\int p(t|w)p(w)dw

--Assuming a normal distribution $ p (w) = N (w | 0, \ alpha ^ -1I) $ where the mean is 0 and the covariance matrix is a constant multiple of the identity matrix as prior distributions -Renewal formulap(w|t)=p(t|w)p(w)/p(t)To this p(w)Substituting and rearranging the formula, we get the following (the constant term is\frac{1}{Z'}(Omitted) (\alphaWill be described later)

p(w|t)=\frac{1}{Z'}\exp\left\{-\frac{1}{2}(w-m)^T\sum^{-1}(w-m)\right\} \hspace{1.0em}(Equation 9)\\
However\\
m=\beta t\sum\phi(x)\\
\sum^{-1}=\alpha I+\beta\phi(x)\phi(x)^T

--Looking inside the {} of $ \ exp $ in Equation 9, $ N (w | \ mu, \ sum) = \ frac {1} {Z} \ exp {(-(w-\ mu) ^ {T} \ sum ^ {-1} (w-\ mu)} Matches the exp part of $ -In other wordsp(w|t)OfN(w|\mu,\sum)That isZ'Understand without asking

N(w|\mu,\sum)=\frac{1}{Z}\exp\left\{(-(w-\mu)^T\sum^{-1}(w-\mu)\right\}

p(w|t)=N(w|m,\sum)

--Since the prior distribution is a normal distribution and the posterior distribution obtained here is also a normal distribution, these can be used as a conjugated prior distribution. --For N data points $ x = (x_1, \ cdots, x_N) ^ T, t = (t_1, \ cdots, t_N) ^ T $, the posterior distribution $ p (w | t, x) $ Can be calculated in the same way --In other words, the normal distribution is as follows.

p(w|t,x)=N(w|m_N,\sum_N) \hspace{1.0em}(Equation 10)\\
However\\
m_N=\beta \sum_N\phi^{T}t\\
\sum^{-1}_N=\alpha I+\beta\phi^T\phi

$ \ Phi $ is a matrix defined in the same way as Equation 5.

-As a result, first the data points(x_1,t_1)For its posterior distributionp(w|t_1,x_1)And consider it as a prior distribution again, another data point(x_2,t_2)The following posterior distribution usingp(w|t_1,t_2,x_1,x_2)Can be obtained by repeating the operation of seeking ...p(w|t_1,…,t_N,x_1,…,x_N)Will have the same distribution as

--By finding the posterior distribution, we have "solved the Bayesian linear regression". - wPosterior distribution ofp(w|t)=N(w|\mu_N,\sum_N)Is normally distributed, so the mean\mu_NIt becomes a bell-shaped distribution with --Therefore, the most "confident", that is, the most probable answer is $ w = \ mu_N $. --The method of considering the maximum point of posterior distribution as the estimated value of $ w $ is called maximum a posteriori (MAP) estimation. ――The degree of "confidence" can be known from the spread of the hem (covariance $ \ sum_N $) (described later).

Bayesian linear regression regularization

--Also, there is an idea of regularization in Bayesian linear regression, --Substituting $ \ sum_N $ in equation 10 above into $ m_N $ gives an equation similar to obtaining the coefficient $ w $ for linear regression with regularization. --It is known that Bayesianization and regularization of linear regression lead to the same optimal solution (see here for details. )page of)

Implementation of Bayesian linear regression

Let's compare both Bayesian linear regression (blue line) and ordinary linear regression (green line) on the graph.

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

X = np.array([0.02, 0.12, 0.19, 0.27, 0.42, 0.51, 0.64, 0.84, 0.88, 0.99])
t = np.array([0.05, 0.87, 0.94, 0.92, 0.54, -0.11, -0.78, -0.79, -0.89, -0.04])

#When a Gaussian basis is used for the characteristic function φ
def phi(x): 
    s = 0.1
    return np.append(1, np.exp(-(x - np.arange(0, 1 + s, s)) ** 2 / (2 * s * s)))

PHI = np.array([phi(x) for x in X])

#(Same as above) Solving a simple linear regression using Gaussian basis → Overfitting
w = np.linalg.solve(np.dot(PHI.T, PHI), np.dot(PHI.T, t))

#Solving Bayesian linear regression using Gaussian basis → Avoid overfitting
#Writing down Equation 10 above
alpha = 0.1 #Temporary storage
beta = 9.0  #Temporary storage
Sigma_N = np.linalg.inv(alpha * np.identity(PHI.shape[1]) + beta * np.dot(PHI.T, PHI)) #np.identity(PHI.shape[1])Is the 12-dimensional identity matrix specified by the Gaussian basis
mu_N = beta * np.dot(Sigma_N, np.dot(PHI.T, t))

xlist = np.arange(0, 1, 0.01) 
plt.plot(xlist, [np.dot(w, phi(x)) for x in xlist], 'g')     #Plot the solution of a normal linear regression
plt.plot(xlist, [np.dot(mu_N, phi(x)) for x in xlist], 'b')  #Plot the solution of Bayesian linear regression
plt.plot(X, t, 'o') #Plot sample points
plt.show()

--Bayesian linear regression (blue) plots better than normal linear regression (green), avoiding overfitting --In Bayesian linear regression, the value with the maximum posterior probability is the estimated value of w. This is consistent with the result of a linear regression with regularization, as noted at the end of the above.

What was the covariance Σ_N?

--The covariance $ \ sum_N $ represented the "probability confidence" of the data point. --The "variance" of a one-dimensional distribution represents the "scattering" of data --On the other hand, multidimensional "covariance" is not as easy as variance because it is a matrix. --The $ \ sum_N $ this time is also a 12x12 matrix.

#Covariance matrix Sigma_Easy-to-read display of N
print "\n".join(' '.join("% .2f" % x for x in y) for y in Sigma_N)

 2.94 -2.03 -0.92 -1.13 -1.28 -1.10 -1.21 -1.14 -1.23 -1.06 -0.96 -2.08
-2.03  2.33 -0.70  1.95  0.13  1.02  0.85  0.65  0.98  0.70  0.65  1.44
-0.92 -0.70  2.52 -1.86  1.97 -0.29  0.42  0.59  0.13  0.40  0.32  0.63
-1.13  1.95 -1.86  3.02 -1.66  1.50  0.17  0.29  0.73  0.33  0.36  0.82
-1.28  0.13  1.97 -1.66  2.82 -1.11  1.39  0.22  0.55  0.49  0.40  0.92
-1.10  1.02 -0.29  1.50 -1.11  2.39 -1.35  1.72 -0.29  0.53  0.46  0.69
-1.21  0.85  0.42  0.17  1.39 -1.35  2.94 -2.06  2.39 -0.02  0.25  1.01
-1.14  0.65  0.59  0.29  0.22  1.72 -2.06  4.05 -2.72  1.43  0.37  0.67
-1.23  0.98  0.13  0.73  0.55 -0.29  2.39 -2.72  3.96 -1.41  1.23  0.59
-1.06  0.70  0.40  0.33  0.49  0.53 -0.02  1.43 -1.41  3.30 -2.27  2.05
-0.96  0.65  0.32  0.36  0.40  0.46  0.25  0.37  1.23 -2.27  3.14 -0.86
-2.08  1.44  0.63  0.82  0.92  0.69  1.01  0.67  0.59  2.05 -0.86  2.45

--It turns out that the covariance matrix is a diagonal matrix --Diagonal component --Represents the variance when the corresponding parameters are viewed alone --In other words, the diagonal component is the same as the variance of the one-dimensional normal distribution. ――In this data, even the smallest one is 2.33, and none of the diagonal components are very small. --The accuracy of parameter estimation may not be very high due to the small amount of data. --Points where each element intersects --Represents the correlation between the corresponding parameters --0 means no correlation (independent of each other) --If it is a positive value, when one is larger than the average, the other tends to be larger than the average. --On the contrary, if the value is negative, it tends to be smaller than the average. --In this data using Gaussian basis, --There is a negative correlation between the coefficient of the constant term (first parameter), the other parameters, and the coefficient of the Gaussian basis next to each other. --There is a positive correlation between Gaussian bases that are far apart, but they are all weak (close to 0). ――When you actually solve a problem, you don't often bother to look at the covariance matrix like this.

Visualize the predicted distribution

-(Idea) posterior distributionp(w|X)Is the distribution of the parameter w, sop(y|w,x)When "assigned" to, given x, the distribution of y is returned as a "function"p(y|X,x)(Actually, it is not possible to substitute the distribution for the distribution, so it is just an image) --The distribution of the target variable that reflects the posterior distribution is called the predicted distribution.

--Refer to here for details on how to write the following predicted distribution.

#Normal distribution probability density function
def normal_dist_pdf(x, mean, var): 
    return np.exp(-(x-mean) ** 2 / (2 * var)) / np.sqrt(2 * np.pi * var)

#Quadratic form( x^Calculate T A x)
def quad_form(A, x):
    return np.dot(x, np.dot(A, x))

xlist = np.arange(0, 1, 0.01)
tlist = np.arange(-1.5, 1.5, 0.01)
z = np.array([normal_dist_pdf(tlist, np.dot(mu_N, phi(x)),1 / beta + quad_form(Sigma_N, phi(x))) for x in xlist]).T

plt.contourf(xlist, tlist, z, 5, cmap=plt.cm.binary)
plt.plot(xlist, [np.dot(mu_N, phi(x)) for x in xlist], 'r')
plt.plot(X, t, 'go')
plt.show()

--The dark part has a high probability density function value, that is, the part where the estimated function is likely to pass through. ――We are confident in estimating where the data density is high, and we are not confident in estimating where the data points are widely spaced.

About hyperparameter α of prior distribution

--The parameters contained in the Bayesian prior distribution $ p (w) = N (w | 0, \ alpha ^ -1I) $ are also called hyperparameters. ――The larger α, the smaller the variance, that is, the stronger the prior knowledge that w is a value close to 0. --If you solve the Bayesian linear regression in this state, the force to estimate w to a value close to 0 is strong, so it is easy to suppress the result of so-called overfitting, but you arrive at the true solution. May require a lot of data ――On the contrary, if α is small, the force to hold down w becomes weak. --Especially when α is 0, it matches ordinary linear regression. Therefore, it is common to set a small value close to 0, for example, around 0.1 or 0.01, and solve it first.

Former story

--Several pages of "Let's get started with machine learning" http://gihyo.jp/dev/serial/01/machine-learning

In particular, --11th Let's implement linear regression http://gihyo.jp/dev/serial/01/machine-learning/0011 --14th Let's implement Bayesian linear regression http://gihyo.jp/dev/serial/01/machine-learning/0014