I often use logistic regression at work, but when I have a little concern, I often find it difficult to get to the information I want, so I would like to summarize logistic regression as a memo for myself.
It seems that logistic regression is often used in the medical field. Of course, it is often used in other fields because of its high interpretability, simplicity of the model, and high accuracy.
Now consider the problem of predicting whether the input vector $ x $ will be assigned to the two classes $ C_0 or C_1 $.
Let $ y ∈ \ {0,1 \} $ be the objective variable (output) and $ x ∈ R ^ d $ be the explanatory variable (input). Here, $ y = 0 $ when assigned to $ C_0 $, and $ y = 1 $ when assigned to $ C_1 $.
python
import matplotlib.pyplot as plt
x = [1.3, 2.5, 3.1, 4, 5.8, 6, 7.5, 8.4, 9.9, 10, 11.1, 12.2, 13.8, 14.4, 15.6, 16, 17.7, 18.1, 19.5, 20]
y = [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
plt.scatter(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.show
I think the simplest approach to the classification problem is linear discrimination.
In the linear model, the output $ y $ is linear with respect to the input $ x $ ($ y (x) = \ beta_0 + \ beta_1 x $), and $ y $ is a real number. Here's one more step to adapt to the classification problem.
For example, transform a linear function with the nonlinear function $ f (・) $.
y(x) = f(\beta_0 + \beta_1 x)
For example, in this case, the following activation function can be considered.
f(z) = \left\{
\begin{array}{ll}
1 & (z \geq 0.5) \\
0 & (z \lt 0.5)
\end{array}
\right.
Well, this time we will make a prediction using this activation function. First, we will train a linear model. We will use the least squares method to estimate the parameters.
As an aside, this book is highly recommended as a starting point for machine learning. It starts with the minimum math required to start studying machine learning. This type of book doesn't know a lot about books. It's a pretty good book. Also, the code is very easy to read. I think that I will use the library in practice, but I think that it is the best in terms of writing from scratch for studying.
The original code is published on the Support Page.
import matplotlib.pyplot as plt
import numpy as np
def reg(x,y):
n = len(x)
a = ((np.dot(x,y) - y.sum() * x.sum() / n) /
((x**2).sum() - x.sum()**2 / n))
b = (y.sum() - a * x.sum()) / n
return a,b
x = np.array(x)
y = np.array(y)
a, b = reg(x,y)
print('y =', b,'+', a, 'x')
fig = plt.scatter(x, y)
xmax = x.max()
plt.plot([0, xmax], [b, a * xmax + b])
plt.axhline(0.5, ls = "--", color = "r")
plt.axhline(0, linewidth = 1, ls = "--", color = "black")
plt.axhline(1, linewidth = 1, ls = "--", color = "black")
plt.xlabel("x")
plt.ylabel("y")
plt.show
The estimated model is
have become.
Next, considering the conversion by the activation function defined earlier, the boundary line seems to be about $ x = 10 $.
Now, there are some problems with this method. The least squares method is equivalent to the maximum likelihood method when a normal distribution is assumed for the conditional probability distribution.
On the other hand, a binary objective variable vector like this one causes various problems because it is clearly far from the normal distribution. For details, please refer to "Pattern Recognition and Machine Learning (Bishop)", but mainly
-The approximation accuracy of the class posterior probability vector is poor. -The flexibility of the linear model is low. ⇒The value of the probability exceeds $ [0,1] $ due to these two.
・ Penalize predictions that are too correct.
Can be given. Therefore, we adopt an appropriate probability model and consider a classification algorithm that has better characteristics than the least squares method.
Before we get into logistic regression, let's think a lot about the logistic distribution. Given the input vector $ x $, the conditional probabilities of class $ C_1 $ are
\begin{eqnarray}
P(y=1|x)&=&\frac{P(x|y=1)P(y=1)}{P(x|y=1)P(y=1)+P(x|y=0)P(y=0)}\\
\\
&=&\frac{1}{1+\frac{P(x|y=0)P(y=0)}{P(x|y=1)P(y=1)}}\\
\\
&=&\frac{1}{1+e^{-\log\frac{P(x|y=0)P(y=0)}{P(x|y=1)P(y=1)}}}\\
\end{eqnarray}
P(y=1|x)=\frac{1}{1+e^{-a}}\\
Will be. This is called a logistic distribution and will be represented by $ \ sigma (a) $.
The form of the distribution function is as follows.
import numpy as np
from matplotlib import pyplot as plt
a = np.arange(-8., 8., 0.001)
y = 1 / (1+np.exp(-a))
plt.plot(a, y)
plt.axhline(0, linewidth = 1, ls = "--", color = "black")
plt.axhline(1, linewidth = 1, ls = "--", color = "black")
plt.xlabel("a")
plt.ylabel("σ (a)")
plt.show()
You can see that the range is within $ (0,1) $.
\sigma(x)=\frac{1}{1+e^{-x}}
It is represented by. The probability density function $ f (x) $ differentiates $ \ sigma (x) $ and
\begin{eqnarray}
f(x)&=&\frac{d}{dx}\frac{1}{1+e^{-x}}\\
\\
&=&\frac{e^{-x}}{(1+e^{-x})^2}
\end{eqnarray}
It will be. The form of the probability density function is as follows.
import numpy as np
from matplotlib import pyplot as plt
x = np.arange(-8., 8., 0.001)
y = np.exp(-x) / ((1+np.exp(-x))**2)
plt.plot(x, y)
plt.xlabel("x")
plt.ylabel("f (x)")
plt.show()
Given a distribution, people want to know the mean and variance. I will calculate it immediately. The moment generating function $ M (t) $ is
M(t) = \int_{-\infty}^{\infty}e^{tx}\frac{e^{-x}}{(1+e^{-x})^2}dx
By replacing $ \ frac {1} {(1 + e ^ {-x})} = y $
\begin{eqnarray}
M(t) &=& \int_{0}^{1}e^{-t\log(\frac{1}{y}-1)}dy\\
\\
&=& \int_{0}^{1}(\frac{1}{y}-1)^{-t}dy\\
\\
&=& \int_{0}^{1}(\frac{1-y}{y})^{-t}dy\\
\\
&=& \int_{0}^{1}(\frac{1}{y})^{-t}(1-y)^{-t}dy\\
\\
&=& \int_{0}^{1}y^t(1-y)^{-t}dy\\
\\
&=& \int_{0}^{1}y^{(t+1)-1}(1-y)^{(-t+1)-1}dy\\
\\
&=& Beta(t+1,1-t)\\
\\
&=& \frac{\Gamma(t+1)\Gamma(1-t)}{\Gamma((t+1)+(1-t))}\\
\\
&=& \frac{\Gamma(t+1)\Gamma(1-t)}{\Gamma(2)}=\Gamma(t+1)\Gamma(1-t)
\end{eqnarray}
(Shindo ...!)
Furthermore, if the order of differentiation and integration is allowed (the order can be exchanged without proof), the first derivative of this moment generating function is
\begin{eqnarray}
\frac{dM(t)}{dt}=\Gamma'(t+1)\Gamma(1-t)-\Gamma(t+1)\Gamma'(1-t)
\end{eqnarray}
And if $ t = 0 $,
\begin{eqnarray}
M'(0)=\Gamma'(1)\Gamma(1)-\Gamma(1)\Gamma'(1)=0
\end{eqnarray}
That is, $ E [X] = M'(0) = 0 $. Then find $ E [X ^ 2] $.
\begin{eqnarray}
\frac{d^2M(t)}{dt^2}&=&\Gamma''(t+1)\Gamma(1-t)-\Gamma'(t+1)\Gamma'(1-t)-\Gamma'(t+1)\Gamma'(1-t)+\Gamma'(t+1)\Gamma''(1-t)\\
\\
&=& \Gamma''(t+1)\Gamma(1-t)-2\Gamma'(t+1)\Gamma'(1-t)+\Gamma(t+1)\Gamma''(1-t)
\end{eqnarray}
If $ t = 0 $,
\begin{eqnarray}
M''(0)&=&\Gamma''(1)-2\Gamma'(1)^2+\Gamma''(1)\\
\\
&=& 2\Gamma''(1)-2\Gamma'(1)^2
\end{eqnarray}
Here, put $ \ psi (x) = \ frac {d} {dx} \ log \ Gamma (x) = \ frac {\ Gamma'(x)} {\ Gamma (x)} $ and differentiate this. Then
\begin{eqnarray}
\frac{d}{dx}\psi(x)=\frac{\Gamma''(x)\Gamma(x)-\Gamma'(x)^2}{\Gamma(x)^2}
\end{eqnarray}
That is, $ \ psi ’(0) = \ Gamma'' (1)-\ Gamma'(1) ^ 2 $. By the way, $ \ psi'(0) = \ zeta (2) $ [likely](https://ja.wikipedia.org/wiki/%E3%83%9D%E3%83%AA%E3%82 % AC% E3% 83% B3% E3% 83% 9E% E9% 96% A2% E6% 95% B0) [^ 1], so $ \ psi'(0) = \ frac {\ pi ^ 2} { It is calculated as 6} $.
Therefore, $ M'' (0) = 2 × \ frac {\ pi ^ 2} {6} $, and $ E [X ^ 2] = M'' (0) = \ frac {\ pi ^ 2} { I was able to ask for 3} $. Than this,
\begin{eqnarray}
V[X]&=&E[X^2]-E[X]^2\\
\\
&=& \frac{\pi^2}{3} - 0\\
\\
&=& \frac{\pi^2}{3}
\end{eqnarray}
It turns out that the expected value of the logistic distribution is $ 0 $ and the variance is $ \ frac {\ pi ^ 2} {3} $.
By the way, it seems that the derivative of the logarithm of the gamma function is called the polygamma function. In particular, the first-order derivative is said to be the digamma function.
(It was tough, and suddenly the $ ζ $ function came out and I wasn't sure, so I can't say I could calculate it ...)
Now, consider $ p = \ sigma (\ beta x) $ when the parameters of the logistic distribution are represented by a linear combination. Solving this for $ \ beta x $
\begin{eqnarray}
p &=& \frac{1}{1+e^{-\beta x}}\\
\\
(1+e^{-\beta x})p &=& 1\\
\\
p+e^{-\beta x}p &=& 1\\
\\
e^{-\beta x} &=& \frac{1-p}{p}\\
\\
-\beta x &=& \log\frac{1-p}{p}\\
\\
\beta x &=& \log\frac{p}{1-p}\\
\\
\end{eqnarray}
(Equal positions are aligned, but it's a little hard to see ...)
The right-hand side is called log odds in the field of statistics.
What I mean by that is, conversely, linear regression of log odds and solving for $ p $ will give you an estimate of the probability of being assigned to each class.
By the way, for $ p \ in [0,1] $, the odds are $ \ frac {p} {1-p} \ in [0, \ infty) $, and the logarithmic odds are $ \ log \ frac { Since it is p} {1-p} \ in (-\ infty, \ infty) $, we can also see that the range of log odds is the same as the range of linear functions.
Y = \left(
\begin{array}{c}
y_1\\
\vdots\\
y_n
\end{array}
\right),\quad y_i \in \{ 0,1 \},(i=1,...n)
Regarding $ X $, I would like to include a constant term in the parameter, but it is troublesome to change the notation, so
X = \left(
\begin{array}{cccc}
1 & x_{11} & \cdots & x_{1d}\\\
\vdots & \vdots & \ddots & \vdots \\\
1 & x_{n1} & \cdots & x_{nd}
\end{array}
\right)
I would like to say that. Also, for $ i = 1, ..., n $, let $ x_i = (1, x_ {i1}, ..., x_ {id}) ^ T $ (that is, $ x_i $ is $ X) Transposed row component of $)).
The likelihood function for the parameter vector $ \ beta = (\ beta_0, \ beta_1, ..., \ beta_d) $
L(\beta) = P(Y | \beta)= \prod_{i=1}^{n} \sigma(\beta x_i)^{y_i}\{1-\sigma(\beta x_i)\}^{1-y_i}
The log-likelihood function can be written as
E(\beta)=-\log L(\beta)= -\sum_{i=1}^{n}\{y_i\log \sigma(\beta x_i)+(1-y_i)\log(1-\sigma(\beta x_i))\}
Can be written as. Find the parameter $ \ beta $ by solving this minimization problem.
However, due to the non-linearity of $ \ sigma $, the maximum likelihood solution cannot be derived analytically.
However, since $ E $ is a convex function, it has only the smallest solution. Find this minimum solution by Newton's method.
The Newton method is also called the Newton-Raphson method. I will tell you a lot about Newton's method if I ask Google teacher without making a memo, so I will leave the explanation there. The early story is how to find the solution of an equation by numerical calculation. In the book I have,
・ P247 of Essence of Machine Learning (Kato) ・ Pattern recognition and machine learning (Bishop) P207 ・ P140 of Basics of Statistical Learning (Hastie) ・ P74 of Galois theory (Fujita) that can be solved
There is an explanation in etc.
[^ 1]: If you look it up, you will find various things, but since there are many pdf direct links such as lecture materials, wikipedia links are provided.
Recommended Posts