When I tried to write about logistic regression, I ended up finding the mean and variance of the logistic distribution.

I often use logistic regression at work, but when I have a little concern, I often find it difficult to get to the information I want, so I would like to summarize logistic regression as a memo for myself.

It seems that logistic regression is often used in the medical field. Of course, it is often used in other fields because of its high interpretability, simplicity of the model, and high accuracy.

Now consider the problem of predicting whether the input vector $ x $ will be assigned to the two classes $ C_0 or C_1 $.

Let $ y ∈ \ {0,1 \} $ be the objective variable (output) and $ x ∈ R ^ d $ be the explanatory variable (input). Here, $ y = 0 $ when assigned to $ C_0 $, and $ y = 1 $ when assigned to $ C_1 $.


## Linear discrimination I have prepared 20 artificially generated data (I just made it properly). This time I will use Python.

python


import matplotlib.pyplot as plt

x = [1.3, 2.5, 3.1, 4, 5.8, 6, 7.5, 8.4, 9.9, 10, 11.1, 12.2, 13.8, 14.4, 15.6, 16, 17.7, 18.1, 19.5, 20]
y = [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]

plt.scatter(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.show

log_scatter.png

I think the simplest approach to the classification problem is linear discrimination.

In the linear model, the output $ y $ is linear with respect to the input $ x $ ($ y (x) = \ beta_0 + \ beta_1 x $), and $ y $ is a real number. Here's one more step to adapt to the classification problem.

For example, transform a linear function with the nonlinear function $ f (・) $.

y(x) = f(\beta_0 + \beta_1 x)

For example, in this case, the following activation function can be considered.

f(z) = \left\{
\begin{array}{ll}
1 & (z \geq 0.5) \\
0 & (z \lt 0.5)
\end{array}
\right.

Well, this time we will make a prediction using this activation function. First, we will train a linear model. We will use the least squares method to estimate the parameters.

As an aside, this book is highly recommended as a starting point for machine learning. It starts with the minimum math required to start studying machine learning. This type of book doesn't know a lot about books. It's a pretty good book. Also, the code is very easy to read. I think that I will use the library in practice, but I think that it is the best in terms of writing from scratch for studying.

The original code is published on the Support Page.

import matplotlib.pyplot as plt
import numpy as np


def reg(x,y):
    n = len(x)
    a = ((np.dot(x,y) - y.sum() * x.sum() / n) /
        ((x**2).sum() - x.sum()**2 / n))
    b = (y.sum() - a * x.sum()) / n
    return a,b


x = np.array(x)
y = np.array(y)
a, b = reg(x,y)

print('y =', b,'+', a, 'x')

fig = plt.scatter(x, y)
xmax = x.max()
plt.plot([0, xmax], [b, a * xmax + b])
plt.axhline(0.5, ls = "--", color = "r")
plt.axhline(0, linewidth = 1, ls = "--", color = "black")
plt.axhline(1, linewidth = 1, ls = "--", color = "black")
plt.xlabel("x")
plt.ylabel("y")
plt.show

The estimated model is

\hat{y} = -0.206 + 0.07x

have become.

log_reg.png

Next, considering the conversion by the activation function defined earlier, the boundary line seems to be about $ x = 10 $.

Now, there are some problems with this method. The least squares method is equivalent to the maximum likelihood method when a normal distribution is assumed for the conditional probability distribution.

On the other hand, a binary objective variable vector like this one causes various problems because it is clearly far from the normal distribution. For details, please refer to "Pattern Recognition and Machine Learning (Bishop)", but mainly

-The approximation accuracy of the class posterior probability vector is poor. -The flexibility of the linear model is low. ⇒The value of the probability exceeds $ [0,1] $ due to these two.

・ Penalize predictions that are too correct.

Can be given. Therefore, we adopt an appropriate probability model and consider a classification algorithm that has better characteristics than the least squares method.

Logistic distribution

Before we get into logistic regression, let's think a lot about the logistic distribution. Given the input vector $ x $, the conditional probabilities of class $ C_1 $ are

\begin{eqnarray}
P(y=1|x)&=&\frac{P(x|y=1)P(y=1)}{P(x|y=1)P(y=1)+P(x|y=0)P(y=0)}\\
\\
&=&\frac{1}{1+\frac{P(x|y=0)P(y=0)}{P(x|y=1)P(y=1)}}\\
\\
&=&\frac{1}{1+e^{-\log\frac{P(x|y=0)P(y=0)}{P(x|y=1)P(y=1)}}}\\
\end{eqnarray}

\log\frac{P(x|y=0)P(y=0)}{P(x|y=1)P(y=1)}=aIf you say

P(y=1|x)=\frac{1}{1+e^{-a}}\\

Will be. This is called a logistic distribution and will be represented by $ \ sigma (a) $.

The form of the distribution function is as follows.

import numpy as np
from  matplotlib import pyplot as plt

a = np.arange(-8., 8., 0.001)
y = 1 / (1+np.exp(-a))

plt.plot(a, y)
plt.axhline(0, linewidth = 1, ls = "--", color = "black")
plt.axhline(1, linewidth = 1, ls = "--", color = "black")
plt.xlabel("a")
plt.ylabel("σ (a)")
plt.show()

logit_P.png

You can see that the range is within $ (0,1) $.



## Mean and variance of logistic distribution As mentioned above, the distribution function of the logistic distribution is
\sigma(x)=\frac{1}{1+e^{-x}}

It is represented by. The probability density function $ f (x) $ differentiates $ \ sigma (x) $ and

\begin{eqnarray}
f(x)&=&\frac{d}{dx}\frac{1}{1+e^{-x}}\\
\\
&=&\frac{e^{-x}}{(1+e^{-x})^2}
\end{eqnarray}

It will be. The form of the probability density function is as follows.

import numpy as np
from  matplotlib import pyplot as plt

x = np.arange(-8., 8., 0.001)
y = np.exp(-x) / ((1+np.exp(-x))**2)

plt.plot(x, y)
plt.xlabel("x")
plt.ylabel("f (x)")
plt.show()

logit_f.png

Given a distribution, people want to know the mean and variance. I will calculate it immediately. The moment generating function $ M (t) $ is

M(t) = \int_{-\infty}^{\infty}e^{tx}\frac{e^{-x}}{(1+e^{-x})^2}dx

By replacing $ \ frac {1} {(1 + e ^ {-x})} = y $

\begin{eqnarray}
M(t) &=& \int_{0}^{1}e^{-t\log(\frac{1}{y}-1)}dy\\
\\
&=& \int_{0}^{1}(\frac{1}{y}-1)^{-t}dy\\
\\
&=& \int_{0}^{1}(\frac{1-y}{y})^{-t}dy\\
\\
&=& \int_{0}^{1}(\frac{1}{y})^{-t}(1-y)^{-t}dy\\
\\
&=& \int_{0}^{1}y^t(1-y)^{-t}dy\\
\\
&=& \int_{0}^{1}y^{(t+1)-1}(1-y)^{(-t+1)-1}dy\\
\\
&=& Beta(t+1,1-t)\\
\\
&=& \frac{\Gamma(t+1)\Gamma(1-t)}{\Gamma((t+1)+(1-t))}\\
\\
&=& \frac{\Gamma(t+1)\Gamma(1-t)}{\Gamma(2)}=\Gamma(t+1)\Gamma(1-t)
\end{eqnarray}

(Shindo ...!)

Furthermore, if the order of differentiation and integration is allowed (the order can be exchanged without proof), the first derivative of this moment generating function is

\begin{eqnarray}
\frac{dM(t)}{dt}=\Gamma'(t+1)\Gamma(1-t)-\Gamma(t+1)\Gamma'(1-t)
\end{eqnarray}

And if $ t = 0 $,

\begin{eqnarray}
M'(0)=\Gamma'(1)\Gamma(1)-\Gamma(1)\Gamma'(1)=0
\end{eqnarray}

That is, $ E [X] = M'(0) = 0 $. Then find $ E [X ^ 2] $.

\begin{eqnarray}
\frac{d^2M(t)}{dt^2}&=&\Gamma''(t+1)\Gamma(1-t)-\Gamma'(t+1)\Gamma'(1-t)-\Gamma'(t+1)\Gamma'(1-t)+\Gamma'(t+1)\Gamma''(1-t)\\
\\
&=& \Gamma''(t+1)\Gamma(1-t)-2\Gamma'(t+1)\Gamma'(1-t)+\Gamma(t+1)\Gamma''(1-t)
\end{eqnarray}

If $ t = 0 $,

\begin{eqnarray}
M''(0)&=&\Gamma''(1)-2\Gamma'(1)^2+\Gamma''(1)\\
\\
&=& 2\Gamma''(1)-2\Gamma'(1)^2
\end{eqnarray}

Here, put $ \ psi (x) = \ frac {d} {dx} \ log \ Gamma (x) = \ frac {\ Gamma'(x)} {\ Gamma (x)} $ and differentiate this. Then

\begin{eqnarray}
\frac{d}{dx}\psi(x)=\frac{\Gamma''(x)\Gamma(x)-\Gamma'(x)^2}{\Gamma(x)^2}
\end{eqnarray}

That is, $ \ psi ’(0) = \ Gamma'' (1)-\ Gamma'(1) ^ 2 $. By the way, $ \ psi'(0) = \ zeta (2) $ [likely](https://ja.wikipedia.org/wiki/%E3%83%9D%E3%83%AA%E3%82 % AC% E3% 83% B3% E3% 83% 9E% E9% 96% A2% E6% 95% B0) [^ 1], so $ \ psi'(0) = \ frac {\ pi ^ 2} { It is calculated as 6} $.

Therefore, $ M'' (0) = 2 × \ frac {\ pi ^ 2} {6} $, and $ E [X ^ 2] = M'' (0) = \ frac {\ pi ^ 2} { I was able to ask for 3} $. Than this,

\begin{eqnarray}
V[X]&=&E[X^2]-E[X]^2\\
\\
&=& \frac{\pi^2}{3} - 0\\
\\
&=& \frac{\pi^2}{3}
\end{eqnarray}

It turns out that the expected value of the logistic distribution is $ 0 $ and the variance is $ \ frac {\ pi ^ 2} {3} $.

By the way, it seems that the derivative of the logarithm of the gamma function is called the polygamma function. In particular, the first-order derivative is said to be the digamma function.

(It was tough, and suddenly the $ ζ $ function came out and I wasn't sure, so I can't say I could calculate it ...)



## Logistic regression

Now, consider $ p = \ sigma (\ beta x) $ when the parameters of the logistic distribution are represented by a linear combination. Solving this for $ \ beta x $

\begin{eqnarray}
p &=& \frac{1}{1+e^{-\beta x}}\\
\\
(1+e^{-\beta x})p &=& 1\\
\\
p+e^{-\beta x}p &=& 1\\
\\
e^{-\beta x} &=& \frac{1-p}{p}\\
\\
-\beta x &=& \log\frac{1-p}{p}\\
\\
\beta x &=& \log\frac{p}{1-p}\\
\\
\end{eqnarray}

(Equal positions are aligned, but it's a little hard to see ...)

The right-hand side is called log odds in the field of statistics.

What I mean by that is, conversely, linear regression of log odds and solving for $ p $ will give you an estimate of the probability of being assigned to each class.

By the way, for $ p \ in [0,1] $, the odds are $ \ frac {p} {1-p} \ in [0, \ infty) $, and the logarithmic odds are $ \ log \ frac { Since it is p} {1-p} \ in (-\ infty, \ infty) $, we can also see that the range of log odds is the same as the range of linear functions.



## Logistic regression parameter estimation I'm going to lose track of what it is, so I'll sort out the letters and symbols. Data set $ D = \\ {X, Y \\}, $
Y = \left(
\begin{array}{c}
y_1\\
\vdots\\
y_n
\end{array}
\right),\quad y_i \in \{ 0,1 \},(i=1,...n)

Regarding $ X $, I would like to include a constant term in the parameter, but it is troublesome to change the notation, so

X = \left(
\begin{array}{cccc}
1 & x_{11} & \cdots & x_{1d}\\\

\vdots & \vdots & \ddots & \vdots \\\
1 & x_{n1} & \cdots & x_{nd}
\end{array}
\right)

I would like to say that. Also, for $ i = 1, ..., n $, let $ x_i = (1, x_ {i1}, ..., x_ {id}) ^ T $ (that is, $ x_i $ is $ X) Transposed row component of $)).

The likelihood function for the parameter vector $ \ beta = (\ beta_0, \ beta_1, ..., \ beta_d) $

L(\beta) = P(Y | \beta)= \prod_{i=1}^{n} \sigma(\beta x_i)^{y_i}\{1-\sigma(\beta x_i)\}^{1-y_i}

The log-likelihood function can be written as

E(\beta)=-\log L(\beta)= -\sum_{i=1}^{n}\{y_i\log \sigma(\beta x_i)+(1-y_i)\log(1-\sigma(\beta x_i))\}

Can be written as. Find the parameter $ \ beta $ by solving this minimization problem.

However, due to the non-linearity of $ \ sigma $, the maximum likelihood solution cannot be derived analytically.

However, since $ E $ is a convex function, it has only the smallest solution. Find this minimum solution by Newton's method.


## Newton's method

The Newton method is also called the Newton-Raphson method. I will tell you a lot about Newton's method if I ask Google teacher without making a memo, so I will leave the explanation there. The early story is how to find the solution of an equation by numerical calculation. In the book I have,

・ P247 of Essence of Machine Learning (Kato) ・ Pattern recognition and machine learning (Bishop) P207 ・ P140 of Basics of Statistical Learning (Hastie) ・ P74 of Galois theory (Fujita) that can be solved

There is an explanation in etc.



## I'm exhausted. I didn't think it would be so hard just to calculate the mean and variance of the logistic distribution. I'm exhausted.

## ★ References ★ [1] Kato: The Essence of Machine Learning (2018) [2] Hastie, Tibshirani, Friedman: Basics of Statistical Learning (2014) [3] Bishop: Pattern Recognition and Machine Learning (2006) [4] Fujita: Galois theory that can be solved (2013)

[^ 1]: If you look it up, you will find various things, but since there are many pdf direct links such as lecture materials, wikipedia links are provided.

Recommended Posts

When I tried to write about logistic regression, I ended up finding the mean and variance of the logistic distribution.
I tried to visualize the age group and rate distribution of Atcoder
[Python] I thoroughly explained the theory and implementation of logistic regression
I became horror when I tried to detect the features of anime faces using PCA and NMF.
I tried to predict the up and down of the closing price of Gurunavi's stock price using TensorFlow (progress)
I tried to extract and illustrate the stage of the story using COTOHA
I tried to verify and analyze the acceleration of Python by Cython
I tried to notify the update of "Hamelin" using "Beautiful Soup" and "IFTTT"
I tried to touch the API of ebay
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried to summarize the basic form of GPLVM
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
How to write offline real time I tried to solve the problem of F02 with Python
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
When I tried to install Ubuntu 18.04, "Initramfs unpacking failed: Decoding failed" was displayed and the startup failed.
I tried to dig deeper about safety while calculating the stochastic finality of Proof of Work
[Graph drawing] I tried to write a bar graph of multiple series with matplotlib and seaborn
I tried to find the entropy of the image with python
I tried to find out the outline about Big Gorilla
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
Ubuntu blew up when I tried to change my username
I tried to display the time and today's weather w
[Machine learning] I tried to summarize the theory of Adaboost
I want to know the features of Python and pip
I tried to enumerate the differences between java and python
I tried to fight the Local Minimum of Goldstein-Price Function
I displayed the chat of YouTube Live and tried playing
I want to clear up the question of the "__init__" method and the "self" argument of a Python class.
I tried fitting the exponential function and logistics function to the number of COVID-19 positive patients in Tokyo
[Linux] I tried to summarize the command of resource confirmation system
I tried to create a simple credit score by logistic regression.
I tried to get the index of the list using the enumerate function
[Introduction to Python] I compared the naming conventions of C # and Python.
I tried to build the SD boot image of LicheePi Nano
I tried to let Pepper talk about event information and member information
I summarized how to change the boot parameters of GRUB and GRUB2
I tried to expand the size of the logical volume with LVM
I tried to summarize the frequently used implementation method of pytest-mock
I tried to improve the efficiency of daily work with Python
I tried to visualize the common condition of VTuber channel viewers
I tried to verify how fast the mnist of Chainer example can be speeded up using cython
I tried to take the difference of Config before and after work with pyATS / Genie self-made script
I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!
I tried to organize about MCMC.
I tried to move the ball
I tried to estimate the interval.
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried logistic regression analysis for the first time using Titanic data
I tried to summarize until I quit the bank and became an engineer