Chapter 6 Supervised Learning: Classification pg212 ~ [Learn by moving with Python! New machine learning textbook]

https://www.amazon.co.jp/Python%E3%81%A7%E5%8B%95%E3%81%8B%E3%81%97%E3%81%A6%E5%AD%A6%E3%81%B6%EF%BC%81-%E3%81%82%E3%81%9F%E3%82%89%E3%81%97%E3%81%84%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81%AE%E6%95%99%E7%A7%91%E6%9B%B8-%E4%BC%8A%E8%97%A4-%E7%9C%9F/dp/4798144983

Knowledge involved

--Maximum likelihood estimation --Logistic regression --Cross entropy error --Gradient method --Two-dimensional input 2 class classification --Logistic regression --Two-dimensional input 3 class classification

What they are doing .. Why use these calculations to classify

Maximum likelihood estimation (pg212)

First, from the maximum likelihood estimation From a certain event (t = 0 for the 1st to 3rd times and t = 1 for the 4th time in this case), the plausible probability that each event can occur is estimated.

P(t=1|x)=w An expression called the probability w that t = 1 for an input of x.

T = [0,0,0,1], so if you think about it normally, it's 1/4! However, this is calculated by maximum likelihood estimation.

Try to find the likelihood when w = 0.1

Since $ w = P (t = 1 | x) = 0.1 $, the probability (likelihood) that t = 1 is $ 0.9 * 0.9 * 0.9 * 0.1 = 0.0729 $ (each value is [0,0,0,1] ] Probability of appearing (when w = 1))

The reason for multiplying these four I want to find the probability when T = [0,0,0,1] appears, so (Probability of 1st time t = 0 0.9) × (Probability of 2nd time t = 0 0.9) × (Probability of 3rd time t = 0 0.9) × (Probability of 4th time t = 0.1) = (Probability that T = [0,0,0,1]) Because it can be expressed as. And the maximum likelihood estimation is the calculation to find out whether w of 0.9, 0.1 is plausible.

Therefore, the likelihood when w = 0.1 is 0.0729.

What if this is w = 0.2? ?? At $ 0.8 * 0.8 * 0.8 * 0.2 = 0.104 $ Likelihood when w = 0.2 is 0.1024 Which one has the higher likelihood is, of course, the latter, so it seems more likely when the probability w of t = 1 is 0.2.

This is finally the highest when w = 0.25 ($ 1/4 $). </ font>

** Consider which w is the best in the range of 0 to 1 in order to analytically find out which probability is the most likely probability ** ・ To find it, generalize the formula so that each number can be inserted. that is, $ P(T=0,0,0,1|x)=(1-w)^3w $ Can be expressed by.

When this is expressed in the range of the input value w from 0 to 1, it becomes a chevron (P214). , It can be said that the apex part is the most plausible w.

Let's find the maximum value!

Take the logarithm of $ P = (1-w) ^ 3 (w) $ to make the calculation easier and find the maximum value (w does not change when the logarithm is taken to the maximum) The logarithmic one is ↓ logP=log{(1-w)^3w}=3log(1-w)+log{w} (Calculation when taking logarithm normally. Power comes down and becomes addition)

Differentiate to find the part with slope = 0

$ \ frac {\ partial} {\ partial w} logP = \ frac {\ partial} {\ partial w} [3log (1-w) + log w] = 0 $![IOS image.jpg](https:: //qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/381710/f797dc43-9cbf-8af6-8c66-3f90283f0a5d.jpeg)

Therefore, what I'm saying here is the best when $ w = \ frac {1} {4} $. T = [0,0,0,1] is the most likely model parameter to be generated. That means </ font>

Maximum likelihood estimation is w = 1/4.

** Maximum likelihood estimation elementary version finished! ** **

Logistic regression model (P216)

-By using a logistic regression model, the input x can be compressed in the range of 0 to 1, so it is easy to use as a classification probability.

By passing the sigmoid function through the linear model, the probability that x becomes t = 1 can be expressed as follows. ** Logistic regression ** $ y=w_0x+w_1\\ $ $ y=\sigma(w_0x+w_1)=\frac{1}{1+exp(-(w_0x+w_1))} $

(Average) Cross entropy error

・ Logistic regression is regarded as a probability (previous w value) ・ Therefore, from the input value x ・ I want to find a good weight w. Next, using the obtained w, enter the value of x in the created model to obtain the predicted value.

y=\sigma(w_0x+w_1)=P(t=1|x)

At this time, y represents the probability. Suddenly, a sigmoid function came out, but this is the one that puts the probability w that was handled so far in y. (By using the sigmoid function, y obtained from the input value x ($ y = w_0x + w_1 $)) can be expressed as a probability. I'm saying that.

⭐️ ** From the above, $ y = w = \ sigma (w_0x + w_1) = P (t = 1 | x) $. ** **

(important): On P218, maximum likelihood estimation is performed so that the parameters $ w_0 $ and $ w_1 $ of this model match the insect data. "Assuming that insect data was generated from this model, we will find the most probable (probabilistically high) parameters."

In the previous theory, we considered T = [0,0,0,1], but this time we will consider it so that it can be applied to all cases.

P218

If there is only one data and t = 1 for a certain body weight x, the probability that t = 1 will be generated from the model is the output value y of the logistic regression model itself.

There is a description, but what does this mean? iOS の画像 (1).jpg

This is the image on the previous page. ** Looking at this, the logistic regression model can be treated as a probability because the linear model (this linear model is a linear expression that obtains the value y from the input value x) is crushed in the range of 1 to 0 by the sigmoid function. it can. ** **

So here y = (probability that t = 1) 1-y = (probability that t = 0) Treat as.

The model that generalizes this is the arrow $ P(t|x)=y^t(1-y)^{1-t} $


So now let's consider the case where the number of data is N Where I was thinking about T = [0,0,0,1] in the previous section X = [$ x_0, x_1, ..., x_n ] T = [ t_0, t_1, ..., t_n $] Consider the pattern (T = [0,0,0,1,1,1,0,1, ... N pieces]).

As we did in the previous section, when T = [0,0,0,1], we calculated something like [$ 0.8 * 0.8 * 0.8 * 0.2 = 0.104 $], but the generalized formula is also the maximum likelihood. Estimates should be calculated, so all patterns from $ x_0 to x_n $ should be multiplied.

This is what P219 says, "Since the probability of generating each piece of data can be multiplied by all the data, it becomes like equation (6-15). This is the likelihood."

This is the part of.

So equation (6-15) is as follows (because they are all multiplied) $ P(T|X)=\prod_{n=0}^{N-1}P(T|X)=\prod_{n=0}^{N-1}y_n^{t_n}(1-y_n)^{1-t_n} $ ** _, which represents the likelihood. (The Yamagata graph on page 214) _ ** </ font>

If you take the logarithm as before and simplify the formula, the log-likelihood will be ↓ $ logP(T|X)=\sum_{n=0}^{N-1}[\{t_nlogy_n+(1-t_n)log(1-y_n)}]\hspace{70pt}(6-16) $

** Still, like last time, you can choose the one with the maximum value of this probability (since it is the maximum likelihood, $ w_0 and w_1 $ are required) **

This is multiplied by -1 and turned upside down, and divided by N is called the "average cross entropy error ($ E (n) $)".
E(n)=-\frac{1}{N}logP(T|X)=-\frac{1}{N}\sum_{n=0}^{N-1}[\{t_nlogy_n+(1-t_n)log(1-y_n)}]\hspace{70pt}(6-17)

** After that, calculate, and since $ w_0, w_1 $ at this minimum value is an appropriate w, put $ w_0 = 1, w_1 = 1 $ in the analysis. Check the slope by partial differentiation from, and proceed to the one with the smaller slope. ** </ font>

2D input 2 class classification (P228)

In the case of insects, there is an input of weight ($ x_1 ) and body length ( x_2 $) (two-dimensional input), and it distinguishes between males and females (two classes).

1ofK coding

How to express where in each class it is classified iOS の画像 (2).jpg

Logistic regression model (2D input ver.)

Until the last time, I was thinking about 1D input, so this time I will think about 2D input.

Consider P232 $ y = \ sigma (a) $ $ y=\sigma(a) $ Until the last time, this a was expressed as $ w_ox + w_1 , which was a one-dimensional input, but consider this as a two-dimensional input. In that case, the part of a = changes from [ a = w_0x + w_1 ] to → [ a = w_0x_0 + w_1x_1 + w_2 $].

As in the previous time, the probability (y) of t = 1 ort = 0 is obtained by passing this through the sigmoid function. (This time, the probability when t = 0 is y, and the probability when t = 1 is 1-y)

Therefore, P(t=0|x)=\sigma(a)=y P(t=1|x)=\sigma(a)=1-y It becomes a model.

Since the generalized model is $ P (t | x) = y ^ t (1-y) ^ {1-t} $ as before, the loss function for finding the average cross entropy error of this is $ E(w)=logP(T|X)=\sum_{n=0}^{N-1}[\{t_nlogy_n+(1-t_n)log(1-y_n)}]\hspace{70pt} $ Can be used as it is.

** If the minimum value of this cross entropy error is obtained by the gradient method, appropriate values $ w_0, w_1, w_2 $ can be obtained. ** **

Finally, using the calculated $ w_0, w_1, w_2 $ and putting that value in w of $ y = \ sigma (a) $ ($ a = w_0x_0 + w_1x_1 + w_2 $), the error You can create a classification model with few errors.

The graph of the created classification model is the graph on page 238, and the boundary line is the boundary line that separates each class.

2D input 3 class classification

From now on, let's consider the case where the input remains two-dimensional, but there are three classes to classify. This can be made to correspond to the classification of 3 or more classes by introducing the softmax function instead of the sigmoid function in the output of the model.

For example, in the case of a three-class classification problem, consider the total input $ a_k (k = 0,1,2) $ corresponding to the three classes. $ a_k=w_{k0}x_0+w_{k1}x_1+w_{k2}\hspace{20pt}(k=0,1,2)\hspace{40pt}(6-40) $

Currently, there are two inputs, $ X = [x_0, x_1] $, but I will set a dummy input value $ x_2 $ that always takes 1 there. Then $ a_k=w_{k0}x_0+w_{k1}x_1+w_{k2}x_2=\sum_{i=0}^{D}w_{ki}x_{i}\hspace{10pt}(k=0,1,2) $ Will be.

So, like last time, put $ a_k $ in the softmax function according to the procedure you put in the sigmoid function. $ Softmax function = y_i = \ frac {exp (x_i)} {\ sum_ {j = 0} ^ {K-1} exp (x_j)} $ This x changes to the sum $ a_k , so the model used this time is $ y_i=\frac{exp(a_k)}{\sum_{k=0}^{K-1}exp(a_k)} $$ Will be. Put the sum part of the bottom side as u u=exp(a_0)+exp(a_1)+exp(a_2) The formula obtained with it is

y_k=\frac{exp(a_k)}{u}\hspace{20pt}(k=0,1,2)

Will be.

After that, I will do the same thing as I did in the previous section ・ Create a generalized formula for each probability. $ P (t | x) = y_0 ^ {t_0} y_1 ^ {t_1} y_2 ^ {t_2} $ (P242) ・ And N, I'll multiply everything to get the probability that this data was generated. -Finally, the log-likelihood is taken to find the minimum value, and an appropriate w is derived. -Create a model for classification using that w (you can get an expression that can display the classification line)

** Create a model for classification using training data, and if you put test data in that model, classification will be issued **

Recommended Posts