[Reference] [Learn by running with Python! New machine learning textbook]: https://www.amazon.co.jp/Python%E3%81%A7%E5%8B%95%E3%81%8B%E3%81%97%E3%81%A6%E5%AD%A6%E3%81%B6%EF%BC%81-%E3%81%82%E3%81%9F%E3%82%89%E3%81%97%E3%81%84%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81%AE%E6%95%99%E7%A7%91%E6%9B%B8-%E4%BC%8A%E8%97%A4-%E7%9C%9F/dp/4798144983

What is explained in Chapter 7

-Create a prediction model for classification using the cross entropy error from the maximum likelihood estimation explained in Chapter 6 using a neural network. Create a model with three or more input values x (input in Chapter 6). I only did 3 class classification of value 2D input.)

First, as in the previous time, look at the total input a.

As in Chapter 6, consider the number of input dimensions when it is 2 (D = 2). $ a=w_0x_0+w_1x_1+w_2 $

If you add the bias parameter $ x_2 , which always takes 1 to represent the intercept $ a=w_0x_0+w_1x_1+w_2x_2 $$

Since it can be expressed by the summation formula, $ a=\sum_{i=0}^{2}w_ix_i $ This is the same as last time, but it can be used as a probability through the sigmoid function. $ y=\frac{1}{1+exp(-a)} $ ** By passing through the sigmoid function, the value of a obtained from the input changes in the range of 0 to 1. (Since the probability also changes in the range of 0 to 1, this can be used as the probability distribution.) **

In Chapter 6, the distribution of this transition from 0 to 1 was expressed as a probability, but in the neural network in this chapter, it is considered that the value in the range of 0 to 1 represents the ** firing frequency **.

（P253） Here, the output value is considered to represent the number of pulses per unit time, that is, ** firing frequency **. The larger a is, the closer the firing frequency is to the limit of the firing frequency, and conversely, the larger the value of a is negative, the closer the firing frequency is to 0, and the more the firing frequency is almost non-existent. I think.

7.2 Neural network model

2-layer feedforward neural network

Two-layer neural network model iOS の画像 (3).jpg Two-dimensional inputs can be divided into three categories. Represents the probability that each output value belongs to each category. ** For example, when there are two input values of weight ($ x_0 ) and height ( x_1 $), if you enter the height and weight of each person A and B, where is that person 3 Each t indicates whether it belongs to a class. ** **

ex)
class
$t_0$=Black
$t_1$=Caucasian
$t_2$=Asian

Mr. A:
Weight: 90kg
Height: 189 cm
If you type in this model.
$t_0$=0.95
$t_1$=0.04
$t_2$=0.01
The sum of t is 1
If it comes out, it is highly likely that you are black

Mr. B:
Weight: 65kg
Height: 168 cm
If you type in this model.
$t_0$=0.06
$t_1$=0.04
$t_2$=0.90
If you say, it's likely that you're Asian.

Mechanism of each probability:

It is easy to understand by looking at the figure on page 239 of Chapter 6.

iOS の画像 (4).jpg

・ Maximum likelihood estimation is used to obtain the values of $ w_0, w_1, and w_2 $ for each training data input. ・ And since the sum total of $ a_0, a_1, a_2 $ can be obtained. -Each output value y is expressed as a probability by a sigmoid function. Probability that t = 0 for $ y_0 $ Probability that t = 1 for $ y_1 $ Probability that t = 2 for $ y_2 $

Each is determined by the value of the input value somewhere from 0 to 1, so classification is possible. ** (I don't know why the sum of t is 1.) **

--In the middle layer, each input value x is weighted by w and passed to the middle layer b. --In the middle layer b, the sum of each input value and dummy variable is shown, and the sum is passed through the sigmoid function so that it can be used as the probability (z). --After that, in the same way, multiply each probability z of the intermediate layer by the weight v again, and take the sum again in the output layer. --This time, the output value can be used as a probability through the softmax function instead of the sigmoid function.

I don't know the exact reason why the sigmoid function and the softmax function are used, but probably once they are output to the middle layer, they will be classified into two classes, so it is necessary to multiply the weight v and take the sum. However, when the weight v is missing, it exceeds the range of 0 to 1, so that must be corrected as a probability. Therefore, I think that the softmax function corresponding to the three-dimensional output is applied. .. .. *

P258

Total input of middle layer: $ b_j=\sum_{i=0}^Dw_{ji}x_i $

Intermediate layer output: $ z_j = h (b_j) \ hspace {25pt} h () is a sigmoid function $

Total input of output layer: $ a_k=z\sum_{j=0}^{M}v_{kj}z_j $

When you call it the total input value, do you see the value in the middle layer as the input value? ??

Output layer output: $ y_k=\frac{exp(a_k)}{\sum_{l=0}^{K-1}exp(a_l)}=\frac{exp(a_k)}{u} $

Numerical differential method

The cross entropy error is the probability that a certain input (x = 5.8g, etc.) produces a certain probability (T = [1,0,0]) and that T = [1,0,0]. Error when calculated by maximum likelihood estimation. From the input value, you can create a model that outputs the probability. By inputting a new input value to the model, it outputs where the input value is classified.

The cross entropy error of the two-layer feedforward network is: $ E(w,v)=-\frac{1}{N}\sum_{n=0}^{N-1}\sum_{k=0}^{K-1}t_{nk}log(y_{nk}) $

Although it is a diagram on page 267, what is said on this page is the same as this page.

First, what we want to do is that this E (w) is the cross entropy error that takes the log-likelihood from the maximum likelihood estimation, and takes the value w (weight) that has the smallest valley but the smallest error.

The slope is 0 because we want w to find the output t when the probability is found from this most input. I want the value w of this valley bottom (the y-axis at the time of this valley bottom is the probability, and the maximum likelihood estimation is multiplied by -1 and reversed, so the most plausible probability of the maximum likelihood estimation = this valley bottom. ).

** Therefore, $ w ^ * $ in the figure is the optimum weight w when creating a classification model. ** **

What are you saying here? ・ It's difficult to calculate partial differential. ・ If you find the value just a little ahead of $ w ^ * $ and the value just a little before, even if you don't bother to calculate the partial differential, you can find a straight line that passes through two points, so it looks like a slope similar to a slope. You can get the value ・ That is the formula (7-19)

So, 7-19 talks about when the parameter w is one, but the formula when it is extended to multiple is (7-20). ** Therefore, even when there are multiple parameters, the appropriate ws can be easily obtained. ** **

Finally, how to read this graph on page 269: (Probably) Since the partial differential values for each of the weight parameters of w and v are given, the closer this value is to 0, the smaller the slope becomes. Therefore, 4 can be set as a good parameter for w, and 8 can be set as a good parameter for v.

I think it means.

First half last

(P373 and 273 are read as they are, so they are omitted.)

In the above figure, a model is created based on the weights of w and v obtained on the previous page, and the drawing is performed when the test data is actually input.

Since w and v are required when the error of each class 1, 2 and 3 is small, the part with high probability when the actual input value is entered is defined as the range of 0.5 to 0.9. The part of $ t_0 ~ t_2 $ in the image. If you display contour lines only where each probability is high and divide them, it seems that classification is possible.

Chapter 7 [Neural Network Deep Learning] P252 ~ 275 (first half) [Learn by moving with Python! New machine learning textbook]