[Reference] [Learn by running with Python! New machine learning textbook]: https://www.amazon.co.jp/Python%E3%81%A7%E5%8B%95%E3%81%8B%E3%81%97%E3%81%A6%E5%AD%A6%E3%81%B6%EF%BC%81-%E3%81%82%E3%81%9F%E3%82%89%E3%81%97%E3%81%84%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92%E3%81%AE%E6%95%99%E7%A7%91%E6%9B%B8-%E4%BC%8A%E8%97%A4-%E7%9C%9F/dp/4798144983
-Create a prediction model for classification using the cross entropy error from the maximum likelihood estimation explained in Chapter 6 using a neural network. Create a model with three or more input values x (input in Chapter 6). I only did 3 class classification of value 2D input.)
As in Chapter 6, consider the number of input dimensions when it is 2 (D = 2).
If you add the bias parameter $ x_2
Since it can be expressed by the summation formula,
In Chapter 6, the distribution of this transition from 0 to 1 was expressed as a probability, but in the neural network in this chapter, it is considered that the value in the range of 0 to 1 represents the ** firing frequency **.
(P253) Here, the output value is considered to represent the number of pulses per unit time, that is, ** firing frequency **. The larger a is, the closer the firing frequency is to the limit of the firing frequency, and conversely, the larger the value of a is negative, the closer the firing frequency is to 0, and the more the firing frequency is almost non-existent. I think.
Two-layer neural network model
Two-dimensional inputs can be divided into three categories. Represents the probability that each output value belongs to each category.
** For example, when there are two input values of weight ($ x_0
ex)
class
$t_0$=Black
$t_1$=Caucasian
$t_2$=Asian
Mr. A:
Weight: 90kg
Height: 189 cm
If you type in this model.
$t_0$=0.95
$t_1$=0.04
$t_2$=0.01
The sum of t is 1
If it comes out, it is highly likely that you are black
Mr. B:
Weight: 65kg
Height: 168 cm
If you type in this model.
$t_0$=0.06
$t_1$=0.04
$t_2$=0.90
If you say, it's likely that you're Asian.
It is easy to understand by looking at the figure on page 239 of Chapter 6.
・ Maximum likelihood estimation is used to obtain the values of $ w_0, w_1, and w_2 $ for each training data input. ・ And since the sum total of $ a_0, a_1, a_2 $ can be obtained. -Each output value y is expressed as a probability by a sigmoid function. Probability that t = 0 for $ y_0 $ Probability that t = 1 for $ y_1 $ Probability that t = 2 for $ y_2 $
Each is determined by the value of the input value somewhere from 0 to 1, so classification is possible. ** (I don't know why the sum of t is 1.) **
--In the middle layer, each input value x is weighted by w and passed to the middle layer b. --In the middle layer b, the sum of each input value and dummy variable is shown, and the sum is passed through the sigmoid function so that it can be used as the probability (z). --After that, in the same way, multiply each probability z of the intermediate layer by the weight v again, and take the sum again in the output layer. --This time, the output value can be used as a probability through the softmax function instead of the sigmoid function.
P258
Total input of middle layer:
Intermediate layer output:
Total input of output layer:
Output layer output:
The cross entropy error is the probability that a certain input (x = 5.8g, etc.) produces a certain probability (T = [1,0,0]) and that T = [1,0,0]. Error when calculated by maximum likelihood estimation. From the input value, you can create a model that outputs the probability. By inputting a new input value to the model, it outputs where the input value is classified.
The cross entropy error of the two-layer feedforward network is:
Although it is a diagram on page 267, what is said on this page is the same as this page.
First, what we want to do is that this E (w) is the cross entropy error that takes the log-likelihood from the maximum likelihood estimation, and takes the value w (weight) that has the smallest valley but the smallest error.
The slope is 0 because we want w to find the output t when the probability is found from this most input. I want the value w of this valley bottom (the y-axis at the time of this valley bottom is the probability, and the maximum likelihood estimation is multiplied by -1 and reversed, so the most plausible probability of the maximum likelihood estimation = this valley bottom. ).
** Therefore, $ w ^ * $ in the figure is the optimum weight w when creating a classification model. ** **
What are you saying here? ・ It's difficult to calculate partial differential. ・ If you find the value just a little ahead of $ w ^ * $ and the value just a little before, even if you don't bother to calculate the partial differential, you can find a straight line that passes through two points, so it looks like a slope similar to a slope. You can get the value ・ That is the formula (7-19)So, 7-19 talks about when the parameter w is one, but the formula when it is extended to multiple is (7-20). ** Therefore, even when there are multiple parameters, the appropriate ws can be easily obtained. ** **
Finally, how to read this graph on page 269: (Probably) Since the partial differential values for each of the weight parameters of w and v are given, the closer this value is to 0, the smaller the slope becomes. Therefore, 4 can be set as a good parameter for w, and 8 can be set as a good parameter for v.
I think it means.
(P373 and 273 are read as they are, so they are omitted.)
In the above figure, a model is created based on the weights of w and v obtained on the previous page, and the drawing is performed when the test data is actually input.
Since w and v are required when the error of each class 1, 2 and 3 is small, the part with high probability when the actual input value is entered is defined as the range of 0.5 to 0.9. The part of $ t_0 ~ t_2 $ in the image. If you display contour lines only where each probability is high and divide them, it seems that classification is possible.
Recommended Posts