This is jp_Pythonia, a first year junior high school student. Nice to meet you. Since this article was buzzed and the amount of information is large, I made it a split version.
What is machine learning? Tom Mitchell defines:
The computer program is based on a performance standard P for a task T.
If the performance for T is measured by reference P and improves with experience E
It can be said that we learned from experience E.
There are two main types of machine learning. In the next two sections, we will discuss two of them, ** "supervised learning" ** and ** "unsupervised learning" **.
Supervised learning is the ** most common ** machine learning problem.
As a main feature
--There is a correct answer in the given data set --Divided into classification problems and regression problems
There are two points.
→ In supervised learning, there is a "correct answer" in every given dataset.
e.g.)
We have developed an algorithm that predicts the weather, determines whether it is raining or sunny tomorrow, and outputs it.
All of the weather data we learned stated whether it was raining or sunny that day.
In the case of the above example, it is called ** "supervised learning". ** **
This is because the "correct answer" for the algorithm, the rain or sunny label, was attached to the data from the beginning.
Supervised learning can be broadly divided into two genres.
They are the "classification problem" and the "regression problem".
The algorithm for predicting the weather shown earlier. This applies to classification problems. This is because ** rain or sunny ** is ** "classified" **.
By the way, the data of rain or sunny weather is called "discrete value".
Regression problems, on the other hand, are a type of problem that deals with "continuous values".
e.g.)
We have developed an algorithm to predict tomorrow's temperature.
The data learned is a combination of the weather and the temperature.
In the case of the above example, the temperature of the output data is a "continuous value", so it is called a regression problem.
Unsupervised learning Unsupervised learning allows you to tackle problems with little or no understanding of what the results will look like. You can derive the structure from data where the effect of the variable is not always known. You can derive this structure by clustering the data based on the relationships between the variables in the data. In unsupervised learning, there is no predictive feedback.
Quoted from https://www.coursera.org/learn/machine-learning/supplement/1O0Bk/unsupervised-learning
As mentioned above, unsupervised learning has two main characteristics.
--Unlike supervised learning, there is no "correct" data --Use a clustering algorithm
What is a clustering algorithm?
--All data labels are the same --There is no data label in the first place --Unlabeled
An algorithm that is useful when working with data. The clustering algorithm finds rules, genres, etc. in the given data, creates a cluster by itself, and classifies it there.
I prepared a dataset. it's here.
Case | Land area(x) | Land price(y) |
---|---|---|
1 | 2104 | 460 |
2 | 1416 | 232 |
3 | 1534 | 315 |
4 | 852 | 178 |
・ ・ ・ | ・ ・ ・ | ・ ・ ・ |
8 | 680 | 143 |
We will define the ** characters ** that we will use from now on.
-** m ** ・ ・ ・ The number of cases. 8 in the above dataset -** x ** ・ ・ ・ The area of land. x i </ sub> represents the area of the i-th land. x 2 </ sub> = 1416 -** y ** ・ ・ ・ Land price. y i </ sub> represents the price of the i-th land. y 4 </ sub> = 178 -** h ** ・ ・ ・ Abbreviation that means hypothesis.
Create an expression that expresses the hypothesis h. Here is it.
hθ(x) = θ0 + θ1x
By the way, this seems difficult at first glance, but can you see that it is very similar to the linear function formula Y = B + AX that junior high school students do? Because the linear function is a graph like the one below
You can see that it is just.
** And h θ </ sub> (x) shows a straight line. ** **
** It's the easiest place to get stuck, so I'd like you to memorize only here. ** **
Continuing from the model representation above, we will define the objective functions, but in the first place these objectives are to draw optimized lines for a given (plotted) dataset. For example, there is a figure like this. (Data has been plotted)
At this time, the purpose is to find the red line and minimize the square of the distance to the blue point.
In order to achieve the purpose introduced in 5-1 it is necessary to use the formula that works to minimize. However, I don't understand it just by looking at the formula, so I will give an explanation that leads to an intuitive understanding.
In the first place, parameters are names such as θ 0 </ sub> and θ 1 </ sub>. Depending on what number you apply to this, the direction of the red line in the graph above will change. It's okay if you've done a linear function when you were in junior high school.
e.g.)
When θ 0 </ sub> = 1.5 and θ 1 </ sub> = 0:
It will be.
When θ 0 </ sub> = 1 and θ 1 </ sub> = 0.5:
It will be.
After all, what you want to do is to base the value of θ 1 </ sub> x (x axis) on the value of θ 0 </ sub> (y axis). You just have to put it out.
Let's make a formula.
5-3-1 Step 1
(hθ(x) - y)2
I will explain the above formula.
The formula in Step 1 is the square of the red straight line h θ </ sub> (x) in the graph above minus the value of y.
5-3-2 Step 2
Remember 4-1? You were able to specify the i-th x value with X i </ sub>.
In the formula of Step 1, no sample is specified and it is very abstract, so I will add to get all the samples. Here is it
mΣi=1 (hθ(x( i )) - y( i ))2
mΣi = 1 means the summation flow from the time of i = 1 (case number 1) to m (the number of all cases).
Also, (h θ </ sub> (x) --y) 2 </ sup> → (h θ </ sub> (x (i) </ sup>) --y (i) </ sup>) 2 </ sup> has been replaced, which means subtract the i-th y from the i-th x.
5-3-3 Step 3
Finally, multiply by 1/2m to simplify the calculation.
1/2m mΣi=1 (hθ(x( i )) - y( i ))2
However, 1/2 and 1 / m are only ** simplified because the result does not change **, so there is no particular meaning.
As I wrote in 5-1 I decided the number of θ 0 </ sub> and θ 1 </ sub>, and one of the minimized functions that fits the plotted data exactly is the cost function. So, I have introduced the formula for selecting parameters.
Well, after all, the cost function is ** J (θ 0 </ sub>, θ 1 </ sub>) **. So when we show the cost function,
** J (θ 0 </ sub>, θ 1 </ sub>) = 1 / 2m mΣi = 1 (h θ </ sub> (x (i) < / sup>) --y (i) </ sup>) Can be expressed as 2 </ sup>. ** **
Therefore, from now on, the cost function will be represented by J.
So far, we have used h θ </ sub> (x) = θ 0 </ sub> + θ 1 </ sub> x.
hθ( x ) = θ1x
It is expressed as.
It's like saying θ 0 </ sub> = 0.
Since there is only one parameter, the formula for the cost function J is
J( θ1 ) = 1/2m mΣ i=1 (hθ(x( i )) - y( i ))2
Can be expressed as.
Simplified as 6-1 because B = θ 0 </ sub> = 0 in the linear function graph Y = AX + B, the objective function (cost function) J passes through the origin. ..
By the way, the hypothesis h θ </ sub> (x) is the function corresponding to x. This is because if θ 1 </ sub> has already been determined, it will be an expression corresponding to x.
On the other hand, the cost function J is a function corresponding to θ 1 </ sub>. You can understand that by looking at 6-1.
First, suppose you have three data at positions (1,1), (2,2), (3,3), respectively.
The straight line of the hypothesis changes because θ 0 </ sub> = 0 is set depending on the value of θ 1 </ sub> of the hypothesis h θ </ sub> (x). I will.
Now, based on the above hypothesis, we will calculate the numerical value to be plotted on the objective function J.
First, the purple line. Since the green circle is the location of the data, the difference is 0.5, 1.0, 1.5 in ascending order.
Now, let's apply these three to the formula of the cost function J.
J( θ1 ) = 1/2m mΣ i=1 (hθ(x( i )) - y( i ))2
If you calculate this,
J (θ 1 </ sub>) = 1/6 (3.5) = 0.58 ...
Similarly, on the magenta (pink) colored line,
J(θ1) = 0
In the red line
J (θ 1 </ sub>) = 2.33 ...
Then plot the three values you just got.
The result is here.
The line color and the circle color correspond. The graph above shows the objective function J.
Objective function J defined in 5-4. Section 7 describes an algorithm called ** steepest descent ** that minimizes J.
The steepest descent method is like this. ![Screenshot 2020-01-25 15.24.52.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/448030/8d4f647e-aa35-056a-ab16- 965eaeb03156.png)
Quoted from https://linky-juku.com/simple-regression-analysis3/
Quoted from https://linky-juku.com/simple-regression-analysis3/
However, as a weakness, as shown in the graph above, what is called a ** local solution ** (not an optimal solution, not a desirable answer) may be the answer.
This is the formula to use.
Step 1
The part:: = means that the right side is assigned to the left side and updated.
Step 2
α refers to the learning rate. The steepest descent method reaches the minimum value in several steps as shown in the figure below. The larger α, the larger the step.
Step 3
Do you remember. Cost function. In fact, J (θ 0 </ sub>, θ 1 </ sub>) is a cost function. As you know.
So you can convert it like this:
J(θ0,θ1)=1/2mmΣi=1(hθ(x(i))-y(i))2=1/2mmΣi=1(θ0+θ1x(i)-y(i)) 2
Then, consider the patterns of θ 0 </ sub> and θ 1 </ sub>.
θ0 ( j = 0) : ∂/∂θ0 J(θ0, θ1 ) = 1/m mΣi=1 (hθ(x( i )) - y( i ))2
θ1 ( j = 0) : ∂/∂θ1 J(θ0, θ1 ) = 1/m mΣi=1 (hθ(x( i )) - y( i ))2
ok。
Actually, as you may have seen the formula for the steepest descent method, there is a very important factor.
It is ** updating θ 0 </ sub> and θ 1 </ sub> at the same time **.
Temp0: = θ 0 </ sub>-![Screenshot 2020-01-25 15.44.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com /0/448030/eaab3438-8dd9-131e-77eb-7b1293c2fc37.png)
Temp 1: = θ 1 </ sub>-![Screenshot 2020-01-25 15.44.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws. com / 0/448030 / eaab3438-8dd9-131e-77eb-7b1293c2fc37.png)
θ0 := Temp0
θ1 := Temp1
In the above formula, once![Screenshot 2020-01-25 15.44.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/448030/eaab3438- 8dd9-131e-77eb-7b1293c2fc37.png) is assigned to the variables Temp0,1 and then to θ 0 </ sub> and θ 1 </ sub>.
Then don't do this
What if you do like this?
Actually, there is a big problem. That is, at the time of the expression on the second line, θ 0 </ sub> has been replaced by the variable Temp0, so on the right side of the expression on the third line, the new value is θ 0 </ sub>. The point is that sub> is supposed to be used in the expression.
Please note that ** simultaneous updates are mandatory ** regardless of whether the problem actually occurs.
As a premise, the steepest descent method is part of the cost function.
Quoted from (https://www.coursera.org/learn/machine-learning/lecture/kCvQc/gradient-descent-for-linear-regression)
By the way, could you see the above two images?
By the way, the figure on the right is![Screenshot 2020-01-25 14.51.52.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/448030/1da1ec0c -0a05-eefa-b571-2f1e04c9a53b.png) This cost function diagram is drawn with contour lines so that it can be seen in two dimensions.
In the previous figure, moving the blue line shows the result of increasing the red cross mark. As you know, the left is linear regression and the right is cost function. By changing the direction of the line of the linear regression hypothesis, the result of the cost function also changes. This is what I wanted to convey in 7-4.
For example, the steepest descent method can be used to prevent it from leading to a locally optimal solution.
\begin{bmatrix}
a & b \\
c & d \\
e & f
\end{bmatrix}
For example, the one above is a matrix. 3 x 2
A vector has only one row horizontally,
y =
\begin{bmatrix}
a \\
c \\
e
\end{bmatrix}
Represents an n x 1 row as shown above.
There are some rules.
For example
y i </ sub> = i th </ sup> element (a in the above example)
** However, you have to be careful whether the number assigned to the index starts from 0 or 1. ** **
Basically, we use a vector of indexes starting from 1.
Basically most people use uppercase letters when referring to indexes. Lowercase letters are often stored raw data
The rules for adding and subtracting matrices are
** Only matrices of the same dimension **
It is not possible to add or subtract.
\begin{bmatrix}
1 & 0 \\
2 & 5 \\
3 & 4
\end{bmatrix}
+
\begin{bmatrix}
4 & 7 \\
2 & 9 \\
0 & 8
\end{bmatrix}
=
\begin{bmatrix}
5 & 7 \\
4 & 14 \\
3 & 12
\end{bmatrix}
As you can see from the above, the two elements to be added and the element of the answer are 3 x 2 respectively. And the numbers that have the same vertical and horizontal positions are added together.
Scalar is an exaggeration, but it means a real number. So
3 ×
\begin{bmatrix}
2 & 5 \\
3 & 4
\end{bmatrix}
=
\begin{bmatrix}
6 & 15 \\
9 & 12
\end{bmatrix}
It's OK with a feeling.
I feel like this.
\begin{bmatrix}
2 & 5 \\
3 & 4
\end{bmatrix}
\begin{bmatrix}
6 \\
9
\end{bmatrix}
=
\begin{bmatrix}
57 \\
54
\end{bmatrix}
Name the matrix A, B, C in order from the left.
Step 1: Put the left of the upper row of A on the upper row of B, and the right of the upper row of A on the lower row of B. (2 x 6 and 5 x 9)
Step 2: Repeat 3x6 and 4x9 as above.
By the way, A was 2 x 2 this time. B is 2 x 1 and C is 2 x 1. Actually, this doesn't happen. ** There are solid rules **
A → m × n (2 × 2)
B →n × 1(2 × 1)
C → m × 1(2 × 1)
Please be careful as you will use this rule from now on.
** In fact, there is a way to avoid using iterative algorithms in linear regression algorithms, and this section also leads to that. ** ** What I'm going to do this time
\begin{bmatrix}
2 & 5 & 2 \\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
6 & 8\\
9 & 3\\
2 & 3
\end{bmatrix}
=
It is the way of. First, divide B into two vectors. As below
\begin{bmatrix}
2 & 5 & 2 \\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
6\\
9 \\
2
\end{bmatrix}
=
\begin{bmatrix}
2 & 5 & 2 \\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
8\\
3 \\
3
\end{bmatrix}
=
I calculated each.
\begin{bmatrix}
2 & 5 & 2\\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
6\\
9 \\
2
\end{bmatrix}
=
\begin{bmatrix}
61\\
68
\end{bmatrix}
\begin{bmatrix}
2 & 5 & 2\\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
8\\
3 \\
3
\end{bmatrix}
=
\begin{bmatrix}
37\\
57
\end{bmatrix}
And put together the C divided into two
\begin{bmatrix}
61& 37 \\
68 & 57
\end{bmatrix}
that's all.
Of course, you shouldn't change A x B to B x A, see 8-6 for some reason.
I don't want to write only things that are too obvious, so I just write that you should think for yourself whether the order of multiplication is wrong.
The number 1 represents an identity, a unit in real space. Because multiplying by 1 is equal to the number itself. Each real number has a reciprocal. For example, the reciprocal of 3 is 1/3.
So you can see that in real space, all numbers have no reciprocal. For example, 0.
Then, it is also called the inverse of the matrix, inverse or inverse matrix, but what is it? Let's conclude this article with that.
Let A be a matrix of m × m. And suppose there is an inverse matrix. And let's represent the inverse matrix as A -1 </ sup>.
Of the matrices, only the m × m matrix has an inverse matrix and is also called a square matrix.
For example, suppose you have one square matrix.
```math
\begin{bmatrix}
3& 4 \\
2 & 16
\end{bmatrix}
```
Also, I happened to know the inverse matrix.
```math
\begin{bmatrix}
0.4& -0.1 \\
-0.05 & 0.075
\end{bmatrix}
```
And when this matrix is multiplied,
```math
=
\begin{bmatrix}
1& 0 \\
0 & 1
\end{bmatrix}
```
= I 2 × 2 </ sub>. (I is a symbol representing the identity matrix)
Of course it is possible to do the inverse matrix by hand, but no one currently does that. So how do you calculate it?
** I use a software called Octave **
Rest assured that Part 2 will explain how to use Octave.
Now, let's summarize the terms.
A matrix that does not have an inverse matrix is called a singular matrix or a degenerate matrix.
** Really really finally, I will tell you how to calculate the transpose **.
```math
\begin{bmatrix}
1 & 2 & 0 \\
3 & 4 & 9
\end{bmatrix}
```
→
```math
\begin{bmatrix}
1 & 3 \\
2 & 5 \\
0 & 9
\end{bmatrix}
```
Those who thought that was the only one. That's right. only this. Simply bring the top row to the left and the bottom row to the right. that's all. Thank you for your hard work.
Thank you for reading this far.
~~ I will post a link when the sequel is ready, so please read it. ~~ Part 2 I wrote it.
Welcome. Go to "Machine Learning Part 2 that even junior high school students can fully understand". If you haven't read Part 1 yet, you don't have to read Part 1 first. (Up to section 1)
1-1 Windows
I'm sorry, my environment is not Mac, so I will give a brief explanation and a link.
First, use the here link to install.
As a reference site, we will also introduce here.
1-2 Mac OS
** We have confirmed the operation on Mac OS high sierra. ** **
You may refer to here.
Step 1
Install Homebrew
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Please put it in the terminal. (It's one line)
Step 2
Also include Homebrew Cask.
brew tap caskroom/cask
By the way, if you get an error, try typing `brew cask`
. If you see `` See also" man brew-cask "```
, it is already in.
Step 3
Install XQuartz
Install using Homebrew Cask.
brew cask install xquartz
brew install octave
Let's wait patiently
Step 5
Start Octave
Octave
This should start.
Octave:1>
It's OK if it comes out
exit
Type to exit and it will work as a normal console.
** The method of Step 1-5 so far is quoted from here: **
https://blog.amedama.jp/entry/2017/04/28/232547
1-3 Linux
sorry. It's possible that there is no operating environment, but (Zorin OS) I'm not doing it because it's annoying. Please check it out.
Wow ... there are many kanji ...
Section 2 discusses more powerful linear regression that works for multiple variables.
First, in the past linear regression, y was predicted based on the information x, which was called a single feature (single variable) with only one x. However, for example, when predicting the price of a house, when predicting the stock price of tomorrow, it is difficult to predict based on only one piece of information.
Therefore, we will use multiple pieces of information using multivariable linear regression.
Then, I will present the database as a sample. it's here.
Previously, it was one x, but we are assigning numbers 1-4.
Case No..( i ) | size(x1) | Number of bedrooms(x2) | Number of floors(x3) | How many years old(x4) | price(y)Ten thousand yen |
---|---|---|---|---|---|
1 | 2104 | 5 | 1 | 45 | 4600 |
2 | 1416 | 3 | 2 | 40 | 2320 |
3 | 1534 | 3 | 2 | 30 | 3150 |
4 | 852 | 2 | 1 | 36 | 1780 |
・ ・ ・ | ・ ・ ・ | ・ ・ ・ | ・ ・ ・ | ・ ・ ・ | ・ ・ ・ |
Then, we will define the prerequisite numbers.
n ・ ・ ・ 4. Indicates the number of variables x.
x (i) </ sup> ・ ・ ・ The i-th variable x is represented by a vector. If x (2) </ sup>,
\begin{bmatrix}
1416 \\
3 \\
2 \\
40
\end{bmatrix}
It will be. It does not mean the square of x. And the vector above is n dimensions (here 4 dimensions, 4 x 1).
x j </ sub> (i) </ sup> ・ ・ ・ Actually, j and (i) are lined up vertically. For example, for x 3 </ sub> (2) </ sup>, the answer is 2, which corresponds to the third variable in the vector above.
Until now, it was like this.
hθ(x) = θ0 + θ1x
However, since x is no longer one, the form of the expression also changes. Here is it.
hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
However, the above formula has a big problem because it grows as n increases.
That is, writing formulas in markdown is surprisingly tiring. So let's simplify the formula.
What to do is define the subscript of x as 0 = 1.
What this means is that for each case i, there is a function vector x (i) </ sup>. And about x j </ sub> (i) </ sup> introduced in 2-1 x 0 </ sub> (i) </ sup> Define = 1. At this time, it means that x 0 </ sub> = 1 in the i-th case is defined, but x 0 </ sub> does not exist originally. So, in other words, in the x (i) </ sup> vector, we arbitrarily defined the 0th feature as 1. So this makes x (i) </ sup> an n + 1 dimensional vector. OK?
** The above content is difficult. Please write it on a piece of paper and read it repeatedly. By the time you read it about three times, you should be able to understand it like I do. ** **
And we think of parameters as vectors.
θ =
\begin{bmatrix}
θ<sub>0</sub> \\
θ<sub>1</sub> \\
θ<sub>2</sub> \\
・\\
・\\
・\\
θ<sub>n</sub> \\
\end{bmatrix}
And now you can transform the shape of the formula.
hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4
= hθ(x) = θ0x1 + θ1x1 + θ2x2 + θ3x3 + θ4x4
By the way, transpose θ. In short, it changes its shape as a vector.
θ =
\begin{bmatrix}
θ0 \\
θ1 \\
θ2 \\
・\\
・\\
・\\
θn \\
\end{bmatrix}
=
\begin{bmatrix}
θ0 & θ1 & θ2 &・
&・
&・
& θn
\end{bmatrix}
What was 1 x n + 1 dimension is now n + 1 x 1 dimension. This is transpose. It is represented by the letter T. And when this is multiplied by the vector of x,
\begin{bmatrix}
θ0 & θ1 & θ2 &・
&・
&・
& θn
\end{bmatrix}
×
\begin{bmatrix}
xo \\
x1 \\
x2 \\
・\\
・\\
xn
\end{bmatrix}
=
= h θ </ sub> (x) = θ 0 </ sub> x 1 </ sub> + θ 1 </ sub> x 1 </ sub> > + θ 2 </ sub> x 2 </ sub> + θ 3 </ sub> x 3 </ sub> + θ 4 </ sub> x 4 </ sub> (the above formula)
= θ Tx
This is the linear regression of the type variate. It was the most difficult content I've dealt with since Part 1 ...
In the previous section, we had the idea of linear regression when there are multiple elements. In this section, we will consider how to fit the parameters to the hypothesis.
Based on Section 2, we will transform the formula.
J(θ0, θ1, ... ,θn)= 1/2m mΣi=1 (hθ(x(i))-y(i))2
→ J(θ)= 1/2m mΣi=1 (hθ(x(i))-y(i))2
The reason for the above formula is
θ =
\begin{bmatrix}
θ0 \\
θ1 \\
θ2 \\
・\\
・\\
・\\
θn \\
\end{bmatrix}
That's why.
next,
θj := θj - α ∂/∂θj J(θ0, θ1, ... ,θn)
→ θj := θj - α ∂/∂θj J(θ)
OK?
And I will write the steepest descent method.
If group A n = 1:
pattern 1
θ0 := θ0 - α 1/m mΣi=1(hθ(x(i)) - y(i))
** Pattern 2 **
θ1 := θ1 - α 1/m mΣi=1(hθ(x(i)) - y(i))x(i)
If group B is greater than n = 1, then:
pattern 1
θ0 := θ0 - α 1/m mΣi=1(hθ(x(i)) - y(i))x0(i)
** Pattern 2 **
θ1 := θ1 - α 1/m mΣi=1(hθ(x(i)) - y(i))x1(i)
First of all, if you compare A-1 and B-1, you will notice that they are the same. This is because the rule x 0 </ sub> (i) </ sup> = 1 makes no difference.
The same is true for A-2 and B-2.
Because x (i) </ sup> = x 1 </ sub> (i) </ sup>.
Well, you don't have to know the formula so far, so please review it carefully.
This section 4 presents practical ideas for using the steepest descent method well.
Introducing Future Scaling.
For example, let x 1 </ sub> be the size of your house, 0-2000m 2 </ sup>.
x 2 </ sub> is the number of bedrooms, between 1-5.
At this time, there are two patterns in the graph of the steepest descent method.
When,
is. (Both quoted from https://www.coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling)
In the first place, what does this ring mean? The contour line is the bowl type of the cost function shown in 7-4 of Part 1. It is a two-dimensional version used. So the center is the answer. It is one of the steepest descent methods.
Now, the red line represents the steps to get to the answer. By all means, it is the latter of the graph, which is closer to a circle, that can efficiently reach the answer first.
So how do you draw a graph that is close to a circle? The answer is simple. In short, the values of θ 0 </ sub> and θ 1 </ sub> (that is, variables x1, x2) should be closer together.
For example, in this example, x1 is 1/5 and x2 is 1/2000. In fact, the result is the latter. The former uses the data as it is.
And the above action is called ** forturing scaling **.
So what are the rules for bringing the values closer together? First of all, as a premise, if the range of data itself is large, there is no need to scale. ** How do you decide whether to scale or not **
A valid example is from Professor Andrew Ng of Stanford University.
-3 to 3 ・ ・ ・ OK
-1/3 to 1/3 ・ ・ ・ OK
-0.001 to 0.001 ・ ・ ・ NG (My father's gag)
-100 to 100 ・ ・ ・ NG
In short, it's ** rough **.
Earlier, I divided the data by the maximum value. However, as often used by other means,
There is something called ** average normalization **.
This is to fetch x i </ sub> and try to bring the mean including the other variables x closer to 0.
However, as you know, x 0 </ sub> is always 1, so please note that it cannot be used for x 0 </ sub>.
** What to do is this **
x1 = size-mean / maximum of x1 (here x1 = size-1000/2000)
x2 = number of bedrooms-average / maximum (here x2 = number of bedrooms-2/5)
This will bring it closer to zero. that's strange.
** Feature scaling is not strict ** Basically it's fine
e.g.)
For example, suppose you have two features for a home.
x1 is the length of the frontage (width facing the road)
x2 is the length of depth
Of course, it's natural to use x1 and x2 as they are, but assuming that the size of the land determines the price of the house, as a third feature,
x is x1 x x2
You could say that.
Because x1 and x2 just happened to exist first. By the way, I feel like I often use it in machine learning competitions such as kaggle. (Mystery)
In defining the features, I would like to introduce you to what is called ** polynomial regression **.
There are many variations of the linear regression equation. For example, here.
hθ + θ1x1 + θ2x2 + θ3x3
And this is
= θ 0 </ sub> + θ 1 </ sub> (size) + θ 2 </ sub> (size) 2 </ sup> + θ 3 It can also be expressed as </ sub> (size) 3 </ sup>.
x1 + (size)
x2 + (size)2
If x 3 </ sub> + (size) 3 </ sup>.
This is a cubic function. Well, this is just an example, but for example, a graph plotting data showing the relationship between residential area and price.
In a case like this, a cubic function is good. Why do you think
If it is a linear function, it will be a straight line, and if it is a quadratic function, it will be lowered again in an arc. It's hard to imagine that the price of a house will drop if there are too many sites.
And ** remember, feature scaling **. I will use it from now on.
If you set ** size: 1-1000 **,
x1 = 1-1000
x2 = 1-1,000,000
X3 = 1-1,000,000,000.
As expected, this is ... feature scaling comes into play.
However, please do the calculation yourself. I haven't finished my school homework today (laughs)
Well, I wrote earlier that the quadratic function can't be used because it will eventually return to its original value and become low, but this way it can be used.
If you use the above formula, you can avoid the graph going down. The reason is the homework from the teacher (laughs).
** How do you choose a future after all? **
Please read Part 3, 4, ... and all the series from now on. Then you will surely understand.
(I can't say anything while writing Part 3 yet ...)
By the way, please be assured that ** the algorithm will automatically select it ** in the end. (I'm wearing it)
The normal equations provide a better way to find the parameter θ in some linear regression problems.
Specifically, the algorithm used for linear regression so far has been the steepest descent method.
The steepest descent method was an algorithm that converges by arriving at the optimum solution by repeating many steps and multiple times.
In contrast, ** normal equations are the ** strongest ** algorithms that can reach the optimal solution at once without the need to run iterative algorithms to provide a way to solve θ analytically. (Later, I notice a big drawback)
First, let's get an intuitive understanding.
As a premise, we assume that θ is a real number. By the way, I don't think it's a vector. Please be careful.
Expression like this
J(θ) = αθ2 + bθ + c
In other words, let's consider the cost function J.
But before that, do you remember that theta we are dealing with is an n + 1 dimensional vector?
The formula looks like this.
J(θ0,θ1,...,θm)= 1/2m mΣi=1 (hθ(x( i ))-y ( i ))2
The problem is how to minimize the cost function J, but the answer of the cost function J is to take the partial differential of J and solve various problems for each parameter θ j </ sub>. It seems that can be derived, but in this section, we will approach a more conceptual point.
By the way, suppose you have a dataset like this.
Then create a column of x 0 </ sub> to create a matrix of x.
y is the same.
The x above is m x n + 1. y is an m-dimensional vector. By the way, m is the number of cases in the dataset and n is the number of features (variable x).
** So, what is a normal equation **
θ = (XTX)-1XTy
I want you to remember that the letter T with a wow is
In this second article, I was struck by the fact that I accidentally deleted data twice, which made me seriously think about working on a cloud basis. Well, which one are you?
Type A seems to be difficult, so I just scrolled to this point
I read everything while wondering what Type B is
I think about 90% and 5 minutes is Type A ...
Nice to meet you!
** Part 3 will be released on Friday midnight. ** **
Week 3 Published at 21:00 on the 16th
Week 4 Published at 21:00 on the 16th
Week 5 Published at 21:00 on the 17th
Week 6 Published at 21:00 on the 19th
Week 7 Published at 21:00 on the 20th
Week 8 Published at 21:00 on the 22nd
Week 9 Published at 21:00 on the 23rd
Week 10 Published at 21:00 on the 24th
Week 11 Published at 21:00 on the 26th
how was it. Most of the table of contents is now, but we will increase the content. The reason why I do my best is that the tutorial for beginners in the computer department to which I belong is making games using C #. Absolutely Python x machine learning is good from now on. If you like, please like it.
Recommended Posts