Machine learning that junior high school students fully understand with just this [updated daily]

0 Preface

This is jp_Pythonia, a first year junior high school student. Nice to meet you. Since this article was buzzed and the amount of information is large, I made it a split version.

Part 1 Part 2

1 What is machine learning?

What is machine learning? Tom Mitchell defines:

The computer program is based on a performance standard P for a task T.

If the performance for T is measured by reference P and improves with experience E

It can be said that we learned from experience E.

There are two main types of machine learning. In the next two sections, we will discuss two of them, ** "supervised learning" ** and ** "unsupervised learning" **.

2 What is supervised learning?

Supervised learning is the ** most common ** machine learning problem.

As a main feature

--There is a correct answer in the given data set --Divided into classification problems and regression problems

There are two points.

2-1-1 There is a correct answer in the given dataset

→ In supervised learning, there is a "correct answer" in every given dataset.

e.g.)

We have developed an algorithm that predicts the weather, determines whether it is raining or sunny tomorrow, and outputs it.

All of the weather data we learned stated whether it was raining or sunny that day.

In the case of the above example, it is called ** "supervised learning". ** **

This is because the "correct answer" for the algorithm, the rain or sunny label, was attached to the data from the beginning.

2-1-2 Can be divided into classification problems and regression problems

Supervised learning can be broadly divided into two genres.

They are the "classification problem" and the "regression problem".

2-2 What is a classification problem?

The algorithm for predicting the weather shown earlier. This applies to classification problems. This is because ** rain or sunny ** is ** "classified" **.

By the way, the data of rain or sunny weather is called "discrete value".

2-3 What is a regression problem?

Regression problems, on the other hand, are a type of problem that deals with "continuous values".

e.g.)

We have developed an algorithm to predict tomorrow's temperature.

The data learned is a combination of the weather and the temperature.

In the case of the above example, the temperature of the output data is a "continuous value", so it is called a regression problem.

3 Unsupervised learning

Unsupervised learning Unsupervised learning allows you to tackle problems with little or no understanding of what the results will look like. You can derive the structure from data where the effect of the variable is not always known. You can derive this structure by clustering the data based on the relationships between the variables in the data. In unsupervised learning, there is no predictive feedback.

Quoted from https://www.coursera.org/learn/machine-learning/supplement/1O0Bk/unsupervised-learning

As mentioned above, unsupervised learning has two main characteristics.

--Unlike supervised learning, there is no "correct" data --Use a clustering algorithm

3-1 What is a clustering algorithm?

What is a clustering algorithm?

--All data labels are the same --There is no data label in the first place --Unlabeled

An algorithm that is useful when working with data. The clustering algorithm finds rules, genres, etc. in the given data, creates a cluster by itself, and classifies it there.

4 Linear regression (model representation)

4-1 Oops ... before that

I prepared a dataset. it's here.

Case Land area(x) Land price(y)
1 2104 460
2 1416 232
3 1534 315
4 852 178
・ ・ ・ ・ ・ ・ ・ ・ ・
8 680 143

We will define the ** characters ** that we will use from now on.

-** m ** ・ ・ ・ The number of cases. 8 in the above dataset -** x ** ・ ・ ・ The area of land. x i </ sub> represents the area of the i-th land. x 2 </ sub> = 1416 -** y ** ・ ・ ・ Land price. y i </ sub> represents the price of the i-th land. y 4 </ sub> = 178 -** h ** ・ ・ ・ Abbreviation that means hypothesis.

4-2 Decide how to express the hypothesis

Create an expression that expresses the hypothesis h. Here is it.

hθ(x) = θ0 + θ1x

By the way, this seems difficult at first glance, but can you see that it is very similar to the linear function formula Y = B + AX that junior high school students do? Because the linear function is a graph like the one below

一次関数の画像

You can see that it is just.

** And h θ </ sub> (x) shows a straight line. ** **

** It's the easiest place to get stuck, so I'd like you to memorize only here. ** **

5 Cost function (objective function)

5-1 In the first place: Purpose

Continuing from the model representation above, we will define the objective functions, but in the first place these objectives are to draw optimized lines for a given (plotted) dataset. For example, there is a figure like this. (Data has been plotted)

4d4dc8da6d0042fd77b72a773be7b8d7.png

At this time, the purpose is to find the red line and minimize the square of the distance to the blue point.

5-2 How to select parameters

In order to achieve the purpose introduced in 5-1 it is necessary to use the formula that works to minimize. However, I don't understand it just by looking at the formula, so I will give an explanation that leads to an intuitive understanding.

In the first place, parameters are names such as θ 0 </ sub> and θ 1 </ sub>. Depending on what number you apply to this, the direction of the red line in the graph above will change. It's okay if you've done a linear function when you were in junior high school.

e.g.)

When θ 0 </ sub> = 1.5 and θ 1 </ sub> = 0:

スクリーンショット 2020-01-25 10.05.14.png

It will be.

When θ 0 </ sub> = 1 and θ 1 </ sub> = 0.5:

スクリーンショット 2020-01-25 10.11.46.png

It will be.

5-3 Expression to select parameters

After all, what you want to do is to base the value of θ 1 </ sub> x (x axis) on the value of θ 0 </ sub> (y axis). You just have to put it out.

Let's make a formula.

5-3-1 Step 1

(hθ(x) - y)2

I will explain the above formula.

4d4dc8da6d0042fd77b72a773be7b8d7.png

The formula in Step 1 is the square of the red straight line h θ </ sub> (x) in the graph above minus the value of y.

5-3-2 Step 2

Remember 4-1? You were able to specify the i-th x value with X i </ sub>.

In the formula of Step 1, no sample is specified and it is very abstract, so I will add to get all the samples. Here is it

mΣi=1 (hθ(x( i )) - y( i ))2

mΣi = 1 means the summation flow from the time of i = 1 (case number 1) to m (the number of all cases).

Also, (h θ </ sub> (x) --y) 2 </ sup> → (h θ </ sub> (x (i) </ sup>) --y (i) </ sup>) 2 </ sup> has been replaced, which means subtract the i-th y from the i-th x.

5-3-3 Step 3

Finally, multiply by 1/2m to simplify the calculation.

1/2m mΣi=1 (hθ(x( i )) - y( i ))2

However, 1/2 and 1 / m are only ** simplified because the result does not change **, so there is no particular meaning.

5-4 What is a cost function after all?

As I wrote in 5-1 I decided the number of θ 0 </ sub> and θ 1 </ sub>, and one of the minimized functions that fits the plotted data exactly is the cost function. So, I have introduced the formula for selecting parameters.

Well, after all, the cost function is ** J (θ 0 </ sub>, θ 1 </ sub>) **. So when we show the cost function,

** J (θ 0 </ sub>, θ 1 </ sub>) = 1 / 2m mΣi = 1 (h θ </ sub> (x (i) < / sup>) --y (i) </ sup>) Can be expressed as 2 </ sup>. ** **

Therefore, from now on, the cost function will be represented by J.

6 Cost function (objective function) # 2

6-1 Simplification

So far, we have used h θ </ sub> (x) = θ 0 </ sub> + θ 1 </ sub> x.

hθ( x ) = θ1x

It is expressed as.

It's like saying θ 0 </ sub> = 0.

Since there is only one parameter, the formula for the cost function J is

J( θ1 ) = 1/2m mΣ i=1 (hθ(x( i )) - y( i ))2

Can be expressed as.

6-2 Deepen understanding

Simplified as 6-1 because B = θ 0 </ sub> = 0 in the linear function graph Y = AX + B, the objective function (cost function) J passes through the origin. ..

By the way, the hypothesis h θ </ sub> (x) is the function corresponding to x. This is because if θ 1 </ sub> has already been determined, it will be an expression corresponding to x.

On the other hand, the cost function J is a function corresponding to θ 1 </ sub>. You can understand that by looking at 6-1.

6-3 Connection between hypothesis and cost function

First, suppose you have three data at positions (1,1), (2,2), (3,3), respectively.

The straight line of the hypothesis changes because θ 0 </ sub> = 0 is set depending on the value of θ 1 </ sub> of the hypothesis h θ </ sub> (x). I will.

スクリーンショット 2020-01-25 13.13.22.png

Now, based on the above hypothesis, we will calculate the numerical value to be plotted on the objective function J.

First, the purple line. Since the green circle is the location of the data, the difference is 0.5, 1.0, 1.5 in ascending order.

Now, let's apply these three to the formula of the cost function J.

J( θ1 ) = 1/2m mΣ i=1 (hθ(x( i )) - y( i ))2

If you calculate this,

J (θ 1 </ sub>) = 1/6 (3.5) = 0.58 ...

Similarly, on the magenta (pink) colored line,

J(θ1) = 0

In the red line

J (θ 1 </ sub>) = 2.33 ...

Then plot the three values you just got.

The result is here.

スクリーンショット 2020-01-25 13.42.00.png

The line color and the circle color correspond. The graph above shows the objective function J.

7 The steepest descent method (gradient descent method)

Objective function J defined in 5-4. Section 7 describes an algorithm called ** steepest descent ** that minimizes J.

7-1 Image

The steepest descent method is like this. ![Screenshot 2020-01-25 15.24.52.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/448030/8d4f647e-aa35-056a-ab16- 965eaeb03156.png)

Quoted from https://linky-juku.com/simple-regression-analysis3/

スクリーンショット 2020-01-25 15.26.49.png

Quoted from https://linky-juku.com/simple-regression-analysis3/

However, as a weakness, as shown in the graph above, what is called a ** local solution ** (not an optimal solution, not a desirable answer) may be the answer.

7-2 Mathematical commentary

This is the formula to use.

スクリーンショット 2020-01-25 15.29.42.png

Step 1

The part:: = means that the right side is assigned to the left side and updated.

Step 2

α refers to the learning rate. The steepest descent method reaches the minimum value in several steps as shown in the figure below. The larger α, the larger the step.

スクリーンショット 2020-01-25 15.37.17.png

Step 3

Do you remember. Cost function. In fact, J (θ 0 </ sub>, θ 1 </ sub>) is a cost function. As you know.

So you can convert it like this:

スクリーンショット 2020-01-25 16.22.06.pngJ(θ01)=スクリーンショット2020-01-2516.22.06.png1/2mmΣi=1(hθ(x(i))-y(i))2=スクリーンショット2020-01-2516.22.06.png1/2mmΣi=1(θ01x(i)-y(i)) 2

Then, consider the patterns of θ 0 </ sub> and θ 1 </ sub>.

θ0 ( j = 0) : ∂/∂θ0 J(θ0, θ1 ) = 1/m mΣi=1 (hθ(x( i )) - y( i ))2

θ1 ( j = 0) : ∂/∂θ1 J(θ0, θ1 ) = 1/m mΣi=1 (hθ(x( i )) - y( i ))2

ok。

7-3 Important points in the steepest descent method (gradient descent method)

Actually, as you may have seen the formula for the steepest descent method, there is a very important factor.

It is ** updating θ 0 </ sub> and θ 1 </ sub> at the same time **.

Temp0: = θ 0 </ sub>-![Screenshot 2020-01-25 15.44.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com /0/448030/eaab3438-8dd9-131e-77eb-7b1293c2fc37.png)

Temp 1: = θ 1 </ sub>-![Screenshot 2020-01-25 15.44.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws. com / 0/448030 / eaab3438-8dd9-131e-77eb-7b1293c2fc37.png)

θ0 := Temp0

θ1 := Temp1

In the above formula, once![Screenshot 2020-01-25 15.44.43.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/448030/eaab3438- 8dd9-131e-77eb-7b1293c2fc37.png) is assigned to the variables Temp0,1 and then to θ 0 </ sub> and θ 1 </ sub>.

Then don't do this

スクリーンショット 2020-01-25 15.56.33.png

What if you do like this?

Actually, there is a big problem. That is, at the time of the expression on the second line, θ 0 </ sub> has been replaced by the variable Temp0, so on the right side of the expression on the third line, the new value is θ 0 </ sub>. The point is that sub> is supposed to be used in the expression.

Please note that ** simultaneous updates are mandatory ** regardless of whether the problem actually occurs.

7-4 Solving the steepest descent method with linear regression

As a premise, the steepest descent method is part of the cost function.

スクリーンショット 2020-01-25 17.00.20.png スクリーンショット 2020-01-25 17.00.31.png

Quoted from (https://www.coursera.org/learn/machine-learning/lecture/kCvQc/gradient-descent-for-linear-regression)

By the way, could you see the above two images?

By the way, the figure on the right is![Screenshot 2020-01-25 14.51.52.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/448030/1da1ec0c -0a05-eefa-b571-2f1e04c9a53b.png) This cost function diagram is drawn with contour lines so that it can be seen in two dimensions.

In the previous figure, moving the blue line shows the result of increasing the red cross mark. As you know, the left is linear regression and the right is cost function. By changing the direction of the line of the linear regression hypothesis, the result of the cost function also changes. This is what I wanted to convey in 7-4.

For example, the steepest descent method can be used to prevent it from leading to a locally optimal solution.

8 Matrix and vector

8-1 What is an identity matrix?

\begin{bmatrix}
a & b \\
c & d \\
e & f
\end{bmatrix}

For example, the one above is a matrix. 3 x 2

8-2 What is a vector?

A vector has only one row horizontally,

y = 

\begin{bmatrix}

a \\
c \\
e 
\end{bmatrix}

Represents an n x 1 row as shown above.

8-3 rules

There are some rules.

For example

y i </ sub> = i th </ sup> element (a in the above example)

** However, you have to be careful whether the number assigned to the index starts from 0 or 1. ** **

Basically, we use a vector of indexes starting from 1.

Basically most people use uppercase letters when referring to indexes. Lowercase letters are often stored raw data

8-4 Matrix addition and subtraction

The rules for adding and subtracting matrices are

** Only matrices of the same dimension **

It is not possible to add or subtract.

\begin{bmatrix}
1 & 0 \\
2 & 5 \\
3 & 4
\end{bmatrix}
+
\begin{bmatrix}
4 & 7 \\
2 & 9 \\
0 & 8
\end{bmatrix}
=
\begin{bmatrix}
5 & 7 \\
4 & 14 \\
3 & 12
\end{bmatrix}

As you can see from the above, the two elements to be added and the element of the answer are 3 x 2 respectively. And the numbers that have the same vertical and horizontal positions are added together.

8-5 scalar multiplication

Scalar is an exaggeration, but it means a real number. So

3 ×
\begin{bmatrix}
2 & 5 \\
3 & 4
\end{bmatrix}
= 
\begin{bmatrix}
6 & 15 \\
9 & 12
\end{bmatrix}

It's OK with a feeling.

8-6 Multiply a matrix by a vector

I feel like this.

\begin{bmatrix}
2 & 5 \\
3 & 4
\end{bmatrix}

\begin{bmatrix}
6  \\
9 
\end{bmatrix}
=
\begin{bmatrix}
57  \\
54 
\end{bmatrix}

Name the matrix A, B, C in order from the left.

Step 1: Put the left of the upper row of A on the upper row of B, and the right of the upper row of A on the lower row of B. (2 x 6 and 5 x 9)

Step 2: Repeat 3x6 and 4x9 as above.

By the way, A was 2 x 2 this time. B is 2 x 1 and C is 2 x 1. Actually, this doesn't happen. ** There are solid rules **

A → m × n (2 × 2)

B →n × 1(2 × 1)

C → m × 1(2 × 1)

Please be careful as you will use this rule from now on.

8-7 Multiply two matrices

** In fact, there is a way to avoid using iterative algorithms in linear regression algorithms, and this section also leads to that. ** ** What I'm going to do this time

\begin{bmatrix}
2 & 5 & 2 \\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
6 & 8\\
9 & 3\\
2 & 3
\end{bmatrix}
=

It is the way of. First, divide B into two vectors. As below

\begin{bmatrix}
2 & 5 & 2 \\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
6\\
9 \\
2
\end{bmatrix}
=

\begin{bmatrix}
2 & 5 & 2 \\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
 8\\
 3 \\
 3
\end{bmatrix}
=

I calculated each.

\begin{bmatrix}
2 & 5 & 2\\
3 & 4 & 7
\end{bmatrix}
×
\begin{bmatrix}
6\\
9 \\
2
\end{bmatrix}
=
\begin{bmatrix}
61\\
68
\end{bmatrix}

\begin{bmatrix}
2 & 5 & 2\\
3 & 4 & 7 
\end{bmatrix}
×
\begin{bmatrix}
 8\\
 3 \\
3
\end{bmatrix}
=
\begin{bmatrix}
37\\
57
\end{bmatrix}

And put together the C divided into two

\begin{bmatrix}
61& 37 \\
68 & 57
\end{bmatrix}

that's all.

8-8 Do not make a mistake in the order of multiplication

Of course, you shouldn't change A x B to B x A, see 8-6 for some reason.

I don't want to write only things that are too obvious, so I just write that you should think for yourself whether the order of multiplication is wrong.

8-9 Inverse of matrix

The number 1 represents an identity, a unit in real space. Because multiplying by 1 is equal to the number itself. Each real number has a reciprocal. For example, the reciprocal of 3 is 1/3.

So you can see that in real space, all numbers have no reciprocal. For example, 0.

Then, it is also called the inverse of the matrix, inverse or inverse matrix, but what is it? Let's conclude this article with that.

Let A be a matrix of m × m. And suppose there is an inverse matrix. And let's represent the inverse matrix as A -1 </ sup>.

Of the matrices, only the m × m matrix has an inverse matrix and is also called a square matrix.

For example, suppose you have one square matrix.

​```math 
\begin{bmatrix}
3& 4 \\
2 & 16
\end{bmatrix}
​```

Also, I happened to know the inverse matrix.

​```math 
\begin{bmatrix}
0.4& -0.1 \\
-0.05 & 0.075
\end{bmatrix}
​```

And when this matrix is multiplied,

​```math 
=
\begin{bmatrix}
1& 0 \\
0 & 1
\end{bmatrix}
​```

= I 2 × 2 </ sub>. (I is a symbol representing the identity matrix)

Of course it is possible to do the inverse matrix by hand, but no one currently does that. So how do you calculate it?

** I use a software called Octave **

Rest assured that Part 2 will explain how to use Octave.

Now, let's summarize the terms.

A matrix that does not have an inverse matrix is called a singular matrix or a degenerate matrix.

** Really really finally, I will tell you how to calculate the transpose **.

​```math
\begin{bmatrix}
1 & 2 & 0 \\
3 & 4 & 9
\end{bmatrix}
​```
→
​```math
\begin{bmatrix}
1 & 3  \\
2 & 5 \\
0 & 9
\end{bmatrix}
​```

Those who thought that was the only one. That's right. only this. Simply bring the top row to the left and the bottom row to the right. that's all. Thank you for your hard work.

9 Thank you

Thank you for reading this far.

~~ I will post a link when the sequel is ready, so please read it. ~~ Part 2 I wrote it.

To Week 2: Welcome

Welcome. Go to "Machine Learning Part 2 that even junior high school students can fully understand". If you haven't read Part 1 yet, you don't have to read Part 1 first. (Up to section 1)

1 Install Octave

1-1 Windows

I'm sorry, my environment is not Mac, so I will give a brief explanation and a link.

First, use the here link to install.

As a reference site, we will also introduce here.

1-2 Mac OS

** We have confirmed the operation on Mac OS high sierra. ** **

You may refer to here.

Step 1

Install Homebrew

 /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Please put it in the terminal. (It's one line)

Step 2

Also include Homebrew Cask.

brew tap caskroom/cask

By the way, if you get an error, try typing `brew cask`. If you see `` See also" man brew-cask "``` , it is already in.

Step 3

Install XQuartz

Install using Homebrew Cask.

brew cask install xquartz
brew install octave

Let's wait patiently

Step 5

Start Octave

Octave

This should start.

Octave:1> 

It's OK if it comes out

exitType to exit and it will work as a normal console.

** The method of Step 1-5 so far is quoted from here: **

https://blog.amedama.jp/entry/2017/04/28/232547

1-3 Linux

sorry. It's possible that there is no operating environment, but (Zorin OS) I'm not doing it because it's annoying. Please check it out.

2 Multivariable linear regression

2-1 Premise

Wow ... there are many kanji ...

Section 2 discusses more powerful linear regression that works for multiple variables.

First, in the past linear regression, y was predicted based on the information x, which was called a single feature (single variable) with only one x. However, for example, when predicting the price of a house, when predicting the stock price of tomorrow, it is difficult to predict based on only one piece of information.

Therefore, we will use multiple pieces of information using multivariable linear regression.

Then, I will present the database as a sample. it's here.

Previously, it was one x, but we are assigning numbers 1-4.

Case No..( i ) size(x1) Number of bedrooms(x2) Number of floors(x3) How many years old(x4) price(y)Ten thousand yen
1 2104 5 1 45 4600
2 1416 3 2 40 2320
3 1534 3 2 30 3150
4 852 2 1 36 1780
・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・

Then, we will define the prerequisite numbers.

n ・ ・ ・ 4. Indicates the number of variables x.

x (i) </ sup> ・ ・ ・ The i-th variable x is represented by a vector. If x (2) </ sup>,

\begin{bmatrix}
1416 \\
3 \\
2 \\
40
\end{bmatrix}

It will be. It does not mean the square of x. And the vector above is n dimensions (here 4 dimensions, 4 x 1).

x j </ sub> (i) </ sup> ・ ・ ・ Actually, j and (i) are lined up vertically. For example, for x 3 </ sub> (2) </ sup>, the answer is 2, which corresponds to the third variable in the vector above.

2-2 Form of linear regression

Until now, it was like this.

hθ(x) = θ0 + θ1x

However, since x is no longer one, the form of the expression also changes. Here is it.

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4

However, the above formula has a big problem because it grows as n increases.

That is, writing formulas in markdown is surprisingly tiring. So let's simplify the formula.

What to do is define the subscript of x as 0 = 1.

What this means is that for each case i, there is a function vector x (i) </ sup>. And about x j </ sub> (i) </ sup> introduced in 2-1 x 0 </ sub> (i) </ sup> Define = 1. At this time, it means that x 0 </ sub> = 1 in the i-th case is defined, but x 0 </ sub> does not exist originally. So, in other words, in the x (i) </ sup> vector, we arbitrarily defined the 0th feature as 1. So this makes x (i) </ sup> an n + 1 dimensional vector. OK?

** The above content is difficult. Please write it on a piece of paper and read it repeatedly. By the time you read it about three times, you should be able to understand it like I do. ** **

And we think of parameters as vectors.

θ = 
\begin{bmatrix}
θ<sub>0</sub> \\
θ<sub>1</sub> \\
θ<sub>2</sub> \\
・\\
・\\
・\\
θ<sub>n</sub> \\ 
\end{bmatrix}

And now you can transform the shape of the formula.

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4

= hθ(x) = θ0x1 + θ1x1 + θ2x2 + θ3x3 + θ4x4

By the way, transpose θ. In short, it changes its shape as a vector.

θ = 
\begin{bmatrix}
θ0 \\
θ1 \\
θ2 \\
・\\
・\\
・\\
θn \\ 
\end{bmatrix}
= 
\begin{bmatrix}
θ0 & θ1 & θ2 &・
&・
&・
& θn 
\end{bmatrix}

What was 1 x n + 1 dimension is now n + 1 x 1 dimension. This is transpose. It is represented by the letter T. And when this is multiplied by the vector of x,

\begin{bmatrix}
θ0 & θ1 & θ2 &・
&・
&・
& θn 
\end{bmatrix}
×
\begin{bmatrix}
xo \\
x1 \\
x2 \\
・\\
・\\
xn
\end{bmatrix}
= 

= h θ </ sub> (x) = θ 0 </ sub> x 1 </ sub> + θ 1 </ sub> x 1 </ sub> > + θ 2 </ sub> x 2 </ sub> + θ 3 </ sub> x 3 </ sub> + θ 4 </ sub> x 4 </ sub> (the above formula)

= θ Tx

This is the linear regression of the type variate. It was the most difficult content I've dealt with since Part 1 ...

3 Multiple variable steepest descent

In the previous section, we had the idea of linear regression when there are multiple elements. In this section, we will consider how to fit the parameters to the hypothesis. ‌

Based on Section 2, we will transform the formula.

J(θ0, θ1, ... ,θn)= 1/2m mΣi=1 (hθ(x(i))-y(i))2

→ J(θ)= 1/2m mΣi=1 (hθ(x(i))-y(i))2

The reason for the above formula is

θ = 
\begin{bmatrix}
θ0 \\
θ1 \\
θ2 \\
・\\
・\\
・\\
θn \\ 
\end{bmatrix}

That's why.

next,

θj := θj - α ∂/∂θj J(θ0, θ1, ... ,θn)

→ θj := θj - α ∂/∂θj J(θ)‌

OK?

And I will write the steepest descent method.

If group A n = 1:

pattern 1

θ0 := θ0 - α 1/m mΣi=1(hθ(x(i)) - y(i))

** Pattern 2 **

θ1 := θ1 - α 1/m mΣi=1(hθ(x(i)) - y(i))x(i)

If group B is greater than n = 1, then:

pattern 1

θ0 := θ0 - α 1/m mΣi=1(hθ(x(i)) - y(i))x0(i)

** Pattern 2 ** ‌

θ1 := θ1 - α 1/m mΣi=1(hθ(x(i)) - y(i))x1(i)

First of all, if you compare A-1 and B-1, you will notice that they are the same. This is because the rule x 0 </ sub> (i) </ sup> = 1 makes no difference.

The same is true for A-2 and B-2.

Because x (i) </ sup> = x 1 </ sub> (i) </ sup>.

Well, you don't have to know the formula so far, so please review it carefully.

4 Future scaling

This section 4 presents practical ideas for using the steepest descent method well.

Introducing Future Scaling.

For example, let x 1 </ sub> be the size of your house, 0-2000m 2 </ sup>.

x 2 </ sub> is the number of bedrooms, between 1-5.

At this time, there are two patterns in the graph of the steepest descent method.

スクリーンショット 2020-01-28 17.38.08.png

When,

スクリーンショット 2020-01-28 17.38.18.png

is. (Both quoted from https://www.coursera.org/learn/machine-learning/lecture/xx3Da/gradient-descent-in-practice-i-feature-scaling)

In the first place, what does this ring mean? The contour line is the bowl type of the cost function shown in 7-4 of Part 1. It is a two-dimensional version used. So the center is the answer. It is one of the steepest descent methods.

Now, the red line represents the steps to get to the answer. By all means, it is the latter of the graph, which is closer to a circle, that can efficiently reach the answer first.

So how do you draw a graph that is close to a circle? The answer is simple. In short, the values of θ 0 </ sub> and θ 1 </ sub> (that is, variables x1, x2) should be closer together.

For example, in this example, x1 is 1/5 and x2 is 1/2000. In fact, the result is the latter. The former uses the data as it is.

And the above action is called ** forturing scaling **.

So what are the rules for bringing the values closer together? First of all, as a premise, if the range of data itself is large, there is no need to scale. ** How do you decide whether to scale or not **

A valid example is from Professor Andrew Ng of Stanford University.

-3 to 3 ・ ・ ・ OK

-1/3 to 1/3 ・ ・ ・ OK

-0.001 to 0.001 ・ ・ ・ NG (My father's gag)

-100 to 100 ・ ・ ・ NG

In short, it's ** rough **.

Earlier, I divided the data by the maximum value. However, as often used by other means,

There is something called ** average normalization **.

This is to fetch x i </ sub> and try to bring the mean including the other variables x closer to 0.

However, as you know, x 0 </ sub> is always 1, so please note that it cannot be used for x 0 </ sub>.

** What to do is this **

x1 = size-mean / maximum of x1 (here x1 = size-1000/2000)

x2 = number of bedrooms-average / maximum (here x2 = number of bedrooms-2/5)

This will bring it closer to zero. that's strange.

** Feature scaling is not strict ** Basically it's fine

5 How to choose the feature (function x) to use

e.g.)

For example, suppose you have two features for a home.

x1 is the length of the frontage (width facing the road)

x2 is the length of depth

Of course, it's natural to use x1 and x2 as they are, but assuming that the size of the land determines the price of the house, as a third feature,

x is x1 x x2

You could say that.

Because x1 and x2 just happened to exist first. By the way, I feel like I often use it in machine learning competitions such as kaggle. (Mystery)

In defining the features, I would like to introduce you to what is called ** polynomial regression **.

There are many variations of the linear regression equation. For example, here.

hθ + θ1x1 + θ2x2 + θ3x3

And this is

= θ 0 </ sub> + θ 1 </ sub> (size) + θ 2 </ sub> (size) 2 </ sup> + θ 3 It can also be expressed as </ sub> (size) 3 </ sup>.

x1 + (size)

x2 + (size)2

If x 3 </ sub> + (size) 3 </ sup>.

This is a cubic function. Well, this is just an example, but for example, a graph plotting data showing the relationship between residential area and price.

スクリーンショット 2020-01-28 18.57.06.png

In a case like this, a cubic function is good. Why do you think

If it is a linear function, it will be a straight line, and if it is a quadratic function, it will be lowered again in an arc. It's hard to imagine that the price of a house will drop if there are too many sites.

And ** remember, feature scaling **. I will use it from now on.

If you set ** size: 1-1000 **,

x1 = 1-1000

x2 = 1-1,000,000

X3 = 1-1,000,000,000.

As expected, this is ... feature scaling comes into play.

However, please do the calculation yourself. I haven't finished my school homework today (laughs)

Well, I wrote earlier that the quadratic function can't be used because it will eventually return to its original value and become low, but this way it can be used.

スクリーンショット 2020-01-28 19.07.55.png

If you use the above formula, you can avoid the graph going down. The reason is the homework from the teacher (laughs).

** How do you choose a future after all? **

Please read Part 3, 4, ... and all the series from now on. Then you will surely understand.

(I can't say anything while writing Part 3 yet ...)

By the way, please be assured that ** the algorithm will automatically select it ** in the end. (I'm wearing it)

6 normal equation

The normal equations provide a better way to find the parameter θ in some linear regression problems.

Specifically, the algorithm used for linear regression so far has been the steepest descent method.

The steepest descent method was an algorithm that converges by arriving at the optimum solution by repeating many steps and multiple times.

In contrast, ** normal equations are the ** strongest ** algorithms that can reach the optimal solution at once without the need to run iterative algorithms to provide a way to solve θ analytically. (Later, I notice a big drawback)

First, let's get an intuitive understanding.

As a premise, we assume that θ is a real number. By the way, I don't think it's a vector. Please be careful.

Expression like this

J(θ) = αθ2 + bθ + c

In other words, let's consider the cost function J.

But before that, do you remember that theta we are dealing with is an n + 1 dimensional vector?

The formula looks like this.

J(θ01,...,θm)= 1/2m mΣi=1 (hθ(x( i ))-y ( i ))2

The problem is how to minimize the cost function J, but the answer of the cost function J is to take the partial differential of J and solve various problems for each parameter θ j </ sub>. It seems that can be derived, but in this section, we will approach a more conceptual point.

By the way, suppose you have a dataset like this.

スクリーンショット 2020-01-28 21.39.07.png

Then create a column of x 0 </ sub> to create a matrix of x.

スクリーンショット 2020-01-28 21.42.12.png

y is the same.

スクリーンショット 2020-01-28 21.43.40.png

The x above is m x n + 1. y is an m-dimensional vector. By the way, m is the number of cases in the dataset and n is the number of features (variable x).

** So, what is a normal equation **

θ = (XTX)-1XTy

I want you to remember that the letter T with a wow is

8 Thank you

In this second article, I was struck by the fact that I accidentally deleted data twice, which made me seriously think about working on a cloud basis. Well, which one are you?

Type A seems to be difficult, so I just scrolled to this point

I read everything while wondering what Type B is

I think about 90% and 5 minutes is Type A ...

Nice to meet you!

** Part 3 will be released on Friday midnight. ** **

Week 3 Published at 21:00 on the 16th

Classification problem

Hypothesis expression

Decision boundary

Cost function

Cost function and steepest descent (simple)

Advanced optimization

Multi-class classification: 1 vs. all

Overfit problem

Cost function

Regularized linear regression

Regularized logistic regression

Week 4 Published at 21:00 on the 16th

Non-linear hypothesis

Neurons and brain

Model representation (1)

Model representation (2)

Samples and intuition (1)

Samples and intuition (2)

Multi-class classification

Week 5 Published at 21:00 on the 17th

Cost function

Backpropagation algorithm

Intuitive backpropagation

Implementation note: Parameter expansion

Gradient check

Random initialization

Put it together

Week 6 Published at 21:00 on the 19th

Decided to try next

How to evaluate a hypothesis?

Model selection and train / verification / test set

Diagnosis of bias and dispersion

Regularization and bias / dispersion

Learning curve

Rethink what to do next

Prioritizing work content

Error analysis

Distorted class error indicator

Institutional and recall trade-offs

Machine learning data

Week 7 Published at 21:00 on the 20th

Purpose of optimization

Kernel (1)

Kernel (2)

Use SVM

Week 8 Published at 21:00 on the 22nd

Unsupervised learning (Introduction)

K-Means algorithm

Purpose of optimization

Random initialization

Select the number of clusters

Data compression

Visualization

Formulation of principal component analysis problem

Principal component analysis algorithm

Reconstruction from compressed representation

Selection of the number of principal components

Advice on applying PCA

Week 9 Published at 21:00 on the 23rd

Let's detect anomaly

Gaussian distribution

algorithm

Development and evaluation of anomaly detection system

Anomaly detection and supervised learning

Select the function to use

Multivariate Gaussian distribution

Anomaly detection using multivariate Gaussian distribution

Problem formulation

Content-based recommendations

Collaborative filtering

Collaborative filtering algorithm

Vectorization: Low rank matrix factorization

Implementation details: average normalization

Week 10 Published at 21:00 on the 24th

Learning with large datasets

Stochastic steepest descent method

Mini batch steepest descent method

Stochastic steepest descent convergence

Online learning

Map Reduce and data parallel processing

Week 11 Published at 21:00 on the 26th

Problem description and pipeline

Sliding window

Get large amounts of data and artificial data

Ceiling analysis: Which part of the pipeline should be addressed next

Afterword

how was it. Most of the table of contents is now, but we will increase the content. The reason why I do my best is that the tutorial for beginners in the computer department to which I belong is making games using C #. Absolutely Python x machine learning is good from now on. If you like, please like it.

Recommended Posts

Machine learning that junior high school students fully understand with just this [updated daily]
SIR model that even junior high school students can understand
Explaining the classic machine learning "Support Vector Machine (SVM)" so that even high school students can understand it