If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** You can see that the explanation "I don't know the background, but I got this result" is obviously weak **.
The purpose of this article is ** 2-3 to "try using scikit-learn because the theory is good" and 4 and later to "understand the background from mathematics" **.
I am from a private liberal arts school, so I am not good at mathematics. I tried to explain it so that it is easy to understand even for those who are not good at math as much as possible. (However, this time we need to know linear algebra, so if you find it difficult, it's okay to just pass it off.)
A similar article has been posted as a series "Understanding from Mathematics", so I hope you can read it as well. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics [[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1] (https://qiita.com/Hawaii/items/3f4e91cf9b86676c202f)
Since there are some overlaps with the above linear simple regression, please also refer to the article on linear simple regression. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics
** Predict numbers. ** In machine learning, there are other "classifications", but if you want to predict numerical values such as "●● yen" and "△ Kg", you can think of using regression.
There may be some misunderstandings,
"What you want ($ = y
Linear simple regression had one $ x $, while linear multiple regression has multiple $ x $.
I think it's hard to understand, so I'll give you a concrete example.
Specific example
You are a self-employed ice cream shop, and you strongly want to be able to predict the sales of ice cream in your store ** in order to stabilize the sales prospects. I will.
You desperately wondered what is affecting the sales of ice cream in your store. In the linear simple regression article, we assumed that it was the temperature that affected ice cream sales, but when you think about it, you really ran an ice cream shop. Then, do you conclude that "only temperature" really affects sales?
Perhaps not only the temperature, but also the traffic volume on the streets of the ice cream shop that day, and that is the influence of the employees who work together. As you can see, there are usually ** multiple explanatory variables that will affect the objective variable (ice cream sales) **, and in some cases tens of thousands.
Therefore, if you try to illustrate "some explanatory variables ($ = x
Scatter plot of sales and temperature
Scatter plot of sales and traffic
Scatter plot of sales and the number of employees who were in the shift on that day
In this illustration, you can choose to use "temperature" and "number of employees" as explanatory variables, which are likely to have a linear relationship with sales, and not "traffic volume". Here, we will use it as an example later, so we will also include "traffic volume" as an explanatory variable.
Next, let's use scikit-learn to build a machine learning model that calculates ice cream sales based on temperature, traffic volume, and number of employees.
Import the following required to perform linear regression.
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
Set the temperature, traffic volume, number of employees and ice cream sales as data as shown below.
data = pd.DataFrame({
"temprature":[8,10,6,15,12,16,20,13,24,26,12,18,19,16,20,23,26,28],
"car":[100,20,30,15,60,25,40,20,18,30,60,10,8,25,35,90,25,55],
"clerk":[3,5,4,6,6,7,12,8,12,10,7,7,8,6,6,10,12,13],
"sales(=y)":[30,35,28,38,35,40,60,34,63,65,38,40,41,43,42,55,65,69]
})
First of all, we will arrange the shape of the data to build the model.
y = data["sales(=y)"].values
X = data.drop("sales(=y)", axis=1).values #Means to define columns other than sales as X
Since this is not an article on python grammar, I will omit the details, but I will arrange X and y into a form for linear regression with scikit-learn.
It's finally the model building code.
regr = LinearRegression(fit_intercept = True)
regr.fit(X,y)
That's it for a simple model. We will create a linear regression model for a variable called regr! It is an image of doing something like a declaration and letting the regr fit (= learn) the prepared X and y in the next line.
As described in "2. What is linear (multiple) regression?" (2), $ y = a_1x_1 + a_2x_2 + a_3x_3 + b $ $ a $ and $ b $ are calculated by scikit-learn, and the temperature We are looking behind the scenes for a linear formula that predicts sales from traffic volume and the number of employees. If you leave it as it is, you will not realize it, so let's actually put out the slope and intercept.
b = regr.intercept_
a1 = regr.coef_[0]
a2 = regr.coef_[1]
a3 = regr.coef_[2]
pd.DataFrame([b,a1,a2,a3],index = ["b","a1","a2","a3"])
Then, it will be displayed as below.
In other words, we can see that the formula for this linear regression is $ y = 1.074159x_1 + 0.04694x_2 + 2.170663x_3 + 8.131467 $.
As a side note, the coefficient of traffic ($ x_2
It doesn't make sense to finish making a model. In the real world, it is necessary to use this forecasting model to forecast future sales. You wrote down the temperature, expected traffic, and number of employees for the next three days. Store it in a variable called z as shown below.
z = pd.DataFrame([[20,15,18],
[15,60,30],
[5,8,12]])
What I want to do is to apply the above future data to the straight line formula obtained by scikit-learn earlier and predict sales.
regr.predict(z)
If you do this, you will see the result as "([69.39068087, 92.18012508, 39.92573722])". In other words, tomorrow's sales will be about 694,000 yen, the day after tomorrow will be about 922,000 yen, and so on. If you can get the data for the next month, you will have a rough idea of sales and your goal will be achieved.
There are many other details, but I think it's good to try implementing orthodox linear regression first.
By the way, up to 3, use scikit-learn to calculate $ a $ and $ b $ of $ y = a_1x_1 + a_2x_2 + ・ ・ ・ + a_ix_i + b $ → Implement the flow of forecasting sales from the data for the next 3 days I tried to. Here, I would like to clarify ** how "calculate $ a $ and $ b $" in this flow is ** mathematically calculated.
$ \ frac {∂c} {∂ \ boldsymbol {x}} = 0 ← Differentiate the constant by x and it becomes 0 $
$\frac{∂(\boldsymbol{x}^TC\boldsymbol{x})}{∂\boldsymbol{x}} = (C
◆ Multiple regression analysis formula As mentioned in the first half, the formula for multiple regression analysis is generally expressed as follows.
** $ \ hat {y} = a_1x_1 + a_2x_2 + ・ ・ + a_mx_m + a_0 ・ 1 $ **
◆ In this example ... $ x_1 $ is the temperature, $ x_2 $ is the traffic volume, $ x_3 $ is the number of employees, each number is multiplied by some coefficient $ a_1, a_2, a_3 $, and finally the constant $ a_0.1 $ is added to sell. I'm looking for $ \ hat {y} $.
The following is the representation of $ x $ in the multiple regression analysis formula (i) as a vector $ x $, that is, $ \ boldsymbol {x} $.
Since there are three explanatory variables this time, it will be up to $ x_3 $, but it is generally expressed as above.
And, for example, the temperature data of $ x_1 $ above should not have one data, but the temperature data for multiple days should be stored in it. It is represented by the matrix $ X $ below.
Similarly, the vector $ a $, that is, $ \ boldsymbol {a} $, can be expressed as follows.
$
\boldsymbol{a} = \begin{pmatrix}
a_0\
a_1\
a_2\
a_3\
・ \
・ \
a_m
\end{pmatrix}
$
In other words, the formula for multiple regression analysis that predicts the original sales is $ \ hat {y} = a_1x_1 + a_2x_2 + ・ ・ + a_mx_m + a_0 ・ 1 $, so $ \ hat {y} = \ boldsymbol {X} \ boldsymbol It can be expressed as {a} $.
The important thing is that $ \ hat {y} $ and $ \ boldsymbol {X} $ can be found from the data you have, so substitute them for $ \ boldsymbol {a} $, that is, ** multiple regression analysis. This means that the coefficients of each explanatory variable in the equation can be calculated **.
From now on, let's find $ \ boldsymbol {a} $ analytically (= manually calculated) using this formula. This is the same calculation that Scikit-learn does behind the scenes. (Strictly different, but I will touch on it later.)
As mentioned in the article on linear simple regression, $ \ hat {y} = a_1x_1 + a_2x_2 + ・ ・ + a_mx_m + a_0 ・ To determine $ a_1 $, $ a_2 $, and $ a_3 $ of 1 $, ** Set good $ a_1, a_2, a_3 $ so that the difference between the real sales $ y $ and the predicted value $ \ hat {y} $ is as small as possible **.
Let's see what it means to be "good" while calculating the difference (error function) between $ y $ and $ \ hat {y} $.
To minimize this $ E $, differentiate $ E $ by $ \ boldsymbol {a} $ and find the $ \ boldsymbol {a} $ that becomes 0. (See the article on linear simple regression for why it differentiates to 0.)
$
\begin{align}
\ frac {∂E} {∂ \ boldsymbol {a}} & = \ frac {∂} {∂ \ boldsymbol {a}} (y ^ Ty) --2 \ frac {∂} {∂ \ boldsymbol {a}} ( y ^ T \ boldsymbol {X} \ boldsymbol {a}) + \ frac {∂} {∂ \ boldsymbol {a}} (a ^ T \ boldsymbol {X} ^ T \ boldsymbol {X} \ boldsymbol {a}) ← \ frac {∂} {∂ \ boldsymbol {a}} (y ^ Ty) becomes 0 \
& = -2 \ boldsymbol {X} ^ Ty + [\ boldsymbol {X} ^ T \ boldsymbol {X} + (\ boldsymbol {X} ^ T \ boldsymbol {X}) ^ T] \ boldsymbol {a} ← Premise C in the third expression of knowledge corresponds to \ boldsymbol {X} ^ T \ boldsymbol {X} here \
&= -2\boldsymbol{X}^Ty + 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{a}
\end{align}
$
Since this error function becomes 0,
$
\begin{align}
-2\boldsymbol{X}^Ty + 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{a} = 0\
2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{a} = 2\boldsymbol{X}^Ty\
\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{a} = \boldsymbol{X}^Ty\
\end{align}
$
In other words, the $ \ boldsymbol {a} $ you wanted to find is calculated as follows.
The $ \ boldsymbol {a} $ I wanted to find could be expressed by a mathematical formula, but even with this amount of presentation, multiple regression may not come out well (I didn't come).
So, here, I would like to use python's numpy to analytically calculate the formula for multiple regression analysis from the above formula.
◆ Data set
(I) Import numpy
import numpy as np
(Ii) Data set It may be a little hard to see, but the vertical column to the left of x is the temperature, the second column is the traffic volume, and the third column is the number of employees. y is sales.
x = np.matrix([[8,100,3],
[10,20,5],
[6,30,4],
[15,15,6],
[12,60,6],
[16,25,7],
[20,40,12],
[13,20,8],
[24,18,12],
[26,30,10],
[12,60,7],
[18,10,7],
[19,8,8],
[16,25,6],
[20,35,6],
[23,90,10],
[26,25,12],
[28,55,13]])
y = np.matrix([[30],
[35],
[28],
[38],
[35],
[40],
[60],
[34],
[63],
[65],
[38],
[40],
[41],
[43],
[42],
[55],
[65],
[69]])
(Iii) Multiple regression analysis As shown earlier, $ \ boldsymbol {a} = (\ boldsymbol {X} ^ T \ boldsymbol {X}) ^ {-1} \ boldsymbol {X} ^ Ty $, so write as follows.
(x.T * x)**-1 * x.T * y
Then you will see the result like this. In other words, when calculated on numpy, $ a_1 = 1.26, a_2 = 0.09, a_3 = 2.47 $. matrix([[1.26664688], [0.09371714], [2.47439799]])
This is because the numerical value is slightly different from $ a_1, a_2, a_3 $ obtained by scikit-learn, but scikit-learn further considers the bias (that is) in the calculation of this numpy. It gets more complicated when you start to go that far, so in terms of knowing the basic calculations that scikit-learn is doing behind the scenes, I think it's best to keep track of this level first.
How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.
However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.
I think there are many things that are difficult to understand, but I hope it helps to deepen my understanding, and I think that I must learn more firmly here, so I strengthened it by continuing to study. I would like to post an article.
Recommended Posts