1. Purpose

If you want to try machine learning, anyone can use scikit-learn etc. to implement it relatively easily. However, in order to achieve results at work or to improve your level ** You can see that the explanation "I don't know the background, but I got this result" is obviously weak **.

The purpose of this article is ** 2-3 to "try using scikit-learn because the theory is good" and 4 and later to "understand the background from mathematics" **.

I am from a private liberal arts school, so I am not good at mathematics. I tried to explain it so that it is easy to understand even for those who are not good at math as much as possible. (However, this time we need to know linear algebra, so if you find it difficult, it's okay to just pass it off.)
A similar article has been posted as a series "Understanding from Mathematics", so I hope you can read it as well. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics [[Machine learning] Understanding logistic regression from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/ee2a0687ca451fe213be) [[Machine learning] Understanding SVM from both scikit-learn and mathematics] (https://qiita.com/Hawaii/items/4688a50cffb2140f297d) [[Machine learning] Understand from mathematics why the correlation coefficient ranges from -1 to 1] (https://qiita.com/Hawaii/items/3f4e91cf9b86676c202f)

2. What is linear (multiple) regression?

Since there are some overlaps with the above linear simple regression, please also refer to the article on linear simple regression. [Machine learning] Understanding linear simple regression from both scikit-learn and mathematics

(1) What is regression?

** Predict numbers. ** In machine learning, there are other "classifications", but if you want to predict numerical values such as "●● yen" and "△ Kg", you can think of using regression.

(2) What is linear regression?

There may be some misunderstandings, "What you want ($ = y )" and "What you think will affect what you want When ( = x $) "has a linear relationship, the method of finding $ y $ using the linear feature is called linear regression.

Linear simple regression had one $ x $, while linear multiple regression has multiple $ x $.

I think it's hard to understand, so I'll give you a concrete example.
Specific example
You are a self-employed ice cream shop, and you strongly want to be able to predict the sales of ice cream in your store ** in order to stabilize the sales prospects. I will. キャプチャ8.PNG

You desperately wondered what is affecting the sales of ice cream in your store. In the linear simple regression article, we assumed that it was the temperature that affected ice cream sales, but when you think about it, you really ran an ice cream shop. Then, do you conclude that "only temperature" really affects sales?

Perhaps not only the temperature, but also the traffic volume on the streets of the ice cream shop that day, and that is the influence of the employees who work together. As you can see, there are usually ** multiple explanatory variables that will affect the objective variable (ice cream sales) **, and in some cases tens of thousands.

Therefore, if you try to illustrate "some explanatory variables ($ = x )" and "objective variables (ice cream sales ( = y $))" as shown below, the explanatory variables and objective variables (ice) You can see that there are some that are likely to have a straight line shape (= $ ax + b $) (= linear) and some that are not so relevant.

Scatter plot of sales and temperature
キャプチャ1.PNG

Scatter plot of sales and traffic
キャプチャ2.PNG

Scatter plot of sales and the number of employees who were in the shift on that day
キャプチャ3.PNG

In this illustration, you can choose to use "temperature" and "number of employees" as explanatory variables, which are likely to have a linear relationship with sales, and not "traffic volume". Here, we will use it as an example later, so we will also include "traffic volume" as an explanatory variable.

Next, let's use scikit-learn to build a machine learning model that calculates ice cream sales based on temperature, traffic volume, and number of employees.

3. Linear regression with scikit-learn

(1) Import of required libraries

Import the following required to perform linear regression.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression

(2) Data preparation

Set the temperature, traffic volume, number of employees and ice cream sales as data as shown below.

For example, in the following, sales will be 300,000 yen on a day when the temperature is 8 °, and 350,000 yen on a day when the temperature is 10 °.

data = pd.DataFrame({
    "temprature":[8,10,6,15,12,16,20,13,24,26,12,18,19,16,20,23,26,28],
    "car":[100,20,30,15,60,25,40,20,18,30,60,10,8,25,35,90,25,55],
    "clerk":[3,5,4,6,6,7,12,8,12,10,7,7,8,6,6,10,12,13],
    "sales(=y)":[30,35,28,38,35,40,60,34,63,65,38,40,41,43,42,55,65,69]
    })

(3) Model construction

(I) Data shaping

First of all, we will arrange the shape of the data to build the model.

y = data["sales(=y)"].values
X = data.drop("sales(=y)", axis=1).values  #Means to define columns other than sales as X

Since this is not an article on python grammar, I will omit the details, but I will arrange X and y into a form for linear regression with scikit-learn.

I think this is a code that cannot be written unless you understand it to some extent, so I would like to summarize it somewhere.

(Ii) Model construction

It's finally the model building code.

regr = LinearRegression(fit_intercept = True)
regr.fit(X,y)

That's it for a simple model. We will create a linear regression model for a variable called regr! It is an image of doing something like a declaration and letting the regr fit (= learn) the prepared X and y in the next line.

(Iii) Try to get the slope and intercept of the straight line

As described in "2. What is linear (multiple) regression?" (2), $ y = a_1x_1 + a_2x_2 + a_3x_3 + b $ $ a $ and $ b $ are calculated by scikit-learn, and the temperature We are looking behind the scenes for a linear formula that predicts sales from traffic volume and the number of employees. If you leave it as it is, you will not realize it, so let's actually put out the slope and intercept.

b = regr.intercept_
a1 = regr.coef_[0]
a2 = regr.coef_[1]
a3 = regr.coef_[2]

pd.DataFrame([b,a1,a2,a3],index = ["b","a1","a2","a3"])

Then, it will be displayed as below. キャプチャ4.PNG

In other words, we can see that the formula for this linear regression is $ y = 1.074159x_1 + 0.04694x_2 + 2.170663x_3 + 8.131467 $.

As a side note, the coefficient of traffic ($ x_2 ) ( = a_2 $), which seemed to have little to do with sales in the first illustration, is 0.04694, which is very small compared to the other two coefficients. You can see from the calculation. In other words, we can see again that traffic volume is a variable that does not have much influence (= insignificant) when calculating $ y $.

(6) In the real world ...

It doesn't make sense to finish making a model. In the real world, it is necessary to use this forecasting model to forecast future sales. You wrote down the temperature, expected traffic, and number of employees for the next three days. Store it in a variable called z as shown below.

z = pd.DataFrame([[20,15,18],
                  [15,60,30],
                  [5,8,12]])

What I want to do is to apply the above future data to the straight line formula obtained by scikit-learn earlier and predict sales.

regr.predict(z)

If you do this, you will see the result as "([69.39068087, 92.18012508, 39.92573722])". In other words, tomorrow's sales will be about 694,000 yen, the day after tomorrow will be about 922,000 yen, and so on. If you can get the data for the next month, you will have a rough idea of sales and your goal will be achieved.

There are many other details, but I think it's good to try implementing orthodox linear regression first.

4. Understanding linear (multiple) regression from mathematics

By the way, up to 3, use scikit-learn to calculate $ a $ and $ b $ of $ y = a_1x_1 + a_2x_2 + ・・・ + a_ix_i + b $ → Implement the flow of forecasting sales from the data for the next 3 days I tried to. Here, I would like to clarify ** how "calculate $ a $ and $ b $" in this flow is ** mathematically calculated.

If you do not need this knowledge at present, you can skip it.
Vectors and linear algebra will be introduced this time, so if you find it difficult, just think "that's what it is".

(1) Prerequisite knowledge (basic vector, linear algebra)

$ \ frac {∂c} {∂ \ boldsymbol {x}} = 0 ← Differentiate the constant by x and it becomes 0 $

\frac{∂(\boldsymbol{c}^T\boldsymbol{x})}{∂\boldsymbol{x}} = \boldsymbol{c}

$\frac{∂(\boldsymbol{x}^TC\boldsymbol{x})}{∂\boldsymbol{x}} = (C

C^T)\boldsymbol{x}$

(2) Mathematical understanding

(I) About the formula of multiple regression analysis

◆ Multiple regression analysis formula As mentioned in the first half, the formula for multiple regression analysis is generally expressed as follows.

** $ \ hat {y} = a_1x_1 + a_2x_2 + ・・ + a_mx_m + a_0 ・ 1 $ **

$ A_0 ・ 1 $ refers to $ b $ in so-called $ y = ax + b $, and is set as some constant value.
Details are omitted this time, but $ a_0 ・ 1 $ will be 0 if the entire data is standardized. I will not standardize this time, but since it is usually standardized, I do not calculate $ a_0 ・ 1 $ analytically.

◆ In this example ... $ x_1 $ is the temperature, $ x_2 $ is the traffic volume, $ x_3 $ is the number of employees, each number is multiplied by some coefficient $ a_1, a_2, a_3 $, and finally the constant $ a_0.1 $ is added to sell. I'm looking for $ \ hat {y} $.

(Ii) Express the formula of multiple regression analysis as a vector

The following is the representation of $ x $ in the multiple regression analysis formula (i) as a vector $ x $, that is, $ \ boldsymbol {x} $. キャプチャ5.PNG

Since there are three explanatory variables this time, it will be up to $ x_3 $, but it is generally expressed as above.

And, for example, the temperature data of $ x_1 $ above should not have one data, but the temperature data for multiple days should be stored in it. It is represented by the matrix $ X $ below.

Similarly, the vector $ a $, that is, $ \ boldsymbol {a} $, can be expressed as follows.

$ \boldsymbol{a} = \begin{pmatrix} a_0\
a_1\
a_2\
a_3\
・ \
・ \
a_m \end{pmatrix} $

In other words, the formula for multiple regression analysis that predicts the original sales is $ \ hat {y} = a_1x_1 + a_2x_2 + ・・ + a_mx_m + a_0 ・ 1 $, so $ \ hat {y} = \ boldsymbol {X} \ boldsymbol It can be expressed as {a} $.

The important thing is that $ \ hat {y} $ and $ \ boldsymbol {X} $ can be found from the data you have, so substitute them for $ \ boldsymbol {a} $, that is, ** multiple regression analysis. This means that the coefficients of each explanatory variable in the equation can be calculated **.

From now on, let's find $ \ boldsymbol {a} $ analytically (= manually calculated) using this formula. This is the same calculation that Scikit-learn does behind the scenes. (Strictly different, but I will touch on it later.)

(Iii) Calculation of error function

As mentioned in the article on linear simple regression, $ \ hat {y} = a_1x_1 + a_2x_2 + ・・ + a_mx_m + a_0 ・ To determine $ a_1 $, $ a_2 $, and $ a_3 $ of 1 $, ** Set good $ a_1, a_2, a_3 $ so that the difference between the real sales $ y $ and the predicted value $ \ hat {y} $ is as small as possible **.

Let's see what it means to be "good" while calculating the difference (error function) between $ y $ and $ \ hat {y} $.

\begin{align} E &= \sum_{i=1}^n ({y_i - \hat{y}})^{2}\\\ &= (y - \hat{y})^T(y - \hat{y})\\\ & = (y-\ boldsymbol {X} \ boldsymbol {a}) ^ T (y-\ boldsymbol {X} \ boldsymbol {a}) ← \ hat {y} with \ boldsymbol {X} \ boldsymbol {a} Substitution \\\ &= (y^T - (\boldsymbol{X}\boldsymbol{a})^T)(y - \boldsymbol{X}\boldsymbol{a})\\\ &= (y^T - a^T\boldsymbol{X}^T)(y - \boldsymbol{X}\boldsymbol{a})←(\boldsymbol{X}\boldsymbol{a})^T = a^T\boldsymbol{X}^T\\\ & = y ^ Ty --y ^ T \ boldsymbol {X} \ boldsymbol {a} --a ^ T \ boldsymbol {X} ^ Ty + a ^ T \ boldsymbol {X} ^ T \ boldsymbol {X} \ boldsymbol {a } ← Expand the expression one level above \\\ & = y ^ Ty --2y ^ T \ boldsymbol {X} \ boldsymbol {a} + a ^ T \ boldsymbol {X} ^ T \ boldsymbol {X} \ boldsymbol {a} ← a ^ T \ boldsymbol {X} ^ From Ty = y ^ T \ boldsymbol {X} \ boldsymbol {a} \end{align}

To minimize this $ E $, differentiate $ E $ by $ \ boldsymbol {a} $ and find the $ \ boldsymbol {a} $ that becomes 0. (See the article on linear simple regression for why it differentiates to 0.)

$ \begin{align} \ frac {∂E} {∂ \ boldsymbol {a}} & = \ frac {∂} {∂ \ boldsymbol {a}} (y ^ Ty) --2 \ frac {∂} {∂ \ boldsymbol {a}} ( y ^ T \ boldsymbol {X} \ boldsymbol {a}) + \ frac {∂} {∂ \ boldsymbol {a}} (a ^ T \ boldsymbol {X} ^ T \ boldsymbol {X} \ boldsymbol {a}) ← \ frac {∂} {∂ \ boldsymbol {a}} (y ^ Ty) becomes 0 \
& = -2 \ boldsymbol {X} ^ Ty + [\ boldsymbol {X} ^ T \ boldsymbol {X} + (\ boldsymbol {X} ^ T \ boldsymbol {X}) ^ T] \ boldsymbol {a} ← Premise C in the third expression of knowledge corresponds to \ boldsymbol {X} ^ T \ boldsymbol {X} here \
&= -2\boldsymbol{X}^Ty + 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{a} \end{align} $

Since this error function becomes 0,

$ \begin{align} -2\boldsymbol{X}^Ty + 2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{a} = 0\
2\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{a} = 2\boldsymbol{X}^Ty\
\boldsymbol{X}^T\boldsymbol{X}\boldsymbol{a} = \boldsymbol{X}^Ty\
\end{align} $

In other words, the $ \ boldsymbol {a} $ you wanted to find is calculated as follows.

\boldsymbol{a} = (\boldsymbol{X}^T\boldsymbol{X})^{-1} \boldsymbol{X}^Ty

(3) Development-Try to implement with python-

The $ \ boldsymbol {a} $ I wanted to find could be expressed by a mathematical formula, but even with this amount of presentation, multiple regression may not come out well (I didn't come).

So, here, I would like to use python's numpy to analytically calculate the formula for multiple regression analysis from the above formula.

◆ Data set

(I) Import numpy

import numpy as np

(Ii) Data set It may be a little hard to see, but the vertical column to the left of x is the temperature, the second column is the traffic volume, and the third column is the number of employees. y is sales.

x = np.matrix([[8,100,3],
              [10,20,5],
              [6,30,4],
              [15,15,6],
              [12,60,6],
              [16,25,7],
              [20,40,12],
              [13,20,8],
              [24,18,12],
              [26,30,10],
              [12,60,7],
              [18,10,7],
              [19,8,8],
              [16,25,6],
              [20,35,6],
              [23,90,10],
              [26,25,12],
              [28,55,13]])
y = np.matrix([[30],
              [35],
              [28],
              [38],
              [35],
              [40],
              [60],
              [34],
              [63],
              [65],
              [38],
              [40],
              [41],
              [43],
              [42],
              [55],
              [65],
              [69]])

(Iii) Multiple regression analysis As shown earlier, $ \ boldsymbol {a} = (\ boldsymbol {X} ^ T \ boldsymbol {X}) ^ {-1} \ boldsymbol {X} ^ Ty $, so write as follows.

(x.T * x)**-1 * x.T * y

Then you will see the result like this. In other words, when calculated on numpy, $ a_1 = 1.26, a_2 = 0.09, a_3 = 2.47 $. matrix([[1.26664688], [0.09371714], [2.47439799]])

This is because the numerical value is slightly different from $ a_1, a_2, a_3 $ obtained by scikit-learn, but scikit-learn further considers the bias (that is) in the calculation of this numpy. It gets more complicated when you start to go that far, so in terms of knowing the basic calculations that scikit-learn is doing behind the scenes, I think it's best to keep track of this level first.

5. Summary

How was it? My thought is, "I can't interpret the extremely complicated code from the beginning, so I don't care about the accuracy, so I'll try to implement a basic series of steps with scikit-learn etc." I think it's very important.

However, once I get used to it, I feel that it is very important to understand from a mathematical background how they work behind the scenes.

I think there are many things that are difficult to understand, but I hope it helps to deepen my understanding, and I think that I must learn more firmly here, so I strengthened it by continuing to study. I would like to post an article.

[Machine learning] Understanding linear multiple regression from both scikit-learn and mathematics