Since I summarized the overall classification of machine learning in the previous article, I will describe each concrete implementation from this time.
Click here for the previous article https://qiita.com/U__ki/items/4ae54da6cadec8a84d1b
The theme this time is "What is the TV size that fits the size of the room?" Maybe some people have decided to move from spring. For those who are wondering how to determine the size of the TV that fits the size of the new room, I would like to find the size of the TV that fits the size of the room using simple regression analysis.
Since it was a big deal, I decided to create simulated data using pandas and then analyze it based on that data.
First of all, referring to this article (* https://www.olive-hitomawashi.com/lifestyle/2019/10/post-294.html*), the recommended TV size data that matches the room size is as follows. did. (If you have this, you don't need this article)
[TV size] 6th power: 24 inches 8 tatami mats: 32 inches 10 tatami mats: 40 inches 12 tatami mats: 50 inches
Output this data as a csv file using pandas.
create_csv.py
#csv creation pandas
import pandas as pd
df=pd.DataFrame([
["6", "24"],
["8", "32"],
["10", "40"],
["12", "50"]],
columns=["room_size", "tv_inch"]
)
df.to_csv("room_tv.csv", index=False)
df is an abbreviation for data flame. Also, by setting index = False, the index number in csv is eliminated. This will create a new file called room_tv.csv on the same folder.
If the following files are created in the folder, it is successful.
room_tv.csv
room_size,tv_inch
6,24
8,32
10,40
12,50
Now you have the csv file to use this time.
Next, we will perform the main single regression analysis this time.
Simple regression analysis has the following three components. ** ・ Model decision ** ** ・ Set the evaluation function ** **-Minimize the evaluation function (determine the slope) **
First, load csv.
main.py
df=pd.read_csv("room_tv.csv")
If you are using jupyter notebook with this, the data created earlier will be displayed as follows.
Next, let's take a look at the data this time. Matplotlib is easy to understand for illustration in python.
main.py
x=df["room_size"]
y=df["tv_inch"]
import matplotlib.pyplot as plt
plt.scatter(x,y)
plt.show()
Here, x is the size of the room and y is the size of the TV.
I got the data that seems to be easy to handle. This time, it seems that a linear function can handle it (** model determination **).
Currently, you can use this data as it is, but if you use this data y=ax+b Then, two variables a and b appear. The calculation can be done as it is, but ** data averaging ** is performed to reduce one variable.
** Data averaging ** Take the average value of all the data and use the value subtracted from each data. With pandas, it's extremely easy to average the data or subtract it from all the data. By doing this, only y = ax can be considered.
main.py
#Get the average value of the data
xm=x.mean()
ym=y.mean()
#Centralized by subtracting the mean from all data
xc=x-xm
yc=y-ym
#Redisplay
plt.scatter(xc,yc)
plt.show()
With the above code, it changed to the following figure.
It seems that there is not much change from the previous graph, but since it is not necessary to consider the intercept (b), future calculations will be considerably easier.
\hat{y}=ax
You just need to find a about.
I want to determine the formula so that the predicted value (using machine learning) is the smallest compared to the measured value. The thing for that is called the determination of the evaluation function. It has the same meaning as "loss function" in data science. Although the explanation is small here, the square error is confirmed and a small one is determined. With y as the measured value and y ^ (Waihat) as the predicted value
\begin{align}
L&=(y_1-\hat{y_1})^2+(y_2-\hat{y_2})^2+....+(y_N-\hat{y_1N})^2\\
&=\sum_{n=1}^{N}(y_n-\hat{y_n})^2
\end{align}
Can be represented by. At this time, * L * is called the evaluation function.
The evaluation function appears as a quadratic function as we saw earlier. Therefore, the point where the slope is 0 is the minimum, and the square error is small.
If you want to find the point where the slope of the quadratic equation is 0, which is in the range of high school mathematics, you can differentiate and find the point where "= 0".
Because the variable is a
\frac{\partial}{\partial a}(L)=0
To ask. Substituting the above formula into this and expanding it
\begin{align}
L&=\sum_{n=1}^{N}y_n^2-2(\sum_{n=1}^{N}x_ny_n)a+(\sum_{n=1}^{N}x_n^2)a^2\\
&=c_o-2c_1a+c_2a^2
\end{align}
Substitute
\frac{\partial}{\partial a}(c_o-2c_1a+c_2a^2)=0\\
\\
a=\frac{\sum_{n=1}^{N}x_ny_n}{\sum_{n=1}^{N}x_n^2}
Let's write some code about this.
main.py
#Find the value of each square from the formula
xx=xc*xc
xy=xc*yc
#Ask for a
a=xy.sum()/xx.sum()
#Try to plot
plt.scatter(xc,yc, label="y")
plt.plot(x,a*x, label="y_hat", color="green")
plt.legend()
plt.show()
The graph is shown below.
Since x is determined within the range of the measured values, the line segment is short, but a was obtained.
From the above, the slope a was obtained. However, since it is centralized, when actually applying it
Don't forget to multiply x (x value) -x (mean value) by a and finally y (mean value).
Finally, I will write a summary of this.
main.py
import pandas as pd
import matplotlib.pyplot as plt
df=pd.read_csv("room_tv.csv")
x=df["room_size"]
y=df["tv_inch"]
#Averaging
xm=x.mean()
ym=y.mean()
#Centralization
xc=x-xm
yc=y-ym
xx=xc*xc
xy=xc*yc
a=xy.sum()/xx.sum()
plt.scatter(xc,yc, label="y")
plt.plot(x,a*x, label="y_hat", color="green")
plt.legend()
plt.show()
bonus
main.py
#Understanding the data
df.describe()
This will analyze the data as follows. Useful for processing more complex and large amounts of data.
The code was short but very meaningful. Next time, I would like to do multiple regression analysis. As for future developments, I will post not only programming-related items such as coding and error handling, but also those related to neuroscience.