https://ocw.tsukuba.ac.jp/course/systeminformation/machine_learning/ This course is very easy to understand because it is open to the public that was actually held for students at the University of Tsukuba. There is a problem of analyzing using Python frequently in the subject of this course, but since the program part of this course is out of scope, there is no particular explanation and the program itself is not provided. The source code can be glimpsed in the video, but it's not neat because I can't move it by myself. So I decided to make a program that would give the same result.
I know Python and Pandas, but what is scikit-learn? I'm starting from that level. However, since I used the least squares method when I was in college, there is no problem with mathematical assumptions (although I forgot the calculation part of the matrix).
It runs on Docker so that it can be used anytime, anywhere. Since matplot cannot be displayed as it is, it is dropped to png. If you drop it from this github, the environment will be ready. https://github.com/legacyworld/sklearn-basic The following .devcontainer is required for Remote Development with VS code.
Exercise 1.4 is the first program. The explanation is from about 49 minutes of the 2nd (1) multiple regression. I explained from around 43 minutes, but the result is not correct because scaling is not done in the program.
python:Homework_1.4.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
#scikit-Import wine data from learn
df= pd.read_csv('winequality-red.csv',sep=';')
#Since the target value quality is included, create a dropped dataframe
df1 = df.drop(columns='quality')
y = df['quality'].values.reshape(-1,1)
scaler = preprocessing.StandardScaler)
#Simple regression performed for each column
for column in df1:
x = df[column]
fig = plt.figure()
plt.xlabel(column)
plt.ylabel('quality')
plt.scatter(x,y)
#Convert to matrix
X = x.values.reshape(-1,1)
#scaling
X_fit = scaler.fit_transform(X)
model = linear_model.LinearRegression()
model.fit(X_fit,y)
plt.plot(x,model.predict(X_fit))
mse = mean_squared_error(model.predict(X_fit),y)
print(f"quality = {model.coef_[0][0]} * {column} + {model.intercept_[0]}")
print(f"MSE: {mse}")
filename = f"{column}.png "
fig.savefig(filename)
#Multiple regression
X = df1.values
X_fit = scaler.fit_transform(X)
model = linear_model.LinearRegression()
model.fit(X_fit,y)
print(model.coef_,model.intercept_)
The place where you will surely get stuck if you use sklearn is the place to make a matrix. You will definitely see this error.
ValueError: Expected 2D array, got 1D array instead:
array=[7.4 7.8 7.8 ... 6.3 5.9 6. ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Let's see what the problem is.
X = x.values.reshape(-1,1)
print(f"Before conversion{x.values}")
print(f"After conversion{X}")
Then, the output will be as follows.
Before conversion[ 9.4 9.8 9.8 ... 11. 10.2 11. ]
After conversion[[ 9.4]
[ 9.8]
[ 9.8]
...
[11. ]
[10.2]
[11. ]]
It has changed to a two-dimensional array. After that, the calculation is actually performed and the result is obtained. The graph is saved as PNG with the name of the feature (fixed acidity.png etc.)
The last multiple regression part is added for my own study.
The following is the execution result
[root@316e28b88f45 workspace]# python test.py
quality = 0.10014898994431619 * fixed acidity + 5.6360225140712945
MSE: 0.6417307196439609
quality = -0.3153038874367112 * volatile acidity + 5.6360225140712945
MSE: 0.5523439983981253
quality = 0.18275435128971876 * citric acid + 5.6360225140712945
MSE: 0.6183613869155018
quality = 0.0110857825729839 * residual sugar + 5.6360225140712945
MSE: 0.6516376452555722
quality = -0.10406844138289646 * chlorides + 5.6360225140712945
MSE: 0.6409302993389623
quality = -0.04089548993375638 * free sulfur dioxide + 5.6360225140712945
MSE: 0.6500880987339057
quality = -0.14943458718129748 * total sulfur dioxide + 5.6360225140712945
MSE: 0.6294298439847829
quality = -0.14121524469500035 * density + 5.636022514071298
MSE: 0.6318187944965589
quality = -0.046607526450713255 * pH + 5.6360225140712945
MSE: 0.6495882783089737
quality = 0.20295710475205553 * sulphates + 5.6360225140712945
MSE: 0.6105689534614908
quality = 0.3844171096080022 * alcohol + 5.6360225140712945
MSE: 0.503984025671457
[[ 0.04349735 -0.19396667 -0.03555254 0.02301871 -0.08818339 0.04560596
-0.10735582 -0.03373717 -0.06384247 0.1552765 0.29424288]] [5.63602251]
The intercept is the same regardless of which data is used. Since it is an estimated value when there is no feature (= x), it is a simple average of quality
.
from this result
--Model that gives the best prediction = Model with the lowest MSE = alcohol --The most positive effect on quality prediction = alcohol (0.384) ――It's not always good if the alcohol content is high, but ... --The most negative effect on quality prediction = volatile acidity (-0.315) ――Volatile acidity is a volatile acid that adversely affects wine. - http://www.worldfinewines.com/winefaults2.html
It was a little task to study wine.
Recommended Posts