University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (1) University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (2) https://github.com/legacyworld/sklearn-basic
Youtube commentary is 4th (1) per 40 minutes Create 30 training data with an error of $ N (0,1) \ times0.1 $ on $ y = \ cos (1.5 \ pi x) $ and perform polynomial regression. Cross-validation enters from here. It returns in order from the 1st order to the 20th order. This is the training data.
Source code
python:Homework_3.2.py
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures as PF
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
DEGREE = 20
def true_f(x):
return np.cos(1.5 * x * np.pi)
np.random.seed(0)
n_samples = 30
#X-axis data for drawing
x_plot = np.linspace(0,1,100)
#Training data
x_tr = np.sort(np.random.rand(n_samples))
y_tr = true_f(x_tr) + np.random.randn(n_samples) * 0.1
#Convert to Matrix
X_tr = x_tr.reshape(-1,1)
X_plot = x_plot.reshape(-1,1)
for degree in range(1,DEGREE+1):
plt.scatter(x_tr,y_tr,label="Training Samples")
plt.plot(x_plot,true_f(x_plot),label="True")
plt.xlim(0,1)
plt.ylim(-2,2)
filename = f"{degree}.png "
pf = PF(degree=degree,include_bias=False)
linear_reg = linear_model.LinearRegression()
steps = [("Polynomial_Features",pf),("Linear_Regression",linear_reg)]
pipeline = Pipeline(steps=steps)
pipeline.fit(X_tr,y_tr)
plt.plot(x_plot,pipeline.predict(X_plot),label="Model")
y_predict = pipeline.predict(X_tr)
mse = mean_squared_error(y_tr,y_predict)
scores = cross_val_score(pipeline,X_tr,y_tr,scoring="neg_mean_squared_error",cv=10)
plt.title(f"Degree: {degree} TrainErr: {mse:.2e} TestErr: {-scores.mean():.2e}(+/- {scores.std():.2e})")
plt.legend()
plt.savefig(filename)
plt.clf()
In the previous task 3.1, I prepared $ x, x ^ 2, x ^ 3 $, etc. in Polynomial Features and then performed Linear Regression, but I learned that it can be done in one shot by using pipeline.
When I actually saw the source code in the explanation video of Exercise 3.1, I was using pipeline.
There is nothing difficult, just list the processing contents with steps
.
steps = [("Polynomial_Features",pf),("Linear_Regression",linear_reg)]
pipeline = Pipeline(steps=steps)
pipeline.fit(X_tr,y_tr)
Other than this part, the difference from Task 3.1 is that cross-validation is included. This part in the program.
scores = cross_val_score(pipeline,X_tr,y_tr,scoring="neg_mean_squared_error",cv=10)
After dividing the data into 10 with cv = 10
, one part is used as the test data to evaluate the test error.
Basically, the one with a small test error is excellent.
When the program is executed, 20 graph files up to 1.png-20.png will be created.
--Minimum training error = 20th order
--Minimum test error = 3rd order
From this, we can see how overfitting is bad.
Recommended Posts