This time, I will summarize the implementation (code) of Ridge regression.
Proceed with the next 7 steps.
First, import the required modules.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Module to read the dataset
from sklearn.datasets import load_boston
#Module that separates training data and test data
from sklearn.model_selection import train_test_split
#Module for standardization (distributed normalization)
from sklearn.preprocessing import StandardScaler
#Module to search for parameter (alpha)
from sklearn.linear_model import RidgeCV
#Module to plot parameters (alpha)
from yellowbrick.regressor import AlphaSelection
#Module that performs Ridge regression (least squares method + L2 regularization term)
from sklearn.linear_model import Ridge
#Loading Boston dataset
boston = load_boston()
#Divide into objective variable and explanatory variable
X, y = boston.data, boston.target
#Standardization (distributed normalization)
SS = StandardScaler()
X = SS.fit_transform(X)
#Divide into training data and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=123)
In standardization, for example, when there are 2-digit and 4-digit features (explanatory variables), the influence of the latter becomes large. The scale is aligned by setting the average to 0 and the variance to 1 for all features.
In random_state, the seed value is fixed so that the result of data division is the same each time.
Ridge regression adds a regularization term to the least squares equation to avoid overfitting. Increasing alpha makes regularization stronger, and decreasing alpha makes it weaker.
Perform grid search and cross-validation on the training data to find the optimal alpha.
#Set the search interval of the parameter (alpha)
alphas = np.logspace(-10, 1, 500)
#Cross-validate training data to find optimal alpha
ridgeCV = RidgeCV(alphas = alphas)
#Plot alpha
visualizer = AlphaSelection(ridgeCV)
visualizer.fit(X_train, y_train)
visualizer.show()
plt.show()
Output result
From the above, the optimum alpha = 8.588 was found.
We will create a model of Ridge regression using the parameter (alpha) obtained earlier.
#Create an instance of Ridge regression
ridge = Ridge(alpha = 8.588)
#Generate a model from training data (least squares method + regularization term)
ridge.fit(X_train, y_train)
#Output intercept
print(ridge.intercept_)
#Output regression coefficient (slope)
print(ridge.coef_)
Output result
lr.intercept_: 22.564747201001634
lr.coef_: [-0.80818323 0.81261982 0.24268597 0.10593523 -1.39093785 3.4266411
-0.23114806 -2.53519513 1.7685398 -1.62416829 -1.99056814 0.57450373
-3.35123162]
lr.intercept_: intercept (weight $ w_0 $) lr.coef_: Regression coefficient / slope (weight $ w_1 $ ~ $ w_ {13} $)
Therefore, a concrete numerical value in the model formula (regression formula) was obtained.
$ y = w_0 + w_1x_1+w_2x_2+ \cdots + w_{12}x_{12} + w_{13}x_{13}$
Put the test data (X_test) in the created model formula and find the predicted value (y_pred).
y_pred = lr.predict(X_test)
y_pred
Output result
y_pred: [15.25513373 27.80625237 39.25737057 17.59408487 30.55171547 37.48819278
25.35202855 ..... 17.59870574 27.10848827 19.12778747 16.60377079 22.13542152]
Residual: Difference between predicted value and correct answer value (y_pred --y_test)
#Creating drawing objects and subplots
fig, ax = plt.subplots()
#Residual plot
ax.scatter(y_pred, y_pred - y_test, marker = 'o')
# y =Plot the red straight line of 0
ax.hlines(y = 0, xmin = -10, xmax = 50, linewidth = 2, color = 'red')
#Set the axis label
ax.set_xlabel('y_pred')
ax.set_ylabel('y_pred - y_test')
#Added graph title
ax.set_title('Residual Plot')
plt.show()
Output result
The data is well-balanced above and below the red line (y_pred --y_test = 0).
It can be confirmed that there is no big bias in the output of the predicted value.
This time, we will evaluate using the coefficient of determination.
#Score for training data
print(ridge.score(X_train, y_train))
#Score against test data
print(ridge.score(X_test, y_test))
Output result
Train Score: 0.763674626990198
Test Score: 0.6462122981958535
This time, for beginners, I have summarized only the implementation (code). Looking at the timing in the future, I would like to write an article about theory (mathematical formula).
Thank you for reading.
References: A new textbook for data analysis using Python (Python 3 engineer certification data analysis test main teaching material)
Recommended Posts