I studied regularization as a development of multiple regression analysis. This time, I will summarize the ridge regression (L2 regularization).
In understanding the ridge regression (L2 regularization), I referred to the following.
-Essence of Machine Learning Koichi Kato (Author) Publisher; SB Creative Co., Ltd. -Types and purposes of regularization L1 regularization L2 regularization -Theory and implementation of ridge regression and lasso regression from the beginning
Ridge regression is a regularization term added to the loss function when performing multiple regression analysis. Multiple regression analysis derives the optimal regression equation by finding the weights that minimize the loss function, such as:
-$ y_ {n} $ is the measured value -Predicted value of $ \ hat {y} _ {n} $
Expressed in vector form, it looks like this.
-$ \ boldsymbol {y} $ is a vectorized version of the measured value of the objective variable. -$ \ boldsymbol {w} $ is a vectorized regression coefficient when creating a multiple regression equation. -$ X $ is a matrix of measured values of explanatory variables with $ n $ number of samples and $ m $ number of variables.
It is OK if the weight $ \ boldsymbol {w} $ that minimizes the above $ L $ is obtained. If the above $ L $ is differentiated by $ \ boldsymbol {w} $ and set as $ 0 $, it will be as follows.
By solving this, the weight $ \ boldsymbol {w} $ can be obtained.
The above is ridge regression(L2 regularization)It becomes the formula of the loss function of. Regularization term to loss function of multiple regression analysis$\lambda|| \boldsymbol{w} ||_{2} $It is in the form of adding. In ridge regression (L2 regularization), regularization is performed by adding the square of the L2 norm of the weight $ \ boldsymbol {w} $ as described above.
The square root of the sum of squares of the difference between the vector components (the so-called "ordinary distance", the Euclidean distance) is the L2 norm. The norm is an index showing "magnitude", and L1 norm and L∞ norm are also used.
Adding a regularization term to the loss function has the effect of reducing the size of the weight $ \ boldsymbol {w}
The following figure is often used to explain ridge regression (L2 regularization). Since it is difficult to understand if it is as it is, an explanation is added in the figure. The following is a two-dimensional plot of the contour lines of the value of the loss function when there are two types of weight parameters. You can see on the figure how the weight value is reduced by adding the regularization term.
Calculate by differentiating the above loss function with the weight $ \ boldsymbol {w} $ and setting it as $ 0 $.
-2X^T\boldsymbol{y}+2\boldsymbol{w}X^TX+2\lambda\boldsymbol{w} = 0 \\
(X^TX+\lambda I)\boldsymbol{w} - X^T\boldsymbol{y} = 0 \\
(X^TX+\lambda I)\boldsymbol{w} = X^T\boldsymbol{y} \\
\boldsymbol{w} = (X^TX+\lambda I)^{-1}X^T\boldsymbol{y}
We were able to derive the weight $ \ boldsymbol {w} $ here.
The following is the result of self-implementation of the ridge regression (L2 regularization) model.
import numpy as np
class RidgeReg:
def __init__(self, lambda_ = 1.0):
self.lambda_ = lambda_
self.coef_ = None
self.intercept_ = None
def fit(self, X, y):
#Add a column with all values of 1 to the first row of the matrix of explanatory variables to include the intercept calculation
X = np.insert(X, 0, 1, axis=1)
#Create an identity matrix
i = np.eye(X.shape[1])
#Calculation formula to calculate the weight
temp = np.linalg.inv(X.T @ X + self.lambda_ * i) @ X.T @ y
#This is the value of the regression coefficient
self.coef_ = temp[1:]
#This is the intercept value
self.intercept_ = temp[0]
def predict(self, X):
#Returns the predicted value by the ridge regression model
return (X @ self.coef_ + self.intercept_)
Verify that the self-implemented model above matches the sklearn results. This time, we will verify using the Boston house price dataset. The details of the contents of the data set can be found in Article on Verification of Multiple Regression Analysis.
from sklearn.datasets import load_boston
import pandas as pd
from sklearn.preprocessing import StandardScaler
#Data reading
boston = load_boston()
#Once converted to pandas data frame format
df = pd.DataFrame(boston.data, columns=boston.feature_names)
#Get the objective variable (value you want to expect)
target = boston.target
df['target'] = target
from sklearn.linear_model import Ridge
X = df[['INDUS', 'CRIM']].values
X = StandardScaler().fit_transform(X)
y = df['target'].values
clf = Ridge(alpha=1)
clf.fit(X, y)
print(clf.coef_)
print(clf.intercept_)
Click here for output. From the top are the regression coefficients and intercept of the model.
[-3.58037552 -2.1078602 ]
22.532806324110677
X = df[['INDUS', 'CRIM']].values
X = StandardScaler().fit_transform(X)
y = df['target'].values
linear = RidgeReg(lambda_ = 1)
linear.fit(X,y)
print(linear.coef_)
print(linear.intercept_)
Click here for output. From the top are the regression coefficients and intercept of the model.
[-3.58037552 -2.1078602 ]
22.532806324110677
The regression coefficients were exactly the same, but for some reason there was a subtle difference in the intercept. I investigated it, but I did not understand the cause, so I will post it as it is. If anyone knows, please point it out ... Next Next, I will try to understand the lasso regression (L1 regularization) and implement it by myself.
Recommended Posts