Understand and implement ridge regression (L2 regularization)

Introduction

I studied regularization as a development of multiple regression analysis. This time, I will summarize the ridge regression (L2 regularization).

reference

In understanding the ridge regression (L2 regularization), I referred to the following.

-Essence of Machine Learning Koichi Kato (Author) Publisher; SB Creative Co., Ltd. -Types and purposes of regularization L1 regularization L2 regularization -Theory and implementation of ridge regression and lasso regression from the beginning

Ridge regression (L2 regularization) overview

Review of multiple regression analysis

Ridge regression is a regularization term added to the loss function when performing multiple regression analysis. Multiple regression analysis derives the optimal regression equation by finding the weights that minimize the loss function, such as:

L = \sum_{n=1}^{n} (y_{n} -\hat{y}_{n} )^2

-$ y_ {n} $ is the measured value -Predicted value of $ \ hat {y} _ {n} $

Expressed in vector form, it looks like this.

L = (\boldsymbol{y}-X\boldsymbol{w})^T(\boldsymbol{y}-X\boldsymbol{w})

-$ \ boldsymbol {y} $ is a vectorized version of the measured value of the objective variable. -$ \ boldsymbol {w} $ is a vectorized regression coefficient when creating a multiple regression equation. -$ X $ is a matrix of measured values of explanatory variables with $ n $ number of samples and $ m $ number of variables.

It is OK if the weight $ \ boldsymbol {w} $ that minimizes the above $ L $ is obtained. If the above $ L $ is differentiated by $ \ boldsymbol {w} $ and set as $ 0 $, it will be as follows.

-2X^T\boldsymbol{y}+2\boldsymbol{w}X^TX = 0

By solving this, the weight $ \ boldsymbol {w} $ can be obtained.

Ridge regression (L2 regularization)

L = (\boldsymbol{y} - X\boldsymbol{w})^T (\boldsymbol{y} - X\boldsymbol{w}) + \lambda|| \boldsymbol{w} ||_{2}

The above is ridge regression(L2 regularization)It becomes the formula of the loss function of. Regularization term to loss function of multiple regression analysis$\lambda|| \boldsymbol{w} ||_{2} $It is in the form of adding. In ridge regression (L2 regularization), regularization is performed by adding the square of the L2 norm of the weight $ \ boldsymbol {w} $ as described above.

What is the L2 norm

The square root of the sum of squares of the difference between the vector components (the so-called "ordinary distance", the Euclidean distance) is the L2 norm. The norm is an index showing "magnitude", and L1 norm and L∞ norm are also used.

Effect of L2 regularization

Adding a regularization term to the loss function has the effect of reducing the size of the weight $ \ boldsymbol {w} . For normal multiple regression analysis, the only terms to minimize are: $(\boldsymbol{y} - X\boldsymbol{w})^T (\boldsymbol{y} - X\boldsymbol{w})$ If regularization is added here, it is necessary to minimize it including the following items. $\lambda|| \boldsymbol{w} ||_{2}$$ Since the weight $ \ boldsymbol {w} $ has a stronger effect on the loss function, the value of the weight $ \ boldsymbol {w} $ will be reduced in size. Also, the degree is controlled by the size of $ \ lambda $.

The following figure is often used to explain ridge regression (L2 regularization). Since it is difficult to understand if it is as it is, an explanation is added in the figure. The following is a two-dimensional plot of the contour lines of the value of the loss function when there are two types of weight parameters. You can see on the figure how the weight value is reduced by adding the regularization term.

図1.png

Derivation of weights for ridge regression (L2 regularization)

L = (\boldsymbol{y} - X\boldsymbol{w})^T (\boldsymbol{y} - X\boldsymbol{w}) + \lambda|| \boldsymbol{w} ||_{2}

Calculate by differentiating the above loss function with the weight $ \ boldsymbol {w} $ and setting it as $ 0 $.

-2X^T\boldsymbol{y}+2\boldsymbol{w}X^TX+2\lambda\boldsymbol{w} = 0 \\
(X^TX+\lambda I)\boldsymbol{w} - X^T\boldsymbol{y} = 0 \\
(X^TX+\lambda I)\boldsymbol{w} = X^T\boldsymbol{y} \\
\boldsymbol{w} = (X^TX+\lambda I)^{-1}X^T\boldsymbol{y}

We were able to derive the weight $ \ boldsymbol {w} $ here.

Implement ridge regression (L2 regularization)

Implementation

The following is the result of self-implementation of the ridge regression (L2 regularization) model.


import numpy as np

class RidgeReg:

    def __init__(self, lambda_ = 1.0):
        self.lambda_ = lambda_
        self.coef_ = None
        self.intercept_ = None

    def fit(self, X, y):
        #Add a column with all values of 1 to the first row of the matrix of explanatory variables to include the intercept calculation
        X = np.insert(X, 0, 1, axis=1)
        #Create an identity matrix
        i = np.eye(X.shape[1])
        #Calculation formula to calculate the weight
        temp = np.linalg.inv(X.T @ X + self.lambda_ * i) @ X.T @ y
        #This is the value of the regression coefficient
        self.coef_ = temp[1:]
        #This is the intercept value
        self.intercept_ = temp[0]
        
    def predict(self, X):
        #Returns the predicted value by the ridge regression model
        return (X @ self.coef_ + self.intercept_)

Verification

Verify that the self-implemented model above matches the sklearn results. This time, we will verify using the Boston house price dataset. The details of the contents of the data set can be found in Article on Verification of Multiple Regression Analysis.

model of sklearn

from sklearn.datasets import load_boston
import pandas as pd
from sklearn.preprocessing import StandardScaler

#Data reading
boston = load_boston()

#Once converted to pandas data frame format
df = pd.DataFrame(boston.data, columns=boston.feature_names)

#Get the objective variable (value you want to expect)
target = boston.target

df['target'] = target

from sklearn.linear_model import Ridge

X = df[['INDUS', 'CRIM']].values
X = StandardScaler().fit_transform(X)
y = df['target'].values

clf = Ridge(alpha=1)

clf.fit(X, y)

print(clf.coef_)
print(clf.intercept_)

Click here for output. From the top are the regression coefficients and intercept of the model.

[-3.58037552 -2.1078602 ]
22.532806324110677

Self-implementing model

X = df[['INDUS', 'CRIM']].values
X = StandardScaler().fit_transform(X)
y = df['target'].values

linear = RidgeReg(lambda_ = 1)

linear.fit(X,y)

print(linear.coef_)
print(linear.intercept_)

Click here for output. From the top are the regression coefficients and intercept of the model.

[-3.58037552 -2.1078602 ]
22.532806324110677

The regression coefficients were exactly the same, but for some reason there was a subtle difference in the intercept. I investigated it, but I did not understand the cause, so I will post it as it is. If anyone knows, please point it out ... Next Next, I will try to understand the lasso regression (L1 regularization) and implement it by myself.

Recommended Posts

Understand and implement ridge regression (L2 regularization)
Implement and understand union-find trees in Go
Difference between linear regression, Ridge regression and Lasso regression
Ridge and Lasso
Artificial intelligence, machine learning, deep learning to implement and understand
Neural network to understand and implement in high school mathematics
Theory and implementation of multiple regression models-why regularization is needed-