Hello. Yesterday I talked about the typical least squares method (OLS), which is a typical model of linear regression. This time I will write about Ridge regression and Lasso regression.

Preparation

Regression with a linear model literally uses a linear function to predict the objective variable. And the objective variable could be expressed by the following formula.

y' = w_1x_1 + w_2x_2 + \dots + w_mx_m (x_1, x_2, \dots, x_m:Explanatory variable, y':Predicted value, w_1,w_2, \dots, w_m :Regression coefficient)

If you write it with the intercept added,

y' = w_0x_0 + w_1x_1 + w_2x_2 + \dots + w_mx_m (w_0,Intercept: x_0 = 1)

When written as a vector,

y' = {\bf x^T}{\bf w}

Since there is not always only one vector (explanatory variable) of x, x is in the form of a matrix. In other words

\left(
    \begin{array}{cccc}
      y_{1} \\
      y_{2} \\
      y_{3} \\
      \vdots \\
      y_{n}
      
    \end{array}
  \right)
  
= \left(
    \begin{array}{cccc}
      1 & x_{11} & x_{21} & \ldots & x_{m1} \\
      1 & x_{12} & x_{22} & \ldots & x_{m2} \\
      \vdots & \vdots & \vdots & \ddots & \vdots \\
      1 & x_{1n} & x_{2n} & \ldots & x_{mn}
    \end{array}
  \right)
\left(
    \begin{array}{cccc}
      w_{1} \\
      w_{2} \\
      w_{3} \\
      \vdots \\
      w_{m}
      
    \end{array}
  \right)
\\
y = {\bf Xw} \dots (1)

Least squares

L = \sum_{i=1}^{n}(y_i - y'_i)^2

The parameters are determined so as to minimize this loss (error) function L. I will omit a detailed explanation, but if you transform this equation and use equation (1)

L = (y - {\bf Xw})^{T}(y - {\bf Xw})

Can be written.

This is the least squares method. However, with this formula, overfitting is likely to occur as the ** w ** parameter increases.

This L plus the ** normalization term ** is the Ridge and Lasso regression described below.

When written in a formula

L = (y - {\bf Xw})^{T}(y - {\bf Xw}) + a \times (L_p norm)

a is called a high parameter and is a scalar for controlling the strength of normalization. In other words, the larger it is, the stronger the normalization can be. However, if it is too strong, it will cause insufficient conformity, so be careful.

here ** Lasso ** regression for L1 norm For the L2 norm, it is a ** Ridge ** regression. It's more interesting.

For the L1 norm and L2 norm, here

Simply put The L1 norm is called ** Manhattan distance ** and is the sum of the absolute values of the differences between the vector components. The L2 norm is called the ** Euclidean distance ** and is the square root of the sum of squares of the differences between the vector components.

Ridge regression

Ridge regression is a linear model regression. As mentioned above, the ridge regression has a square root for normalization, so each component of the parameter ** w ** can be smoothed as a whole. Reason, In the route, it is easier to minimize the larger number closer to 0.

If you write in mathematical formulas

L = (y - {\bf Xw})^{T}(y - {\bf Xw}) + a ||{\bf w}||_2^2

Lasso return

Lasso regression is also one of the regressions by linear model. Since the regularization term part of the lasso regression is just the sum of the absolute values, some variables are often 0. If you write in mathematical formulas

L = \frac{1}{2}(y - {\bf Xw})^{T}(y - {\bf Xw}) + a ||{\bf w}||_1

Summary

Mathematics is really difficult, isn't it? I also have some places. But the most important thing to tell is The difference between Lasso regression and Ridge regression is ** the difference between the L1 norm and the L2 norm **. If you hold down the L1 norm and the L2 norm, you can see the characteristics of each regression. I haven't studied enough yet, but I'll do my best ...

Addendum (sample code)


from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

import mglearn
import numpy as np
import matplotlib.pyplot as plt


X, y = mglearn.datasets.make_wave(n_samples=40)
x_for_graph = np.linspace(-3, 3, 100)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

ridge = Ridge().fit(X_train, y_train)
lasso = Lasso().fit(X_train, y_train)

fig, ax = plt.subplots()

ax.scatter(X_train, y_train, marker='o')

ax.plot(x_for_graph, ridge.coef_ * x_for_graph + ridge.intercept_, label='Ridge')
ax.plot(x_for_graph, lasso.coef_ * x_for_graph + lasso.intercept_, label='Lasso')

ax.legend(loc='best')

print("Training set score for Ridge: {:.2f}".format(ridge.score(X_train, y_train)))
print("test set score: {:.2f}".format(ridge.score(X_test, y_test)))

print("Training set score for Lasso: {:.2f}".format(lasso.score(X_train, y_train)))
print("test set score: {:.2f}".format(lasso.score(X_test, y_test)))
plt.show()

result

Training set score for Ridge: 0.69
test set score: 0.64
Training set score for Lasso: 0.40
test set score: 0.55

There are too few variables in the data. .. .. Lol