Difference between linear regression, Ridge regression and Lasso regression

■ Introduction

This time, in linear regression, Ridge regression, and Lasso regression I will summarize the article about each feature and difference.

■ Preparation of module data


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mglearn

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn.linear_model import RidgeCV
from yellowbrick.regressor import AlphaSelection

mglearn is a module for visualizing data usage and plots.

This time we will use the improved boston dataset. Unlike the original data, the number of features is 104.


X, y = mglearn.datasets.load_extended_boston()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print('X:', X.shape)
print('y:', y.shape)
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

# X: (506, 104)
# y: (506,)
# X_train: (379, 104)
# y_train: (379,)
# X_test: (127, 104)
# y_test: (127,)

1. Linear regression


lr = LinearRegression().fit(X_train, y_train)

#Coefficient of determination (an index that measures the accuracy of regression model predictions)
print('Train set score: {}'.format(lr.score(X_train, y_train)))
print('Test set score: {}'.format(lr.score(X_test, y_test)))

# Train set score: 0.9520519609032729
# Test set score: 0.6074721959665842

score is the coefficient of determination (an index that measures the accuracy of regression model predictions).

In linear regression, the prediction accuracy of the training data at hand is high, For test data (unknown data), the prediction accuracy tends to be low.

If you compare it to sports (baseball), you usually practice hitting straight balls. Do you feel that you will not be able to handle curved balls at all in the actual game?

In practice and Kaggle, generalization performance (ability to respond to production) for unknown data is important. For this data, we can see that the linear regression model is unsuitable.

2. Ridge regression

Too much adaptation to training data like before Deterioration of generalization performance to test data is called "overfitting".

To prevent this, regularization (parameter: alpha) is performed by Ridge regression.

Overfitting is likely to occur if the regression coefficient (value of each feature) is large or varies. Increasing the parameter alpha causes the regression coefficient to approach zero.


ridge = Ridge(alpha=1).fit(X_train, y_train)

print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))

# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752

Adaptation to training data is reduced, but generalization performance to test data is improved. The alpha of Ridge regression is 1 by default, so try other values as well.


ridge10 = Ridge(alpha=10).fit(X_train, y_train)

print('Training set score: {:.2f}'.format(ridge10.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(ridge10.score(X_test, y_test)))

# Training set score: 0.79
# Test set score: 0.64

The prediction accuracy for the test data is lower than when alpha = 1.


ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)

print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))

# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752

This is about the same prediction accuracy as when alpha = 1.

Here, about the magnitude and variation of each regression coefficient Let's compare 3 patterns with alpha = 0.1, 1, 10.


plt.plot(ridge10.coef_, 's', label='Ridge alpha=10')
plt.plot(ridge.coef_, 's', label='Ridge alpha=1')
plt.plot(ridge01.coef_, 's', label='Ridge alpah=0.1')

plt.plot(lr.coef_, 'o', label='LinearRegression')
plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')

plt.hlines(0, 0, len(lr.coef_))
plt.ylim(-25, 25)

plt.legend()

Horizontal axis: 104 features Vertical axis: magnitude of each regression coefficient in the model

You can see that the score is higher when there is some variation in the data, such as alpha = 0.1, 1. However, if the variation is too large (overfitting) like linear regression, or if the regularization is too strong like alpha = 10. The coefficient of determination score will be low.

Earlier, I substituted some alphas and compared the scores. There is also a way to find out in advance about the optimal alpha.

First, for the parameter alpha, set the range to search for the value. Cross-validate the training data with RidgeCV and plot the optimal values with Alpha Selection.


alphas = np.logspace(-10, 1, 500)

ridgeCV = RidgeCV(alphas = alphas)

alpha_selection = AlphaSelection(ridgeCV)
alpha_selection.fit(X_train, y_train)

alpha_selection.show()
plt.show()

From this, we found the optimum parameter (alpha) value for this Ridge regression.


ridge0069 = Ridge(alpha=0.069).fit(X_train, y_train)

print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))

# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752

When I actually tried it, the score was as high as when alpha = 0.1, 1.

Next, let's plot the learning curve as a comparison between linear regression and Ridge regression (alpha = 1).


mglearn.plots.plot_ridge_n_samples()

Horizontal axis: Data size (total) Vertical axis: coefficient of determination

Although linear regression is prone to overfitting, the training data scores are high. In the test data, the generalization performance is almost zero.

However, if the data size is sufficient It can be seen that it has the same generalization performance as Ridge regression.

3. Lasso regression

Similar to Ridge regression, it constrains the coefficient to approach 0, The multiplication method is slightly different, and the Lasso regression has some coefficients that are completely zero.

Number of features used: Number of features used


lasso = Lasso().fit(X_train, y_train)

print('Training set score: {:.2f}'.format(lasso.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso.coef_ != 0)))

# Training set score: 0.29
# Test set score: 0.21
# Number of features used: 4


lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)

print('Traing set score: {:.2f}'.format(lasso001.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso001.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso001.coef_ != 0)))

# Traing set score: 0.90
# Test set score: 0.77
# Number of features used: 33


lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)

print('Training set score: {:.2f}'.format(lasso00001.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso00001.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso00001.coef_ != 0)))

# Training set score: 0.95
# Test set score: 0.64
# Number of features used: 96

Based on an error function designed so that less important features tend to be 0 We carry out training and determine the features to be actually used.

In this case, if the number of features is as large as 96 or too small as 4 You can see that the generalization performance is low.

About Lasso (alpha = 0.0001) and Ridge (alpha = 1) Let's compare the magnitude and variability of the regression coefficient.


plt.plot(lasso.coef_, 's', label='Lasso alpha=1')
plt.plot(lasso001.coef_, '^', label='Lasso alpha=0.01')
plt.plot(lasso00001.coef_, 'v', label='Lasso alpha=0.0001')

plt.plot(ridge01.coef_, 'o', label='Ridge alpha=0.1')
plt.legend(ncol=2, loc=(0, 1.05))
plt.ylim(-25, 25)
plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')

Horizontal axis: 104 features Vertical axis: magnitude of each regression coefficient in the model

Looking at the above figure, it is still necessary to have some variation in the coefficients. Lasso (alpha = 0.0001) is as scattered as Ridge (alpha = 1) You can see that the coefficient of determination scores are also close.

■ Conclusion

If you think about linear regression, first model with Ridge regression If you find out that there are unnecessary features, it is better to perform Lasso regression.

ElasticNet (which has both Ridge and Lasso parameters) is accurate, but tedious to adjust.

■ References

・ Machine learning starting with Python --O'Reilly Japan