Difference between linear regression, Ridge regression and Lasso regression

■ Introduction

This time, in linear regression, Ridge regression, and Lasso regression I will summarize the article about each feature and difference.

■ Preparation of module data


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mglearn

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn.linear_model import RidgeCV
from yellowbrick.regressor import AlphaSelection

mglearn is a module for visualizing data usage and plots.

This time we will use the improved boston dataset. Unlike the original data, the number of features is 104.


X, y = mglearn.datasets.load_extended_boston()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

print('X:', X.shape)
print('y:', y.shape)
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

# X: (506, 104)
# y: (506,)
# X_train: (379, 104)
# y_train: (379,)
# X_test: (127, 104)
# y_test: (127,)

1. Linear regression


lr = LinearRegression().fit(X_train, y_train)

#Coefficient of determination (an index that measures the accuracy of regression model predictions)
print('Train set score: {}'.format(lr.score(X_train, y_train)))
print('Test set score: {}'.format(lr.score(X_test, y_test)))

# Train set score: 0.9520519609032729
# Test set score: 0.6074721959665842

score is the coefficient of determination (an index that measures the accuracy of regression model predictions).

In linear regression, the prediction accuracy of the training data at hand is high, For test data (unknown data), the prediction accuracy tends to be low.

If you compare it to sports (baseball), you usually practice hitting straight balls. Do you feel that you will not be able to handle curved balls at all in the actual game?

In practice and Kaggle, generalization performance (ability to respond to production) for unknown data is important. For this data, we can see that the linear regression model is unsuitable.

2. Ridge regression

Too much adaptation to training data like before Deterioration of generalization performance to test data is called "overfitting".

To prevent this, regularization (parameter: alpha) is performed by Ridge regression.

Overfitting is likely to occur if the regression coefficient (value of each feature) is large or varies. Increasing the parameter alpha causes the regression coefficient to approach zero.


ridge = Ridge(alpha=1).fit(X_train, y_train)

print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))

# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752

Adaptation to training data is reduced, but generalization performance to test data is improved. The alpha of Ridge regression is 1 by default, so try other values as well.


ridge10 = Ridge(alpha=10).fit(X_train, y_train)

print('Training set score: {:.2f}'.format(ridge10.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(ridge10.score(X_test, y_test)))

# Training set score: 0.79
# Test set score: 0.64

The prediction accuracy for the test data is lower than when alpha = 1.


ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)

print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))

# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752

This is about the same prediction accuracy as when alpha = 1.

Here, about the magnitude and variation of each regression coefficient Let's compare 3 patterns with alpha = 0.1, 1, 10.


plt.plot(ridge10.coef_, 's', label='Ridge alpha=10')
plt.plot(ridge.coef_, 's', label='Ridge alpha=1')
plt.plot(ridge01.coef_, 's', label='Ridge alpah=0.1')

plt.plot(lr.coef_, 'o', label='LinearRegression')
plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')

plt.hlines(0, 0, len(lr.coef_))
plt.ylim(-25, 25)

plt.legend()

image.png Horizontal axis: 104 features Vertical axis: magnitude of each regression coefficient in the model

You can see that the score is higher when there is some variation in the data, such as alpha = 0.1, 1. However, if the variation is too large (overfitting) like linear regression, or if the regularization is too strong like alpha = 10. The coefficient of determination score will be low.

Earlier, I substituted some alphas and compared the scores. There is also a way to find out in advance about the optimal alpha.

First, for the parameter alpha, set the range to search for the value. Cross-validate the training data with RidgeCV and plot the optimal values with Alpha Selection.


alphas = np.logspace(-10, 1, 500)

ridgeCV = RidgeCV(alphas = alphas)

alpha_selection = AlphaSelection(ridgeCV)
alpha_selection.fit(X_train, y_train)

alpha_selection.show()
plt.show()

image.png From this, we found the optimum parameter (alpha) value for this Ridge regression.


ridge0069 = Ridge(alpha=0.069).fit(X_train, y_train)

print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))

# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752

When I actually tried it, the score was as high as when alpha = 0.1, 1.

Next, let's plot the learning curve as a comparison between linear regression and Ridge regression (alpha = 1).


mglearn.plots.plot_ridge_n_samples()

image.png Horizontal axis: Data size (total) Vertical axis: coefficient of determination

Although linear regression is prone to overfitting, the training data scores are high. In the test data, the generalization performance is almost zero.

However, if the data size is sufficient It can be seen that it has the same generalization performance as Ridge regression.

3. Lasso regression

Similar to Ridge regression, it constrains the coefficient to approach 0, The multiplication method is slightly different, and the Lasso regression has some coefficients that are completely zero.

Number of features used: Number of features used


lasso = Lasso().fit(X_train, y_train)

print('Training set score: {:.2f}'.format(lasso.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso.coef_ != 0)))

# Training set score: 0.29
# Test set score: 0.21
# Number of features used: 4

lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)

print('Traing set score: {:.2f}'.format(lasso001.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso001.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso001.coef_ != 0)))

# Traing set score: 0.90
# Test set score: 0.77
# Number of features used: 33

lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)

print('Training set score: {:.2f}'.format(lasso00001.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso00001.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso00001.coef_ != 0)))

# Training set score: 0.95
# Test set score: 0.64
# Number of features used: 96

Based on an error function designed so that less important features tend to be 0 We carry out training and determine the features to be actually used.

In this case, if the number of features is as large as 96 or too small as 4 You can see that the generalization performance is low.

About Lasso (alpha = 0.0001) and Ridge (alpha = 1) Let's compare the magnitude and variability of the regression coefficient.


plt.plot(lasso.coef_, 's', label='Lasso alpha=1')
plt.plot(lasso001.coef_, '^', label='Lasso alpha=0.01')
plt.plot(lasso00001.coef_, 'v', label='Lasso alpha=0.0001')

plt.plot(ridge01.coef_, 'o', label='Ridge alpha=0.1')
plt.legend(ncol=2, loc=(0, 1.05))
plt.ylim(-25, 25)
plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')

image.png Horizontal axis: 104 features Vertical axis: magnitude of each regression coefficient in the model

Looking at the above figure, it is still necessary to have some variation in the coefficients. Lasso (alpha = 0.0001) is as scattered as Ridge (alpha = 1) You can see that the coefficient of determination scores are also close.

■ Conclusion

If you think about linear regression, first model with Ridge regression If you find out that there are unnecessary features, it is better to perform Lasso regression.

ElasticNet (which has both Ridge and Lasso parameters) is accurate, but tedious to adjust.

■ References

Machine learning starting with Python --O'Reilly Japan

Recommended Posts

Difference between linear regression, Ridge regression and Lasso regression
Difference between regression and classification
Ridge and Lasso
Difference between "categorical_crossentropy" and "sparse_categorical_crossentropy"
Difference between np.array and np.arange
Difference between MicroPython and CPython
Difference between ps a and ps -a
Difference between return and print-Python
Difference between java and python (memo)
Difference between list () and [] in Python
Difference between SQLAlchemy filter () and filter_by ()
Difference between == and is in python
Memorandum (difference between csv.reader and csv.dictreader)
(Note) Difference between gateway and default gateway
Difference between Numpy randint and Random randint
Difference between sort and sorted (memorial)
Difference between python2 series and python3 series dict.keys ()
[Python] Difference between function and method
Difference between SQLAlchemy flush () and commit ()
Python --Difference between exec and eval
[Python] Difference between randrange () and randint ()
[Python] Difference between sorted and sorted (Colaboratory)
Linear regression
Estimator calculation / prediction at Lasso and Ridge of generalized linear model
First TensorFlow (Revised) -Linear Regression and Logistic Regression
[Xg boost] Difference between softmax and softprob
difference between statements (statements) and expressions (expressions) in Python
Difference between PHP and Python finally and exit
Difference between @classmethod and @staticmethod in Python
Difference between append and + = in Python list
Difference between nonlocal and global in Python
[Python] Difference between class method and static method
Difference between docker-compose env_file and .env file
[Python Iroha] Difference between List and Tuple
Understanding data types and beginning linear regression
[python] Difference between rand and randn output
speed difference between wsgi, Bottle and Flask
Understand and implement ridge regression (L2 regularization)
Difference between numpy.ndarray and list (dimension, size)
Difference between ls -l and cat command
Difference and compatibility verification between keras and tf.keras # 1
Difference between using and import on shield language
About the difference between "==" and "is" in python
About the difference between PostgreSQL su and sudo
What is the difference between Unix and Linux?
Consideration of the difference between ROC curve and PR curve
"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"
The rough difference between Unicode and UTF-8 (and their friends)
Difference between Ruby and Python in terms of variables
What is the difference between usleep, nanosleep and clock_nanosleep?
Difference between Numpy (n,) and (n, 1) notation [Difference between horizontal vector and vertical vector]
Getting Started with Tensorflow-About Linear Regression Hypothesis and Cost
Difference between return, return None, and no return description in Python
How to use argparse and the difference between optparse
Center difference and forward difference
Linear regression with statsmodels
Between parametric and nonparametric
Machine learning linear regression
Regression with linear model
What is the difference between a symbolic link and a hard link?
Python module num2words Difference in behavior between English and Russian