This time, in linear regression, Ridge regression, and Lasso regression
I will summarize the article about each feature and difference.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
from sklearn.linear_model import RidgeCV
from yellowbrick.regressor import AlphaSelection
mglearn is a module for visualizing data usage and plots.
This time we will use the improved boston dataset. Unlike the original data, the number of features is 104.
X, y = mglearn.datasets.load_extended_boston()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print('X:', X.shape)
print('y:', y.shape)
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)
# X: (506, 104)
# y: (506,)
# X_train: (379, 104)
# y_train: (379,)
# X_test: (127, 104)
# y_test: (127,)
lr = LinearRegression().fit(X_train, y_train)
#Coefficient of determination (an index that measures the accuracy of regression model predictions)
print('Train set score: {}'.format(lr.score(X_train, y_train)))
print('Test set score: {}'.format(lr.score(X_test, y_test)))
# Train set score: 0.9520519609032729
# Test set score: 0.6074721959665842
score is the coefficient of determination (an index that measures the accuracy of regression model predictions).
In linear regression, the prediction accuracy of the training data at hand is high, For test data (unknown data), the prediction accuracy tends to be low.
If you compare it to sports (baseball), you usually practice hitting straight balls. Do you feel that you will not be able to handle curved balls at all in the actual game?
In practice and Kaggle, generalization performance (ability to respond to production) for unknown data is important.
For this data, we can see that the linear regression model is unsuitable.
Too much adaptation to training data like before Deterioration of generalization performance to test data is called "overfitting".
To prevent this, regularization (parameter: alpha) is performed by Ridge regression.
Overfitting is likely to occur if the regression coefficient (value of each feature) is large or varies. Increasing the parameter alpha causes the regression coefficient to approach zero.
ridge = Ridge(alpha=1).fit(X_train, y_train)
print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))
# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752
Adaptation to training data is reduced, but generalization performance to test data is improved. The alpha of Ridge regression is 1 by default, so try other values as well.
ridge10 = Ridge(alpha=10).fit(X_train, y_train)
print('Training set score: {:.2f}'.format(ridge10.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(ridge10.score(X_test, y_test)))
# Training set score: 0.79
# Test set score: 0.64
The prediction accuracy for the test data is lower than when alpha = 1.
ridge01 = Ridge(alpha=0.1).fit(X_train, y_train)
print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))
# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752
This is about the same prediction accuracy as when alpha = 1.
Here, about the magnitude and variation of each regression coefficient Let's compare 3 patterns with alpha = 0.1, 1, 10.
plt.plot(ridge10.coef_, 's', label='Ridge alpha=10')
plt.plot(ridge.coef_, 's', label='Ridge alpha=1')
plt.plot(ridge01.coef_, 's', label='Ridge alpah=0.1')
plt.plot(lr.coef_, 'o', label='LinearRegression')
plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')
plt.hlines(0, 0, len(lr.coef_))
plt.ylim(-25, 25)
plt.legend()
Horizontal axis: 104 features Vertical axis: magnitude of each regression coefficient in the model
You can see that the score is higher when there is some variation in the data, such as alpha = 0.1, 1. However, if the variation is too large (overfitting) like linear regression, or if the regularization is too strong like alpha = 10. The coefficient of determination score will be low.
Earlier, I substituted some alphas and compared the scores. There is also a way to find out in advance about the optimal alpha.
First, for the parameter alpha, set the range to search for the value. Cross-validate the training data with RidgeCV and plot the optimal values with Alpha Selection.
alphas = np.logspace(-10, 1, 500)
ridgeCV = RidgeCV(alphas = alphas)
alpha_selection = AlphaSelection(ridgeCV)
alpha_selection.fit(X_train, y_train)
alpha_selection.show()
plt.show()
From this, we found the optimum parameter (alpha) value for this Ridge regression.
ridge0069 = Ridge(alpha=0.069).fit(X_train, y_train)
print('Training set score: {}'.format(ridge.score(X_train, y_train)))
print('Test set score: {}'.format(ridge.score(X_test, y_test)))
# Training set score: 0.885796658517094
# Test set score: 0.7527683481744752
When I actually tried it, the score was as high as when alpha = 0.1, 1.
Next, let's plot the learning curve as a comparison between linear regression and Ridge regression (alpha = 1).
mglearn.plots.plot_ridge_n_samples()
Horizontal axis: Data size (total) Vertical axis: coefficient of determination
Although linear regression is prone to overfitting, the training data scores are high. In the test data, the generalization performance is almost zero.
However, if the data size is sufficient
It can be seen that it has the same generalization performance as Ridge regression.
Similar to Ridge regression, it constrains the coefficient to approach 0, The multiplication method is slightly different, and the Lasso regression has some coefficients that are completely zero.
Number of features used: Number of features used
lasso = Lasso().fit(X_train, y_train)
print('Training set score: {:.2f}'.format(lasso.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso.coef_ != 0)))
# Training set score: 0.29
# Test set score: 0.21
# Number of features used: 4
lasso001 = Lasso(alpha=0.01, max_iter=100000).fit(X_train, y_train)
print('Traing set score: {:.2f}'.format(lasso001.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso001.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso001.coef_ != 0)))
# Traing set score: 0.90
# Test set score: 0.77
# Number of features used: 33
lasso00001 = Lasso(alpha=0.0001, max_iter=100000).fit(X_train, y_train)
print('Training set score: {:.2f}'.format(lasso00001.score(X_train, y_train)))
print('Test set score: {:.2f}'.format(lasso00001.score(X_test, y_test)))
print('Number of features used: {}'.format(np.sum(lasso00001.coef_ != 0)))
# Training set score: 0.95
# Test set score: 0.64
# Number of features used: 96
Based on an error function designed so that less important features tend to be 0 We carry out training and determine the features to be actually used.
In this case, if the number of features is as large as 96 or too small as 4 You can see that the generalization performance is low.
About Lasso (alpha = 0.0001) and Ridge (alpha = 1) Let's compare the magnitude and variability of the regression coefficient.
plt.plot(lasso.coef_, 's', label='Lasso alpha=1')
plt.plot(lasso001.coef_, '^', label='Lasso alpha=0.01')
plt.plot(lasso00001.coef_, 'v', label='Lasso alpha=0.0001')
plt.plot(ridge01.coef_, 'o', label='Ridge alpha=0.1')
plt.legend(ncol=2, loc=(0, 1.05))
plt.ylim(-25, 25)
plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')
Horizontal axis: 104 features Vertical axis: magnitude of each regression coefficient in the model
Looking at the above figure, it is still necessary to have some variation in the coefficients.
Lasso (alpha = 0.0001) is as scattered as Ridge (alpha = 1)
You can see that the coefficient of determination scores are also close.
If you think about linear regression, first model with Ridge regression If you find out that there are unnecessary features, it is better to perform Lasso regression.
ElasticNet (which has both Ridge and Lasso parameters) is accurate, but tedious to adjust.
・ Machine learning starting with Python --O'Reilly Japan
Recommended Posts