2. Multivariate analysis spelled out in Python 6-2. Ridge regression / Lasso regression (scikit-learn) [Ridge regression vs. Lasso regression]

Ridge regression / Lasso regression derived from multiple regression has a mechanism to suppress overfitting of multiple regression **. Specifically, it is as follows.

Ridge regression deals with ** weight squared **, while lasso regression is ** absolute weight **.
Repeating the previous section, conventional multiple regression analysis finds the coefficient that minimizes the sum of squares error between the predicted value and the observed value. By adding a penalty according to the number and weight of variables to this, we prevent the coefficient from becoming a large value.

** Here, I would like to compare three multiple regression models, including the lasso regression. ** **

⑴ Import library

#Data processing / calculation / analysis library
import numpy as np
import pandas as pd

#Graph drawing library
import matplotlib.pyplot as plt
%matplotlib inline

#Machine learning library
import sklearn

⑵ Data acquisition and reading

#Get data
url = 'https://raw.githubusercontent.com/yumi-ito/datasets/master/datasets_auto_4variables_pre-processed.csv'

#Read the acquired data as a DataFrame object
df = pd.read_csv(url, header=None)

#Set column label
df.columns = ['width', 'height', 'horsepower', 'price']

print(df)

This is data for predicting the price of the objective variable, using three of the various specifications related to automobiles as explanatory variables: width (width), height (height), and horsepower (horsepower).
Click here for details such as data source and overview. https://qiita.com/y_itoh/items/9befbf47869d66337dad
Unknown value "?" And missing value have been deleted, and the data type has been converted to float type and int type.

#Confirmation of data shape
print('Data shape:', df.shape)

#Confirmation of missing values
print('Number of missing values:{}\n'.format(df.isnull().sum().sum()))

#Data type confirmation
print(df.dtypes)

(3) Division of training data and test data

#Import for model building
from sklearn.linear_model import Ridge, Lasso, LinearRegression

#Import for data splitting
from sklearn.model_selection import train_test_split

Use the pandas drop () function to remove the price column and set only the explanatory variables to x and only the price to y.
In sklearn's train_test_split method, the explanatory variable x and the objective variable y are separated into training data (train) and test data (test), respectively.

#Set explanatory variables and objective variables
x = df.drop('price', axis=1)
y = df['price']

#Divided into training data and test data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.5, random_state=0)

⑷ Model generation and evaluation

Initialize multiple regression, ridge regression, and lasso regression all at once, and use the for statement to generate a model, calculate the correct answer rate for training data, and calculate the correct answer rate for test data at once.

#Initialize each class and store it in models of dict type variable
models = {
    'linear': LinearRegression(),
    'ridge': Ridge(random_state=0),
    'lasso': Lasso(random_state=0)}

#Initialize the dict type variable that stores the correct answer rate
scores = {}

#Generate each model in sequence, calculate the correct answer rate, and store it.
for model_name, model in models.items():
    #Model generation
    model.fit(X_train, Y_train)
    #Correct answer rate of training data
    scores[(model_name, 'train')] = model.score(X_train, Y_train)
    #Test data accuracy rate
    scores[(model_name, 'test')] = model.score(X_test, Y_test)

#Convert dict type to pandas one-dimensional list
print(pd.Series(scores))

If you turn the dict object as it is with the for statement, the key of each element will be obtained, but if you use ʻitems (): `, you can get both the key and value of each element.
Calculate the correct answer rate with sklearn's score () function, and store model_name and either train or test as a set as a key.

	Multiple regression	Ridge regression	Lasso return
Correct answer rate of training data	0.733358	0.733355	0.733358
Test data accuracy rate	0.737069	0.737768	0.737084

The correct answer rate of the training data is ** ridge regression <lasso regression = multiple regression **, and the correct answer rate of the test data is ** ridge regression> lasso regression> multiple regression **.
Focusing on the lasso regression, the accuracy rate in the training data is the same as that in the multiple regression, and in the test data, it is slightly higher than the multiple regression, though not as much as the ridge regression.
However, the parameter $ λ $ ** that specifies the strength of regularization remains untouched, and scikit-learn defaults to $ λ = 1.0 $ for both.

** So I would like to change the regularization parameters and compare. ** **

Regularization parameters

If the parameter $ λ $ that specifies the strength of regularization is increased, the effect of the penalty becomes stronger, so the absolute value of the regression coefficient can be kept small.

Parameter settings are specified as arguments with ʻalpha =` when initializing the class and creating a model template. Try it with $ alpha = 10.0 $.

#parameter settings
alpha = 10.0

#Initialize each class and store in models
models = {
    'ridge': Ridge(alpha=alpha, random_state=0),
    'lasso': Lasso(alpha=alpha, random_state=0)}

#Initialize the dict type variable that stores the correct answer rate
scores = {}

#Execute each model in sequence and store the correct answer rate
for model_name, model in models.items():
    model.fit(X_train, Y_train)
    scores[(model_name, 'train')] = model.score(X_train, Y_train)
    scores[(model_name, 'test')] = model.score(X_test, Y_test)

print(pd.Series(scores))

The following shows the result of changing the regularization parameter $ λ $ step by step.

λ	Ridge(train)	Ridge(test)	Lasso(train)	Lasso(test)
1	0.733355	0.737768	0.733358	0.737084
10	0.733100	0.743506	0.733357	0.737372
100	0.721015	0.771022	0.733289	0.740192
200	0.705228	0.778607	0.733083	0.743195
400	0.680726	0.779004	0.732259	0.748795
500	0.671349	0.777338	0.731640	0.751391
1000	0.640017	0.767504	0.726479	0.762336

In this example, first of all, the ridge regression tends to be reversed in the training data when it becomes the test data, and the tendency becomes more remarkable as $ λ $ increases. On the other hand, the lasso regression generally behaves slowly, the yield in the training data is good even if $ λ $ increases, and the accuracy rate of the test data gradually increases as $ λ $ increases.
The difference is simply whether the coefficient is squared or the absolute value is taken, but the magnitude of the penalty is larger for the lasso regression depending on the coefficient, and in short, the type of influence is different.
Regarding the effect of each regularization, I would like to take a step further by considering the relationship with the coefficient.