2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]

To put it plainly, both are evolutions of multiple regression analysis, ** Lasso is also called L1 and Ridge is also called L2 **, which are twin-like relationships.
What kind of evolution is ** improved multiple regression analysis to prevent overfitting **.
In multiple regression analysis, the regression coefficient is estimated so that the loss function (error of the sum of squares of the predicted value and the objective variable) is minimized, but in addition to this, ** avoid that the regression coefficient itself becomes large. Ingenuity ** has been applied.
In general, a model with a large regression coefficient will have a large output with a small movement of the input. Such a sensitive model is at high risk of being applicable to training data but not well to unknown data.
Therefore, ** a penalty ** is added to the loss function as the number and weight of variables increase, and ** the model itself suppresses the size of the parameter **.

The above formula is the definition of the loss function with the penalty added, and the penalty is more accurately called ** regularization **.
As the number of variables $ M $ increases and the weight $ W $ also increases, the value of the regularization term increases, and it is added to increase the value of the loss function.
When the regularization term $ q $ is $ q = 1 $, it is called lasso regression, and when $ q = 2 $ is called ridge regression.

⑴ Import library

#Data processing / calculation / analysis library
import numpy as np
import pandas as pd

#Graph drawing library
import matplotlib.pyplot as plt
%matplotlib inline

#Machine learning library
import sklearn

⑵ Data acquisition and reading

#Get data
url = 'https://raw.githubusercontent.com/yumi-ito/datasets/master/datasets_auto.csv'

#Read the acquired data as a DataFrame object
df = pd.read_csv(url, header=None)

#Set column label
df.columns = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 
              'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 
              'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']

A data set that predicts automobile prices based on various specifications such as automobile body style, body size, fuel efficiency, and attribute information such as insurance risk rating.
Data on imported cars and trucks of 1985 model in the United States, extracted from three sources: Ward Automobile Yearbook 1985 edition, personal automobile manual of insurance service office, and insurance collision report of Road Safety Insurance Association. .. Click here for details. https://archive.ics.uci.edu/ml/datasets/Automobile
The outline is organized below. There are a total of 25 explanatory variables, including numeric types and attribute types. The objective variable is the price, and the number of samples is 205.

	Variable name	Free translation	Item (commentary)	Data type
0	symboling	Insurance risk rating	-3, -2, -1, 0, 1, 2, 3.(3 is high risk and dangerous,-3 is low risk and safe)	int64
1	normalized-losses	Normalization loss	65〜256	object
2	make	Maker	alfa-romero, audi, bmw, ..., volkswagen, volvo.	object
3	fuel-type	Fuel type	diesel, gas.	object
4	aspiration	Intake type	std, turbo.	object
5	num-of-doors	Number of doors	four, two.	object
6	body-style	Body style	hardtop, wagon, sedan, hatchback, convertible.	object
7	drive-wheels	Drive wheels	4wd, fwd, rwd.	object
8	engine-location	Engine position	front, rear.	object
9	wheel-base	Wheelbase	86.6～120.9	float64
10	length	Commander	141.1～208.1	float64
11	width	Vehicle width	60.3～72.3	float64
12	height	Vehicle height	47.8～59.8	float64
13	curb-weight	Unmanned vehicle weight	1488～4066	int64
14	engine-type	Engine type	dohc, dohcv, l, ohc, ohcf, ohcv, rotor.	object
15	num-of-cylinders	Number of cylinders	eight, five, four, six, three, twelve, two.	object
16	engine-size	Engine size	61～326	int64
17	fuel-system	Fuel system	1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.	object
18	bore	Engine cylinder inner diameter	2.54～3.94	object
19	stroke	Amount of movement of the piston	2.07～4.17	object
20	compression-ratio	Compression ratio	7～23	float64
21	horsepower	horsepower	48～288	object
22	peak-rpm	Maximum output	4150～6600	object
23	city-mpg	City fuel economy	13-49 (miles traveled per gallon of oil)	int64
24	highway-mpg	Highway fuel economy	16～54	int64
25	price	price	5118～45400	object

#Output data shape and number of defects
print(df.shape)
print('Number of defects:{}'.format(df.isnull().sum().sum()))

#Output the first 5 lines of data
df.head()

(3) Data preprocessing

First, let's focus on only three explanatory variables: horsepower, width, and height.
Note that this dataset contains an unprocessable value "?", So you must delete the sample containing this.

#Create a DataFrame for only the target columns
auto = df[['price', 'horsepower', 'width', 'height']]

#For each column, "?Check the number that contains
auto.isin(['?']).sum()

#"?Replace with NAN and delete the line with NAN
auto = auto.replace('?', np.nan).dropna()

#Check the shape of the matrix after deletion
auto.shape

It was confirmed that the number of rows is 199, which is obtained by subtracting 6 rows including "?" From the original 205, and that the matrix is only the target 4 variables.
In addition, check the data type of these 4 variables.

#Data type confirmation
auto.dtypes

The object type is equal to the str type and must be converted to a numeric type.

#Convert data type
auto = auto.assign(price = pd.to_numeric(auto.price))
auto = auto.assign(horsepower = pd.to_numeric(auto.horsepower))

#Check the data type after conversion
auto.dtypes

In pandas' ʻassign ()function, if you specifycolumn name = value` in the keyword argument, the specified value will be assigned to the existing column, and a new column will be added if it is a new column name. ..
Then, at the end of data preprocessing, ** check the correlation matrix ** with the corr () function of pandas.

auto.corr()

Since price is the objective variable, when observing the correlation coefficient between the other explanatory variables, the correlation coefficient between width and horsepower is slightly higher, 0.6 or more.
In multiple regression analysis, if there is a high correlation between the explanatory variables, use only the variables that represent the highly correlated variables. This is a phenomenon called ** multicollinearity (commonly known as multicollinearity) **, which causes problems in calculation due to high correlation between explanatory variables, and causes abnormal values of coefficients and odds ratios.
However, since this is a test, we will use all three explanatory variables as they are.

⑷ Model construction and evaluation

#Check the data
print(auto)

** Using this data, perform model estimation for ridge regression and multiple regression analysis, and compare the accuracy of both. ** **

#Import for model building of ridge regression
from sklearn.linear_model import Ridge

#Import for model building of multiple regression analysis
from sklearn.linear_model import LinearRegression

#Import for data splitting (training data and test data)
from sklearn.model_selection import train_test_split

Use the pandas drop () function to remove the price column and set only the explanatory variables to x and only the price to y.
In sklearn's train_test_split method, the explanatory variable x and the objective variable y are separated into training data (train) and test data (test), respectively.

#Set explanatory variables and objective variables
x = auto.drop('price', axis=1)
y = auto['price']

#Divided into training data and test data
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.5, random_state=0)

** First, build a model for multiple regression analysis and calculate the accuracy rate of training data and test data. ** **

#Initialization of LinearRegression class
linear = LinearRegression()

#Execution of learning
linear.fit(X_train, Y_train)

#Correct answer rate of training data
train_score_linear = format(linear.score(X_train, Y_train))
print('Correct answer rate of multiple regression analysis(train):', 
      '{:.6f}'.format(float(train_score_linear)))

#Test data accuracy rate
test_score_linear = format(linear.score(X_test, Y_test))
print('Correct answer rate of multiple regression analysis(test):', 
      '{:.6f}'.format(float(test_score_linear)))

'{:. Number of digits f}'. format () specifies the number of digits after the decimal point.
Since the data type of train_score_linear and test_score_linear is str, they are converted to floating point numbers by float ().

** Next, build a model of ridge regression and calculate the accuracy rate of training data and test data. ** **

#Initialization of Ridge class
ridge = Ridge()

#Execution of learning
ridge.fit(X_train, Y_train)

#Correct answer rate of training data
train_score_ridge = format(ridge.score(X_train, Y_train))
print('Correct answer rate of ridge regression(train):', 
      '{:.6f}'.format(float(train_score_ridge)))

#Test data accuracy rate
test_score_ridge = format(ridge.score(X_test, Y_test))
print('Correct answer rate of ridge regression(test):', 
      '{:.6f}'.format(float(test_score_ridge)))

	Multiple regression analysis(L)	Ridge regression(R)	Difference(L-R)
Correct answer rate of training data	0.733358	0.733355	0.000003
Test data accuracy rate	0.737069	0.737768	-0.000699

There is no significant difference in the performance of multiple regression and ridge regression in these models.
However, as a tendency, in training data learning, multiple regression has a higher accuracy rate although it is minute, and when it becomes test data, ridge regression is reversed. This reversal seems to be a good effect of regularization.