1.First of all

I started studying deep learning. This time, I will briefly summarize regularization.

2. Data creation

Based on the equation $ y = -x ^ 3 + x ^ 2 + x $, x is the value obtained by dividing -10 to 10 by 50, and y is the result of substituting that x into the equation and adding a random number from 0 to 0.05. Create the data as a value.

from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
import numpy as np
import matplotlib.pyplot as plt

#Data generation
np.random.seed(0)
X = np.linspace(-10, 10, 50)
Y_truth = 0.001 * (-X **3 + X**2 + X)
Y = Y_truth + np.random.normal(0, 0.05, len(X))

plt.figure(figsize=(5, 5))
plt.plot(X, Y_truth, color='gray')
plt.plot(X, Y, '.', color='k')
plt.show()

スクリーンショット 2019-12-26 15.31.05.png This is the created data. It is assumed that the solid line is the true value (the value of the equation) and the point is the value actually observed (the value of y plus noise).

3. Introduction of polynomial regression

Overfitting is more likely to occur with higher degrees of freedom, so we dare to introduce 30-dimensional polynomial regression.

#graph display
def graph(Y_lr, name):
    plt.figure(figsize=(6, 6))
    plt.plot(X, Y_truth, color='gray', label='truth')
    plt.plot(xs, Y_lr, color='r', markersize=2, label=name)
    plt.plot(X, Y, '.', color='k')
    plt.legend()
    plt.ylim(-1, 1)
    plt.show()

#Display settings
xs = np.linspace(-10, 10, 200)  

#Introduction of polynomial regression
poly = PolynomialFeatures(degree=30, include_bias=False)  
X_poly = poly.fit_transform(X[:, np.newaxis])

After setting the graph display and display, PolynomialFeatures is instantiated and fitted. The dimension is 30 dimensions (degree = 30).

4. No regularization

First, do polynomial regression without regularization.

#No regularization
lr0 = linear_model.LinearRegression(normalize=True)
lr0.fit(X_poly, Y)
Y_lr0 = lr0.predict(poly.fit_transform(xs[:, np.newaxis]))
graph(Y_lr0, 'No Regularization')

スクリーンショット 2019-12-26 15.32.47.png

Due to its high degree of freedom with a 30-dimensional polynomial, it is possible to pass many points dexterously, resulting in typical overfitting. It is far from the true value, and generalization performance cannot be expected with this.

5. L2 regularization

L2 regularization is a technique known for Ridge regression, which limits the coefficients so that they do not become too large, and adds the L2 norm of the parameter to the loss (C is a constant). L(W)+c|w|^2

#L2 regularization
lr2 = linear_model.Ridge(normalize=True, alpha=0.5)
lr2.fit(X_poly, Y)
Y_lr2 = lr2.predict(poly.fit_transform(xs[:, np.newaxis]))
graph(Y_lr2, 'L2')

スクリーンショット 2019-12-26 15.33.51.png Well, I feel like I've been able to return successfully.

6. L1 regularization

L1 regularization is a technique known for Lasso regression, which also limits the coefficients so that they do not become too large, adding the L1 norm of the parameter to the loss (C is a constant). L(W)+c|w|

#L1 regularization
lr1 = linear_model.LassoLars(normalize=True, alpha=0.001)
lr1.fit(X_poly, Y)
Y_lr1 = lr1.predict(poly.fit_transform(xs[:, np.newaxis]))
graph(Y_lr1, 'L1')

スクリーンショット 2019-12-26 15.34.43.png The shape is very close to a perfect fit. Compared to L1 regularization, L2 regularization seems to be able to regress better.

7. Comparison of dimensional coefficients

Compares 30 dimensional coefficients for each of no regularization, L2 regularization, and L1 regularization (listed from lowest dimension).

import pandas as pd
result = []
for i in range(len(lr0.coef_)):
      tmp = lr0.coef_[i], lr2.coef_[i], lr1.coef_[i]
      result.append(tmp)
df = pd.DataFrame(result)
df.columns = ['No Regularization', 'L2', 'L1']
print(df)

スクリーンショット 2019-12-26 15.43.05.png You can see that L2 has a smaller coefficient than No Regularization. L1 is also a sparse expression with many completely zeros.

I'm glad that L1 regularization can suppress overfitting and reduce dimensions.

Visualize the effects of deep learning / regularization