Comparison of L1 regularization and Leaky Relu

We compared the effects of L1 regularization and Leaky Relu on error backpropagation. It is explained that L1 regularization is used for dimensional compression that strips away unnecessary explanatory variables in machine learning. On the other hand, LeakyRelu is an activation function that prevents learning from stopping even in multiple layers by giving a slight inclination even when x is negative. I don't even know why I compared these completely unrelated things.

Error backpropagation of normal full coupling

python


x1 = Input(shape=(124,))
x2 = Dense(100)(x1)
x3 = Activation('relu')(x2)
y = Dense(10)(x3)

model.compile(loss='mean_squared_error')
\begin{align}
x_1 &= input\\
x_2 &= Ax_1 + B\\
x_3 &= relu(x_2)\\
y &= Cx_3+D\\
L_{mse} &= \frac{1}{2}(t-y)^2\\
\end{align}

It is assumed that the loss function, the activation function, and the total connection calculation are performed as described above. In this case, each partial differential is

\begin{align}
\frac{\partial L_{mse}}{\partial y} &= -(t - y)\\
\frac{\partial y}{\partial C} &= x_3\\
\frac{\partial y}{\partial D} &= 1\\
\frac{\partial y}{\partial x_3} &= C\\
\frac{\partial x_3}{\partial x_2} &= H(x_2)\\
 H(x)&=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\
\frac{\partial x_2}{\partial A} &= x_1\\
\frac{\partial x_2}{\partial B} &= 1\\
\end{align}

From the model weights $ A, B, C, D $ are updated using the chain rule

\begin{align}
\frac{\partial L_{mse}}{\partial A} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial x_3}\frac{\partial x_3}{\partial x_2}\frac{\partial x_2}{\partial A}\\
&=-(t - y)\cdot C \cdot H(x_2) \cdot x_1\\ 
\frac{\partial L_{mse}}{\partial B} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial x_3}\frac{\partial x_3}{\partial x_2}\frac{\partial x_2}{\partial B}\\
&=-(t - y)\cdot C \cdot H(x_2) \\ 
\frac{\partial L_{mse}}{\partial C} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial C}\\
&=-(t - y)\cdot x_3 \\
\frac{\partial L_{mse}}{\partial D} &= \frac{\partial L_{mse}}{\partial y}\frac{\partial y}{\partial D}\\
&=-(t - y)\\
\end{align}

Can be shown as.

Error backpropagation of fully connected layers of L1 regularization

In Keras, regularization has the following three. ・ Kernel_regularizer ・ Bias_regularizer ・ Activity_regularizer https://keras.io/ja/regularizers/

In terms of the coefficients of the first fully connected layer $ x_2 = Ax_1 + B $ in the previous section, kernel_regularizer has a weight of $ A $, bias_regularizer has a weight of $ B $, and activity_regularizer has a regularization of output $ x_2 $. I will do it. In this case, L1 regularization adds the absolute value of the variable multiplied by an additional small factor $ \ lambda $ to the loss function. Each\lambda |A|\lambda |B|\lambda |x_2|。 On the other hand, L2 regularization adds the square of the variable multiplied by a small factor $ \ lambda $ to the loss function. $ \ Lamb A ^ 2 $, $ \ lambda B ^ 2 $, $ \ lambda x_2 ^ 2 $, respectively.

Here is the activity of L1 regularization_regularizer、\lambda |x_2|Think about. Considering the additional loss function as follows|x|Is the derivative ofx>0so1x<0so-1So

L_{L1} = \lambda |x_2| \\
\frac{\partial |x|}{\partial x} = \left\{
\begin{array}{ll}
1 & (x \geq 0) \\
-1 & (x \lt 0)
\end{array}
\right.\\
 H(x)=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\
\frac{\partial L_{L1}}{\partial x_2} = \lambda(2H(x_2)-1)\\

Therefore, the change in the update amount of the model weights $ A, B, C, D $ due to the activity_regularizer of L1 regularization is

\begin{align}
\frac{\partial L_{L1}}{\partial A} &= \frac{\partial L_{L1}}{\partial x_2}\frac{\partial x_2}{\partial A}\\
&=\lambda(2H(x_2)-1) \cdot x_1\\ 
\frac{\partial L_{L1}}{\partial B} &= \frac{\partial L_{L1}}{\partial x_2}\frac{\partial x_2}{\partial B}\\
&=\lambda(2H(x_2)-1) \\ 
\frac{\partial L_{L1}}{\partial C} &= 0\\
\frac{\partial L_{L1}}{\partial D} &= 0\\
\end{align}

The explanation that L1 regularization is used for dimensional compression is activity._kernel instead of regularizer_of regularizerL_{L1} = \lambda |A|This is an explanation when considering the L1 regularization loss term of.

python


x1 = Input(shape=(124,))
x2 = Dense(100, activity_regularizer=regularizers.l1(0.01))(x1)
x3 = Activation('relu')(x2)
y = Dense(10)(x3)

model.compile(loss='mean_squared_error')

Error backpropagation of fully connected layers of Leaky Relu

LeakyRelu is the activation function as follows. Generally $ \ alpha ≪1 $

 LeakyRelu(x)=\left\{
\begin{array}{ll}
x & (x \geq 0) \\
\alpha x & (x \lt 0)
\end{array}
\right.\\
\frac{\partial LeakyRelu(x)}{\partial x}
 =\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
\alpha  & (x \lt 0)
\end{array}
\right.\\

From

 H(x)=\left\{
\begin{array}{ll}
1 & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.\\

Leaky Relu and regular Relu gradient difference using

\begin{align}
\frac{\partial LeakyRelu(x)}{\partial x} &=(1-\alpha )H(x) + \alpha \\
\frac{\partial LeakyRelu(x)}{\partial x}-\frac{\partial Relu(x)}{\partial x} 
 &=((1-\alpha )H(x) + \alpha )-H(x)\\
&=-\alpha (H(x) -1)\\
\end{align}

Can be written. Therefore, the change in the update amount of the model weights $ A, B, C, D $ due to Relu => LeakyRelu is

\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (H(x_2)-1) \cdot x_1\\ 
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (H(x_2)-1) \\ 
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}

Can be shown as. Also, if you dare to transform this, it will be as follows. Here, the two items are just the normal error backpropagation multiplied by $ \ alpha $, so they will be ignored from now on.

\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \cdot x_1-[\alpha (t - y)\cdot C \cdot H(x_2) \cdot x_1]\\ 
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1)-[\alpha (t - y)\cdot C \cdot H(x_2)] \\ 
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}

python


x1 = Input(shape=(124,))
x2 = Dense(100)(x1)
x3 = LeakyReLU(alpha=0.01)(x2)
y = Dense(10)(x3)

model.compile(loss='mean_squared_error')

Comparison

Let's compare the changes from normal backpropagation. Here, $ \ lambda << 1, \ alpha << 1 $ are both well less than 1.

--Changes due to activity_regularizer of L1 regularization

\begin{align}
\frac{\partial L_{L1}}{\partial A} &=\lambda(2H(x_2)-1) \cdot x_1\\ 
\frac{\partial L_{L1}}{\partial B} &=\lambda(2H(x_2)-1) \\ 
\frac{\partial L_{L1}}{\partial C} &= 0\\
\frac{\partial L_{L1}}{\partial D} &= 0\\
\end{align}

--Relu => Leaky Relu changes

\begin{align}
\frac{\partial L_{mse}}{\partial A} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \cdot x_1\\ 
\frac{\partial L_{mse}}{\partial B} &=(t - y)\cdot C \cdot \alpha (2H(x_2)-1) \\ 
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}

Here, the change due to Relu => LeakyRelu can be regarded as close to the change due to L1 regularization of activity_regularizer in the early stage of learning where $ (t --y) \ cdot C $ cannot be regarded as zero. On the other hand, if the learning progresses sufficiently and $ (t --y) \ cdot C $ approaches zero, can we consider that the activity_regularizer of L1 regularization weakens? I thought about it. However, the change in activity_regularizer of L1 regularization is only related to the layer to which the regularization is applied, but the change by Relu => LeakyRelu is also transmitted to the layer before the activation function.

Also, if there are only two items that change due to Relu => LeakyRelu, it will be the amount obtained by multiplying the normal error back propagation by $ \ alpha $.

--Relu => Leaky Relu changes in two items

\begin{align}
\frac{\partial L_{mse}}{\partial A} &=-\alpha (t - y)\cdot C \cdot H(x_2) \cdot x_1\\ 
\frac{\partial L_{mse}}{\partial B} &=-\alpha (t - y)\cdot C \cdot H(x_2) \\ 
\frac{\partial L_{mse}}{\partial C} &= 0\\
\frac{\partial L_{mse}}{\partial D} &= 0\\
\end{align}

Combined with the original gradient and multiplied by LeakyRelu's activation function, all previous gradients are $ (1 + \ alpha) $ times larger. It is suggested that if the model is deep, replacing all Relu functions with LeakyRelu functions can logarithmically increase the gradient of the shallow layer closer to the input by $ (1 + \ alpha) ^ n $.

Summary:

We compared L1 regularization and Leaky Relu. I thought that the change due to Relu => LeakyRelu is the change due to L1 regularization of activity_regularizer and the gradient before the activation function is multiplied by $ (1 + \ alpha) $. This interpretation assumes that LeakyRel is the sum of the following functions.

\begin{align}
 LeakyRelu(x)&=\left\{
\begin{array}{ll}
x & (x \geq 0) \\
\alpha x & (x \lt 0)
\end{array}
\right.\\
&=\left\{
\begin{array}{ll}
(1+\alpha )x & (x \geq 0) \\
0 & (x \lt 0)
\end{array}
\right.  - \alpha|x|\\
&=(1+\alpha )Relu(x)- \alpha|x|
\end{align}

Recommended Posts

Comparison of L1 regularization and Leaky Relu
Comparison of Apex and Lamvery
Comparison of gem, bundler and pip, venv
Comparison of class inheritance and constructor description
Understand and implement ridge regression (L2 regularization)
Speed comparison of murmurhash3, md5 and sha1
Comparison of k-means implementation examples of scikit-learn and pyclustering
Comparison of R and Python writing (Euclidean algorithm)
Comparison of Python and Ruby (Environment / Grammar / Literal)
A quick comparison of Python and node.js test libraries
DNN (Deep Learning) Library: Comparison of chainer and TensorFlow (1)
Comparison of Windows Server and Free Linux to Commercial Linux
Comparison table of frequently used processes of Python and Clojure
Installation of Docker on Raspberry Pi and L Chika
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar
Comparison of LDA implementations
Comparison of fitting programs
Comparison of how to use higher-order functions in Python 2 and 3
Comparison of Hungarian law and general-purpose solver for allocation problems
Theory and implementation of multiple regression models-why regularization is needed-