Recently, Swish and Mish are famous as activation function trends. There is no Japanese article explaining Gelu, so I would like to summarize my own understanding. An explanation of this will probably lead to an understanding of Swish. Paper: https://arxiv.org/abs/1606.08415
Gelu = x \Phi (x) = x * \frac{1}{2}[1+erf(\frac{x}{\sqrt{2}})]
Gelu is defined as above. Where $ \ Phi (x) $ is the cumulative distribution function of the normal distribution (Gaussian distribution). $ erf () $ is an error function. This is an ideal function, but in general, error functions cannot be calculated by elementary functions, so approximation is performed.
\begin{align}
Gelu &= 0.5x(1+tanh[\sqrt{\frac{2}{\pi}}(x+0.044715x^3)])\\
Gelu &= x\sigma (1.702x)\\
Swish &= x\sigma (\beta x)\\
\end{align}
Let's plot this Gelu and the approximation of Gelu and consider the difference with the relu function. Where Gelu ideal and Gelu approximate 1 are roughly equal. On the other hand, it can be seen that Gelu approximate 2 is slightly different from Gelu ideal. Here we can see that the Swish function with $ \ beta = 1.702 $ is equivalent to Gelu approximate 2. So if you can figure out why Gelu has an advantage, that's why Swish has an advantage. In this paper, the Swish function with β = 1 is called SiLU.
However, it should be noted that although the Swish function with $ \ beta = 1.702 $ is equal to the Gelu approximate 2, the exact Gelu function is different from the Swish function.
gelu.py
import matplotlib.pyplot as plt
import numpy as np
from scipy import special
x = np.arange(-6, 6, 0.05)
y1 = np.array([i if i>0 else 0 for i in x]) # Relu
y2 = x*0.5*(1+special.erf(np.sqrt(0.5)*x)) # Gelu ideal
y3 = x*0.5*(1+np.tanh(np.sqrt(2/np.pi)*(x+0.044715*x*x*x))) # Gelu approximate 1
y4 = x/(1+np.exp(-1.702*x)) # Gelu approximate 2
plt.plot(x, y1, label="Relu")
plt.plot(x, y2, label="Gelu ideal")
plt.plot(x, y3, label="Gelu approximate 1")
plt.plot(x, y4, label="Gelu approximate 2")
plt.legend()
plt.show()
plt.plot(x, y2-y1, label="Gelu ideal - Relu")
plt.plot(x, y3-y1, label="Gelu approximate 1 - Relu")
plt.plot(x, y4-y1, label="Gelu approximate 2 - Relu")
plt.legend()
plt.show()
Let's look again at the Gelu-Relu graph. At this time, Gelu-Relu becomes a graph with two valleys. And the gradient becomes 0 near $ x = ± 1 $. Gelu is an abbreviation for GAUSSIAN ERROR LINEAR UNITS. Inferring from the above, it seems that this activation function has a gradient component that brings the input of the function closer to 1 or -1. (The activation function includes a term that promotes so-called regularization)
activity_regularizer(Regularization of intermediate layer output)L1,I would like to consider a comparison with L2 regularization. The loss function component to be added in this case is
The difference between Gelu and Swish with $ \ beta = 1 $ and Relu from Mish is as follows. Mish is positive and negative, and the size and location of the valley are different. Mish is close to Swish with $ \ beta = 1.0 $ in the negative region, but feels close to Gelu approximate 2 (Swish with $ \ beta = 1.702 $) in the positive region. (By the way, if you delusion the reason why Mish is biased toward the negative region, the larger the convergence value of the negative value, the larger the percentage of inputs that take a positive value when regularized. I thought.)
gelu.py
y5 = x/(1+np.exp(-1.0*x)) # Swish (beta=1.0)
y6 = x*np.tanh(np.log(1+np.exp(x))) # Mish
plt.plot(x, y2-y1, label="Gelu ideal - Relu")
plt.plot(x, y4-y1, label="Gelu approximate 2 - Relu")
plt.plot(x, y5-y1, label="Swish (beta=1.0) - Relu")
plt.plot(x, y6-y1, label="Mish - Relu")
plt.legend()
plt.show()
Gelu displayed the variational component from the Relu function. I don't know if it's a good interpretation because I can't find an article that plots it as a variation from the Relu function, but in this case I understand that Gelu has a gradient component that brings the input closer to 1 or -1. This appears in the Mish paper with the word self regularized, but I think it is.
Recommended Posts