Recently, Swish and Mish are famous as activation function trends. There is no Japanese article explaining Gelu, so I would like to summarize my own understanding. An explanation of this will probably lead to an understanding of Swish. Paper: https://arxiv.org/abs/1606.08415

Gelu definition

Gelu = x \Phi (x) = x * \frac{1}{2}[1+erf(\frac{x}{\sqrt{2}})]

Gelu is defined as above. Where $ \ Phi (x) $ is the cumulative distribution function of the normal distribution (Gaussian distribution). $ erf () $ is an error function. This is an ideal function, but in general, error functions cannot be calculated by elementary functions, so approximation is performed.

\begin{align}
Gelu &= 0.5x(1+tanh[\sqrt{\frac{2}{\pi}}(x+0.044715x^3)])\\
Gelu &= x\sigma (1.702x)\\
Swish &= x\sigma (\beta x)\\
\end{align}

Let's plot this Gelu and the approximation of Gelu and consider the difference with the relu function. Where Gelu ideal and Gelu approximate 1 are roughly equal. On the other hand, it can be seen that Gelu approximate 2 is slightly different from Gelu ideal. Here we can see that the Swish function with $ \ beta = 1.702 $ is equivalent to Gelu approximate 2. So if you can figure out why Gelu has an advantage, that's why Swish has an advantage. In this paper, the Swish function with β = 1 is called SiLU.

However, it should be noted that although the Swish function with $ \ beta = 1.702 $ is equal to the Gelu approximate 2, the exact Gelu function is different from the Swish function.

`gelu.py`


import matplotlib.pyplot as plt
import numpy as np
from scipy import special

x = np.arange(-6, 6, 0.05)
y1 = np.array([i if i>0 else 0 for i in x])                  # Relu
y2 = x*0.5*(1+special.erf(np.sqrt(0.5)*x))                   # Gelu ideal
y3 = x*0.5*(1+np.tanh(np.sqrt(2/np.pi)*(x+0.044715*x*x*x)))  # Gelu approximate 1
y4 = x/(1+np.exp(-1.702*x))                                  # Gelu approximate 2

plt.plot(x, y1, label="Relu")
plt.plot(x, y2, label="Gelu ideal")
plt.plot(x, y3, label="Gelu approximate 1")
plt.plot(x, y4, label="Gelu approximate 2")
plt.legend()
plt.show()

plt.plot(x, y2-y1, label="Gelu ideal - Relu")
plt.plot(x, y3-y1, label="Gelu approximate 1 - Relu")
plt.plot(x, y4-y1, label="Gelu approximate 2 - Relu")
plt.legend()
plt.show()

Guess why Gelu is advantageous

Let's look again at the Gelu-Relu graph. At this time, Gelu-Relu becomes a graph with two valleys. And the gradient becomes 0 near $ x = ± 1 $. Gelu is an abbreviation for GAUSSIAN ERROR LINEAR UNITS. Inferring from the above, it seems that this activation function has a gradient component that brings the input of the function closer to 1 or -1. (The activation function includes a term that promotes so-called regularization)

activity_regularizer(Regularization of intermediate layer output)L1,I would like to consider a comparison with L2 regularization. The loss function component to be added in this case is\lambda |x|, \lambda x^2Is. When this loss function is added, the output approaches zero at any value, but if the output becomes too small, the original loss function will increase more than the additional loss function, so the magnitude of the output will eventually be somewhere. It will be balanced. Considering the difference of Gelu-Relu as an additional loss function, when the output is too close to zero, this function has a component that tries to increase the output. This is a term not found in the L1 and L2 regularization of activity_regularizer.

Comparison with Mish

The difference between Gelu and Swish with $ \ beta = 1 $ and Relu from Mish is as follows. Mish is positive and negative, and the size and location of the valley are different. Mish is close to Swish with $ \ beta = 1.0 $ in the negative region, but feels close to Gelu approximate 2 (Swish with $ \ beta = 1.702 $) in the positive region. (By the way, if you delusion the reason why Mish is biased toward the negative region, the larger the convergence value of the negative value, the larger the percentage of inputs that take a positive value when regularized. I thought.)

`gelu.py`


y5 = x/(1+np.exp(-1.0*x))                                    # Swish (beta=1.0)
y6 = x*np.tanh(np.log(1+np.exp(x)))                          # Mish

plt.plot(x, y2-y1, label="Gelu ideal - Relu")
plt.plot(x, y4-y1, label="Gelu approximate 2 - Relu")
plt.plot(x, y5-y1, label="Swish (beta=1.0) - Relu")
plt.plot(x, y6-y1, label="Mish - Relu")
plt.legend()
plt.show()

Summary:

Gelu displayed the variational component from the Relu function. I don't know if it's a good interpretation because I can't find an article that plots it as a variation from the Relu function, but in this case I understand that Gelu has a gradient component that brings the input closer to 1 or -1. This appears in the Mish paper with the word self regularized, but I think it is.

Regarding the activation function Gelu

Gelu definition

gelu.py

Guess why Gelu is advantageous

Comparison with Mish

gelu.py

Summary:

`gelu.py`

`gelu.py`