The previous article is here In this article, we will explain dropout, which is one of the typical methods for suppressing overfitting. Although it is a simple method, its effect can be inferred from the fact that it has been used since it was proposed. By the way, it seems that there is no theoretical explanation that can suppress overfitting. (Maybe lack of research ...) There are many possible reasons, though ~
-[What is a dropout](#What is a dropout) -[Relationship with ensemble learning](#Relationship with ensemble learning) -[View implementation from theory](# View implementation from theory) -[Implementation of dropout layer](#Implementation of dropout layer) -Experiment
Dropout: Dropout was proposed in 2012 as a method of suppressing overfitting and was also adopted by the famous ** AlexNet **. The outline is just "shut out the output of each layer of the fully connected layer with a certain probability $ ratio $ during learning". I'm surprised that overfitting can be suppressed with just this. I don't think there is a theoretical explanation as to why overfitting is suppressed, but there are various theories. One of them is ensemble learning.
In the first place, ensemble learning is a technology that realizes high accuracy by integrating multiple weak learners. Dropouts are of particular concern to the bagging technique. For more information [here](https://qiita.com/kuroitu/items/57425380546f7b9ed91c#%E3%82%A2%E3%83%B3%E3%82%B5%E3%83%B3%E3%83%96 % E3% 83% AB% E5% AD% A6% E7% BF% 92), so please refer to that. In any case, the dropout is training multiple models at the same time, so it's similar to some kind of bagging. The fact that the deactivating neurons are different each time they learn means that they are learning different models for each pattern, which means that they are learning different models.
From this, it is considered that learning by multiple learning devices = ensemble learning is performed in a simulated manner. One of the characteristics of bagging is that the training result has high bias and low variance, so it does not fit perfectly to the training data even though it is organized to some extent. Therefore, it is thought that overfitting is suppressed.
Now, let's take a brief look at the implementation from theory. As mentioned earlier, the dropout layer is easy to implement because it only "shuts out the output of each layer of the fully connected layer with a certain probability $ ratio $ during training". However, as those who are sharp may have noticed, the focus is on "** when learning **". So what happens when you finish learning and start inference?
It does not drop out during inference, so all neurons remain active. As you can imagine, the output "** density **" at the time of learning and at the time of inference will be different. To solve this, there is a method of multiplying the output by $ (1 --ratio) $ at the time of inference.
Let's take a look at the formula. Assuming that the output before applying the dropout is $ y $ and the output after applying is $ \ hat {y} $, the expected value of the output at the time of learning is
\mathbb{E}[\hat{y}] = \underbrace{(1 - ratio) y}_{Expected value of active neurons} + \underbrace{ratio \times 0}_{非Expected value of active neurons} = (1 - ratio)y
It looks like. On the other hand, at the time of inference, the truncation rate $ ratio $ is 0, and the expected value of the output is
\mathbb{E}[\hat{y}] = \underbrace{(1 - 0) y}_{Expected value of active neurons} + \underbrace{0 \times 0}_{非Expected value of active neurons} = y
And the $ \ frac {1} {1 --ratio} $ times output is "dark". (Note that $ ratio $ is $ 0 \ le ratio \ lt 1 $ here) The idea is to eliminate this mismatch by multiplying the inference output by $ (1 --ratio) $ to adjust for this "darkness".
(1 - ratio) \mathbb{E}[\hat{y}] = (1 - ratio) \left\{ \underbrace{(1 - 0) y}_{Expected value of active neurons} + \underbrace{0 \times 0}_{非Expected value of active neurons} \right\}= (1 - ratio)y
But this is an easy and dangerous method. Of course, you can study without any problems as it is, and you can also make inferences without problems. However, this method carries the risk of "changing the output of inference". I don't think it's a problem in many cases, but the output of the inference phase is used to evaluate the accuracy of the model, so it's best to touch it.
Then, there is a method of "aligning the output at the time of learning with that at the time of inference". In other words, the concentration is "darkened" by dividing the learning output by $ (1 --ratio) $.
\cfrac{1}{1 - ratio}\mathbb{E}[\hat{y}] = \cfrac{1}{1 - ratio} \left\{ \underbrace{(1 - ratio) y}_{Expected value of active neurons} + \underbrace{ratio \times 0}_{非Expected value of active neurons} \right\} = y
By doing this, the expected value of the output will be the same at the time of learning and at the time of inference, so there is no need to touch the output at the time of inference. This method of playing with the output during learning is called the ** inverse dropout method ** as opposed to the normal dropout method.
Now, let's implement the dropout layer using the inverse dropout method.
dropout.py
class Dropout(BaseLayer):
def __init__(self, *args,
mode="cpu", ratio=0.25,
prev=1, n=None, **kwds):
if not n is None:
raise KeyError("'n' must not be specified.")
super().__init__(*args, mode=mode, **kwds)
self.ratio = ratio
self.mask = self.calculator.zeros(prev)
self.prev = prev
self.n = prev
def forward(self, x, *args, train_flag=True, **kwds):
if train_flag:
self.mask = self.calculator.random.randn(self.prev)
self.mask = self.calculator.where(self.mask >= self.ratio, 1, 0)
return x*self.mask/(1- self.ratio)
else:
return x
def backward(self, grad, *args, **kwds):
return grad*self.mask/(1 - self.ratio)
def update(self, *args, **kwds):
pass
The implementation is simple, isn't it? The number of neurons in the output needs to match the previous layer, so it is repelled during the initialization stage.
For forward propagation, during learning, a variable called mask
selects neurons that randomly drop out. Moreover, the reverse dropout is realized by dividing by $ (1-ratio) $ at the time of output.
Therefore, it is an implementation that passes through as it is at the time of inference.
Since back propagation is used only during learning, there is no need to separate processing like forward propagation. The same mask
as in forward propagation is multiplied by the element product so that only active neurons propagate backward, and it is also divided by $ (1-ratio) $.
There are no parameters to learn in the dropout layer, so the implementation has passed.
Also, when adding the dropout layer, add the dropout layer to the _TypeManager
class, and calculate the error in the training
function in the implementation of the Trainer
class and the forward
function used in the predict
function. Let's add train_flag
to.
type_manager.py
class _TypeManager():
"""
Manager class for layer types
"""
N_TYPE = 5 #Number of layer types
BASE = -1
MIDDLE = 0 #Middle layer numbering
OUTPUT = 1 #Output layer numbering
DROPOUT = 2 #Dropout layer numbering
CONV = 3 #Numbering of convolutionary layers
POOL = 4 #Numbering of the pooling layer
REGULATED_DIC = {"Middle": MiddleLayer,
"Output": OutputLayer,
"Dropout": Dropout,
"Conv": ConvLayer,
"Pool": PoolingLayer,
"BaseLayer": None}
@property
def reg_keys(self):
return list(self.REGULATED_DIC.keys())
def name_rule(self, name):
name = name.lower()
if "middle" in name or name == "mid" or name == "m":
name = self.reg_keys[self.MIDDLE]
elif "output" in name or name == "out" or name == "o":
name = self.reg_keys[self.OUTPUT]
elif "dropout" in name or name == "drop" or name == "d":
name = self.reg_keys[self.DROPOUT]
elif "conv" in name or name == "c":
name = self.reg_keys[self.CONV]
elif "pool" in name or name == "p":
name = self.reg_keys[self.POOL]
else:
raise UndefinedLayerError(name)
return name
trainer.py
import time
import matplotlib.pyplot as plt
import matplotlib.animation as animation
softmax = type(get_act("softmax"))
sigmoid = type(get_act("sigmoid"))
class Trainer(Switch):
def __init__(self, x, y, *args, mode="cpu", **kwds):
#GPU availability
if not mode in ["cpu", "gpu"]:
raise KeyError("'mode' must select in {}".format(["cpu", "gpu"])
+ "but you specify '{}'.".format(mode))
self.mode = mode.lower()
super().__init__(*args, mode=self.mode, **kwds)
self.x_train, self.x_test = x
self.y_train, self.y_test = y
self.x_train = self.calculator.asarray(self.x_train)
self.x_test = self.calculator.asarray(self.x_test)
self.y_train = self.calculator.asarray(self.y_train)
self.y_test = self.calculator.asarray(self.y_test)
self.make_anim = False
def forward(self, x, train_flag=True, lim_memory=10):
def propagate(x, train_flag=True):
x_in = x
n_batch = x.shape[0]
switch = True
for ll in self.layer_list:
if switch and not self.is_CNN(ll.name):
x_in = x_in.reshape(n_batch, -1)
switch = False
x_in = ll.forward(x_in, train_flag=train_flag)
#Because the forward propagation method is also used for error calculation and prediction of unknown data
#Memory capacity can be large
if self.calculator.prod(
self.calculator.asarray(x.shape))*8/2**20 >= lim_memory:
#Double precision floating point number(8byte)At 10MB(=30*2**20)More than
#When using memory, divide it into 5MB or less and execute
n_batch = int(5*2**20/(8*self.calculator.prod(
self.calculator.asarray(x.shape[1:]))))
if self.mode == "cpu":
y = self.calculator.zeros((x.shape[0], lm[-1].n))
elif self.mode == "gpu":
y = self.calculator.zeros((x.shape[0], lm[-1].n))
n_loop = int(self.calculator.ceil(x.shape[0]/n_batch))
for i in range(n_loop):
propagate(x[i*n_batch : (i+1)*n_batch], train_flag=train_flag)
y[i*n_batch : (i+1)*n_batch] = lm[-1].y.copy()
lm[-1].y = y
else:
#Otherwise run normally
propagate(x, train_flag=train_flag)
・
・
・
def training(self, epoch, n_batch=16, threshold=1e-8,
show_error=True, show_train_error=False, **kwds):
if show_error:
self.error_list = []
if show_train_error:
self.train_error_list = []
if self.make_anim:
self.images = []
self.n_batch = n_batch
n_train = self.x_train.shape[0]//n_batch
n_test = self.x_test.shape[0]
#Start learning
if self.mode == "gpu":
cp.cuda.Stream.null.synchronize()
start_time = time.time()
lap_time = -1
error = 0
error_prev = 0
rand_index = self.calculator.arange(self.x_train.shape[0])
for t in range(1, epoch+1):
#Scene creation
if self.make_anim:
self.make_scene(t, epoch)
#Training error calculation
if show_train_error:
self.forward(self.x_train[rand_index[:n_test]],
train_flag=False)
error = lm[-1].get_error(self.y_train[rand_index[:n_test]])
self.train_error_list.append(error)
#Error calculation
self.forward(self.x_test, train_flag=False)
error = lm[-1].get_error(self.y_test)
if show_error:
self.error_list.append(error)
・
・
・
def predict(self, x=None, y=None, threshold=0.5):
if x is None:
x = self.x_test
if y is None:
y = self.y_test
self.forward(x, train_flag=False)
self.y_pred = self.pred_func(self[-1].y, threshold=threshold)
y = self.pred_func(y, threshold=threshold)
print("correct:", y[:min(16, int(y.shape[0]*0.1))])
print("predict:", self.y_pred[:min(16, int(y.shape[0]*0.1))])
print("accuracy rate:",
100*self.calculator.sum(self.y_pred == y,
dtype=int)/y.shape[0], "%",
"({}/{})".format(self.calculator.sum(self.y_pred == y, dtype=int),
y.shape[0]))
if self.mode == "cpu":
return self.y_pred
elif self.mode == "gpu":
return self.y_pred.get()
Let's experiment. However, learning with the MNIST dataset does not cause much overfitting, so the effect may seem weak. The experiment is conducted on Google Colaboratory. I'm running in GPU mode because I'm using Keras' MNIST dataset, but it still takes about 20 minutes for 200 epochs. The code can be executed as it is by jumping from github to Google Colaboratory.
test.py
%matplotlib inline
#Create convolution layer and output layer
M, F_h, F_w = 10, 3, 3
lm = LayerManager((x_train, x_test), (t_train, t_test), mode="gpu")
lm.append(name="c", I_shape=(C, I_h, I_w), F_shape=(M, F_h, F_w), pad=1)
lm.append(name="p", I_shape=lm[-1].O_shape, pool=2)
lm.append(name="m", n=100, opt="eve")
lm.append(name="d", ratio=0.5)
lm.append(name="o", n=n_class, act="softmax", err_func="Cross")
#To learn
epoch = 200
threshold = 1e-8
n_batch = 128
lm.training(epoch, threshold=threshold, n_batch=n_batch, show_train_error=True)
#Predict
print("training dataset")
_ = lm.predict(x=lm.x_train, y=lm.y_train)
print("test dataset")
y_pred = lm.predict()
It takes a little troublesome work to illustrate the experimental results. First, execute the test code cell without the dropout layer, and then execute the following code prepared in another cell.
get_error.py
err_list = lm.error_list
Next, execute the test code cell with the dropout layer, and execute the following code prepared in another cell.
get_drop_error.py
drop_error_list = lm.error_list
After the above setup, prepare the following code in another cell and execute it.
plot.py
fig, ax = plt.subplots(1)
fig.suptitle("error comparison")
ax.set_xlabel("epoch")
ax.set_ylabel("error")
ax.set_yscale("log")
ax.grid()
ax.plot(drop_error_list, label="dropout error")
ax.plot(err_list, label="normal error")
ax.legend(loc="best")
You can now view it.
It's too annoying, so let's think about an implementation that makes it easier to illustrate this kind of comparative verification ...
Recommended Posts