Overview

The previous article is here In this article, we will explain dropout, which is one of the typical methods for suppressing overfitting. Although it is a simple method, its effect can be inferred from the fact that it has been used since it was proposed. By the way, it seems that there is no theoretical explanation that can suppress overfitting. (Maybe lack of research ...) There are many possible reasons, though ~

-[What is a dropout](#What is a dropout) -[Relationship with ensemble learning](#Relationship with ensemble learning) -[View implementation from theory](# View implementation from theory) -[Implementation of dropout layer](#Implementation of dropout layer) -Experiment

What is a dropout?

Dropout: Dropout was proposed in 2012 as a method of suppressing overfitting and was also adopted by the famous ** AlexNet **. The outline is just "shut out the output of each layer of the fully connected layer with a certain probability $ ratio $ during learning". I'm surprised that overfitting can be suppressed with just this. I don't think there is a theoretical explanation as to why overfitting is suppressed, but there are various theories. One of them is ensemble learning.

Relationship with ensemble learning

In the first place, ensemble learning is a technology that realizes high accuracy by integrating multiple weak learners. Dropouts are of particular concern to the bagging technique. For more information [here](https://qiita.com/kuroitu/items/57425380546f7b9ed91c#%E3%82%A2%E3%83%B3%E3%82%B5%E3%83%B3%E3%83%96 % E3% 83% AB% E5% AD% A6% E7% BF% 92), so please refer to that. In any case, the dropout is training multiple models at the same time, so it's similar to some kind of bagging. The fact that the deactivating neurons are different each time they learn means that they are learning different models for each pattern, which means that they are learning different models.

From this, it is considered that learning by multiple learning devices = ensemble learning is performed in a simulated manner. One of the characteristics of bagging is that the training result has high bias and low variance, so it does not fit perfectly to the training data even though it is organized to some extent. Therefore, it is thought that overfitting is suppressed.

See the implementation from theory

Now, let's take a brief look at the implementation from theory. As mentioned earlier, the dropout layer is easy to implement because it only "shuts out the output of each layer of the fully connected layer with a certain probability $ ratio $ during training". However, as those who are sharp may have noticed, the focus is on "** when learning **". So what happens when you finish learning and start inference?

It does not drop out during inference, so all neurons remain active. As you can imagine, the output "** density **" at the time of learning and at the time of inference will be different. To solve this, there is a method of multiplying the output by $ (1 --ratio) $ at the time of inference.

Let's take a look at the formula. Assuming that the output before applying the dropout is $ y $ and the output after applying is $ \ hat {y} $, the expected value of the output at the time of learning is

\mathbb{E}[\hat{y}] = \underbrace{(1 - ratio) y}_{Expected value of active neurons} + \underbrace{ratio \times 0}_{非Expected value of active neurons} = (1 - ratio)y

It looks like. On the other hand, at the time of inference, the truncation rate $ ratio $ is 0, and the expected value of the output is

\mathbb{E}[\hat{y}] = \underbrace{(1 - 0) y}_{Expected value of active neurons} + \underbrace{0 \times 0}_{非Expected value of active neurons} = y

And the $ \ frac {1} {1 --ratio} $ times output is "dark". (Note that $ ratio $ is $ 0 \ le ratio \ lt 1 $ here) The idea is to eliminate this mismatch by multiplying the inference output by $ (1 --ratio) $ to adjust for this "darkness".

(1 - ratio) \mathbb{E}[\hat{y}] = (1 - ratio) \left\{ \underbrace{(1 - 0) y}_{Expected value of active neurons} + \underbrace{0 \times 0}_{非Expected value of active neurons} \right\}= (1 - ratio)y

But this is an easy and dangerous method. Of course, you can study without any problems as it is, and you can also make inferences without problems. However, this method carries the risk of "changing the output of inference". I don't think it's a problem in many cases, but the output of the inference phase is used to evaluate the accuracy of the model, so it's best to touch it.

Then, there is a method of "aligning the output at the time of learning with that at the time of inference". In other words, the concentration is "darkened" by dividing the learning output by $ (1 --ratio) $.

\cfrac{1}{1 - ratio}\mathbb{E}[\hat{y}] = \cfrac{1}{1 - ratio} \left\{ \underbrace{(1 - ratio) y}_{Expected value of active neurons} + \underbrace{ratio \times 0}_{非Expected value of active neurons} \right\} = y

By doing this, the expected value of the output will be the same at the time of learning and at the time of inference, so there is no need to touch the output at the time of inference. This method of playing with the output during learning is called the ** inverse dropout method ** as opposed to the normal dropout method.

Dropout layer implementation

Now, let's implement the dropout layer using the inverse dropout method.

`dropout.py`


class Dropout(BaseLayer):
    def __init__(self, *args,
                 mode="cpu", ratio=0.25,
                 prev=1, n=None, **kwds):
        if not n is None:
            raise KeyError("'n' must not be specified.")
        super().__init__(*args, mode=mode, **kwds)

        self.ratio = ratio
        self.mask = self.calculator.zeros(prev)
        self.prev = prev
        self.n = prev
    

    def forward(self, x, *args, train_flag=True, **kwds):
        if train_flag:
            self.mask = self.calculator.random.randn(self.prev)
            self.mask = self.calculator.where(self.mask >= self.ratio, 1, 0)
            return x*self.mask/(1- self.ratio)
        else:
            return x
    

    def backward(self, grad, *args, **kwds):
        return grad*self.mask/(1 - self.ratio)
    

    def update(self, *args, **kwds):
        pass

The implementation is simple, isn't it? The number of neurons in the output needs to match the previous layer, so it is repelled during the initialization stage.

For forward propagation, during learning, a variable called mask selects neurons that randomly drop out. Moreover, the reverse dropout is realized by dividing by $ (1-ratio) $ at the time of output. Therefore, it is an implementation that passes through as it is at the time of inference.

Since back propagation is used only during learning, there is no need to separate processing like forward propagation. The same mask as in forward propagation is multiplied by the element product so that only active neurons propagate backward, and it is also divided by $ (1-ratio) $.

There are no parameters to learn in the dropout layer, so the implementation has passed.

Also, when adding the dropout layer, add the dropout layer to the _TypeManager class, and calculate the error in the training function in the implementation of the Trainer class and the forward function used in the predict function. Let's add train_flag to.

type_manager.py and trainer.py

`type_manager.py`


class _TypeManager():
    """
Manager class for layer types
    """
    N_TYPE = 5  #Number of layer types

    BASE = -1
    MIDDLE = 0  #Middle layer numbering
    OUTPUT = 1  #Output layer numbering
    DROPOUT = 2    #Dropout layer numbering
    CONV = 3    #Numbering of convolutionary layers
    POOL = 4    #Numbering of the pooling layer
    
    REGULATED_DIC = {"Middle": MiddleLayer,
                     "Output": OutputLayer,
                     "Dropout": Dropout,
                     "Conv": ConvLayer,
                     "Pool": PoolingLayer,
                     "BaseLayer": None}
    
    
    @property
    def reg_keys(self):
        return list(self.REGULATED_DIC.keys())
    
    
    def name_rule(self, name):
        name = name.lower()
        if "middle" in name or name == "mid" or name == "m":
            name = self.reg_keys[self.MIDDLE]
        elif "output" in name or name == "out" or name == "o":
            name = self.reg_keys[self.OUTPUT]
        elif "dropout" in name or name == "drop" or name == "d":
            name = self.reg_keys[self.DROPOUT]
        elif "conv" in name or name == "c":
            name = self.reg_keys[self.CONV]
        elif "pool" in name or name == "p":
            name = self.reg_keys[self.POOL]
        else:
            raise UndefinedLayerError(name)
        
        return name

`trainer.py`


import time


import matplotlib.pyplot as plt
import matplotlib.animation as animation


softmax = type(get_act("softmax"))
sigmoid = type(get_act("sigmoid"))


class Trainer(Switch):
    def __init__(self, x, y, *args, mode="cpu", **kwds):
        #GPU availability
        if not mode in ["cpu", "gpu"]:
            raise KeyError("'mode' must select in {}".format(["cpu", "gpu"])
                         + "but you specify '{}'.".format(mode))
        self.mode = mode.lower()

        super().__init__(*args, mode=self.mode, **kwds)

        self.x_train, self.x_test = x
        self.y_train, self.y_test = y
        self.x_train = self.calculator.asarray(self.x_train)
        self.x_test = self.calculator.asarray(self.x_test)
        self.y_train = self.calculator.asarray(self.y_train)
        self.y_test = self.calculator.asarray(self.y_test)
    
        self.make_anim = False
    

    def forward(self, x, train_flag=True, lim_memory=10):
        def propagate(x, train_flag=True):
            x_in = x
            n_batch = x.shape[0]
            switch = True
            for ll in self.layer_list:
                if switch and not self.is_CNN(ll.name):
                    x_in = x_in.reshape(n_batch, -1)
                    switch = False
                x_in = ll.forward(x_in, train_flag=train_flag)
        
        #Because the forward propagation method is also used for error calculation and prediction of unknown data
        #Memory capacity can be large
        if self.calculator.prod(
            self.calculator.asarray(x.shape))*8/2**20 >= lim_memory:
            #Double precision floating point number(8byte)At 10MB(=30*2**20)More than
            #When using memory, divide it into 5MB or less and execute
            n_batch = int(5*2**20/(8*self.calculator.prod(
                                     self.calculator.asarray(x.shape[1:]))))
            if self.mode == "cpu":
                y = self.calculator.zeros((x.shape[0], lm[-1].n))
            elif self.mode == "gpu":
                y = self.calculator.zeros((x.shape[0], lm[-1].n))
            n_loop = int(self.calculator.ceil(x.shape[0]/n_batch))
            for i in range(n_loop):
                propagate(x[i*n_batch : (i+1)*n_batch], train_flag=train_flag)
                y[i*n_batch : (i+1)*n_batch] = lm[-1].y.copy()
            lm[-1].y = y
        else:
            #Otherwise run normally
            propagate(x, train_flag=train_flag)

・
・
・
    
    def training(self, epoch, n_batch=16, threshold=1e-8,
                 show_error=True, show_train_error=False, **kwds):
        if show_error:
            self.error_list = []
        if show_train_error:
            self.train_error_list = []
        if self.make_anim:
            self.images = []
        self.n_batch = n_batch
        
        n_train = self.x_train.shape[0]//n_batch
        n_test = self.x_test.shape[0]
        
        #Start learning
        if self.mode == "gpu":
            cp.cuda.Stream.null.synchronize()
        start_time = time.time()
        lap_time = -1
        error = 0
        error_prev = 0
        rand_index = self.calculator.arange(self.x_train.shape[0])
        for t in range(1, epoch+1):
            #Scene creation
            if self.make_anim:
                self.make_scene(t, epoch)
            
            #Training error calculation
            if show_train_error:
                self.forward(self.x_train[rand_index[:n_test]],
                             train_flag=False)
                error = lm[-1].get_error(self.y_train[rand_index[:n_test]])
                self.train_error_list.append(error)
            
            #Error calculation
            self.forward(self.x_test, train_flag=False)
            error = lm[-1].get_error(self.y_test)
            if show_error:
                self.error_list.append(error)

・
・
・
    
    def predict(self, x=None, y=None, threshold=0.5):
        if x is None:
            x = self.x_test
        if y is None:
            y = self.y_test
        
        self.forward(x, train_flag=False)
        self.y_pred = self.pred_func(self[-1].y, threshold=threshold)
        y = self.pred_func(y, threshold=threshold)
        print("correct:", y[:min(16, int(y.shape[0]*0.1))])
        print("predict:", self.y_pred[:min(16, int(y.shape[0]*0.1))])
        print("accuracy rate:",
              100*self.calculator.sum(self.y_pred == y, 
                                      dtype=int)/y.shape[0], "%",
              "({}/{})".format(self.calculator.sum(self.y_pred == y, dtype=int),
                               y.shape[0]))
        if self.mode == "cpu":
            return self.y_pred
        elif self.mode == "gpu":
            return self.y_pred.get()

Experiment

Let's experiment. However, learning with the MNIST dataset does not cause much overfitting, so the effect may seem weak. The experiment is conducted on Google Colaboratory. I'm running in GPU mode because I'm using Keras' MNIST dataset, but it still takes about 20 minutes for 200 epochs. The code can be executed as it is by jumping from github to Google Colaboratory.

`test.py`


%matplotlib inline
#Create convolution layer and output layer
M, F_h, F_w = 10, 3, 3
lm = LayerManager((x_train, x_test), (t_train, t_test), mode="gpu")
lm.append(name="c", I_shape=(C, I_h, I_w), F_shape=(M, F_h, F_w), pad=1)
lm.append(name="p", I_shape=lm[-1].O_shape, pool=2)
lm.append(name="m", n=100, opt="eve")
lm.append(name="d", ratio=0.5)
lm.append(name="o", n=n_class, act="softmax", err_func="Cross")

#To learn
epoch = 200
threshold = 1e-8
n_batch = 128
lm.training(epoch, threshold=threshold, n_batch=n_batch, show_train_error=True)

#Predict
print("training dataset")
_ = lm.predict(x=lm.x_train, y=lm.y_train)
print("test dataset")
y_pred = lm.predict()

It takes a little troublesome work to illustrate the experimental results. First, execute the test code cell without the dropout layer, and then execute the following code prepared in another cell.

`get_error.py`


err_list = lm.error_list

Next, execute the test code cell with the dropout layer, and execute the following code prepared in another cell.

`get_drop_error.py`


drop_error_list = lm.error_list

After the above setup, prepare the following code in another cell and execute it.

`plot.py`


fig, ax = plt.subplots(1)
fig.suptitle("error comparison")
ax.set_xlabel("epoch")
ax.set_ylabel("error")
ax.set_yscale("log")
ax.grid()
ax.plot(drop_error_list, label="dropout error")
ax.plot(err_list, label="normal error")
ax.legend(loc="best")

You can now view it.

in conclusion

It's too annoying, so let's think about an implementation that makes it easier to illustrate this kind of comparative verification ...

Introduction to Deep Learning ~ Dropout Edition ~