Motivation

I wanted to use a deep neural network like Residual Network. I used the CIFAR-10 image dataset to classify images because MNIST isn't enough and ImageNet seems to be difficult to collect data and takes time to learn.

What is the CIFAR-10 image dataset?

The CIFAR-10 image dataset is a small color image dataset https://www.cs.toronto.edu/~kriz/cifar.html

Image size is 32 x 32px
There are 6000 images of 10 classes each, for a total of 60,000 images. Of these, 50,000 are learning data and 10,000 are test data.
Classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

Execution environment

I ran it in the following environment

Windows 10 (confirmed to work on Ubuntu 14.04)
Python 2.7
Chainer 1.9.0
CUDA 7.5
cudnn v4
GeForce GTX 970

Source code and features

The source code used is below. The commands described below assume that you have cloned this code and are at the root of the source tree. https://github.com/dsanno/chainer-cifar

It has the following functions

Supports multiple neural network configurations
Augmentation of training data
Preprocessing of training data and test data
Learning rate schedule (Example: If you train 100 epoch, reduce the learning rate to 1/10)
Drawing loss curve and error rate curve

Error rate measurement

In classification, the error rate was measured as follows.

Randomly divide 50,000 learning data into 45,000 for learning and 5,000 for verification.
Measure the error rate of verification data and the error rate of test data in each epoch
Adopt the test data error rate in epoch with the best validation data error rate as the final test error rate

Get dataset

You can download it from the link "CIFAR-10 python version" at https://www.cs.toronto.edu/~kriz/cifar.html. Alternatively, download the dataset with the following command. The dataset file is 166MB, and it takes time if the line is thin.

$ python src/download.py

Unzip the downloaded dataset and you will see the following image.

Learning data (pickle file) 5 of data_batch_1, data_batch_2, data_batch_3, data_batch_4, data_batch_5
Test data (pickle file) test_batch
Metadata file, readme

The contents of data_batch_1 etc. are dict, and the raw data of the image is stored in'data'and the label information is stored in'labels'. Here is a sample that saves the first 100 as an image while examining data_batch_1.

$ python
>>> import cPickle as pickle
>>> f = open('dataset/cifar-10-batches-py/data_batch_1', 'rb')
>>> train_data = pickle.load(f)
>>> f.close()
>>> type(train_data['data'])
<type 'numpy.ndarray'>
>>> train_data['data'].shape #Get the shape of the raw data
(10000L, 3072L)
>>> train_data['data'][:5]   #Get the first 5 raw data
array([[ 59,  43,  50, ..., 140,  84,  72],
       [154, 126, 105, ..., 139, 142, 144],
       [255, 253, 253, ...,  83,  83,  84],
       [ 28,  37,  38, ...,  28,  37,  46],
       [170, 168, 177, ...,  82,  78,  80]], dtype=uint8)
>>> type(train_data['labels'])
<type 'list'>
>>> train_data['labels'][:10] #Get the first 10 label data
[6, 9, 9, 4, 1, 1, 2, 7, 8, 3]
>>> from PIL import Image
>>> sample_image = train_data['data'][:100].reshape((10, 10, 3, 32, 32)).transpose((0, 3, 1, 4, 2)).reshape((320, 320, 3)) #Sort the first 100 tiles
>>> Image.fromarray(sample_image).save('sample.png')

You can get the following images

Image preprocessing

Execute the following command to generate a dataset with 3 types of preprocessing.

$ python src/dataset.py

Subtract the mean from the image
ZCA Whitening (not used this time)
Contrast Normalization + ZCA Whitening

For "Average value of image", we used the average value of the RGB values of the entire training data regardless of RGB. Contrast Normalization equalizes the contrast by subtracting the average value of each image from the RGB value and then multiplying by a constant so that the standard deviation becomes 1. I don't really understand ZCA Whitening, but Toki no Mori Wiki Says that "transformation so that the covariance matrix of data becomes an identity matrix" is performed. The specific calculation of Whitening is described in detail in Sunfish Diary "CIFAR-10 and ZCA whitening".

The image after Contrast Normalization + ZCA Whitening is as follows. Since the distribution of RGB values is narrow as it is, it is normalized so that the distribution of RGB values spreads in the range of 0 to 255 for each image.

Augmentation of training data

The following augmentations are performed in each study. At the time of testing, augmentation is not performed and the preprocessed test data is used as it is.

The size of the training data is 32x32px
Randomly slide images up, down, left and right in the range of -4 to 4px Fill the vacant part by the slide with 0 (Because the preprocessing is completed, the part filled with 0 corresponds to gray)
Randomly flip left and right

The code for the augmentation part is as follows.


import numpy as np

(Omission)

    def __trans_image(self, x):
        size = 32
        n = x.shape[0]
        images = np.zeros((n, 3, size, size), dtype=np.float32)
        offset = np.random.randint(-4, 5, size=(n, 2))
        mirror = np.random.randint(2, size=n)
        for i in six.moves.range(n):
            image = x[i]
            top, left = offset[i]
            left = max(0, left)
            top = max(0, top)
            right = min(size, left + size)
            bottom = min(size, left + size)
            if mirror[i] > 0:
                images[i,:,size-bottom:size-top,size-right:size-left] = image[:,top:bottom, left:right][:,:,::-1]
            else:
                images[i,:,size-bottom:size-top,size-right:size-left] = image[:,top:bottom,left:right]
        return images

Try to classify using a relatively shallow network

Let's learn with a relatively shallow network like the one below. Use a network-like structure used in the Tensorflor Tutorial (https://www.tensorflow.org/versions/r0.9/tutorials/deep_cnn/index.html). (Not exactly the same, there is a difference in layer composition and initial parameter values) After stacking 3 layers of Convolution Neural Network (CNN) + ReLU + MaxPooling, there are 2 layers of Fully connected Layer. After the Fully Connected Layer, a dropout is provided to prevent overfitting.

class CNN(chainer.Chain):
    def __init__(self):
        super(CNN, self).__init__(
            conv1=L.Convolution2D(3, 64, 5, stride=1, pad=2),
            conv2=L.Convolution2D(64, 64, 5, stride=1, pad=2),
            conv3=L.Convolution2D(64, 128, 5, stride=1,
            pad=2),
            l1=L.Linear(4 * 4 * 128, 1000),
            l2=L.Linear(1000, 10),
        )

    def __call__(self, x, train=True):
        h1 = F.max_pooling_2d(F.relu(self.conv1(x)), 3, 2)
        h2 = F.max_pooling_2d(F.relu(self.conv2(h1)), 3, 2)
        h3 = F.max_pooling_2d(F.relu(self.conv3(h2)), 3, 2)
        h4 = F.relu(self.l1(F.dropout(h3, train=train)))
        return self.l2(F.dropout(h4, train=train))

Execute learning with the following command. It took about 40 minutes to run.

$ python src/train.py -g 0 -m cnn -b 128 -p cnn --optimizer adam --iter 300 --lr_decay_iter 100

The meaning of the option is as follows

-g 0: 0th GPU used
-b 128: Number of mini batches 128
--optimizer adam: Use Adam as Optimizer
--iter 300: epoch number 300
--lr_decay_iter 100: Decrease the learning rate (alpha value in the case of Adam) by 1/10 every 100 epoch

The error curve looks like this, with a test error rate of 18.94%. Looking at the error curve, if you lower the learning rate after learning for a while, the learning progresses rapidly again. It is a technique used to schedule the learning rate to decrease after learning a certain number of times in this way.

Use a dataset with Contrast Normalization + ZCA Whitening

This time, we will train using a dataset that has undergone Contrast Normalization + ZCA Whitening in the preprocessing. Execute learning with the following command. It took about 40 minutes to run.

$ python src/train.py -g 0 -m cnn -b 128 -p cnn_zca --optimizer adam --iter 300 --lr_decay_iter 100 -d dataset/image_norm_zca.pkl

"-d dataset / image_norm_zca.pkl" specifies the dataset with Contrast Normalization + ZCA Whitening.

The error curve looks like this, with a test error rate of 18.76%. The result is that although it is better than just subtracting the average value, it is almost unchanged.

Use Batch Normalization

Batch Normalization is a method to normalize the output of a specific layer to 0 on average and 1 on variance for each mini-batch. The aim is to make it easier to learn the next layer by normalizing. The algorithm is described in detail in this article. You can use chainer.links.BatchNormalization to do Batch Normalization with Chainer.

The network code used this time is shown below. The network configuration is almost the same as the one used earlier, the only difference is the presence or absence of Batch Normalization.

class BatchConv2D(chainer.Chain):
    def __init__(self, ch_in, ch_out, ksize, stride=1, pad=0, activation=F.relu):
        super(BatchConv2D, self).__init__(
            conv=L.Convolution2D(ch_in, ch_out, ksize, stride, pad),
            bn=L.BatchNormalization(ch_out),
        )
        self.activation=activation

    def __call__(self, x, train):
        h = self.bn(self.conv(x), test=not train)
        if self.activation is None:
            return h
        return F.relu(h)

class CNNBN(chainer.Chain):
    def __init__(self):
        super(CNNBN, self).__init__(
            bconv1=BatchConv2D(3, 64, 5, stride=1, pad=2),
            bconv2=BatchConv2D(64, 64, 5, stride=1, pad=2),
            bconv3=BatchConv2D(64, 128, 5, stride=1, pad=2),
            l1=L.Linear(4 * 4 * 128, 1000),
            l2=L.Linear(1000, 10),
        )

    def __call__(self, x, train=True):
        h1 = F.max_pooling_2d(self.bconv1(x, train), 3, 2)
        h2 = F.max_pooling_2d(self.bconv2(h1, train), 3, 2)
        h3 = F.max_pooling_2d(self.bconv3(h2, train), 3, 2)
        h4 = F.relu(self.l1(F.dropout(h3, train=train)))
        return self.l2(F.dropout(h4, train=train))

To learn using Batch Normalization, enter the following command: It took about 50 minutes to run.

$ python src/train.py -g 0 -m cnnbn -b 128 -p cnnbn --optimizer adam --iter 300 --lr_decay_iter 100

Use the model with Batch Normalization with "-m cnnbn".

The error curve looks like this, with an error rate of 12.40%. You can see that the error rate is dramatically lower than without Batch Normalization.

I also tried using the training data obtained by performing Contrast Normalization + ZCA Whitening. The command is as follows.

$ python src/train.py -g 0 -m cnnbn -b 128 -p cnnbn --optimizer adam --iter 300 --lr_decay_iter 100 -d dataset/image_norm_zca.pkl

The error rate was 12.27%, which was slightly better than just subtracting the mean.

Use a VGG like model

Use a model similar to the VGG 16 layer or VGG 19 layer. In the VGG model, a Fully Connected layer is provided after repeating multiple CNNs of kernel size 3 + Max Pooling several times. I used a VGG-based network on Day "The Story of Kaggle CIFAR-10", so I used a similar network. To learn. This blog has achieved a high score of 94.15% for test data recognition.

The differences from "The story of Kaggle CIFAR-10" are as follows.

	This implementation	Kaggle CIFAR-10 stories
Input data	32 x 32px	24 x 24 px
Augmentation(During learning)	Translation, left-right reversal	Translation, left-right reversal、拡大
Augmentation(At the time of test)	None	Translation, left / right inversion, enlargement
Number of models	1 piece	6(Use the average output of each model)
Batch Normaliztion	Yes	None

It is as follows when described using Chainer.

class VGG(chainer.Chain):
    def __init__(self):
        super(VGG, self).__init__(
            bconv1_1=BatchConv2D(3, 64, 3, stride=1, pad=1),
            bconv1_2=BatchConv2D(64, 64, 3, stride=1, pad=1),
            bconv2_1=BatchConv2D(64, 128, 3, stride=1, pad=1),
            bconv2_2=BatchConv2D(128, 128, 3, stride=1, pad=1),
            bconv3_1=BatchConv2D(128, 256, 3, stride=1, pad=1),
            bconv3_2=BatchConv2D(256, 256, 3, stride=1, pad=1),
            bconv3_3=BatchConv2D(256, 256, 3, stride=1, pad=1),
            bconv3_4=BatchConv2D(256, 256, 3, stride=1, pad=1),
            fc4=L.Linear(4 * 4 * 256, 1024),
            fc5=L.Linear(1024, 1024),
            fc6=L.Linear(1024, 10),
        )

    def __call__(self, x, train=True):
        h = self.bconv1_1(x, train)
        h = self.bconv1_2(h, train)
        h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
        h = self.bconv2_1(h, train)
        h = self.bconv2_2(h, train)
        h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
        h = self.bconv3_1(h, train)
        h = self.bconv3_2(h, train)
        h = self.bconv3_3(h, train)
        h = self.bconv3_4(h, train)
        h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
        h = F.relu(self.fc4(F.dropout(h, train=train)))
        h = F.relu(self.fc5(F.dropout(h, train=train)))
        h = self.fc6(h)
        return h

Execute it with the following command. A VGG-like model is specified with "-m vgg". It took about five and a half hours to run.

$ python src/train.py -g 0 -m vgg -b 128 -p vgg_adam --optimizer adam --iter 300 --lr_decay_iter 100

The error rate is 7.65%, improving recognition accuracy.

Use Residual Network

Residual Network is a network that makes it possible to learn well even if the hierarchy is deepened by combining identity transformation and multiple layers of CNN. It is described in detail in this article.

The network implemented this time has a total of 74 layers and has the following configuration. Regarding how to count layers, since Residual Block uses 2 layers of CNN, one Residual Block is counted as 2 layers.

CNN + Batch Normalization + 1 layer
Residual Block 36 pieces (72 layers)
Fully Connected 1 layer

I really wanted to make it 110 layer, but I couldn't run it due to lack of GPU memory. We have confirmed that if the input image is made smaller to 24 x 24px, it can be executed on 110 layers.

The network implementation in Chainer is as follows.

class ResidualBlock(chainer.Chain):
    def __init__(self, ch_in, ch_out, stride=1, swapout=False, skip_ratio=0, activation1=F.relu, activation2=F.relu):
        w = math.sqrt(2)
        super(ResidualBlock, self).__init__(
            conv1=L.Convolution2D(ch_in, ch_out, 3, stride, 1, w),
            bn1=L.BatchNormalization(ch_out),
            conv2=L.Convolution2D(ch_out, ch_out, 3, 1, 1, w),
            bn2=L.BatchNormalization(ch_out),
        )
        self.activation1 = activation1
        self.activation2 = activation2
        self.skip_ratio = skip_ratio
        self.swapout = swapout

    def __call__(self, x, train):
        skip = False
        if train and self.skip_ratio > 0 and np.random.rand() < self.skip_ratio:
            skip = True
        sh, sw = self.conv1.stride
        c_out, c_in, kh, kw = self.conv1.W.data.shape
        b, c, hh, ww = x.data.shape
        if sh == 1 and sw == 1:
            shape_out = (b, c_out, hh, ww)
        else:
            hh = (hh + 2 - kh) // sh + 1
            ww = (ww + 2 - kw) // sw + 1
            shape_out = (b, c_out, hh, ww)
        h = x
        if x.data.shape != shape_out:
            xp = chainer.cuda.get_array_module(x.data)
            n, c, hh, ww = x.data.shape
            pad_c = shape_out[1] - c
            p = xp.zeros((n, pad_c, hh, ww), dtype=xp.float32)
            p = chainer.Variable(p, volatile=not train)
            x = F.concat((p, x))
            if x.data.shape[2:] != shape_out[2:]:
                x = F.average_pooling_2d(x, 1, 2)
        if skip:
            return x
        h = self.bn1(self.conv1(h), test=not train)
        if self.activation1 is not None:
            h = self.activation1(h)
        h = self.bn2(self.conv2(h), test=not train)
        if not train:
            h = h * (1 - self.skip_ratio)
        if self.swapout:
            h = F.dropout(h, train=train) + F.dropout(x, train=train)
        else:
            h = h + x
        if self.activation2 is not None:
            return self.activation2(h)
        else:
            return h

class ResidualNet(chainer.Chain):
    def __init__(self, depth=18, swapout=False, skip=True):
        super(ResidualNet, self).__init__()
        links = [('bconv1', BatchConv2D(3, 16, 3, 1, 1), True)]
        skip_size = depth * 3 - 3
        for i in six.moves.range(depth):
            if skip:
                skip_ratio = float(i) / skip_size * 0.5
            else:
                skip_ratio = 0
            links.append(('res{}'.format(len(links)), ResidualBlock(16, 16, swapout=swapout, skip_ratio=skip_ratio, ), True))
        links.append(('res{}'.format(len(links)), ResidualBlock(16, 32, stride=2, swapout=swapout), True))
        for i in six.moves.range(depth - 1):
            if skip:
                skip_ratio = float(i + depth) / skip_size * 0.5
            else:
                skip_ratio = 0
            links.append(('res{}'.format(len(links)), ResidualBlock(32, 32, swapout=swapout, skip_ratio=skip_ratio), True))
        links.append(('res{}'.format(len(links)), ResidualBlock(32, 64, stride=2, swapout=swapout), True))
        for i in six.moves.range(depth - 1):
            if skip:
                skip_ratio = float(i + depth * 2 - 1) / skip_size * 0.5
            else:
                skip_ratio = 0
            links.append(('res{}'.format(len(links)), ResidualBlock(64, 64, swapout=swapout, skip_ratio=skip_ratio), True))
        links.append(('_apool{}'.format(len(links)), F.AveragePooling2D(8, 1, 0, False, True), False))
        links.append(('fc{}'.format(len(links)), L.Linear(64, 10), False))

        for name, f, _with_train in links:
            if not name.startswith('_'):
                self.add_link(*(name, f))
        self.layers = links

    def __call__(self, x, train=True):
        h = x
        for name, f, with_train in self.layers:
            if with_train:
                h = f(h, train=train)
            else:
                h = f(h)
        return h

There is a parameter called swapout, but it's an experimental implementation, so ignore it for now. In the Residual Block, when the input and output sizes are different (width and height are smaller at the output, the number of channels is the same or the output is larger), the identity conversion part is as follows.

Use Average Pooling for x and y directions
For the channel direction, fill the increased amount with 0

Execute it with the following command.

python src/train.py -g 0 -m residual -b 128 -p residual --res_depth 12 --optimizer sgd --lr 0.1 --iter 300 --lr_decay_iter 100

-m residual is the Residual Network specification. --optimizer sgd is a specification that uses Momentum SGD, which gives better results than using Adam. The initial value of the learning rate seems to be 0.1. In the implementation of CIFAR-10 image classification using the Residual Network listed below, the initial learning rate was 0.1.

https://github.com/mitmul/chainer-cifar10
https://github.com/yueatsprograms/Stochastic_Depth

It took about 10 hours to run. The test error rate was 8.06%, which was worse than the VGG-like model.

Use Stochastic Depth

Stochastic Depth is a method of probabilistically skipping Residual Block during learning. The explanation is detailed in this article.

The network code is posted in Using Residual Network (using # residual-network). Give ReisualBlock a property called skip_ratio, and prevent the CNN part of Residual Block from being executed with the probability specified in skip_ratio during learning. When testing, use the value obtained by multiplying the CNN part by (1 --skip_ratio). The skip_ratio is sloped so that the deeper the Residual Block is, the larger the skip_ratio is. This time, skip_ratio is 0 for the first Residual Block, 0.5 for the deepest skip_ratio, and the blocks in between are changed linearly.

Execute it with the following command.

python src/train.py -g 0 -m residual -b 128 -p residual_skip --skip_depth --res_depth 12 --optimizer sgd --lr 0.1 --iter 300 --lr_decay_iter 100

--skip_depth is an option to use Stochastic Depth.

It took about 9 hours to learn. The test error rate is now 7.42%, improving accuracy.

Summary

We classified the CIFAR-10 image dataset using various models. We were able to confirm the difference in recognition rate due to the difference in preprocessing, the presence or absence of Batch Normalization, and the difference in models.

This time I implemented it with Chainer, but for example, Implementation of Stochastic Depth by Torch7 is faster and saves memory than the implementation used this time, and the same environment (https://github.com/yueatsprograms/Stochastic_Depth) I was able to train 56layer Residual Network on Ubuntu). If you want to run a deeper model, you might consider another framework.

Classify CIFAR-10 image datasets using various models of deep learning