Classify CIFAR-10 image datasets using various models of deep learning

Motivation

I wanted to use a deep neural network like Residual Network. I used the CIFAR-10 image dataset to classify images because MNIST isn't enough and ImageNet seems to be difficult to collect data and takes time to learn.

What is the CIFAR-10 image dataset?

The CIFAR-10 image dataset is a small color image dataset https://www.cs.toronto.edu/~kriz/cifar.html

Execution environment

I ran it in the following environment

Source code and features

The source code used is below. The commands described below assume that you have cloned this code and are at the root of the source tree. https://github.com/dsanno/chainer-cifar

It has the following functions

Error rate measurement

In classification, the error rate was measured as follows.

Get dataset

You can download it from the link "CIFAR-10 python version" at https://www.cs.toronto.edu/~kriz/cifar.html. Alternatively, download the dataset with the following command. The dataset file is 166MB, and it takes time if the line is thin.

$ python src/download.py

Unzip the downloaded dataset and you will see the following image.

The contents of data_batch_1 etc. are dict, and the raw data of the image is stored in'data'and the label information is stored in'labels'. Here is a sample that saves the first 100 as an image while examining data_batch_1.

$ python
>>> import cPickle as pickle
>>> f = open('dataset/cifar-10-batches-py/data_batch_1', 'rb')
>>> train_data = pickle.load(f)
>>> f.close()
>>> type(train_data['data'])
<type 'numpy.ndarray'>
>>> train_data['data'].shape #Get the shape of the raw data
(10000L, 3072L)
>>> train_data['data'][:5]   #Get the first 5 raw data
array([[ 59,  43,  50, ..., 140,  84,  72],
       [154, 126, 105, ..., 139, 142, 144],
       [255, 253, 253, ...,  83,  83,  84],
       [ 28,  37,  38, ...,  28,  37,  46],
       [170, 168, 177, ...,  82,  78,  80]], dtype=uint8)
>>> type(train_data['labels'])
<type 'list'>
>>> train_data['labels'][:10] #Get the first 10 label data
[6, 9, 9, 4, 1, 1, 2, 7, 8, 3]
>>> from PIL import Image
>>> sample_image = train_data['data'][:100].reshape((10, 10, 3, 32, 32)).transpose((0, 3, 1, 4, 2)).reshape((320, 320, 3)) #Sort the first 100 tiles
>>> Image.fromarray(sample_image).save('sample.png')

You can get the following images sample.png

Image preprocessing

Execute the following command to generate a dataset with 3 types of preprocessing.

$ python src/dataset.py

For "Average value of image", we used the average value of the RGB values of the entire training data regardless of RGB. Contrast Normalization equalizes the contrast by subtracting the average value of each image from the RGB value and then multiplying by a constant so that the standard deviation becomes 1. I don't really understand ZCA Whitening, but Toki no Mori Wiki Says that "transformation so that the covariance matrix of data becomes an identity matrix" is performed. The specific calculation of Whitening is described in detail in Sunfish Diary "CIFAR-10 and ZCA whitening".

The image after Contrast Normalization + ZCA Whitening is as follows. Since the distribution of RGB values is narrow as it is, it is normalized so that the distribution of RGB values spreads in the range of 0 to 255 for each image.

sample_norm_zca.png

Augmentation of training data

The following augmentations are performed in each study. At the time of testing, augmentation is not performed and the preprocessed test data is used as it is.

The code for the augmentation part is as follows.


import numpy as np

(Omission)

    def __trans_image(self, x):
        size = 32
        n = x.shape[0]
        images = np.zeros((n, 3, size, size), dtype=np.float32)
        offset = np.random.randint(-4, 5, size=(n, 2))
        mirror = np.random.randint(2, size=n)
        for i in six.moves.range(n):
            image = x[i]
            top, left = offset[i]
            left = max(0, left)
            top = max(0, top)
            right = min(size, left + size)
            bottom = min(size, left + size)
            if mirror[i] > 0:
                images[i,:,size-bottom:size-top,size-right:size-left] = image[:,top:bottom, left:right][:,:,::-1]
            else:
                images[i,:,size-bottom:size-top,size-right:size-left] = image[:,top:bottom,left:right]
        return images

Try to classify using a relatively shallow network

Let's learn with a relatively shallow network like the one below. Use a network-like structure used in the Tensorflor Tutorial (https://www.tensorflow.org/versions/r0.9/tutorials/deep_cnn/index.html). (Not exactly the same, there is a difference in layer composition and initial parameter values) After stacking 3 layers of Convolution Neural Network (CNN) + ReLU + MaxPooling, there are 2 layers of Fully connected Layer. After the Fully Connected Layer, a dropout is provided to prevent overfitting.

class CNN(chainer.Chain):
    def __init__(self):
        super(CNN, self).__init__(
            conv1=L.Convolution2D(3, 64, 5, stride=1, pad=2),
            conv2=L.Convolution2D(64, 64, 5, stride=1, pad=2),
            conv3=L.Convolution2D(64, 128, 5, stride=1,
            pad=2),
            l1=L.Linear(4 * 4 * 128, 1000),
            l2=L.Linear(1000, 10),
        )

    def __call__(self, x, train=True):
        h1 = F.max_pooling_2d(F.relu(self.conv1(x)), 3, 2)
        h2 = F.max_pooling_2d(F.relu(self.conv2(h1)), 3, 2)
        h3 = F.max_pooling_2d(F.relu(self.conv3(h2)), 3, 2)
        h4 = F.relu(self.l1(F.dropout(h3, train=train)))
        return self.l2(F.dropout(h4, train=train))

Execute learning with the following command. It took about 40 minutes to run.

$ python src/train.py -g 0 -m cnn -b 128 -p cnn --optimizer adam --iter 300 --lr_decay_iter 100

The meaning of the option is as follows

The error curve looks like this, with a test error rate of 18.94%. Looking at the error curve, if you lower the learning rate after learning for a while, the learning progresses rapidly again. It is a technique used to schedule the learning rate to decrease after learning a certain number of times in this way.

cnn2_error.png

Use a dataset with Contrast Normalization + ZCA Whitening

This time, we will train using a dataset that has undergone Contrast Normalization + ZCA Whitening in the preprocessing. Execute learning with the following command. It took about 40 minutes to run.

$ python src/train.py -g 0 -m cnn -b 128 -p cnn_zca --optimizer adam --iter 300 --lr_decay_iter 100 -d dataset/image_norm_zca.pkl

"-d dataset / image_norm_zca.pkl" specifies the dataset with Contrast Normalization + ZCA Whitening.

The error curve looks like this, with a test error rate of 18.76%. The result is that although it is better than just subtracting the average value, it is almost unchanged.

cnn_zca2_error.png

Use Batch Normalization

Batch Normalization is a method to normalize the output of a specific layer to 0 on average and 1 on variance for each mini-batch. The aim is to make it easier to learn the next layer by normalizing. The algorithm is described in detail in this article. You can use chainer.links.BatchNormalization to do Batch Normalization with Chainer.

The network code used this time is shown below. The network configuration is almost the same as the one used earlier, the only difference is the presence or absence of Batch Normalization.

class BatchConv2D(chainer.Chain):
    def __init__(self, ch_in, ch_out, ksize, stride=1, pad=0, activation=F.relu):
        super(BatchConv2D, self).__init__(
            conv=L.Convolution2D(ch_in, ch_out, ksize, stride, pad),
            bn=L.BatchNormalization(ch_out),
        )
        self.activation=activation

    def __call__(self, x, train):
        h = self.bn(self.conv(x), test=not train)
        if self.activation is None:
            return h
        return F.relu(h)

class CNNBN(chainer.Chain):
    def __init__(self):
        super(CNNBN, self).__init__(
            bconv1=BatchConv2D(3, 64, 5, stride=1, pad=2),
            bconv2=BatchConv2D(64, 64, 5, stride=1, pad=2),
            bconv3=BatchConv2D(64, 128, 5, stride=1, pad=2),
            l1=L.Linear(4 * 4 * 128, 1000),
            l2=L.Linear(1000, 10),
        )

    def __call__(self, x, train=True):
        h1 = F.max_pooling_2d(self.bconv1(x, train), 3, 2)
        h2 = F.max_pooling_2d(self.bconv2(h1, train), 3, 2)
        h3 = F.max_pooling_2d(self.bconv3(h2, train), 3, 2)
        h4 = F.relu(self.l1(F.dropout(h3, train=train)))
        return self.l2(F.dropout(h4, train=train))

To learn using Batch Normalization, enter the following command: It took about 50 minutes to run.

$ python src/train.py -g 0 -m cnnbn -b 128 -p cnnbn --optimizer adam --iter 300 --lr_decay_iter 100

Use the model with Batch Normalization with "-m cnnbn".

The error curve looks like this, with an error rate of 12.40%. You can see that the error rate is dramatically lower than without Batch Normalization.

cnnbn2_error.png

I also tried using the training data obtained by performing Contrast Normalization + ZCA Whitening. The command is as follows.

$ python src/train.py -g 0 -m cnnbn -b 128 -p cnnbn --optimizer adam --iter 300 --lr_decay_iter 100 -d dataset/image_norm_zca.pkl

The error rate was 12.27%, which was slightly better than just subtracting the mean.

cnnbn_zca2_error.png

Use a VGG like model

Use a model similar to the VGG 16 layer or VGG 19 layer. In the VGG model, a Fully Connected layer is provided after repeating multiple CNNs of kernel size 3 + Max Pooling several times. I used a VGG-based network on Day "The Story of Kaggle CIFAR-10", so I used a similar network. To learn. This blog has achieved a high score of 94.15% for test data recognition.

The differences from "The story of Kaggle CIFAR-10" are as follows.

This implementation Kaggle CIFAR-10 stories
Input data 32 x 32px 24 x 24 px
Augmentation(During learning) Translation, left-right reversal Translation, left-right reversal、拡大
Augmentation(At the time of test) None Translation, left / right inversion, enlargement
Number of models 1 piece 6(Use the average output of each model)
Batch Normaliztion Yes None

It is as follows when described using Chainer.

class VGG(chainer.Chain):
    def __init__(self):
        super(VGG, self).__init__(
            bconv1_1=BatchConv2D(3, 64, 3, stride=1, pad=1),
            bconv1_2=BatchConv2D(64, 64, 3, stride=1, pad=1),
            bconv2_1=BatchConv2D(64, 128, 3, stride=1, pad=1),
            bconv2_2=BatchConv2D(128, 128, 3, stride=1, pad=1),
            bconv3_1=BatchConv2D(128, 256, 3, stride=1, pad=1),
            bconv3_2=BatchConv2D(256, 256, 3, stride=1, pad=1),
            bconv3_3=BatchConv2D(256, 256, 3, stride=1, pad=1),
            bconv3_4=BatchConv2D(256, 256, 3, stride=1, pad=1),
            fc4=L.Linear(4 * 4 * 256, 1024),
            fc5=L.Linear(1024, 1024),
            fc6=L.Linear(1024, 10),
        )

    def __call__(self, x, train=True):
        h = self.bconv1_1(x, train)
        h = self.bconv1_2(h, train)
        h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
        h = self.bconv2_1(h, train)
        h = self.bconv2_2(h, train)
        h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
        h = self.bconv3_1(h, train)
        h = self.bconv3_2(h, train)
        h = self.bconv3_3(h, train)
        h = self.bconv3_4(h, train)
        h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
        h = F.relu(self.fc4(F.dropout(h, train=train)))
        h = F.relu(self.fc5(F.dropout(h, train=train)))
        h = self.fc6(h)
        return h

Execute it with the following command. A VGG-like model is specified with "-m vgg". It took about five and a half hours to run.

$ python src/train.py -g 0 -m vgg -b 128 -p vgg_adam --optimizer adam --iter 300 --lr_decay_iter 100 

The error rate is 7.65%, improving recognition accuracy.

vgg_adam2_error.png

Use Residual Network

Residual Network is a network that makes it possible to learn well even if the hierarchy is deepened by combining identity transformation and multiple layers of CNN. It is described in detail in this article.

The network implemented this time has a total of 74 layers and has the following configuration. Regarding how to count layers, since Residual Block uses 2 layers of CNN, one Residual Block is counted as 2 layers.

I really wanted to make it 110 layer, but I couldn't run it due to lack of GPU memory. We have confirmed that if the input image is made smaller to 24 x 24px, it can be executed on 110 layers.

The network implementation in Chainer is as follows.

class ResidualBlock(chainer.Chain):
    def __init__(self, ch_in, ch_out, stride=1, swapout=False, skip_ratio=0, activation1=F.relu, activation2=F.relu):
        w = math.sqrt(2)
        super(ResidualBlock, self).__init__(
            conv1=L.Convolution2D(ch_in, ch_out, 3, stride, 1, w),
            bn1=L.BatchNormalization(ch_out),
            conv2=L.Convolution2D(ch_out, ch_out, 3, 1, 1, w),
            bn2=L.BatchNormalization(ch_out),
        )
        self.activation1 = activation1
        self.activation2 = activation2
        self.skip_ratio = skip_ratio
        self.swapout = swapout

    def __call__(self, x, train):
        skip = False
        if train and self.skip_ratio > 0 and np.random.rand() < self.skip_ratio:
            skip = True
        sh, sw = self.conv1.stride
        c_out, c_in, kh, kw = self.conv1.W.data.shape
        b, c, hh, ww = x.data.shape
        if sh == 1 and sw == 1:
            shape_out = (b, c_out, hh, ww)
        else:
            hh = (hh + 2 - kh) // sh + 1
            ww = (ww + 2 - kw) // sw + 1
            shape_out = (b, c_out, hh, ww)
        h = x
        if x.data.shape != shape_out:
            xp = chainer.cuda.get_array_module(x.data)
            n, c, hh, ww = x.data.shape
            pad_c = shape_out[1] - c
            p = xp.zeros((n, pad_c, hh, ww), dtype=xp.float32)
            p = chainer.Variable(p, volatile=not train)
            x = F.concat((p, x))
            if x.data.shape[2:] != shape_out[2:]:
                x = F.average_pooling_2d(x, 1, 2)
        if skip:
            return x
        h = self.bn1(self.conv1(h), test=not train)
        if self.activation1 is not None:
            h = self.activation1(h)
        h = self.bn2(self.conv2(h), test=not train)
        if not train:
            h = h * (1 - self.skip_ratio)
        if self.swapout:
            h = F.dropout(h, train=train) + F.dropout(x, train=train)
        else:
            h = h + x
        if self.activation2 is not None:
            return self.activation2(h)
        else:
            return h

class ResidualNet(chainer.Chain):
    def __init__(self, depth=18, swapout=False, skip=True):
        super(ResidualNet, self).__init__()
        links = [('bconv1', BatchConv2D(3, 16, 3, 1, 1), True)]
        skip_size = depth * 3 - 3
        for i in six.moves.range(depth):
            if skip:
                skip_ratio = float(i) / skip_size * 0.5
            else:
                skip_ratio = 0
            links.append(('res{}'.format(len(links)), ResidualBlock(16, 16, swapout=swapout, skip_ratio=skip_ratio, ), True))
        links.append(('res{}'.format(len(links)), ResidualBlock(16, 32, stride=2, swapout=swapout), True))
        for i in six.moves.range(depth - 1):
            if skip:
                skip_ratio = float(i + depth) / skip_size * 0.5
            else:
                skip_ratio = 0
            links.append(('res{}'.format(len(links)), ResidualBlock(32, 32, swapout=swapout, skip_ratio=skip_ratio), True))
        links.append(('res{}'.format(len(links)), ResidualBlock(32, 64, stride=2, swapout=swapout), True))
        for i in six.moves.range(depth - 1):
            if skip:
                skip_ratio = float(i + depth * 2 - 1) / skip_size * 0.5
            else:
                skip_ratio = 0
            links.append(('res{}'.format(len(links)), ResidualBlock(64, 64, swapout=swapout, skip_ratio=skip_ratio), True))
        links.append(('_apool{}'.format(len(links)), F.AveragePooling2D(8, 1, 0, False, True), False))
        links.append(('fc{}'.format(len(links)), L.Linear(64, 10), False))

        for name, f, _with_train in links:
            if not name.startswith('_'):
                self.add_link(*(name, f))
        self.layers = links

    def __call__(self, x, train=True):
        h = x
        for name, f, with_train in self.layers:
            if with_train:
                h = f(h, train=train)
            else:
                h = f(h)
        return h

There is a parameter called swapout, but it's an experimental implementation, so ignore it for now. In the Residual Block, when the input and output sizes are different (width and height are smaller at the output, the number of channels is the same or the output is larger), the identity conversion part is as follows.

Execute it with the following command.

python src/train.py -g 0 -m residual -b 128 -p residual --res_depth 12 --optimizer sgd --lr 0.1 --iter 300 --lr_decay_iter 100

-m residual is the Residual Network specification. --optimizer sgd is a specification that uses Momentum SGD, which gives better results than using Adam. The initial value of the learning rate seems to be 0.1. In the implementation of CIFAR-10 image classification using the Residual Network listed below, the initial learning rate was 0.1.

It took about 10 hours to run. The test error rate was 8.06%, which was worse than the VGG-like model.

residual_noskip_error.png

Use Stochastic Depth

Stochastic Depth is a method of probabilistically skipping Residual Block during learning. The explanation is detailed in this article.

The network code is posted in Using Residual Network (using # residual-network). Give ReisualBlock a property called skip_ratio, and prevent the CNN part of Residual Block from being executed with the probability specified in skip_ratio during learning. When testing, use the value obtained by multiplying the CNN part by (1 --skip_ratio). The skip_ratio is sloped so that the deeper the Residual Block is, the larger the skip_ratio is. This time, skip_ratio is 0 for the first Residual Block, 0.5 for the deepest skip_ratio, and the blocks in between are changed linearly.

Execute it with the following command.

python src/train.py -g 0 -m residual -b 128 -p residual_skip --skip_depth --res_depth 12 --optimizer sgd --lr 0.1 --iter 300 --lr_decay_iter 100

--skip_depth is an option to use Stochastic Depth.

It took about 9 hours to learn. The test error rate is now 7.42%, improving accuracy.

residual_error.png

Summary

We classified the CIFAR-10 image dataset using various models. We were able to confirm the difference in recognition rate due to the difference in preprocessing, the presence or absence of Batch Normalization, and the difference in models.

This time I implemented it with Chainer, but for example, Implementation of Stochastic Depth by Torch7 is faster and saves memory than the implementation used this time, and the same environment (https://github.com/yueatsprograms/Stochastic_Depth) I was able to train 56layer Residual Network on Ubuntu). If you want to run a deeper model, you might consider another framework.

References

Recommended Posts

Classify CIFAR-10 image datasets using various models of deep learning
Image recognition model using deep learning in 2016
Implementation of Deep Learning model for image recognition
Deep learning 1 Practice of deep learning
Collection and automation of erotic images using deep learning
[PyTorch] Image classification of CIFAR-10
Examination of Forecasting Method Using Deep Learning and Wavelet Transform-Part 2-
Deep running 2 Tuning of deep learning
Importance of machine learning datasets
Deep reinforcement learning 2 Implementation of reinforcement learning
[Anomaly detection] Try using the latest method of deep distance learning
Image capture of firefox using python
Judgment of backlit image using OpenCV
Deep learning image recognition 2 model implementation
I tried deep learning using Theano
Image recognition of fruits using VGG16
[AI] Deep Learning for Image Denoising
Sentiment analysis of corporate word-of-mouth data of career change meetings using deep learning
A story of a deep learning beginner trying to classify guitars on CNN
I tried using the trained model VGG16 of the deep learning library Keras
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
Examination of exchange rate forecasting method using deep learning and wavelet transform
Python: Basics of image recognition using CNN
Deep learning learned by implementation 2 (image classification)
Classify anime faces with deep learning with Chainer
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
Python: Application of image recognition using CNN
Stock price forecast using deep learning (TensorFlow)
Try deep learning of genomics with Kipoi
Visualize the effects of deep learning / regularization
Sentiment analysis of tweets with deep learning
Image alignment: from SIFT to deep learning
Deep learning image recognition 3 after model creation
Learning record of reading "Deep Learning from scratch"
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
Machine Learning: Image Recognition of MNIST by using PCA and Gaussian Native Bayes
[Deep Learning from scratch] Initial value of neural network weight using sigmoid function