I wanted to use a deep neural network like Residual Network. I used the CIFAR-10 image dataset to classify images because MNIST isn't enough and ImageNet seems to be difficult to collect data and takes time to learn.
The CIFAR-10 image dataset is a small color image dataset https://www.cs.toronto.edu/~kriz/cifar.html
I ran it in the following environment
The source code used is below. The commands described below assume that you have cloned this code and are at the root of the source tree. https://github.com/dsanno/chainer-cifar
It has the following functions
In classification, the error rate was measured as follows.
You can download it from the link "CIFAR-10 python version" at https://www.cs.toronto.edu/~kriz/cifar.html. Alternatively, download the dataset with the following command. The dataset file is 166MB, and it takes time if the line is thin.
$ python src/download.py
Unzip the downloaded dataset and you will see the following image.
The contents of data_batch_1 etc. are dict, and the raw data of the image is stored in'data'and the label information is stored in'labels'. Here is a sample that saves the first 100 as an image while examining data_batch_1.
$ python
>>> import cPickle as pickle
>>> f = open('dataset/cifar-10-batches-py/data_batch_1', 'rb')
>>> train_data = pickle.load(f)
>>> f.close()
>>> type(train_data['data'])
<type 'numpy.ndarray'>
>>> train_data['data'].shape #Get the shape of the raw data
(10000L, 3072L)
>>> train_data['data'][:5] #Get the first 5 raw data
array([[ 59, 43, 50, ..., 140, 84, 72],
[154, 126, 105, ..., 139, 142, 144],
[255, 253, 253, ..., 83, 83, 84],
[ 28, 37, 38, ..., 28, 37, 46],
[170, 168, 177, ..., 82, 78, 80]], dtype=uint8)
>>> type(train_data['labels'])
<type 'list'>
>>> train_data['labels'][:10] #Get the first 10 label data
[6, 9, 9, 4, 1, 1, 2, 7, 8, 3]
>>> from PIL import Image
>>> sample_image = train_data['data'][:100].reshape((10, 10, 3, 32, 32)).transpose((0, 3, 1, 4, 2)).reshape((320, 320, 3)) #Sort the first 100 tiles
>>> Image.fromarray(sample_image).save('sample.png')
You can get the following images
Execute the following command to generate a dataset with 3 types of preprocessing.
$ python src/dataset.py
For "Average value of image", we used the average value of the RGB values of the entire training data regardless of RGB. Contrast Normalization equalizes the contrast by subtracting the average value of each image from the RGB value and then multiplying by a constant so that the standard deviation becomes 1. I don't really understand ZCA Whitening, but Toki no Mori Wiki Says that "transformation so that the covariance matrix of data becomes an identity matrix" is performed. The specific calculation of Whitening is described in detail in Sunfish Diary "CIFAR-10 and ZCA whitening".
The image after Contrast Normalization + ZCA Whitening is as follows. Since the distribution of RGB values is narrow as it is, it is normalized so that the distribution of RGB values spreads in the range of 0 to 255 for each image.
The following augmentations are performed in each study. At the time of testing, augmentation is not performed and the preprocessed test data is used as it is.
The code for the augmentation part is as follows.
import numpy as np
(Omission)
def __trans_image(self, x):
size = 32
n = x.shape[0]
images = np.zeros((n, 3, size, size), dtype=np.float32)
offset = np.random.randint(-4, 5, size=(n, 2))
mirror = np.random.randint(2, size=n)
for i in six.moves.range(n):
image = x[i]
top, left = offset[i]
left = max(0, left)
top = max(0, top)
right = min(size, left + size)
bottom = min(size, left + size)
if mirror[i] > 0:
images[i,:,size-bottom:size-top,size-right:size-left] = image[:,top:bottom, left:right][:,:,::-1]
else:
images[i,:,size-bottom:size-top,size-right:size-left] = image[:,top:bottom,left:right]
return images
Let's learn with a relatively shallow network like the one below. Use a network-like structure used in the Tensorflor Tutorial (https://www.tensorflow.org/versions/r0.9/tutorials/deep_cnn/index.html). (Not exactly the same, there is a difference in layer composition and initial parameter values) After stacking 3 layers of Convolution Neural Network (CNN) + ReLU + MaxPooling, there are 2 layers of Fully connected Layer. After the Fully Connected Layer, a dropout is provided to prevent overfitting.
class CNN(chainer.Chain):
def __init__(self):
super(CNN, self).__init__(
conv1=L.Convolution2D(3, 64, 5, stride=1, pad=2),
conv2=L.Convolution2D(64, 64, 5, stride=1, pad=2),
conv3=L.Convolution2D(64, 128, 5, stride=1,
pad=2),
l1=L.Linear(4 * 4 * 128, 1000),
l2=L.Linear(1000, 10),
)
def __call__(self, x, train=True):
h1 = F.max_pooling_2d(F.relu(self.conv1(x)), 3, 2)
h2 = F.max_pooling_2d(F.relu(self.conv2(h1)), 3, 2)
h3 = F.max_pooling_2d(F.relu(self.conv3(h2)), 3, 2)
h4 = F.relu(self.l1(F.dropout(h3, train=train)))
return self.l2(F.dropout(h4, train=train))
Execute learning with the following command. It took about 40 minutes to run.
$ python src/train.py -g 0 -m cnn -b 128 -p cnn --optimizer adam --iter 300 --lr_decay_iter 100
The meaning of the option is as follows
The error curve looks like this, with a test error rate of 18.94%. Looking at the error curve, if you lower the learning rate after learning for a while, the learning progresses rapidly again. It is a technique used to schedule the learning rate to decrease after learning a certain number of times in this way.
This time, we will train using a dataset that has undergone Contrast Normalization + ZCA Whitening in the preprocessing. Execute learning with the following command. It took about 40 minutes to run.
$ python src/train.py -g 0 -m cnn -b 128 -p cnn_zca --optimizer adam --iter 300 --lr_decay_iter 100 -d dataset/image_norm_zca.pkl
"-d dataset / image_norm_zca.pkl" specifies the dataset with Contrast Normalization + ZCA Whitening.
The error curve looks like this, with a test error rate of 18.76%. The result is that although it is better than just subtracting the average value, it is almost unchanged.
Batch Normalization is a method to normalize the output of a specific layer to 0 on average and 1 on variance for each mini-batch.
The aim is to make it easier to learn the next layer by normalizing.
The algorithm is described in detail in this article.
You can use chainer.links.BatchNormalization
to do Batch Normalization with Chainer.
The network code used this time is shown below. The network configuration is almost the same as the one used earlier, the only difference is the presence or absence of Batch Normalization.
class BatchConv2D(chainer.Chain):
def __init__(self, ch_in, ch_out, ksize, stride=1, pad=0, activation=F.relu):
super(BatchConv2D, self).__init__(
conv=L.Convolution2D(ch_in, ch_out, ksize, stride, pad),
bn=L.BatchNormalization(ch_out),
)
self.activation=activation
def __call__(self, x, train):
h = self.bn(self.conv(x), test=not train)
if self.activation is None:
return h
return F.relu(h)
class CNNBN(chainer.Chain):
def __init__(self):
super(CNNBN, self).__init__(
bconv1=BatchConv2D(3, 64, 5, stride=1, pad=2),
bconv2=BatchConv2D(64, 64, 5, stride=1, pad=2),
bconv3=BatchConv2D(64, 128, 5, stride=1, pad=2),
l1=L.Linear(4 * 4 * 128, 1000),
l2=L.Linear(1000, 10),
)
def __call__(self, x, train=True):
h1 = F.max_pooling_2d(self.bconv1(x, train), 3, 2)
h2 = F.max_pooling_2d(self.bconv2(h1, train), 3, 2)
h3 = F.max_pooling_2d(self.bconv3(h2, train), 3, 2)
h4 = F.relu(self.l1(F.dropout(h3, train=train)))
return self.l2(F.dropout(h4, train=train))
To learn using Batch Normalization, enter the following command: It took about 50 minutes to run.
$ python src/train.py -g 0 -m cnnbn -b 128 -p cnnbn --optimizer adam --iter 300 --lr_decay_iter 100
Use the model with Batch Normalization with "-m cnnbn".
The error curve looks like this, with an error rate of 12.40%. You can see that the error rate is dramatically lower than without Batch Normalization.
I also tried using the training data obtained by performing Contrast Normalization + ZCA Whitening. The command is as follows.
$ python src/train.py -g 0 -m cnnbn -b 128 -p cnnbn --optimizer adam --iter 300 --lr_decay_iter 100 -d dataset/image_norm_zca.pkl
The error rate was 12.27%, which was slightly better than just subtracting the mean.
Use a model similar to the VGG 16 layer or VGG 19 layer. In the VGG model, a Fully Connected layer is provided after repeating multiple CNNs of kernel size 3 + Max Pooling several times. I used a VGG-based network on Day "The Story of Kaggle CIFAR-10", so I used a similar network. To learn. This blog has achieved a high score of 94.15% for test data recognition.
The differences from "The story of Kaggle CIFAR-10" are as follows.
This implementation | Kaggle CIFAR-10 stories | |
---|---|---|
Input data | 32 x 32px | 24 x 24 px |
Augmentation(During learning) | Translation, left-right reversal | Translation, left-right reversal、拡大 |
Augmentation(At the time of test) | None | Translation, left / right inversion, enlargement |
Number of models | 1 piece | 6(Use the average output of each model) |
Batch Normaliztion | Yes | None |
It is as follows when described using Chainer.
class VGG(chainer.Chain):
def __init__(self):
super(VGG, self).__init__(
bconv1_1=BatchConv2D(3, 64, 3, stride=1, pad=1),
bconv1_2=BatchConv2D(64, 64, 3, stride=1, pad=1),
bconv2_1=BatchConv2D(64, 128, 3, stride=1, pad=1),
bconv2_2=BatchConv2D(128, 128, 3, stride=1, pad=1),
bconv3_1=BatchConv2D(128, 256, 3, stride=1, pad=1),
bconv3_2=BatchConv2D(256, 256, 3, stride=1, pad=1),
bconv3_3=BatchConv2D(256, 256, 3, stride=1, pad=1),
bconv3_4=BatchConv2D(256, 256, 3, stride=1, pad=1),
fc4=L.Linear(4 * 4 * 256, 1024),
fc5=L.Linear(1024, 1024),
fc6=L.Linear(1024, 10),
)
def __call__(self, x, train=True):
h = self.bconv1_1(x, train)
h = self.bconv1_2(h, train)
h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
h = self.bconv2_1(h, train)
h = self.bconv2_2(h, train)
h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
h = self.bconv3_1(h, train)
h = self.bconv3_2(h, train)
h = self.bconv3_3(h, train)
h = self.bconv3_4(h, train)
h = F.dropout(F.max_pooling_2d(h, 2), 0.25, train=train)
h = F.relu(self.fc4(F.dropout(h, train=train)))
h = F.relu(self.fc5(F.dropout(h, train=train)))
h = self.fc6(h)
return h
Execute it with the following command. A VGG-like model is specified with "-m vgg". It took about five and a half hours to run.
$ python src/train.py -g 0 -m vgg -b 128 -p vgg_adam --optimizer adam --iter 300 --lr_decay_iter 100
The error rate is 7.65%, improving recognition accuracy.
Residual Network is a network that makes it possible to learn well even if the hierarchy is deepened by combining identity transformation and multiple layers of CNN. It is described in detail in this article.
The network implemented this time has a total of 74 layers and has the following configuration. Regarding how to count layers, since Residual Block uses 2 layers of CNN, one Residual Block is counted as 2 layers.
I really wanted to make it 110 layer, but I couldn't run it due to lack of GPU memory. We have confirmed that if the input image is made smaller to 24 x 24px, it can be executed on 110 layers.
The network implementation in Chainer is as follows.
class ResidualBlock(chainer.Chain):
def __init__(self, ch_in, ch_out, stride=1, swapout=False, skip_ratio=0, activation1=F.relu, activation2=F.relu):
w = math.sqrt(2)
super(ResidualBlock, self).__init__(
conv1=L.Convolution2D(ch_in, ch_out, 3, stride, 1, w),
bn1=L.BatchNormalization(ch_out),
conv2=L.Convolution2D(ch_out, ch_out, 3, 1, 1, w),
bn2=L.BatchNormalization(ch_out),
)
self.activation1 = activation1
self.activation2 = activation2
self.skip_ratio = skip_ratio
self.swapout = swapout
def __call__(self, x, train):
skip = False
if train and self.skip_ratio > 0 and np.random.rand() < self.skip_ratio:
skip = True
sh, sw = self.conv1.stride
c_out, c_in, kh, kw = self.conv1.W.data.shape
b, c, hh, ww = x.data.shape
if sh == 1 and sw == 1:
shape_out = (b, c_out, hh, ww)
else:
hh = (hh + 2 - kh) // sh + 1
ww = (ww + 2 - kw) // sw + 1
shape_out = (b, c_out, hh, ww)
h = x
if x.data.shape != shape_out:
xp = chainer.cuda.get_array_module(x.data)
n, c, hh, ww = x.data.shape
pad_c = shape_out[1] - c
p = xp.zeros((n, pad_c, hh, ww), dtype=xp.float32)
p = chainer.Variable(p, volatile=not train)
x = F.concat((p, x))
if x.data.shape[2:] != shape_out[2:]:
x = F.average_pooling_2d(x, 1, 2)
if skip:
return x
h = self.bn1(self.conv1(h), test=not train)
if self.activation1 is not None:
h = self.activation1(h)
h = self.bn2(self.conv2(h), test=not train)
if not train:
h = h * (1 - self.skip_ratio)
if self.swapout:
h = F.dropout(h, train=train) + F.dropout(x, train=train)
else:
h = h + x
if self.activation2 is not None:
return self.activation2(h)
else:
return h
class ResidualNet(chainer.Chain):
def __init__(self, depth=18, swapout=False, skip=True):
super(ResidualNet, self).__init__()
links = [('bconv1', BatchConv2D(3, 16, 3, 1, 1), True)]
skip_size = depth * 3 - 3
for i in six.moves.range(depth):
if skip:
skip_ratio = float(i) / skip_size * 0.5
else:
skip_ratio = 0
links.append(('res{}'.format(len(links)), ResidualBlock(16, 16, swapout=swapout, skip_ratio=skip_ratio, ), True))
links.append(('res{}'.format(len(links)), ResidualBlock(16, 32, stride=2, swapout=swapout), True))
for i in six.moves.range(depth - 1):
if skip:
skip_ratio = float(i + depth) / skip_size * 0.5
else:
skip_ratio = 0
links.append(('res{}'.format(len(links)), ResidualBlock(32, 32, swapout=swapout, skip_ratio=skip_ratio), True))
links.append(('res{}'.format(len(links)), ResidualBlock(32, 64, stride=2, swapout=swapout), True))
for i in six.moves.range(depth - 1):
if skip:
skip_ratio = float(i + depth * 2 - 1) / skip_size * 0.5
else:
skip_ratio = 0
links.append(('res{}'.format(len(links)), ResidualBlock(64, 64, swapout=swapout, skip_ratio=skip_ratio), True))
links.append(('_apool{}'.format(len(links)), F.AveragePooling2D(8, 1, 0, False, True), False))
links.append(('fc{}'.format(len(links)), L.Linear(64, 10), False))
for name, f, _with_train in links:
if not name.startswith('_'):
self.add_link(*(name, f))
self.layers = links
def __call__(self, x, train=True):
h = x
for name, f, with_train in self.layers:
if with_train:
h = f(h, train=train)
else:
h = f(h)
return h
There is a parameter called swapout, but it's an experimental implementation, so ignore it for now. In the Residual Block, when the input and output sizes are different (width and height are smaller at the output, the number of channels is the same or the output is larger), the identity conversion part is as follows.
Execute it with the following command.
python src/train.py -g 0 -m residual -b 128 -p residual --res_depth 12 --optimizer sgd --lr 0.1 --iter 300 --lr_decay_iter 100
-m residual
is the Residual Network specification.
--optimizer sgd
is a specification that uses Momentum SGD, which gives better results than using Adam.
The initial value of the learning rate seems to be 0.1.
In the implementation of CIFAR-10 image classification using the Residual Network listed below, the initial learning rate was 0.1.
It took about 10 hours to run. The test error rate was 8.06%, which was worse than the VGG-like model.
Stochastic Depth is a method of probabilistically skipping Residual Block during learning. The explanation is detailed in this article.
The network code is posted in Using Residual Network (using # residual-network).
Give ReisualBlock
a property called skip_ratio
, and prevent the CNN part of Residual Block from being executed with the probability specified in skip_ratio
during learning.
When testing, use the value obtained by multiplying the CNN part by (1 --skip_ratio)
.
The skip_ratio
is sloped so that the deeper the Residual Block is, the larger the skip_ratio
is.
This time, skip_ratio
is 0 for the first Residual Block, 0.5 for the deepest skip_ratio
, and the blocks in between are changed linearly.
Execute it with the following command.
python src/train.py -g 0 -m residual -b 128 -p residual_skip --skip_depth --res_depth 12 --optimizer sgd --lr 0.1 --iter 300 --lr_decay_iter 100
--skip_depth
is an option to use Stochastic Depth.
It took about 9 hours to learn. The test error rate is now 7.42%, improving accuracy.
We classified the CIFAR-10 image dataset using various models. We were able to confirm the difference in recognition rate due to the difference in preprocessing, the presence or absence of Batch Normalization, and the difference in models.
This time I implemented it with Chainer, but for example, Implementation of Stochastic Depth by Torch7 is faster and saves memory than the implementation used this time, and the same environment (https://github.com/yueatsprograms/Stochastic_Depth) I was able to train 56layer Residual Network on Ubuntu). If you want to run a deeper model, you might consider another framework.
Recommended Posts