Computer Vision: Semantic Segmentation Part1 --ImageNet pretraining VoVNet

Target

We have summarized the semantic segmentation using the Microsoft Cognitive Toolkit (CNTK).

Part 1 will pre-learn the CNN used as the backbone for semantic segmentation. Use 1,000 categories of ImageNet images for CNN pre-learning.

I will introduce them in the following order.

  1. Download and prepare from ImageNet
  2. VoVNet : One-Shot Aggregation module
  3. Settings in training

Introduction

Download and prepare from ImageNet

ImageNet [1] is a large-scale image database with more than 140 million images registered. Until 2017, it was used in the image recognition competition ILSVCR.

This time, we collected 1,000 categories of training data by downloading using the URL of the image managed by ImageNet. However, since none of the 850th teddy and teddy bear could be downloaded, the image prepared in Computer Vision: Image Classification Part1 --Understanding COCO dataset I used it as a substitute.

Also, the downloaded images contained quite a few corrupted JPEG files and images that weren't related to the category, so I cleaned them automatically and manually. The final collection of images was 775,983.

The structure of the directory this time is as follows.

COCO MNIST NICS RTSS  |―ImageNet   |―n01440764    |―n01440764_0.jpg    |―…  rtss_imagenet.py  rtss_vovnet57.py SSMD

VoVNet : One-Shot Aggregation module This time, we adopted VoVNet [2] (Variety of View Network) as a model of convolutional neural network. VoVNet is a CNN model that uses less memory and less computational costs than DenseNet [3].

One-Shot Aggregation module VoVNet uses the One-Shot Aggregation (OSA) module, which is boxed in the figure below.

osa.png

VoVNet57 The network configuration of VoVNet57 is as follows.

Layer Filters Size/Stride Input Output
Convolution2D 64 3x3/2 3x224x224 64x112x112
Convolution2D 64 3x3/1 64x112x112 64x112x112
Convolution2D 128 3x3/1 64x112x112 128x112x112
MaxPooling2D 3x3/2 128x112x112 128x56x56
OSA module 128, 256 3x3/1, 1x1/1 128x56x56 256x56x56
MaxPooling2D 3x3/2 256x56x56 256x28x28
OSA module 160, 512 3x3/1, 1x1/1 256x28x28 512x28x28
MaxPooling2D 3x3/2 512x28x28 512x14x14
OSA module 192, 768 3x3/1, 1x1/1 512x14x14 768x14x14
OSA module 192, 768 3x3/1, 1x1/1 768x14x14 768x14x14
OSA module 192, 768 3x3/1, 1x1/1 768x14x14 768x14x14
OSA module 192, 768 3x3/1, 1x1/1 768x14x14 768x14x14
MaxPooling2D 3x3/2 768x14x14 768x7x7
OSA module 224, 1024 3x3/1, 1x1/1 768x7x7 1024x7x7
OSA module 224, 1024 3x3/1, 1x1/1 1024x7x7 1024x7x7
OSA module 224, 1024 3x3/1, 1x1/1 1024x7x7 1024x7x7
GlobalAveragePooling global 1024x7x7 1024x1x1
Dense 1000 1024x1x1 1000x1x1
Softmax 1000 1000 1000

It consists of a total of 57 layers of convolution and 32x downsampling. The total number of parameters is 31,429,159.

In the convolution layer, apply Batch Normalization [4] without using bias before inputting to the activation function.

The last fully connected layer uses the bias term instead of Batch Normalization.

Activation function Mish

We adopted Mish [5] as the activation function. Mish is an activation function that has been reported to outperform Swish [6] over ReLU. Mish can be easily implemented by combining the soft plus function and the tanh function, as expressed by the following formula.

Mish(x) = x \cdot \tanh \left( \log (1 + e^x) \right)

Mish looks like the figure below.

mish.png

Mish avoids ReLU's deadly neuron, and while ReLU is discontinuous when differentiated, Mish is continuous no matter how many times it is differentiated, which makes the loss function smoother and easier to optimize.

Settings in training

The input image is divided by the maximum brightness value of 255.

The initial values of the parameters for each layer were set to the normal distribution of He [7].

The loss function is Cross Entropy Error, and the optimization algorithm is Stochastic Gradient Decent (SGD) with Momentum. The momentum was fixed at 0.9.

The Cyclical Learning Rate (CLR) [8] is used as the learning rate, the maximum learning rate is 0.1, the base learning rate is 1e-4, the step size is 10 times the number of epochs, and the policy is triangular2. I set it to.

As a measure against overfitting, I set the L2 regularization value to 0.0005.

Model training performed 100 Epoch with mini-batch training of mini-batch size 64.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 5000 16GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.50 ・ Numpy 1.17.3 ・ Opencv-contrib-python 4.1.1.26 ・ Pandas 0.25.0 ・ Requests 2.22.0

Program to run

Programs downloaded from ImageNet and training programs are available on GitHub.

rtss_imagenet.py


rtss_vovnet57.py


result

The figure below is a visualization of the loss function and false recognition rate logs during training. The graph on the left shows the loss function, the graph on the right shows the false recognition rate, the horizontal axis shows the number of epochs, and the vertical axis shows the value of the loss function and the false recognition rate, respectively.

vovnet57_logging.png

Now that we have a backbone CNN pre-training model, we'll complete Part 2 with the addition of mechanisms to achieve semantic segmentation.

reference

ImageNet Microsoft COCO Common Objects in Context

Computer Vision : Image Classification Part1 - Understanding COCO dataset

  1. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. "ImageNet: A Large-Scale Hierarchical Image Database", IEEE conference on Computer Vision and Pattern Recognition (CVPR). 2009, p. 248-255.
  2. Youngwan Lee, Joong-won Hwang, Sangrok Lee, Yuseok Bae, and Jongyoul Park. "An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection", the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2019, p. 0-0.
  3. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. "Densely Connected Convolutional Networks", the IEEE conference on Computer Vision and Pattern Recognition (CVPR). 2017. p. 4700-4708.
  4. Ioffe Sergey and Christian Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", arXiv preprint arXiv:1502.03167 (2015).
  5. Misra, Diganta. "Mish: A self regularized non-monotonic neural activation function." arXiv preprint arXiv:1908.08681 (2019).
  6. Ramachandran, Prajit, Barret Zoph, and Quoc V. Le. "Searching for activation functions." arXiv preprint arXiv:1710.05941 (2017).
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", The IEEE International Conference on Computer Vision (ICCV). 2015, p. 1026-1034.
  8. Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, p. 464-472.

Recommended Posts

Computer Vision: Semantic Segmentation Part1 --ImageNet pretraining VoVNet
Computer Vision: Semantic Segmentation Part2 --Real-Time Semantic Segmentation
Computer Vision: Object Detection Part1 --Bounding Box preprocessing