Target

This is a continuation of semantic segmentation using the Microsoft Cognitive Toolkit (CNTK).

In Part2, semantic segmentation is performed using the pre-learning model prepared in Part1. It is assumed that you have NVIDIA GPU CUDA installed and you have an SSD with a capacity of 500GB or more.

Introduction

In Computer Vision: Semantic Segmentation Part1 --ImageNet pretraining VoVNet, we trained a CNN pretraining model using images collected from ImageNet.

In Part 2, we will create and train a semantic segmentation model with a neural network.

ADEChallengeData2016 Use ADEChallengeDate2016 [1] as the semantic segmentation dataset. Download the zip file from the link below and unzip it. ADEChallengeDate2016 predicts a total of 151 categories with 150 category labels and backgrounds.

ADEChallengeData2016.zip

The input image is a BGR color image with a size of 320x480 and the output map is 151x320x480. For the category label information, I saved a 151x320x480 integer array as a numpy file.

Neural network structure

This time, I created a model based on the decoders of Joint Pyramid Upsampling (JPU) [2] and DeepLabv3 + [3]. The outline of the implemented neural network is shown in the figure below.

The implemented model can be roughly divided into three stages of processing.

Feature extraction with pre-learned VoVNet57 as backbone
JPU that receives features maps with three different resolutions and refines features for Contextual semantic segmentation.
Decoder that obtains a predicted output map of the same size as the input resolution after concatenating the feature map cultivated by JPU and the low-level feature map.

Several convolutional layers to be added employ Separable Convolution [4], especially JPUs that use Dilated Separable Convolution in combination with Dilated Convolution [5].

We applied Batch Normalization [6] immediately after all but the last 1x1 convolutionary layer, and adopted Mish [7] as the activation function.

Settings in training

The initial values of the parameters of the convolution layer to be learned are set to the normal distribution of He [8].

This time we will use the multitasking loss function. Use Focal Loss [9] for classification of non-uniform categories and Generalized Dice Loss [10] for minimizing prediction overlap.

Loss = Focal Loss + Generalized Dice Loss

Adam [11] was used as the optimization algorithm. Adam's hyperparameters $ β_1 $ are set to 0.9 and $ β_2 $ are set to the default values of CNTK.

For the learning rate, use the Cyclical Learning Rate (CLR) [12], the maximum learning rate is 1e-3, the base learning rate is 1e-5, the step size is 10 times the number of epochs, and the strategy is Set to triangular2.

Model training performed 100 Epoch with mini-batch training of mini-batch size 8.

Implementation

Execution environment

hardware

・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 6000 24GB

software

・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.50 ・ H5py 2.10.0 ・ Numpy 1.17.3 ・ Opencv-contrib-python 4.1.1.26 ・ Pandas 0.25.0

Program to run

The program for creating training data and the program for training are available on GitHub.

`rtss_ade20k.py`

`rtss_training.py`

Commentary

I will supplement the main contents of this implementation.

Dilated Separable Convolution

Separable Convolution Separable Convolution applies independent convolution (depthwise) and channel-only convolution (pointwise) for each channel in sequence, as shown below.

Following Xception [4], no activation function is applied between depthwise and pointwise.

Dilated Convolution In Dilated Convolution, as shown in the figure below, 0 is inserted between the convolution filters to increase the filter size, and the convolution process is performed by the convolution filter. This allows you to widen the field of view of the convolution filter. When $ r = 1 $, it is a normal convolution.

The figure below compares a vertical Gaussian derivative filter between regular convolution and Dilated Convolution.

You can see that the Dilated Convolution result is smoother than the normal convolution result.

Joint Pyramid Upsampling (JPU) In the model implemented this time, the part that acquires information on Contextual semantic segmentation is Joint Pyramid Upsampling. The figure below shows the internal processing of the JPU. Where 1/32, 1/16, 1/8 represent the downscale of the input size.

The JPU takes feature maps of three different resolutions as inputs, first convolves each with a normal 3x3, then upsamples and concatenates them to 1/8 resolution. Next, the result of executing four types of Dilated Separable Convolution in parallel is concatenated and output.

Multi-Task Loss

Focal Loss Assuming that the prediction output by Softmax as a probability is $ p $, Cross Entropy and Focal Loss can be expressed by the following equations.

CrossEntropy = -\log(p) \\
FocalLoss = -\alpha(1 - p)^\gamma \log(p)

The figure below shows a comparison of Cross Entropy and Focal Loss. The horizontal axis represents $ p $ and the vertical axis represents loss. Focal Loss keeps losses small for well-classified items such as 0.8-1.0. In this implementation, we set $ \ alpha = 1, \ gamma = 2 $.

Generalized Dice Loss When the prediction area is $ A $ and the correct area is $ B $, the Dice coefficient is often used as a quantification index of the degree of overlap between the two areas.

Dice = \frac{2|A \cap B|}{|A| + |B|}

If the prediction area and the correct area match exactly, the Dice coefficient has a maximum value of 1. Therefore, assuming that the prediction is $ p $ and the correct answer is $ t $, Dice Loss is as follows.

Dice Loss = 1 - 2 \frac{\sum^C_{c=1}\sum^N_{i=1}p^c_i t^c_i}{\sum^C_{c=1}\sum^N_{i=1} \left( p^c_i + t^c_i \right)}

Where $ C $ is the number of categories and $ N $ is the total number of pixels. Generalized Dice Loss applies $ w_c $ to Dice Loss to account for invariance between categories.

Generalized Dice Loss = 1 - 2 \frac{\sum^C_{c=1}w_c \sum^N_{i=1}p^c_i t^c_i}{\sum^C_{c=1}w_c \sum^N_{i=1} \left( p^c_i + t^c_i \right)} \\
w_c = \frac{1}{\left( \sum^N_{i=1} t^c_i \right)^2}

result

Training loss and dice coefficient The figure below is a visualization of the loss function and Dice coefficient log during training. The graph on the left shows the loss function, the graph on the right shows the Dice coefficient, the horizontal axis shows the number of epochs, and the vertical axis shows the value of the loss function and the Dice coefficient, respectively.

Validation mIOU Score Now that we have trained the semantic segmentation model, we evaluated the performance using the verification data.

For this performance evaluation, we calculated mean intersection over union (mIOU). Using validation as the validation data resulted in the following:

mIOU 3.0

FPS and Demo I also measured FPS, which is an index of execution speed. The measurement used the standard Python module time, and the hardware used was the GPU NVIDIA GeForce GTX 1060 6GB. If you don't want to color the results, it's FPS 4.0.

FPS 2.4

Below is a video of trying semantic segmentation on a trained model.

reference

ADE20K dataset

Computer Vision : Semantic Segmentation Part1 - ImageNet pretraining VoVNet

Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. "Scene Parsing through ADE20K Dataset", the IEEE conference on computer vision and pattern recognition. 2017. pp 633-641.
Huikai Wu, Junge Zhang, Kaiqi Huang, Kongming Liang, and Yizhou Yu. "FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation", arXiv preprint arXiv:1903.11816 (2019).
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. "Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation", the European conference on computer vision (ECCV). 2018. p. 801-818.
Francois Chollet. "Xception: Deep Learning with Depthwise Separable Convolutions", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, p. 1251-1258.
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", IEEE transactions on pattern analysis and machine intelligence 40.4 (2017): 834-848.
Ioffe Sergey and Christian Szegedy. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", arXiv preprint arXiv:1502.03167 (2015).
Misra, Diganta. "Mish: A self regularized non-monotonic neural activation function." arXiv preprint arXiv:1908.08681 (2019).
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification", The IEEE International Conference on Computer Vision (ICCV). 2015, p. 1026-1034.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollar. "Focal Loss for Dense Object Detection", the IEEE international conference on computer vision. 2017. p. 2980-2988.
Carole H. Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M. Jorge Cardoso. "Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations", Deep learning in medical image analysis and multimodal learning for clinical decision support. Springer, Cham, 2017. p. 240-248.
Diederik P. Kingma and Jimmy Lei Ba. "Adam: A method for stochastic optimization", arXiv preprint arXiv:1412.6980 (2014).
Leslie N. Smith. "Cyclical Learning Rates for Training Neural Networks", 2017 IEEE Winter Conference on Applications of Computer Vision. 2017, p. 464-472.

Computer Vision: Semantic Segmentation Part2 --Real-Time Semantic Segmentation