This is a continuation of semantic segmentation using the Microsoft Cognitive Toolkit (CNTK).
In Part2, semantic segmentation is performed using the pre-learning model prepared in Part1. It is assumed that you have NVIDIA GPU CUDA installed and you have an SSD with a capacity of 500GB or more.
In Computer Vision: Semantic Segmentation Part1 --ImageNet pretraining VoVNet, we trained a CNN pretraining model using images collected from ImageNet.
In Part 2, we will create and train a semantic segmentation model with a neural network.
ADEChallengeData2016 Use ADEChallengeDate2016 [1] as the semantic segmentation dataset. Download the zip file from the link below and unzip it. ADEChallengeDate2016 predicts a total of 151 categories with 150 category labels and backgrounds.
The input image is a BGR color image with a size of 320x480 and the output map is 151x320x480. For the category label information, I saved a 151x320x480 integer array as a numpy file.
This time, I created a model based on the decoders of Joint Pyramid Upsampling (JPU) [2] and DeepLabv3 + [3]. The outline of the implemented neural network is shown in the figure below.

The implemented model can be roughly divided into three stages of processing.
Several convolutional layers to be added employ Separable Convolution [4], especially JPUs that use Dilated Separable Convolution in combination with Dilated Convolution [5].
We applied Batch Normalization [6] immediately after all but the last 1x1 convolutionary layer, and adopted Mish [7] as the activation function.
The initial values of the parameters of the convolution layer to be learned are set to the normal distribution of He [8].
This time we will use the multitasking loss function. Use Focal Loss [9] for classification of non-uniform categories and Generalized Dice Loss [10] for minimizing prediction overlap.
Loss = Focal Loss + Generalized Dice Loss
Adam [11] was used as the optimization algorithm. Adam's hyperparameters $ β_1 $ are set to 0.9 and $ β_2 $ are set to the default values of CNTK.
For the learning rate, use the Cyclical Learning Rate (CLR) [12], the maximum learning rate is 1e-3, the base learning rate is 1e-5, the step size is 10 times the number of epochs, and the strategy is Set to triangular2.
Model training performed 100 Epoch with mini-batch training of mini-batch size 8.
・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 6000 24GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.50 ・ H5py 2.10.0 ・ Numpy 1.17.3 ・ Opencv-contrib-python 4.1.1.26 ・ Pandas 0.25.0
The program for creating training data and the program for training are available on GitHub.
rtss_ade20k.py
rtss_training.py
I will supplement the main contents of this implementation.
Dilated Separable Convolution
Separable Convolution Separable Convolution applies independent convolution (depthwise) and channel-only convolution (pointwise) for each channel in sequence, as shown below.

Following Xception [4], no activation function is applied between depthwise and pointwise.
Dilated Convolution In Dilated Convolution, as shown in the figure below, 0 is inserted between the convolution filters to increase the filter size, and the convolution process is performed by the convolution filter. This allows you to widen the field of view of the convolution filter. When $ r = 1 $, it is a normal convolution.

The figure below compares a vertical Gaussian derivative filter between regular convolution and Dilated Convolution.

You can see that the Dilated Convolution result is smoother than the normal convolution result.
Joint Pyramid Upsampling (JPU) In the model implemented this time, the part that acquires information on Contextual semantic segmentation is Joint Pyramid Upsampling. The figure below shows the internal processing of the JPU. Where 1/32, 1/16, 1/8 represent the downscale of the input size.

The JPU takes feature maps of three different resolutions as inputs, first convolves each with a normal 3x3, then upsamples and concatenates them to 1/8 resolution. Next, the result of executing four types of Dilated Separable Convolution in parallel is concatenated and output.
Multi-Task Loss
Focal Loss Assuming that the prediction output by Softmax as a probability is $ p $, Cross Entropy and Focal Loss can be expressed by the following equations.
CrossEntropy = -\log(p) \\
FocalLoss = -\alpha(1 - p)^\gamma \log(p)
The figure below shows a comparison of Cross Entropy and Focal Loss. The horizontal axis represents $ p $ and the vertical axis represents loss. Focal Loss keeps losses small for well-classified items such as 0.8-1.0. In this implementation, we set $ \ alpha = 1, \ gamma = 2 $.

Generalized Dice Loss When the prediction area is $ A $ and the correct area is $ B $, the Dice coefficient is often used as a quantification index of the degree of overlap between the two areas.
Dice = \frac{2|A \cap B|}{|A| + |B|}
If the prediction area and the correct area match exactly, the Dice coefficient has a maximum value of 1. Therefore, assuming that the prediction is $ p $ and the correct answer is $ t $, Dice Loss is as follows.
Dice Loss = 1 - 2 \frac{\sum^C_{c=1}\sum^N_{i=1}p^c_i t^c_i}{\sum^C_{c=1}\sum^N_{i=1} \left( p^c_i + t^c_i \right)}
Where $ C $ is the number of categories and $ N $ is the total number of pixels. Generalized Dice Loss applies $ w_c $ to Dice Loss to account for invariance between categories.
Generalized Dice Loss = 1 - 2 \frac{\sum^C_{c=1}w_c \sum^N_{i=1}p^c_i t^c_i}{\sum^C_{c=1}w_c \sum^N_{i=1} \left( p^c_i + t^c_i \right)} \\
w_c = \frac{1}{\left( \sum^N_{i=1} t^c_i \right)^2}
Training loss and dice coefficient The figure below is a visualization of the loss function and Dice coefficient log during training. The graph on the left shows the loss function, the graph on the right shows the Dice coefficient, the horizontal axis shows the number of epochs, and the vertical axis shows the value of the loss function and the Dice coefficient, respectively.

Validation mIOU Score Now that we have trained the semantic segmentation model, we evaluated the performance using the verification data.
For this performance evaluation, we calculated mean intersection over union (mIOU). Using validation as the validation data resulted in the following:
mIOU 3.0
FPS and Demo I also measured FPS, which is an index of execution speed. The measurement used the standard Python module time, and the hardware used was the GPU NVIDIA GeForce GTX 1060 6GB. If you don't want to color the results, it's FPS 4.0.
FPS 2.4
Below is a video of trying semantic segmentation on a trained model.

Computer Vision : Semantic Segmentation Part1 - ImageNet pretraining VoVNet
Recommended Posts