This is a continuation of object detection using the Microsoft Cognitive Toolkit (CNTK).
In Part2, object detection by CNTK will be performed using the training data prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.
In Computer Vision: Object Detection Part1 --Bounding Box preprocessing, from Microsoft Common Object in Contexts (COCO) [1], the bounding box We have prepared category labels and anchor boxes.
In Part2, we will create and train a 1-stage object detection model using a neural network.
This time, I made a model that combines the Multi-scale feature map of SSD [[2]](# reference) and the Direct location prediction of YOLOv2 [[3]](# reference). The outline of the implemented neural network is shown in the figure below.
Add a convolutional layer to the feature map from the underlying pre-trained convolutional neural network (CNN). The added convolution layer does not adopt the bias term, the activation function adopts Exponential Linear Units (ELUs) [[4]](# reference), and Batch Normalization [[5]](# reference) is applied. I will.
In the final output convolution layer, the bias term is adopted without using the nonlinear activation function and Batch Normalization to perform bounding box, object degree, and categorization.
The idea was to detect small objects on the 26x26 feature map, medium objects on the 13x13 feature map, and large objects on the 7x7 feature map. The anchor boxes used are 26x26 (0.06, 0.08), 13x13 (0.19. 0.28), (0.31, 0.67), (0.66, 0.35), 7x7 (0.31, 0.67), (0.66, 0.35), ( 0.83, 0.83) is used.
YOLO's algorithm is used to predict the bounding box.
x = \sigma(t_x) + c_x \\
y = \sigma(t_y) + c_y \\
w = p_w \log(1 + e^{t_w}) \\
h = p_h \log(1 + e^{t_h}) \\
objectness = \sigma(t_o)
Now, apply the sigmoid function to the network output $ t_x, t_y $, and then add the upper left coordinates $ c_x, c_y $ of each grid cell to predict the center coordinates of each grid cell. To predict the width and height, apply the soft plus function to the network outputs $ t_w and t_h $, and then multiply by the anchor box. Apply the sigmoid function to the output $ t_o $ for object degree.
The initial value of the added convolution layer parameter was set to He's normal distribution [6].
This time we will use the multitasking loss function. Use Generalized IoU Loss [7] for bounding box regression, Binary Cross Entropy for objectivity prediction, and Cross Entropy Error for categorization. The details of the loss function will be explained later.
Loss = Generalized IoU Loss + Binary Cross Entropy + Cross Entropy Error
Adam [8] was used as the optimization algorithm. Adam's hyperparameters $ β_1 $ are set to 0.9 and $ β_2 $ are set to the default values of CNTK.
For the learning rate, use the Cyclical Learning Rate (CLR) [9], the maximum learning rate is 1e-3, the base learning rate is 1e-5, the step size is 10 times the number of epochs, and the strategy is Set to triangular2.
Model training performed 100 Epoch by mini-batch learning.
・ CPU Intel (R) Core (TM) i7-5820K 3.30GHz ・ GPU NVIDIA Quadro RTX 5000 16GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Cntkx 0.1.13 ・ H5py 2.9.0 ・ Numpy 1.17.3 ・ Pandas 0.25.0 ・ Scikit-learn 0.21.3
The training program is available on GitHub.
ssmd_training.py
I will supplement the main contents of this implementation.
Generalized IoU Loss The squared error [10] and smooth L1 Loss [2] [11] are used for the bounding box regression loss function. Intersection over Union (IoU), which indicates the degree of overlap between the bounding box and the correct bounding box, may be adopted.
However, IoU has the problem of having more saddle points in the optimization, as the value will be 0 if the two bounding boxes do not overlap at all. The one proposed there is Generalized IoU (GIoU).
Assuming that the predictive bounding box is $ A $ and the correct bounding box is $ B $, GIoU looks like this:
IoU = \frac{A \cap B}{A \cup B} \\
GIoU = IoU - \frac{C - (A \cup B)}{C} \\
GIoU Loss = 1 - GIoU
Where $ C $ represents the smallest rectangular area that surrounds the two bounding boxes. GIoU takes a value of [-1, 1].
Multi-Task Loss Training a neural network that performs multiple tasks defines a loss function for each task. As mentioned above, this loss function consists of the following loss functions.
Loss = Generalized IoU Loss + Binary Cross Entropy + Cross Entropy Error
Generalized IoU Loss for the center coordinates and width / height of the bounding box, Binary Cross Entropy for Objectness to determine if an object exists, and Cross Entropy Error for object categorization. Calculate the loss function.
Therefore, the formula for the loss function is:
Loss = \lambda^{coord}_{obj} \sum^N \sum^B \left\{1 - \left(IoU - \frac{C - (A \cup B)}{C} \right) \right\} +
\lambda^{coord}_{noobj} \sum^N \sum^B \left\{1 - \left(IoU - \frac{C - (A \cup B')}{C} \right) \right\} \\
+ \lambda^{conf}_{obj} \sum^N \sum^B -t \log(\sigma(t_o)) + \lambda^{conf}_{noobj} \sum^N \sum^B -(1 - t) \log(1 - \sigma(t_o)) \\
+ \lambda^{prob}_{obj} \sum^N \sum^B -t \log(p_c) + \lambda^{prob}_{noobj} \sum^N \sum^B -t \log(p_c) \\
\lambda^{coord}_{obj} = 1.0, \lambda^{coord}_{noobj} = 0.1, \lambda^{conf}_{obj} = 1.0, \lambda^{conf}_{noobj} = 0.1, \lambda^{prob}_{obj} = 1.0, \lambda^{prob}_{noobj} = 0.0
Here, $ A, B, and C $ represent the predicted bounding box, the correct bounding box, and the smallest rectangular area that surrounds the two bounding boxes, respectively, and $ B'$ represents the default bounding box. The default bounding box means a bounding box whose center coordinates and width / height of each grid cell are the same size as the anchor box.
The contribution of each loss function is adjusted by the coefficient $ \ lambda $, which is set to 1.0 if the object is present and 0.1 or 0.0 if the object is not present.
Dynamic Target Assignment In network training, not all predictive bounding boxes correspond to correct data. Therefore, we will take the measure of dynamically assigning the correct bounding box and category label.
For example, when the upper left figure in the figure below is the input image, the bounding box output by the network will be the red bounding box in the upper right figure if an object exists. However, the correct bounding box is the green bounding box in the lower left figure. Here, calculate the IoU of the output bounding box and the correct bounding box, and assign the correct bounding box and the correct category label to the predicted bounding box with the largest IoU. The lower right figure shows the predicted bounding box assigned the correct bounding box in blue.
However, some of the bounding boxes that were not assigned the correct bounding box have high IoU values, so assign the correct bounding box and the correct category label to them as well. The predicted bounding box to which the correct bounding box is assigned by this process is shown in light blue in the lower right figure.
If the correct bounding box cannot be assigned, the object does not exist and the default bounding box is assigned.
Training loss and error
The figure below is a visualization of each loss function during training. From the left, GIoU Loss for bounding box regression, Binary Cross Entropy for objectivity, and Cross Entropy Error for categorization. The horizontal axis represents the number of epochs, and the vertical axis represents the value of the loss function.
Validation mAP score
Now that we have trained the 1-stage object detection model, we evaluated the performance using the verification data.
For this performance evaluation, we calculated mean Average Precision (mAP). I used sklearn to calculate the mAP and set the IoU to 0.5. Using val2014 as the validation data resulted in the following:
mAP50 Score 10.3
FPS and demo
I also measured FPS, which is an index of execution speed. The measurement used the standard Python module time, and the hardware used was the GPU NVIDIA GeForce GTX 1060 6GB.
39.9 FPS
Below is a video of an object detection attempt with a trained model.
The result is not good. I would like to try again to detect objects.
Microsoft COCO Common Objects in Context
Computer Vision : Image Classification Part2 - Training CNN model Computer Vision : Object Detection Part1 - Bounding Box preprocessing
Recommended Posts