Computer Vision: Object Detection Part1 --Bounding Box preprocessing

Target

We have summarized the object detection using the Microsoft Cognitive Toolkit (CNTK).

In Part 1, we will prepare for object detection using the Microsoft Cognitive Toolkit. The Microsoft Common Objects in Context (COCO) provided by Microsoft is used as the training dataset for object detection.

I will introduce them in the following order.

  1. Object detection by neural network
  2. Bounding box pretreatment
  3. Dimension Clustering for creating Anchor Box
  4. Creating a file to be read by the built-in reader provided by CNTK

Introduction

Object detection by neural network

Object detection is roughly divided into two systems, a 2-stage system that detects candidate regions and then classifies them, and a 1-stage system that detects and classifies candidate regions at the same time, and there is a trade-off relationship between accuracy and speed. It is well known that it is in.

Especially for object detection that requires real time, a high-speed 1-stage system is adopted.

Therefore, this time, using features from multiple layers such as SSD [1], we follow the algorithm of YOLO [2] for learning the bounding box and classification. Train your network of real-time object detection with your model.

The input image is a BGR color image with a size of 416x416, and the base convolutional neural network (CNN) has Computer Vision: Image Classification Part2-Training CNN model. Use the original CNN (coco21) trained in).

Bounding box pretreatment

[Computer Vision: Image Classification Part1 --Understanding COCO dataset](https://qiita.com/sho_watari/items/bf0cdaa32cbbc2192393#coco-%E3%83%87%E3%83%BC%E3%82%BF%E3% 82% BB% E3% 83% 83% E3% 83% 88% E3% 81% AE% E8% A7% A3% E6% 9E% 90) As mentioned in the Microsoft COCO image, the bounding box and its Information on all 80 categories is given to the bounding box. [3]

YOLO's algorithm is used to detect the candidate area where the object exists. Therefore, as preprocessing, it is necessary to calculate the center coordinates, width, and height of the bounding box included in each image, and convert each to the ratio when the size of the original image is 1. The structure of the directory this time is as follows.

COCO  |―COCO   |―train2014    |―COCO_train2014_000000000009.jpg    |―... MNIST NICS SSMD  ssmd_boundingbox.py coco21.h5

Dimension Clustering for creating Anchor Box

Bounding boxes come in many shapes, so anchor boxes are a way to stabilize learning. [4]

Both SSD and YOLOv2 [5] use anchor boxes, and like YOLOv2, the width and height of the bounding box included in the training data make it typical to use k-means clustering for unsupervised learning. Find the width and height. I set the number of anchor boxes to five.

Create files with images, bounding boxes and categories for learning

What you need this time is a text file of ImageDeserializer that reads the images used for training and CFTDeserializer that reads the bounding box and category label corresponding to each image. ImageDeserializer and CTFDeserializer are CNTK's built-in readers respectively. For ImageDeserializer, [Computer Vision: Image Classification Part1 --Understanding COCO dataset](https://qiita.com/sho_watari/items/bf0cdaa32cbbc2192393#%E8%A7%A3%E8% AA%AC) introduces CTFDeserializer in Computer Vision: Image Caption Part1 --STAIR Captions doing.

However, this time I want to learn multiple bounding boxes and categories that exist in one image at the same time, so some ingenuity was required during training. We will introduce these ideas in Part 2 of the actual training.

Implementation

Execution environment

hardware

CPU Intel(R) Core(TM) i7-7700 3.60GHz

software

・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Numpy 1.17.3 ・ Opencv-contrib-python 4.1.1.26 ・ Scikit-learn 0.21.3

Program to run

The implemented program is published on GitHub.

ssmd_boundingbox.py


Commentary

I will extract and supplement some parts of the program to be executed.

Since one image is assigned multiple category labels and bounding boxes, keep the image ID as a key in dictionary format.

ssmd_boundingbox.py


bbox_dict = {}
for ann in annotations:
    image_id = ann["image_id"]
    category_id = ann["category_id"]
    bbox = ann["bbox"]
    bbox.append(categories[str(category_id)][0])

    bbox_dict.setdefault(image_id, []).append(bbox)

For the correct bounding box, normalize the center coordinates (x, y) and width / height to [0, 1] with the original width / height.

ssmd_boundingbox.py


box = [(bbox[0] + bbox[2] / 2) / width, (bbox[1] + bbox[3] / 2) / height, bbox[2] / width, bbox[3] / height]

Finally, k-means clustering is performed for the width and height of all bounding boxes.

dimension_clustering


def dimension_clustering(bounding_boxes, num_anchors):
    centroid, label, _ = k_means(bounding_boxes, num_anchors)

    np.save("anchor_boxes.npy", centroid)
    print("\nSaved anchor_boxes.npy")

result

When the program is executed, the center coordinates, width and height of the bounding box, the category label corresponding to the bounding box are written, and finally the width of a typical bounding box obtained by k-means clustering from all bounding boxes. And save the height as a Numpy file.

Now 10000 samples...
Now 20000 samples...
...
Now 80000 samples...

Number of samples 82081

Saved anchor_boxes.npy

The width and height of the anchor box obtained this time are as follows. Sorted in ascending order and displayed with 2 significant digits, the anchor box is shown in the figure below.

(0.06, 0.08)
(0.19, 0.28)
(0.31, 0.67)
(0.66, 0.35)
(0.83, 0.83)

dimension_clustering.png

Now that you have created the anchor box and the text file used for training, Part 2 will train you in the end-to-end object detection network using CNTK.

reference

Microsoft COCO Common Objects in Context

Computer Vision : Image Classification Part1 - Understanding COCO dataset Computer Vision : Image Classification Part2 - Training CNN model Computer Vision : Image Caption Part1 - STAIR Captions

  1. Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C. Berg. "SSD: Single Shot MultiBox Detector", arXiv preprint arXiv:1512.02325 (2016). European Conference on Computer Vision. 2016, pp 21-37.
  2. Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. "You Only Look Once: Unified, Real-Time Object Detection", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016, pp 779-788.
  3. Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. "Microsoft COCO: Common Objects in Context", European Conference on Computer Vision. 2014, pp 740-755.
  4. Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", In Advances in Neural Information Processing Systems (NIPS). 2015, pp 91-99.
  5. Joseph Redmon and Ali Farhadi. "YOLO9000: better, faster, stronger", The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2017, pp 7263-7271.

Recommended Posts

Computer Vision: Object Detection Part1 --Bounding Box preprocessing
Computer Vision: Object Detection --Non Maximum Suppression
Computer Vision: Object Detection Part2-Single Shot Multi Detector
Computer Vision: Semantic Segmentation Part2 --Real-Time Semantic Segmentation
[PyTorch Tutorial ⑧] Torch Vision Object Detection Finetuning Tutorial
Computer Vision: Semantic Segmentation Part1 --ImageNet pretraining VoVNet
Anomaly detection with Amazon Lookout for Vision Part 2 (Python3.6)