We have summarized the object detection using the Microsoft Cognitive Toolkit (CNTK).
In Part 1, we will prepare for object detection using the Microsoft Cognitive Toolkit. The Microsoft Common Objects in Context (COCO) provided by Microsoft is used as the training dataset for object detection.
I will introduce them in the following order.
Object detection is roughly divided into two systems, a 2-stage system that detects candidate regions and then classifies them, and a 1-stage system that detects and classifies candidate regions at the same time, and there is a trade-off relationship between accuracy and speed. It is well known that it is in.
Especially for object detection that requires real time, a high-speed 1-stage system is adopted.
Therefore, this time, using features from multiple layers such as SSD [1], we follow the algorithm of YOLO [2] for learning the bounding box and classification. Train your network of real-time object detection with your model.
The input image is a BGR color image with a size of 416x416, and the base convolutional neural network (CNN) has Computer Vision: Image Classification Part2-Training CNN model. Use the original CNN (coco21) trained in).
[Computer Vision: Image Classification Part1 --Understanding COCO dataset](https://qiita.com/sho_watari/items/bf0cdaa32cbbc2192393#coco-%E3%83%87%E3%83%BC%E3%82%BF%E3% 82% BB% E3% 83% 83% E3% 83% 88% E3% 81% AE% E8% A7% A3% E6% 9E% 90) As mentioned in the Microsoft COCO image, the bounding box and its Information on all 80 categories is given to the bounding box. [3]
YOLO's algorithm is used to detect the candidate area where the object exists. Therefore, as preprocessing, it is necessary to calculate the center coordinates, width, and height of the bounding box included in each image, and convert each to the ratio when the size of the original image is 1. The structure of the directory this time is as follows.
COCO |―COCO |―train2014 |―COCO_train2014_000000000009.jpg |―... MNIST NICS SSMD ssmd_boundingbox.py coco21.h5
Bounding boxes come in many shapes, so anchor boxes are a way to stabilize learning. [4]
Both SSD and YOLOv2 [5] use anchor boxes, and like YOLOv2, the width and height of the bounding box included in the training data make it typical to use k-means clustering for unsupervised learning. Find the width and height. I set the number of anchor boxes to five.
What you need this time is a text file of ImageDeserializer that reads the images used for training and CFTDeserializer that reads the bounding box and category label corresponding to each image. ImageDeserializer and CTFDeserializer are CNTK's built-in readers respectively. For ImageDeserializer, [Computer Vision: Image Classification Part1 --Understanding COCO dataset](https://qiita.com/sho_watari/items/bf0cdaa32cbbc2192393#%E8%A7%A3%E8% AA%AC) introduces CTFDeserializer in Computer Vision: Image Caption Part1 --STAIR Captions doing.
However, this time I want to learn multiple bounding boxes and categories that exist in one image at the same time, so some ingenuity was required during training. We will introduce these ideas in Part 2 of the actual training.
CPU Intel(R) Core(TM) i7-7700 3.60GHz
・ Windows 10 Pro 1909 ・ Python 3.6.6 ・ Numpy 1.17.3 ・ Opencv-contrib-python 4.1.1.26 ・ Scikit-learn 0.21.3
The implemented program is published on GitHub.
ssmd_boundingbox.py
I will extract and supplement some parts of the program to be executed.
Since one image is assigned multiple category labels and bounding boxes, keep the image ID as a key in dictionary format.
ssmd_boundingbox.py
bbox_dict = {}
for ann in annotations:
image_id = ann["image_id"]
category_id = ann["category_id"]
bbox = ann["bbox"]
bbox.append(categories[str(category_id)][0])
bbox_dict.setdefault(image_id, []).append(bbox)
For the correct bounding box, normalize the center coordinates (x, y) and width / height to [0, 1] with the original width / height.
ssmd_boundingbox.py
box = [(bbox[0] + bbox[2] / 2) / width, (bbox[1] + bbox[3] / 2) / height, bbox[2] / width, bbox[3] / height]
Finally, k-means clustering is performed for the width and height of all bounding boxes.
dimension_clustering
def dimension_clustering(bounding_boxes, num_anchors):
centroid, label, _ = k_means(bounding_boxes, num_anchors)
np.save("anchor_boxes.npy", centroid)
print("\nSaved anchor_boxes.npy")
When the program is executed, the center coordinates, width and height of the bounding box, the category label corresponding to the bounding box are written, and finally the width of a typical bounding box obtained by k-means clustering from all bounding boxes. And save the height as a Numpy file.
Now 10000 samples...
Now 20000 samples...
...
Now 80000 samples...
Number of samples 82081
Saved anchor_boxes.npy
The width and height of the anchor box obtained this time are as follows. Sorted in ascending order and displayed with 2 significant digits, the anchor box is shown in the figure below.
(0.06, 0.08)
(0.19, 0.28)
(0.31, 0.67)
(0.66, 0.35)
(0.83, 0.83)
Now that you have created the anchor box and the text file used for training, Part 2 will train you in the end-to-end object detection network using CNTK.
Microsoft COCO Common Objects in Context
Computer Vision : Image Classification Part1 - Understanding COCO dataset Computer Vision : Image Classification Part2 - Training CNN model Computer Vision : Image Caption Part1 - STAIR Captions
Recommended Posts