CornerNet: Detecting Objects as Paired Keypoints
https://arxiv.org/abs/1808.01244
Seems to be announced at ECCV 2018
Oh, it's basically my interpretation, so there may be some mistakes.
faster RCNN, SSD, and YOLO have "predetermined anchors" and perform "classes" within anchors and "regression on anchors and actual bounding boxes".
In other words, I doubt that there is a part that depends on the anchor.
① The number of anchors will increase DSSD uses anchors of 40k or more, and Retina Net uses anchors of 100k or more. Simple story "If you increase the number of anchors, the correct answer bbox is easy to get hooked on the anchor, so the accuracy will increase?"
If you increase the number of anchors, the number of negative anchors will increase and the learning speed will decrease.
② Many hyperparameters of anchor ・ How many bboxes ・ What size ・ What aspect ratio This design is tedious, complex and cumbersome.
Is it possible to make an object detector that does not use anchors? </ b>
That is my motivation.
The first feature is that it does not regress the bbox coordinates and size. Then, what we are doing is to output a heat map of the probability of the upper left coordinate of the input image and the probability of the upper right coordinate. No regression using fully connected layer
There are three basic outputs. ・ Heat map regarding coordinate position (upper left and lower right respectively) ・ Embedding that separates objects of the same class -Offset output that returns the resolution by feature extraction
Let's take a quick look at each
The upper left coordinate of the correct bounding box is essentially a "point", so it's too severe to predict. So, apply a Gaussian filter to make it a little bigger.
All that is left is to predict this heatmap with an encoder-decoder type network called the hourglass Network.
The loss function adopts focal loss.
Roughly speaking, it's not that important, but When feature extraction is performed, the resolution drops, so the exact coordinates in the original image are calculated by regression.
Output heatmaps are created for each class. Therefore, if there are multiple objects of the same class from one image, there will be multiple corner points. For example, in an image with two humans, the "human class output heatmap" will detect two upper left points.
After that, make a combination of the upper left point and the lower right point to make a bounding box. At this time, the pair of which upper left and which lower right combination is not clear. Therefore, using embedding, multiply the losses with similar features that are also combined. This makes it easier to find combinations. Lpull makes two features of the same combination similar Lpush has the property of making two different combinations of features different.
It is a combination of two losses, detection loss and embedding, and offset loss.
corner pooling The technique of "predicting this upper left point and lower right point" is fundamentally often predicting where the class does not exist. To predict the edge of the bounding box, it seems problematic that there is no information for that class at the point to be predicted. Therefore, we use a new pooling method called corner pooling.
What I'm doing is like the figure, and it's like shifting the maximum value vertically and horizontally.
There seems to be a good consideration of the feature that "human beings are in the direction to the right of the upper left coordinates".
The top of the figure is without corner pooling, and the bottom of the figure is with corner pooling. You can see that the exact bounding box is proposed.
Quantitatively, you can see that the accuracy is a little better. You can also confirm that it is good enough without it.
Compared to other methods, the results are much better than other one stage detectors, and the results are comparable to two stage detectors.
Overlapping giraffes can be detected clearly. You can see that the embedding is working well.
This is a failure example. Some people may not be detected or the embedding may be wrong.
244ms per sheet slow
We proposed a new object detector that does not use anchors at all, and some results were obtained. Especially the accuracy is quite good.
On the other hand, the problem is 244ms and the slow operation speed is pikaichi. However, since this is the first paper using a heat map method, there is a possibility that it will be improved in the future.
(Personally) The result of corner pooling is a little better, but it may not be usable much except for this annotation.
That's why I briefly explained about the detector CornerNet that does not use anchors. If there is a lack of explanation, I would like to supplement it again.
centerNet https://arxiv.org/abs/1904.07850
Grid R-CNN https://arxiv.org/abs/1811.12030
Recommended Posts