Let's take a quick look at CornerNet, an object detector that does not use anchors.

Papers to introduce

CornerNet: Detecting Objects as Paired Keypoints

https://arxiv.org/abs/1808.01244

Seems to be announced at ECCV 2018

Oh, it's basically my interpretation, so there may be some mistakes.

Problems pointed out

faster RCNN, SSD, and YOLO have "predetermined anchors" and perform "classes" within anchors and "regression on anchors and actual bounding boxes".

In other words, I doubt that there is a part that depends on the anchor.

① The number of anchors will increase DSSD uses anchors of 40k or more, and Retina Net uses anchors of 100k or more. Simple story "If you increase the number of anchors, the correct answer bbox is easy to get hooked on the anchor, so the accuracy will increase?"

If you increase the number of anchors, the number of negative anchors will increase and the learning speed will decrease.

② Many hyperparameters of anchor ・ How many bboxes ・ What size ・ What aspect ratio This design is tedious, complex and cumbersome.

Is it possible to make an object detector that does not use anchors? </ b>

That is my motivation.

Rough structure

image.png

The first feature is that it does not regress the bbox coordinates and size. Then, what we are doing is to output a heat map of the probability of the upper left coordinate of the input image and the probability of the upper right coordinate. No regression using fully connected layer

output

There are three basic outputs. ・ Heat map regarding coordinate position (upper left and lower right respectively) ・ Embedding that separates objects of the same class -Offset output that returns the resolution by feature extraction

Let's take a quick look at each

Heat map generation

The upper left coordinate of the correct bounding box is essentially a "point", so it's too severe to predict. So, apply a Gaussian filter to make it a little bigger.

All that is left is to predict this heatmap with an encoder-decoder type network called the hourglass Network.

The loss function adopts focal loss.

image.png

offset

Roughly speaking, it's not that important, but When feature extraction is performed, the resolution drops, so the exact coordinates in the original image are calculated by regression.

image.png

Embedding

Output heatmaps are created for each class. Therefore, if there are multiple objects of the same class from one image, there will be multiple corner points. For example, in an image with two humans, the "human class output heatmap" will detect two upper left points.

After that, make a combination of the upper left point and the lower right point to make a bounding box. At this time, the pair of which upper left and which lower right combination is not clear. Therefore, using embedding, multiply the losses with similar features that are also combined. This makes it easier to find combinations. Lpull makes two features of the same combination similar Lpush has the property of making two different combinations of features different.

image.png image.png

Loss function

It is a combination of two losses, detection loss and embedding, and offset loss.

image.png

corner pooling The technique of "predicting this upper left point and lower right point" is fundamentally often predicting where the class does not exist. To predict the edge of the bounding box, it seems problematic that there is no information for that class at the point to be predicted. Therefore, we use a new pooling method called corner pooling. image.png

What I'm doing is like the figure, and it's like shifting the maximum value vertically and horizontally.

There seems to be a good consideration of the feature that "human beings are in the direction to the right of the upper left coordinates".

result

Result of corner pooling

The top of the figure is without corner pooling, and the bottom of the figure is with corner pooling. You can see that the exact bounding box is proposed. image.png

Quantitatively, you can see that the accuracy is a little better. You can also confirm that it is good enough without it. image.png

Comparative experiment

Compared to other methods, the results are much better than other one stage detectors, and the results are comparable to two stage detectors.

image.png

Success story

Overlapping giraffes can be detected clearly. You can see that the embedding is working well. image.png

Failure example

This is a failure example. Some people may not be detected or the embedding may be wrong. image.png

Operating speed

244ms per sheet slow

Summary

We proposed a new object detector that does not use anchors at all, and some results were obtained. Especially the accuracy is quite good.

On the other hand, the problem is 244ms and the slow operation speed is pikaichi. However, since this is the first paper using a heat map method, there is a possibility that it will be improved in the future.

(Personally) The result of corner pooling is a little better, but it may not be usable much except for this annotation.

That's why I briefly explained about the detector CornerNet that does not use anchors. If there is a lack of explanation, I would like to supplement it again.

Development system

centerNet https://arxiv.org/abs/1904.07850

Grid R-CNN https://arxiv.org/abs/1811.12030

Recommended Posts

Let's take a quick look at CornerNet, an object detector that does not use anchors.
Let's take a look at the feature map of YOLO v3
[Go] Take a look at io.Writer
I took a quick look at the fractions package that handles Python built-in fractions.
Let's look at a differential equation that cannot be solved normally in Python
I tried a neural network Π-Net that does not require an activation function
Let's take a look at the Scapy code. How are you processing the structure?
Cheat sheet that does not cause an accident
How to use a tp-link wireless LAN slave unit that does not support Linux