A few months ago, I was a little worried about the lack of knowledge about SSD (Single Shot Multibox Detector), so I decided to build my SSD little by little while looking at the paper and the implementation I found on Github etc. It was. To be honest, it's not over yet, but I've learned about how to create a Default Box (Prior), an important feature for SSDs, so I'd like to share it with you.
If you're reading this article, you probably know what Default Box is, but I'll explain it a bit just in case. The SSD outputs a convolved image called Feature Map when processing the image. The Feature Map, number and size are specified in the model settings, but basically it is about 5 or 6. In order to recognize the object from the Feature Map, the area where the object is likely to appear is specified, and that area is called the Default Box (sometimes called Prior) and is used for classification and regression. I am.
It looks like this when you visualize one set of Default Box. With my model settings, a total of 8732 will be created.
 This is an image of COCO Dataset 2017.
This is an image of COCO Dataset 2017.
According to the paper's explanation, in order to calculate the Default Box, we need Scale (there is no specific definition, but something like the size of an object, Aspect Ratios (Aspect Ratio of Default Box) and the size of Feature Map.
In the paper, scale is defined by this function. m is the number of feature maps and k is the number of feature maps. And s_min and s_max are determined by the size of the object in the image. I would like to say that there is a way to decide, but it seems that there is no particular way.
The function itself looks a bit complicated, but in a nutshell it divides s_min and s_max equally into the same amount as the number of feature maps.
 For example, if s_min = 0.2, s_max = 0.9 and m = 6, then s_k would be [0.2, 0.34, 0.48, 0.62, 0.76, 0.9]. Each scale is equally separated by 0.14.
For example, if s_min = 0.2, s_max = 0.9 and m = 6, then s_k would be [0.2, 0.34, 0.48, 0.62, 0.76, 0.9]. Each scale is equally separated by 0.14.
Aspect Ratios, like s_min and s_max, are determined by the object. The paper states that the {1, 2, 3, 1/2, 1/3} Aspect Ratio was used. It was a little unclear just by looking at the paper, but (2, 1/2) and (3, 1/3) are like a combination, so if you are talking about reducing the Aspect Ratio, basically 3 and 1/3 It means that Aspect Ratio is not used.
Feature Maps was the most (personally) obvious part, but it's simply the size of the convolution layer output that you pass to the classification and regression heads in the SSD. For example, in my model the first output is 38 x 38 and the last output is 1 x 1.
Once you have the necessary parts above, you can calculate the Default Box. This function calculates the height and width of the Default Box.
 

And now we can calculate the cx and cy points of the middle point of the Default Box. F_k here is simply the size of the feature map (eg 38).

By the way, the pattern [cx, cy, w, h] is common when representing the Default Box.
The calculations for cx, cy and w, h described above are done for each aspect ratio in each feature map. However, 1 is a little exception. If it is 1, it calculates two Default Boxes. It is calculated with the Default Box calculated with a normal scale and another scale. Another method of calculating scale is defined by the following function. It can be calculated by the scale of the current feature map and the scale one level higher.

There is one Default Box for the Aspect Ratio of 1. For {1, 2, 3, 1/2, 1/3}, each block creates six Default Boxes. In the case of {1, 2, 1/2}, four are created.
I've looked at some implementations to understand how to create a Default Box, but at some point I didn't understand after reading the paper.
First of all, we often see a variable called steps. f_k is calculated by dividing the image size by step. There is no particular explanation anywhere and it is not written in the paper, but steps is calculated by dividing the size of the image by the size of the feature map. For example, dividing 300 by 38 gives 7.89-> 8.
steps: [8, 16, 32, 64, 100, 300]
The other thing that got stuck is the Aspect Ratio setting. There are many ways to write like this.
aspect_ratios: [[2], [2,3], [2,3], [2,3], [2], [2]]
I didn't have 1/2 or 1/3, so I thought, "What's this?", But I can simply wash 2 and 1/2 with only 2, and express 3 and 1/3 with only 3. [2,3] means {1, 2, 3, 1/2, 1/3}.
It is common to write such settings for scales.
scales: [30, 60, 111, 162, 213, 264, 315]
This is the image size multiplied by the already calculated s_k. If s_k is 0.1, the value of scales will be 30.
Finally, this may seem obvious to some, but when calculating the 2 and 1/2 Default Boxes, only 2 w and h are calculated, and the 1/2 Default Box is 2 h. Use as w and use w as h. The reason is that sqrt (2) == 1 / sqrt (1/2) and sqrt (1/2) == 1 / sqrt (2).
Thank you for reading to the end! My mother tongue is not Japanese, so there may be some things I can't explain well or use strange words. If you have any questions, please comment and I will try to answer as much as possible!
I'd like to post more and more while implementing SSD, so please look forward to the next post!
Recommended Posts