YOLO is an abbreviation for You Only Look Once, but is it an Ateji? It is a speed-specific neural network for image detection and recognition, and is built on a C-based framework called darknet. The original paper is here. The feature is that the class probability and the bounding box coordinates of the object are suddenly inferred from all the images. Unlike R-CNN etc., it is not necessary to infer for multiple candidate areas many times, so it is fast because it looks once.
It seems that there is already an example of converting to TensorFlow, so this time I will try to make it chainer. The purpose of converting what is built in C to a scripting language is to understand the contents. In this article, I will only make inferences using the learned coefficients, but when I read the paper, the learning procedure seems to be one of the biggest features, so I would like to trace it soon.
See here for the code.
It seems that YOLO has normal size, small, tiny, etc., but this time only tiny is converted.
tiny-yolo and yolo-tiny seem to be different versions. tiny-yolo uses Batch Normalization and so on, which is more modern, but this time I will use the older yolo-tiny that has already been converted to the tensorflow model.
The coefficient of yolo-tiny of TensorFlow version is in here, so the coefficient is spit out with numpy.
import tensorflow as tf
from YOLO_tiny_tf import *
import numpy as np
c=YOLO_TF()
with c.sess.as_default():
for v in tf.trainable_variables():
name = v.name
ary = v.eval()
np.save("npy/"+name, ary)
Now you can write the coefficients under the npy folder in numpy format. With tf.trainable_valiables (), you can list the parameters included in Link in chainer. eval () is data in chainer.
In addition, in YOLO_tensorflow, the coefficient is written out as text modified darknet ) Is used to convert the coefficient from darknet.
We will convert this to chainer.
The configuration of yolo-tiny is this if you follow the history of github. It is a kanji that repeats 3x3 convolution and max pooling. I will describe Chain according to this description.
If you know the original model with such a large number of stages, it is easier to write in Sequential without distinguishing between Link and Function like Keras.
By the way, the coefficient of Chain constructed in this way is replaced, but since Chainer is Define by run, W and b etc. cannot be accessed in the first place without performing Forward operation once. This time, I rewrote the weights after inserting the 0 matrix and forwarding it. I think you can specify it with initialW, initial_bias, etc.
The weights of Convolution_2d are quite different between TensorFlow and Chainer,
--TensorFlow: [Kernel vertical, kernel horizontal, output channel, input channel] --Chainer: [Output channel, Input channel, Kernel vertical, Kernel horizontal]
They are lined up like this, so be careful when sorting. The vertical and horizontal directions are also reversed for all connections.
Also note that the slope of leaky_relu is 0.2 for chainer, but 0.1 for YOLO.
The code for writing and converting Chain is here. Other variations such as small should be possible as well.
The coefficients are optimized for a dataset of 20 classification problems called Pascal VOC.
Divide the entire screen into 7x7 grids and directly provide all the information about the class probabilities for each grid, up to two bounding boxes centered in that grid, and the reliability of those two bounding boxes. Infer.
The coordinates of the bounding box are the square roots of the center position x and y normalized from 0 to 1.0 in grid size, and the size is the square root of the width and height normalized from 0 to 1.0 in image size. The reason for using the square root is that he wants to reduce the error penalty when the size increases.
The output vector of the final stage is
--20 class probabilities per 7x7x20 grid --Reliability for each of the two bounding boxes on each 7x7x2 grid --Two sets of bounding box coordinates for each 7x7x2x4 grid (normalized with 0-1)
After reshaping, identify the highly reliable bounding box and retrieve its coordinates and class.
At the time of inference, it is sufficient to multiply the class probability and the reliability of the bounding box and cut at the threshold value, but at the time of learning, it seems to devise so that it is easy to get on the loss function only in the important part.
Try it out from Flickr's Commercial use & mods allowed images.
Train Sounds good.
Sheep It's a sheep.
Bicycle The person's bounding box has shifted a little. The crowd won't get caught.
Birds I'd like you to find the person on the right as well. I wonder if it will come out if the threshold is lowered.
Cat It's a cat.
PottedPlant Is it like this? ..
As it is written repeatedly in the paper, it is a bold structure that suddenly outputs the coordinates of the bounding box with one network. I think the point is the learning process to make the bold idea work properly.
On the other hand, as described in the paper, the accuracy of the bounding box is not so good, especially in tiny. I think this area depends on the application.
This is the first time I have done model conversion between TensorFlow and chainer, but it seems that it can be done by paying attention only to the arrangement of coefficients. I think I can do it with general-purpose tools, but it seems to be beyond the scope of my hobbies.
YOLO is out as ver2. -Identify Penpainappo and AppoPen (YOLO ver2 in Chainer) -Implementation including learning by leetenki Please refer if you are interested.
Recommended Posts