I tried 3D detection of a car

TL;DR --I tried 3D detection of a car using only the image of a monocular camera --Apply CenterNet used for human posture estimation ――There are many improvements, but they have taken shape to some extent.

Trigger

In my business, I have a relationship with a company that specializes in image recognition, but that company is just amazing. Although it is based only on the image, it detects three-dimensional objects (for example, this is the front / rear side of the car, and this is the left and right sides).

Until now, I have only touched on object detection that surrounds an object with a rectangular bounding box such as YOLO or SSD, and I could not imagine what kind of method was used. This article is a record of researching various 3D object detection using only images and trying 3D detection of automobiles by myself.

Let's investigate 3D object detection

The aim this time was to recognize the car three-dimensionally like this using only images. w.png

For the time being, I investigated a method that could be used for three-dimensional object detection based only on images. It seems that there are various methods just because I didn't know anything.

--Posture estimation

A method of estimating the posture of a person and the position of joints based on images. If the shape is fixed to some extent other than human beings, it can be used for various things by replacing the shape with the connection of human joints. Regarding human posture estimation, the latest trends are summarized in an easy-to-understand manner on DeNA's blog. https://engineer.dena.com/posts/2019.11/cv-papers-19-2d-human-pose-estimation/

--Depth estimation

A method of estimating the depth from a camera to an object based on an image. It's amazing. https://ai.googleblog.com/2019/05/moving-camera-moving-people-deep.html However, in order to perform object detection, it is necessary to combine it with an existing method using images or perform further object detection using the estimated depth. I think. This paper took the latter method. https://arxiv.org/pdf/1812.07179.pdf

A method of classifying what is shown for each pixel of an image. If you change the attributes on the front, back, left and right sides of the car, you should be able to recognize it in three dimensions.

This time, I decided to apply the method used for posture estimation.

――Even if it is hidden by other things and only a part is shown in the image, it seems that you can guess the hidden part from the part shown in the image. --It seems to be relatively easy to implement

For that reason.

Implementation

How to give correct answer data

This time, I used CenterNet, which is used for posture estimation. CenterNet estimates the posture by learning the center point of the recognition target, the position of each joint, and the vector between each joint. (CenterNet seems to have various uses such as not only posture estimation but also recognition by the conventional Bounding Box and recognition of xyz coordinates of the recognition target.) Screen Shot 2020-08-20 at 22.26.06.png https://github.com/xingyizhou/CenterNet

A rectangular parallelepiped of a car so that the same method as posture estimation can be used. Central point of the car ┣ Front or rear of the car ┃┗Four vertices of a quadrangle that forms the front and back faces ┗ Left side or right side of the car ┗Four vertices of a quadrangle that forms the left and right sides As if it were a connection like The center point of the car, the center point of the side, the position of the apex of the rectangular parallelepiped, Vector from the center point of the car to the center point of the side, vector from the center point of the side to the apex of the rectangular parallelepiped Was given as correct answer data. 1.png Visualizing the trained data looks like this. In the figure above Car center point </ font> ┣ </ font> Front or rear of the car </ font> ┃ </ font> ┗ </ font> Four vertices of a quadrangle that forms the front and back faces </ font> ┗ </ font> Left or right side of the car </ font> ┗ </ font> Four vertices of a quadrangle that forms the front and back faces </ font> It is colored.

Network structure

The structure is similar to U-Net, but the original image size is not restored due to the calculation time.

u-net-architecture.png https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/

I used the trained parameters of EfficientNet for the part that reduces the image size.

data set

Using KITTI's 3D Object Detection Evaluation 2017, we divided the training data 19: 1 for training and validation. http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d

Learning

The weight of the EfficientNet part was fixed and 30 Epoch parts were transferred and learned, and then the fixed weight of the EfficientNet part was released and fine tuning was performed for 20 Epoch parts. We have just started trial and error to optimize the evaluation function, and we have just tried to learn it.

result

Introducing the inference results with the test data that can be downloaded from KITTI.

Things that can be recognized relatively well

Rather than choosing the one that was recognized well, I was able to recognize other images with this level of accuracy. download (1).png download (2).png

Things that are not well recognized

I can't recognize the sideways car in the center of the image. There seems to be room for improvement in robustness when Occlusion occurs. download (3).png There are some false positives on the left side of the image. download (4).png

What I thought about doing

――CenterNet is simple as an algorithm because there is no selection of Anker Box compared to Yolo, and it was surprisingly easy to implement even for beginners and intermediates. Even so, it can be recognized in the orthodox Bounding Box and in 3D like this time, and it seems to be usable in various ways. --The label for car detection (the part filled in red) was created independently from the KITTI reference data. I later thought that if we make good use of the trained data of the vehicle's Instance Segmentation, the accuracy will be greatly improved. This implementation is similar to U-Net, which is often used for segmentation, so it seems to be quite compatible.

What I want to do in the future

--Increase accuracy in combination with Segmentation --Move in real time using Jetson or something --Recognize the position and movement of the car (this time + α → Kalman filter)

Recommended Posts