Delusion of 3D from 2D video [Unsupervised Monocular Depth Learning in Dynamic Scenes]

motivation

Wouldn't it be nice if you could make a video (2D image) 3D? You can also use it for robotics and autonomous driving. You can also enjoy the muffled videos in 3D. Dreams and so on swell.

Introduction. What is the content of this article?

This article introduces ** Unsupervised Monocular Depth Learning in Dynamic Scenes ** (https://arxiv.org/abs/2010.16404), a self-supervised monocular depth and motion estimation method proposed in 2020.

To put it plainly, it is a deep learning method that estimates 3D (Depth) and the motion of an object from an ordinary (2D) video without the need for humans to prepare correct answer data **. fig_0.png

Summary

What is self-supervised learning in the first place?

** It is an attractive method that generates correct answer data from the data itself and does not label or annotate it by humans **. There is already a great article, so please refer to another article for details. https://qiita.com/omiita/items/a7429ec42e4eef4b6a4d

The method introduced this time solves the problem of estimating the image of the second frame (correct answer) from the image of the first frame by using ** images of consecutive frames, and in the process, the target image An approach ** that attempts to predict 3D coordinates (xyz) and motion.

Model overview (during inference)

Again, this method solves the problem of estimating the image of the second frame (correct answer) from the image of the first frame by using the images of consecutive frames, and in the process, the target image The purpose is to predict 3D coordinates (xyz) and motion.

The learning phase is a bit complicated, so let's first look at an easy-to-understand inference model. The overall configuration is as shown in the figure below. fig_1.png

As shown in the figure, it is formed by two major networks, Depth Network and Motion Network. ** Depth Network estimates the depth from the image, and Motion Network estimates the motion of the camera or object, and camera parameters using the generated Depth image and 2 frames of RGB image. ** If you have information on camera parameters (viewing angle, etc.) and a Depth image, you can express the target three-dimensionally (in xyz).

The input / output of each network is roughly summarized below.

(1) Depth Network -** Input: ** RGB image (3ch)

-** Output: ** Depth image (1ch)

-** Output @ Bottleneck part: ** Camera motion (XYZ translation and XYZ Euler angles, 6 parameters in total) Camera matrix (focal length and shooting center in each direction of image height & width. 4 parameters in total)

-** Output @ Decoder part: ** Motion motion (xyz vector 3ch motion for each pixel in the image)

Model overview (during learning)

At the time of training, based on the information obtained from the above model, ** create a warp image that projects the image of the second frame onto the positional relationship of the original image **. In other words, it logically derives something like "Given the inferred camera motion and the movement of the object, it would look like this in the original frame." We will train this warp image to match the original image.

In addition to that, multiple constraints are given for learning. The configuration is as follows. fig_2.png It's a little complicated. I will briefly explain Loss.

Motion regularization is a bit tricky, but you'll prefer flat peaks while gently suppressing changes in the motion prediction heatmaps. I am aware that the speed of an object does not change from place to place.

Prediction results (excerpt from the treatise)

The figure below is an example of inference results for each data set. From left to right, the original image, the inference depth image (disparity, 1/depth), and the inference motion image. Sometimes it's weird, but it looks pretty neat and clear. You can also predict camera parameters, so you can learn using Youtube videos. fig_4.png

What's amazing after all?

Subjective, I like the following:

You can estimate the depth of one video without a teacher!

Image ⇒ I think the easiest way to learn depth is to acquire data as a set of two and perform image to image supervised learning. The fashionable point of this method is that you do not need correct answer data.

No camera parameters required!

If you approach it normally, you will want to make the depth data three-dimensional using information such as the viewing angle of the camera, but even that is a bold guess.

You can estimate XYZ motion, not pixel movement!

It is not the movement of pixels on two axes, up, down, left, and right. You can estimate the movement of the three axes of XYZ. (Different from Optical Flow)

Camera motion and object motion can be estimated separately!

You can estimate the motion caused by the camera and the motion caused by the moving object separately. When you want to know the movement of an object, you can eliminate the effect of the camera making it appear to move.

Compared to past methods ...

One year before this method, I published Thesis Depth from videos in the wild. In the previous paper, by masking the moving object with a pre-learned object detection model, it was possible to separate the moving object from the camera motion. (Excerpt from the figure below and the treatise) fig_4.png

On the other hand, the method introduced this time eliminates the need for object detection of moving objects by giving motion-related regularization and assumptions. It's simpler as a model.

For the genealogy of monocular depth, DeNA Miyazawa's materials are insanely rich, so please refer to it. https://www.slideshare.net/KazuyukiMiyazawa/depth-from-videos-in-the-wild-unsupervised-monocular-depth-learning-from-unknown-cameras-167891145

I actually tried it.

While referring to Github author implementation, I tried to ** create and learn a model by myself **. ** The following is a little maniac, so if you are not interested, please skip it. ** **

result

I will attach some inference results of the model after 80 epoch training with about 10,000 KITTI tracking data. ** The top three are relatively good examples, and the bottom three are slightly disappointing examples. ** Well, after all, it's a monocular estimation, so I can't read the depth completely, but I think it's a good line.

Even if you misread the depth, you may be able to match the tsuji with motion, so strange inference results will come out. The road sign on the lower left is a typical example, and the depth is strange by all means. Since I misread the depth, the road sign will be distorted in the next frame if only the movement of the camera is considered, but by thinking that the road sign is moving, I forcibly adjust the tsuji. fig_5_ok.png fig_5_ng.png The loss at this point is 0.02 for motion regularization, 0.0007 for depth smooth, and 0.52 for cycle loss of rgb & motion. The ssim loss of image similarity increases by far.

Implementation points

To be honest, I struggled quite a bit until I was able to get the results shown above.

Model building itself shouldn't be too difficult if you're used to processing depth images. However, I got the impression that it is quite difficult to balance learning ** because there are so many estimation targets that they influence each other to produce the final warp image. Personally, learning was more difficult than modeling.

** There are local optimizations everywhere during learning, so if you don't think about it, you'll quickly get stuck in a hole. ** ** I don't think it's fun to read, but there are some addictive points, for example:

** 1. Addicted by excessive camera movement (depth: small, motion: large) ** Since the target is close to the camera and the camera has moved so much, there is no place to lose between frames and it is stable. As a countermeasure, forcibly give motion restrictions or devise to reduce the initial value of learning. It is better to have a proper gradient limit (clip) at the time of learning.

** 2. I'm addicted to matching Tsuji with all motions (Motion: Large) ** Even if a distant object moves a lot or a near object moves a little, the change in appearance from the camera is the same, so there is a tsuji and it is stable. As a countermeasure, freeze the estimation layer of moving object motion halfway. At the beginning, I'll do my best with only depth and camera motion, and add moving motion at the end.

** 3. It's better if it doesn't move, and I'm addicted to it (motion: under) ** A neat stability point that it is better to eliminate motion if it moves poorly and the shape of the image collapses. If the initial value is too small or the gradient limit at the time of learning is too strong, the model will be lazy and learning will not progress, so it is necessary to give moderate randomness.

I feel like I'm addicted to many other things. It may be better to give appropriate constraints such as maximum motion depending on the application and video. I really respect the author.

at the end

Last year, shortly after I first started using Python, I saw a commentary on Depth form videos in the wild at Nikkei Robotics, and was impressed by the deep learning ghosts. At that time, the implementation ability was low and I couldn't do anything about it, but I was able to implement the same series of treatises in about a year after that, so I could feel a little growth.

However, I haven't studied enough yet, so please point out any mistakes. Thank you for reading.

Recommended Posts

Delusion of 3D from 2D video [Unsupervised Monocular Depth Learning in Dynamic Scenes]
Pattern recognition learning in video Part 1 Field of Pattern Recognition
Paper: Machine learning mimics the learning methods of multilingual children! (Visual Grounding in Video for Unsupervised Word Translation)