I may not be able to explain it well, but I would be grateful if you could roughly work out the nuances. I also summarized the stereo depth, so if you are interested, please https://qiita.com/minh33/items/55717aa1ace9d7f7e7dd https://qiita.com/minh33/items/6b8d37ce08f85d3a3479

Table of contents 1. Depth Estimation 2. Image warp 3. Loss

Depth Estimation

The position can be estimated with a monocular camera so that a person can see the rough position of an object with one eye. For example, if you fix the angle of the camera, the near object is located at the top of the image, and the far object is located at the top of the image. Since the object is basically on the ground, the position of the object can be estimated if the distance of the ground plane is known. In addition, the relative position between objects and the size of objects can also be used as information. Since I am learning using CNN, it is difficult to understand what kind of information the model actually gets, but there was an interesting paper that I did an experiment, so if you are interested, please do

Image warp

For monocular </ h3> By moving the car with the camera on it, you can convert it to the previous frame. First, the distance of the image when t = t is estimated by the network. Since the distance can be estimated, it is possible to calculate a 3D point cloud. Use self-position estimation to find the amount of movement of the car with the camera. VSLAM, odometry, GPS, IMU, etc. can be used for self-position estimation. By transfoming the 3D point cloud for which the amount of change in one frame of x, y, z, roll, pitch, yaw was calculated earlier, we were able to infer the 3D point cloud with t = t-1. By converting it to Image View, you can warp the image of t = t to the image of t = t-1. However, there are also disadvantages that you can not learn unless you move, and if the opponent's object is moving, it will shift even if you warp.

It=>target Image(t=t) Is=>source Image(t=t-1) Dt => target Depth (Ground Truth of distance using LiDAR) D^t=>Estimated target Depth I^t=>Estimated target Image View Synthesis => Image Reconstruction Photometric Loss => Comparison of estimated image and actual image

For binoculars </ h3>

It is used to calculate Loss in No. 3, but you can convert the obtained Depth to Disparity and warp the image on the right to the image on the left. By the way, is it binocular even though it is mono depth? I think some people think that, distance estimation is done with a single eye, and the opposite lens is used as the ground truth for learning.

Definition of Loss [

Unsupervised Monocular Depth Estimation with Left-Right Consistency(2017)

]( https://arxiv.org/pdf/1609.03677.pdf')

This paper is probably the most famous in monodepth

・ Reconstruction Loss A Reconstruction image on the left can be created by warping using the Disparity on the left that estimates the image on the right. Calculate the SAD and SSIM of that image and the input image on the left. Do the reverse

・ LR Consistency Loss Warp the Disparity Map on the right to the Disparity Map on the left to calculate the difference between the absolute values of Disparity. Do the reverse

・ Smoothness Loss Since the Disparity (Depth) of nearby Pixels should be almost the same if they are the same object, the Smoothness of Disparity is calculated using Laplacian Smoothness or the like. Perform on the right and left Disparity Map respectively.

Basic understanding of depth estimation by mono camera (Deep Learning)

Table of contents 1. Depth Estimation 2. Image warp 3. Loss

Depth Estimation

Image warp

Definition of Loss [

Unsupervised Monocular Depth Estimation with Left-Right Consistency(2017)