Instance-wise Depth and Motion Learning from Monocular Videos

Reviewer 1

problem: when both ego-vehicle/agent and objects move in the scene, it is difficult to accurately estimate relative motion of the dynamic objects and there is a flaw in prior approaches in the way they warp the images to generate supervision summary:

authors propose first projecting things to an intermediate reference frame, where the estimated ego-motion has been used to warp the scene, and only the object motions remain to be accounted for. Appropriate modifications are made to kernelise the operations and ensure everything is differentiable.
They jointly optimise the depth net, ego-posenet and obj-posenet self-supervised by image reconstruction loss.
experiments are showed on: -depth net tested on kitti scene flow -ego-posenet tested on KITTI VO -results are shown that suggest that posenet and depth net are complementary and a better obj-posenet leads to a better depth net.

strengths:

more principled/ geometrically correct way to do the projection in this problem setup
its a nice idea to fill in holes by scaling up the image and depth to do projection, so that when scaled back down, we have filled in some holes.

weaknesses: not familiar enough with the field to critique

Reviewer 2

The work attempts to solve the problem of self-supervised monocular depth estimation from sequential frames and it’s showing results on the KITTI+Cityscapes benchmarks.

Strengths:

The work uses the best of both worlds - forward projection for detailed moving objects and backward projection for background motion. It’s tricky to merge both into an end-to-end pipeline but the authors seem to have done just that.
By doing this, the work introduces a strong geometric prior

Weaknesses:

No, you’re not beating the SotA. SotA on this task at the moment it BTS, https://arxiv.org/pdf/1907.10326v5.pdf but I appreciate that you’re coming close with self-supervision (i.e. without explicit depth supervision).
Your method is complex and I don’t think a 4p short paper does it justice. I’ve worked with related methods, so it’s kinda clear to me, but I wouldn’t assume a reader who’s not familiar with all the 3D projection and warping operations to get into this from just this work. This is less of a criticism of the paper and more of a suggestion to re-submit this as a full paper. (That is to say, the extended method section and illustrative diagrams in the appendix are great. If you find a venue where you can keep that in the main body of the paper, you’re golden.)
I commend the idea of having videos embedded in a PDF but I’m on Linux/Mac and I’m not going to install a special PDF viewer just to look at embedded videos. In the future, please upload them to an anonymous Youtube profile and put the link in the paper.” “- Maybe rephrase the first sentence a little bit. ““Recent advances in deep neural networks (DNNs)”” has become a bit of a meme.
How’s the depth inconsistency map represented in the main Figure?
Also in the same figure, the ““inverse warping”” is just a disconnected box floating around with no arrows going in or out.
Other than that, the Figure is good