End-to-End Differentiable 6DoF Object Pose Estimation with Local and Global Constraints

Reviewer 1

First some positives about the paper:

The paper addresses and important problem.
The idea of using local and global features seems like it would make sense

In the following, I list my questions, and list the limitations I see with the approach and the paper.

The paper lacks justification of design choices for the proposed approach. For instance, why is a triplet loss function used instead of something else? Why are local features as in equation (1) better than other ways of extracting local features?
It is also hard to justify the claim that the improvements in results are due to the proposed approach or just due to randomness in training. In table 1, the numbers are very close. It’s hard to attribute any improvements to the specific design choices that were made.
The paper is also not very well written. It’s very hard to understand on a first reading what the exact contributions are. In the abstract and intro, there is no intuitive explanation for what local and global features are used. The triplet loss is not intuitively explained.

Overall, the paper addresses an important problem, but it’s unclear if the proposed contributions actually help improve pose estimation.

The intro could be re-written to make the contributions sharper. It would also be useful to differentiate from previous work.

Reviewer 2

The submission introduces regularization terms for 6DoF object pose estimation.

The technique is interesting and the results seem promising.

I think the weakest part is the exposition, especially Sec. 2. I suggest the authors have a figure and a paragraph setting the context and showing the notation and then briefly talk about the main idea. That figure could also be referenced from the introduction to illustrate the ideas the authors started from.

Reviewer 3

The authors have targeted 6DoF object pose estimation from a single RGB image. They have proposed two extensions to the prior work that works on finding correspondences between 3d and 2d keypoints. First extension involves incorporating pairwise features extracted from pairs of correspondences. Second extension involves incorporating triplet loss on features extracted for keypoints. These two terms enforce local and global constraints to the prior work.

Overall the work proposed two extensions to the prior work. The experiments have been conducted on the Occlusion Linemod dataset which is a very popular and challenging benchmark dataset for object pose estimation. Their approach achieves large improvement on this dataset. They have also done ablation studies to show improvement after adding each component.

Reviewer 4

I feel that the method is not very well explained. The paper assumes that the reader knows [9] really well, but this may not be the case for many readers. I suggest having at least a few sentences giving a very high level overview of [9], since it’s very crucial to understanding this paper. The use of terminology is quite confusing. In particular, the use of the word ““correspondence”” to describe the 4D vector on line 52 will confuse many readers, since a correspondence in this setting typically refers to a matching between a 2D point and a 3D point. I understand that the 4D vector points towards a 2D image point, which in turn corresponds to a 3D point. Thus I know why you are calling it a correspondence. However, this is all very hard for the typical reader to decipher if you do not provide some more background material.

Otherwise the results are fairly interesting. They show that the learned approach can outperform RANSAC-based approaches.