Towards end-to-end training of proposal-based 3D human pose estimation

Reviewer 1

The submission proposes 3D pose estimation methods that extends upon Mask R-CNN (2D pose estimation) with (i) Integral Pose Regression and location maps in a single step (ii) integral layer for 2D estimation followed by an MLP that takes 2D coordinates and uses this to estimate 3D estimation with separate training for each of the two stages (iii) same as (ii) except the system is trained end-to-end.

There is some degree of novelty in proposing a method to extend existing implementations for 2D pose estimation to handle 3D estimation. The submission also makes use of an interesting dataset.

However, the submission makes no effort to compare how current state-of-the-art 3D pose estimation methods stack up against the solutions proposed. It is not easy to make comparisons between the state-of-the-art and the proposed architecture, given the dataset is custom, and as a result, is only useful in understanding the results attained in the submission. It would be good to run other state-of-the-art implementations using the custom dataset to evaluate the framework proposed.

The novelty is somewhat questionable in that it’s not clear how the extension proposed (using integral pose regression) is any different from that proposed in Integral Pose Regression paper cited. It seems as if the author has proposed a framework that just concatenated the implementation suggested in Mask R-CNN with an Integral Pose Regression layer. It’s not very clear where something original has been proposed. It would be good to emphasise/highlight where some originality has been introduced.

Very little is given in the way of how the dataset is split between training and test subsets. This is one among several variables that need to be stated. Please consider doing so.

Few typos here and there; I would urge the authors to revisit their submission and fix grammatical errors as well as typos.

Seeing as you have cited other 3D pose estimation techniques, would it be possible to either run them with your custom dataset and compare the accuracy attained with that of your proposed solution?

Would you be able to suggest what exactly the novelty is in your proposed solution? The solution submitted seems to concatenate the layers proposed in the Integral Pose Regression paper.

In Figure 6, there shouldn’t be an initial loss on the 2D output after the integral step, as it’s end-to-end optimisation? Is this an error in the image?

Are there no standard datasets for 3D pose estimation? Why go down the custom dataset route? It makes it hard to compare different approaches/implementations.

Reviewer 2

“Strong Points:

They propose a fully differentiable end to end single person pose estimation network and also extend this network to give 3D pose estimation from indoor videos.

They have achieved fairly good quantitative results in their own custom dataset.

Weak Points:

There is no information regarding the integral layer, extended maskrcnn, loss function definition and how there framework is performing better than conventional one.

They have commented on improvement in both accuracy and inference time but substantial experimental evidence is not present in their paper.

Please address the doubts/points mentioned in the “weak points” section.

Provide comparison with some other methods.

Reviewer 3

Scope: 3D Human Pose estimation

Contributions: This paper reduces the state-of-the-art methods to a single-step differentiable method, which improves the classical pipeline through a differentiable method.

Limitations: There are some writing problems in the paper and a native proof-reading is needed. Besides, the gap between Fig2 and Fig3 is really hard to identify. According to my knowledge, the more advanced method in 3D pose estimation is through using 3DMM (such as FLAME and its variant) to do a coarse-to-fine regression, which can handle extreme cases: occlusion or bad illumination. If this paper do have some 3D priors involved, it would be more appealing to me.

Q1: The purpose of this paper? An incremental method for 3D Human Pose estimation or a innovation on network structure?

Suggestions: The dataset seems good, I would like to know when you would like to release that.

Reviewer 4

The main contribution of the paper is extending the Mask RCNN framework with integral pose regression (soft-argmax) for 2D and 3D human pose estimation on a new dataset. The dataset used by the authors is a niche dataset collected for professional alpine ski players. The paper’s scope is limited as the authors use a well known backbone framework by He et al. (Mask RCNN) and a well known technique for soft argmax by Sun et al. (Integral Human Pose Regression) on a new dataset. The authors describe 3 variants of this framework and compare the performance among those 3.

Strengths - Their results show that direct end-to-end training of 2D and 3D keypoints is better than sequential training and putting individual modules together.

Weaknesses -

The dataset is very niche and limited, and hasn’t been used in prior 3D human pose estimation works.
The framework is not noveL. It’s an application of two well known techniques to a new dataset.
Approaches 3.2 and 3.3 are essentially just different training strategies, and not approaches in themselves.
Loss formulation is not explained throughout the paper.
Authors should be consistent in spellings and sentences (maskrcnn vs. mask rcnn vs Mask RCNN).
The values used in Eq. 2 are not explained anywhere.
A lot of relevant recent literature is not discussed.