Inverse articulated-body dynamics from video via variational sequential Monte Carlo

Reviewer 1

The main contribution of this work is a method to infer joint angles and torques from 2D video that only has the joint positions annotated. The authors use a strong physical prior (a forward rigid-body dynamics model) in combination with variational sequential Monte-Carlo estimation to reconstruct the underlying forces.

Strengths:

I appreciate the use of classic machine learning techniques (HMM and SMC sampling) to accomplish this task on real & noisy data
the authors show strong results in a variety of tasks
very impressive mouse experiment

Weaknesses:

The authors could’ve used Openpose (https://github.com/CMU-Perceptual-Computing-Lab/openpose) to achieve better human joint tracking.
Some quantitative results would’ve been nice to assess the difference between ground truth and inferred position when playing back a trajectory with the forward dynamics (but I understand that there were space restrictions). I think this would make for a great full paper or even journal article if a couple more implementation details are added. I hope the mouse wasn’t harmed in the making of this. And kudos for shaving a mouse.

Reviewer 2

The paper introduces a pipeline for tracking the motion of articulated rigid bodies from an image sequence, while inferring physical parameters, such as link lengths. The inference process leverages variational sequential Monte Carlo (SMC) to recover the sequence of states (i.e. joint torques, positions, velocities) and approximate the posterior over the model parameters. The transition distribution is defined by the Newton-Euler equations to define the rigid-body dynamics, and the emission distribution is realized by the forward kinematics function mapping joint angles to Cartesian positions. The authors implement a parallel SMC sampler in PyTorch that runs multiple independent SMC samplers while aggregating statistics via importance sampling. Variational SMC further allows to infer the model parameters, besides the states of the observed motion. Experiments for inferring joint positions and stick lengths are provided for simulated data of a planar arm, from videos of a real human planar arm motion (via poses obtained from a separately trained ResNet-50), for a 3D simulated arm motion, and a real 3D motion of a mouse moving a planar joystick (again with link poses tracked by ResNet-50).

The presented work presents an interesting approach for inferring states and model parameters from real-world dynamical systems where only noise sensor measurements, such as from camera images, are available. While the authors acknowledge poor scaling of SMC as the state dimension gets larger in 3D, the work is a promising direction where variational SMC could be used in many other inference processes that can leverage prior knowledge of the transition and emission dynamics, while being able to infer model parameters. The experiment from the planar human arm motion does not show such accurate tracking as the results for the simulated systems, which could be, as the authors state, likely the result of a poor initial tracking performance from ResNet-50.

Reviewer 3

I think the results section could potentially do a better job of quantifing the benefit of each individual contribution. From the plots in Figure 2., I can’t see the improvement of the proposed method with respect to previous work. I think the question to answer is how does nested SMC method improve a simpler approach. The torque inference generally seems quite noisy and often inaccurate. Although it seems reasonable to expect some larger error here (since torques are noisier), some additional discussion would be nice to understand the limitations of the system.

As mentioned previously, my two main questions are the impact of the paper’s choices on the performance of the method through e.g.: an ablation study, and in addition some more detail on how differentiability is used to improve training performance.

I do not feel I can confidently judge this paper due to it being outside my area of expertise so would only recommend acceptance if others can champion it.