Blendshape-augmented Facial Action Units Detection

Blendshape-augmented Facial Action Units Detection

Reviewer 1

The authors present a differentiable strategy for estimating facial activation unit contributions to a given 2D image of a face. The strategy avoids complex modeling of facial muscle physics by using a pre-existing facial expression generation tool. The authors present a combined model from a top-down approach for combining AU features, and a bottom-up approach for estimating AU features from data. By composing these models, the authors demonstrate results that are superior to either model acting alone.

The major limitation of this paper is the lack of comparisons to existing approaches. The comparison against the ablated models is interesting and subsumes naive approaches, but I believe more substantive comparisons are warranted.” “This paper is very interesting, thank you. There are several minor typos throughout the text. The figures were informative.

For a longer submission, I believe that more comparisons against existing baselines are warranted (as discussed above). I think a new figure, or a modification of figure 1 to include the 3D Blendshape Engine, would aid in exposition.

Reviewer 2

The contribution of this paper is mostly the top-down model. The decomposition of the landmark deformations into rigid (rotation and translation) and non-rigid modeled by the blend-shape engine.

It is apparent that the paper is a work in progress as someone would expect from a workshop short-paper.

I believe the claim that the method takes into account facial anatomy, muscles, skin etc is exaggerated. It is exactly the what FACS AU framework (which most cited methods use) does. As far as I understand, the proposed top-down method builds a model between 3d landmark coordinates (and their displacements) and AU classes. In my view this does not mean that any content in the paper is redundant, or irrelevant, it just means this claim makes harder to understand the scope of the paper.

The experiments are only performed on known 2d-3d landmark coordinates provided by the dataset instead of the full top-down method described in chapter 2. Section 2 specifcally Paragraph: ““3D facial landmark estimator, implies the input aught to be only a 2D image which is used to have 2d landmark coordinates are estimated and they are used to infer 3d landmark coordinates.

"”We assume that different subjects’ landmarks can be fitted into our AU bases through the rigid transformation with trivial errors”” seems like an assumption that could and should be tested experimentally. facial antinomy -> facial anatomy ????

The blending model g_{\phi}^m(,) not defined at all, does it have parameters? All I could find is that ““it gives probabilities””.

Paragraphs ““Facial muscle and the modeling of its mechanics”” seem to belong to a ““motivation”” or a ““background”” section rather than in the description of the model.

Section 2.1 should be broken into what the FACS AU framework does and what your method does. It seems you are using facegen as a black-box so I don’t consider it to be part of the presented method but rather a third-party generator of a train-set.

If I have misunderstood, you should definitely clarify the boundary between your top-down method, FACS-AU, and facegen.com. Couldn’t the data from BP4D be used instead of facegen in order to create the 3D blendshape engine and the AU basis?

Reviewer 3

Contributions:

  1. The idea of integrating of top-down and bottom-up processing is intuitive and ablation results show a reasonable and favorable performance. Limitations:
  2. The proposed model is not compared with other previous works.
  3. There is no qualitative result.
  4. Considering the title of the workshop, it would be better to highlight the capability of differentiability of each module.” “1. Some parts of the paper are ambiguous. What is the main goal of the task of AU detection? (please consider that I am no expert in action unit representation.) The inputs, outputs, and types of labels for training deep model is difficult to see.
  5. It would be better to visualize the actual embedded feature space of AU representation, i.e., t-SNE or other embedding spaces. Since the facial representations are continuous due to the smooth motion of facial muscles, it would be better to visualize encoded feature maps the for validation.
  6. It would be better to show how each top-down and bottom-up process is weighted for the final integration. Are their any biased cases between them while training? I would like to see how they are correlated. Attention-based training also could be one solution.
  7. Experiments are only conducted on the ablation study. It would be better to show the comparison between other previous works.

Reviewer 4

Scope: 2/3D Facial Landmarks, Facial Action Units, Blendshapes.

Contribution: In this paper, the authors proposes a novel way to exploits the combination of top-down and bottom-up method to predict the Action Units well. In comparision to 2D only methods (Bottom-Up), this paper takes 3D priors into the method and analysis the motivation of the leverage of 3D priors. The AU estimations (I regard it as BS parameters regression network) has proved to be effective during the combination of this two method.

Limitations:

  1. 3D facial landmarks could use 3DMM-based method to improve the precision and other indicators.

  2. I think the ever-growing FLAME is more suitable for your blendshape prediction since it is the mainstream and takes in the leading position compared to other methods.

This paper seems good to me, the style of narrative is quite sober and the explanation of experiment setting is adequate.

I am wondering whether it is possible to add a differentiable module to rendered 3D to 2D with weak supervision signals (Top-down network).