Learned Equivariant Rendering without Transformation Supervision

Reviewer 1: Contributions: The authors present a method for applying a transformation learned in one setting (background + foreground objects) to a new setting. The separate encoders for foreground and background also enable new combinations of foreground and background objects.

Scope: Scaled up appropriately, the method could be useful to the graphics community or as a method for compositional representation learning of videos (because of the separation of foreground/background/transformation). Limitations: The setting is quite toy. While toy experiments are appropriate for a workshop submission, I am not sure that a loss only on pairs of frames is a realistic assumption to make.

It would be helpful to explain the interpretation of x_i more explicitly earlier on.
Providing a background section is helpful, but it would be great to go into more details about how this work builds off of the cited Dupont et. al. (2020).
L 38: wrt -> with respect to”

Reviewer 2: This paper uses self-supervised learning to disentangle foreground from background objects in videos, obtaining factorized codes so that they can be manipulated independently. To do this, it jointly optimizes encoders for the foreground and background, an affine transformation in the encoded space T^Z, and a decoder (renderer), learning to map between adjacent frames. A loss function is constructed to include a “scene matching” term on the rendered learned transformation, “equivariance” term on the codes, ““invariance”” term on the learned backdrop decoders.

Experiments are shown with MNIST digits randomly placed on a procedurally generated backdrop. This method is shown to successfully learn manipulable decoders.

Some concerns:

Though the idea is promising, the results presented here might be too preliminary. If the background is completely static, the pixel-wise median across a video should be enough to recover it perfectly. A slightly more sophisticated setting would be more convincing, even with the same MNIST-on-generated-backdrop methodology.
It seems that the architectures for f_o, f_b, and g are not specified; only that f_o = f_b = transpose of g. Even in this simple setting, this seems like an important detail to include.
An interpretation of the ““Analyzing T^Z”” section in the appendix might be in order. These statistics don’t clearly match between the learned vs. true T^Z; what is the conclusion?” Upon making the experimental setting slightly more challenging, and a more fleshed-out revision on the exposition, I believe this could be a strong conference submission.

Reviewer 3: Strength

This paper is easy to follow. In other words, the authors describe the assumption of this study and the propose method clearly.
The main idea of this paper is very straightforward.

Weakness

Equation (3) is weird to me. please check it.
what does T(z1, z2) mean in equation (4)? the authors need to clearly describe this operation because T is an affine transformation, which is one of unitary operators. In my opinion, this equation is likely to cause many misunderstandings to the reader. Can the authors rewrite it?

As mentioned above, equations, like T(z1, z2), are weird to me through the manuscript. Since z2 = T_z(z1), it is strange to me that the same operator is applied to z1 and z2 at the same time.
Could the author shows the image translation according to walking in the scene representation space like many GANs papers, if possible? If the neural render is well trained, a smooth image sequences will be created.