Inverse Graphics GAN

Reviewer 1

I think this is a promising research direction. You properly outline the problem of rendering voxels differentiably , and outline a solution which seems to work well.

-> The paper is a little hard to parse because of the dense level of detail, and tends to repeat its self a little. I would suggest removing so much repetition , and also lines 73-84 as its probably better to focus on what you are going to do then what you can’t in such a small format.

-> While the method seems to work well, it does seem a little convoluted of an approach for what you are trying to accomplish. This is not necessarily a bad thing, but I can’t help but hope there is a more direct solution where voxel rendering can be made differentiable directly, something like ““Multi-view Supervision for Single-view Reconstruction via Differentiable Ray Consistency”” perhaps.

-> I would think a good baseline to compare to here would be training a a GAN to generate images of you data distribution, and then using single image reconstruction to produce the voxel models. Proving your method can outperform this approach would be very helpful in validating its complexity.

-> I hope that the age of ‘name’-GAN is nearly over, there are too many to track and its very confusing to parse them all. I would suggest changing this choice.

-> You do not indicate what Absorption Only and Visual Hull are in the paper.

Overall I think this is a good paper, though maybe some time should be put into improving the writing.

Reviewer 2

There are few things I like about the paper:

It addresses an important problem and explores an interesting (although not new) solution.
It’s well motivated
Ability to capture concavities is interesting.

The paper also has a few shortcomings which I list below:

While the proposed idea is interesting, it’s not new. This is not a problem in itself, but the paper does not discuss or cite a large body of work with similar ideas. The idea of using projections (or silhouettes) of a reconstructed point cloud or voxel grid as a loss for 3D shape has been tried numerous times before. Here’s a good starting point to this line of work:

Unsupervised Learning of Shape and Pose with Differentiable Point Clouds, Insafutdinov et al. NeurIPS 2018 Hologan: Unsupervised learning of 3d representations from natural images, Nguyen-Phuoc et al. 2019

The paper does not discuss the similarity of the proposed ““Continuous voxel grids”” to implicit representations like SDFs and Occupancies.

Mescheder et al, Occupancy Networks: Learning 3D Reconstruction in Function Space Deepsdf: Learning continuous signed distance functions for shape representation, JJ Park et al.

When projecting voxel grid to 2D, it’s unclear how backface culling is handled?

Overall, this paper presents an interesting idea, but lacks connection to previous work. Still, publication in the workshop could enable further discussion and exposure to related work. I would encourage the authors to add a para discussing this work in the final version. Misc:

Line 13: GANS –> GANs
Third para in the intro is quite verbose and can be shortened.
Line 87: Closed –> close

Reviewer 3

Contribution - The authors propose a technique to utilize an off-the-shelf non-differentiable renderer to train a 3D generative model from 2D images, by employing a proxy differentiable renderer and a discriminator that ensures the output distribution between it and proxy renderer is similar.

Scope - The approach is scalable to any rendering algorithm and allows the use of advanced industrial renderers. Building photorealistic neural renderers is an important topic and has a lot of applications in gaming and media.

Limitations:

Since the 3D representation used is voxels, I’m not sure how much information can actually be ‘distilled’ from an industrial renderer considering the low resolution limitation of voxels. The fine details would still be difficult to observe given the low resolution which slightly defeats the purpose of using a better renderer.
There’s no intermediate loss on the 3D voxel representation and no explicit disentanglement of features, which limits the generalization capability of the generator. Approaches like HoloGAN explicitly address this.” The method is clearly explained and well-formulated. and the idea is pretty interesting. However, I would have liked to see the limitations of the proposed method and result comparison with some of the recent differentiable renderers. For scalability, I am unsure about the effectiveness of a low res voxel representation and no explicit disentanglement of 3D features.