Tractable loss function and color image generation of multinary restricted Boltzmann machine

Reviewer 1

RBMs have always been tricky to optimize due to their non-differentiable nature and their results were not as strong as current generative models. The authors proposed a new loss function that is fully differentiable for both binary and multinary RBMs, thus backpropagation can be used making them easy to optimize. Further, the authors showed experimental qualitative and quantitative results. Although the FID scores are poor compared to current methods, the proposed RBM can pave way to new types of generative models that could lead to new SOTA methods for image generation.

Reviewer 2

Contributions:

The motivation of the proposed method is easy to follow (although I am no expert in RBM).
From the motivation of intractibility of traditional statistical learning, the authors propose a generalized loss function for optimizing multinary RBM. Regardless of the quality of the paper, my decision is currently leaning towards rejection, considering the scope of this workshop. I would like to see other reviewers’ comments.

Reviewer 3

The authors proposed formulation for (multinary) RBMs allows them to stack RBMs and train them jointly, unlike the widely used contrastive divergence algorithm which mandates layer-wise greedy training. The authors proposed loss function also allows gradients with respect to the (multinary) RBM input to be computed. This would facilitate the use of (multinary) RBM layers as a building block in the context of larger deep learning models.

The performance of the proposed models in terms of classification accuracy and generated image FID is not state of the art and the generated image samples in fig. 2 feel a lot like the “average images” produced by variational auto-encoders trained with KL-regularization and L2 loss. Also, in line #105 the authors present the removal of facial accessories as a strength/generalization point. Since this is happening on a sample from the dataset it was trained on - regardless of whether it was a test split or train split image - I view this as a weakness. In my opinion it shows that the model lacks the capacity to model the sun-glasses that probably appear less frequently in the training dataset. With the discrete nature of the sigmoid activated latents and the blurry “average image” feel of the images, this gives off the impression that the generative model has the ability to output a number of different “average images”, each one capturing a different mode of the dataset [1]. This idea is further reinforced by the reconstructions of the two out of distribution images shown in the right half of fig. 2. The reconstructed images of the provided faces are of similar quality to the reconstructions of images from the training dataset, however, for both pictures I would say that the faces in the reconstructions belong to different people from those in the original image.

In light of this, I still find this work interesting because of the proposition of a potential new block for deep learning models. This block could in theory tie in nicely with some developing areas such as the vector quantization exhibited in the VQVAE and vq-wav2vec models. I think this kind of work is both interesting and will widely benefit from the feedback and exposure of being presented at the workshop.

Exploring pushing these differential RBM layers further, possibly through combining with other deep learning layers to achieve results that lag less far behind other methods both qualitatively and quantitatively would definitely help this work get disseminated further.

As a follow-up to point [1] of the review I would be interested in seeing an analysis of the diversity of samples generated and/or the reconstructed images. Even just an appendix including a large number of images and their reconstructions, or of samples generated from the model (i.e. maybe a full page of both, maybe with a light commentary in a caption) would be good. This would showcase more of the diversity, or lack thereof, in the images modeled by the proposed model.

The celebA dataset provides attributes including whether or not the image contains glasses. I would like to ask the authors how many of the images that their model was trained on contained eye-glasses, both as a percentage of the training dataset and an absolute number/count.

A plot of (loss/task metric vs epoch), a plot of(loss/task metric vs wall-clock time) and further details on compute resources would be welcome additions to an appendix. This is because computational efficiency would be a deal-breaking factor for many people when deciding whether or not to adopt a new algorithm/framework for training models.

Reviewer 4

This paper derived a loss function to enable backpropagation for restricted Boltzmann machines (RBM). Then apply this function to both multinary and binary RBMs. Finally, the authors evaluate the model with image classification and generation tasks.

The experiments show the potential of this model comparing to the state-of-art. Explanations about higher errors in classification task and high FID scores (lower is better) are welcomed in future submission.

I am not familiar with RBMs and so I am not confident if the formulations derived are correct.

Reviewer 5

The paper attempts to bring techniques considered outdated back into the present era via enhanced loss functions that should contribute towards a new generation of Deep Belief Networks and a democratization of such technologies via reduced training overhead. Irrespective of the relative efficacy of this technique, it should serve as a valuable foundation for future research. A brief discussion of why ImageNet (or even Tiny ImageNet https://www.kaggle.com/c/tiny-imagenet) was not selected would be welcome.