End-to-End Differentiable Learning to HDR Image Synthesis for Multi-exposure Images

End-to-End Differentiable Learning to HDR Image Synthesis for Multi-exposure Images

Reviewer 1

The authors take the classical computer vision problem of HDR image synthesis, propose a differentiable approximation for a discrete component used in the process. They then go on to use this approximation to facilitate training an end-to-end model for HDR image synthesis. The proposed method is shown to perform well in comparison to existing benchmarks in terms of metrics and the produced results are visually pleasing. The method is also shown to be data efficient, getting the job done well with relatively small model sizes. One significant implementation detail that I could not find in the submission was how the authors differentiate through a histogram count function. I am referring to “cnt_l(I)” in the last term of the loss in equation 4. The authors refer to this term as a histogram loss function. If they are referring to [1] then they need to add an appropriate citation and make clear that they are optimizing a proxy of the term in equation 2. If they do something else they must clarify that in the camera ready submission. This key detail should be either in the main body of the paper, or in the appendix but with a clear reference to the specific appendix from the main body of the paper.

[1] Ustinova, Evgeniya, and Victor Lempitsky. ““Learning deep embeddings with histogram loss.”” Advances in Neural Information Processing Systems. 2016

The following comments are with the sole intent of improving the paper and have not influenced my outcome decision.

I found the sample HDR images in Appendix C to be very visually pleasing, it would be worth referring the reader to this appendix from the main body of the paper, possibly somewhere around line #125.

The figures 1 and 2 have been rendered to low quality in the submission pdf, the axis in the yellow box in fig. 1 are barely readable unless the reader zooms in closely. Please re-render these figures to a suitable high quality image format.

In figure 1, in the X axis of the upper plot in the yellow box there is a typo. It is labeled “Piexl intensity”. Also using title capitalization in the axis would adhere better to publication norms.

It was not immediately clear to me what the word “individually” in line #41 was referring to.

The final sentence in Section 1 (lines #42 - #44) was confusing to me. Maybe try to break it into shorter sentences or re-word this sentence.

Line #55 “denotes” -> “denote”. This specific grammatical error was actually a source of brief notational confusion for me. Also #55 “a pixel” -> “the pixel” and #56 “the luminance” instead of just “luminance” would make things more clear.

In appendix A, fig 3 the authors compare their method for differential approximation of the CRF function to polynomial baselines. The comparison is not fair as the polynomial baselines are not regularized as can be seen from eqn 9 and the plots. Many smoothness regularizers would be easy to implement. It would also be relatively trivial to get a smooth regressor function from a kernel regression method with a Gaussian kernel or similar.

Reviewer 2

This paper proposed an end-to-end method for synthesizing HDR images from LDR images. This method alleviated the generation of local inversion artifacts by reconstructing the multi-exposure stack using the introduced differentiable HDR synthesis layer. It has made the non-differentiable camera response function (CRF) differentiable by using a piece-wise linear approximation. This proposed methods outperformed other works.

Strengths: 1) The paper is well-written and easy to follow and the contributions have been clearly defined. 2) The presented work is of sufficient technical novelty and seems technical and theoretically sound. The use of piece-wise linear approximation seems promising

Limitations: 1) Though explained in the paper, I do not think it is a fair comparison among different methods if they are trained using different datasets.

In terms of the data efficiency (line 130), I think the results are showing good generalization while not data efficiency. I would suggest training your model with different data volumes (Say, 48 images, 256 images, 1024 images). If the resulting performance is similar, it could better describe the data efficiency of the model.

Reviewer 3

Strong points:

They propose a recurrent approach in the multi-exposure stack generation.

They modify the discrete camera response function to be differentiable with the linear approximation technique.

Their method reduces local inversion artifacts and preserves image details and contrasts in overexposed regions

Their proposed network outperforms the state-of-the-art results in 3 different datasets and they compared their performance with 5 other frameworks.

Weak points:

The exact training process of sub-networks is not very much clear.

Is the sampling strategy of pixels from the multi-exposure stack random ?

Should use “Multi-Exposure Images” in the title.

Reviewer 4

This paper develops a new end-to-end method for reconstructing HDR images from LDR images. The method first generates the stack of multi-exposure images starting from singe LDR via recurrent networks having U-Net structures. Then an HDR image is reconstructed by using inverse CRF from the stack. To make the whole process fully differentiable, the CRF function is interpolated using a piecewise linear function. This enables the direct integration of HDR loss and results in the removal of the local inversion with performance gain (+3.3 HDR-VDP-2 in VDS dataset). The authors have achieved state-of-the-art results on three datasets.

Overall, the paper is well-written and the proposed method seems to work well. My only concern is that the word ““End-to-End”” in the title sounds a little overclaimed as the differentiable modification introduces only a single new L_HDR into the already basically (although not fully) differentiable loss function of the task.

The manuscript would be more readable if the following two points are explicitly mentioned: (1) the connection between the main text and the appendix, (2) how the inverse CRF g in Eq.1 is used in Eq.7.