Normalizing Flows with Multi-Scale Autoregressive Priors ... · Normalizing Flows with Multi-Scale Autoregressive Priors – Supplemental Material – Apratim Bhattacharyya 1Shweta

Normalizing Flows with Multi-Scale Autoregressive Priors– Supplemental Material –

Apratim Bhattacharyya∗1 Shweta Mahajan∗2 Mario Fritz3 Bernt Schiele1 Stefan Roth2

1Max Planck Institute for Informatics, Saarland Informatics Campus2Department of Computer Science, TU Darmstadt

3CISPA Helmholtz Center for Information Security, Saarland Informatics Campus

We provide additional details of Lemma 4.1 in the mainpaper, additional details of our mAR-SCF model architecture,as well as additional results and qualitative examples.

A. Channel Dimensions in mAR-SCF

We begin by providing additional details of Lemma 4.1 inthe main paper. We formally prove the claim that the numberof channels at the last layer n (which does not have a SPLIToperation) is Cn = 2n+1 · C for an image of size [C,N,N ].Thereby, we also show that the number of channels at anylayer i (with a SPLIT operation) is Ci = 2i · C.

Note that the number of time steps required for samplingdepends on the number of channels for mAR-SCF (flowmodel) based on MARPS (Algorithm 1).

Lemma A.1. Let the size of the sampled image be [C,N,N ].The number of channels at the last layer hn is Cn = 2n+1 ·C.

Proof. The shape of image to be sampled is [C,N,N ].Since the network is invertible, the size of the image atlevel h0 is the size of the image sampled, i.e. [C,N,N ].

First, let the number of layers n be 1. This implies thatduring the forward pass the network applies a SQUEEZE oper-ation, which reshapes the image to [4C,N/2, N/2] followedby STEPOFFLOW, which does not modify the shape of theinput, i.e. the shape remains [4C,N/2, N/2]. Thus the num-ber of channels at the last layer h1 is C1 = 4 ·C = 21+1 ·Cwhen n = 1.

Next, let us assume that the number of channels at thelast layer hk−1 for a flow with k − 1 layers is

Ck−1 = 2k · C. (12)

We aim to prove by induction that the number of channels atthe last layer hk for a flow with k layers then is

Ck = 2k+1 · C. (13)

∗Authors contributed equally

To that end, we note that the dimensionality of the out-put at layer k − 1 after SQUEEZE and STEPOFFLOW fromEq. (12) by assumption is Ck−1 = 2k · C. For a flow withk layers, the (k − 1)

st layer has a SPLIT operation resultingin {lk−1, rk−1}, each with size [2k−1C,N/2k−1, N/2k−1].The number of channels for rk−1 at layer k − 1 is thus2k · C/2 = 2k−1 · C. At layer k the input with 2k−1 · Cchannels is transformed by SQUEEZE to 2k−1 · 4 · C =2(k−1)+2 · C = 2k+1 · C channels.

Therefore, by induction Eq. (13) holds for k = n. Thusthe number of channels at the last layer hn is given as Cn =2n+1 · C.

B. mAR-SCF ArchitectureIn Fig. 6, we show the architecture of our mAR prior

in detail for a layer i of mAR-SCF. The network is a con-volutional LSTM with three LSTM layers. The input ateach time-step is the previous channel of li, concatenatedwith the output after convolution on ri. Our mAR priorautoregressivly outputs the probability of each channel ljigiven by the distribution pφ

(lji |lj−1

i , · · · , l1i , ri)

modeled asN(µji , σ

ji

)during inference. Because of the internal state

of the Convolutional LSTM (memory), our mAR prior canlearn long-range dependencies.

In Fig. 2 in the main paper, the STEPOFFLOW operationconsists of an activation normalization layer, an invertible1 × 1 convolution layer followed by split coupling layers(32 layers for affine couplings or 4 layers of MixLogCDF ateach level).

C. Empirical Analysis of Sampling SpeedIn Fig. 7, we analyze the real-world sampling time of

our state-of-the-art mAR-SCF model with MixLogCDF cou-plings using varying image sizes [C,N,N ]. In particular,we vary the input spatial resolution N . We report the meanand variance of 1000 runs with batch size of 8 on an NvidiaV100 GPU with 32GB memory. Using smaller batch sizes

⋯⋯⋯

Output

⋯Layer 1

Layer 2

Layer 3

Convolutional LSTM

Input

p�(l1i |ri)<latexit sha1_base64="JaYfueMc+emaemN5bO9So7xq1PU=">AAADLXicbVJbaxNBFJ5db3W9pfroy2AIpJCGbBQUfCkWJLAgFZq2kEmG2clsMs3shZlZaRznD/niXxHBh4r46t9wNk1ITHtg4TvfuXznnJ24EFzpTufS82/dvnP33s794MHDR4+f1Hafnqi8lJT1aS5yeRYTxQTPWF9zLdhZIRlJY8FO49lhFT/9xKTieXas5wUbpmSS8YRToh2Fd73DRoLNNCZ2ZPZDiwRLdBOlRE/jxFCLz1srJ3EOknwy1XtBA2l2oR05tzgaUYgUTxd5lAjzwboGJY5ajp2kBEcuf4J7N/ePWnBDIFoLbKiuvc8Wm5kNGj1CJEJB41iSTCW5TIPVFuvciwprnjKF3sJoP1wObZSbkSmYJ7bq8H67TDgJ7o5hutYGBTaomHLbNOvwKMT8y8qVFnO7h2v1TruzMHgdhEtQB0s7wrUfaJzTMmWZpoIoNQg7hR4aIjWngtkAlYoVhM7IhA0czIjbYmgWf9vChmPG0G3tvkzDBbtZYUiq1DyNXWY1pdqOVeRNsUGpkzdDw7Oi1CyjV0JJKaDOYfV04JhLRrWYO0Co5G5WSKdEEqrdAwvcEcLtla+Dk247fNnufnxVP3i3PMcOeA5egCYIwWtwAHrgCPQB9b56371L75f/zf/p//b/XKX63rLmGfjP/L//ABanBnc=</latexit>

p�(l2i |l1i , ri)<latexit sha1_base64="v4KkQAPvUBA+XbZz9GMCqn6cnGg=">AAADPnicbVLNixMxFM+MH7uOX109egmWQhe6pVOFFbwsClIYkBXa3YWmDZk008ZmPkgysjXmL/Pi3+DNoxcPinj1aKbbpbW7DwZ+7/c+fu+9SVwIrnSn883zb9y8dXtn905w9979Bw9re49OVF5KygY0F7k8i4ligmdsoLkW7KyQjKSxYKfx/HUVP/3ApOJ51teLgo1SMs14winRjsJ7Xr+RYDOLiR2bg9AiwRLdRCnRszgx1OL3rUsncQ6SfDrT+0EDaXauHbmwOBpTiBRPl3mUCPPWugYljlqOnaYERy5/invX949acEMgWgtsqK69jxabuQ0aPUIkQkGjL0mmklymweUW69zzCmueMoVewuggXA1tlJuRKZgnturwZrtMOAnujmG61gYFNqiYcds06/C4i/knuOGHmK+XkBZzu49r9U67szR4FYQrUAcrO8a1r2iS0zJlmaaCKDUMO4UeGSI1p4LZAJWKFYTOyZQNHcyIW2tklr/fwoZjJtCdwX2Zhkt2s8KQVKlFGrvMakq1HavI62LDUicvRoZnRalZRi+EklJAncPqLcEJl4xqsXCAUMndrJDOiCRUuxcXuCOE2ytfBSfddvis3X33vH70anWOXfAEPAVNEIJDcAR64BgMAPU+e9+9n94v/4v/w//t/7lI9b1VzWPwn/l//wG9xg0X</latexit>

p�(lCii |lCi�1

i , · · · l1i , ri)<latexit sha1_base64="xBWarJ0/zVs4WhVrb+yUyqGeiA8=">AAADb3icbVJba9swFFacXTrv0rR72EPHEAuBBNIQp4UN9lJWGAHD6KBpC3EiZEVOtMgXJHk00/S4P7i3/Ye97B9MTlySpRE2fOc737lywowzqbrd3xWn+uDho8d7T9ynz56/2K8dHF7JNBeEDkjKU3ETYkk5S+hAMcXpTSYojkNOr8P5eeG//kaFZGlyqRYZHcV4mrCIEawshQ4qPxsR0rMQm7E+9kzAaaSaQYzVLIw0Mehr+86IrBEINp2pltsIFL1VllwY5I8JDCSLlzqCuf5sbIIc+W3LTmOMfKufov7u/H4bbhTw1wU2qq6t7wbpuXEbfYxFELiNS4ETGaUidu+mWGtvC6xYTGXwAfrHXtm0lrZHKmEamSLDp+0wbkswuwzdM8bNkA6yGTNNvXaP9SrPuUHMfj/gTpfNYcdFzI5HJqmSmypvSZe2KNK0UK3e7XSXD94HXgnqoHwXqPYrmKQkj2miCMdSDr1upkYaC8UIp8YNckkzTOZ4SocWJtjuYaSX92JgwzITaPdm/0TBJbsZoXEs5SIOrbLoUm77CnKXb5ir6P1IsyTLFU3IqlCUc6hSWBwfnDBBieILCzARzPYKyQwLTJQ9Udcuwdse+T646nW8k07vy2n97GO5jj1wBN6CJvDAO3AG+uACDACp/HEOnSPntfO3+qr6pgpXUqdSxrwE/71q6x/3KBuh</latexit>

p�(li|ri)<latexit sha1_base64="UKa9oOnO5wKhbf5gXeygxLH13CA=">AAADlHicbVJbi9pAFI6ml2160xb60pehIii4YmxhC+2DrbQIgWUL6+6C0WEyTnTq5MLMpKydnT/Un9O3/ptONFtT1yGB73znO1dOkDIqZK/3p1K1791/8PDokfP4ydNnz2v1FxciyTgmY5ywhF8FSBBGYzKWVDJylXKCooCRy2A1zP2XPwgXNInP5Tol0wgtYhpSjKShYL3yqxlCtQyQnqljV/uMhLLlR0gug1BhDb93bo3QGD6ni6VsO01fkmtpyLWG3gwDX9Boo8OIqVNtEmTQ6xh2ESHoGf0Cjg7n9zqgVMDbFShV3Vk/NVQr7TRHCHHfd5rnHMUiTHjk3E6x017nWNKICP8D8I7domklTI9EgCTUeYav+2HMlKBmGaqvDZtC5adLqltq55+pbaKhhtR8N+CgyyQx80Jq5sPzRIqyyt3Qhc3zNG3nX6VSI/SmLGrDWqPX7W0euAvcAjSs4p3B2m9/nuAsIrHEDAkxcXupnCrEJcWMaMfPBEkRXqEFmRgYI7OsqdoclQZNw8yBWa75Ywk2bDlCoUiIdRQYZd6k2Pfl5CHfJJPh+6micZpJEuNtoTBjQCYgv1Awp5xgydYGIMyp6RXgJeIIS3PHjlmCuz/yXXDR77pvu/1v7xqDz8U6jqzX1hurZbnWiTWwRtaZNbZwtV49qQ6qn+xX9kd7aH/ZSquVIual9d+zT/8C0ywpbw==</latexit>

li<latexit sha1_base64="6dg92o6a2Y8m6Kku2rHM59lZeYo=">AAADBXicbVLLahsxFNVMX+n05bTLbETNQAqOmUkLLXQTWiiGgZJCnAQ8ttDImrFizQNJU+IKbbrpr3TTRUvptv/QXf+mGmeM3SQXBOee+zj3SkoqzqQKgr+Oe+Pmrdt3tu569+4/ePios/34WJa1IHRISl6K0wRLyllBh4opTk8rQXGecHqSzN828ZOPVEhWFkdqUdFxjrOCpYxgZSm07ez4KdKzBJuJ3gtNzGmqduMcq1mSamLQWW/lpNaJBctm6pnnx4qeK0suDIomBMaS5cs8grl+b2yDGkU9y2Y5RpHNz9Dg+v5RD24IRGuBDdW198kgPTeeP8BYxLHnHwlcyLQUubfaYp173mDFcirj1zDaC9uhtbQzUgnL1DQd3rVlqypuFZjxJ1qHxqBON+gHS4NXQdiCLmjtEHX+xNOS1DktFOFYylEYVGqssVCMcGpVakkrTOY4oyMLC2ynG+vlKxroW2YK7Tb2FAou2c0KjXMpF3liM5th5eVYQ14XG9UqfTXWrKhqRQtyIZTWHKoSNl8CTpmgRPGFBZgIZmeFZIYFJsp+HM9eQnh55avgeL8fPu/vf3jRPXjTXscW2AFPwS4IwUtwAAbgEAwBcT47X53vzg/3i/vN/en+ukh1nbbmCfjP3N//AO2Y9Rs=</latexit>

ri<latexit sha1_base64="c25ijodwC2EdGehxdf1LZf30SEY=">AAADBXicbVLLahsxFNVMX+n05bTLbETNQAqOmUkLLXQTWiiGgZJCnAQ8ttDImrFizQNJU+IKbbrpr3TTRUvptv/QXf+mGmeM3SQXBOee+zj3SkoqzqQKgr+Oe+Pmrdt3tu569+4/ePios/34WJa1IHRISl6K0wRLyllBh4opTk8rQXGecHqSzN828ZOPVEhWFkdqUdFxjrOCpYxgZSm07ez4KdKzBJuJ3gtNzGmqduMcq1mSamLQWW/lpNaJBctm6pnnx4qeK0suDIomBMaS5cs8grl+b2yDGkU9y2Y5RpHNz9Dg+v5RD24IRGuBDdW198kgPTeeP8BYxLHnHwlcyLQUubfaYp173mDFcirj1zDaC9uhtbQzUgnL1DQd3rVlqyphFZjxJ1qHxqBON+gHS4NXQdiCLmjtEHX+xNOS1DktFOFYylEYVGqssVCMcGpVakkrTOY4oyMLC2ynG+vlKxroW2YK7Tb2FAou2c0KjXMpF3liM5th5eVYQ14XG9UqfTXWrKhqRQtyIZTWHKoSNl8CTpmgRPGFBZgIZmeFZIYFJsp+HM9eQnh55avgeL8fPu/vf3jRPXjTXscW2AFPwS4IwUtwAAbgEAwBcT47X53vzg/3i/vN/en+ukh1nbbmCfjP3N//APb49SE=</latexit>

l1i<latexit sha1_base64="O9k67+zGctmKvIePEB0ScaSEDDg=">AAADBHicbVJba9RAFJ7EW42XbvVNXwaXQIXtklShBV+KgiwEpEK3LWy2YTI7yY47k4SZiXQd5sEX/4ovPijiqz/CN/+Nk22WXdseCHznO5fvnJNJK0alCoK/jnvj5q3bdzbuevfuP3i42dl6dCzLWmAyxCUrxWmKJGG0IENFFSOnlSCIp4ycpLM3TfzkIxGSlsWRmldkzFFe0IxipCyVbDlP/CzR0xSZM70TmpiRTG3HHKlpmmlskg+9pZNZJxY0n6rnnh8rcq4sOTdJdIZhLClf5GHE9DtjG9RJ1LNszlES2fw8GVzfP+rBNYFoJbCmuvI+mUTPjOcPEBJx7PlHAhUyKwX3lluscs8brCgnMn4Fo52wHVpLOyORsMxM0+FtW7asYlaB2lvo0Jik0w36wcLgVRC2oAtaO0w6f+JJiWtOCoUZknIUBpUaayQUxYxYkVqSCuEZysnIwgLZ4cZ68RMN9C0zgXYZ+xUKLtj1Co24lHOe2sxmVnk51pDXxUa1yvbHmhZVrUiBL4SymkFVwuZFwAkVBCs2twBhQe2sEE+RQFjZd+PZI4SXV74Kjnf74Yv+7vuX3YPX7Tk2wFPwDGyDEOyBAzAAh2AIsPPZ+ep8d364X9xv7k/310Wq67Q1j8F/5v7+BynJ9Ow=</latexit>

lCi�1i

<latexit sha1_base64="gKZN+gxCfKaZkH4Ehnw16jR8KnI=">AAADs3icfVLbitswEHXsXrbuLds+9kU0BBLIBjttaaEvSxdKwFC2sMkuxIkqK3KiRr4gyWVTrT6wr33r31ROvMTNhgobZs6cmTMzTJQzKqTn/WnYzr37Dx4ePXIfP3n67Hnz+MVYZAXHZIQzlvGrCAnCaEpGkkpGrnJOUBIxchmtzsr45Q/CBc3SC7nOyTRBi5TGFCNpIHjc+NWOoVpGSM/Uia9DRmLZCRMkl1GssIbfe7dObJyQ08VSdt12KMm1NOBaw2CGQShosuFhxNQXbQoUMOgZdJEgGBj+Ag4P1w96oCYQ7ARqqjvvp4Zqpd32ECEehm77gqNUxBlP3Nspdtzr0pY0ISL8CIITv2paCdMjESCLdVnh834aMxLULEMNtEFzqMJ8SXVH7eIztS10piE13w04GDJFzLyQmvnwPJOizvI3cOXzsky3JlXrhN7UWV33v0qw2fL63uaBu4ZfGS2reuew+TucZ7hISCoxQ0JMfC+XU4W4pJgR7YaFIDnCK7QgE2OmyOxyqjY3p0HbIHNgdm/+VIINWs9QKBFinUSGWTYt9mMleCg2KWT8YapomheSpHgrFBcMyAyUBwzmlBMs2doYCHNqegV4iTjC0py5a5bg74981xgP+v6b/uDr29bpp2odR9Yr67XVsXzrvXVqDa1za2Rh27PHNrS/Oe+ciRM58y3VblQ5L61/npP8BaYSNp4=</latexit>

Figure 6. Architecture of our multi-scale autoregressive (mAR) prior.

of 8 ensures that we always have enough parallel resourcesfor all image sizes. Under these conditions, we see that thesampling time increases linearly with the image size. Thisis because the number of sampling steps of our mAR-SCFmodel scales as O(N), cf . Lemma 4.1.

Figure 7. Analysis of real-world sampling time of our mAR-SCFmodel with varying image size [C,N,N ]. For reference, we showthe line y = 2x in black.

D. Additional Qualitative ExamplesWe include additional qualitative examples for training

our generative model on the MNIST, CIFAR10, and Ima-geNet (32 × 32) datasets. In Fig. 8, we see that the visualsample quality of our mAR-SCF model with MixLogCDFcouplings is competitive with those of the state-of-the-artResidual Flows [4]. Moreover, note that we achieve bettertest log-likelihoods (0.88 vs. 1.00 bits/dim).

We additionally provide qualitative examples for Ima-geNet (32×32) in Figure 10. We observe that in comparison

to the state-of-the-art Flow++ [14] and Residual Flows [4],the images generated by our mAR-SCF model are detailedwith a competitive visual quality.

Finally, on CIFAR10 we compare with the fully autore-gressive PixelCNN model [38] in Figure 9. We see thatalthough the fully autoregressive PixelCNN achieves signif-icantly better test log-likelihoods (3.00 vs. 3.24 bits/dim),our mAR-SCF models achieves better visual sample quality(also shown by the FID and Inception metrics in Table 3 ofthe main paper).

E. Additional Interpolation Results

Finally, in Fig. 11 we show qualitative examples to com-pare the effect using our proposed interpolation method(Eq. 11) versus simple linear interpolation in the multimodallatent space of our mAR prior. We use the Adamax opti-mizer to optimize Eq. (11) with a learning rate of 5× 10−2

for ∼ 100 iterations. The computational requirement periteration is approximately equivalent to a standard trainingiteration of our mAR-SCF model, i.e. ∼ 1 sec for a batchof size 128. We observe that interpolated images obtainedfrom our mAR-SCF affine model using our proposed in-terpolation method (Eq. 11) have a better visual quality,especially for the interpolations in the middle of the inter-polating path. Note that we find our scheme to be stable inpractice in a reasonably wide range of the hyperparametersλ2 ≈ λ1 ∈ [0.2, 0.5] as it interpolates in high-density re-gions between xA,xB. We support the better visual qualityof the interpolations of our scheme compared to the linearmethod with Inception scores computed for interpolationsobtained from both methods in Fig. 12.

(a) Residual Flows [4] (b) Our mAR-SCF (MixLogCDF)

Figure 8. Random samples when trained on MNIST.

(a) PixelCNN [38] (b) Our mAR-SCF (MixLogCDF)

Figure 9. Random samples when trained on CIFAR10 (32× 32).

(a) Real Data (b) Flow++ [14]

(c) Residual Flows [4] (d) Our mAR-SCF (MixLogCDF)

Figure 10. Random samples when trained on ImageNet (32× 32).

Figure 11. Comparison of interpolation quality of our interpolation scheme (Eq. 11) with a standard linear interpolation scheme. The top rowof each result shows the interpolations between real images for the linear method and the corresponding bottom rows are the interpolationswith our proposed scheme.

Figure 12. Inception scores of random samples generated with our interpolation scheme versus linear interpolation.

Documents

Normalizing Flows with Multi-Scale Autoregressive Priors ... · Normalizing Flows with Multi-Scale Autoregressive Priors – Supplemental Material – Apratim Bhattacharyya 1Shweta