Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Normalizing Flows with Multi-Scale Autoregressive Priors– Supplemental Material –
Apratim Bhattacharyya∗1 Shweta Mahajan∗2 Mario Fritz3 Bernt Schiele1 Stefan Roth2
1Max Planck Institute for Informatics, Saarland Informatics Campus2Department of Computer Science, TU Darmstadt
3CISPA Helmholtz Center for Information Security, Saarland Informatics Campus
We provide additional details of Lemma 4.1 in the mainpaper, additional details of our mAR-SCF model architecture,as well as additional results and qualitative examples.
A. Channel Dimensions in mAR-SCF
We begin by providing additional details of Lemma 4.1 inthe main paper. We formally prove the claim that the numberof channels at the last layer n (which does not have a SPLIToperation) is Cn = 2n+1 · C for an image of size [C,N,N ].Thereby, we also show that the number of channels at anylayer i (with a SPLIT operation) is Ci = 2i · C.
Note that the number of time steps required for samplingdepends on the number of channels for mAR-SCF (flowmodel) based on MARPS (Algorithm 1).
Lemma A.1. Let the size of the sampled image be [C,N,N ].The number of channels at the last layer hn is Cn = 2n+1 ·C.
Proof. The shape of image to be sampled is [C,N,N ].Since the network is invertible, the size of the image atlevel h0 is the size of the image sampled, i.e. [C,N,N ].
First, let the number of layers n be 1. This implies thatduring the forward pass the network applies a SQUEEZE oper-ation, which reshapes the image to [4C,N/2, N/2] followedby STEPOFFLOW, which does not modify the shape of theinput, i.e. the shape remains [4C,N/2, N/2]. Thus the num-ber of channels at the last layer h1 is C1 = 4 ·C = 21+1 ·Cwhen n = 1.
Next, let us assume that the number of channels at thelast layer hk−1 for a flow with k − 1 layers is
Ck−1 = 2k · C. (12)
We aim to prove by induction that the number of channels atthe last layer hk for a flow with k layers then is
Ck = 2k+1 · C. (13)
∗Authors contributed equally
To that end, we note that the dimensionality of the out-put at layer k − 1 after SQUEEZE and STEPOFFLOW fromEq. (12) by assumption is Ck−1 = 2k · C. For a flow withk layers, the (k − 1)
st layer has a SPLIT operation resultingin {lk−1, rk−1}, each with size [2k−1C,N/2k−1, N/2k−1].The number of channels for rk−1 at layer k − 1 is thus2k · C/2 = 2k−1 · C. At layer k the input with 2k−1 · Cchannels is transformed by SQUEEZE to 2k−1 · 4 · C =2(k−1)+2 · C = 2k+1 · C channels.
Therefore, by induction Eq. (13) holds for k = n. Thusthe number of channels at the last layer hn is given as Cn =2n+1 · C.
B. mAR-SCF ArchitectureIn Fig. 6, we show the architecture of our mAR prior
in detail for a layer i of mAR-SCF. The network is a con-volutional LSTM with three LSTM layers. The input ateach time-step is the previous channel of li, concatenatedwith the output after convolution on ri. Our mAR priorautoregressivly outputs the probability of each channel ljigiven by the distribution pφ
(lji |lj−1
i , · · · , l1i , ri)
modeled asN(µji , σ
ji
)during inference. Because of the internal state
of the Convolutional LSTM (memory), our mAR prior canlearn long-range dependencies.
In Fig. 2 in the main paper, the STEPOFFLOW operationconsists of an activation normalization layer, an invertible1 × 1 convolution layer followed by split coupling layers(32 layers for affine couplings or 4 layers of MixLogCDF ateach level).
C. Empirical Analysis of Sampling SpeedIn Fig. 7, we analyze the real-world sampling time of
our state-of-the-art mAR-SCF model with MixLogCDF cou-plings using varying image sizes [C,N,N ]. In particular,we vary the input spatial resolution N . We report the meanand variance of 1000 runs with batch size of 8 on an NvidiaV100 GPU with 32GB memory. Using smaller batch sizes
⋯⋯⋯
Output
⋯Layer 1
Layer 2
Layer 3
Convolutional LSTM
Input
p�(l1i |ri)<latexit sha1_base64="JaYfueMc+emaemN5bO9So7xq1PU=">AAADLXicbVJbaxNBFJ5db3W9pfroy2AIpJCGbBQUfCkWJLAgFZq2kEmG2clsMs3shZlZaRznD/niXxHBh4r46t9wNk1ITHtg4TvfuXznnJ24EFzpTufS82/dvnP33s794MHDR4+f1Hafnqi8lJT1aS5yeRYTxQTPWF9zLdhZIRlJY8FO49lhFT/9xKTieXas5wUbpmSS8YRToh2Fd73DRoLNNCZ2ZPZDiwRLdBOlRE/jxFCLz1srJ3EOknwy1XtBA2l2oR05tzgaUYgUTxd5lAjzwboGJY5ajp2kBEcuf4J7N/ePWnBDIFoLbKiuvc8Wm5kNGj1CJEJB41iSTCW5TIPVFuvciwprnjKF3sJoP1wObZSbkSmYJ7bq8H67TDgJ7o5hutYGBTaomHLbNOvwKMT8y8qVFnO7h2v1TruzMHgdhEtQB0s7wrUfaJzTMmWZpoIoNQg7hR4aIjWngtkAlYoVhM7IhA0czIjbYmgWf9vChmPG0G3tvkzDBbtZYUiq1DyNXWY1pdqOVeRNsUGpkzdDw7Oi1CyjV0JJKaDOYfV04JhLRrWYO0Co5G5WSKdEEqrdAwvcEcLtla+Dk247fNnufnxVP3i3PMcOeA5egCYIwWtwAHrgCPQB9b56371L75f/zf/p//b/XKX63rLmGfjP/L//ABanBnc=</latexit>
p�(l2i |l1i , ri)<latexit sha1_base64="v4KkQAPvUBA+XbZz9GMCqn6cnGg=">AAADPnicbVLNixMxFM+MH7uOX109egmWQhe6pVOFFbwsClIYkBXa3YWmDZk008ZmPkgysjXmL/Pi3+DNoxcPinj1aKbbpbW7DwZ+7/c+fu+9SVwIrnSn883zb9y8dXtn905w9979Bw9re49OVF5KygY0F7k8i4ligmdsoLkW7KyQjKSxYKfx/HUVP/3ApOJ51teLgo1SMs14winRjsJ7Xr+RYDOLiR2bg9AiwRLdRCnRszgx1OL3rUsncQ6SfDrT+0EDaXauHbmwOBpTiBRPl3mUCPPWugYljlqOnaYERy5/invX949acEMgWgtsqK69jxabuQ0aPUIkQkGjL0mmklymweUW69zzCmueMoVewuggXA1tlJuRKZgnturwZrtMOAnujmG61gYFNqiYcds06/C4i/knuOGHmK+XkBZzu49r9U67szR4FYQrUAcrO8a1r2iS0zJlmaaCKDUMO4UeGSI1p4LZAJWKFYTOyZQNHcyIW2tklr/fwoZjJtCdwX2Zhkt2s8KQVKlFGrvMakq1HavI62LDUicvRoZnRalZRi+EklJAncPqLcEJl4xqsXCAUMndrJDOiCRUuxcXuCOE2ytfBSfddvis3X33vH70anWOXfAEPAVNEIJDcAR64BgMAPU+e9+9n94v/4v/w//t/7lI9b1VzWPwn/l//wG9xg0X</latexit>
p�(lCii |lCi�1
i , · · · l1i , ri)<latexit sha1_base64="xBWarJ0/zVs4WhVrb+yUyqGeiA8=">AAADb3icbVJba9swFFacXTrv0rR72EPHEAuBBNIQp4UN9lJWGAHD6KBpC3EiZEVOtMgXJHk00/S4P7i3/Ye97B9MTlySpRE2fOc737lywowzqbrd3xWn+uDho8d7T9ynz56/2K8dHF7JNBeEDkjKU3ETYkk5S+hAMcXpTSYojkNOr8P5eeG//kaFZGlyqRYZHcV4mrCIEawshQ4qPxsR0rMQm7E+9kzAaaSaQYzVLIw0Mehr+86IrBEINp2pltsIFL1VllwY5I8JDCSLlzqCuf5sbIIc+W3LTmOMfKufov7u/H4bbhTw1wU2qq6t7wbpuXEbfYxFELiNS4ETGaUidu+mWGtvC6xYTGXwAfrHXtm0lrZHKmEamSLDp+0wbkswuwzdM8bNkA6yGTNNvXaP9SrPuUHMfj/gTpfNYcdFzI5HJqmSmypvSZe2KNK0UK3e7XSXD94HXgnqoHwXqPYrmKQkj2miCMdSDr1upkYaC8UIp8YNckkzTOZ4SocWJtjuYaSX92JgwzITaPdm/0TBJbsZoXEs5SIOrbLoUm77CnKXb5ir6P1IsyTLFU3IqlCUc6hSWBwfnDBBieILCzARzPYKyQwLTJQ9Udcuwdse+T646nW8k07vy2n97GO5jj1wBN6CJvDAO3AG+uACDACp/HEOnSPntfO3+qr6pgpXUqdSxrwE/71q6x/3KBuh</latexit>
p�(li|ri)<latexit sha1_base64="UKa9oOnO5wKhbf5gXeygxLH13CA=">AAADlHicbVJbi9pAFI6ml2160xb60pehIii4YmxhC+2DrbQIgWUL6+6C0WEyTnTq5MLMpKydnT/Un9O3/ptONFtT1yGB73znO1dOkDIqZK/3p1K1791/8PDokfP4ydNnz2v1FxciyTgmY5ywhF8FSBBGYzKWVDJylXKCooCRy2A1zP2XPwgXNInP5Tol0wgtYhpSjKShYL3yqxlCtQyQnqljV/uMhLLlR0gug1BhDb93bo3QGD6ni6VsO01fkmtpyLWG3gwDX9Boo8OIqVNtEmTQ6xh2ESHoGf0Cjg7n9zqgVMDbFShV3Vk/NVQr7TRHCHHfd5rnHMUiTHjk3E6x017nWNKICP8D8I7domklTI9EgCTUeYav+2HMlKBmGaqvDZtC5adLqltq55+pbaKhhtR8N+CgyyQx80Jq5sPzRIqyyt3Qhc3zNG3nX6VSI/SmLGrDWqPX7W0euAvcAjSs4p3B2m9/nuAsIrHEDAkxcXupnCrEJcWMaMfPBEkRXqEFmRgYI7OsqdoclQZNw8yBWa75Ywk2bDlCoUiIdRQYZd6k2Pfl5CHfJJPh+6micZpJEuNtoTBjQCYgv1Awp5xgydYGIMyp6RXgJeIIS3PHjlmCuz/yXXDR77pvu/1v7xqDz8U6jqzX1hurZbnWiTWwRtaZNbZwtV49qQ6qn+xX9kd7aH/ZSquVIual9d+zT/8C0ywpbw==</latexit>
li<latexit sha1_base64="6dg92o6a2Y8m6Kku2rHM59lZeYo=">AAADBXicbVLLahsxFNVMX+n05bTLbETNQAqOmUkLLXQTWiiGgZJCnAQ8ttDImrFizQNJU+IKbbrpr3TTRUvptv/QXf+mGmeM3SQXBOee+zj3SkoqzqQKgr+Oe+Pmrdt3tu569+4/ePios/34WJa1IHRISl6K0wRLyllBh4opTk8rQXGecHqSzN828ZOPVEhWFkdqUdFxjrOCpYxgZSm07ez4KdKzBJuJ3gtNzGmqduMcq1mSamLQWW/lpNaJBctm6pnnx4qeK0suDIomBMaS5cs8grl+b2yDGkU9y2Y5RpHNz9Dg+v5RD24IRGuBDdW198kgPTeeP8BYxLHnHwlcyLQUubfaYp173mDFcirj1zDaC9uhtbQzUgnL1DQd3rVlqypuFZjxJ1qHxqBON+gHS4NXQdiCLmjtEHX+xNOS1DktFOFYylEYVGqssVCMcGpVakkrTOY4oyMLC2ynG+vlKxroW2YK7Tb2FAou2c0KjXMpF3liM5th5eVYQ14XG9UqfTXWrKhqRQtyIZTWHKoSNl8CTpmgRPGFBZgIZmeFZIYFJsp+HM9eQnh55avgeL8fPu/vf3jRPXjTXscW2AFPwS4IwUtwAAbgEAwBcT47X53vzg/3i/vN/en+ukh1nbbmCfjP3N//AO2Y9Rs=</latexit>
ri<latexit sha1_base64="c25ijodwC2EdGehxdf1LZf30SEY=">AAADBXicbVLLahsxFNVMX+n05bTLbETNQAqOmUkLLXQTWiiGgZJCnAQ8ttDImrFizQNJU+IKbbrpr3TTRUvptv/QXf+mGmeM3SQXBOee+zj3SkoqzqQKgr+Oe+Pmrdt3tu569+4/ePios/34WJa1IHRISl6K0wRLyllBh4opTk8rQXGecHqSzN828ZOPVEhWFkdqUdFxjrOCpYxgZSm07ez4KdKzBJuJ3gtNzGmqduMcq1mSamLQWW/lpNaJBctm6pnnx4qeK0suDIomBMaS5cs8grl+b2yDGkU9y2Y5RpHNz9Dg+v5RD24IRGuBDdW198kgPTeeP8BYxLHnHwlcyLQUubfaYp173mDFcirj1zDaC9uhtbQzUgnL1DQd3rVlqyphFZjxJ1qHxqBON+gHS4NXQdiCLmjtEHX+xNOS1DktFOFYylEYVGqssVCMcGpVakkrTOY4oyMLC2ynG+vlKxroW2YK7Tb2FAou2c0KjXMpF3liM5th5eVYQ14XG9UqfTXWrKhqRQtyIZTWHKoSNl8CTpmgRPGFBZgIZmeFZIYFJsp+HM9eQnh55avgeL8fPu/vf3jRPXjTXscW2AFPwS4IwUtwAAbgEAwBcT47X53vzg/3i/vN/en+ukh1nbbmCfjP3N//APb49SE=</latexit>
l1i<latexit sha1_base64="O9k67+zGctmKvIePEB0ScaSEDDg=">AAADBHicbVJba9RAFJ7EW42XbvVNXwaXQIXtklShBV+KgiwEpEK3LWy2YTI7yY47k4SZiXQd5sEX/4ovPijiqz/CN/+Nk22WXdseCHznO5fvnJNJK0alCoK/jnvj5q3bdzbuevfuP3i42dl6dCzLWmAyxCUrxWmKJGG0IENFFSOnlSCIp4ycpLM3TfzkIxGSlsWRmldkzFFe0IxipCyVbDlP/CzR0xSZM70TmpiRTG3HHKlpmmlskg+9pZNZJxY0n6rnnh8rcq4sOTdJdIZhLClf5GHE9DtjG9RJ1LNszlES2fw8GVzfP+rBNYFoJbCmuvI+mUTPjOcPEBJx7PlHAhUyKwX3lluscs8brCgnMn4Fo52wHVpLOyORsMxM0+FtW7asYlaB2lvo0Jik0w36wcLgVRC2oAtaO0w6f+JJiWtOCoUZknIUBpUaayQUxYxYkVqSCuEZysnIwgLZ4cZ68RMN9C0zgXYZ+xUKLtj1Co24lHOe2sxmVnk51pDXxUa1yvbHmhZVrUiBL4SymkFVwuZFwAkVBCs2twBhQe2sEE+RQFjZd+PZI4SXV74Kjnf74Yv+7vuX3YPX7Tk2wFPwDGyDEOyBAzAAh2AIsPPZ+ep8d364X9xv7k/310Wq67Q1j8F/5v7+BynJ9Ow=</latexit>
lCi�1i
<latexit sha1_base64="gKZN+gxCfKaZkH4Ehnw16jR8KnI=">AAADs3icfVLbitswEHXsXrbuLds+9kU0BBLIBjttaaEvSxdKwFC2sMkuxIkqK3KiRr4gyWVTrT6wr33r31ROvMTNhgobZs6cmTMzTJQzKqTn/WnYzr37Dx4ePXIfP3n67Hnz+MVYZAXHZIQzlvGrCAnCaEpGkkpGrnJOUBIxchmtzsr45Q/CBc3SC7nOyTRBi5TGFCNpIHjc+NWOoVpGSM/Uia9DRmLZCRMkl1GssIbfe7dObJyQ08VSdt12KMm1NOBaw2CGQShosuFhxNQXbQoUMOgZdJEgGBj+Ag4P1w96oCYQ7ARqqjvvp4Zqpd32ECEehm77gqNUxBlP3Nspdtzr0pY0ISL8CIITv2paCdMjESCLdVnh834aMxLULEMNtEFzqMJ8SXVH7eIztS10piE13w04GDJFzLyQmvnwPJOizvI3cOXzsky3JlXrhN7UWV33v0qw2fL63uaBu4ZfGS2reuew+TucZ7hISCoxQ0JMfC+XU4W4pJgR7YaFIDnCK7QgE2OmyOxyqjY3p0HbIHNgdm/+VIINWs9QKBFinUSGWTYt9mMleCg2KWT8YapomheSpHgrFBcMyAyUBwzmlBMs2doYCHNqegV4iTjC0py5a5bg74981xgP+v6b/uDr29bpp2odR9Yr67XVsXzrvXVqDa1za2Rh27PHNrS/Oe+ciRM58y3VblQ5L61/npP8BaYSNp4=</latexit>
Figure 6. Architecture of our multi-scale autoregressive (mAR) prior.
of 8 ensures that we always have enough parallel resourcesfor all image sizes. Under these conditions, we see that thesampling time increases linearly with the image size. Thisis because the number of sampling steps of our mAR-SCFmodel scales as O(N), cf . Lemma 4.1.
Figure 7. Analysis of real-world sampling time of our mAR-SCFmodel with varying image size [C,N,N ]. For reference, we showthe line y = 2x in black.
D. Additional Qualitative ExamplesWe include additional qualitative examples for training
our generative model on the MNIST, CIFAR10, and Ima-geNet (32 × 32) datasets. In Fig. 8, we see that the visualsample quality of our mAR-SCF model with MixLogCDFcouplings is competitive with those of the state-of-the-artResidual Flows [4]. Moreover, note that we achieve bettertest log-likelihoods (0.88 vs. 1.00 bits/dim).
We additionally provide qualitative examples for Ima-geNet (32×32) in Figure 10. We observe that in comparison
to the state-of-the-art Flow++ [14] and Residual Flows [4],the images generated by our mAR-SCF model are detailedwith a competitive visual quality.
Finally, on CIFAR10 we compare with the fully autore-gressive PixelCNN model [38] in Figure 9. We see thatalthough the fully autoregressive PixelCNN achieves signif-icantly better test log-likelihoods (3.00 vs. 3.24 bits/dim),our mAR-SCF models achieves better visual sample quality(also shown by the FID and Inception metrics in Table 3 ofthe main paper).
E. Additional Interpolation Results
Finally, in Fig. 11 we show qualitative examples to com-pare the effect using our proposed interpolation method(Eq. 11) versus simple linear interpolation in the multimodallatent space of our mAR prior. We use the Adamax opti-mizer to optimize Eq. (11) with a learning rate of 5× 10−2
for ∼ 100 iterations. The computational requirement periteration is approximately equivalent to a standard trainingiteration of our mAR-SCF model, i.e. ∼ 1 sec for a batchof size 128. We observe that interpolated images obtainedfrom our mAR-SCF affine model using our proposed in-terpolation method (Eq. 11) have a better visual quality,especially for the interpolations in the middle of the inter-polating path. Note that we find our scheme to be stable inpractice in a reasonably wide range of the hyperparametersλ2 ≈ λ1 ∈ [0.2, 0.5] as it interpolates in high-density re-gions between xA,xB. We support the better visual qualityof the interpolations of our scheme compared to the linearmethod with Inception scores computed for interpolationsobtained from both methods in Fig. 12.
(a) Residual Flows [4] (b) Our mAR-SCF (MixLogCDF)
Figure 8. Random samples when trained on MNIST.
(a) PixelCNN [38] (b) Our mAR-SCF (MixLogCDF)
Figure 9. Random samples when trained on CIFAR10 (32× 32).
(a) Real Data (b) Flow++ [14]
(c) Residual Flows [4] (d) Our mAR-SCF (MixLogCDF)
Figure 10. Random samples when trained on ImageNet (32× 32).
Figure 11. Comparison of interpolation quality of our interpolation scheme (Eq. 11) with a standard linear interpolation scheme. The top rowof each result shows the interpolations between real images for the linear method and the corresponding bottom rows are the interpolationswith our proposed scheme.
Figure 12. Inception scores of random samples generated with our interpolation scheme versus linear interpolation.