tempoGAN: A Temporally Coherent, Volumetric GAN for Super ... · elot et al. 2017], which could potentially lead to additional gains in quality. Besides the super-resolution task,

tempoGAN: A Temporally Coherent, Volumetric GAN forSuper-resolution Fluid Flow

YOU XIE∗, Technical University of Munich

ERIK FRANZ∗, Technical University of Munich

MENGYU CHU∗, Technical University of Munich

NILS THUEREY, Technical University of Munich

Fig. 1. Our convolutional neural network learns to generate highly detailed, and temporally coherent features based on a low-resolution field containing asingle time-step of density and velocity data. We introduce a novel discriminator that ensures the synthesized details change smoothly over time.

We propose a temporally coherent generative model addressing the super-resolution problem for fluid flows. Our work represents a first approach tosynthesize four-dimensional physics fields with neural networks. Based on aconditional generative adversarial network that is designed for the inferenceof three-dimensional volumetric data, our model generates consistent anddetailed results by using a novel temporal discriminator, in addition tothe commonly used spatial one. Our experiments show that the generatoris able to infer more realistic high-resolution details by using additionalphysical quantities, such as low-resolution velocities or vorticities. Besidesimprovements in the training process and in the generated outputs, theseinputs offer means for artistic control as well. We additionally employ aphysics-aware data augmentation step, which is crucial to avoid overfitting

This work is supported by the ERC Starting Grant 637014.(*) Similar amount of contributions.Authors’ addresses: You Xie∗ , Technical University of Munich, [email protected]; ErikFranz∗ , Technical University of Munich, [email protected]; Mengyu Chu∗ , TechnicalUniversity of Munich, [email protected]; Nils Thuerey, Technical University ofMunich, [email protected].

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).© 2018 Copyright held by the owner/author(s).XXXX-XXXX/2018/11-ARThttps://doi.org/10.1145/nnnnnnn.nnnnnnn

and to reduce memory requirements. In this way, our network learns togenerate advected quantities with highly detailed, realistic, and temporallycoherent features. Our method works instantaneously, using only a singletime-step of low-resolution fluid data. We demonstrate the abilities of ourmethod using a variety of complex inputs and applications in two and threedimensions.

CCS Concepts: •Computingmethodologies→Neural networks;Phys-ical simulation;

Additional KeyWords and Phrases: deep learning, generativemodels, physically-based animation, fluid simulation

ACM Reference Format:

You Xie∗, Erik Franz∗, Mengyu Chu∗, and Nils Thuerey. 2018. tempoGAN: ATemporally Coherent, Volumetric GAN for Super-resolution Fluid Flow. 1,1 (November 2018), 15 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

Generative models were highly successful in the last years to rep-resent and synthesize complex natural images [Goodfellow et al.2014]. These works demonstrated that deep convolutional neuralnetworks (CNNs) are able to capture the distribution of, e.g., photos

2018-11-16 20:30 page 1 (pp. 1-15) Submission ID: submitted to SIGGRAPH , Vol. 1, No. 1, Article . Publication date: November 2018.

arX

iv:1

801.

0971

0v1

[cs

.LG

] 2

9 Ja

n 20

18

https://doi.org/10.1145/nnnnnnn.nnnnnnn

https://doi.org/10.1145/nnnnnnn.nnnnnnn

:2 • Xie, Y., Franz, E., Chu, M., Thuerey, N.

of human faces, and generate novel, previously unseen versions thatare virtually indistinguishable from the original inputs. Likewise,similar algorithms were shown to be extremely successful at gener-ating natural high-resolution images from a coarse input [Karraset al. 2017]. However, in their original form, these generative modelsdo not take into account the temporal evolution of the data, whichis crucial for realistic physical systems. In the following, we willextend these methods to generate high-resolution volumetric datasets of passively advected flow quantities, and ensuring temporalcoherence is one of the core aspects that we will focus on below. Wewill demonstrate that it is especially important to make the trainingprocess aware of the underlying transport phenomena, such that thenetwork can learn to generate stable and highly detailed solutions.

Capturing the intricate details of turbulent flows has been a long-standing challenge for numerical simulations. Resolving such detailswith discretized models induces enormous computational costs andquickly becomes infeasible for flows on human space and time scales.While algorithms to increase the apparent resolution of simulationscan alleviate this problem [Kim et al. 2008], they are typically basedon procedural models that are only loosely inspired by the under-lying physics. In contrast to all previous methods, our algorithmrepresents a physically-based interpolation, that does not requireany form of additional temporal data or quantities tracked over time.The super-resolution process is instantaneous, based on volumetricdata from a single frame of a fluid simulation. We found that infer-ence of high-resolution data in a fluid flow setting benefits from theavailability of information about the flow. In our case, this takes theshape of additional physical variables such as velocity and vorticityas inputs, which in turn yield means for artistic control.

A particular challenge in the field of super-resolution flow is howto evaluate the quality of the generated output. As we are typi-cally targeting turbulent motions, a single coarse approximationcan be associated with a large variety of significantly different high-resolution versions. As long as the output matches the correlatedspatial and temporal distributions of the reference data, it repre-sents a correct solution. To encode this requirement in the trainingprocess of a neural network, we employ so-called generative adver-sarial networks (GANs). These methods train a generator, as well asa second network, the discriminator that learns to judge how closelythe generated output matches the ground truth data. In this way, wetrain a specialized, data-driven loss function alongside the genera-tive network, while making sure it is differentiable and compatiblewith the training process. We not only employ this adversarial ap-proach for the smoke density outputs, but we also train a specializedand novel adversarial loss function that learns to judge the temporalcoherence of the outputs.

We additionally present best practices to set up a training pipelinefor physics-based GANs. E.g., we found it particularly useful tohave physics-aware data augmentation functionality in place. Thelarge amounts of space-time data that arise in the context of manyphysics problems quickly bring typical hardware environmentsto their limits. As such, we found data augmentation crucial toavoid overfitting. We also explored a variety of different variantsfor setting up the networks as well as training them, and we will

evaluate them in terms of their capabilities to learn high-resolutionphysics functions below.

To summarize, the main contributions of our work are:

• a novel temporal discriminator, to generate consistent and highlydetailed results over time,

• artistic control of the outputs, in the form of additional lossterms and an intentional entangling of the physical quantitiesused as inputs,

• a physics aware data augmentation method,

• and a thorough evaluation of adversarial training processes forphysics functions.

To the best of our knowledge, our approach is the first generativeadversarial network for four-dimensional functions, and we willdemonstrate that it successfully learns to infer solutions for flowtransport processes from approximate solutions.

2 RELATED WORK

In the area of computer vision, deep learning techniques haveachieved significant breakthroughs in numerous fields such as clas-sification [Krizhevsky et al. 2012], object detection [Girshick et al.2014], style transfer [Luan et al. 2017], novel view synthesis [Flynnet al. 2016], and additionally, in the area of content creation. Formore in-depth reviews of neural networks and deep learning tech-niques, we refer the readers to corresponding books [Bishop 2006;Goodfellow et al. 2016].

One of the popular methods to generate content are so called gen-erative adversarial networks (GANs), introduced by Goodfellow etal. [Goodfellow et al. 2014]. They were shown to be particularlypowerful at re-creating the distributions of complex data sets suchas images of human faces. Depending on the kind of input data theytake, GANs can be separated into unconditional and conditionalones. The formers generate realistic data from samples of a syntheticdata distribution like Gaussian noise. The DC-GAN [Radford et al.2016] is a good example of an unconditional GAN. It was designedfor generic natural images, while the cycle-consistent GAN by Zhuet al. [2017] was developed to translate between different classesof images. The conditional GANs were introduced by Mirza andOsindero [2014], and provide the network with an input that isin some way related to the target function in order to control thegenerated output. Therefore, conditional variants are popular fortransformation tasks, such as image translations problems [Isolaet al. 2017] and super resolution problems [Ledig et al. 2016].

In the field of super-resolution techniques, researchers have ex-plored different network architectures. E.g., convolutional networks[Dong et al. 2016] were shown to be more effective than fully con-nected architectures. These networks can be trained with smallertiles and later on be applied to images of arbitrary sizes [Isola et al.2017]. Batch normalization [Lim et al. 2017] significantly improvesresults by removing value shifting in hidden layers, and networksemploying so-called residual blocks [Kim et al. 2016; Lim et al. 2017]

, Vol. 1, No. 1, Article . Publication date: November 2018. 2018-11-16 20:30 page 2 (pp. 1-15) Submission ID: submitted to SIGGRAPH

tempoGAN: A Temporally Coherent, Volumetric GAN for Super-resolution Fluid Flow • :3

enable the training of deep networks without strongly vanishinggradients.

In term of loss functions, pixel-wise loss between the network out-put and the ground truth data used to be common, such as the L1and L2 loss [Dong et al. 2016]. Nowadays, using the adversarialstructure of GANs, or using pre-trained networks, such as the VGGnet [Simonyan and Zisserman 2014] often leads to higher perceptualqualities [Johnson et al. 2016; Mathieu et al. 2015].

Our method likewise uses residual blocks in conjunction with a con-ditional GAN architecture to infer three-dimensional flow quantities.Here, we try to use standard architectures. While our generator issimilar to approaches for image super-resolution [Ledig et al. 2016],we show that loss terms and discriminators are crucial for high-quality outputs. We also employ a fairly traditional GAN training,instead of recently proposed alternatives [Arjovsky et al. 2017; Berth-elot et al. 2017], which could potentially lead to additional gainsin quality. Besides the super-resolution task, our work differs frommany works in the GAN area with its focus on temporal coherence,as we will demonstrate in more detail later on.

While most works have focused on single images, several papershave addressed temporal changes of data sets. One way to solvethis problem is by directly incorporating the time axis, i.e., by usingsequences of data as input and output. E.g., Saito et al. propose atemporal generator in their work [Saito et al. 2017], while Yu et al.[Yu et al. 2017] proposed a sequence generator that learns a stochas-tic policy. In these works, results need to be generated sequentially,while our algorithm processes individual frames independently, andin arbitrary order, if necessary. In addition, such approaches wouldexplode in terms of weights and computational resources for typicalfour-dimensional fluid data sets.

An alternative here is to generate single frame data with additionalloss terms to keep the results coherent over time. Bhattacharjee etc.[2017] achieved improved coherence in their results for video frameprediction, by adding specially designed distance measures as a dis-continuity penalty between nearby frames. For video style transfer,a L2 loss on warped nearby frames helped to alleviate temporaldiscontinuities, as shown by Ruder et al. [2016]. In addition to a L2loss on nearby frames, Chen et al. [2017] used neural networks tolearn frame warping and frame combination in VGG feature space.Similarly, Liu etc. [Liu et al. 2017] used neural networks to learnspatial alignment for low-resolution inputs, and adaptive aggrega-tion for high-resolution outputs, which also improved the temporalcoherence. Due to the three-dimensional data sets we are facing,we also adopt the single frame view. However, in contrast to allprevious works, we propose the use of a temporal discriminator.We will show that relying on data-driven, learned loss functions inthe form of a discriminator helps to improve results over manuallydesigned losses. Once our networks are trained, this discriminatorcan be discarded. Thus, unlike, e.g., aggregation methods, our ap-proach does not influence runtime performance. While previouswork shows that warping layers are useful in motion field learn-ing [Chen et al. 2017; de Bezenac et al. 2017], our work targets the

opposite direction: by providing our networks with velocities, warp-ing layers can likewise improve the training of temporally coherentcontent generation.

More recently, deep learning algorithms have begun to influencecomputer graphics algorithms. E.g., they were successfully used forefficient and noise-free renderings [Bako et al. 2017; Chaitanya et al.2017], the illumination of volumes [Kallweit et al. 2017], for mod-eling porous media [Mosser et al. 2017], and for character control[Peng et al. 2017]. First works also exist that target numerical simu-lations. E.g., a conditional GAN was used to compute solutions forsmaller, two-dimensional advection-diffusion problems [Farimaniet al. 2017; Long et al. 2017]. Others have demonstrated the inferenceof SPH forces with regression forests [Ladicky et al. 2015], proposedCNNs for fast pressure projections [Tompson et al. 2016], learnedspace-time deformations for interactive liquids [Prantl et al. 2017] ,and modeled splash statistics with NNs [Um et al. 2017]. Closer toour line of work, Chu et al. [2017] proposed a method to look uppre-computed patches using CNN-based descriptors. Despite a sim-ilar goal, their methods still require additional Lagrangian trackinginformation, while our method does not require any modificationsof a basic solver. In addition, our method does not use any storeddata at runtime apart from the trained generator model.

As ourmethod focuses on the conditional inference of high-resolutionflow data sets, we also give a brief overview of the related work here,with a particular focus on single-phase flows. The stable fluids algo-rithm [Stam 1999] arguably introduced single-phase simulations tocomputer animation. Based on this approach, a variety of powerfulextensions and variants have been developed over the years, Moreaccurate advection schemes [Kim et al. 2005; Selle et al. 2008] areoften employed, and many applications require interactions withobstacle boundaries [Batty et al. 2007; Teng et al. 2016]. Addition-ally, methods to increase the resolution of flows with proceduralturbulence methods are popular extensions [Kim et al. 2008; Narainet al. 2008]. In contrast to our work, these methods require a fulladvection of the high-resolution density field over full simulationsequence. In contrast, our method infers an instantaneous solutionto the underlying advection problem based only on a single snapshotof data.

Our work also shares similarities in terms of goals with otherphysics-based up-sampling algorithms [Kavan et al. 2011], and dueto this goal, is related to fluid control methods [McNamara et al.2004; Pan et al. 2013]. These methods would work very well in con-junction with our approach, in order to generate a coarse input withthe right shape and timing.

3 ADVERSARIAL LOSS FUNCTIONS

Based on a set of low-resolution inputs, with corresponding high-resolution references, our goal is to train a CNN that produces atemporally coherent, high-resolution solution with adversarial train-ing. We will first very briefly summarize the basics of adversairaltraining, and then explain our extensions for temporal coherenceand for changing the style of outputs.



3.1 Generative Adversarial Networks

GANs consist of two models, which are trained in conjunction:the generator G and the discriminator D. Both will be realized asconvolutional neural networks in our case. For regular ones, i.e., notconditional GANs, the goal is to train a generator G(x) that mapsa simple data distribution, typically noise, x to a complex desiredoutput y, e.g., natural images. Instead of using a manually specifiedloss term to train the generator, another NN, the discriminator, isused as complex, learned loss function [Goodfellow et al. 2014]. Thisdiscriminator takes the form of a simple binary classifier, whichis trained in a supervised manner to reject generated data, i.e., itshould return D(G(x)) = 0, and accept the real data with D(y) = 1.For training, the loss for the discriminator is thus given by a sigmoidcross entropy for the two classes “generated” and “real”:

LD (D,G) =Ey∼pdata (y)[− logD(y)] + Ex∼pdata (x )[− log(1 − D(G(x )))]=Em[− logD(ym )] + En[− log(1 − D(G(xn )))]

(1)

, where n is the number of drawn inputs x , whilem denotes the num-ber of real data samples y. Here we use the notation y ∼ pdata (y)for samples y being drawn from a corresponding probability datadistribution pdata , which will later on be represented by our nu-merical simulation framework. In a discrete setting, the continuousdistribution yields the average of samples yn and xm in the secondline of Eq. (1). We will omit the y ∼ pdata (y) and x ∼ pdata (x)subscripts of the sigmoid cross entropy, and n andm subscripts ofD(ym ) and G(xn ), for clarity below.

In contrast to the discriminator, the generator is trained to “fool”the discriminator into accepting its samples and thus to generateoutput that is close to the real data from y. In practice, this meansthat the generator is trained to drive the discriminator result for itsoutputs to one. Instead of directly using the negative discriminatorloss, GANs typically use

LG (D,G) = Ex∼pdata (x )[− log(D(G(x )))] = En[− log(D(G(x )))] (2)

as the loss function for the generator, in order to reduce diminishinggradient problems [Goodfellow 2016]. As D is realized as a NN, it isguaranteed to be sufficiently differentiable as a loss function for G.In practice, both discriminator and generator are trained in turnsand will optimally reach an equilibrium state.

As we target a super-resolution problem, our goal is not to generatean arbitrary high-resolution output, but one that corresponds toa low-resolution input, and hence we employ a conditional GAN.In terms of the dual optimization problem described above, thismeans that the input x now represents the low-resolution dataset, and the discriminator is provided with x in order to establishand ensure the correct relationship between input and output, i.e.,we now have D(x ,y) and D(x ,G(x)) [Mirza and Osindero 2014].Furthermore, previous work [Zhao et al. 2015] has shown that anadditional L1 loss term with a small weight can be added to thegenerator to ensure that its output stays close to the ground truthy. This yields λL1En ∥G(x ) − y∥1, where λL1 controls the strengthof this term, and we use E for consistency to denote the expectedvalue, in this discrete case being equivalent to an average.

3.2 Loss in Feature Spaces

In order to further control and guide the coupled, non-linear op-timization process, the features of the underlying CNNs can beconstrained. This not only leads to more realistic outputs [Doso-vitskiy and Brox 2016], but was also shown to help with modecollapse problems [Salimans et al. 2016]. To achieve this goal, anL2 loss over parts or the whole feature space of a neural networkis introduced for the generator. I.e., the intermediate results of thegenerator network are constrained w.r.t. a set of intermediate ref-erence data. While previous work typically makes use of manuallyselected layers of pre-trained networks, such as the VGG net, wepropose to use features of the discriminator as constraints instead.Thus, we incorporate a novel loss term of the form

Lf = En, jλjf

F j (G(x )) − F j (y) 22 (3)

, where j is a layer in our discriminator network, and F j denotes theactivations of the corresponding layer. The factor λjf is a weightingterm, which can be adjusted on a per layer basis, as we will discuss inSec. 5.2. It is particularly important in this case that we can employthe discriminator here, as no suitable, pre-trained networks areavailable for three-dimensional flow problems.

These loss terms effectively encourage a minimization of the meanfeature space distances of real and generated data sets for λf > 0.However, we found that surprisingly, training runs with λf < 0also yield excellent, and often even better results. As we are tar-geting conditional GANs, our networks are highly constrained bythe inputs. A negative feature loss in this setting encourages theoptimization to generate results that differ in terms of the features,but are still similar, ideally indistinguishable, in terms of their finaloutput. This is possible as we are not targeting a single ground-truthresult, but rather, we give the generator the freedom to generateany result that best fits the collection of inputs it receives. Fromour experience, this loss term drives the generator towards realisticdetail, an example of which can be seen in Fig. 2.

a) b) c ) d )

Fig. 2. From left to right: a) a sample, low-resolution input, b) a CNN outputwith naive L2 loss (no GAN training), c) our tempoGAN output, and d)the high-resolution reference. The L2 version learns a smooth result with-out small scale details, while our output in (c) surpasses the detail of thereference in certain regions.



3.3 Temporal Coherence

While the GAN process described so far is highly successful atgenerating highly detailed and realistic outputs for static frames,these details are particularly challenging in terms of their temporalcoherence. Since both the generator and the discriminator work onevery frame independently, subtle changes of the input x can leadto outputs G(x) with distinctly different details for higher spatialfrequencies.

When the ground truth data y comes from a transport process, suchas frame motion or flow motion, it typically exhibits a very high de-gree of temporal coherence, and a velocity field vy exists for whichyt = A(yt−1,vt−1y ). Here, we denote the advection operator (alsocalled warp or transport in other works) with A, and we assumewithout loss of generality that the time step between frame t andt − 1 is equal to one. Discrete time steps will be denoted by super-scripts, i.e., for a function y of space and time yt = y(x, t ) denotes afull spatial sample at time t . Similarly, in order to solve the tempo-ral coherence problem, the relationship G(xt ) = A(G(xt−1),vt−1G (x ))should hold, which assumes that we can compute a motion vG (x )based on the generator input x . While directly computing such amotion can be difficult and unnecessary for general GAN problems,we can make use of the ground truth data for y in our conditionalsetting. I.e., in the following, we will use a velocity reference vycorresponding to the target y, and perform a spatial down-samplingto compute the velocity vx for input x .

Equipped with vx , one possibility to improve temporal coherencewould be to add an L2 loss term of the form:

L2,t = ∥G(xt ) − A(G(xt−1),vt−1x )∥22 (4)

We found that extending the forward-advection difference withbackward-advection improves the results further, i.e., the followingL2 loss is clearly preferable over Eq. (4):

L2,t = ∥G(xt ) − A(G(xt−1),vt−1x )∥22+∥G(xt ) − A(G(xt+1),−vt+1x )∥22 (5)

, where we align the next frame at t + 1 by advecting with −vt+1x .

While this L2,t based loss improves temporal coherence, our testsshow that its effect is relatively small. E.g., it can improve outlines,but leads to clearly unsatisfactory results, which are best seen inthe accompanying video. One side effect of this loss term is thatit can easily be minimized by simply reducing the values of G(x).This is visible, e.g., in the second column of Fig. 3, which containsnoticeably less density than the other versions and the ground truth.However, we do not want to drive the generator towards darkeroutputs, but rather make it aware of how the data should changeover time.

Instead of manually encoding the allowed temporal changes, wepropose to use another discriminator Dt , that learns from the givendata which changes are admissible. In this way, the original spatialdiscriminator, which we will denote as Ds (x ,G(x)) from now on,guarantees that our generator learns to generate realistic details,while the new temporal discriminator Dt mainly focuses on drivingG(x) towards solutions that match the temporal evolution of theground-truth y.

None L2,t LDt ′ LDt y

Fig. 3. A comparison of different approaches for temporal coherence. Thetop two rows show the inferred densities, while the bottom two rows containthe time derivative of the frame content computed with a finite differencebetween frame t and t + 1. Positive and negative values are color-codedwith red and blue, respectively. From left to right: no temporal loss applied,L2,t loss applied, LD′

t, i.e., applied without advection, LDt applied with

advection (our full tempoGAN approach), and the ground-truth y . From leftto right across the different versions, the derivatives become less jagged andless noisy, as well as more structured and narrow. This means the temporalcoherence is improved.

Specifically, Dt takes three frames as input. We will denote suchsets of three frames with a tilde in the following. As real data for thediscriminator, the set YA contains three consecutive and advectedframes, thus YA = {A(yt−1,vt−1x ), yt , A(yt+1,−vt+1x )}. The gener-ated data set contains correspondingly advected samples from thegenerator: GA (X ) = {A(G(xt−1),vt−1x ), G(xt ), A(G(xt+1),−vt+1x )}.

Similar to our spatial discriminator Ds , the temporal discriminatorDt is trained as a binary classifier on the two sets of data:

LDt (Dt ,G) = Em[− logDt

(YA

)] + En[− log(1 − Dt

(GA

(X)))] (6)

, where set X also contains three consecutive frames, i.e., X = {xt−1,xt ,xt+1}. Note that unlike the spatial discriminator, Dt is not aconditional discriminator. It does not “see” the conditional input x ,and thus Dt is forced to make its judgement purely based on thegiven sequence.

In Fig. 3, we show a comparison of the different loss variants forimproving temporal coherence. The first column is generated withonly the spatial discriminator, i.e., provides a baseline for the im-provements. The second column shows the result using the L2-basedtemporal loss L2,t from Eq. (5), while the fourth column shows theresult using Dt from Eq. (6). The last column is the ground-truthdata y. The first two rows show the generated density fields. WhileL2,t reduces overall density content, the result with Dt is clearlycloser to the ground truth. The bottom two rows show time deriva-tives of the densities for frames t and t+1. Again, the result from Dt



a) Y a) YA b) Y b) YA

Fig. 4. These images highlight data alignment due to advection. Threeconsecutive frames are encoded as R, G, B channels of a single image, thus,ideally a fully aligned image would only contain shades of grey. The tworows contain front and top views in the top and bottom row, respectively. Weshow two examples, a) and b). Each of them contains Y left, and YA right.The RGB channels are the three input frames, t-1, t, and t+1. Comparedwith Y , YA is significantly less saturated, i.e., better aligned.

and the ground-truth y match closely in terms of their time deriva-tives. The large and jagged values of the first two rows indicate theundesirable temporal changes produced by the regular GAN andthe L2,t loss.

In the third column of Fig. 3, we show a simpler variant of ourtemporal discriminator. Here, we employ the discriminator withoutaligning the set of inputs with advection operations, i.e.,

LD′t(D ′

t ,G) =Em[− logDt

(Y)] + En[− log(1 − Dt (G

(X)))] (7)

with Y = {yt−1,yt ,yt+1} and G(X)= {G(xt−1),G(xt ),G(xt+1)}.

This version improves results compared to L2,t , but does not reachthe level of quality of LDt , as can be seen in Fig. 3. Additionally,we found that LDt often exhibits a faster convergence during thetraining process. This is an indication that the underlying neuralnetworks have difficulties aligning and comparing the data by them-selves when using LD′

t. This intuition is illustrated in Fig. 4, where

we show example content of the regular data sets Y and the advectedversion YA side by side. In this figure, the three chronological framesare visualized as red, green, and blue channels of the images. Thus,a pure gray-scale image would mean perfect alignment, while in-creasing visibility of individual colors indicates un-aligned featuresin the data. Fig. 4 shows that, although not perfect, the advectedone leads to clear improvements in terms of aligning the features ofthe data sets, despite only using the approximated coarse velocityfields vx . Our experiments show that this alignment successfullyimproves the backpropagated gradients such that the generatorlearns to produce more coherent outputs. However, when no flowfields are available, LD′

tstill represents a a better choice than the

simpler L2,t version. We see this as another indicator of the powerof adversarial training models. It seems to be preferable to let aneural network learn and judge the specifics of a data set, insteadof manually specifying metrics, as we have demonstrated for datasets of fluid flow motions above.

It is worth pointing out that our formulation for Dt in Eq. (6) meansthat the advection step is an inherent part of the generator trainingprocess. Whilevx can be pre-computed, it needs to be applied to theoutputs of the generator during training. This in turn means that theadvection needs to be tightly integrated into the training loop. Theresults discussed in the previous paragraph indicate that if this isdone correctly, the loss gradients of the temporal discriminator aresuccessfully passed through the advection steps to give the generatorfeedback such that it can improve its results. In the general case,advection is a non-linear function, the discrete approximation forwhich we have abbreviated with A(yt ,vty ) above. Given a knownflow field vy and time step, we can linearize this equation to yield amatrixMy = A(yt ,vty ) = yt+1. E.g., for a first order approximation,M would encode the Euler-step lookup of source positions and linearinterpolation to compute the solution. While we have found firstorder scheme (i.e., semi-Lagrangian advection) to work well, Mcould likewise encode higher-order methods for advection, such asMacCormack or WENO schemes.

We have implemented this process as an advection layer in ournetwork training, which computes the advection coefficients, andperforms the matrix multiplication such that the discriminator re-ceives the correct sets of inputs. When training the generator, thesame code is used, and the underlying NN framework can easilycompute the necessary derivatives. In this way, the generator ac-tually receives three accumulated, and aligned gradients from thethree input frames that were passed to Dt .

3.4 Full Algorithm

While the previous sections have explained the different parts ofour final loss function, we summarize and discuss the combined lossin the following section. We will refer to our full algorithm as tem-poGAN. The resulting optimization problem that is solved with NNtraining consists of three coupled non-linear sub-problems: the gen-erator, the conditional spatial discriminator, and the un-conditionaltemporal discriminator. The generator has to effectively minimizeboth discriminator losses, additional feature space constraints, and aL1 regularization term. Thus, the loss functions can be summarizedas:

LDt (Dt ,G) = − Em[logDt (YA )] − En[log(1 − Dt

(GA

(X)))

]

LDs (Ds ,G) = − Em[logDs (x ,y)] − En[log(1 − Ds (x ,G(x )))]

LG (Ds ,Dt ,G) = − En[logDs (x ,G(x ))] − En[logDt

(GA

(X))]

+ En, jλjf

F j (G(x )) − F j (y) 22 + λL1En ∥G(x ) − y∥1

(8)

Our generator has to effectively compete against two powerfuladversaries, who, along the lines of "the enemy of my enemy is myfriend", implicitly cooperate to expose the results of the generator.E.g., we have performed tests without Ds , only using Dt , and theresulting generator outputs were smooth in time, but clearly lessdetailed than when using both discriminators.

Among the loss terms of the generator, the L1 term has a relativelyminor role to stabilize the training by keeping the averaged out-put close to the target. However, due to the complex optimizationproblem, it is nonetheless helpful for successful training runs. The



A

Trainable networkData setData flow direction

Nearest neighbor interpolation layer

Residual block

Convolutional layerFully connected node

Advection layer

Gradient backpropagationGeneratorDiscriminatorLow-res dataHigh-res dataVelocity data

𝑥

𝑦

𝑣

𝐺

𝐷

Size of feature mapNumber of feature maps

323

128

𝐺 𝐺(𝑥 )

𝐺(𝑥 )

𝐺(𝑥 )163 323 643 64332

643128

6438

643

Data set

𝑦

𝑥

𝑦 𝑦𝑡 𝑦

𝑥: 163 𝑦: 643

Example Pairs

𝑥

𝐺(𝑥)

𝑣 0 −𝑣

6431

𝛻𝐷 → 𝐺(𝑥)

Conditional 𝐷

𝐷 (𝑥, 𝐺|𝑦)

32332

16364

83128

83256

0|1

11

Unconditional 𝐷

𝛻𝐷 → 𝐺 𝑥 , 𝐺(𝑥 ),𝐺(𝑥 )

𝐷 (𝐺|𝑦)

32332

16364

83128

83256

11

𝐺|𝑦,

𝑣

A

0|1| Operator or, e.g., 𝐺|y

Fig. 5. Here an overview of our tempoGAN architecture is shown. The three neural networks (blue boxes) are trained in conjunction. The data flow betweenthem is highlighted by the red and black arrows.

feature space loss, on the other hand, directly influences the gen-erated features. In the adversarial setting the discriminator mostlikely learns distinct features that only arise for the ground truth(positive features), or those that make it easy to identify generatedversions, i.e., negative features that are only produced by the gener-ator. Thus, while training, the generator will receive gradients tomake it produce more features of the targets from F (y), while thegradients from F (G(x )) will penalize the generation of recognizablenegative features.

While positive values for λf reinforce this behavior, it is less clearwhy negative values can lead to even better results in certain cases.Our explanation for this behavior is that this drives the generatortowards distinct features that have to adhere to the positive andnegative features detected by the discriminator, as explained above,but at the same time differ from the average features in y. Thus,the generator cannot simply create different or no features, as thediscriminator would easily detect this. Instead it needs to developfeatures that are like the ones present in the outputs y, but don’tcorrespond to the average features in F (y), which, e.g., leads to thefine detailed outputs shown in Fig. 2.

4 ARCHITECTURE AND TRAINING DATA

While our loss function theoretically works with any realization ofG,Ds and Dt , their specifics naturally have significant impact onperformance and the quality of the generated outputs. A variety ofnetwork architectures has been proposed for training generativemodels [Berthelot et al. 2017; Goodfellow et al. 2014; Radford et al.2016], and in the following, we will focus on pure convolutional net-works for the generator, i.e., networks without any fully connectedlayers. A fully convolutional network has the advantage that thetrained network can be applied to inputs of arbitrary sizes later on.We have experimented with a large variety of generator architec-tures, and while many simpler networks only yielded sub-optimalresults, we have achieved high quality results with generators basedon the popular U-net [Isola et al. 2017; Ronneberger et al. 2015],as well as with residual networks (res-nets) [Lim et al. 2017]. TheU-net concatenates activations from earlier layers to later layers (so

called skip connections) in order to allow the network to combinehigh- and low-level information, while the res-net processes thedata using by multiple residual blocks. Each of these residual blocksconvolves the inputs without changing their spatial size, and theresult of two convolutional layers is added to the original signal asa “residual” correction. In the following, we will focus on the latterarchitecture, as it gave slightly sharper results in our tests.

We found the discriminator architecture to be less crucial. As long asenough non-linearity is introduced over the course of several hiddenlayers, and there are enough weights, changing the connectivity ofthe discriminator did not significantly influence the generated out-puts. Thus, in the following, we will always use discriminators withfour convolutional layers with leaky ReLU activations1 followedby a fully connected layer to output the final score. As suggestedby Odena et al. [2016], we use the nearest-neighbor interpolationlayers as the first two layers in our generator, instead of deconvolu-tional ones. In our discriminators, the kernel size is also divisibleby its stride. In practice, these settings help to remove the commoncheckerboard artifacts, and improve the final results. An overviewof the architecture of our neural networks is shown in Fig. 5, whiledetails can be found in Appendix A.

4.1 Data Generation and Training

We use a randomized smoke simulation setup to generate the de-sired number of training samples as a pre-computation step. Wetypically generate around 20 simulations with 120 frames of outputper simulation. For each of these, we randomly initialize a certainnumber of smoke inflow regions, another set of velocity inflows, anda randomized buoyancy force. As inputs x , we use a down-sampledversion of the simulation data sets, typically by a factor of 4, whilethe full resolution data is used as ground truth y. To prevent a largenumber of primarily empty samples, we discard inputs with averagesmoke density of less than 0.02. Details of the parametrization canbe found in Appendix B, and visualization of the training data setscan be found in the supplemental video.

1With a leaky tangent of 0.2 for the negative half space.



a) b) c) d)

Fig. 6. An illustration of different training results after 40k iterations withdifferent input fields: a) ρ , b) ρ + v, c) ρ + v +w, all with similar networksizes. Version d) with only ρ has 2x the number of weights. The seams inthe images show the size of the training patches. Supplemental physicalfields lead to clear improvements in b) and c), that even additional weightscannot compensate for.

In addition, we show examples generated from a two-dimensionalrising smoke simulation with a different simulation setup than theone used for generating the training data. It is, e.g., used in Fig. 2.

We use the same modalities for all training runs: we employ thecommonly used ADAM optimizer2 with an initial learning rate of2·10−4 that decays to 1/20th for second half of the training iterations.The number of training iterations is typically on the order of 10k.Full details are given in Appendix B. We use 20% of the data fortesting and the remaining 80% for training. Our networks did notrequire any additional regularization such as dropout or weightdecay. For training and running the trained networks, we use NvidiaGeForce GTX 1080 Ti GPUs (each with 11GB Ram) and Intel Core i7-6850K CPUs, while we used the tensorflow and mantaflow softwareframeworks for deep learning and fluid simulation implementations,respectively.

4.2 Input Fields

On first sight, it might seem redundant and unnecessary to inputflow velocity v and vorticity w in addition to the density ρ. Afterall, we are only interested in the final output density, and manyworks on GANs exist, which demonstrate that detailed images canbe learned purely based on image content.

However, over the course of numerous training runs, we noticedthat giving the networks additional information about the underly-ing physics significantly improves convergence and quality of theinferred results. An example is shown in Fig. 6. Here, we show howthe training evolves for three networks with identical size, structureand parameters, the only difference being the input fields. Fromleft to right, the networks receive (ρ), (ρ, v), and (ρ, v,w). Note thatthese fields are only given to the generator, while the discriminatoralways only receives (ρ) as input. The version with only densitypassed to the generator, G(ρ), fails to reconstruct smooth and de-tailed outputs. Even after 40000 iterations, the results exhibit stronggrid artifacts and lack detailed structures. In contrast, both versionswith additional inputs start to yield higher quality outputs earlierduring training. While adding v is crucial, the addition of w onlyyields subtle improvements (most apparent at the top of the images

2Parameterized with β = 0.5.

in Fig. 6), which is why we will use (ρ, v) to generate our final resultsbelow. The full training run comparison is in our supplementalvideo.

We believe that the insight that auxiliary fields help improving train-ing and inference quality is a surprising and important one. Thenetworks do not get any explicit guidance on how to use the addi-tional information. However, it clearly not only learns to use thisinformation, but also benefits from having this supporting informa-tion about the underlying physics processes. While larger networkscan potentially alleviate the quality problems of the density-onlyversion, as illustrated in Fig. 6 d), we believe it is highly preferableto instead construct and train smaller, physics-aware networks. Thisnot only shortens training times and accelerates convergence, butalso makes evaluating the trained model more efficient in the longrun. The availability of physical inputs turned out to be a crucialaddition in order to successfully realize high-dimensional GANoutputs for space-time data, which we will demonstrate in Sec. 5.

4.3 Augmenting Physical Data

Data augmentation turned out to be an important component ofour pipeline due to the high dimensionality of our data sets and thelarge amount of memory they require. Without sufficient enoughtraining data, the adversarial training yields undesirable results dueto overfitting. While data augmentation is common practice fornatural images [Dosovitskiy et al. 2016], we describe several aspectsbelow that play a role for physical data sets.

The augmentation process allows us to train networks having mil-lions of weights with data sets that only contain a few hundredsamples without overfitting. At the same time, we can ensure thatthe trained networks respect the invariants of the underlying physi-cal problems, which is crucial for the complex space-time data setsof flow fields that we are considering. E.g., we know from theorythat solutions obey Galilean invariance, and we can make sure ournetworks are aware of this property not by providing large datasets, but instead by generating data with different inertial frameson the fly while training.

In order to minimize the necessary size of the training set withoutdeteriorating the result quality, we generate modified data sets attraining time. We focus on spatial transformations, which take theform of x(p) = x(Ap), where p is a spatial position, and A denotesan 4 × 4 matrix. For applying augmentation, we distinguish threetypes of components of a data set:

• passive: these components can be transformed in a straightforward manner as described above. An example of passivecomponents are the advected smoke fields ρ, shown in many ofour examples.

• directional: the content of these components needs to be trans-formed in conjunction with the augmentation. A good exampleis velocities, whose directions need to be adjusted for rotationsand flippings, i.e., v(p) = A3×3v(Ap), where A3×3 is the upperleft 3 × 3 matrix of A.



a) b)

Fig. 7. An identical GAN network trained with the same set of input data.While version a) did not use data augmentation, leading to blurry resultswith streak-like artifacts, version b), with data augmentation, producedsharp and detailed outputs.

• derived: finally, derived componentswould be invalid after apply-ing augmentation, and thus need to be re-computed from othercomponents after augmentation. A good example are physicalquantities such as vorticity, which contain mixed derivatives,that cannot be transformed with a simple matrix multiplication.

If the data set contains quantities that cannot be computed fromother augmented fields, this unfortunately means that augmentationcannot be applied easily. However, we believe that a large class oftypical physics data sets can in practice be augmented as describedhere.

For matrix A, we consider affine transformation matrices that con-tain combinations of randomized translations, uniform scaling, re-flections, and rotations. Here, only those transformations are al-lowed that do not violate the physical model for the data set. Whileshearing and non-uniform scaling could easily be added, they violatedivergence-freeness and those should not be used for flow data. Wehave used values in the range [0.85, 1.15] for scaling, and rotationsby [−90, 90] degrees. We typically do not load derived componentsinto memory for training, as they are re-computed after augmenta-tion. Thus, they are computed on the fly for a training batch anddiscarded afterwards.

The outputs of our simulations typically have significantly largersize than the input tiles that our networks receive. In this way, wehave many choices for choosing offsets, in order to train the net-works for shift invariance. This also aligns with our goal to traina network that will later on work for arbitrarily sized inputs. Wefound it important to take special care at spatial boundaries of thetiles. While data could be extended by Dirichlet or periodic bound-ary conditions, it is important that the data set boundaries afteraugmentation do not lie outside the original data set. We enforcethis by choosing suitable translations after applying the other trans-formations. This ensures that all data sets contain only valid content,and the network does not learn from potentially unphysical or un-representative data near boundaries. We also do not augment thetime axis in the same way as the spatial axes. We found that thespatial transformations above applied to velocity fields give enoughvariance in terms of temporal changes.

An example of the huge difference that data augmentation can makeis shown in Fig. 7. Here we compare two runs with the same amountof training data (160 frames of data), one with, the other one without

data augmentation. While training a GAN directly with this dataproduces blurry results, the network converges to a final state withsignificantly sharper results with data augmentation. The possibilityto successfully train networks with only a small amount of trainingdata is what makes it possible to train networks for 3D+time data,as we will demonstrate in Sec. 5.

5 RESULTS AND APPLICATIONS

In the following, we will apply our method discussed so far todifferent data sets, and explore different application settings. Amongothers, we will discuss related topics such as art direction, trainingconvergence, and performance. 3

5.1 3D Results

We have primarily used the 2D rising plume example in the previoussections to ensure the different variants can be compared easily. InFig. 8, we demonstrate that these results directly extend to 3D. Weapply our method to a three-dimensional plume with resolution 643,and the 2563 output produced by our tempoGAN exhibits small scalefeatures that are at least as detailed as the ground truth reference.The temporal coherence is especially important in this setting, whichis best seen in the accompanying video.

We also apply our trained 3D model to two different inputs withhigher resolutions. In both cases, we use a regular simulation aug-mented with additional turbulence to generate an interesting set ofinputs for our method. A first scene with 150×100×100 is shown inFig. 9, where we generate a 600× 400× 400 output with our method.The output closely resembles the input volumes, but exhibits a largenumber of fine details.

Our method also has no problems with obstacles in the flow, asshown in Fig. 10. This example has resolutions of 256 × 180 × 180and 1024 × 720 × 720 for input and output volumes. The small-scale features closely adhere to the input flow around the obstacle.Although the obstacle is completely filled with densities towards theend of the simulation, there are no leaking artifacts as our methodis applied independently to each input volume in the sequence.When showing the low-resolution input, we always employ cubicup-sampling, in order to not make the input look unnecessarily bad.

5.2 Artistic Controls

Art direction of simulation results is an important topic for computergraphics applications.While we primarily rely on traditional guidingtechniques to control the low-resolution input, our method offersdifferent ways to adjust the details produced by our tempoGANalgorithm. A first control knob for changing the style of outputsis to modify the data fields of the conditional inputs. As describedin Sec. 4.2, our generator receives the velocity in addition to thedensity, and it internally builds tight relationships between the two.We can use these entangled inputs to control the features producedin the outputs. To achieve this, we modify the velocity components3We will make code and trained models available upon acceptance of our work.



a) b) c)

Fig. 8. These images show our algorithm applied to a 3D volume. F.l.t.r.: a). a coarse input volume (rendered with cubic up-sampling), b). our result, and c). thehigh resolution reference. As in 2D, our trained model generates sharp features and detailed sheets that are at least on par with the reference.

Fig. 9. We apply our algorithm to a horizontal jet of smoke in this example.The inset shows the coarse input volume (rendered with cubic up-sampling),and the result of our algorithm. The diffuse streaks caused by proceduralturbulence in the input (esp. near the inflow) are turned into detailed wispsof smoke by our algorithm.

passed to the generator with various procedural functions. Fig. 11shows the result of original input and several modified velocityexamples and the resulting density configurations.

In addition, Fig. 12 demonstrates that we can effectively suppress thegeneration of small scale details by setting all velocities to zero. Thus,the network learns a correlation between velocity magnitudes andamount of features. This is another indicator that the network learnsto extract meaningful relationships from the data, as we expectturbulence and small-scale details to primarily form in regions withlarge velocities. Three-dimensional data can similarly be controlled,as illustrated in Fig. 13.

In Sec. 3.2, we discussed the influence of the λf parameter for smallscale features. For situations where we might not have additionalchannels such as the velocity above, we can use λf to globallylet the network generate different features. However, as this only

Fig. 10. Our algorithm generated a high-resolution volume around an ob-stacle with a final resolution of 1024 × 720 × 720. The inset shows the inputvolume. This scene is also shown in Fig. 1 with a different visualization.

provides a uniform change that is encoded in the trained network,the resulting differences are more subtle than those from the velocitymodifications above. Examples of different 2D and 3D outputs canbe found in Fig. 14 and Fig. 15, respectively.

5.3 Additional Variants

In order to verify that our network can not only work with two- orthree-dimensional data from a Navier-Stokes solver, we generateda more synthetic data set by applying strong wavelet turbulenceto a 4× up-sampled input flow. We then trained our network withdown-sampled inputs, i.e., giving it the task to learn the output ofthe wavelet turbulence algorithm. Note that a key difference hereis that wavelet turbulence normally requires a full high-resolutionadvection over time, while our method infers high-resolution datasets purely based on low-resolution data from a single frame.

Our network successfully learns to generate structures similar tothe wavelet turbulence outputs, shown in Fig. 16. However, thisdata set turned out to be more difficult to learn than the originalfluid simulation inputs. The training runs required two times more



a) b)

c) d)

Fig. 11. The red&green images on the left of each pair represent themodifiedvelocity inputs, while the corresponding result is shown on the right. Forreference, pair a) shows the unmodified input velocity, and the regularoutput of our algorithm.

a) b)

Fig. 12. a) is a regular tempoGAN output, while for b), the velocity wasartificially set to zero. The network has learned a relationship between detailand velocities, leading to reduced details when no velocities are present.

training data than the regular simulation runs, and we used a featureloss of λ1, ...,4f = 10−5. We assume that these more difficult trainingconditions are caused by the more chaotic nature of the proceduralturbulence, and the less reasonable correlations between density andvelocity inputs. Note that despite using more wavelet turbulenceinput data, it is still a comparatively small data set.

We additionally were curious how well our network works when itis applied to a generated output, i.e., a recursive application. Theresult can be found in Fig. 17, where we applied our network toits own output for an additional 2× upsampling. Thus, in total thisled to an 8× increase of resolution. While the output is plausible,and clearly contains even more fine features such as thin sheets,

a) b)

c) d)

Fig. 13. a) is the result of tempoGAN with velocity set to zero. The otherthree examples were generated with modified velocity inputs to achievemore stylized outputs.

a) b)

Fig. 14. A comparison of training runs with different feature loss weights:a) λ1, . . .,4f = −10−5 , b) λ1,4f = 1/3 · 10−4, λ2,3f = −1/3 · 10−4.

a) b) c)

Fig. 15. A comparison of training runs with different feature loss weights in3D: a) with λ1, . . .,4f = −1/3 ·10−6 , b) with λ1f = 1/3 ·10−6, λ2,3,4f = −1/3 ·10−6.The latter yields a sharpened result. Image c) shows the high resolutionreference.

there is a tendency to amplify features generated during the firstapplication.



a) b)

Fig. 16. Our regular model a) and one trained with wavelet turbulence datab). In contrast to the model trained with real simulation data, the waveletturbulence model produces flat regions with sharper swirls, mimicking theinput data.

a) b)

Fig. 17. a) is the network output after a single application. b) is the networkrecursively applied to a) with a scaling factor of 2, resulting in a total increaseof 8×.

5.4 Training Progress

With the training settings given in Appendix B, our training runstypically converged to stable solutions of around 1/2 for the discrim-inator outputs after sigmoid activation. While this by itself does notguarantee that a desirable solution was found, it at least indicatesconvergence towards one of the available local minima.

However, it is interesting to see how the discriminator loss changesin the presence of the temporal discriminator. Fig. 18 shows severalgraphs of discriminator losses over the course of a full trainingrun. Note that we show the final loss outputs here from Eq. (1) andEq. (6). A large value means the discriminator does “worse”, i.e., ithas more difficulty distinguishing real samples from the generatedones. Correspondingly, lower values mean it can separate themmore successfully. In Fig. 18a) it is visible that the spatial discrimi-nator loss decreases when the temporal discriminator is introduced.Here the graph only shows the spatial discriminator loss, and thediscriminator itself is unchanged when the second discriminatoris introduced. Our interpretation of the lower loss is that the exis-tence of a temporal discriminator in the optimization prevents thegenerator from using the part of the solution space with detailed,but flickering outputs. Hence, it has to find a solution from thetemporally coherent ones, and as a consequence has a harder time,which in turn makes the job easier for the spatial discriminator.

Conversely, the existence of a spatial discriminator does not notice-ably influence the temporal discriminator, as shown in Fig. 18b).

This is also intuitive, as the spatial discriminator does not influencetemporal changes. In conjunction, both graphs indicate that the twodiscriminators successfully influence different aspects of the solu-tion space, as intended. Lastly, Fig. 18c) shows that activating thenegative feature loss from Sec. 3.2 makes the task for the generatorslightly harder, resulting in a lowered spatial discriminator loss.

5.5 Performance

Training our two- and three-dimensional models is relatively expen-sive. Our full 2D runs typically take around 14 hours to complete (1GPU), while the 3D runs took ca. 9 days using two GPUs. However,in practice, the state of the model after a quarter of this time isalready indicative of the final performance. The remainder of thetime is typically spent fine-tuning the network.

When using our trained network to generate high-resolution outputsin 3D, the limited memory of current GPUs poses a constraint on thevolumes that can be processed at once, as the intermediate layerswith their feature maps can take up significant amounts of memory.However, this does not pose a problem for generating larger finalvolumes, as we can subdivide the input volumes, and process thempiece by piece. We generate tiles with a size of 1363 on one GPU,with a corresponding input of size 343. Our 8 convolutional layerswith a receptive field of 16 cells mean that up to four cells of aninput could be influenced by a boundary. In practice, we found 3input cells to be enough in terms of overlap. Generating a single1363 output took ca. 2.2 seconds on average. Thus, generating a2563 volume from an 643 input took 17.9s on average. Note that thiscost includes the full GPU scheduling and data transfer overhead(measured before and after the tensorflow python call). I.e., for apipeline where the data would already reside on the GPU, and wherethe high-resolution data could be visualized directly, the runtimefor our model could be significantly lower.

This cost scales linearly with the number of cells in the volume, andin contrast to all previous methods for increasing the resolutionof flow simulations, our method does not require any additionaltracking information. It is also fully independent for all frames. Thus,our method could ideally be applied on the fly before rendering avolume, after which the high resolution data could be discarded.

5.6 Limitations

One limitation of our approach is that the network encodes a fixedresolution difference for the generated details. While the initialup-sampling layers can be stripped, and the network could thusbe applied to inputs of any size, it will be interesting to exploredifferent up-sampling factors beyond the factor of four which wehave used throughout. Our networks have so far also focused onbuoyant smoke clouds. While obstacle interactions worked in ourtests, we assume that networks trained for larger data sets and withother types of interactions could yield even better results.

Our three-dimensional networks needed a long time to train, circanine days for our final model. Luckily, this is a one-time cost, and



0

0.5

1

1.5

2

2.5

0 5000 10000 15000 20000 25000 30000 35000 40000

steps

L𝐷

𝐺 Trained with 𝐷 , L1 and Lf𝐺 Trained with 𝐷 , L1 , Lf and 𝐷

0

0.5

1

1.5

2

2.5

0 5000 10000 15000 20000 25000 30000 35000 40000

L𝐷

𝐺 Trained with 𝐷 and L1𝐺 Trained with 𝐷 , 𝐷 and L1

steps

0

0.5

1

1.5

2

2.5

0 5000 10000 15000 20000 25000 30000 35000 40000

L𝐷

𝐺 Trained with 𝐷 and L1𝐺 Trained with 𝐷 , L1 and Lf

stepsa) b) c)

Fig. 18. Several discriminator loss functions over the course of the 40k training iterations. a) Ds (spatial discriminator) loss is shown in green without Dt , andorange with Dt . b) Temporal discriminator loss in blue with only Dt , and in red for tempoGAN (i.e., with Ds , and feature loss). c) Spatial discriminator loss isshown in green with Lf , and in dark blue without. For each graph, the dark lines show smoothed curves. The full data is shown in a lighter color in thebackground.

the network can be flexibly reused afterwards. However, if the syn-thesized small-scale features need to be fine-tuned, which we luckilydid not find necessary for our work, the long runtimes could makethis a difficult process. The feature loss weights clearly also aredata dependent, e.g., we used different settings for simulation andwavelet turbulence data. Here, it will be an interesting direction forfuture work to give the network additional inputs for fine tuningthe results beyond the velocity modifications which discussed inSec. 5.2.

6 CONCLUSIONS

We have realized a first conditional GAN approach for four-dimensionaldata sets, and we have demonstrated that it is possible to traingenerators that preserve temporal coherence using our novel timediscriminator. The network architecture of this temporal discrimina-tor, which ensures that the generator receives gradient informationeven for complex transport processes, makes it possible to robustlytrain networks for temporal evolutions. We have shown that thisdiscriminator improves the generation of stable details as well asthe learning process itself. At the same time, our fully convolutionalnetworks can be applied to inputs of arbitrary size, and our approachprovides basic means for art direction of the generated outputs. Wealso found it very promising to see that our CNNs are able to benefitfrom coherent, physical information even in complex 3D settings,which led to reduced network sizes.

Overall, we believe that our contributions yield a robust and verygeneral method for generative models of physics problems, and forsuper-resolution flows in particular. It will be highly interestingas future work to apply our tempoGAN to other physical problemsettings, or even to non-physical data such as video streams.

REFERENCES

Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein GAN.arXiv:1701.07875 (2017).

Steve Bako, Thijs Vogels, Brian McWilliams, Mark Meyer, Jan NováK, Alex Harvill,Pradeep Sen, Tony Derose, and Fabrice Rousselle. 2017. Kernel-predicting convo-lutional networks for denoising Monte Carlo renderings. ACM Transactions onGraphics (TOG) 36, 4 (2017), 97.

Christopher Batty, Florence Bertails, and Robert Bridson. 2007. A fast variationalframework for accurate solid-fluid coupling. ACM Trans. Graph. 26, 3, Article 100(July 2007). https://doi.org/10.1145/1276377.1276502

David Berthelot, Tom Schumm, and Luke Metz. 2017. BeGAN: Boundary equilibriumgenerative adversarial networks. arXiv:1703.10717 (2017).

Prateep Bhattacharjee and Sukhendu Das. 2017. Temporal Coherency based Criteria forPredicting Video Frames using Deep Multi-stage Generative Adversarial Networks.In Advances in Neural Information Processing Systems. 4271–4280.

Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (InformationScience and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA.

Chakravarty Alla Chaitanya, Anton Kaplanyan, Christoph Schied, Marco Salvi, AaronLefohn, Derek Nowrouzezahrai, and Timo Aila. 2017. Interactive reconstructionof Monte Carlo image sequences using a recurrent denoising autoencoder. ACMTransactions on Graphics (TOG) 36, 4 (2017), 98.

Dongdong Chen, Jing Liao, Lu Yuan, Nenghai Yu, and Gang Hua. 2017. CoherentOnline Video Style Transfer. In The IEEE International Conference on ComputerVision (ICCV).

Mengyu Chu and Nils Thuerey. 2017. Data-Driven Synthesis of Smoke Flows withCNN-based Feature Descriptors. ACM Trans. Graph. 36(4), 69 (2017).

Emmanuel de Bezenac, Arthur Pajot, and Patrick Gallinari. 2017. Deep Learningfor Physical Processes: Incorporating Prior Scientific Knowledge. arXiv preprintarXiv:1711.07970 (2017).

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2016. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysisand machine intelligence 38, 2 (2016), 295–307.

Alexey Dosovitskiy and Thomas Brox. 2016. Generating images with perceptual simi-larity metrics based on deep networks. In Advances in Neural Information ProcessingSystems. 658–666.

Alexey Dosovitskiy, Philipp Fischer, Jost Tobias Springenberg, Martin Riedmiller, andThomas Brox. 2016. Discriminative unsupervised feature learning with exemplarconvolutional neural networks. IEEE Trans. Pattern Analysis and Mach. Int. 38, 9(2016), 1734–1747.

Amir Barati Farimani, Joseph Gomes, and Vijay S Pande. 2017. Deep Learning thePhysics of Transport Phenomena. arXiv:1709.02432 (2017).

John Flynn, Ivan Neulander, James Philbin, and Noah Snavely. 2016. DeepStereo:Learning to predict new views from the world’s imagery. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition. 5515–5524.

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich featurehierarchies for accurate object detection and semantic segmentation. In Proc. of IEEEComp. Vision and Pattern Rec. IEEE, 580–587.

Ian Goodfellow. 2016. NIPS 2016 tutorial: Generative adversarial networks. arXivpreprint arXiv:1701.00160 (2016).

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,

Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative AdversarialNets. stat 1050 (2014), 10.

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-imagetranslation with conditional adversarial networks. Proc. of IEEE Comp. Vision andPattern Rec. (2017).

Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-timestyle transfer and super-resolution. In European Conference on Computer Vision.Springer, 694–711.

Simon Kallweit, Thomas Müller, Brian McWilliams, Markus Gross, and Jan Novák. 2017.Deep Scattering: Rendering Atmospheric Clouds with Radiance-Predicting NeuralNetworks. arXiv:1709.05418 (2017).


https://doi.org/10.1145/1276377.1276502


Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growingof gans for improved quality, stability, and variation. arXiv:1710.10196 (2017).

Ladislav Kavan, Dan Gerszewski, AdamWBargteil, and Peter-Pike Sloan. 2011. Physics-inspired upsampling for cloth simulation in games. In ACM Transactions on Graphics(TOG), Vol. 30. ACM, 93.

Byungmoon Kim, Yingjie Liu, Ignacio Llamas, and Jarek Rossignac. 2005. FlowFixer:Using BFECC for Fluid Simulation. In Proceedings of the First Eurographics conferenceon Natural Phenomena. 51–56.

Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. 2016. Accurate image super-resolutionusing very deep convolutional networks. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. 1646–1654.

Theodore Kim, Nils Thuerey, Doug James, and Markus Gross. 2008. Wavelet Turbulencefor Fluid Simulation. ACM Trans. Graph. 27 (3) (2008), 50:1–6.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In Advances in Neural InformationProcessing Systems. NIPS, 1097–1105.

Lubor Ladicky, SoHyeon Jeong, Barbara Solenthaler, Marc Pollefeys, and Markus Gross.2015. Data-driven fluid simulations using regression forests. ACM Trans. Graph. 34,6 (2015), 199.

Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham,Alejandro Acosta, AndrewAitken, Alykhan Tejani, Johannes Totz, ZehanWang, et al.2016. Photo-realistic single image super-resolution using a generative adversarialnetwork. arXiv:1609.04802 (2016).

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. 2017.Enhanced deep residual networks for single image super-resolution. In Proc. of IEEEComp. Vision and Pattern Rec., Vol. 1. 3.

Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang,and Thomas Huang. 2017. Robust Video Super-Resolution With Learned TemporalDynamics. In The IEEE International Conference on Computer Vision (ICCV).

Zichao Long, Yiping Lu, Xianzhong Ma, and Bin Dong. 2017. PDE-Net: Learning PDEsfrom Data. arXiv:1710.09668 (2017).

Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. 2017. Deep Photo StyleTransfer. arXiv preprint arXiv:1703.07511 (2017).

Michael Mathieu, Camille Couprie, and Yann LeCun. 2015. Deep multi-scale videoprediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015).

Antoine McNamara, Adrien Treuille, Zoran Popović, and Jos Stam. 2004. Fluid ControlUsing the Adjoint Method. ACM Trans. Graph. 23, 3 (2004), 449–456.

Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXivpreprint arXiv:1411.1784 (2014).

Lukas Mosser, Olivier Dubrule, and Martin J Blunt. 2017. Reconstruction ofthree-dimensional porous media using generative adversarial neural networks.arXiv:1704.03225 (2017).

Rahul Narain, Jason Sewall, Mark Carlson, and Ming C. Lin. 2008. Fast Animation ofTurbulence Using Energy Transport and Procedural Synthesis. ACM Trans. Graph.27, 5 (2008), article 166.

Augustus Odena, Vincent Dumoulin, and Chris Olah. 2016. Deconvolution and Checker-board Artifacts. Distill (2016). https://doi.org/10.23915/distill.00003

Zherong Pan, Jin Huang, Yiying Tong, Changxi Zheng, and Hujun Bao. 2013. InteractiveLocalized Liquid Motion Editing. ACM Trans. Graph. 32, 6 (Nov. 2013).

Xue Bin Peng, Glen Berseth, KangKang Yin, and Michiel Van De Panne. 2017. Deeploco:Dynamic locomotion skills using hierarchical deep reinforcement learning. ACMTrans. Graph. 36, 4 (2017), 41.

Lukas Prantl, Boris Bonev, and Nils Thuerey. 2017. Pre-computed Liquid Spaces withGenerative Neural Networks and Optical Flow. arXiv:1704.07854 (2017).

Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised RepresentationLearning with Deep Convolutional Generative Adversarial Networks. Proc. ICLR(2016).

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutionalnetworks for biomedical image segmentation. In International Conference on MedicalImage Computing and Computer-Assisted Intervention. Springer, 234–241.

Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. 2016. Artistic Style Trans-fer for Videos. In Pattern Recognition - 38th German Conference, GCPR 2016, Han-nover, Germany, September 12-15, 2016, Proceedings. 26–36. https://doi.org/10.1007/978-3-319-45886-1_3

Masaki Saito, Eiichi Matsumoto, and Shunta Saito. 2017. Temporal generative adversar-ial nets with singular value clipping. In IEEE International Conference on ComputerVision (ICCV). 2830–2839.

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, andXi Chen. 2016. Improved techniques for training gans. In Advances in NeuralInformation Processing Systems. 2234–2242.

Andrew Selle, Ronald Fedkiw, Byungmoon Kim, Yingjie Liu, and Jarek Rossignac. 2008.An Unconditionally Stable MacCormack Method. J. Sci. Comput. 35, 2-3 (June 2008),350–371.

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks forlarge-scale image recognition. arXiv:1409.1556 (2014).

Jos Stam. 1999. Stable Fluids. In Proc. ACM SIGGRAPH. ACM, 121–128.

Yun Teng, David IW Levin, and Theodore Kim. 2016. Eulerian solid-fluid coupling.ACM Trans. Graph. 35, 6 (2016), 200.

Jonathan Tompson, Kristofer Schlachter, Pablo Sprechmann, and Ken Perlin. 2016. Accel-erating Eulerian Fluid Simulation With Convolutional Networks. arXiv: 1607.03597(2016).

KiwonUm, XiangyuHu, andNils Thuerey. 2017. SplashModelingwithNeural Networks.arXiv:1704.04456 (2017).

Lantao Yu,Weinan Zhang, JunWang, and Yong Yu. 2017. SeqGAN: Sequence GenerativeAdversarial Nets with Policy Gradient.. In AAAI. 2852–2858.

Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz. 2015. Loss Functions for NeuralNetworks for Image Processing. arXiv preprint arXiv:1511.08861 (2015).

Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv:1703.10593(2017).

A DETAILS OF NN ARCHITECTURES

To clearly specify our networks, we use the following notation. Let in(resolution,channels), out (resolution, output) present input and output information; N I (output-resolution) represent nearest-neighbor interpolation; C (output-resolution, filter size,output-channels) denote a convolutional layer. Our resolutions and filter sizes are thesame for every spacial dimension for both 2D and 3D. Resolutions of feature mapsare reduced when strides >1. We use RB to represent our residual blocks, and useCS for adding residuals in a RB . E.g., RB3 : [CA, ReLU, CB ] + [CS ], ReLU means[(input → CA → ReLU → CB ) + (input → CS )] → ReLU, where + denoteselement-wise addition. BN denotes batch normalization, which is not used in thelast layer ofG , the first layer of Dt and the first layer of Ds [Radford et al. 2016]. Inaddition, | denotes concatenation of layer outputs along the channel dimension.

Architectures of G , Ds and Dt :

G :in(16, 4)N I (64)RB0 : [CA(64, 5, 8), BN , ReLU, CB (64, 5, 32), BN ] + [CS (64, 1, 32), BN ], ReLURB1 : [CA(64, 5, 128), BN , ReLU, CB (64, 5, 128), BN ] + [CS (64, 1, 128), BN ], ReLURB2 : [CA(64, 5, 32), BN , ReLU, CB (64, 5, 8), BN ] + [CS (64, 1, 8), BN ], ReLURB3 : [CA(64, 5, 2), ReLU, CB (64, 5, 1)] + [CS (64, 1, 1)], ReLUout (64, 1)

Ds : Dt :inx (16, 1) , the conditional density iny (64, 3), the 3 high-res frames to classifyN I (64) |iny (64, 1) , the high-res input to classifyC (32, 4, 32), leaky ReLU C (32, 4, 32), leaky ReLUC (16, 4, 64), BN , leaky ReLU C (16, 4, 64), BN , leaky ReLUC (8, 4, 128), BN , leaky ReLU C (8, 4, 128), BN , leaky ReLUC (8, 4, 256), BN , leaky ReLU C (8, 4, 256), BN , leaky ReLUFully connected, σ activation Fully connected, σ activationout (1, 1) out (1, 1)

B PARAMETERS & DATA STATISTICS

Below we summarize all parameters for training runs and data generation. Note thatthe model size includes compression, and we train the individual networks multipletimes per iteration, as indicated below.

Details of generated results:

test input size tile (343)number output size time

Fig. 14 a) 1282 - 5122 0.064s/frame- 343 1 1363 2.2s/frameFig. 8 b) 643 8 2563 17.9s/frameFig. 9 150 × 100 × 100 96 600 × 400 × 400 234.48s/frameFig. 10 256 × 180 × 180 441 1024 × 720 × 720 1046.07s/frame


https://doi.org/10.23915/distill.00003

https://doi.org/10.1007/978-3-319-45886-1_3

https://doi.org/10.1007/978-3-319-45886-1_3


Details of 2D training runs:2D training 1Fig. 14 a)

2D training 2Fig. 14 b)


Raw data 4000 frames from20 simulations

4000 frames from20 simulations

2400 frames from20 simulations

Training data set 160 frames 160 frames 320 framesTesting data set 40 frames 40 frames 80 framesLow-res data size 642 642 642High-res data size 2562 2562 2562Input tile size 162 162 162λL1 5 5 5

λfλ1, . . .,4f = −10−5 λ1,4f = 1/3 · 10−4 λ1, . . .,4f = 10−5

λ2,3f = −1/3 · 10−4Batch size 16 16 16G , Ds , Dt trainings/epoch 2 2 2Training epochs 40000 40000 40000

Model weights634214(G ),706017(Ds ),706529(Dt )

634214(G ),706017(Ds ),706529(Dt )

634214(G ),706017(Ds ),706529(Dt )

Model size 36.88Mb 42.73Mb 43.45Mb

Training time(min) 798.651 GPU

905.721 GPU

877.591 GPU

Details of 3D training runs:3D training 1Fig. 8 b)


Raw data 2400 frames from 20 simulations 2400 frames from 20 simulationsTraining data set 96 frames 96 framesTesting data set 24 frames 24 framesLow-res data size 643 643High-res data size 2563 2563Input tile size 163 163λL1 5 5

λfλ1, . . .,4f = −1/3 · 10−6 λ1f = 1/3 · 10−6

λ2,3,4f = −1/3 · 10−6Batch size 1 1G , Ds , Dt trainings/epoch 16 16Training epochs 7000 7000

Model weights3148014(G ),2888161(Ds ),2890209(Dt )

3148014(G ),2888161(Ds ),2890209(Dt )

Model size 134.93Mb 140.79Mb

Training time(min) 12636.222 GPUs

18231.972 GPUs


Documents

tempoGAN: A Temporally Coherent, Volumetric GAN for Super ... · elot et al. 2017], which could potentially lead to additional gains in quality. Besides the super-resolution task,