Learning Latent Dynamics for Planning from Pixels · Danijar Hafner1 2 Timothy Lillicrap3 Ian Fischer4 Ruben Villegas1 5 David Ha 1Honglak Lee James Davidson Abstract Planning has

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner 1 2 Timothy Lillicrap 3 Ian Fischer 4 Ruben Villegas 1 5David Ha 1 Honglak Lee 1 James Davidson 1

AbstractPlanning has been very successful for controltasks with known environment dynamics. Toleverage planning in unknown environments,the agent needs to learn the dynamics frominteractions with the world. However, learningdynamics models that are accurate enough forplanning has been a long-standing challenge,especially in image-based domains. We proposethe Deep Planning Network (PlaNet), a purelymodel-based agent that learns the environmentdynamics from images and chooses actionsthrough fast online planning in latent space. Toachieve high performance, the dynamics modelmust accurately predict the rewards ahead formultiple time steps. We approach this using alatent dynamics model with both deterministicand stochastic transition components. Moreover,we propose a multi-step variational inferenceobjective that we name latent overshooting.Using only pixel observations, our agent solvescontinuous control tasks with contact dynamics,partial observability, and sparse rewards, whichexceed the difficulty of tasks that were previouslysolved by planning with learned models. PlaNetuses substantially fewer episodes and reaches finalperformance close to and sometimes higher thanstrong model-free algorithms.

1. IntroductionPlanning is a natural and powerful approach to decisionmaking problems with known dynamics, such as game play-ing and simulated robot control (Tassa et al., 2012; Silveret al., 2017; Moravčík et al., 2017). To plan in unknownenvironments, the agent needs to learn the dynamics fromexperience. Learning dynamics models that are accurate

1Google Brain 2University of Toronto 3DeepMind 4GoogleResearch 5University of Michigan. Correspondence to: DanijarHafner .

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

enough for planning has been a long-standing challenge.Key difficulties include model inaccuracies, accumulatingerrors of multi-step predictions, failure to capture multiplepossible futures, and overconfident predictions outside ofthe training distribution.

Planning using learned models offers several benefits overmodel-free reinforcement learning. First, model-based plan-ning can be more data efficient because it leverages a richertraining signal and does not require propagating rewardsthrough Bellman backups. Moreover, planning carries thepromise of increasing performance just by increasing thecomputational budget for searching for actions, as shownby Silver et al. (2017). Finally, learned dynamics can beindependent of any specific task and thus have the potentialto transfer well to other tasks in the environment.

Recent work has shown promise in learning the dynamics ofsimple low-dimensional environments (Deisenroth & Ras-mussen, 2011; Gal et al., 2016; Amos et al., 2018; Chuaet al., 2018; Henaff et al., 2018). However, these approachestypically assume access to the underlying state of the worldand the reward function, which may not be available in prac-tice. In high-dimensional environments, we would like tolearn the dynamics in a compact latent space to enable fastplanning. The success of such latent models has previouslybeen limited to simple tasks such as balancing cartpoles andcontrolling 2-link arms from dense rewards (Watter et al.,2015; Banijamali et al., 2017).

In this paper, we propose the Deep Planning Network(PlaNet), a model-based agent that learns the environmentdynamics from pixels and chooses actions through onlineplanning in a compact latent space. To learn the dynamics,we use a transition model with both stochastic and determin-istic components. Moreover, we experiment with a novelgeneralized variational objective that encourages multi-steppredictions. PlaNet solves continuous control tasks frompixels that are more difficult than those previously solvedby planning with learned models.

Key contributions of this work are summarized as follows:

• Planning in latent spaces We solve a variety of tasksfrom the DeepMind control suite, shown in Figure 1, bylearning a dynamics model and efficiently planning in

arX

iv:1

811.

0455

1v5

[cs

.LG

] 4

Jun

201

9


(a) Cartpole (b) Reacher (c) Cheetah (d) Finger (e) Cup (f) Walker

Figure 1: Image-based control domains used in our experiments. The images show agent observations before downscalingto 64× 64× 3 pixels. (a) The cartpole swingup task has a fixed camera so the cart can move out of sight. (b) The reachertask has only a sparse reward. (c) The cheetah running task includes both contacts and a larger number of joints. (d) Thefinger spinning task includes contacts between the finger and the object. (e) The cup task has a sparse reward that is onlygiven once the ball is caught. (f) The walker task requires balance and predicting difficult interactions with the ground whenthe robot is lying down.

its latent space. Our agent substantially outperforms themodel-free A3C and in some cases D4PG algorithm infinal performance, with on average 200× less environ-ment interaction and similar computation time.

• Recurrent state space model We design a latent dy-namics model with both deterministic and stochasticcomponents (Buesing et al., 2018; Chung et al., 2015).Our experiments indicate having both components to becrucial for high planning performance.

• Latent overshooting We generalize the standard vari-ational bound to include multi-step predictions. Usingonly terms in latent space results in a fast regularizerthat can improve long-term predictions and is compati-ble with any latent sequence model.

2. Latent Space PlanningTo solve unknown environments via planning, we need tomodel the environment dynamics from experience. PlaNetdoes so by iteratively collecting data using planning andtraining the dynamics model on the gathered data. In thissection, we introduce notation for the environment and de-scribe the general implementation of our model-based agent.In this section, we assume access to a learned dynamicsmodel. Our design and training objective for this model aredetailed in Section 3.

Problem setup Since individual image observations gen-erally do not reveal the full state of the environment, weconsider a partially observable Markov decision process(POMDP). We define a discrete time step t, hidden statesst, image observations ot, continuous action vectors at, andscalar rewards rt, that follow the stochastic dynamics

Transition function: st ∼ p(st | st−1, at−1)Observation function: ot ∼ p(ot | st)Reward function: rt ∼ p(rt | st)Policy: at ∼ p(at | o≤t, a


Model-based planning PlaNet learns a transition modelp(st | st−1, at−1), observation model p(ot | st), and rewardmodel p(rt | st) from previously experienced episodes(note italic letters for the model compared to upright lettersfor the true dynamics). The observation model provides arich training signal but is not used for planning. We alsolearn an encoder q(st | o≤t, a


o1, r1 o2, r2 o3, r3

h1 h2 h3

a1 a2

(a) Deterministic model (RNN)

o1, r1 o2, r2 o3, r3

s1 s2 s3

a1 a2

(b) Stochastic model (SSM)

o1, r1 o2, r2 o3, r3

s1 s2 s3

h1 h2 h3

a1 a2

(c) Recurrent state-space model (RSSM)

Figure 2: Latent dynamics model designs. In this example, the model observes the first two time steps and predicts thethird. Circles represent stochastic variables and squares deterministic variables. Solid lines denote the generative processand dashed lines the inference model. (a) Transitions in a recurrent neural network are purely deterministic. This preventsthe model from capturing multiple futures and makes it easy for the planner to exploit inaccuracies. (b) Transitions in astate-space model are purely stochastic. This makes it difficult to remember information over multiple time steps. (c) Wesplit the state into stochastic and deterministic parts, allowing the model to robustly learn to predict multiple futures.

network followed by a feed-forward neural network. We usethe filtering posterior that conditions on past observationssince we are ultimately interested in using the model forplanning, but one may also use the full smoothing posteriorduring training (Babaeizadeh et al., 2017; Gregor & Besse,2018).

Training objective Using the encoder, we construct avariational bound on the data log-likelihood. For simplicity,we write losses for predicting only the observations — thereward losses follow by analogy. The variational boundobtained using Jensen’s inequality is

ln p(o1:T | a1:T ) , ln∫ ∏

t

p(st | st−1, at−1)p(ot | st) ds1:T

≥T∑

t=1

(Eq(st|o≤t,a


o1, r1 o2, r2 o3, r3

s1|1 s2|2 s3|3

s2|1 s3|2

(a) Standard variational bound

o1, r1 o2, r2 o3, r3

s1|1 s2|2 s3|3

s2|1 s3|2

s3|1

(b) Observation overshooting

o1, r1 o2, r2 o3, r3

s1|1 s2|2 s3|3

s2|1 s3|2

s3|1

(c) Latent overshooting

Figure 3: Unrolling schemes. The labels si|j are short for the state at time i conditioned on observations up to time j.Arrows pointing at shaded circles indicate log-likelihood loss terms. Wavy arrows indicate KL-divergence loss terms. (a)The standard variational objectives decodes the posterior at every step to compute the reconstruction loss. It also places aKL on the prior and posterior at every step, which trains the transition function for one-step predictions. (b) Observationovershooting (Amos et al., 2018) decodes all multi-step predictions to apply additional reconstruction losses. This istypically too expensive in image domains. (c) Latent overshooting predicts all multi-step priors. These state beliefs aretrained towards their corresponding posteriors in latent space to encourage accurate multi-step predictions.

from latent overshooting, although our final agent using theRSSM model does not require it (see Appendix D).

Limited capacity If we could train our model to make per-fect one-step predictions, it would also make perfect multi-step predictions, so this would not be a problem. However,when using a model with limited capacity and restricteddistributional family, training the model only on one-steppredictions until convergence does in general not coincidewith the model that is best at multi-step predictions. For suc-cessful planning, we need accurate multi-step predictions.Therefore, we take inspiration from Amos et al. (2018) andearlier related ideas (Krishnan et al., 2015; Lamb et al.,2016; Chiappa et al., 2017), and train the model on multi-step predictions of all distances. We develop this idea forlatent sequence models, showing that multi-step predictionscan be improved by a loss in latent space, without having togenerate additional images.

Multi-step prediction We start by generalizing the stan-dard variational bound (Equation 3) from training one-steppredictions to training multi-step predictions of a fixed dis-tance d. For ease of notation, we omit actions in the con-ditioning set here; every distribution over st is conditionedupon a


tions of all distances 1 ≤ d ≤ D,

1

D

D∑d=1

ln pd(o1:T ) ≥T∑

t=1

(Eq(st|o≤t)[ln p(ot | st)]

reconstruction

←↩

− 1D

D∑d=1

βd Ep(st−1|st−d)q(st−d|o≤t−d)

[KL[q(st | o≤t) ‖ p(st | st−1)]

]latent overshooting

). (7)

Latent overshooting can be interpreted as a regularizer inlatent space that encourages consistency between one-stepand multi-step predictions, which we know should be equiv-alent in expectation over the data set. We include weightingfactors {βd}Dd=1 analogously to the β-VAE (Higgins et al.,2016). While we set all β>1 to the same value for sim-plicity, they could be chosen to let the model focus moreon long-term or short-term predictions. In practice, westop gradients of the posterior distributions for overshoot-ing distances d > 1, so that the multi-step predictions aretrained towards the informed posteriors, but not the otherway around.

5. ExperimentsWe evaluate PlaNet on six continuous control tasks frompixels. We explore multiple design axes of the agent: thestochastic and deterministic paths in the dynamics model,iterative planning, and online experience collection. Werefer to the appendix for hyper parameters (Appendix A)and additional experiments (Appendices C to E). Besidesthe action repeat, we use the same hyper parameters for alltasks. Within less than one hundredth the episodes, PlaNetoutperforms A3C (Mnih et al., 2016) and achieves sim-ilar performance to the top model-free algorithm D4PG(Barth-Maron et al., 2018). The training time of 10 to 20hours (depending on the task) on a single Nvidia V100GPU compares favorably to that of A3C and D4PG. Ourimplementation uses TensorFlow Probability (Dillon et al.,2017). Please visit https://danijar.com/planetfor access to the code and videos of the trained agent.

For our evaluation, we consider six image-based continuouscontrol tasks of the DeepMind control suite (Tassa et al.,2018), shown in Figure 1. These environments providequalitatively different challenges. The cartpole swinguptask requires a long planning horizon and to memorize thecart when it is out of view, reacher has a sparse rewardgiven when the hand and goal area overlap, finger spinningincludes contact dynamics between the finger and the object,cheetah exhibits larger state and action spaces, the cup taskonly has a sparse reward for when the ball is caught, and thewalker is challenging because the robot first has to stand upand then walk, resulting in collisions with the ground thatare difficult to predict. In all tasks, the only observations arethird-person camera images of size 64× 64× 3 pixels.

Comparison to model-free methods Figure 4 comparesthe performance of PlaNet to the model-free algorithms re-ported by Tassa et al. (2018). Within 100 episodes, PlaNetoutperforms the policy-gradient method A3C trained fromproprioceptive states for 100,000 episodes, on all tasks. Af-ter 500 episodes, it achieves performance similar to D4PG,trained from images for 100,000 episodes, except for thefinger task. PlaNet surpasses the final performance of D4PGwith a relative improvement of 26% on the cheetah runningtask. We refer to Table 1 for numerical results, which alsoincludes the performance of CEM planning with the truedynamics of the simulator.

Model designs Figure 4 additionally compares designchoices of the dynamics model. We train PlaNet usingour recurrent state-space model (RSSM), as well as ver-sions with purely deterministic GRU (Cho et al., 2014), andpurely stochastic state-space model (SSM). We observe theimportance of both stochastic and deterministic elementsin the transition function on all tasks. The deterministicpart allows the model to remember information over manytime steps. The stochastic component is even more impor-tant – the agent does not learn without it. This could bebecause the tasks are stochastic from the agent’s perspectivedue to partial observability of the initial states. The noisemight also add a safety margin to the planning objective thatresults in more robust action sequences.

Agent designs Figure 5 compares PlaNet, a version col-lecting episodes under random actions rather than by plan-ning, and a version that at each environment step selects thebest action out of 1000 sequences rather than iteratively re-fining plans via CEM. We observe that online data collectionhelps for all tasks and is necessary for the cartpole, finger,and walker tasks. Iterative search for action sequences usingCEM improves performance on all tasks.

One agent all tasks Figure 7 in the appendix shows theperformance of a single agent trained on all six tasks. Theagent is not told which task it is facing; it needs to inferthis from the image observations. We pad the action spaceswith unused elements to make them compatible and adaptAlgorithm 1 to collect one episode of each task every Cupdate steps. We use the same hyper parameters as for themain experiments above. The agent solves all tasks whilelearning slower compared to individually trained agents.This indicates that the model can learn to predict multipledomains, regardless of the conceptually different visuals.

6. Related WorkPrevious work in model-based reinforcement learninghas focused on planning in low-dimensional state spaces(Gal et al., 2016; Higuera et al., 2018; Henaff et al., 2018;

6

https://danijar.com/planet


Cartpole Swing Up

5 250 500 750 10000

200

400

600

800

1000Reacher Easy

5 250 500 750 10000

200

400

600

800

1000Cheetah Run

5 250 500 750 10000

200

400

600

800

1000

Finger Spin

5 250 500 750 10000

200

400

600

800

1000Cup Catch

5 250 500 750 10000

200

400

600

800

1000Walker Walk

5 250 500 750 10000

200

400

600

800

1000

PlaNet (RSSM) Stochastic (SSM)Deterministic (GRU)

D4PG (100k episodes)A3C (100k episodes, proprio)

Figure 4: Comparison of PlaNet to model-free algorithms and other model designs. Plots show test performance over thenumber of collected episodes. We compare PlaNet using our RSSM (Section 3) to purely deterministic (GRU) and purelystochastic models (SSM). The RNN does not use latent overshooting, as it does not have stochastic latents. The lines showmedians and the areas show percentiles 5 to 95 over 5 seeds and 10 trajectories. The shaded areas are large on two of thetasks due to the sparse rewards.

Table 1: Comparison of PlaNet to the model-free algorithms A3C and D4PG reported by Tassa et al. (2018). The trainingcurves for these are shown as orange lines in Figure 4 and as solid green lines in Figure 6 in their paper. From these, weestimate the number of episodes that D4PG takes to achieve the final performance of PlaNet to estimate the data efficiencygain. We further include CEM planning (H = 12, I = 10, J = 1000,K = 100) with the true simulator instead of learneddynamics as an estimated upper bound on performance. Numbers indicate mean final performance over 5 seeds and 10trajectories.

Method Modality Episodes Car

tpol

eSw

ing

Up

Rea

cher

Eas

y

Che

etah

Run

Fing

erSp

in

Cup

Cat

ch

Wal

ker

Wal

k

A3C proprioceptive 100,000 558 285 214 129 105 311D4PG pixels 100,000 862 967 524 985 980 968PlaNet (ours) pixels 1,000 821 832 662 700 930 951CEM + true simulator simulator state 0 850 964 656 825 993 994

Data efficiency gain PlaNet over D4PG (factor) 250 40 500+ 300 100 90

Chua et al., 2018), combining the benefits of model-basedand model-free approaches (Kalweit & Boedecker, 2017;Nagabandi et al., 2017; Weber et al., 2017; Kurutach et al.,2018; Buckman et al., 2018; Ha & Schmidhuber, 2018;Wayne et al., 2018; Igl et al., 2018; Srinivas et al., 2018),and pure video prediction without planning (Oh et al., 2015;Krishnan et al., 2015; Karl et al., 2016; Chiappa et al., 2017;Babaeizadeh et al., 2017; Gemici et al., 2017; Denton &Fergus, 2018; Buesing et al., 2018; Doerr et al., 2018; Gre-

gor & Besse, 2018). Appendix G reviews these orthogonalresearch directions in more detail.

Relatively few works have demonstrated successful plan-ning from pixels using learned dynamics models. Therobotics community focuses on video prediction modelsfor planning (Agrawal et al., 2016; Finn & Levine, 2017;Ebert et al., 2018; Zhang et al., 2018) that deal with thevisual complexity of the real world and solve tasks with

7


Cartpole Swing Up

5 250 500 750 10000

200

400

600

800

1000Reacher Easy

5 250 500 750 10000

200

400

600

800

1000Cheetah Run

5 250 500 750 10000

200

400

600

800

1000

Finger Spin

5 250 500 750 10000

200

400

600

800

1000Cup Catch

5 250 500 750 10000

200

400

600

800

1000Walker Walk

5 250 500 750 10000

200

400

600

800

1000

PlaNet Random collectionRandom shooting


Figure 5: Comparison of agent designs. Plots show test performance over the number of collected episodes. We comparePlaNet, a version that collects data under random actions (random collection), and a version that chooses the best action outof 1000 sequences at each environment step (random shooting) without iteratively refining plans via CEM. The lines showmedians and the areas show percentiles 5 to 95 over 5 seeds and 10 trajectories.

a simple gripper, such as grasping or pushing objects. Incomparison, we focus on simulated environments, wherewe leverage latent planning to scale to larger state and ac-tion spaces, longer planning horizons, as well as sparsereward tasks. E2C (Watter et al., 2015) and RCE (Banija-mali et al., 2017) embed images into a latent space, wherethey learn local-linear latent transitions and plan for actionsusing LQR. These methods balance simulated cartpoles andcontrol 2-link arms from images, but have been difficult toscale up. We lift the Markov assumption of these models,making our method applicable under partial observability,and present results on more challenging environments thatinclude longer planning horizons, contact dynamics, andsparse rewards.

7. DiscussionWe present PlaNet, a model-based agent that learns a la-tent dynamics model from image observations and choosesactions by fast planning in latent space. To enable accu-rate long-term predictions, we design a model with bothstochastic and deterministic paths. We show that our agentsucceeds at several continuous control tasks from imageobservations, reaching performance that is comparable tothe best model-free algorithms while using 200× fewerepisodes and similar or less computation time. The results

show that learning latent dynamics models for planning inimage domains is a promising approach.

Directions for future work include learning temporal ab-straction instead of using a fixed action repeat, possiblythrough hierarchical models. To further improve final per-formance, one could learn a value function to approximatethe sum of rewards beyond the planning horizon. Moreover,gradient-based planning could increase the computationalefficiency of the agent and learning representations withoutreconstruction could help to solve tasks with higher visualdiversity. Our work provides a starting point for multi-taskcontrol by sharing the dynamics model.

Acknowledgements We thank Jacob Buckman, NicolasHeess, John Schulman, Rishabh Agarwal, Silviu Pitis,Mohammad Norouzi, George Tucker, David Duvenaud,Shane Gu, Chelsea Finn, Steven Bohez, Jimmy Ba,Stephanie Chan, and Jenny Liu for helpful discussions.

8


References

Agrawal, P., Nair, A. V., Abbeel, P., Malik, J., and Levine,S. Learning to poke by poking: Experiential learningof intuitive physics. In Advances in Neural InformationProcessing Systems, pp. 5074–5082, 2016.

Amos, B., Dinh, L., Cabi, S., Rothörl, T., Muldal, A., Erez,T., Tassa, Y., de Freitas, N., and Denil, M. Learningawareness models. In International Conference on Learn-ing Representations, 2018.

Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R. H., andLevine, S. Stochastic variational video prediction. arXivpreprint arXiv:1710.11252, 2017.

Banijamali, E., Shu, R., Ghavamzadeh, M., Bui, H., andGhodsi, A. Robust locally-linear controllable embedding.arXiv preprint arXiv:1710.05373, 2017.

Barth-Maron, G., Hoffman, M. W., Budden, D., Dabney,W., Horgan, D., Muldal, A., Heess, N., and Lillicrap, T.Distributed distributional deterministic policy gradients.arXiv preprint arXiv:1804.08617, 2018.

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Sched-uled sampling for sequence prediction with recurrent neu-ral networks. In Advances in Neural Information Process-ing Systems, pp. 1171–1179, 2015.

Buckman, J., Hafner, D., Tucker, G., Brevdo, E., andLee, H. Sample-efficient reinforcement learning withstochastic ensemble value expansion. arXiv preprintarXiv:1807.01675, 2018.

Buesing, L., Weber, T., Racaniere, S., Eslami, S., Rezende,D., Reichert, D. P., Viola, F., Besse, F., Gregor, K.,Hassabis, D., et al. Learning and querying fast gener-ative models for reinforcement learning. arXiv preprintarXiv:1802.03006, 2018.

Chiappa, S., Racaniere, S., Wierstra, D., and Mohamed,S. Recurrent environment simulators. arXiv preprintarXiv:1704.02254, 2017.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau,D., Bougares, F., Schwenk, H., and Bengio, Y. Learn-ing phrase representations using rnn encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078, 2014.

Chua, K., Calandra, R., McAllister, R., and Levine, S.Deep reinforcement learning in a handful of trials us-ing probabilistic dynamics models. arXiv preprintarXiv:1805.12114, 2018.

Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C.,and Bengio, Y. A recurrent latent variable model forsequential data. In Advances in neural information pro-cessing systems, pp. 2980–2988, 2015.

Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fastand accurate deep network learning by exponential linearunits (elus). arXiv preprint arXiv:1511.07289, 2015.

Deisenroth, M. and Rasmussen, C. E. Pilco: A model-basedand data-efficient approach to policy search. In Proceed-ings of the 28th International Conference on machinelearning (ICML-11), pp. 465–472, 2011.

Denton, E. and Fergus, R. Stochastic video generation witha learned prior. arXiv preprint arXiv:1802.07687, 2018.

Dillon, J. V., Langmore, I., Tran, D., Brevdo, E., Vasudevan,S., Moore, D., Patton, B., Alemi, A., Hoffman, M., andSaurous, R. A. Tensorflow distributions. arXiv preprintarXiv:1711.10604, 2017.

Doerr, A., Daniel, C., Schiegg, M., Nguyen-Tuong, D.,Schaal, S., Toussaint, M., and Trimpe, S. Proba-bilistic recurrent state-space models. arXiv preprintarXiv:1801.10395, 2018.

Ebert, F., Finn, C., Dasari, S., Xie, A., Lee, A., and Levine,S. Visual foresight: Model-based deep reinforcementlearning for vision-based robotic control. arXiv preprintarXiv:1812.00568, 2018.

Finn, C. and Levine, S. Deep visual foresight for planningrobot motion. In Robotics and Automation (ICRA), 2017IEEE International Conference on, pp. 2786–2793. IEEE,2017.

Gal, Y., McAllister, R., and Rasmussen, C. E. Improvingpilco with bayesian neural network dynamics models. InData-Efficient Machine Learning workshop, ICML, 2016.

Gemici, M., Hung, C.-C., Santoro, A., Wayne, G., Mo-hamed, S., Rezende, D. J., Amos, D., and Lillicrap, T.Generative temporal models with memory. arXiv preprintarXiv:1702.04649, 2017.

Gregor, K. and Besse, F. Temporal difference variationalauto-encoder. arXiv preprint arXiv:1806.03107, 2018.

Ha, D. and Schmidhuber, J. World models. arXiv preprintarXiv:1803.10122, 2018.

Henaff, M., Whitney, W. F., and LeCun, Y. Model-basedplanning with discrete and continuous actions. arXivpreprint arXiv:1705.07177, 2018.

9


Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrainedvariational framework. In International Conference onLearning Representations, 2016.

Higuera, J. C. G., Meger, D., and Dudek, G. Synthesizingneural network controllers with probabilistic model basedreinforcement learning. arXiv preprint arXiv:1803.02291,2018.

Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson,S. Deep variational reinforcement learning for pomdps.arXiv preprint arXiv:1806.02426, 2018.

Kalchbrenner, N., Oord, A. v. d., Simonyan, K., Danihelka,I., Vinyals, O., Graves, A., and Kavukcuoglu, K. Videopixel networks. arXiv preprint arXiv:1610.00527, 2016.

Kalweit, G. and Boedecker, J. Uncertainty-driven imagi-nation for continuous deep reinforcement learning. InConference on Robot Learning, pp. 195–206, 2017.

Karl, M., Soelch, M., Bayer, J., and van der Smagt, P.Deep variational bayes filters: Unsupervised learningof state space models from raw data. arXiv preprintarXiv:1605.06432, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Kingma, D. P. and Dhariwal, P. Glow: Generativeflow with invertible 1x1 convolutions. arXiv preprintarXiv:1807.03039, 2018.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114, 2013.

Krishnan, R. G., Shalit, U., and Sontag, D. Deep kalmanfilters. arXiv preprint arXiv:1511.05121, 2015.

Krishnan, R. G., Shalit, U., and Sontag, D. Structuredinference networks for nonlinear state space models. InAAAI, pp. 2101–2109, 2017.

Kurutach, T., Clavera, I., Duan, Y., Tamar, A., and Abbeel, P.Model-ensemble trust-region policy optimization. arXivpreprint arXiv:1802.10592, 2018.

Lamb, A. M., GOYAL, A. G. A. P., Zhang, Y., Zhang, S.,Courville, A. C., and Bengio, Y. Professor forcing: A newalgorithm for training recurrent networks. In Advances InNeural Information Processing Systems, pp. 4601–4609,2016.

Mathieu, M., Couprie, C., and LeCun, Y. Deep multi-scale video prediction beyond mean square error. arXivpreprint arXiv:1511.05440, 2015.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. Nature, 518(7540):529, 2015.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-chronous methods for deep reinforcement learning. InInternational Conference on Machine Learning, pp. 1928–1937, 2016.

Moerland, T. M., Broekens, J., and Jonker, C. M. Learningmultimodal transition dynamics for model-based rein-forcement learning. arXiv preprint arXiv:1705.00470,2017.

Moravčík, M., Schmid, M., Burch, N., Lisỳ, V., Morrill, D.,Bard, N., Davis, T., Waugh, K., Johanson, M., and Bowl-ing, M. Deepstack: Expert-level artificial intelligence inheads-up no-limit poker. Science, 356(6337):508–513,2017.

Nagabandi, A., Kahn, G., Fearing, R. S., and Levine, S.Neural network dynamics for model-based deep rein-forcement learning with model-free fine-tuning. arXivpreprint arXiv:1708.02596, 2017.

Nair, V. and Hinton, G. E. Rectified linear units improverestricted boltzmann machines. In Proceedings of the 27thinternational conference on machine learning (ICML-10),pp. 807–814, 2010.

Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. Action-conditional video prediction using deep networks in atarigames. In Advances in Neural Information ProcessingSystems, pp. 2863–2871, 2015.

Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. arXiv preprint arXiv:1401.4082, 2014.

Richards, A. G. Robust constrained model predictive control.PhD thesis, Massachusetts Institute of Technology, 2005.

Rubinstein, R. Y. Optimization of computer simulation mod-els with rare events. European Journal of OperationalResearch, 99(1):89–112, 1997.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,Bolton, A., et al. Mastering the game of go withouthuman knowledge. Nature, 550(7676):354, 2017.

Srinivas, A., Jabri, A., Abbeel, P., Levine, S., and Finn,C. Universal planning networks. arXiv preprintarXiv:1804.00645, 2018.

10


Talvitie, E. Model regularization for stable sample rollouts.In UAI, pp. 780–789, 2014.

Tassa, Y., Erez, T., and Todorov, E. Synthesis and stabi-lization of complex behaviors through online trajectoryoptimization. In Intelligent Robots and Systems (IROS),2012 IEEE/RSJ International Conference on, pp. 4906–4913. IEEE, 2012.

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., Casas, D.d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq,A., et al. Deepmind control suite. arXiv preprintarXiv:1801.00690, 2018.

van den Oord, A., Vinyals, O., et al. Neural discrete repre-sentation learning. In Advances in Neural InformationProcessing Systems, pp. 6309–6318, 2017.

Venkatraman, A., Hebert, M., and Bagnell, J. A. Improvingmulti-step prediction of learned time series models. InAAAI, pp. 3024–3030, 2015.

Vondrick, C., Pirsiavash, H., and Torralba, A. Generatingvideos with scene dynamics. In Advances In NeuralInformation Processing Systems, 2016.

Watter, M., Springenberg, J., Boedecker, J., and Riedmiller,M. Embed to control: A locally linear latent dynamicsmodel for control from raw images. In Advances in neuralinformation processing systems, pp. 2746–2754, 2015.

Wayne, G., Hung, C.-C., Amos, D., Mirza, M., Ahuja,A., Grabska-Barwinska, A., Rae, J., Mirowski, P.,Leibo, J. Z., Santoro, A., et al. Unsupervised predic-tive memory in a goal-directed agent. arXiv preprintarXiv:1803.10760, 2018.

Weber, T., Racanière, S., Reichert, D. P., Buesing, L., Guez,A., Rezende, D. J., Badia, A. P., Vinyals, O., Heess,N., Li, Y., et al. Imagination-augmented agents for deepreinforcement learning. arXiv preprint arXiv:1707.06203,2017.

Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M.,and Levine, S. SOLAR: deep structured representationsfor model-based reinforcement learning. arXiv preprintarXiv:1808.09105, 2018.

11


A. Hyper ParametersWe use the convolutional and deconvolutional networks from Ha & Schmidhuber (2018), a GRU (Cho et al., 2014) with 200units as deterministic path in the dynamics model, and implement all other functions as two fully connected layers of size200 with ReLU activations (Nair & Hinton, 2010). Distributions in latent space are 30-dimensional diagonal Gaussians withpredicted mean and standard deviation.

We pre-process images by reducing the bit depth to 5 bits as in Kingma & Dhariwal (2018). The model is trained usingthe Adam optimizer (Kingma & Ba, 2014) with a learning rate of 10−3, � = 10−4, and gradient clipping norm of 1000on batches of B = 50 sequence chunks of length L = 50. We do not scale the KL divergence terms relatively to thereconstruction terms but grant the model 3 free nats by clipping the divergence loss below this value. In a previous versionof the agent, we used latent overshooting and an additional fixed global prior, but we found this to not be necessary.

For planning, we use CEM with horizon length H = 12, optimization iterations I = 10, candidate samples J = 1000, andrefitting to the best K = 100. We start from S = 5 seed episodes with random actions and collect another episode everyC = 100 update steps under � ∼ Normal(0, 0.3) action noise. The action repeat differs between domains: cartpole (R = 8),reacher (R = 4), cheetah (R = 4), finger (R = 2), cup (R = 4), walker (R = 2). We found important hyper parameters tobe the action repeat, the KL-divergence scales β, and the learning rate.

B. Planning Algorithm

Algorithm 2: Latent planning with CEM

Input : H Planning horizon distanceI Optimization iterationsJ Candidates per iterationK Number of top candidates to fit

q(st | o≤t, a


C. Multi-Task LearningAverage over tasks

5 250 500 750 10000

200

400

600

800

1000

Figure 6: We compare a single PlaNet agent trained on all tasks to individual PlaNet agents. The plot shows test performanceover the number of episodes collected for each task. The single agent learns to solve all the tasks while learning more slowlycompared to the individual agents. The lines show mean and one standard deviation over 6 tasks, 5 seeds, and 10 trajectories.

Cartpole Swing Up

5 250 500 750 10000

200

400

600

800

1000Reacher Easy

5 250 500 750 10000

200

400

600

800

1000Cheetah Run

5 250 500 750 10000

200

400

600

800

1000

Finger Spin

5 250 500 750 10000

200

400

600

800

1000Cup Catch

5 250 500 750 10000

200

400

600

800

1000Walker Walk

5 250 500 750 10000

200

400

600

800

1000

Separate agentsSingle agent


Figure 7: Per-task performance of a single PlaNet agent trained on the six tasks. Plots show test performance over thenumber of episodes collected per task. The agent is not told which task it is solving and it needs to infer this from the imageobservations. The agent learns to distinguish the tasks and solve them with just a moderate slowdown in learning. The linesshow medians and the areas show percentiles 5 to 95 over 4 seeds and 10 trajectories.

13


D. Latent OvershootingCartpole Swing Up

5 250 500 750 10000

200

400

600

800

1000Reacher Easy

5 250 500 750 10000

200

400

600

800

1000Cheetah Run

5 250 500 750 10000

200

400

600

800

1000

Finger Spin

5 250 500 750 10000

200

400

600

800

1000Cup Catch

5 250 500 750 10000

200

400

600

800

1000Walker Walk

5 250 500 750 10000

200

400

600

800

1000

RSSMRSSM + Overshooting

DRNNDRNN + Overshooting


Figure 8: We compare the standard variational objective with latent overshooting on our proposed RSSM and another modelcalled DRNN that uses two RNNs as encoder and decoder with a stochastic state sequence in between. Latent overshootingcan substantially improve the performance of the DRNN and other models we have experimented with (not shown), butslightly reduces performance of our RSSM. The lines show medians and the areas show percentiles 5 to 95 over 5 seeds and10 trajectories.

14


E. Activation FunctionCartpole Swing Up

5 250 500 750 10000

200

400

600

800

1000Reacher Easy

5 250 500 750 10000

200

400

600

800

1000Cheetah Run

5 250 500 750 10000

200

400

600

800

1000

Finger Spin

5 250 500 750 10000

200

400

600

800

1000Cup Catch

5 250 500 750 10000

200

400

600

800

1000Walker Walk

5 250 500 750 10000

200

400

600

800

1000

PlaNet (ReLU)PlaNet (ELU)

SSM (ReLU)SSM (ELU)


Figure 9: Comparison of hard ReLU (Nair & Hinton, 2010) and smooth ELU (Clevert et al., 2015) activation functions. Wefind that smooth activations help improve performance of the purely stochastic model (and the purely deterministic model;not shown) while our proposed RSSM is robust to the choice of activation function. The lines show medians and the areasshow percentiles 5 to 95 over 5 seeds and 10 trajectories.

15


G. Additional Related WorkPlanning in state space When low-dimensional states of the environment are available to the agent, it is possible to learnthe dynamics directly in state space. In the regime of control tasks with only a few state variables, such as the cart poleand mountain car tasks, PILCO (Deisenroth & Rasmussen, 2011) achieves remarkable sample efficiency using Gaussianprocesses to model the dynamics. Similar approaches using neural networks dynamics models can solve two-link balancingproblems (Gal et al., 2016; Higuera et al., 2018) and implement planning via gradients (Henaff et al., 2018). Chua et al.(2018) use ensembles of neural networks, scaling up to the cheetah running task. The limitation of these methods is thatthey access the low-dimensional Markovian state of the underlying system and sometimes the reward function. Amos et al.(2018) train a deterministic model using overshooting in observation space for active exploration with a robotics hand. Wemove beyond low-dimensional state representations and use a latent dynamics model to solve control tasks from images.

Hybrid agents The challenges of model-based RL have motivated the research community to develop hybrid agentsthat accelerate policy learning by training on imagined experience (Kalweit & Boedecker, 2017; Nagabandi et al., 2017;Kurutach et al., 2018; Buckman et al., 2018; Ha & Schmidhuber, 2018), improving feature representations (Wayne et al.,2018; Igl et al., 2018), or leveraging the information content of the model directly (Weber et al., 2017). Srinivas et al. (2018)learn a policy network with integrated planning computation using reinforcement learning and without prediction loss, yetrequire expert demonstrations for training.

Multi-step predictions Training sequence models on multi-step predictions has been explored for several years. Scheduledsampling (Bengio et al., 2015) changes the rollout distance of the sequence model over the course of training. Hallucinatedreplay (Talvitie, 2014) mixes predictions into the data set to indirectly train multi-step predictions. Venkatraman et al. (2015)take an imitation learning approach. Recently, Amos et al. (2018) train a dynamics model on all multi-step predictions atonce. We generalize this idea to latent sequence models trained via variational inference.

Latent sequence models Classic work has explored models for non-Markovian observation sequences, including recurrentneural networks (RNNs) with deterministic hidden state and probabilistic state-space models (SSMs). The ideas behindvariational autoencoders (Kingma & Welling, 2013; Rezende et al., 2014) have enabled non-linear SSMs that are trained viavariational inference (Krishnan et al., 2015). The VRNN (Chung et al., 2015) combines RNNs and SSMs and is trained viavariational inference. In contrast to our RSSM, it feeds generated observations back into the model which makes forwardpredictions expensive. Karl et al. (2016) address mode collapse to a single future by restricting the transition function,(Moerland et al., 2017) focus on multi-modal transitions, and Doerr et al. (2018) stabilize training of purely stochasticmodels. Buesing et al. (2018) propose a model similar to ours but use in a hybrid agent instead for explicit planning.

Video prediction Video prediction is an active area of research in deep learning. Oh et al. (2015) and Chiappa et al. (2017)achieve visually plausible predictions on Atari games using deterministic models. Kalchbrenner et al. (2016) introducean autoregressive video prediction model using gated CNNs and LSTMs. Recent approaches introduce stochasticity tothe model to capture multiple futures (Babaeizadeh et al., 2017; Denton & Fergus, 2018). To obtain realistic predictions,Mathieu et al. (2015) and Vondrick et al. (2016) use adversarial losses. In simulated environments, Gemici et al. (2017)augment dynamics models with an external memory to remember long-time contexts. van den Oord et al. (2017) propose avariational model that avoids sampling using a nearest neighbor look-up, yielding high fidelity image predictions. Thesemodels are complimentary to our approach.

17


H. Video Predictions

PlaN

etTr

ue

Context 6 10 15 20 25 30 35 40 45 50

Mod

elTr

ueM

odel

True

Mod

el

PlaN

et+

Ove

rsho

otin

g

True

Context 6 10 15 20 25 30 35 40 45 50

Mod

elTr

ueM

odel

True

Mod

el

Det

erm

inis

tic(G

RU

)

True

Context 6 10 15 20 25 30 35 40 45 50

Mod

elTr

ueM

odel

True

Mod

el

Stoc

hast

ic(S

SM)

True

Context 6 10 15 20 25 30 35 40 45 50

Mod

elTr

ueM

odel

True

Mod

el

Figure 10: Open-loop video predictions for test episodes. The columns 1–5 show reconstructed context frames and theremaining images are generated open-loop. Our RSSM achieves pixel-accurate predictions for 50 steps into the future in thecheetah environment. We randomly selected action sequences from test episodes collected with action noise alongside thetraining episodes.

18


I. State Diagnostics

Figure 11: Open-loop state diagnostics. We freeze the dynamics model of a PlaNet agent and learn small neural networks topredict the true positions, velocities, and reward of the simulator. The open-loop predictions of these quantities show thatmost information about the underlying system is present in the learned latent space and can be accurately predicted forwardfurther than the planning horizons used in this work.

19


J. Planning Parameters

3.0

5.0

10.0

15.0

itera

tions

fraction=0.05 horizon=6.0 fraction=0.05 horizon=8.0 fraction=0.05 horizon=10.0 fraction=0.05 horizon=12.0 fraction=0.05 horizon=14.0

3.0

5.0

10.0

15.0

itera

tions


3.0

5.0

10.0

15.0

itera

tions


100.0 300.0 500.0 1000.0proposals

3.0

5.0

10.0

15.0

itera

tions

fraction=0.5 horizon=6.0

100.0 300.0 500.0 1000.0proposals


100.0 300.0 500.0 1000.0proposals


100.0 300.0 500.0 1000.0proposals


100.0 300.0 500.0 1000.0proposals


Figure 12: Planning performance on the cheetah running task with the true simulator using different planner settings.Performance ranges from 132 (blue) to 837 (yellow). Evaluating more action sequences, optimizing for more iterations, andre-fitting to fewer of the best proposals tend to improve performance. A planning horizon length of 6 is not sufficient andresults in poor performance. Much longer planning horizons hurt performance because of the increased search space. Forthis environment, best planning horizon length is near 8 steps.

20

Documents

Learning Latent Dynamics for Planning from Pixels · Danijar Hafner1 2 Timothy Lillicrap3 Ian Fischer4 Ruben Villegas1 5 David Ha 1Honglak Lee James Davidson Abstract Planning has