Deep Reinforcement Learning amidst Continual Structured

Deep Reinforcement Learning amidst Continual Structured Non-Stationarity

Annie Xie 1 James Harrison 1 Chelsea Finn 1

Abstract

As humans, our goals and our environment arepersistently changing throughout our lifetimebased on our experiences, actions, and internaland external drives. In contrast, typical reinforce-ment learning problem set-ups consider decisionprocesses that are stationary across episodes. Canwe develop reinforcement learning algorithmsthat can cope with the persistent change in theformer, more realistic problem settings? Whileon-policy algorithms such as policy gradients inprinciple can be extended to non-stationary set-tings, the same cannot be said for more efficientoff-policy algorithms that replay past experienceswhen learning. In this work, we formalize thisproblem setting, and draw upon ideas from theonline learning and probabilistic inference liter-ature to derive an off-policy RL algorithm thatcan reason about and tackle such lifelong non-stationarity. Our method leverages latent vari-able models to learn a representation of the envi-ronment from current and past experiences, andperforms off-policy RL with this representation.We further introduce several simulation environ-ments that exhibit lifelong non-stationarity, andempirically find that our approach substantiallyoutperforms approaches that do not reason aboutenvironment shift.

1. IntroductionIn the standard reinforcement learning (RL) set-up, the agentis assumed to operate in a stationary environment, i.e., underfixed dynamics and reward. However, the assumption ofstationarity rarely holds in more realistic settings, such asin the context of lifelong learning systems (Thrun, 1998).That is, over the course of its lifetime, an agent may besubjected to environment dynamics and rewards that varywith time. In robotics applications, for example, this non-

1Stanford University. Correspondence to: Annie Xie <[email protected]>.

Proceedings of the 38 th International Conference on MachineLearning, PMLR 139, 2021. Copyright 2021 by the author(s).

stationarity manifests itself in changing terrains and weatherconditions. In some situations, not even the objective isnecessarily fixed: consider an assistive robot helping a hu-man whose preferences gradually change over time. And,because stationarity is a core assumption in many existingRL algorithms, they are unlikely to perform well in theseenvironments.

Crucially, in each of the above scenarios, the environmentis specified by unknown, time-varying parameters. Theselatent parameters are also not i.i.d., and in fact have as-sociated but unobserved dynamics. For example, outdoorrobots experience weather conditions that are determinedby the season; a user-facing robot’s task depends on theuser’s preferences which can vary based on their day-to-dayroutine. We formalize this problem setting with the dy-namic parameter Markov decision process (DP-MDP). TheDP-MDP corresponds to a sequence of stationary MDPs,related through a set of latent parameters governed by an au-tonomous dynamical system. While non-stationary MDPsare special instances of the partially observable Markov de-cision process (POMDP) (Kaelbling et al., 1998), in thissetting, we can leverage structure available in the dynamicsof the hidden parameters and avoid solving POMDPs in thegeneral case.

On-policy RL algorithms can in principle cope with suchnon-stationarity (Sutton et al., 2007). However, in highlydynamic environments, only a limited amount of interactionis permitted before the environment changes, and on-policymethods may fail to adapt rapidly enough in this low-shotsetting (Al-Shedivat et al., 2017). Instead, we desire anoff-policy RL algorithm that can use past experience bothto improve sample efficiency and to reason about the envi-ronment dynamics. In order to adapt, the agent needs theability to predict how the MDP parameters will shift. Wethus require a representation of the MDP as well as a modelof how parameters evolve in this space, both of which canbe learned from off-policy experience.

To this end, our core contribution is an off-policy RL al-gorithm that can operate under non-stationarity by jointlylearning (1) a latent variable model, which lends a compactrepresentation of the MDP, and (2) a maximum entropypolicy with this representation. We validate our approach,which we call Lifelong Latent Actor-Critic (LILAC), on a


set of simulated environments that demonstrate persistentnon-stationarity. In our experimental evaluation, we findthat our method far outperforms RL algorithms that do notaccount for environment non-stationary, handles extrapolat-ing environment shifts, and retains strong performance instationary settings.

2. Dynamic Parameter Markov DecisionProcesses

The standard RL setting assumes episodic interaction witha fixed MDP (Sutton & Barto, 2018). In the real world, theassumption of episodic interaction with identical MDPs islimiting as it does not capture the wide variety of exogenousfactors that may effect the decision-making problem. Acommon model to avoid the strict assumption of Marko-vian observations is the partially observed MDP (POMDP)formulation (Kaelbling et al., 1998). While the POMDPis highly general, we focus on leveraging known structureof the non-stationary MDP in this work to improve perfor-mance. In particular, we consider an episodic environment,which we call the dynamic parameter MDP (DP-MDP),where a new MDP (we also refer to MDPs as tasks) ispresented in each episode. In reflection of the regularity ofreal-world non-stationarity, the tasks are sequentially relatedthrough a set of continuous parameters.

Formally, the DP-MDP is equipped with state space S,action space A, and initial state distribution ρs(s1). Fol-lowing the formulation of the Hidden Parameter MDP(HiP-MDP) (Doshi-Velez & Konidaris, 2016), a set of un-observed task parameters z ∈ Z defines the dynamicsps(st+1|st,at; z) and reward function r(st,at; z) for eachtask. In contrast to the HiP-MDP, the task parameters z inthe DP-MDP are not sampled i.i.d. but instead shift stochas-tically according to pz(zi+1|z1:i), with initial distributionρz(z

1). In other words, the DP-MDP is a sequence of taskswith parameters determined by the transition function pz.If the task parameters z for each episode were known, theaugmented state space S × Z would define a fully observ-able MDP for which we can use standard RL algorithms tosolve. Hence, in our approach, we aim to infer the hiddentask parameters and learn their transition function, allowingus to leverage existing RL algorithms by augmenting theobservations with the inferred task parameters.

Approximate model of continuously varying environ-ments. Some environments may not exhibit shifts onlyat episode boundaries, and instead change more smoothlyat every timestep. Formally, continuously varying environ-ments have a set of task parameters zit for each timestep tin each episodic interaction i. While these environmentsdo not explicitly fall under the setting of DP-MDPs, theDP-MDP can exactly represent these environments whenthe intra-episode timestep t is either provided as part of the

Figure 1. The graphical model for the RL-as-Inference frameworkconsists of states st, actions at, and optimality variables Ot. Byincorporating rewards through the optimality variables, learningan RL policy amounts to performing inference in this model.

state s or can be inferred. One way to see this mappingis to define our DP-MDP such that the task parameters forepisodic interaction i is the concatenation of all parametersof the episode, i.e., zi = [zit]

Tt=1. Then, if the continuously

varying environment has dynamics p′s(sit+1|sit,ait; zit) and

reward function r′(sit,ait; z

it), the equivalent DP-MDP, with

state s = [s, t], is defined by:

ps(sit+1|sit,ait; zi) = p′s(s

it+1|sit,ait; zi[t])

r(sit,ait; z

i) = r′(sit,ait; z

i[t]).

Furthermore, even when the timestep is not provided, theDP-MDP can still be viewed as quantized model of theseforms of environment shifts, and using this quantization canbe significantly more efficient in computation than modelingsmall changes at every single timestep. Under this interpre-tation, algorithms for solving DP-MDPs are not necessarilylimited to environments with inter-episode shifts, and canbe applied to fairly general non-stationary environments.We validate this claim in the experiments, and indeed, findthat the algorithm proposed in the next section can solveinstances of continuously varying environments.

3. Preliminaries: RL as InferenceWe first discuss an established connection between prob-abilistic inference and reinforcement learning (Toussaint,2009; Levine, 2018) to provide some context for our ap-proach. At a high level, this framework casts sequentialdecision-making as a probabilistic graphical model, andfrom this perspective, the maximum-entropy RL objectivecan be derived as an inference procedure in this model.

3.1. A Probabilistic Graphical Model for RL

As depicted in Figure 1, the proposed model consists ofstates st, actions at, and per-timestep optimality variablesOt, which are related to rewards by p(Ot = 1|st,at) =exp(r(st,at)) and denote whether the action at taken fromstate st is optimal. While rewards are required to benon-positive through this relation, so long the rewards arebounded, they can be scaled and centered to be no greaterthan 0. A trajectory is the sequence of states and actions,(s1,a1, s2, . . . , sT ,aT ), and we aim to infer the posterior


Figure 2. The graphical model for the DP-MDP. Each episode presents a new task, or MDP, determined by latent variables z. The MDPsare related through a transition function pz(zi+1|z1:i).

distribution p(s1:T ,a1:T |O1:T = 1), i.e., the trajectory dis-tribution that is optimal for all timesteps.

3.2. Variational Inference

Among existing inference tools, structured variational in-ference is particularly appealing for its scalability and ef-ficiency to approximate the distribution of interest. In thevariational inference framework, a variational distribution qis optimized through the variational lower bound to approxi-mate another distribution p. Assuming a uniform prior overactions, the optimal trajectory distribution is:

p(s1:T ,a1:T |O1:T = 1) ∝ p(s1:T ,a1:T ,O1:T = 1)

= p(s1)

T∏t=1

exp(r(st,at))p(st+1|st,at).

For our approximating distribution, we can choose the formq(s1:T ,a1:T ) = p(s1)

∏Tt=1 p(st+1|st,at)q(at|st), where

p(s1) and p(st+1|st,at) are fixed and given by the envi-ronment. We now rename q(at|st) to π(at|st) since thisrepresents the desired policy. By Jensen’s inequality, thevariational lower bound for the evidence O1:T = 1 is

log p(O1:T = 1) = logEq[p(s1:T ,a1:T ,O1:T = 1)

q(s1:T ,a1:T )

]≥ Eπ

[T∑t=1

r(st,at)− log π(at|st)

],

which is the maximum entropy RL objective (Ziebart et al.,2008; Toussaint, 2009; Rawlik et al., 2013; Fox et al., 2015;Haarnoja et al., 2017). This objective adds a conditional en-tropy term and thus maximizes both returns and the entropyof the policy. This formulation is known for its improve-ments in exploration, robustness, and stability over other RLalgorithms, thus we build upon it in our method to inheritthese qualities. We capture non-stationarity by augment-ing the RL-as-inference model with latent variables zi foreach task i. As we will see in the next section, by view-ing non-stationarity from this probabilistic perspective, ouralgorithm can be derived as an inference procedure in aunified model.

4. Off-Policy Reinforcement Learning inNon-Stationary Environments

Building upon the RL-as-inference framework, in this sec-tion, we offer a probabilistic graphical model that under-lies the dynamic parameter MDP setting introduced in Sec-tion 2. Then, using tools from variational inference, wederive a variational lower bound that performs joint RL andrepresentation learning. Finally, we present our RL algo-rithm, which we call Lifelong Latent Actor-Critic (LILAC),that optimizes this objective and builds upon on soft actor-critic (Haarnoja et al., 2018), an off-policy maximum en-tropy RL algorithm.

4.1. Non-stationarity as a Probabilistic Model

We can cast the dynamic parameter MDP as a probabilistichierarchical model, where non-stationarity occurs at theepisodic level, and within each episode is an instance of astationary MDP. To do so, we construct a two-tiered model:on the first level, we have the sequence of latent variableszi as a Markov chain, and on the second level, a Markovdecision process corresponding to each zi. The graphicalmodel formulation of the DP-MDP is illustrated in Figure 2.

Within this formulation, the trajectories gathered from eachepisode are modeled individually, rather than amortized asin Subsection 3.2. Let ui represent the sequence of actionsai1:T taken in trajectory i. Then, the probability distributionp(z1:N , τ1:N |u1:N ) is defined as follows:

p(z1)p(τ1|z1,u1)

N∏i=1

p(zi|z1:i−1)p(τ i|zi,ui)

where the probability of each trajectory τ conditioned on zand action sequence u is

p(τ |z,u) = p(s1)

T∏t=1

p(Ot = 1|st,at; z)p(st+1|st,at; z)

= p(s1)

T∏t=1

exp(r(st,at; z))p(st+1|st,at; z).


With this factorization, the non-stationary elements of theenvironment are captured by the latent variables z, andwithin a task, the dynamics and reward functions are nec-essarily stationary. This suggests that learning to infer z,which amounts to representing the non-stationarity elementsof the environment with z, will reduce this RL setting to astationary one. Taking this type of approach is appealingsince there already exists a rich body of algorithms for thestandard RL setting. In the next subsection, we describe howwe can approximate the posterior over z, by deriving theevidence lower bound for this model under the variationalinference framework.

4.2. Joint Representation and Reinforcement Learningvia Variational Inference

Recall the agent is operating in an online learning set-ting. That is, it must continuously adapt to a streamof tasks and leverage experience gathered from previoustasks for learning. Thus, at any episode i > 1, theagent has observed all of the trajectories collected fromepisodes 1 through i− 1, τ1:i−1 = {τ1, · · · , τ i−1}, whereτ = {s1,a1, r1, . . . , sT ,aT , rT }.

We aim to infer, at every episode i, the posterior distri-bution over actions, given the evidence Oi1:T = 1 andthe experience from the previous episodes τ1:i−1. Fol-lowing Subsection 3.2, we can leverage variational infer-ence to optimize a variational lower bound to the log-probability of this set of evidence conditioned on the actionstaken, log p(τ1:i−1,Oi1:T = 1|u1:i−1), where ui repre-sents ai1:T . Since p(τ1:i−1,Oi1:T = 1,u1:i−1) factorizes asp(τ1:i−1|u1:i−1)p(Oi1:T = 1|τ1:i−1), the log-probability ofthe evidence can be decomposed into log p(τ1:i−1|u1:i−1)+log p(Oi1:T = 1|τ1:i−1). These two terms can be separatelylower bounded and summed to form a single objective.

The variational lower bound of the first term follows fromthat of a variational auto-encoder (Kingma & Welling, 2014)with evidence τ1:i−1 and latent variables z1:i−1:

log p(τ1:i−1|u1:i−1) = logEq[p(τ1:i−1, z1:i−1|u1:i−1)

q(z1:i−1)

].

We choose our approximating distribution over the latentvariables zi to be conditioned on the trajectory from episodei, i.e. q(zi|τ i). Then, the variational lower bound is:

Lrep = Eq

i∑j=1

T∑t=1

log p(sjt+1, rjt |s

jt ,a

jt , z

j)

−DKL(q(zj |τ j)) || p(zj |z1:j−1))

].

The lower bound Lrep corresponds to an objective for un-supervised representation learning in a sequential latentvariable model. By optimizing the reconstruction loss of

Figure 3. An overview of our network architecture. Our methodconsists of the actor π, the critic Q, an inference network q, adecoder network, and a learned prior over latent embeddings. Eachcomponent is implemented with a neural network.

the transitions and rewards for each episode, the learnedlatent variables should encode the varying parameters ofthe MDP. Further, by imposing the prior p(zi|z1:i−1) onthe approximated distribution q through the KL divergence,the latent variables are encouraged to be sequentially con-sistent across time. This prior corresponds to a model ofthe environment’s latent dynamics and gives the agent apredictive estimate of future conditions of the environment(to the extent to which the DP-MDP is predictable).

For the second term,

log p(Oi1:T = 1|τ1:i−1) = log

∫p(Oi1:T = 1, zi|τ1:i−1)dzi

= log

∫p(Oi1:T = 1|zi)p(zi|τ1:i−1)dzi

≥ Ep(zi|τ1:i−1)

[log p(Oi1:T = 1|zi)

]≥ Ep(zi|τ1:i−1)

π(at|st,zi)

[T∑i=1

r(st,at; zi)− log π(at|st, zi)

]

= LRL.

The final inequality is given by steps from Subsection 3.2.The bound LRL optimizes for both policy returns and pol-icy entropy, as in the maximum entropy RL objective, buthere the policy is also conditioned on the inferred latent em-beddings of the MDP. This objective essentially performstask-conditioned reinforcement learning where the task vari-ables at episode i are given by p(zi|τ1:i−1). Learning amulti-task RL policy is appealing, especially over a policythat adapts between episodes. That is, if the shifts in theenvironment are similar to those seen previously, we do notexpect its performance to degrade even if the environment isshifting quickly, whereas a single-task policy would likelystruggle to adapt quickly enough.

Our proposed objective is the sum of the above two terms


L = Lrep +LRL, which is also a variational lower bound forour entire model. Hence, while our objective was derivedfrom and can be understood as an inference procedure inour probabilistic model, it also decomposes into two veryintuitive objectives, with the first corresponding to unsuper-vised representation learning and the second correspondingto reinforcement learning.

4.3. Implementation

We introduce an inference network that outputs a distribu-tion over latent variables, q(zi|τ i), conditioned on the trajec-tory from the i-th episode. The network outputs parametersof a Gaussian distribution, and we use the reparameteriza-tion trick (Kingma & Welling, 2014) to sample zi. Theweights of the inference network are trained with gradientsfrom both Lrep and LRL, which we detail below.

Optimizing Lrep. Like in the standard VAE objective, thelower bound is Lrep = −(Jdec + JKL) where

Jdec = −Eq

i∑j=1

T∑t=1

log p(sjt+1, rjt |s

jt ,a

jt , z

j)

JKL = Eq

i∑j=1

DKL(q(zj |τ j)) || p(zj |z1:j−1))

.A decoder neural network reconstructs transitions and re-wards given the latent embedding zi, current state st, andaction taken at. Finally, we approximate p(zi|z1:i−1) andp(zi|τ1:i−1) with a shared long short-term memory (LSTM)network (Hochreiter & Schmidhuber, 1997), which receiveszi−1 from q(zi−1|τ i−1) and hidden state hi−1, and pro-duces zi and the next hidden state hi.

Optimizing LRL. To optimize LRL, we extend soft actor-critic (SAC) (Haarnoja et al., 2018), which implementsmaximum entropy off-policy RL. As depicted in Figure 3,the policy and critic are conditioned on the environment stateand the latent variables z. During training, z is sampled fromq(z|τ) outputted by the inference network. At executiontime, the latent variables z the policy conditions on are givenby the LSTM network, based on the inferred latent variablesfrom the previous episode. Following SAC (Haarnoja et al.,2018), the actor loss Jπ and critic loss JQ are

Jπ = Eτ∼D,

z∼q(·|τ)

[DKL

(π(a|s, z)

∣∣∣∣∣∣∣∣exp(Q(s,a, z))

Z(st)

)],

JQ = Eτ∼D,

z∼q(·|τ)

[(Q(s,a, z)− (r + V (s′, z)))2

],

where V denotes the target network. Our complete algo-rithm, Lifelong Latent Actor-Critic (LILAC), is summarizedin Algorithm 1.

Algorithm 1 Lifelong Latent Actor-Critic (LILAC)

Input: env, αQ, απ , αenc, αdec, αψRandomly initialize θQ, θπ , φenc, φdec, and ψInitialize empty replay buffer DAssign z1 ← ~0for i = 1, 2, . . . do

Sample zi ∼ pψ(zi|z1:i−1)Collect trajectory τ i from env with πθ(a|s, z)Update replay buffer D[i]← τ i

for j = 1, 2, . . . , N doSample a batch of episodes E from D. Update actor and criticθQ ← θQ − αQ∇θQJQθπ ← θπ − απ∇θπJπ. Update inference networkφenc ← φenc − αenc∇φenc (Jdec + JKL + JQ). Update modelφdec ← φdec − αdec∇φdecJdecψ ← ψ − αψ∇ψJKL

end forend for

5. Related WorkPartial observability in RL. The POMDP is a general,flexible framework capturing non-stationarity and partial ob-servability in sequential decision-making problems. Whileexact solution methods are tractable only for tiny state andactions spaces (Kaelbling et al., 1998), methods based (pri-marily) on approximate Bayesian inference have enabledscaling to larger problems over the course of the past twodecades (Kurniawati et al., 2008; Roy et al., 2005). In recentyears, representation learning, and especially deep learningpaired with amortized variational inference, has enabledscaling to a larger class of problems, including continuousstate and action spaces (Igl et al., 2018; Han et al., 2020;Lee et al., 2019a; Hafner et al., 2019) and image observa-tions (Lee et al., 2019a; Kapturowski et al., 2019). However,the generality of the POMDP formulation both ignores pos-sible performance improvements that may be realized byexploiting the structure of the DP-MDP, and does not ex-plicitly consider between-episode non-stationarity.

A variety of intermediate problem statements betweenepisodic MDPs and POMDPs have been proposed. TheBayes-adaptive MDP formulation (BAMDP) (Duff, 2002;Ross et al., 2008), as well as the hidden parameter MDP(HiP-MDP) (Doshi-Velez & Konidaris, 2016) consideran MDP with unknown parameters governing the rewardand dynamics, which we aim to infer online over thecourse of one episode. In this formulation, the exploration-exploitation dilemma is resolved by augmenting the statespace with a representation of posterior belief over the la-tent parameters. As noted by Duff (2002) in the RL lit-


erature and Feldbaum (1960); Bar-Shalom & Tse (1974)in control theory, this representation rapidly becomes in-tractable due to exploding state dimensionality. Recent workhas developed effective methods for policy optimization inBAMDPs via, primarily, amortized inference (Zintgraf et al.,2020; Rakelly et al., 2019; Lee et al., 2019b). However, theBAMDP framework does not address the dynamics of thelatent parameter between episodes, assuming a temporally-fixed structure. In contrast, we are capable of modeling theevolution of the latent variable over the course of episodes,leading to better priors for online inference.

A strongly related setting is the hidden-mode MDP (Choiet al., 2000), which augments the MDP with a latent pa-rameter that evolves via a hidden Markov model with adiscrete number of states. Algorithms that study the HM-MDP setting aim to quickly detect changes in the environ-ment (Da Silva et al., 2006; Hadoux et al., 2014; Banerjeeet al., 2017; Padakandla et al., 2020), while LILAC aimsto anticipate future changes and adapt as they happen. Inboth the HM-MDP and the DP-MDP, the latent variableevolves infrequently, as opposed to at every time step as inthe POMDP. The HM-MDP is limited to a fixed numberof latent variable states due to the use of standard HMMinference algorithms. In contrast, our approach allows con-tinuous latent variables, thus widely extending the range ofapplicability.

Non-stationarity in learning. LILAC also shares concep-tual similarities with methods from online learning and life-long learning (Shalev-Shwartz, 2012; Gama et al., 2014),which aim to capture non-stationarity in supervised learning,as well as meta-learning and meta-reinforcement learning al-gorithms, which aim to rapidly adapt to new settings. Withinmeta-reinforcement learning, two dominant techniques ex-ist: optimization-based (Finn et al., 2017; Rothfuss et al.,2019; Zintgraf et al., 2019; Stadie et al., 2018) and context-based, which includes both recurrent architectures (Duanet al., 2016; Wang et al., 2016; Mishra et al., 2018) and ar-chitectures based on latent variable inference (Rakelly et al.,2019; Lee et al., 2019a; Zintgraf et al., 2020). LILAC fitsinto this last category within this taxonomy, but extends pre-vious methods by considering inter-episode latent variabledynamics. Previous embedding-based meta-RL algorithms—while able to perform online inference of latent variablesand incorporate this posterior belief into action selection—do not consider how these latent variables evolve over thelifetime of the agent, as in the DP-MDP setting. The innerlatent variable inference component of LILAC possessesstrong similarities to the continual and lifelong learningsetting (Gama et al., 2014). Many continual and lifelonglearning aim to learn a variety of tasks without forgettingprevious tasks (Kirkpatrick et al., 2017; Zenke et al., 2017;Lopez-Paz et al., 2017; Aljundi et al., 2019; Parisi et al.,2019; Rusu et al., 2016; Shmelkov et al., 2017; Rebuffi et al.,

2017; Shin et al., 2017). We consider a setting where it ispractical to store past experiences in a replay buffer (Rolnicket al., 2019; Finn et al., 2019). Unlike these prior works,LILAC aims to learn the dynamics associated with latentfactors, and perform online inference.

Within RL, non-stationarity Chandak et al. (2020) studya setting similar to ours, where the reward and transitiondynamics change smoothly across episodes, and propose touse curve-fitting to estimate performance on future MDPsand learn a single policy that optimizes for future perfor-mance. This need for continual policy adaptation can resultin performance lag in quickly changing environments; incontrast, LILAC learns a latent variable-conditioned policy,where different MDPs map to different values for these la-tent variables, and thus should be less sensitive to the rateof non-stationarity.

6. ExperimentsIn our experiments, we aim to address our central hypoth-esis: that existing off-policy RL algorithms struggle un-der persistent non-stationarity and that, by leveraging ourlatent variable model, LILAC can make learning in suchsettings both effective and efficient. To do so, we evaluatethe agent’s learning performance in various non-stationaryenvironments, including environments with varying ratesof change, intra-episodic shifts, and task parameters thatexhibit extrapolating shifts.

Environments. We construct four continuous control en-vironments with varying sources of change in the rewardand/or dynamics. These environments are designed suchthat the policy needs to change in order to achieve goodperformance. The first is derived from the simulated Sawyerreaching task in the Meta-World benchmark (Yu et al., 2019),in which the target position is not observed and moves be-tween episodes. In the second environment based on Half-Cheetah from OpenAI Gym (Brockman et al., 2016), weconsider changes in the direction and magnitude of windforces on the agent, and changes in the target velocity. Wenext consider the 8-DoF minitaur environment (Tan et al.,2018) and vary the mass of the agent between episodes,representative of a varying payload. Finally, we construct a2D navigation task in an infinite, non-episodic environmentwith non-stationary dynamics which we call 2D Open World.The agent’s goal is to collect food pellets and to avoid otherobjects and obstacles, whilst subjected to unknown perturba-tions that vary on an episodic schedule. These environmentsare illustrated in Figure 4. For full environment details, seeAppendix A.

Comparisons. We compare our approach to standard soft-actor critic (SAC) (Haarnoja et al., 2018), which corre-sponds to our method without any latent variables, allowing


Figure 4. The environments in our evaluation. Each environment changes over the course of learning, including a changing target reachingposition (left), variable wind and goal velocities (middle left), and variable payloads (middle right). We also introduce a 2D open worldenvironment with non-stationary dynamics and visualize a partial snapshot of the LILAC agent’s lifetime in purple (right).

Figure 5. Learning curves across our experimental domains. In all settings, our approach is substantially more stable and successfulthan SAC, SLAC, and PPO. As demonstrated in Half-Cheetah with varying target velocities and wind forces, our method can cope withnon-stationarity in both dynamics and rewards. Error bars reflect 95% confidence intervals.

us to evaluate the performance of off-policy algorithms amidnon-stationarity. We also compare to stochastic latent actor-critic (SLAC) (Lee et al., 2019a), which learns to modelpartially observed environments with a latent variable modelbut does not address inter-episode non-stationarity. Thiscomparison allows us to evaluate the importance of model-ing non-stationarity between episodes. Finally, we includeproximal policy optimization (PPO) (Schulman et al., 2017)as a comparison to on-policy RL. Since the tasks in theSawyer and Half-Cheetah domains involve goal reaching,we can obtain an oracle by training a goal-conditioned SACpolicy, i.e. with the true goal concatenated to the observa-tion. We provide this comparison to help contextualize theperformance of our method against other algorithms. Wetune the hyperparameters for all approaches, and run eachwith the best hyperparameter setting with 3 random seeds1.For all hyperparameter details, see Appendix B.

Results. Our experimental results are shown in Figure 5.Since on-policy algorithms tend to have worse sample com-plexity, we run PPO for 10 million environment steps andplot only the asymptotic returns. In all domains, LILACattains higher and more stable returns compared to SAC,SLAC, and PPO. Since SAC amortizes experience collectedacross episodes into a single replay buffer, we observe thatthe algorithm converges to an averaged behavior. Mean-while, SLAC does not have the mechanism to model non-stationarity across episodes, and has to infer the unknown

1We ran SAC in the Minitaur task with additional seeds fora total of 5 seeds, as recommended by a significance test of ourresults. The analysis, presented in Appendix C, suggests that noadditional seeds are necessary for each of the other algorithms orenvironments.

dynamics and reward from the initial steps taken during eachepisode, which the algorithm is not very successful at. Dueto the cyclical nature of the tasks, the learned behavior ofSLAC results in oscillating returns across tasks. Similarly,PPO cannot adapt to per-episode changes in the environ-ment and ultimately converges to learning an average policy.In contrast to these methods, LILAC infers how the envi-ronment changes in future episodes and steadily maintainshigh rewards over the training procedure, despite experi-encing persistent shifts in the environment in each episode.Further, LILAC can learn under simultaneous shifts in bothdynamics and rewards, verified by the HC WindVel results.LILAC can also adeptly handle shifts in the 2D Open Worldenvironment without episodic resets. A partial snapshot ofthe agent’s lifetime from this task is visualized in Figure 4.

Varying rates of environment shift. We next evaluateLILAC under varying rates of non-stationarity. To do so,we use the Sawyer reaching domain, where the goal movesalong a fixed-radius circle, and vary the step size alongthe circle (0.2, 0.4, 0.6, and 0.8 radians/step) to generateenvironments that shift at different speeds. As depicted inFigure 6a, LILAC’s performance is largely independent ofthe environment’s rate of change. We also evaluate LILACunder stationary conditions, i.e. with a fixed goal, and findit achieves the same performance as SAC, thus retaining theability to learn as effectively as SAC in a fixed environment.These results demonstrate LILAC’s efficacy under a rangeof rates of non-stationarity, including the stationary case.

The gap in LILAC’s performance between the non-stationary and stationary cases can likely be explained bythe estimation error of future environment conditions given


(a) (b) (c)Figure 6. (a) LILAC and SAC evaluated in the Sawyer task with varying rates of non-stationarity (0.2, 0.4, 0.6, and 0.8 radians/step).(b) We introduce a continuously varying variant of the Sawyer task with intra-episodic shifts, and evaluate LILAC with and without thetimestep t included in the state s, finding that both are robust to shifts at every timestep. (c) The task parameters in this setting exhibitextrapolating shift: the target moves along a never-ending line between episodic trials. The LILAC agent can continually reach new goals,while the performance of SAC degrades over time.

by the prior pφ(zi|z1:i−1). Currently, the executed policyuses a fixed z given by the prior for the entire duration ofthe episode, but a natural extension that may improve per-formance is updating z during each episode. In particular,we could encode the collected partial trajectory with theinference network and combine the inferred values with theprior to form an updated estimate, akin to Bayesian filtering.

Intra-episodic environment shifts. As described in Sec-tion 2, the DP-MDP can exactly represent environments thatchange at every timestep, when the timestep t is providedas part of the state s. Even when the timestep t is not given,however, the DP-MDP can still be viewed as a quantizationof these environments. To empirically investigate this propo-sition, we evaluate LILAC in a modified Sawyer reachingtask: the target now moves after every time-step insteadof every episode, thereby introducing intra-episode shifts.Here, the target moves at the same rate per episode as theoriginal setting, but moves smoothly over time. We evaluateLILAC with and without the timestep t given in the state,and as shown in Figure 6b, LILAC is robust to shifts in bothscenarios and significantly outperforms SAC. Hence, ourapproach can handle a wide subset of non-stationary envi-ronments, including those that change at every time-step.

Extrapolating environment shifts. To understand whetherLILAC can cope within other open-world environments, westudy a setting in which the dynamic parameters of the en-vironment exhibit extrapolating shift. Deep RL algorithmsgenerally struggle to generalize to out-of-distribution envi-ronment conditions (Kumar et al., 2020; Mendonca et al.,2020; Agarwal et al., 2021). In this experiment, we studyan instance of extrapolating variations in the task. Specifi-cally, we construct a Sawyer reaching task in which the goalgradually moves along a never-ending line between trials.Our results, presented in Figure 6c, indicate that LILACcan indeed learn to model as well as reach extrapolatinggoal positions, especially when compared to the SAC agentwhose performance degrades over time.

7. ConclusionWe considered the problem of reinforcement learning withpersistent but structured non-stationarity, a problem whichwe believe is a step towards reinforcement learning systemsoperating in the real world. This problem is at the intersec-tion of reinforcement learning under partial observability(i.e. POMDPs) and online learning; hence we formalizedthe problem as a special case of a POMDP that is also signif-icantly more tractable. We derive a graphical model underly-ing this problem setting, and utilize it to derive our approachunder the formalism of reinforcement learning as probabilis-tic inference (Levine, 2018). Our method leverages thislatent variable model to model the change in the environ-ment, and conditions the policy and critic on the inferredvalues of these latent variables. On several challenging con-tinuous control tasks with significant non-stationarity, weobserve that our approach leads to substantial improvementcompared to state-of-the-art RL methods.

While the DP-MDP formulation represents a strict general-ization of the commonly-considered meta-reinforcementlearning settings (typically, a BAMDP (Zintgraf et al.,2020)), it is still somewhat limited in its generality. Inparticular, the assumption of task parameters shifting be-tween episodes presents a possibly unrealistic limitation,but can be relaxed when the timestep is given, or can beinferred. Otherwise, the DP-MDP can still be viewed as anapproximation of environments with intra-episodic shifts.For highly infrequent shifts however, we may need to lever-age alternative tools; in particular, this notion of infrequent,discrete shifts underlies the changepoint detection litera-ture (Adams & MacKay, 2007; Fearnhead & Liu, 2007).Previous work within sequential decision making in chang-ing environments (Da Silva et al., 2006; Hadoux et al., 2014;Banerjee et al., 2017) and meta-learning within changingdata streams (Harrison et al., 2019) may enable a version ofLILAC capable of handling unobserved changepoints.


AcknowledgementsThe authors would also like to thank Allan Zhou, Evan Liu,and Laura Smith for helpful feedback on an early version ofthis paper. This work was supported by an NSF GRFP, ONRgrant N00014-20-1-2675, and JPMorgan Chase & Co. Anyviews or opinions expressed herein are solely those of theauthors listed, and may differ from the views and opinionsexpressed by JPMorgan Chase & Co. or its affiliates. Thismaterial is not a product of the Research Department of J.P.Morgan Securities LLC. This material should not be con-strued as an individual recommendation for any particularclient and is not intended as a recommendation of particularsecurities, financial instruments or strategies for a particularclient. This material does not constitute a solicitation oroffer in any jurisdiction.

ReferencesAdams, R. P. and MacKay, D. J. Bayesian online change-

point detection. arXiv:0710.3742, 2007.

Agarwal, R., Machado, M. C., Castro, P. S., and Bellemare,M. G. Contrastive behavioral similarity embeddings forgeneralization in reinforcement learning. InternationalConference on Learning Representations (ICLR), 2021.

Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mor-datch, I., and Abbeel, P. Continuous adaptation via meta-learning in nonstationary and competitive environments.International Conference on Learning Representations(ICLR), 2017.

Aljundi, R., Kelchtermans, K., and Tuytelaars, T. Task-freecontinual learning. IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2019.

Banerjee, T., Liu, M., and How, J. P. Quickest changedetection approach to optimal control in markov deci-sion processes with model changes. American ControlConference (ACC), 2017.

Bar-Shalom, Y. and Tse, E. Dual effect, certainty equiva-lence, and separation in stochastic control. IEEE Trans-actions on Automatic Control, 19(5):494–500, 1974.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,Schulman, J., Tang, J., and Zaremba, W. Openai gym.arXiv preprint arXiv:1606.01540, 2016.

Chandak, Y., Theocharous, G., Shankar, S., Mahadevan, S.,White, M., and Thomas, P. S. Optimizing for the futurein non-stationary mdps. ICML, 2020.

Choi, S. P., Yeung, D.-Y., and Zhang, N. L. Hidden-modemarkov decision processes for nonstationary sequentialdecision making. In Sequence Learning, pp. 264–287.Springer, 2000.

Colas, C., Sigaud, O., and Oudeyer, P.-Y. How many randomseeds? statistical power analysis in deep reinforcementlearning experiments. arXiv preprint arXiv:1806.08295,2018.

Da Silva, B. C., Basso, E. W., Bazzan, A. L., and Engel,P. M. Dealing with non-stationary environments usingcontext detection. ICML, 2006.

Doshi-Velez, F. and Konidaris, G. Hidden parameter markovdecision processes: A semiparametric regression ap-proach for discovering latent task parametrizations. In-ternational Joint Conference on Artificial Intelligence(IJCAI), 2016.

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutskever,I., and Abbeel, P. RL2: Fast reinforcement learn-ing via slow reinforcement learning. arXiv preprintarXiv:1611.02779, 2016.

Duff, M. O. Optimal Learning: Computational proceduresfor Bayes-adaptive Markov decision processes. PhD the-sis, University of Massachusetts at Amherst, 2002.

Fearnhead, P. and Liu, Z. On-line inference for multiplechangepoint problems. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), 2007.

Feldbaum, A. Dual control theory. Avtomatika i Tele-mekhanika, 1960.

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. ICML,2017.

Finn, C., Rajeswaran, A., Kakade, S., and Levine, S. Onlinemeta-learning. ICML, 2019.

Fox, R., Pakman, A., and Tishby, N. Taming the noise inreinforcement learning via soft updates. arXiv preprintarXiv:1512.08562, 2015.

Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., andBouchachia, A. A survey on concept drift adaptation.ACM computing surveys (CSUR), 2014.

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Reinforce-ment learning with deep energy-based policies. ICML,2017.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcementlearning with a stochastic actor. ICML, 2018.

Hadoux, E., Beynier, A., and Weng, P. Sequential decision-making under non-stationary environments via sequentialchange-point detection. 2014.


Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D.,Lee, H., and Davidson, J. Learning latent dynamics forplanning from pixels. ICML, 2019.

Han, D., Doya, K., and Tani, J. Variational recurrent modelsfor solving partially observable control tasks. Interna-tional Conference on Learning Representations (ICLR),2020.

Harrison, J., Sharma, A., Finn, C., and Pavone, M. Con-tinuous meta-learning without tasks. arXiv preprintarXiv:1912.08866, 2019.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup,D., and Meger, D. Deep reinforcement learning that mat-ters. In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 32, 2018.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.Neural computation, 9(8):1735–1780, 1997.

Igl, M., Zintgraf, L., Le, T. A., Wood, F., and Whiteson,S. Deep variational reinforcement learning for pomdps.ICML, 2018.

Kaelbling, L. P., Littman, M. L., and Cassandra, A. R. Plan-ning and acting in partially observable stochastic domains.Artificial intelligence, 1998.

Kapturowski, S., Ostrovski, G., Dabney, W., Quan, J., andMunos, R. Recurrent experience replay in distributed re-inforcement learning. International Conference on Learn-ing Representations (ICLR), 2019.

Kingma, D. P. and Welling, M. Auto-encoding variationalBayes. International Conference on Learning Represen-tations (ICLR), 2014.

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J.,Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ra-malho, T., Grabska-Barwinska, A., et al. Overcomingcatastrophic forgetting in neural networks. Proceedingsof the National Academy of Sciences, 2017.

Kumar, S., Kumar, A., Levine, S., and Finn, C. One solutionis not all you need: Few-shot extrapolation via structuredmaxent rl. NeurIPS, 33, 2020.

Kurniawati, H., Hsu, D., and Lee, W. S. Sarsop: Efficientpoint-based pomdp planning by approximating optimallyreachable belief spaces. Robotics: Science and Systems(RSS), 2008.

Lee, A. X., Nagabandi, A., Abbeel, P., and Levine,S. Stochastic latent actor-critic: Deep reinforcementlearning with a latent variable model. arXiv preprintarXiv:1907.00953, 2019a.

Lee, G., Hou, B., Mandalika, A., Lee, J., Choudhury, S.,and Srinivasa, S. S. Bayesian policy optimization formodel uncertainty. International Conference on LearningRepresentations (ICLR), 2019b.

Levine, S. Reinforcement learning and control as proba-bilistic inference: Tutorial and review. arXiv preprintarXiv:1805.00909, 2018.

Lopez-Paz, D. et al. Gradient episodic memory for continuallearning. NeurIPS, 2017.

Mendonca, R., Geng, X., Finn, C., and Levine, S. Meta-reinforcement learning robust to distributional shift viamodel identification and experience relabeling. arXivpreprint arXiv:2006.07178, 2020.

Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P.A simple neural attentive meta-learner. InternationalConference on Learning Representations (ICLR), 2018.

Padakandla, S., Prabuchandran, K., and Bhatnagar, S. Rein-forcement learning algorithm for non-stationary environ-ments. Applied Intelligence, 50(11):3590–3606, 2020.

Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., and Wermter,S. Continual lifelong learning with neural networks: Areview. Neural Networks, 2019.

Rakelly, K., Zhou, A., Quillen, D., Finn, C., and Levine,S. Efficient off-policy meta-reinforcement learning viaprobabilistic context variables. ICML, 2019.

Rawlik, K., Toussaint, M., and Vijayakumar, S. On stochas-tic optimal control and reinforcement learning by ap-proximate inference. International Joint Conference onArtificial Intelligence (IJCAI), 2013.

Rebuffi, S.-A., Kolesnikov, A., and Lampert, C. H. icarl:Incremental classifier and representation learning. IEEEConference on Computer Vision and Pattern Recognition(CVPR), 2017.

Rolnick, D., Ahuja, A., Schwarz, J., Lillicrap, T., andWayne, G. Experience replay for continual learning.NeurIPS, 2019.

Ross, S., Chaib-draa, B., and Pineau, J. Bayes-adaptivepomdps. NeurIPS, 2008.

Rothfuss, J., Lee, D., Clavera, I., Asfour, T., and Abbeel,P. Promp: Proximal meta-policy search. InternationalConference on Learning Representations (ICLR), 2019.

Roy, N., Gordon, G., and Thrun, S. Finding approximatepomdp solutions through belief compression. Journal ofArtificial Intelligence Research, 23:1–40, 2005.


Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H.,Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Had-sell, R. Progressive neural networks. arXiv:1606.04671,2016.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., andKlimov, O. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.

Shalev-Shwartz, S. Online learning and online convex opti-mization. ”Foundations and Trends in Machine Learn-ing”, 2012.

Shin, H., Lee, J. K., Kim, J., and Kim, J. Continual learningwith deep generative replay. NeurIPS, 2017.

Shmelkov, K., Schmid, C., and Alahari, K. Incrementallearning of object detectors without catastrophic forget-ting. arXiv:1708.06977, 2017.

Stadie, B. C., Yang, G., Houthooft, R., Chen, X., Duan, Y.,Wu, Y., Abbeel, P., and Sutskever, I. Some considerationson learning to explore via meta-reinforcement learning.arXiv:1803.01118, 2018.

Sutton, R. S. and Barto, A. G. Reinforcement learning: Anintroduction. MIT press, 2018.

Sutton, R. S., Koop, A., and Silver, D. On the role oftracking in stationary environments. ICML, 2007.

Tan, J., Zhang, T., Coumans, E., Iscen, A., Bai, Y., Hafner,D., Bohez, S., and Vanhoucke, V. Sim-to-real: Learningagile locomotion for quadruped robots. Robotics: Scienceand Systems (RSS), 2018.

Thrun, S. Lifelong learning algorithms. In Learning tolearn, pp. 181–209. Springer, 1998.

Toussaint, M. Robot trajectory optimization using approxi-mate inference. ICML, 2009.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D., andBotvinick, M. Learning to reinforcement learn. arXivpreprint arXiv:1611.05763, 2016.

Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn,C., and Levine, S. Meta-world: A benchmark and eval-uation for multi-task and meta reinforcement learning.Conference on Robot Learning (CoRL), 2019.

Zenke, F., Poole, B., and Ganguli, S. Continual learningthrough synaptic intelligence. ICML, 2017.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K.Maximum entropy inverse reinforcement learning. 2008.

Zintgraf, L., Shiarlis, K., Igl, M., Schulze, S., Gal, Y., Hof-mann, K., and Whiteson, S. Varibad: A very good methodfor bayes-adaptive deep rl via meta-learning. Interna-tional Conference on Learning Representations (ICLR),2020.

Zintgraf, L. M., Shiarlis, K., Kurin, V., Hofmann, K., andWhiteson, S. Fast context adaptation via meta-learning.ICML, 2019.

Documents

Deep Reinforcement Learning amidst Continual Structured