34
CS 330 Meta Reinforcement Learning Learning to Explore 1

Meta Reinforcement Learning Learning to Explore

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Meta Reinforcement Learning Learning to Explore

CS 330

Meta Reinforcement Learning Learning to Explore

1

Page 2: Meta Reinforcement Learning Learning to Explore

Course Reminders

2

Homework 3, due tonight.

Optional Homework 4 out today.

Project milestone due in two weeks: Wednesday 11/10.

Page 3: Meta Reinforcement Learning Learning to Explore

Plan for Today

Lecturegoals:- Understand the challenge of end-to-end op5miza5on of explora5on - Understand the basics of using alterna5ve explora5on strategies in meta-RL - Understand & be able to implement decoupled explora5on & exploita5on

End-to-End Op5miza5on of Explora5on Strategies

Alterna5ve Explora5on Strategies

Recap: Meta-RL Basics

<- focus of HW4

3

Decoupled Explora5on & Exploita5on

LearningtoExplore

Page 4: Meta Reinforcement Learning Learning to Explore

diagram adapted from Duan et al. ‘17

[met

a] te

st ti

me

Given a small amount of experience Learn to solve the task

Recall: Meta-RL Maze Navigation ExampleBy learning how to learn many tasks:

[met

a] tr

ain

time

meta-training tasks

𝒯1 𝒯2

4

Page 5: Meta Reinforcement Learning Learning to Explore

Recall: The Meta Reinforcement Learning Problem

Meta Reinforcement Learning:Inputs: Outputs: Data:{

k rollouts from dataset of datasets

collected for each task

Design & optimization of f *and* collecting appropriate data(learning to explore)

Finn. Learning to Learn with Gradients. PhD Thesis 2018 5

This lecture: How to learn to collect

Page 6: Meta Reinforcement Learning Learning to Explore

+ general & expressive + a variety of design choices in architecture-- hard to optimize

Recall: Black-Box Meta-RL

Black-box network (LSTM, NTM, Conv, …)

~ inherits sample efficiency from outer RL optimizer

6

exploration episode(s) execution episode(s)

Page 7: Meta Reinforcement Learning Learning to Explore

Plan for Today

End-to-End Op5miza5on of Explora5on Strategies

Alterna5ve Explora5on Strategies

Recap: Meta-RL Basics

<- focus of HW4

7

Decoupled Explora5on & Exploita5on

LearningtoExplore

Page 8: Meta Reinforcement Learning Learning to Explore

In RL, should we be using the same explora5on algorithm for:

- Learning to navigate an environment - Learning to make recommenda5ons to users - Learning a policy for computer system caching - Learning to physically operate a new tool or machine

This is how we currently approach explora5on.

Taking a step back…

Can we learnexplora*onstrategies based on experience from other tasks in that domain?

Page 9: Meta Reinforcement Learning Learning to Explore

A simple, running exampleHallway 1 Hallway 2 Hallway N

agentinforma@on on where to go

Differenttasks: naviga@ng to the ends of different hallways

Question:What is one strategy for exploring and learning the task?

Page 10: Meta Reinforcement Learning Learning to Explore

How Do We Learn to Explore?Solution #1: Optimize for Exploration &

Exploitation End-to-End w.r.t. Task Reward(Duan et al., 2016, Wang et al., 2016, Mishra et al., 2017, Stadie et

al., 2018, Zintgraf et al., 2019, Kamienny et al., 2020)

RL2 with Policy Gradients:

Example episodes during meta-training:

agent goes to the end of the correct hallway- gets positive reward for current task,

but won’t be different than for any other task𝒟tri

agent looks at the instructions - good exploratory behavior, but won’t get any reward for this behavior

agent goes to wrong hallway then correct hallway +/- provides signal on a suboptimal exploration + exploitation strategy

It’s hard to learn exploration & exploitation at the same time!

Page 11: Meta Reinforcement Learning Learning to Explore

How Do We Learn to Explore?

Solution #1: Optimize for Exploration & Exploitation End-to-End w.r.t. Task Reward

+ simple + leads to optimal strategy

in principle

-- challenging optimization when exploration is hard

(Duan et al., 2016, Wang et al., 2016, Mishra et al., 2017, Stadie et al., 2018, Zintgraf et al., 2019, Kamienny et al., 2020)

Page 12: Meta Reinforcement Learning Learning to Explore

Another Example of a Hard Exploration Meta-RL Problem

Learned cooking tasks in previous kitchens Goal: Quickly learn tasks in a new kitchen.

meta-trainingmeta-testing

Page 13: Meta Reinforcement Learning Learning to Explore

Why is End-to-End Training Hard in This Example?

🐔🥚

Learning to cook (execution)

Learning to find ingredients (exploration)

Coupling problem: learning exploration and execution depend on each other

End-to-end approach: optimize exploration and execution episode behaviors end-to-end to maximize reward of execution

Ingredient not found (bad exploration)

Cannot learn to cook (bad execution)

Cannot cook (bad execution)

Low reward for any exploration

Liu, Raghunathan, Liang, Finn. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices. ICML 2021

—> can lead to poor local optima, poor sample efficiency

Page 14: Meta Reinforcement Learning Learning to Explore

Plan for Today

End-to-End Op5miza5on of Explora5on Strategies

Alterna0veExplora0onStrategies

Recap: Meta-RL Basics

<- focus of HW4

14

Decoupled Explora5on & Exploita5on

LearningtoExplore

Page 15: Meta Reinforcement Learning Learning to Explore

Rakelly, Zhou, Quillen, Finn, Levine. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables. ICML 2019.

Method Sketch:

2a. Use posterior sampling(also called Thompson sampling)

Solution #2: Leverage Alternative Exploration Strategies

1. Learn to solve & collect data for all of the training tasks(independent of learning to explore)

2. Form a latent representation of each task, and a task-conditioned policy z π(a |s, zi)

3. Learn to infer a distribution over the task: and p(z) q(z |𝒟tri )

4. Alternate between sampling from current task distribution

and collecting data according to that task with

zi ∼ q(z |𝒟tri )

π(a |s, zi)(posterior sampling)

Page 16: Meta Reinforcement Learning Learning to Explore

Rakelly, Zhou, Quillen, Finn, Levine. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables. ICML 2019.

Method Sketch:

2a. Use posterior sampling(also called Thompson sampling)

Solution #2: Leverage Alternative Exploration Strategies

1. Learn to solve & collect data for all of the training tasks(independent of learning to explore)

2. Form a latent representation of each task, and a task-conditioned policy z π(a |s, zi)

3. Learn to infer a distribution over the task: and p(z) q(z |𝒟tri )

4. Alternate between sampling from current task distribution

and collecting data according to that task with

zi ∼ q(z |𝒟tri )

π(a |s, zi)(posterior sampling)

𝒟tri zi

Use a particular form of black box architecture

Page 17: Meta Reinforcement Learning Learning to Explore

Rakelly, Zhou, Quillen, Finn, Levine. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables. ICML 2019.

Method Sketch:

2a. Use posterior sampling(also called Thompson sampling)

Solution #2: Leverage Alternative Exploration Strategies

1. Learn to solve & collect data for all of the training tasks(independent of learning to explore)

2. Form a latent representation of each task, and a task-conditioned policy z π(a |s, zi)

3. Learn to infer a distribution over the task: and p(z) q(z |𝒟tri )

𝒟tri zi

Use a particular form of black box architecture

Impose a distribution on the task encoder:

DKL (q(z |𝒟tri ) ∥ 𝒩(0, I))

Complete objective: maxθ,ψ ∑

i

Eτ∼πθ [Ri(τ)] − DKL (qψ(z |𝒟tri ) ∥ 𝒩(0, I))

Page 18: Meta Reinforcement Learning Learning to Explore

2a. Use posterior sampling(also called Thompson sampling)

Question: In what situations might posterior sampling be bad?eg. Goals far away & sign on wall that tells you the correct goal.

1-3. Learn distribution over latent task variable and task-conditioned policy z π(a |s, zi)

4. Sample from current posterior , sample from policy , add experience to z q(z |𝒟tr) π(a |s, z) 𝒟tr

Solution #2: Leverage Alternative Exploration Strategies

Page 19: Meta Reinforcement Learning Learning to Explore

2a. Use posterior sampling PEARL (Rakelly, Zhou, Quillen, Finn, Levine. ICML ’19)(also called Thompson sampling)

MetaCURE (Zhang, Wang, Hu, Chen, Fan, Zhang. ‘20)2b. Task dynamics & reward predictionKey idea: & collect so that model is accurate.𝒟tr

1. Collect data with exploration policy 𝒟tr πexp

2. Collect data with policy 𝒟ts πexe(a |s, 𝒟tr)

3. Train policy w.r.t. reward of πexp rexp = −∥( s′ , r) − (s′ , r)∥2

4. Update policy w.r.t. task rewardπexe

Train model f(s′ , r |s, a, 𝒟tr)

Iteratively, for each task:

where s′ , r ∼ f( ⋅ , ⋅ |s, a, 𝒟tr)

Solution #2: Leverage Alternative Exploration Strategies

1-3. Learn distribution over latent task variable and task-conditioned policy z π(a |s, zi)

4. Sample from current posterior , sample from policy , add experience to z q(z |𝒟tr) π(a |s, z) 𝒟tr

Page 20: Meta Reinforcement Learning Learning to Explore

2a. Use posterior sampling PEARL (Rakelly, Zhou, Quillen, Finn, Levine. ICML ’19)(also called Thompson sampling)

MetaCURE (Zhang, Wang, Hu, Chen, Fan, Zhang. ‘20)2b. Task dynamics & reward prediction

Lots of task-irrelevant distractors, or complex, high-dim state dynamics

Question: In what situations might this be bad?

Key idea: & collect so that model is accurate.𝒟trTrain model f(s′ , r |s, a, 𝒟tr)

Solution #2: Leverage Alternative Exploration Strategies

1-3. Learn distribution over latent task variable and task-conditioned policy z π(a |s, zi)

4. Sample from current posterior , sample from policy , add experience to z q(z |𝒟tr) π(a |s, z) 𝒟tr

Page 21: Meta Reinforcement Learning Learning to Explore

Solution #2: Leverage Alternative Exploration Strategies

+ easy to optimize + many based on

principled strategies

2a. Use posterior sampling PEARL (Rakelly, Zhou, Quillen, Finn, Levine. ICML ’19)(also called Thompson sampling)

MetaCURE (Zhang, Wang, Hu, Chen, Fan, Zhang. ‘20)2b. Task dynamics & reward prediction

-- suboptimal by arbitrarily large amount in some environments.

Overall

Key idea: & collect so that model is accurate.𝒟trTrain model f(s′ , r |s, a, 𝒟tr)

1-3. Learn distribution over latent task variable and task-conditioned policy z π(a |s, zi)

4. Sample from current posterior , sample from policy , add experience to z q(z |𝒟tr) π(a |s, z) 𝒟tr

Page 22: Meta Reinforcement Learning Learning to Explore

Can we decouple exploitation and exploration without sacrificing optimality?

(best of both worlds?)

Page 23: Meta Reinforcement Learning Learning to Explore

Plan for Today

End-to-End Op5miza5on of Explora5on Strategies

Alterna5ve Explora5on Strategies

Recap: Meta-RL Basics

<- focus of HW4

23

DecoupledExplora0on&Exploita0on

LearningtoExplore

Page 24: Meta Reinforcement Learning Learning to Explore

Solution #3

Idea from solution #2b: & collect so that model is accurate.𝒟trTrain model f(s′ , r |s, a, 𝒟tr)

Do we have to learn a full dynamics & reward model?

Idea 3.0: Label each training task with a unique ID μ

such that allows accurate task prediction from 𝒟tr ∼ πexp fExploration policy: train policy and task identification model πexp(a |s) q(μ |𝒟tr)

Explore: 𝒟tr ∼ πexp(a |s) Infer task: μ ∼ q(μ |𝒟tr) Perform task: πexec(a |s, μ)

Meta training

Meta testing

Execution policy: train ID-conditioned policy πexec(a |s, μi)

+ no longer need to model dynamics, rewards — may not generalize well for one-hot μ

Page 25: Meta Reinforcement Learning Learning to Explore

Solution #3: Decouple by acquiring representation of task relevant information

1) Learn execution & identify key information 🥚 🐣2) Learn to explore by recovering that information

wall color

ingredients

decorations

Bottlenecked representation

Execution policy

MDP identifier μ

Meta-training

Exploration episode

Information recovery reward

Exploration policy

Liu, Raghunathan, Liang, Finn. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices. ICML 2021

Page 26: Meta Reinforcement Learning Learning to Explore

Solution #3: Decouple by acquiring representation of task relevant information

1) Learn execution & identify key information 🥚 🐣2) Learn to explore by recovering that information

wall color

ingredients

decorations

MDP identifier μ Bottlenecked representation

Execution policy

Meta-training

Exploration episode

Information recovery reward

Exploration policy

Liu, Raghunathan, Liang, Finn. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices. ICML 2021

Train such that collected is predictive of .πexp 𝒟tr zi

In practice: (1) and (2) can be trained simultaneously.

Train and encoder to:πexec(a |s, zi) F(zi |μi)

Page 27: Meta Reinforcement Learning Learning to Explore

Solution #3: Decouple by acquiring representation of task relevant information

1) Learn execution & identify key information 🥚 🐣2) Learn to explore by recovering that information

Meta-training

Liu, Raghunathan, Liang, Finn. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices. ICML 2021

Train such that collected is predictive of .πexp 𝒟tr zi

How to formulate the reward function for ?πexp

(a) Train model q(zi |𝒟tr) (b) = per-step information gainrt

prediction error from — prediction error from rt = τ1:t−1 τ1:t

Train and encoder to:πexec(a |s, zi) F(zi |μi)

Decoupled Reward-free ExplorAtion and Execution in Meta-Reinforcement Learning (DREAM)

Page 28: Meta Reinforcement Learning Learning to Explore

(Informal) Theoretical Analysis

Solution #3: Decouple by acquiring representation of task relevant information

(1) DREAM objective is consistent with end-to-end optimization.

-> can in principle recover the optimal exploration strategy

[under mild assumptions]

RL2 requires samples for meta-optimization.

DREAM requires samples for meta-optimization.

(2) Consider a bandit-like setting with arms.

0 20 40 60 80 100 120

Number of Actions (|A|)

0

50

100

150

200

250

300

Sam

ples

(1e3

)un

tilO

ptim

ality

Sample Complexity

Dream

O(|A| log |A|)RL2

≠(|A|2 log |A|)

In MDP , arm yields reward.i i In all MDPs, arm 0 reveals the rewarding arm.

[assuming Q-learning with uniform outer-loop exploration]

Liu, Raghunathan, Liang, Finn. Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices. ICML 2021

Page 29: Meta Reinforcement Learning Learning to Explore

End-to-End Alternative Strategies Decoupled Exploration & Execution

How Do We Learn to Explore?

+ easy to optimize + many based on principled

strategies

-- suboptimal by arbitrarily large amount in some environments.

+ leads to optimal strategy in principle

-- challenging optimization when exploration is hard

+ leads to optimal strategy in principle

+ easy to optimize in practice

-- requires task identifier

Page 30: Meta Reinforcement Learning Learning to Explore

Empirical Comparison: Sparse Reward 3D Visual Navigation Problem

● Task: go to the (key / block / ball), color specified by the sign

● Agent starts on other side of barrier, must walk around to read the sign

● Pixels observations (80 x 60 RGB)

● Sparse binary reward

More challenging variant of task from Kamienny et al., 2020

Liu, Raghunathan, Liang, Finn. Explore then Execute: Adapting without Rewards via Factorized Meta-RL. ICML 2021

Page 31: Meta Reinforcement Learning Learning to Explore

Quantitative Comparison

● End-to-end algorithms (RL2,IMPORT, VARIBAD) perform poorly due to coupling

● PEARL-UB: Upper-bound on PEARL: optimal policy and Thompson-Sampling exploration, does not learn the optimal exploration strategy

● DREAM achieves near-optimal reward

RL2 (Duan et al., 2016), IMPORT (Kamienny et al., 2020), VARIBAD (Zintgraf et al., 2019), PEARL (Rakelly, et. al., 2019), Thompson, 1933

Liu, Raghunathan, Liang, Finn. Explore then Execute: Adapting without Rewards via Factorized Meta-RL. ICML 2021

Page 32: Meta Reinforcement Learning Learning to Explore

Exploration episode

Execution episode Goal: Go to key

Liu, Raghunathan, Liang, Finn. Explore then Execute: Adapting without Rewards via Factorized Meta-RL. ICML 2021

Qualitative Results for DREAM

Page 33: Meta Reinforcement Learning Learning to Explore

Plan for Today

Lecturegoals:- Understand the challenge of end-to-end op5miza5on of explora5on - Understand the basics of using alterna5ve explora5on strategies in meta-RL - Understand & be able to implement decoupled explora5on & exploita5on

End-to-End Op5miza5on of Explora5on Strategies

Alterna5ve Explora5on Strategies

Recap: Meta-RL Basics

<- focus of HW4

33

Decoupled Explora5on & Exploita5on

LearningtoExplore

Page 34: Meta Reinforcement Learning Learning to Explore

34

Roadmap for Remaining Lectures

Next week: Following week:

Frontiers & Open ChallengesGuest lectures by Colin Raffel,

Jascha Sohl-DicksteinLifelong Learning

Hierarchical RLOffline Multi-Task RL

Week after that:

final RL topics multi-task NLP learned optimizers

learning tasks sequentially

Thank you for being an engaging student group!