38
Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado with thanks to Andy Barto Amy McGovern, Andrew Fagg Ron Parr, Csaba Szepeszvari

Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

  • Upload
    trey

  • View
    17

  • Download
    3

Embed Size (px)

DESCRIPTION

Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales. Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado with thanks to Andy Barto Amy McGovern, Andrew Fagg Ron Parr, Csaba Szepeszvari. - PowerPoint PPT Presentation

Citation preview

Page 1: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Between MDPs and Semi-MDPs:Learning, Planning and Representing

Knowledge at Multiple Temporal Scales

Richard S. SuttonDoina Precup

University of Massachusetts

Satinder SinghUniversity of Colorado

with thanks toAndy Barto

Amy McGovern, Andrew FaggRon Parr, Csaba Szepeszvari

Page 2: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Related Work

“Classical” AI

Fikes, Hart & Nilsson(1972)Newell & Simon (1972)Sacerdoti (1974, 1977)

Macro-OperatorsKorf (1985)Minton (1988)Iba (1989)Kibler & Ruby (1992)

Qualitative ReasoningKuipers (1979)de Kleer & Brown (1984)Dejong (1994)

Laird et al. (1986)Drescher (1991)Levinson & Fuchs (1994)Say & Selahatin (1996)Brafman & Moshe (1997)

Robotics and Control Engineering

Brooks (1986)Maes (1991)Koza & Rice (1992)Brockett (1993)Grossman et. al (1993)Dorigo & Colombetti (1994)Asada et. al (1996)Uchibe et. al (1996)Huber & Grupen(1997)Kalmar et. al (1997)Mataric(1997)Sastry (1997)Toth et. al (1997)

Reinforcement Learningand MDP Planning

Mahadevan & Connell (1992)Singh (1992)Lin (1993)Dayan & Hinton (1993)Kaelbling(1993)Chrisman (1994)Bradtke & Duff (1995)Ring (1995)Sutton (1995)Thrun & Schwartz (1995)Boutilier et. al (1997)Dietterich(1997)Wiering & Schmidhuber (1997)Precup, Sutton & Singh (1997)McGovern & Sutton (1998)Parr & Russell (1998)Drummond (1998)Hauskrecht et. al (1998)Meuleau et. al (1998)Ryan and Pendrith (1998)

Page 3: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Abstraction in Learning and Planning

• A long-standing, key problem in AI !

• How can we give abstract knowledge a clear semantics?e.g. “I could go to the library”

• How can different levels of abstraction be related? spatial: states temporal: time scales

• How can we handle stochastic, closed-loop, temporally extended courses of action?

• Use RL/MDPs to provide a theoretical foundation

Page 4: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Outline

• RL and Markov Decision Processes (MDPs)

• Options and Semi-MDPs

• Rooms Example

• Between MDPs and Semi-MDPs

Termination Improvement

Intra-option Learning

Subgoals

Page 5: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

RL is Learning from Interaction

Environment

actionperception

rewardAgent

• complete agent

• temporally situated

• continual learning and planning

• object is to affect environment

• environment is stochastic and uncertain

RL is like Life!

Page 6: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

More Formally:Markov Decision Problems (MDPs)

t

An MDP is defined by < S, A, p, r, >S - set of states of the environment

A(s) – set of actions possible in state s - probability of transition from s

- expected reward when executing a in s

- discount rate for expected reward

Assumption: discrete time t = 0, 1, 2, . . .

sr s s

r s. . .t a

t +1t +1

t +1art +2

t +2t +2a

t +3t +3 . . .

t +3a

rsa

pss'a

Page 7: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

The Objective

Find a policy (way of acting) that gets a lot of reward in the long run

These are called value functions - cf. evaluation functions

Find a policy π : S × A a [0,1] that maximizes the value(expected future reward) for each state

V π (s) = E {r1 +γ r2 +L +γ t−1rt +L | s0 = s}

Qπ (s,a) = E {r1 + γ r2 +L + γ t−1rt +L | s0 = s,a 0 = a}

V * (s) = maxπ

(V π (s))

Q* (s,a) = maxπ

(Qπ (s,a))

Page 8: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Outline

• RL and Markov Decision Processes (MDPs)

• Options and Semi-MDPs

• Rooms Example

• Between MDPs and Semi-MDPs

Termination Improvement

Intra-option Learning

Subgoals

Page 9: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Options

Example: docking πhand-crafted controller

: terminate when docked or charger not visible

Options can take variable number of steps

A generalization of actions to include courses of action

Option execution is assumed to be call-and-return

An option is a triple o=<I ,π ,β >

• I ⊆S is the set of states in which o may be started• π : S × A a [0,1] is the policy followed during o• β : S a [0,1] is the probability of terminating in each state

I : all states in which charger is in sight

Page 10: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Room Example

HALLWAYS

O2

O1

4 rooms

4 hallways

8 multi-step options

Given goal location, quickly plan shortest route

up

down

rightleft

(to each room's 2 hallways)

G?

G?

4 unreliable primitive actions

Fail 33% of the time

Goal states are givena terminal value of 1 = .9

All rewards zero

ROOM

Page 11: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Options define a Semi-Markov Decison Process (SMDP)

Discrete timeHomogeneous discount

Continuous timeDiscrete eventsInterval-dependent discount

Discrete timeOverlaid discrete eventsInterval-dependent discount

A discrete-time SMDP overlaid on an MDPCan be analyzed at either level

MDP

SMDP

Optionsover MDP

State

Time

Page 12: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

MDP + Options = SMDP

Thus all Bellman equations and DP results extend for value functions over options and models of options (cf. SMDP theory).

Theorem:

For any MDP, and any set of options,the decision process that chooses among the options,executing each to termination,is an SMDP.

Page 13: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

What does the SMDP connection give us?

• Policies over options : μ : S × O a [0,1]• Value functions over options : V μ (s), Qμ (s, o), VO

*(s), QO* (s,o)

• Learning methods : Bradtke & Duff (1995), Parr (1998)• Models of options• Planning methods : e.g. value iteration, policy iteration, Dyna...• A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level

A theoretical fondation for what we really need!

But the most interesting issues are beyond SMDPs...

Page 14: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Value Functions for Options

Define value functions for options, similar to the MDP case

V μ (s) = E {rt+1 + γ rt+2 + ... | E(μ ,s,t )}

Q μ (s,o) = E {rt+1 + γ rt+2 + ... | E(oμ ,s, t )}

Now consider policies μ ∈Π(O) restricted to choose only from options in O :

VO* (s) = max

μ∈Π(O)V μ (s)

QO* (s,o) = max

μ∈Π(O)Qμ (s,o)

Page 15: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Models of Options

Knowing how an option is executed is not enough for reasoning aboutit, or planning with it. We need information about its consequences

The model of the consequences of starting option o in state s has :

• a reward partrs

o =E{r1 + γ r2 + L + γ k−1rk | s0 = s,o taken in s0 , lasts k steps }

• a next - state part pss'

o = E {γ kδ sk s' | s0 = s,o taken in s0 , lasts k steps }

↓ 1 if s' = sk is the termination state, 0 otherwise

This form follows from SMDP theory. Such models can be used interchangeably with models of primitive actions in Bellman equations.

Page 16: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Outline

• RL and Markov Decision Processes (MDPs)

• Options and Semi-MDPs

• Rooms Example

• Between MDPs and Semi-MDPs

Termination Improvement

Intra-option Learning

Subgoals

Page 17: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Room Example

HALLWAYS

O2

O1

4 rooms

4 hallways

8 multi-step options

Given goal location, quickly plan shortest route

up

down

rightleft

(to each room's 2 hallways)

G?

G?

4 unreliable primitive actions

Fail 33% of the time

Goal states are givena terminal value of 1 = .9

All rewards zero

ROOM

Page 18: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Example: Synchronous Value IterationGeneralized to Options

Initialize : V0 (s) ← 0∀s∈S

Iterate: Vk+1(s) ← maxo∈O[rs

o + pss'o

s'∈S∑ Vk(s')]∀s∈S

,The algorithm converges to the optimal value function given the options:lim

k→∞Vk =VO

*

OnceVO* ,is computedμO

* .is readily determined

IfO =A, the algorithm reduces to conventional value iterationIfA ⊆ S, thenVO

* =V *

Page 19: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Rooms Example

Iteration #0 Iteration #1 Iteration #2

with cell-to-cellprimitive actions

Iteration #0 Iteration #1 Iteration #2

with room-to-roomoptions

V(goa l)=1

V (goa l)=1

Page 20: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Example with GoalSubgoalboth primitive actions and options

Iteration #1Initial values Iteration #2

Iteration #3 Iteration #4 Iteration #5

Page 21: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

What does the SMDP connection give us?

• Policies over options : μ : S × O a [0,1]• Value functions over options : V μ (s), Qμ (s, o), VO

*(s), QO* (s,o)

• Learning methods : Bradtke & Duff (1995), Parr (1998)• Models of options• Planning methods : e.g. value iteration, policy iteration, Dyna...• A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level

A theoretical foundation for what we really need!

But the most interesting issues are beyond SMDPs...

Page 22: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Advantages of Dual MDP/SMDP View

At the SMDP levelCompute value functions and policies over optionswith the benefit of increased speed / flexibility

At the MDP levelLearn how to execute an option for achieving agiven goal

Between the MDP and SMDP levelImprove over existing options (e.g. by terminating early)

Learn about the effects of several options in parallel, without executing them to termination

Page 23: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Outline

• RL and Markov Decision Processes (MDPs)

• Options and Semi-MDPs

• Rooms Example

• Between MDPs and Semi-MDPs

Termination Improvement

Intra-option Learning

Subgoals

Page 24: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Between MDPs and SMDPs

• Termination Improvement Improving the value function by changing the termination conditions of options

• Intra-Option Learning Learning the values of options in parallel, without executing them to termination

Learning the models of options in parallel, without executing them to termination

• Tasks and Subgoals Learning the policies inside the options

Page 25: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Termination Improvement

Idea: We can do better by sometimes interrupting ongoing options- forcing them to terminate before says to

Theorem: For any policy over options μ : S × O a [0,1], suppose we interrupt its options one or more times, when

Q μ (s,o) < Q μ (s,μ(s)), where s is the state at that time o is the ongoing option to obtain μ ' : S × O' a [0,1],

Then μ ' > μ (it attains more or equal reward everywhere)

Application : Suppose we have determined QO* and thus μ = μO

* .

Then μ' is guaranteed better than μO*

and is available with no additional computation.

Page 26: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

range (input set) of eachrun-to-landmark controller

landmarks

S

G

Landmarks Task

Task: navigate from S to G as fast as possible

4 primitive actions, for taking tiny steps up, down, left, right

7 controllers for going straightto each one of the landmarks,from within a circular regionwhere the landmark is visible

In this task, planning at the level of primitive actions is computationally intractable, we need the controllers

Page 27: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

S

G

SMDP Solution(600 Steps)

Termination-ImprovedSolution (474 Steps)

Termination Improvement for Landmarks Task

Allowing early termination based on models improves the value function at no additional cost!

Page 28: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Illustration: Reconnaissance Mission Planning (Problem)

• Mission: Fly over (observe) most valuable sites and return to base

• Stochastic weather affects observability (cloudy or clear) of sites

• Limited fuel

• Intractable with classical optimal control methods

• Temporal scales: Actions: which direction to fly now

Options: which site to head for • Options compress space and time

Reduce steps from ~600 to ~6 Reduce states from ~1011 to ~106

QO

* (s, o) =rso + ps ′s

o VO* ( ′s )

′s∑

any state (106) sites only (6)

10

50

50

50

100

25

15 (reward)

5

25

8

Base100 decision steps

options

(mean time between weather changes)

Page 29: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

30

40

50

60

Illustration: Reconnaissance Mission Planning (Results)

• SMDP planner: Assumes options followed to

completion Plans optimal SMDP solution

• SMDP planner with re-evaluation Plans as if options must be

followed to completion But actually takes them for only

one step Re-picks a new option on every

step

• Static planner: Assumes weather will not change Plans optimal tour among clear

sites Re-plans whenever weather

changes

Low FuelHigh Fuel

Expected Reward/Mission

SMDPPlanner

StaticRe-planner

SMDP planner

with re-evaluation of options on

each step

Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner

Page 30: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Intra-Option Learning Methods for Markov Options

Proven to converge to correct values, under same assumptions as 1-step Q-learning

Idea: take advantage of each fragment of experience

SMDP Q-learning:• execute option to termination, keeping track of reward along

the way• at the end, update only the option taken, based on reward and

value of state in which option terminates

Intra-option Q-learning:• after each primitive action, update all the options that could have taken that action, based on the reward and the expected value from the next state on

Page 31: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Intra-Option Learning Methods for Markov Options

Proven to converge to correct values, under same assumptions as 1-step Q-learning

Idea: take advantage of each fragment of experience

Intra-Option Learning: after each primitive action, update all the options that could have taken that action

SMDP Learning: execute option to termination,then update only the option taken

Page 32: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Example of Intra-Option Value Learning

Intra-option methods learn correct values without ever taking the options! SMDP methods are not applicable here

Random start, goal in right hallway, random actions

-4

-3

-2

-1

0

0 10001000 6000 2000 3000 4000 5000 6000

EpisodesEpisodes

Optionvalues

Averagevalue of

greedy policy

Learned value

Learned value

Upperhallwayoption

Lefthallwayoption

True value

True value-4

-3

-2

1 10 100

Value of Optimal Policy

Page 33: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Intra-Option Value Learning Is FasterThan SMDP Value Learning

Random start, goal in right hallway, choice from A U H, 90% greedy

Page 34: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Intra-Option Model Learning

Intra-option methods work much faster than SMDP methods

Random start state, no goal, pick randomly among all options

Options executed

Stateprediction

error

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 20,000 40,000 60,000 80,000 100,000

SMDP

SMDP

Intra

Intra

SMDP 1/t

Maxerror

Avg.error

SMDP 1/t

0

1

2

3

4

0 20,000 40,000 60,000 80,000 100,000

Options executed

SMDP

IntraSMDP 1/t

SMDP

Intra SMDP 1/t

Rewardprediction

error

Max error

Avg. error

Page 35: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Tasks and Subgoals

It is natural to define options as solutions to subtaskse.g. treat hallways as subgoals, learn shortest paths

We have defined subgoals as pairs : < G, g > G⊆ S is the set of states treated as subgoalsg:Ga ℜ ( )are their subgoal values can be both good and bad

,Each subgoal has its own set of value functions . .e g:Vg

o(s) =E{r1 + r2 +L + k−1rk + g(sk) |s0 =s, o, sk ∈G}Vg

* (s) =maxo

Vgo(s)

,Policies inside options can be learned from subgoals in intra- ,optionoff- .policy manner

Page 36: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Options Depend on Outcome Values

Large Outcome Values Small Outcome Values

Learned Policy: Avoids Negative Rewards

Learned Policy: Shortest Paths

Small negative rewards on each step

g = 10 g = 1

g = 0 g = 0

Page 37: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Between MDPs and SMDPs

• Termination Improvement Improving the value function by changing the termination conditions of options

• Intra-Option Learning Learning the values of options in parallel, without executing them to termination

Learning the models of options in parallel, without executing them to termination

• Tasks and Subgoals Learning the policies inside the options

Page 38: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado

Summary: Benefits of Options

• Transfer Solutions to sub-tasks can be saved and reused Domain knowledge can be provided as options and subgoals

• Potentially much faster learning and planning By representing action at an appropriate temporal scale

• Models of options are a form of knowledge representation Expressive Clear Suitable for learning and planning

• Much more to learn than just one policy, one set of values A framework for “constructivism” – for finding models of the

world that are useful for rapid planning and learning