Questions about Final Project?

Questions about Final Project?

• Questions about grading the last project? Please come see me.

• Function Approximation and Human Control

• Average: 93% +/- 3.2

Next Week: Discussions on• Intrinsically Motivated Reinforcement Learning: An

Evolutionary Perspective

– Berkin, Ryan, Kevin, Maytee

• Autonomous Helicopter Flight via Reinforcement Learning

– Anthony, Tim, Khine, Javon, Bidur

• Can switch groups if someone wants to trade

• Decide which day you’ll lead discussion

• Decide responsibilities

Hierarchical Reinforcement Learning

Based on slides by Mausam

(UWashington)

Reinforcement Learning

• Unknown P and reward R.

• Learning Component : Estimate the P and Rvalues via data observed from the environment.

• Planning Component: Decide which actions to take that will maximize reward.

• Exploration vs. Exploitation – GLIE (Greedy in Limit with Infinite Exploration)

Learning

• Model-based learning

– Learn the model, and do planning

– Requires less data, more computation

• Model-free learning

– Plan without learning an explicit model

– Requires a lot of data, less computation

Semi-MDP: When actions take time.

The Semi-MDP equation:

Semi-MDP Q-Learning equation:

where experience tuple is hs,a,s’,r,Ni

r = accumulated discounted reward

while action a was executing.

Printerbot

• Paul G. Allen Center @ UW has 85000 sq ft space

• Each floor ~ 12000 sq ft

• Discretize locations on a floor: 12000 parts.

• State Space (without map): very large!!!!!

• How do humans make decisions?

1. The Mathematical PerspectiveA Structure Paradigm

• S : Relational MDP

• A : Concurrent MDP

• P : Dynamic Bayes Nets

• R : Continuous-state MDP

• G : Conjunction of state variables

• V : Algebraic Decision Diagrams

• : Decision List (RMDP)

2. Modular Decision Making


•Go out of room

•Walk in hallway

•Go in the room


• Humans plan modularly at different granularities of understanding.

• Going out of one room is similar to going out of another room.

• Navigation steps do not depend on whether we have the print out or not.

3. Background Knowledge

• Classical Planners using additional control knowledge can scale up to larger problems.

– E.g., HTN planning, TLPlan

• What forms of control knowledge can we provide to our Printerbot?

– First pickup printouts, then deliver them.

– Navigation – consider rooms & hallways separately, etc.

A mechanism that exploits all three avenues : Hierarchies

1. Way to add a special (hierarchical) structure on different parameters of an MDP.

2. Draws from the intuition and reasoning in human decision making.

3. Way to provide additional control knowledge to the system.

Hierarchy• Hierarchy of : Behavior, Skill, Module,

SubTask, Macro-action, etc.

– picking the pages

– collision avoidance

– fetch pages phase

– walk in hallway

• HRL: RL with temporally extended actions

Option : Movee until end of hallway

• Start: Any state in hallway

• Execute: policy as shown

• Terminate: when s is end of hallway

Options

• What “high-level actions” would be needed in a navigation domain?

• What would we need to define in order to replace actions (A)?

Options [Sutton, Precup, Singh’99]

• o = <Io, o, o>

• Io : Set of states (Io in S) in which o can be initiated

• o(s) : Policy (S -> A) when o is executing– Can be a policy over lower level options

• o(s) : Probability that o terminates in s

Learning

• An option is temporally extended action with well defined policy

• Set of options (O) replaces the set of actions (A)

• Learning occurs outside options

• Learning over options: Semi MDP Q-Learning

Options: How could they hurt?

Machine: Movee + Collision Avoidance

Movew Moven Moven Return

Movew Moves Moves Return

Movee Choose

Return

End of hallway

: End of hallway

Obstacle

Call M1

Call M2

M1

M2

Hierarchies of Abstract Machines

• A machine is a partial policy represented by a Finite State Automaton.

• Node :

– Execute a ground action.

– Call a machine as subroutine.

– Choose the next node.

– Return to the calling machine.

Learning• Learning occurs within machines, as machines

are only partially defined.

• Flatten all machines out and consider states [s,m] where s is a world state, and m, a machine node: MDP

• reduce(SoM) : Consider only states where machine node is a choice node: Semi-MDP.

Task Hierarchy: MAXQ Decomposition[Dietterich’00]

Root

Take GiveNavigate(loc)

DeliverFetch

Extend-arm Extend-armGrab Release

MoveeMovewMovesMoven

Children of a task

are unordered

MAXQ Decomposition

• Augment the state s by adding the subtask i : [s,i].

• Define C([s,i],j) as the reward received in i after j finishes.

• Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)]) +

C([s,Fetch],Navigate(prr))

• Express V in terms of C

• Learn C, instead of learning Q

Reward received

while navigatingReward received

after navigation

1. State Abstraction• Abstract state : A state having fewer

state variables; different world states maps to the same abstract state.

• If we can reduce some state variables, then we can reduce on the learning time considerably!

• We may use different abstract states for different macro-actions.

State Abstraction in MAXQ• Relevance : Only some variables are

relevant for the task.– Fetch : user-loc irrelevant

– Navigate(printer-room) : h-r-po,h-u-po,user-loc

– Fewer params for V of lower levels.

• Funnelling : Subtask maps many states to smaller set of states. – Fetch : All states map to h-r-po=true, loc=pr.room.

– Fewer params for C of higher levels.

State Abstraction in Options, HAM

• Options : Learning required only in states that are terminal states for some option.

• HAM : Original work has no abstraction.

– Extension: Three-way value decomposition*:

Q([s,m],n) = V([s,n]) + C([s,m],n) + Cex([s,m])

– Similar abstractions are employed.

*[Andre,Russell’02]

2. Optimality

Hierarchical Optimality

vs.

Recursive Optimality

Optimality

• Options : Hierarchical

– Interrupt options

• HAM : Hierarchical

• MAXQ : Recursive

– Interrupt subtasks

– Use Pseudo-rewards

– Iterate!

3. Language Expressiveness

• Option– Can only input a complete policy

• HAM – Can input a complete policy.

– Can input a task hierarchy.

– Can represent “amount of effort”.

– Later extended to partial programs.

• MAXQ– Cannot input a policy (full/partial)

4. Knowledge Requirements

• Options

– Requires complete specification of policy.

– One could learn option policies – given subtasks.

• HAM

– Medium requirements

• MAXQ

– Minimal requirements

5. Models advanced

• Options : Concurrency

• HAM : Richer representation, Concurrency

• MAXQ : Continuous time, state, actions; Multi-agents, Average-reward.

• In general, more researchers have followed MAXQ

– Less input knowledge

– Value decomposition

How to choose appropriate hierarchy

• Look at available domain knowledge

– If some behaviours are completely specified –options

– If some behaviours are partially specified – HAM

– If less domain knowledge available – MAXQ

• We can use all three to specify different behaviours in tandem.

Main ideas in HRL community

• Hierarchies speedup learning

• Value function decomposition

• State Abstractions

• Greedy non-hierarchical execution

• Context-free learning and pseudo-rewards

• Policy improvement by re-estimation and re-learning.

Documents

Questions about Final Project?