Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Questions about Final Project?
• Questions about grading the last project? Please come see me.
• Function Approximation and Human Control
• Average: 93% +/- 3.2
Next Week: Discussions on• Intrinsically Motivated Reinforcement Learning: An
Evolutionary Perspective
– Berkin, Ryan, Kevin, Maytee
• Autonomous Helicopter Flight via Reinforcement Learning
– Anthony, Tim, Khine, Javon, Bidur
• Can switch groups if someone wants to trade
• Decide which day you’ll lead discussion
• Decide responsibilities
Reinforcement Learning
• Unknown P and reward R.
• Learning Component : Estimate the P and Rvalues via data observed from the environment.
• Planning Component: Decide which actions to take that will maximize reward.
• Exploration vs. Exploitation – GLIE (Greedy in Limit with Infinite Exploration)
Learning
• Model-based learning
– Learn the model, and do planning
– Requires less data, more computation
• Model-free learning
– Plan without learning an explicit model
– Requires a lot of data, less computation
Semi-MDP: When actions take time.
The Semi-MDP equation:
Semi-MDP Q-Learning equation:
where experience tuple is hs,a,s’,r,Ni
r = accumulated discounted reward
while action a was executing.
Printerbot
• Paul G. Allen Center @ UW has 85000 sq ft space
• Each floor ~ 12000 sq ft
• Discretize locations on a floor: 12000 parts.
• State Space (without map): very large!!!!!
• How do humans make decisions?
1. The Mathematical PerspectiveA Structure Paradigm
• S : Relational MDP
• A : Concurrent MDP
• P : Dynamic Bayes Nets
• R : Continuous-state MDP
• G : Conjunction of state variables
• V : Algebraic Decision Diagrams
• : Decision List (RMDP)
2. Modular Decision Making
• Humans plan modularly at different granularities of understanding.
• Going out of one room is similar to going out of another room.
• Navigation steps do not depend on whether we have the print out or not.
3. Background Knowledge
• Classical Planners using additional control knowledge can scale up to larger problems.
– E.g., HTN planning, TLPlan
• What forms of control knowledge can we provide to our Printerbot?
– First pickup printouts, then deliver them.
– Navigation – consider rooms & hallways separately, etc.
A mechanism that exploits all three avenues : Hierarchies
1. Way to add a special (hierarchical) structure on different parameters of an MDP.
2. Draws from the intuition and reasoning in human decision making.
3. Way to provide additional control knowledge to the system.
Hierarchy• Hierarchy of : Behavior, Skill, Module,
SubTask, Macro-action, etc.
– picking the pages
– collision avoidance
– fetch pages phase
– walk in hallway
• HRL: RL with temporally extended actions
Option : Movee until end of hallway
• Start: Any state in hallway
• Execute: policy as shown
• Terminate: when s is end of hallway
Options
• What “high-level actions” would be needed in a navigation domain?
• What would we need to define in order to replace actions (A)?
Options [Sutton, Precup, Singh’99]
• o = <Io, o, o>
• Io : Set of states (Io in S) in which o can be initiated
• o(s) : Policy (S -> A) when o is executing– Can be a policy over lower level options
• o(s) : Probability that o terminates in s
Learning
• An option is temporally extended action with well defined policy
• Set of options (O) replaces the set of actions (A)
• Learning occurs outside options
• Learning over options: Semi MDP Q-Learning
Machine: Movee + Collision Avoidance
Movew Moven Moven Return
Movew Moves Moves Return
Movee Choose
Return
End of hallway
: End of hallway
Obstacle
Call M1
Call M2
M1
M2
Hierarchies of Abstract Machines
• A machine is a partial policy represented by a Finite State Automaton.
• Node :
– Execute a ground action.
– Call a machine as subroutine.
– Choose the next node.
– Return to the calling machine.
Learning• Learning occurs within machines, as machines
are only partially defined.
• Flatten all machines out and consider states [s,m] where s is a world state, and m, a machine node: MDP
• reduce(SoM) : Consider only states where machine node is a choice node: Semi-MDP.
Task Hierarchy: MAXQ Decomposition[Dietterich’00]
Root
Take GiveNavigate(loc)
DeliverFetch
Extend-arm Extend-armGrab Release
MoveeMovewMovesMoven
Children of a task
are unordered
MAXQ Decomposition
• Augment the state s by adding the subtask i : [s,i].
• Define C([s,i],j) as the reward received in i after j finishes.
• Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)]) +
C([s,Fetch],Navigate(prr))
• Express V in terms of C
• Learn C, instead of learning Q
Reward received
while navigatingReward received
after navigation
1. State Abstraction• Abstract state : A state having fewer
state variables; different world states maps to the same abstract state.
• If we can reduce some state variables, then we can reduce on the learning time considerably!
• We may use different abstract states for different macro-actions.
State Abstraction in MAXQ• Relevance : Only some variables are
relevant for the task.– Fetch : user-loc irrelevant
– Navigate(printer-room) : h-r-po,h-u-po,user-loc
– Fewer params for V of lower levels.
• Funnelling : Subtask maps many states to smaller set of states. – Fetch : All states map to h-r-po=true, loc=pr.room.
– Fewer params for C of higher levels.
State Abstraction in Options, HAM
• Options : Learning required only in states that are terminal states for some option.
• HAM : Original work has no abstraction.
– Extension: Three-way value decomposition*:
Q([s,m],n) = V([s,n]) + C([s,m],n) + Cex([s,m])
– Similar abstractions are employed.
*[Andre,Russell’02]
Optimality
• Options : Hierarchical
– Interrupt options
• HAM : Hierarchical
• MAXQ : Recursive
– Interrupt subtasks
– Use Pseudo-rewards
– Iterate!
3. Language Expressiveness
• Option– Can only input a complete policy
• HAM – Can input a complete policy.
– Can input a task hierarchy.
– Can represent “amount of effort”.
– Later extended to partial programs.
• MAXQ– Cannot input a policy (full/partial)
4. Knowledge Requirements
• Options
– Requires complete specification of policy.
– One could learn option policies – given subtasks.
• HAM
– Medium requirements
• MAXQ
– Minimal requirements
5. Models advanced
• Options : Concurrency
• HAM : Richer representation, Concurrency
• MAXQ : Continuous time, state, actions; Multi-agents, Average-reward.
• In general, more researchers have followed MAXQ
– Less input knowledge
– Value decomposition
How to choose appropriate hierarchy
• Look at available domain knowledge
– If some behaviours are completely specified –options
– If some behaviours are partially specified – HAM
– If less domain knowledge available – MAXQ
• We can use all three to specify different behaviours in tandem.