Upload
lenci
View
41
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Distributed Planning in Hierarchical Factored MDPs. Carlos Guestrin Stanford University Geoffrey Gordon Carnegie Mellon University. Multiagent Coordination Examples. Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control. - PowerPoint PPT Presentation
Citation preview
Distributed Planning in Hierarchical Factored
MDPs
Carlos GuestrinStanford University
Geoffrey GordonCarnegie Mellon University
Multiagent Coordination Examples
Search and rescue Factory management Supply chain Firefighting Network routing Air traffic control
Access only local information Distributed Control Distributed Planning
Hierarchical Decomposition
EngineSteering
Chassis
ExhaustInjection
Cylinders
Part-of Part-of
Subsystems can share variables Each subsystem only observes its local variables Parallel decomposition ! exponential state space
Outline Object-based Representation
Hierarchical Factored MDPs
Distributed planning Message passing algorithm based on LP
decomposition
Hierarchical action selection mechanism Limited observability and communication
Reusing plans and computation Exploit classes of objects
R
G
Basic Subsystem MDP
I’
’
Internalvariables
I
Speed control
S
Externalvariables
Actions
Subsystem j decomposed: Internal variables Xj
External variables Yj
Actions Aj
Subsystem model: Rewards - Rj(Xj , Yj , Aj)
Transitions - Pj (Xj’ | Xj , Yj , Aj)
Subsystem can be modeled with any representation
Hierarchical Subsystem Tree
Subsystem tree: Nodes are subsystems Hierarchical decomposition Tree reward = sum subsystem rewards
Consistent subsystem tree: Running intersection property Consistent dynamics
Lemma: consistent subsystem tree yields well-defined global MDP
M2 Speed control
I
GS
M1 Transmission
GC
M3 Cooling
F T
SepSet[M2]: {G , }
SepSet[M3]: {}
Relationship to Factored MDPs
A2
A1
X1
R1
X3
X2
X’3
X’2
X’1
h2
h1
R2
R3
A2
A1
X3
X2
X’3
X’2
R2
R3
A1
X1
R1
X2 X’2
X’1
X1
X2X1A1
M2
M1
SepSet[M2]
Multiagent Factored MDP [Guestrin et al. ’01] Hierarchical Factored MDP
Representational power equivalent Hierarchical factored MDP multiagent
factored MDP with particular choice of basis functions
New capabilities Fully distributed planning algorithm Reuse for knowledge representation Reuse of computation
MDP counterpart to Object-Oriented Bayes Nets (OOBNs) [Koller and Pfeffer ’97]
Planning for Hierarchical Factored MDPs
Action space: joint action a= {a1,…, an} for all subsystems
State space: joint state x of entire system Reward function: total reward r
Action and state spaces are exponential in # subsystems
Exploit hierarchical structure Efficient, distributed approximate planning algorithm Simple message passing approach Each subsystem accesses only its local model Each local model solved by any standard MDP algorithm
Solving MDPs as LPs
Bellman constraint: if x a y with reward r,
V(x) V(y) + r = Q(a, x) Similarly for stochastic transitions Optimal V* satisfies all Bellman constraints,
and is componentwise smallest
min V(x)+V(y)+V(z)+V(g) st
V(x) V(y)+1V(y) V(g)+3V(x) V(z)+2V(z) V(g)+1
Linear combination of restricted domain functions [Bellman et al. ’63][Schweitzer & Seidmann ’85][Tsitsiklis & Van Roy ’96][Koller & Parr ’99,’00][Guestrin et al. ’01]
Decomposable Value Functions
Each hi is status of small part(s) of a complex system: Status of a machine and neighbors Load on machine
Must find w giving good approximate value function
Well-designed hi exponentially fewer parameters
i iihwV )()(
~xx
Approximate Linear Programming
To solve subsystem tree MDP as LP Overall state is cross-product of subsystem states Bellman LP has exponentially many constraints, variables
we need to approximate Write V(x) = V1(X1) + V2(X2) + ...
Minimize V1(X1) + V2(X2) + ... s.t.
V1(X1) + V2(X2) + ... V1(Y1) + V2(Y2) + ...
+ R1 + R2 + ...
One variable Vi(Xi) for each state of each subsys One constraint for every state and action Vi , Qi depend on small sets of variables/actions
Generates polynomially-sized LPs for factored MDPs [Guestrin et al. ‘01]
Overview of Algorithm Each subsystem solves a
local (stand-alone) MDP
Each subsystem computes messages by solving a simple local LP:
Sends `constraint message’ to its parent
Sends `reward messages’ to its children
Repeat until convergence
Mj
Mk
… …
… …
Ml
Rewardmessage
Rewardmessage
Constraintmessage
Constraintmessage
Stand-alone MDPs and Reward Messages
State – (Xj , Yj)
Actions – Aj
Rewards – Rj(Xj , Yj , Aj)
Transitions –
Pj (Xj’ | Xj , Yj , Aj)
Subsystem MDPReward
messages Sj from parent
Sk to children
State – Xj
Actions – (Aj , Yj)
Rewards – Rj(Xj ,
Yj , Aj) – Sj + k Sk
Transitions –
Pj (Xj’ | Xj , Yj , Aj)
Stand-alone MDP
Reward messages are over SepSets Solve stand-alone MDP using any algorithm Obtain visitation frequencies of resulting policy:
j = discounted frequency of visits to each state-action
Visitation Frequencies
Dual
Discounted frequency of visits to each state action pairs:
Subsystems must agree on the frequency for shared variables ! reward messages
Approx. ! relaxed enforcement of constraints
M2 Speed control
I
GS
Overview of Algorithm: Detailed
Mj
Mk
… …
… …
Ml
Each subsystem solves a local (stand-alone) MDP
Compute local visitation frequencies j
Add constraint to reward message LP
Each subsystem computes messages by solving a simple local LP:
Sends `constraint message’ to its parent – visitation frequencies for SepSet variables
Sends `reward messages’ to its children
Repeat until convergence
Reward Message LP
Dual
LP yields reward messages Sk for children Dual yields mixing weights pj , pk enforce consistent frequencies
Computing Reward Messages
Rows of jj and Lj correspond to visitation frequencies and value of each policy visited by Mj
Rows of jk are frequencies marginalized to SepSet[Mk]
Messages: Dual of reward message LP generates mixed policies pj and pk are mixing parameters, force parents and children to
agree on visitation of SepSet
Convergence Result
Planning algorithm is a special case of nested Benders decomposition
One Benders split for each internal node N of subsystem tree
One subproblem is N itself Remaining subproblems are subtrees for N’s
children (decompose these recursively) Master prob is to determine reward messages
Result follows from correctness of Benders decomposition
Mj
Ml
Rewardmessage
Constraintmessage
In finite number of iterations, algorithm produces best possible value function
(ie, same as centralized planner)
Hierarchical Action Selection
Mj
Mk
… …
… …
Ml
Actionchoice
Actionchoice
Value ofconditional
policy
Value of conditional
policy
Distributed planning obtains value function
Distributed message passing obtains action choice (policy)
Sends conditional value to its parent
Sends action choice to its children
Limited observability Limited communication
Reusing Models and Computation
Classes of objects Basic subsystems with same rewards and transitions
Reuse in knowledge representation Library of subsystems
Reusing computation Compute policy (visitation frequencies) for one
subsystem, use it in all subsystems of the same class
Compute messages for one subtree, use them in all equivalent subtrees
Related Work
Serial decompositions one subsystem “active” at a time Kushner & Chen ’74 (rooms in a maze) Dean & Lin, IJCAI-95 (combines w/ abstraction) hierarchical is similar (MAXQ, HAM, etc.)
Parallel decompositions more expressive (exponentially larger state
space) Singh & Cohn, NIPS-98 (enumerates states) Meuleau et al., AAAI-98 (heuristic for resources)
Related Work
Dantzig-Wolfe or Benders decomposition Dantzig ’65 first used for MDPs in Kushner & Chen ’74 we are first to apply to parallel subsystems
Variable elimination well-known from Bayes nets Guestrin, Koller & Parr NIPS-01
Summary – Hierarchical Factored MDPs
Parallel decomposition ! Exponential state space Efficient distributed planning algorithm
Solve local stand-alone MDPs with any algorithm Reward sharing coordinate subsystem plans Simple message passing algorithm computes rewards
Hierarchical action selection Limited communication Limited observability
Reuse for knowledge representation and computation
General approach for modeling and planning in large stochastic systems