View
216
Download
0
Tags:
Embed Size (px)
Citation preview
A Principled Information Valuation for Communications During Multi-Agent Coordination
Simon A. Williamson, Enrico H. Gerding, Nicholas R. Jennings
School of Electronics & Computer Science
University of Southampton
Outline Introduction
Decentralised Decision Processes
Communication Valuations
Policy Generation
RoboCupRescue as a dec_POMDP_Valued_Com
Communication Policies
Unrestricted Communication
Restricted Communication
Future Work
Introduction
Communication is a restricted resource Team members must evaluate the value of
a communication Balanced against the cost of
communication Decision theoretic approach with
information theory to value communications
RoboCup Rescue Team of ambulance agents must
rescue civilians Uncertainties in location of civilians
and their status - they may be trapped and require many ambulances to dig them out
Uncertainties in location of team mates and their observations and activities
Communication is used to coordinate activities
In some regions of the map, agents cannot communicate
Decentralised Decision Processes A decentralised extension to the standard
POMDP formalisation with communication suggested in PT02, XLZ01 and ZG03
dec_POMDP_com (Decentralised POMDP with Communication) from ZG03
Utilise a communication substage - a solution defines an action and communication policy
dec_POMDP_com for 2 agents
€
dec _POMDP _com = S,A1,A2,Σ,CΣ,P,R,Ω1,Ω2,O
S is the state space.
A1 and A2 are the action spaces of each agent.
Σ is the communication alphabet with σ i a
member sent by agent i. εσ is the null communication.
CΣ is the cost of communicating. Is 0 for εσ .
P is the transistion function P(s,a1,a2,s') = [0,1].
R is the reward function R(s,a1,σ 1,a2,σ 2,s') =ℜ.
Ω1 and Ω2 are the action spaces of each agent.
O is the observation function O(s,a1,a2,s',σ 1,σ 2) = [0,1].
Communication Valuations Our model calls for an explicit valuation for
communications The exact value of a communication can be calculated
using the decentralised POMDP model Other work models team knowledge in a Bayes Net and
uses that to generate a valuation Can also use teamwork models such as STEAM to give a
valuation based on the cost of mis-coordination All of these approaches involve modelling the team and
information propagation - leading to an explosion of state variables
dec_POMDP_Valued_com
Cannot always communicate in parallel so communication is an action like any other
There is an explicit reward function for communications which approximates the change in expected reward from communicating
Weighting between reward functions calculates this approximation
Avoids using the policy generation stage to calculate the exact value of communicating
dec_POMDP_Valued_com for 2 agents
€
dec _POMDP _Valued _com = S,Σ,A1,A2,CΣ,P,R,RC ,RP ,Ω1,Ω2,O
S is the state space.
A1 and A2 are the action spaces of each agent.
Σ is the communication alphabet with σ i a
member sent by agent i. Σ = Ω1 and εσ is the null communication.
CΣ is the cost of communicating. Is 0 for εσ .
P is the transistion function P(s,a1,a2,s') = [0,1].
Rp is the problem reward function R(s,a1,σ 1,a2,σ 2,s') =ℜ.
Rc is the communication reward function R(bh,b1) =ℜ.
R is the combined reward function R =αRp + (1−α )Rc
Ω1 and Ω2 are the action spaces of each agent.
O is the observation function O(s,a1,a2,s',σ 1,σ 2) = [0,1].
Using Information Theory We approximate this calculation using techniques from Information
Theory This approach follows from work in sensor networks where valuations
are derived from Fisher Information and Kalman Filters This works easily in sensor networks because the problem function is
based on information theory - so individual agents can use it directly In our problem, we are using information theory as an approximation
of the cost of miss-coordination (having different information) Because of this, the value of information must be normalised with
respect to cost of miss-coordinating in the actual problem Pros: Efficient calculation and reduces complexity of policy generation Cons: Not easy - we describe an empirical validation of this technique
The dec_POMDP_Valued_com again… Use KL Divergence as it is efficient when combined with
Bayesian updating in the POMDP
b1 is the belief state of the agent and bh is the observation history. N is a normalisation factor.
Gives the information content of the observation history with respect to the agents current beliefs
∑===i h
hKLhc ib
ibibNbbNDbbValueR
)(
)(log)()(),( 1
111
Policy Generation RobocupRescue is a large problem, with real-time
constraints on action selection Using an online policy generation algorithm based
on RTBSS (Real-Time Belief State Search)
A tree is constructed at each action selection point€
δ(b,d) =
U(b)
maxα ∈A
RB (b,α ) + γ (Pr(oo∈Ω
∑ b,a) ×δ(τ (b,a,o),d −1)) ⎡
⎣ ⎢
⎤
⎦ ⎥
Policy Generation We modify the algorithm to consider joint actions
If the agents have the same knowledge then they will calculate coordinated actions
A second modification leverages the dual reward function model for communication actions
RCR as dec_POMDP_Valued_com State indicates whether each building contains
trapped civilians or not, and the location of the ambulances.
Actions are behaviours - movement, rescue, load/unload and communicate.
Reward for emptying buildings. Observations are the local sensing capabilities of
the ambulances. Communication is the history of observation since
the last communication action.
Example
Communication Policies Full – communicates all the time when possible (with no
cost).
Zero – never communicate.
Selective – communication is an action in the policy computation and has a valuation which increases with the time since the last communication.
Valued - communication is an action in the policy computation, and a reward is given based on the communication sent.
Valued Communication
Comms valuation uses KL Divergence. Problem reward uses a different functionMixed with the RCR reward functionThis allows us to experiment with the relative
importance of communicating
€
R s( )= 1−α( )Rc +αRp
Unrestricted Communication Comparing the 4 communication policies: Full Zero Selective Valued Interested in the percentage of civilians saved over the course of the
simulation and at the end Averaged over 30 runs No restrictions on communication Agents can always communicate at any point on the map
Results 1 Full does the
best
Zero and Selective do similarly bad
Neither can complete the problem
Valued Communication Results
We compare performance at the end of
the simulation
Alpha is varied between 0 and 1
At 0 the agents only communicate and at 1
they never communicate
Restricted Communication Comparing 3 communication policies: Full Zero Valued Interested in the percentage of civilians saved over the course of the simulation
and at the end Averaged over 30 runs Restrictions on communication Areas of the rescue map are defined as ‘Blackout’ regions where communication
is not possible 0%, 25%, 50%, 75% and 99% blackouts
Results 2 Is only marginally affected by
communication restrictions up to 75%
Can do better than a naive policy which only communicates when possible
The shape change reflects the different value of information now it is more expensive
Biggest drop at 99% because of the much greater time to communicate
Conclusion
Information valuation is an efficient mechanism for valuing a communication resource
Can adapt to a restricted communication environment with only minimal drop in performance
Future Work
Generalise the information theoretic valuation: calculate/learn normalisation and mixture
investigate with other types of communication
restrictions – restricted bandwidth etc
expand to larger agent teams
References
PT02 David V. Pynadath and Milind Tambe. Multiagent Teamwork: Analyzing the optimality and complexity of key theories and models. AAMAS 2002
XLZ01 Ping Xuan, Victor Lesser and Schlomo Zilberstein. Communication decisions in multiagent cooperation: model and experiments. 5th International Conference on Autonomous Agents 2001
ZG03 Schlomo Zilberstein and Claudia V. Goldman. Optimizing information exchange in cooperative multiagent systems. AAMAS 2003
Any Questions?