A Principled Information Valuation for Communications During Multi-Agent Coordination Simon A. Williamson, Enrico H. Gerding, Nicholas R. Jennings School

A Principled Information Valuation for Communications During Multi-Agent Coordination

Simon A. Williamson, Enrico H. Gerding, Nicholas R. Jennings

School of Electronics & Computer Science

University of Southampton

Outline Introduction

Decentralised Decision Processes

Communication Valuations

Policy Generation

RoboCupRescue as a dec_POMDP_Valued_Com

Communication Policies

Unrestricted Communication

Restricted Communication

Future Work

Introduction

Communication is a restricted resource Team members must evaluate the value of

a communication Balanced against the cost of

communication Decision theoretic approach with

information theory to value communications

RoboCup Rescue Team of ambulance agents must

rescue civilians Uncertainties in location of civilians

and their status - they may be trapped and require many ambulances to dig them out

Uncertainties in location of team mates and their observations and activities

Communication is used to coordinate activities

In some regions of the map, agents cannot communicate

Decentralised Decision Processes A decentralised extension to the standard

POMDP formalisation with communication suggested in PT02, XLZ01 and ZG03

dec_POMDP_com (Decentralised POMDP with Communication) from ZG03

Utilise a communication substage - a solution defines an action and communication policy

dec_POMDP_com for 2 agents

€

dec _POMDP _com = S,A1,A2,Σ,CΣ,P,R,Ω1,Ω2,O

S is the state space.

A1 and A2 are the action spaces of each agent.

Σ is the communication alphabet with σ i a

member sent by agent i. εσ is the null communication.

CΣ is the cost of communicating. Is 0 for εσ .

P is the transistion function P(s,a1,a2,s') = [0,1].

R is the reward function R(s,a1,σ 1,a2,σ 2,s') =ℜ.

Ω1 and Ω2 are the action spaces of each agent.

O is the observation function O(s,a1,a2,s',σ 1,σ 2) = [0,1].

Communication Valuations Our model calls for an explicit valuation for

communications The exact value of a communication can be calculated

using the decentralised POMDP model Other work models team knowledge in a Bayes Net and

uses that to generate a valuation Can also use teamwork models such as STEAM to give a

valuation based on the cost of mis-coordination All of these approaches involve modelling the team and

information propagation - leading to an explosion of state variables

dec_POMDP_Valued_com

Cannot always communicate in parallel so communication is an action like any other

There is an explicit reward function for communications which approximates the change in expected reward from communicating

Weighting between reward functions calculates this approximation

Avoids using the policy generation stage to calculate the exact value of communicating

dec_POMDP_Valued_com for 2 agents

€

dec _POMDP _Valued _com = S,Σ,A1,A2,CΣ,P,R,RC ,RP ,Ω1,Ω2,O

S is the state space.

A1 and A2 are the action spaces of each agent.

Σ is the communication alphabet with σ i a

member sent by agent i. Σ = Ω1 and εσ is the null communication.

CΣ is the cost of communicating. Is 0 for εσ .

P is the transistion function P(s,a1,a2,s') = [0,1].

Rp is the problem reward function R(s,a1,σ 1,a2,σ 2,s') =ℜ.

Rc is the communication reward function R(bh,b1) =ℜ.

R is the combined reward function R =αRp + (1−α )Rc

Ω1 and Ω2 are the action spaces of each agent.

O is the observation function O(s,a1,a2,s',σ 1,σ 2) = [0,1].

Using Information Theory We approximate this calculation using techniques from Information

Theory This approach follows from work in sensor networks where valuations

are derived from Fisher Information and Kalman Filters This works easily in sensor networks because the problem function is

based on information theory - so individual agents can use it directly In our problem, we are using information theory as an approximation

of the cost of miss-coordination (having different information) Because of this, the value of information must be normalised with

respect to cost of miss-coordinating in the actual problem Pros: Efficient calculation and reduces complexity of policy generation Cons: Not easy - we describe an empirical validation of this technique

The dec_POMDP_Valued_com again… Use KL Divergence as it is efficient when combined with

Bayesian updating in the POMDP

b1 is the belief state of the agent and bh is the observation history. N is a normalisation factor.

Gives the information content of the observation history with respect to the agents current beliefs

∑===i h

hKLhc ib

ibibNbbNDbbValueR

)(

)(log)()(),( 1

111

Policy Generation RobocupRescue is a large problem, with real-time

constraints on action selection Using an online policy generation algorithm based

on RTBSS (Real-Time Belief State Search)

A tree is constructed at each action selection point€

δ(b,d) =

U(b)

maxα ∈A

RB (b,α ) + γ (Pr(oo∈Ω

∑ b,a) ×δ(τ (b,a,o),d −1)) ⎡

⎣ ⎢

⎤

⎦ ⎥

Policy Generation We modify the algorithm to consider joint actions

If the agents have the same knowledge then they will calculate coordinated actions

A second modification leverages the dual reward function model for communication actions

RCR as dec_POMDP_Valued_com State indicates whether each building contains

trapped civilians or not, and the location of the ambulances.

Actions are behaviours - movement, rescue, load/unload and communicate.

Reward for emptying buildings. Observations are the local sensing capabilities of

the ambulances. Communication is the history of observation since

the last communication action.

Example

Communication Policies Full – communicates all the time when possible (with no

cost).

Zero – never communicate.

Selective – communication is an action in the policy computation and has a valuation which increases with the time since the last communication.

Valued - communication is an action in the policy computation, and a reward is given based on the communication sent.

Valued Communication

Comms valuation uses KL Divergence. Problem reward uses a different functionMixed with the RCR reward functionThis allows us to experiment with the relative

importance of communicating

€

R s( )= 1−α( )Rc +αRp

Unrestricted Communication Comparing the 4 communication policies: Full Zero Selective Valued Interested in the percentage of civilians saved over the course of the

simulation and at the end Averaged over 30 runs No restrictions on communication Agents can always communicate at any point on the map

Results 1 Full does the

best

Zero and Selective do similarly bad

Neither can complete the problem

Valued Communication Results

We compare performance at the end of

the simulation

Alpha is varied between 0 and 1

At 0 the agents only communicate and at 1

they never communicate

Restricted Communication Comparing 3 communication policies: Full Zero Valued Interested in the percentage of civilians saved over the course of the simulation

and at the end Averaged over 30 runs Restrictions on communication Areas of the rescue map are defined as ‘Blackout’ regions where communication

is not possible 0%, 25%, 50%, 75% and 99% blackouts

Results 2 Is only marginally affected by

communication restrictions up to 75%

Can do better than a naive policy which only communicates when possible

The shape change reflects the different value of information now it is more expensive

Biggest drop at 99% because of the much greater time to communicate

Conclusion

Information valuation is an efficient mechanism for valuing a communication resource

Can adapt to a restricted communication environment with only minimal drop in performance

Future Work

Generalise the information theoretic valuation: calculate/learn normalisation and mixture

investigate with other types of communication

restrictions – restricted bandwidth etc

expand to larger agent teams

References

PT02 David V. Pynadath and Milind Tambe. Multiagent Teamwork: Analyzing the optimality and complexity of key theories and models. AAMAS 2002

XLZ01 Ping Xuan, Victor Lesser and Schlomo Zilberstein. Communication decisions in multiagent cooperation: model and experiments. 5th International Conference on Autonomous Agents 2001

ZG03 Schlomo Zilberstein and Claudia V. Goldman. Optimizing information exchange in cooperative multiagent systems. AAMAS 2003

Any Questions?

Documents

A Principled Information Valuation for Communications During Multi-Agent Coordination Simon A. Williamson, Enrico H. Gerding, Nicholas R. Jennings School