Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Counterfactual Multi-Agent Policy Gradients
Shimon WhitesonDept. of Computer Science
University of Oxford
joint work with Jakob Foerster, Gregory Farquhar,Triantafyllos Afouras, and Nantas Nardelli
July 6, 2017
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 1 / 31
Single-Agent Paradigm
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 2 / 31
Multi-Agent Paradigm
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 3 / 31
Multi-Agent Systems are Everywhere
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 4 / 31
Types of Multi-Agent Systems
Cooperative:I Shared team rewardI Coordination problem
Competitive:I Zero-sum gamesI Individual opposing rewardsI Minimax equilibria
Mixed:I General-sum gamesI Nash equilibriaI What is the question?
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 5 / 31
Coordination Problems are Everywhere
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 6 / 31
Multi-Agent MDP
All agents see the global state s
Individual actions: ua ∈ U
State transitions: P(s ′|s,u) : S ×U× S → [0, 1]
Shared team reward: r(s,u) : S ×U→ R
Equivalent to an MDP with a factored action space
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 7 / 31
Dec-POMDP
Observation function: O(s, a) : S × A→ Z
Action-observation history: τ a ∈ T ≡ (Z × U)∗
Decentralised policies: πa(ua|τ a) : T × U → [0, 1]
Natural decentralisation: communication and sensory constraints
Artificial decentralisation: coping with joint action space
Centralised learning of decentralised policies
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 8 / 31
Key Challenges
Curse of dimensionality in actions
Multi-agent credit assignment
Modelling other agents’ information state
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 9 / 31
Single-Agent Policy Gradient Methods
Optimise πθ with gradient ascent on expected return:
Jθ = Es∼ρπ(s),u∼πθ(s,·) [r(s, u)]
Good when:I Greedification is hard, e.g., continuous actionsI Policy is simpler than value function
Policy gradient theorem [Sutton et al. 2000]:
∇θJθ = Es∼ρπ(s),u∼πθ(s,·) [∇θ log πθ(u|s)Qπ(s, u)]
REINFORCE [Williams 1992]:
∇θJθ ≈ g(τ) =T∑t=0
∇θ log πθ(ut |st)Rt
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 10 / 31
Single-Agent Actor-Critic Methods [Sutton et al. 00]Reduce variance in g(τ) by learning a critic Q(s, u):
g(τ) =T∑t=0
∇θ log πθ(ut |st)Q(st , ut)
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 11 / 31
Single-Agent Baselines
Further reduce variance with a baseline b(s):
g(τ) =T∑t=0
∇θ log πθ(ut |st)(Q(st , ut)− b(st))
b(s) = V (s) =⇒ Q(s, u)− b(s) = A(s, a), the advantage function:
g(τ) =T∑t=0
∇θ log πθ(ut |st)A(st , ut)
TD-error rt + γV (st+1)− V (s) is an unbiased estimate of A(st , ut):
g(τ) =T∑t=0
∇θ log πθ(ut |st)(rt + γV (st+1)− V (st))
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 12 / 31
Single-Agent Deep Actor-Critic Methods
Actor and critic are both deep neural networks
I Convolutional and recurrent layers
I Actor and critic share layers
Both trained with stochastic gradient descent
I Actor trained on policy gradient
I Critic trained on TD(λ) or Sarsa(λ):
Lt(ψ) = (y (λ) − C (·t , ψ))2
y (λ) = (1− λ)∞∑n=1
λn−1G(n)t
G(n)t =
n∑k=1
γk−1rt+k + γnC (·t+n, ψ)
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 13 / 31
Independent Actor-Critic
Inspired by independent Q-learning [Tan 1993]I Each agent learns independently with its own actor and criticI Treats other agents as part of the environment
Speed learning with parameter sharingI Different inputs, including a, induce different behaviourI Still independent: critics condition only on τ a and ua
Variants:I IAC-V: TD-error gradient using V (τ a)I IAC-Q: Advantage-based gradient using A(τ a, ua) = Q(τ a, ua)− V (τ a)
Limitations:I Nonstationary learningI Hard to learn to coordinateI Multi-agent credit assignment
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 14 / 31
Counterfactual Multi-Agent Policy Gradients
Centralised critic: stabilise learning to coordinate
Counterfactual baseline: tackle multi-agent credit assignment
Efficient critic representation: scale to large NNs
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 15 / 31
Centralised Critic
Centralisation → Hard greedification → actor-critic
ga(τ) =T∑t=0
∇θ log πθ(uat |τ at )(rt + γV (st+1)− V (st))
critic A2t
st rt o2
t
h2
o1t
h1
(h1, ) A1
t
u1t
u1t
u2t
u2t
(h2, )
actor 2actor 1
environment
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 16 / 31
Wonderful Life Utility [Wolpert & Tumer 2000]
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 17 / 31
Difference Rewards [Tumer & Agogino 2007]
Per-agent shaped reward:
Da(s,u) = r(s,u)− r(s, (u−a, ca))
where ca is a default action
Important property:
Da(s, (u−a, u̇a)) > Da(s,u) =⇒ r(s, (u−a, u̇a)) > r(s, (u−a, a))
Limitations:
I Need (extra) simulation to estimate counterfactual r(s, (u−a, ca))
I Need expertise to choose ca
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 18 / 31
Counterfactual Baseline
Use Q(s,u) to estimate difference rewards:
ga(τ) =T∑t=0
∇θ log πθ(uat |τ at )Aa(st ,ut)
Aa(s,u) = Q(s,u)−∑ua
πa(ua|τ a)Q(s, (u−a, ua))
Baseline marginalises out ua
Critic obviates need for extra simulations
Marginalised action obviates need for default
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 19 / 31
Efficient Critic Representation
Critic A2t
st
rt
o2t
h2
o1t
h1
ᶢ(h1, ᶗ)
A1t
u1t
u1t
u2t
u2t
ᶢ(h2, ᶗ)
Actor 2Actor 1
(oat, a, ua
t-1) (u-a
t, s
t, oa
t, a, u
t-1)
hat
(hat)(ha
t-1)
ᶢat =ᶢ(ha
t, ᶗ)
(ᶗ)
{Q(ua=1, u-at,..),. .,Q(ua=|U|, u-a
t,..)}
(uat, ᶢa
t)
Aat
COMA
GRU
(b) (c)
Environment
(a)
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 20 / 31
Starcraft
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 21 / 31
Starcraft Micromanagement [Synnaeve et al. 2016]
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 22 / 31
Centralised Performance
Local Field of View (FoV) Full FoV, Central Control
map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA
heur. DQN GMEZOmean best
3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 23 / 31
Decentralised Starcraft Micromanagement
x
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 24 / 31
Heuristic Performance
Local Field of View (FoV) Full FoV, Central Control
map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA
heur. DQN GMEZOmean best
3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 25 / 31
Baseline Algorithms
IAC-V: independent actor-critic with V (τ a)
IAC-Q: independent actor-critic with A(τ a, ua) = Q(τ a, ua)− V (τ a)
Central-V: centralised critic V (s) with TD-error-based gradient
Central-QV:
I Centralised critics Q(s,u) and V (s)
I Advantage gradient A(s,u) = Q(s,u)− V (s)
I COMA but with b(s) = V (s)
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 26 / 31
Results (3m, 5m, 5w, 2d-3z)
20k 40k 60k 80k 100k120k140k# Episodes
0102030405060708090
Ave
rag
e W
in %
COMAcentral-V
central-QVheuristic
10k 20k 30k 40k 50k 60k 70k# Episodes
0102030405060708090
Ave
rag
e W
in %
IAC-V IAC-Q
5k 10k 15k 20k 25k 30k 35k# Episodes
0102030405060708090
Ave
rag
e W
in %
5k 10k 15k 20k 25k 30k 35k 40k# Episodes
0
10
20
30
40
50
60
70
Ave
rag
e W
in %
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 27 / 31
Compared to Centralised Controllers
Local Field of View (FoV) Full FoV, Central Control
map heur. IAC-V IAC-Q cnt-V cnt-QVCOMA
heur. DQN GMEZOmean best
3m .35 .47 .56 .83 .83 .87 .98 .74 - -5m .66 .63 .58 .67 .71 .81 .95 .98 .99 1.5w .70 .18 .57 .65 .76 .82 .98 .82 .70 .742d 3z .63 .27 .19 .36 .39 .47 .65 .68 .61 .90
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 28 / 31
Future Work
Factored centralised critics for many agents
Multi-agent exploration
Starcraft macromanagement
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 29 / 31
Paper
Counterfactual Multi-Agent Policy Gradients
Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras,Nantas Nardelli, and Shimon Whiteson
https://arxiv.org/abs/1705.08926
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 30 / 31
Thank You Microsoft!
This work was made possible thanks to a generousdonation of Azure cloud credits from Microsoft.
Shimon Whiteson (Oxford) Counterfactual Policy Gradients July 6, 2017 31 / 31