View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Learning in networks(and other asides)
A preliminary investigation & some comments
Yu-Han ChangJoint work with Tracey Ho and Leslie Kaelbling
AI Lab, MIT
NIPS Multi-agent Learning Workshop, Whistler, BC 2002
Networks: a multi-agent system
Graphical games [Kearns, Ortiz, Guestrin, …] Real networks, e.g. a LAN [Boyan, Littman,
…] “Mobile ad-hoc networks” [Johnson, Maltz,
…]
Mobilized ad-hoc networks
Mobile sensors, tracking agents, … Generally a distributed system that
wants to optimize some global reward function
Learning
Nash equilibrium is the phrase of the day, but is it a good solution?
Other equilibria, i.e. refinements of NE
1. Can we do better than Nash Equilibrium?(Game playing approach)
2. Perhaps we want to just learn some good policy in a distributed manner. Then what?(Distributed problem solving)
What are we studying?
Learning
Knownworld
Single-agent Multiple agents
RL,NDP
Game TheoryDecision Theory,Planning
Stochastic games,Learning in games,…
Part I: Learning
Policy
World,
State
Learning Algorithm
Actions
Observations,Sensations
Rewards
Learning to act in the world
Policy
Environ-ment
Learning Algorithm
Actions
Observations,Sensations
Rewards
Other agents(possibly learning)
?
World
A simple example
The problem: Prisoner’s Dilemma Possible solutions: Space of policies The solution metric: Nash equilibrium
World,
State
Cooperate Defect
Cooperate 1,1 -2,2
Defect 2,-2 -1,-1
Player 1’s actions Rewards
Player 2’s actions
That Folk Theorem For discount factors close to 1, any individually
rational payoffs are feasible (and are Nash) in the infinitely repeated game
Coop. Defect
Coop. 1,1 -2,2
Defect 2,-2 -1,-1
R2
R1
(1,1)
(-1,-1)
(2,-2)
(-2,2)
safety value
Better policies: Tit-for-Tat
Expand our notion of policies to include maps from past history to actions
Our choice of action now depends on previous choices (i.e. non-stationary)
history (last period’s play)
Tit-for-Tat Policy:
( . , Defect ) Defect
( . , Cooperate ) Cooperate
Types of policies & consequences
Stationary: 1 At At best, leads to same outcome as single-shot Nash
Equilibrium against rational opponents
Reactionary: { ( ht-1 ) } At Tit for Tat achieves “best” outcome in Prisoners Dilemma
Finite Memory: { ( ht-n , … , ht-2 , ht-1 ) } At May be useful against more complex opponents or in
more complex games
“Algorithmic”: { ( h1 , h2 , … , ht-2 , ht-1 ) } At Makes use of the entire history of actions as it learns
over time
Classifying our policy space
We can classify our learning algorithm’s potential power by observing the amount of history its policies can use
Stationary: H0
1 At
Reactionary: H1
{ ( ht-1 ) } At
Behavioral/Finite Memory: Hn
{ ( ht-n , … , ht-2 , ht-1 ) } At
Algorithmic/Infinite Memory: H
{ ( h1 , h2 , … , ht-2 , ht-1 ) } At
Classifying our belief space
Its also important to quantify our belief space, i.e. our assumptions about what types of policies the opponent is capable of playing
Stationary: B0
Reactionary: B1
Behavioral/Finite Memory: Bn
Infinite Memory/Arbitrary: B
A Simple Classification
B0 B1 Bn B
H0Minimax-Q, Nash-Q, Corr-Q
Bully
H1Godfather
Hn
H(WoLF) PHC, Fictitious Play, Q-learning (JAL)
Q1-learning
Qt-learning? ???
A Classification
B0 B1 Bn B
H0Minimax-Q, Nash-Q, Corr-Q
Bully
H1Godfather
Hn
H(WoLF) PHC, Fictitious Play, Q-learning (JAL)
Q1-learning
Qt-learning? ???
H x B0 : Stationary opponent
Since the opponent is stationary, this case reduces the world to an MDP. Hence we can apply any traditional reinforcement learning methods
Policy hill climber (PHC) [Bowling & Veloso, 02]
Estimates the gradient in the action space and follows it towards the local optimum
Fictitious play [Robinson, 51] [Fudenburg & Levine, 95]
Plays a stationary best response to the statistical frequency of the opponent’s play
Q-learning (JAL) [Watkins, 89] [Claus & Boutilier, 98]
Learns Q-values of states and possibly joint actions
A Classification
B0 B1 Bn B
H0Minimax-Q, Nash-Q, Corr-Q
Bully
H1Godfather
Hn
H(WoLF) PHC, Fictitious Play, Q-learning (JAL)
Q1-learning
Qt-learning? ???
H0 x B : My enemy’s pretty smart
“Bully” [Littman & Stone, 01]
Tries to force opponent to conform to the preferred outcome by choosing to play only some part of the game matrix
Cooperate“Swerve”
Defect“Drive”
Cooperate“Swerve”
1,1 -2,2
Defect“Drive”
2,-2 -5,-5Us:
Them:The “Chicken” game (Hawk-Dove) Undesirable
Nash Eq.
Achieving “perfection”
Can we design a learning algorithm that will perform well in all circumstances? Prediction Optimization
But this is not possible!* [Nachbar, 95] [Binmore, 89] * Universal consistency (Exp3 [Auer et al, 02], smoothed
fictitious play [Fudenburg & Levine, 95]) does provide a way out, but it merely guarantees that we’ll do almost as well as any stationary policy that we could have used
A reasonable goal?
Can we design an algorithm in H x Bn or in a subclass of H x B that will do well?
Should always try to play a best response to any given opponent strategy
Against a fully rational opponent, should thus learn to play a Nash equilibrium strategy
Should try to guarantee that we’ll never do too badly
One possible approach: given knowledge about the opponent, model its behavior and exploit its weaknesses (play best response)
Let’s start by constructing a player that plays well against PHC players in 2x2 games
2x2 Repeated Matrix Games
Left Right
Up r11 , c11 r12 , c12
Down r21 , c21 r22 , c22
• We choose row i to play• Opponent chooses column j to play• We receive reward rij , they receive cij
Iterated gradient ascent
[Singh Kearns Mansour, 00]
System dynamics for 2x2 matrix games take one of two forms:
Player 1’s probability for Action 1
Pla
yer
2’s
pro
babi l i
ty f
or
Ac t
i on 1
Player 1’s probability for Action 1
Pla
yer
2’s
pro
babi l i
ty f
or
Ac t
i on 1
Can we do better and actually win?
Singh et al show that we can achieve Nash payoffs
But is this a best response? We can do better…
Exploit while winning Deceive and bait while losing
Heads Tails
Heads -1,1 1,-1
Tails 1,-1 -1,1Us:
Them:Matching pennies
A winning strategy against PHC
If winningplay probability 1 for
current preferred actionin order to maximize
rewards while winningIf losing
play a deceiving policy until we are ready to take advantage of them again 0 1
1
0.5
0.5
Probability we play heads
Pro
babili
ty o
pponent
pla
ys
heads
Formally, PHC does:
Keeps and updates Q values:
Updates policy:
))','(max(),()1(),( ' asQRasQasQ a
otherwise
maxarg if),(),(
1|| iA
a' Q(s,a')aasas
PHC-Exploiter
Updates policy differently if winning vs. losing:
otherwise0
actionbest theis if1),(1 a
as
If we are winning:
otherwise
actionbest theis if),(),(
1||
211
1
2
A
aasas
Otherwise, we are losing:
otherwise
maxarg if),(),(
1||
211
1
2
A
a' Q(s,a')aasas
PHC-Exploiter
Updates policy differently if winning vs. losing:
a'
ssRasQas ))(),(*()',()',( 2111
otherwise0
maxarg if1),(1 Q(s,a')a
as a'
If
Otherwise, we are losing:
otherwise
maxarg if),(),(
1||
211
1
2
A
a' Q(s,a')aasas
PHC-Exploiter
Updates policy differently if winning vs. losing:
a'
ssRasQas ))(),(*()',()',( 2111
otherwise0
maxarg if1),(1 Q(s,a')a
as a'
If
Otherwise, we are losing:
But we don’t have complete information
Estimate opponent’s policy 2 at each
time period Estimate opponent’s learning rate 2
timet
w
t-wt-2w
Ideally we’d like to see this:
winning
losing
With our approximations:
And indeed we’re doing well.
winninglosing
Knowledge (beliefs) are useful
Using our knowledge about the opponent, we’ve demonstrated one case in which we can achieve better than Nash rewards
In general, we’d like algorithms that can guarantee Nash payoffs against fully rational players but can exploit bounded players (such as a PHC)
So what do we want from learning?
Best Response / Adaptive : exploit the opponent’s weaknesses, essentially always try to play a best response
Regret-minimization : we’d like to be able to look back and not regret our actions; we wouldn’t say to ourselves: “Gosh, why didn’t I choose to do that instead…”
A next step
Expand the comparison class in universally consistent (regret-minimization) algorithms to include richer spaces of possible strategies
For example, the comparison class could include a best-response player to a PHC
Could also include all t-period strategies
Part II
What if we’re cooperating?
What if we’re cooperating?
Nash equilibrium is not the most useful concept in cooperative scenarios
We simply want to distributively find the global (perhaps approximately) optimal solution This happens to be a Nash equilibrium, but
its not really the point of NE to address this scenario
Distributed problem solving rather than game playing
May also deal with modeling emergent behaviors
Mobilized ad-hoc networks
Ad-hoc networks are limited in connectivity
Mobilized nodes can significantly improve connectivity
Network simulator
Connectivity bounds
Static ad-hoc networks have loose bounds of the following form:
Given n nodes uniformly distributed i.i.d. in a disk of area A, each with range
the graph is connected almost surely as n iff n .
n
nAr n
n
log
'
Connectivity bounds
Allowing mobility can improve our loose bounds to:
Can we achieve this or even do significantly better than this?
Fraction mobile Required range # nodes
1/2 n/2
2/3 n/3
k/(k+1) n/(k+1)
n
nrn
log
2nr
3nr
1krn
Many challenges
Routing Dynamic environment: neighbor nodes
moving in and out of range, source and receivers may also be moving
Limited bandwidth: channel allocation, limited buffer sizes
Moving What is the globally optimal configuration? What is the globally optimal trajectory of
configurations? Can we learn a good policy using only local
knowledge?
Routing
Q-routing [Boyan Littman, 93] Applied simple Q-learning to the static network
routing problem under congestion Actions: Forward packet to a particular neighbor
node States: Current packet’s intended receiver Reward: Estimated time to arrival at receiver Performed well by learning to route packets around
congested areas Direct application of Q-routing to the mobile
ad-hoc network case Adaptations to the highly dynamic nature of
mobilized ad-hoc networks
Movement: An RL approach
What should our actions be? North, South, East, West, Stay Put Explore, Maintain connection, Terminate
connection, etc.
What should our states be? Local information about nodes, locations,
and paths Summarized local information Globally shared statistics
Policy search? Mixture of experts?
Macros, options, complex actions
Allow the nodes (agents) to utilize complex actions rather than simple N, S, E, W type movements
Actions might take varying amounts of time
Agents can re-evaluate whether to continue to do the action or not at each time step If the state hasn’t really changed, then
naturally the same action will be chosen again
Example action: “plug”
1. Sniff packets in neighborhood2. Identify path (source, receiver pair)
with longest average hops 3. Move to that path4. Move along this path until a long hop is
encountered5. Insert yourself into the path at this
point, thereby decreasing the average hop distance
Some notion of state
State space could be huge, so we choose certain features to parameterize the state space Connectivity, average hop distance, …
Actions should change the world state Exploring will hopefully lead to connectivity,
plugging will lead to smaller average hops, …
Experimental results
Number of nodes
Range Theoretical fraction mobile
Empirical fraction mobile required
25 2 rn
25 rn 1/2 0.21
50 1.7 rn
50 0.85 rn 1/2 0.25
100 1.7 rn
100 0.85 rn 1/2 0.19
200 1.6 rn
200 0.8 rn 1/2 0.17
400 1.6 rn
400 0.8 rn 1/2 0.14
Seems to work well
Pretty pictures
Pretty pictures
Pretty pictures
Pretty pictures
Many things to play with
Lossy transmissions Transmission interference Existence of opponents, jamming signals Self-interested nodes More realistic simulations – ns2 Learning different agent roles or
optimizing the individual complex actions Interaction between route learning and
movement learning
Three yardsticks
1. Non-cooperative case: We want to play our best response to the observed play of the world – we want to learn about the opponent
Minimize regret Play our best response
Three yardsticks
1. Non-cooperative case: We want to play our best response to the observed play of the world
2. Cooperative case: Approximate a global optimal using only local information or less computation
Three yardsticks
1. Non-cooperative case: We want to play our best response to the observed play of the world
2. Cooperative case: Approximate a global optimal in a distributed manner
3. Skiiing case: 17 cm of fresh powder last night and its still snowing. More snow is better. Who can argue with that?
The End