Learning in networks (and other asides) A preliminary investigation & some comments Yu-Han Chang...

Preview:

Citation preview

Learning in networks(and other asides)

A preliminary investigation & some comments

Yu-Han ChangJoint work with Tracey Ho and Leslie Kaelbling

AI Lab, MIT

NIPS Multi-agent Learning Workshop, Whistler, BC 2002

Networks: a multi-agent system

Graphical games [Kearns, Ortiz, Guestrin, …] Real networks, e.g. a LAN [Boyan, Littman,

…] “Mobile ad-hoc networks” [Johnson, Maltz,

…]

Mobilized ad-hoc networks

Mobile sensors, tracking agents, … Generally a distributed system that

wants to optimize some global reward function

Learning

Nash equilibrium is the phrase of the day, but is it a good solution?

Other equilibria, i.e. refinements of NE

1. Can we do better than Nash Equilibrium?(Game playing approach)

2. Perhaps we want to just learn some good policy in a distributed manner. Then what?(Distributed problem solving)

What are we studying?

Learning

Knownworld

Single-agent Multiple agents

RL,NDP

Game TheoryDecision Theory,Planning

Stochastic games,Learning in games,…

Part I: Learning

Policy

World,

State

Learning Algorithm

Actions

Observations,Sensations

Rewards

Learning to act in the world

Policy

Environ-ment

Learning Algorithm

Actions

Observations,Sensations

Rewards

Other agents(possibly learning)

?

World

A simple example

The problem: Prisoner’s Dilemma Possible solutions: Space of policies The solution metric: Nash equilibrium

World,

State

Cooperate Defect

Cooperate 1,1 -2,2

Defect 2,-2 -1,-1

Player 1’s actions Rewards

Player 2’s actions

That Folk Theorem For discount factors close to 1, any individually

rational payoffs are feasible (and are Nash) in the infinitely repeated game

Coop. Defect

Coop. 1,1 -2,2

Defect 2,-2 -1,-1

R2

R1

(1,1)

(-1,-1)

(2,-2)

(-2,2)

safety value

Better policies: Tit-for-Tat

Expand our notion of policies to include maps from past history to actions

Our choice of action now depends on previous choices (i.e. non-stationary)

history (last period’s play)

Tit-for-Tat Policy:

( . , Defect ) Defect

( . , Cooperate ) Cooperate

Types of policies & consequences

Stationary: 1 At At best, leads to same outcome as single-shot Nash

Equilibrium against rational opponents

Reactionary: { ( ht-1 ) } At Tit for Tat achieves “best” outcome in Prisoners Dilemma

Finite Memory: { ( ht-n , … , ht-2 , ht-1 ) } At May be useful against more complex opponents or in

more complex games

“Algorithmic”: { ( h1 , h2 , … , ht-2 , ht-1 ) } At Makes use of the entire history of actions as it learns

over time

Classifying our policy space

We can classify our learning algorithm’s potential power by observing the amount of history its policies can use

Stationary: H0

1 At

Reactionary: H1

{ ( ht-1 ) } At

Behavioral/Finite Memory: Hn

{ ( ht-n , … , ht-2 , ht-1 ) } At

Algorithmic/Infinite Memory: H

{ ( h1 , h2 , … , ht-2 , ht-1 ) } At

Classifying our belief space

Its also important to quantify our belief space, i.e. our assumptions about what types of policies the opponent is capable of playing

Stationary: B0

Reactionary: B1

Behavioral/Finite Memory: Bn

Infinite Memory/Arbitrary: B

A Simple Classification

B0 B1 Bn B

H0Minimax-Q, Nash-Q, Corr-Q

Bully

H1Godfather

Hn

H(WoLF) PHC, Fictitious Play, Q-learning (JAL)

Q1-learning

Qt-learning? ???

A Classification

B0 B1 Bn B

H0Minimax-Q, Nash-Q, Corr-Q

Bully

H1Godfather

Hn

H(WoLF) PHC, Fictitious Play, Q-learning (JAL)

Q1-learning

Qt-learning? ???

H x B0 : Stationary opponent

Since the opponent is stationary, this case reduces the world to an MDP. Hence we can apply any traditional reinforcement learning methods

Policy hill climber (PHC) [Bowling & Veloso, 02]

Estimates the gradient in the action space and follows it towards the local optimum

Fictitious play [Robinson, 51] [Fudenburg & Levine, 95]

Plays a stationary best response to the statistical frequency of the opponent’s play

Q-learning (JAL) [Watkins, 89] [Claus & Boutilier, 98]

Learns Q-values of states and possibly joint actions

A Classification

B0 B1 Bn B

H0Minimax-Q, Nash-Q, Corr-Q

Bully

H1Godfather

Hn

H(WoLF) PHC, Fictitious Play, Q-learning (JAL)

Q1-learning

Qt-learning? ???

H0 x B : My enemy’s pretty smart

“Bully” [Littman & Stone, 01]

Tries to force opponent to conform to the preferred outcome by choosing to play only some part of the game matrix

Cooperate“Swerve”

Defect“Drive”

Cooperate“Swerve”

1,1 -2,2

Defect“Drive”

2,-2 -5,-5Us:

Them:The “Chicken” game (Hawk-Dove) Undesirable

Nash Eq.

Achieving “perfection”

Can we design a learning algorithm that will perform well in all circumstances? Prediction Optimization

But this is not possible!* [Nachbar, 95] [Binmore, 89] * Universal consistency (Exp3 [Auer et al, 02], smoothed

fictitious play [Fudenburg & Levine, 95]) does provide a way out, but it merely guarantees that we’ll do almost as well as any stationary policy that we could have used

A reasonable goal?

Can we design an algorithm in H x Bn or in a subclass of H x B that will do well?

Should always try to play a best response to any given opponent strategy

Against a fully rational opponent, should thus learn to play a Nash equilibrium strategy

Should try to guarantee that we’ll never do too badly

One possible approach: given knowledge about the opponent, model its behavior and exploit its weaknesses (play best response)

Let’s start by constructing a player that plays well against PHC players in 2x2 games

2x2 Repeated Matrix Games

Left Right

Up r11 , c11 r12 , c12

Down r21 , c21 r22 , c22

• We choose row i to play• Opponent chooses column j to play• We receive reward rij , they receive cij

Iterated gradient ascent

[Singh Kearns Mansour, 00]

System dynamics for 2x2 matrix games take one of two forms:

Player 1’s probability for Action 1

Pla

yer

2’s

pro

babi l i

ty f

or

Ac t

i on 1

Player 1’s probability for Action 1

Pla

yer

2’s

pro

babi l i

ty f

or

Ac t

i on 1

Can we do better and actually win?

Singh et al show that we can achieve Nash payoffs

But is this a best response? We can do better…

Exploit while winning Deceive and bait while losing

Heads Tails

Heads -1,1 1,-1

Tails 1,-1 -1,1Us:

Them:Matching pennies

A winning strategy against PHC

If winningplay probability 1 for

current preferred actionin order to maximize

rewards while winningIf losing

play a deceiving policy until we are ready to take advantage of them again 0 1

1

0.5

0.5

Probability we play heads

Pro

babili

ty o

pponent

pla

ys

heads

Formally, PHC does:

Keeps and updates Q values:

Updates policy:

))','(max(),()1(),( ' asQRasQasQ a

otherwise

maxarg if),(),(

1|| iA

a' Q(s,a')aasas

PHC-Exploiter

Updates policy differently if winning vs. losing:

otherwise0

actionbest theis if1),(1 a

as

If we are winning:

otherwise

actionbest theis if),(),(

1||

211

1

2

A

aasas

Otherwise, we are losing:

otherwise

maxarg if),(),(

1||

211

1

2

A

a' Q(s,a')aasas

PHC-Exploiter

Updates policy differently if winning vs. losing:

a'

ssRasQas ))(),(*()',()',( 2111

otherwise0

maxarg if1),(1 Q(s,a')a

as a'

If

Otherwise, we are losing:

otherwise

maxarg if),(),(

1||

211

1

2

A

a' Q(s,a')aasas

PHC-Exploiter

Updates policy differently if winning vs. losing:

a'

ssRasQas ))(),(*()',()',( 2111

otherwise0

maxarg if1),(1 Q(s,a')a

as a'

If

Otherwise, we are losing:

But we don’t have complete information

Estimate opponent’s policy 2 at each

time period Estimate opponent’s learning rate 2

timet

w

t-wt-2w

Ideally we’d like to see this:

winning

losing

With our approximations:

And indeed we’re doing well.

winninglosing

Knowledge (beliefs) are useful

Using our knowledge about the opponent, we’ve demonstrated one case in which we can achieve better than Nash rewards

In general, we’d like algorithms that can guarantee Nash payoffs against fully rational players but can exploit bounded players (such as a PHC)

So what do we want from learning?

Best Response / Adaptive : exploit the opponent’s weaknesses, essentially always try to play a best response

Regret-minimization : we’d like to be able to look back and not regret our actions; we wouldn’t say to ourselves: “Gosh, why didn’t I choose to do that instead…”

A next step

Expand the comparison class in universally consistent (regret-minimization) algorithms to include richer spaces of possible strategies

For example, the comparison class could include a best-response player to a PHC

Could also include all t-period strategies

Part II

What if we’re cooperating?

What if we’re cooperating?

Nash equilibrium is not the most useful concept in cooperative scenarios

We simply want to distributively find the global (perhaps approximately) optimal solution This happens to be a Nash equilibrium, but

its not really the point of NE to address this scenario

Distributed problem solving rather than game playing

May also deal with modeling emergent behaviors

Mobilized ad-hoc networks

Ad-hoc networks are limited in connectivity

Mobilized nodes can significantly improve connectivity

Network simulator

Connectivity bounds

Static ad-hoc networks have loose bounds of the following form:

Given n nodes uniformly distributed i.i.d. in a disk of area A, each with range

the graph is connected almost surely as n iff n .

n

nAr n

n

log

'

Connectivity bounds

Allowing mobility can improve our loose bounds to:

Can we achieve this or even do significantly better than this?

Fraction mobile Required range # nodes

1/2 n/2

2/3 n/3

k/(k+1) n/(k+1)

n

nrn

log

2nr

3nr

1krn

Many challenges

Routing Dynamic environment: neighbor nodes

moving in and out of range, source and receivers may also be moving

Limited bandwidth: channel allocation, limited buffer sizes

Moving What is the globally optimal configuration? What is the globally optimal trajectory of

configurations? Can we learn a good policy using only local

knowledge?

Routing

Q-routing [Boyan Littman, 93] Applied simple Q-learning to the static network

routing problem under congestion Actions: Forward packet to a particular neighbor

node States: Current packet’s intended receiver Reward: Estimated time to arrival at receiver Performed well by learning to route packets around

congested areas Direct application of Q-routing to the mobile

ad-hoc network case Adaptations to the highly dynamic nature of

mobilized ad-hoc networks

Movement: An RL approach

What should our actions be? North, South, East, West, Stay Put Explore, Maintain connection, Terminate

connection, etc.

What should our states be? Local information about nodes, locations,

and paths Summarized local information Globally shared statistics

Policy search? Mixture of experts?

Macros, options, complex actions

Allow the nodes (agents) to utilize complex actions rather than simple N, S, E, W type movements

Actions might take varying amounts of time

Agents can re-evaluate whether to continue to do the action or not at each time step If the state hasn’t really changed, then

naturally the same action will be chosen again

Example action: “plug”

1. Sniff packets in neighborhood2. Identify path (source, receiver pair)

with longest average hops 3. Move to that path4. Move along this path until a long hop is

encountered5. Insert yourself into the path at this

point, thereby decreasing the average hop distance

Some notion of state

State space could be huge, so we choose certain features to parameterize the state space Connectivity, average hop distance, …

Actions should change the world state Exploring will hopefully lead to connectivity,

plugging will lead to smaller average hops, …

Experimental results

Number of nodes

Range Theoretical fraction mobile

Empirical fraction mobile required

25 2 rn

25 rn 1/2 0.21

50 1.7 rn

50 0.85 rn 1/2 0.25

100 1.7 rn

100 0.85 rn 1/2 0.19

200 1.6 rn

200 0.8 rn 1/2 0.17

400 1.6 rn

400 0.8 rn 1/2 0.14

Seems to work well

Pretty pictures

Pretty pictures

Pretty pictures

Pretty pictures

Many things to play with

Lossy transmissions Transmission interference Existence of opponents, jamming signals Self-interested nodes More realistic simulations – ns2 Learning different agent roles or

optimizing the individual complex actions Interaction between route learning and

movement learning

Three yardsticks

1. Non-cooperative case: We want to play our best response to the observed play of the world – we want to learn about the opponent

Minimize regret Play our best response

Three yardsticks

1. Non-cooperative case: We want to play our best response to the observed play of the world

2. Cooperative case: Approximate a global optimal using only local information or less computation

Three yardsticks

1. Non-cooperative case: We want to play our best response to the observed play of the world

2. Cooperative case: Approximate a global optimal in a distributed manner

3. Skiiing case: 17 cm of fresh powder last night and its still snowing. More snow is better. Who can argue with that?

The End

Recommended