Derivative Action Learning in Games Review of: J. Shamma and G. Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to

Derivative Action Learning in Games

Review of: J. Shamma and G. Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to Nash Equilibria,” IEEE Transactions on Automatic Control, Vol. 50, no. 3, pp. 312-327, March 2005

Overview

• The authors propose an extension of fictitious play (FP) and gradient play (GP) in which strategy adjustment is a function of both the estimated opponent strategy and its time derivative

• They demonstrate that when the learning rules are well-calibrated, convergence (or near-convergence) to Nash equilibria and asymptotic stability in the vicinity of equilibria can be achieved in games where static FP and GP fail to do so

Game SetupThis paper addresses a class of two-player games in which each player selects an action ai from a finite set at each instance of the game according to his mixed strategy pi and experiences utility U(p1,p2) equal to his expected payoff plus some additional utility associated with playing a mixed strategy. The purpose of the entropy term is not discussed by the authors, but it may be there to avoid converging to inferior local maxima in the utility function.

Actual payoff depends on the combined player actions a1 and a2, each randomly selected according to the mixed strategies p1 and p2.

( ) ( )( ) ( ) ( ) ( )( )

( ) ( )( ) [ ] ( ) ( )[ ] ( )( )( ) ( ) selectionstrategy mixed encourages logH rmentropy te the

kHkkEPayoffEk,kU

,elyalternativ

0,kHkkk,kU

T

iiiTiiii

iiiTiiii

sss

paMapp

ppMppp

−=

τ+==

≥ττ+=

−−

−−

Entropy function H(•) rewards mixed strategy

Probability of selecting a1 in 2-dimensional strategy space

Empirical Estimation and Best Response

Player i’s strategy pi is, in general, mixed and exists within the simplex defined in mi space, where mi is the number of available actions to player i, by vertices corresponding to the available actions.

Further, he adjusts his strategy by observing his opponent’s actions, formulating an empirical estimate of his opponent’s strategy q-i, and calculating the best mixed strategy in response. The adjusted strategy then will direct his next move.

( ) ( ) ( )

( )( ) )k(k)k(

k1k

1k

1k

k1k

i

on distributi the toaccordingselection Random

iii

iii

i

aqp

aqq

p

→β=

++

+=+

−

−−−

Best Response Function

The best response is defined by the authors to be the mixed strategy that maximizes expected payoff. The authors claim (without proof) that, for > 0, the utility-maximizing function is the logit function.

( )( ) ( )

( )( )[ ]N

ii

1

ii

n

ii

i

)k(M)k(M

)k(M

nii

iiiiii

ee

ek:0for

)k(),k(Umaxargk)k(

⎥⎦⎤

⎢⎣⎡

τ⎥⎦⎤

⎢⎣⎡

τ

⎥⎦⎤

⎢⎣⎡

τ

−

−−

−−

−

++

=β>τ

=β=

qq

q

p

q

qpqp

L

FP in Continuous Time

( )( ) ( )

( )( )[ ]

( )( ) ( )tt)t(ee

et:0for

)t(),t(Umaxargt)t(

iiii

)t(M)t(M

)t(M

nii

iiiiii

N

ii

1

ii

n

ii

i

−−−

⎥⎦⎤

⎢⎣⎡

τ⎥⎦⎤

⎢⎣⎡

τ

⎥⎦⎤

⎢⎣⎡

τ

−

−−

−β=++

=β>τ

=β=

−−

−

qqq

q

qpqp

qq

q

p

&L

The remaining discussion of Fictitious Play is conducted in the continuous time domain. This allows the authors to describe the system dynamics in terms of smooth differential equations, and player actions are equivalent to their mixed strategies.

The discrete-time dynamics are then interpreted as stochastic approximations of continuous-time solutions to the differential equations. This transformation is discussed in [Benaim, Hofbauer and Sorin 2003] and, presumably in [Benaim and Hirsch 1996], though I have not seen the latter myself.

Achieving Nash Equilibrium

Nash equilibria are reached at fixed points of the Best Response function. Convergence to fixed points occurs as the empirical frequency estimates converge to the actual strategies played.

( ) ( )( ) ( ) ( )( )( )( ) ( ) 0qqq

pqppqpp

→−β=

→=→β=

−−−

∗−−

∗∗−−

∗

tt)t(

tt),t(Umaxargtt)t(

iiii

iiiiiiiii

&

Derivative Action FP (DAFP): Idealized Case – Exact DAFP

Exact DAFP uses directly measured first order forecast of opponent strategy in addition to observed empirical frequency in order to calculate Best Response

( ) ( )( )( ) ( )( ) ( )

p(t)strategy play isoutput and q(t),

frequency, empirical isinput where,controller-PD in the

gain” nal“proportio theis 1 gain”, e“derivativ theis

ttt)t(

tt)t(

iiiii

iiii

γ

−γ+β=

γ+β=

−−

−−

qqqq

qqp

&&

&

Derivative Action FP (DAFP): Approximate DAFP

Approximate DAFP uses estimated first order forecast of opponent strategy in addition to observed empirical frequency in order to calculate Best Response

( ) ( )( )( ) ( )( ) ( )

( ) ( ) ( )( )

)continuously sufficient is function ResponseBest the

provided (and large with (t)(t) and

(t), ofion approximat filtered a is (t)

ttt

ttt)t(

tt)t(

iii

iiiii

iiii

β

λ→

−λ=

−γ+β=

γ+β=

−−−

−−−

−−

qr

qr

rqr

qrqq

qqp

&&

&

&&

&

Exact DAFP,Special Case: = 1

System Inversion – Each player seeks to play best response against current opponent strategy

( ) ( )( )

( ) ( )( ) ( ) ( )( )

( ) ( )( ) ( ) ( )

( ) ( ) ( ) )s(1s

1ss)s(ss

t)t(ttt)t(

s1ssss)s(

tt)t(

iiiii

TransformLaplace

iiiiiii

iiiiii

TransformLaplace

iiii

−−−−−

−−−−−

−−−

−−

+=⇒−=⇒

−=−+β=

+β=+β=⇒

+β=

pqqpq

qpqqqq

qqqp

qqp

&&

&

Convergence with Exact DAFP in Special Case ( = 1)

( ) ( ) ( )( )

( )( )

( ) ( )( )( )

( )

game original theof equilibriaNash arewhich

T, of points fixed follow dynamics DAFP

T

aswrittenbecanequationsDAFPthethen

mappingthebemm:Tletand

t)t(

t)t(letnow

ttt)t(recall

12

21

2

1

21mm

22

11

2

1

iiiii

21

⇒

=

⎥⎦

⎤⎢⎣

⎡β

β⎥⎦

⎤⎢⎣

⎡

Δ×Δ→ℜ×ℜ

⎥⎦

⎤⎢⎣

⎡+

+=⎥

⎦

⎤⎢⎣

⎡=

+β=+ −−

zz

z

z

z

z

qq

qq

z

zz

qqqq

a

&

&

&&

Convergence with Noisy Exact DAFP in Special Case ( = 1)

Suppose

In words, the derivative of empirical frequencies is measurable to within some error.

The authors prove that for any arbitrarily small >0, there exists a >0 such that if the measurement error (e1, e2) eventually remains within a -neighborhood of the origin, then the empirical frequencies (q1, q2) will remain within an -neighborhood of a Nash equilibrium.

This suggests that, if a sufficiently accurate approximation of empirical frequency can be constructed, Approximate DAFP will converge to an arbitrary neighborhood of the Nash equilibria.

( ) ( ) ( )( )iiiiii ttt)t( −−− ++β=+ eqqqq &&

Convergence with Approximate DAFP in Special Case ( = 1)

( ) ( )( )( ) ( )( ) ( )

( ) ( ) ( )( )

( ) ( ) ( )

value.same the toconverge (t)and(t) that show tonecessary isit ,equilibriaNash

of odneighborhoarbitrary an toconverges DAFP eApproximat that show order toin However,

.0T ],T,[Tfor t 0ttt1

, increasingy arbitrarilfor that prove authors thee,Furthermor

input. bounded with systems LTIfor assurance sboundednes Lyupanov of reminicent

,)(sup1

(0) -(0)e(t) -(t)

(t), and (t) solutions associated with ,any for sinceuniformly bounded also

are (t) and (t) versionsfiltered thebounded,uniformly thereforeare and

simplexstrategy in the evolve (t) and (t) since that prove authors The

ttt

ttt)t(

tt)t(recall

ii

121iii

i0

iit

ii

ii

iii

iiiii

iiii

λλ

λ

≥τ

λλλ−λλ

λλ

−−−

−−−

−−

>∈→−=λ

λ

τλ

+≤

λ

−λ=

−γ+β=

γ+β=

rq

rqr

qrqrq

rq

rr

qq

rqr

qrqq

qqp

&&

&

&

&

&

&

&&

&

Convergence with Approximate DAFP in Special Case ( = 1) (CONTINUED)

( )

earlier. defined mapping T theof points fixed are they and

equations dynamic DAFPExact thesolve (t)and(t)Then

(t)(t)(t)If

(t).(t)(t)

define and

asly respective (t),and(t) tocoverge (t)and(t)et L

ii

i-i-ii

iii

iiii

qq

qqb

qqb

qqqq

&

&

&

&&

+β=

+=

∞→λλλ

Convergence with Approximate DAFP in Special Case ( = 1) (CONTINUED)

systems). timecontinuous(for planeleft in the are seigenvaluematrix system

linearized all that ensure toprocedure Hurwitz-Routh a using and mequilibriuNash

aabout evolution strategy theglinearizinby settinggain for method a illustrate

authors theand gain, derivativeunity -non with achievable isstability Asymptotic

Lyupanov. of sense in the stable is and mequilibriuNash aabout odneighborho

arbitrary an toconverges 1 with DAFP eApproximat practice,In

.continuous weakly be function that the

requiresit but ,definitionby almost m)equilibriu(Nash point fixed a to

econvergenc guarantees abovegiven condition equality that theNote

i

=

β

Simulation Demonstration: Shapley Game

Consider the 2-player 3×3 game invented by Lloyd Shapley to illustrate non-convergence of fictitious play in general.

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡==

001

100

010

MM 21

Standard FP in Discrete Time (top) and Continuous Time (bottom)

Simulation Demonstration: Shapley Game

Shapley Game with Approximate DAFP in Continuous Time with increasing : 1(top), 10(middle), 100(bottom)

Another interesting thing here is that the players enter a correlated equilibrium, and their average payoff is higher than the expected Nash payoff.

For the “modified” game, where player utility matrices are not identical, the strategies converge to theoretically unsupported values, illustrating a violation of the weak continuity requirement for βi. This steady-state error can be corrected by setting the derivative gain according to the linearization-Routh-Hurwitz procedure noted earlier.

GP Review for 2-player Games

Gradient Play: Player i adjusts his strategy by observing his own empirical action frequency and adding the gradient of his Utility, as determined by his opponent’s empirical action frequency

GP in Discrete Time

( )( )

[ ]

( ) ( ) ( )

.simplex theonto projection denotes

k1k

1k

1k

k1k

)k()k()k(

)k()k(),k(U

)k()k()k(),k(U

i

iii

iiii

iiiii

iiTiiii

i

i

ΔΠ

++

+=+

+Π=

=∇

=

Δ

−−−

−Δ

−−

−−

aqq

qMqp

pMpp

pMppp

GP in Continuous Time

[ ]( ) [ ] ( )t)t()t(t

)t()t()t(

iiiii

iiii

−−−Δ−

−Δ

−+Π=

+Π=

qqMqq

qMqp

&

Achieving Nash Equilibrium

Gradient Play:

[ ]( ) [ ] ( ) 0qqMqq

qMqp

=−+Π=

+Π=∗−

∗−

∗−Δ−

∗−

∗Δ

∗

t)t()t(t

)t()t()t(

iiiii

iiii

&

Derivative Action Gradient PlayStandard GP cannot converge asymptotically to completely mixed Nash equilibria because the linearized dynamics are unstable at mixed equilibria.

Exact DAGP always enables asymptotic stability at mixed equilibria with proper selection of derivative gain. Under some conditions, Approximate DAGP also enables asymptotic stability near mixed equilibria.

Approximate DAGP always ensures asymptotic stability in the vicinity of strict equilibria.

( ) ( )( )( ) ( )

( ) ( )( )( ) ( )( ) ( ) ( )( )ttt

ttMt)t(:eApproximat

ttMt)t(:Exact

iii

iii1ii

iii1ii

−−−

−−Δ

−−Δ

−λ=

−γ++Π=

−γ++Π=

rqr

qrqqq

qqqqq

&

&&

&&

DAGP Simulation:Modified Shapley Game

Multiplayer Games

Consider the 3-player Jordan game:

0a2

a01

212

a02

0a1

211

The authors demonstrate that DAGP converges to the mixed Nash equilibrium.

Jordan Game Demonstration

Documents

Derivative Action Learning in Games Review of: J. Shamma and G. Arslan, “Dynamic Fictitious Play, Dynamic Gradient Play, and Distributed Convergence to