Download ppt - Reinforcement Learning Presentation Markov Games as a Framework for Multi-agent Reinforcement Learning Mike L. Littman Jinzhong Niu March 30, 2004

Reinforcement Learning PresentationReinforcement Learning Presentation

Markov Games as a Framework for Markov Games as a Framework for Multi-agent Reinforcement LearningMulti-agent Reinforcement Learning

Mike L. LittmanMike L. Littman

Markov Games as a Framework for Markov Games as a Framework for Multi-agent Reinforcement LearningMulti-agent Reinforcement Learning

Mike L. LittmanMike L. Littman

Jinzhong Niu

March 30, 2004

Markov Games as a Framework for Multi-agent Reinforcement Learning 2

Overview

MDP is capable of describing only single-agent

environments.

New mathematical framework is needed to support

multi-agent reinforcement learning.

Markov Games

A single step in this direction is explored.

2-player zero-sum Markov Games


Definitions

Markov Decision Process (MDP)


Definitions (cont.)

Markov Game (MG)

http://images.google.com/imgres?imgurl=jayski.thatsracin.com/schemes/2002bgn/38greatclips-game-layout.jpg&imgrefurl=http://jayski.thatsracin.com/schemes/2002bgnschemes.htm&h=360&w=500&sz=39&tbnid=u9Cinp4v7R4J:&tbnh=91&tbnw=126&prev=/images%3Fq%3Dgame%26start%3D60%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DN


Definitions (cont.)

Two-player zero-sum Markov Game (2P-MG)

http://images.google.com/imgres?imgurl=user.cs.tu-berlin.de/~kunegis/babychess/art/game-13.png&imgrefurl=http://user.cs.tu-berlin.de/~kunegis/babychess/art/&h=768&w=888&sz=43&tbnid=LvhmkN8ubXcJ:&tbnh=125&tbnw=144&prev=/images%3Fq%3Dgame%26start%3D40%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DN


2P-MG Is Capable?

Precludes cooperation!

Generalizes

MDPs (when |O|=1)

The opponent has a constant behavior, which may be

viewed as part of the environment.

Matrix Games (when |S|=1)

The environment doesn’t hold any information and rewards

are totally decided by the actions.

Yes


Matrix Games

Example – “rock, paper, scissors”


What does ‘optimality’ exactly mean?

MDPA stationary, deterministic, and undominated optimal policy always exists.

MGThe performance of a policy depends on the opponent’s policy, so we cannot evaluate them without context.

New definition of ‘optimality’ in game theory Performs best at its worst case compared with others

At least one optimal policy exists, which may or may not be deterministic because the agent is uncertain of its opponent’s move.


Finding Optimal Policy - Matrix Games

The optimal agent’s minimum expected reward should be as large as possible.

Use V to express the minimum value, then consider how to maximize it

http://images.google.com/imgres?imgurl=www.siggraph.org/pub-policy/images/surveygraphs_02/policy-histogram1.jpg&imgrefurl=http://www.siggraph.org/pub-policy/CGColumn-1199.html&h=957&w=643&sz=104&tbnid=ytTXX11ZyR0J:&tbnh=145&tbnw=98&prev=/images%3Fq%3Dpolicy%26start%3D80%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DN


Finding Optimal Policy - MDP

Value of a state

Quality of a state-action pair



Finding Optimal Policy – 2P-MG

Value of a state

Quality of a s-a-o triple

V(s)

Q(s,a3,o3)Q(s,a2,o2)Q(s,a1,o1)

o1

o2

o3

V(s,o2)

min

(s,a1) (s,a2)(s,a3)



Learning Optimal Polices

Q-learning

minimax-Q learning


Minimax-Q Algorithm


Experiment - Problem

Soccer


Experiment - Training

4 agents trained through 106 stepsminimax-Q learning

vs. random opponent - MR

vs. itself - MM

Q-learningvs. random opponent - QR

vs. itself - QQ


Experiment - Testing

Test 3QR, QQ – 100% loser?

Test 1QR > MR?

Test 2QR<<QQ?


Contributions

A solution to 2-player Markov games with a modified Q-learning method in which minimax is in place of max

Minimax can also be used in single-agent environments to avoid risky behavior.


Future work

Possible performance improvement of the minimax-Q learning method

Linear programming caused large computational complexity.

Iterative methods may be used to get approximate solutions to minimax much faster, which is sufficiently satisfactory.