Strategic Interactions against Non-Stationary Agents

Strategic Interactions againstNon-Stationary Agents

por

Pablo Francisco Hernandez Leal

Tesis sometida como requisito parcial para obtener el grado de Doctor

en Ciencias en el Area de Ciencias Computacionales en el Instituto

Nacional de Astrofısica, Optica y Electronica

Supervisada por:

Dr. Jose Enrique Munoz de Cote Flores Luna,

INAOE

c⃝INAOE 2015

El autor otorga al INAOE el permiso de reproducir y distribuir copiasen su totalidad o en partes de esta tesis

Strategic Interactions against Non-Stationary Agents

By:

Pablo Francisco Hernandez Leal

Advisor:

Dr. Jose Enrique Munoz de Cote Flores Luna

Ph.D. Dissertation

Coordinacion de Ciencias Computacionales

Instituto Nacional de Astrofısica Optica y Electronica

December 2015

Sta. Marıa Tonanzintla, Puebla, Mexico

Agradecimientos

A mis asesores, por su guıa, revisiones y apoyo para llevar a cabo esta investigacion doctoral.

A Enrique Munoz de Cote, quien me introdujo al area de teorıa de juegos y me acepto como

su primer alumno doctoral en su llegada al INAOE. En particular, agradezco las largas platicas de

brainstorming (que a veces me dejaban con mas dudas que al inicio), las revisiones tan detalladas de

los artıculos y el apoyo para realizar una estancia de investigacion fuera del paıs. De el he aprendido

y valorado la idea de hacer investigacion de alta calidad, de no conformarse y de buscar siempre los

mas altos estandares. Aprecio su direccion y guıa en estos 4 anos, la cual combinaba consejos y el

presionarme para buscar mi mayor potencial. A Enrique Sucar quien desde la maestrıa me acepto

como su estudiante y continuo durante mi investigacion doctoral aun cuando el tema era diferente.

Aprecio su personalidad, la cual es muy similar a su forma de hacer ciencia, al mismo tiempo agradable

y responsable. De ellos he aprendido a ser un mejor investigador y espero que en el futuro pueda ser

un asesor tan dedicado como ellos.

A diversas personas del INAOE que a traves de seminarios, cursos o platicas me ayudaron a cul-

minar esta tesis y me hacen ser un mejor investigador. Al grupo de robotica y de sistemas inteligentes

quienes escucharon mis presentaciones. A Felipe Orihuela, por su guıa en las cuestiones estadısticas

y sus duras crıticas y sugerencias. A Alma Rios, por su apoyo en las cuestiones administrativas.

A personas que conocı en mi estancia en CREATE-NET, Oscar Mayora, Venet Osmani y

Alban Maxhuni (quien se convirtio en un gran amigo). A Matthew E. Taylor y Yusen Zhan de

Washington State University, con quienes convivı por 7 meses y con quienes espero seguir colaborando.

A Benjamin Rosman con quien he iniciado una colaboracion, aun cuando el se encuentra al otro

lado del mundo.

A mis sinodales, Eduardo Morales, Francisco Martinez, Angelica Munoz, Aurelio Lopez

y Prashant Doshi por sus observaciones y comentarios que enriquecieron y mejoraron esta tesis.

A mi familia y en particular a mi padres, quienes me han apoyado en esta vida de hacer in-

vestigacion. Terminando mi licenciatura, ellos me apoyaron cuando les dije que querıa estudiar una

maestrıa. Dos anos despues tambien quise estudiar un doctorado y su respuesta fue positiva. En estos

cuatro anos he tenido la fortuna de realizar estancias y de asistir a diversas conferencias, mis padres

siempre me han apoyado aun cuando en el fondo sepan que me alejo fısicamente de ellos. Siempre

han sido y seran mi guıa, inspiracion y soporte durante toda mi vida, los admiro y son un modelo de

personas a seguir. Agradezco a mi mama, Isabel Leal por educarme, por quererme, por heredarme su

inteligencia y responsabilidad en la vida y el trabajo. Por ensenarme a superarme todos los dıas. Por

ensenarme a ser puntual, respetuoso y muchos otros valores que espero me hagan ademas de un buen

i

investigador, una buena persona. Gracias por ensenarme a hacer bien mi trabajo, a ser responsable y

cumplir como es debido. A mi papa, Francisco Hernandez por brindarme siempre su carino y sus

consejos, por ensenarme lo que es amar tu trabajo. Gracias por ensenarme a disfrutar de la vida, por

compartir tus alegrıas conmigo, por apoyarme y celebrar siempre mis pequenos logros. Gracias por

todo tu afecto y tus abrazos. Por alentarme y ensenarme a no dudar de mis capacidades. Gracias por

ensenarme tantas cosas en la vida que no se pueden aprender mediante libros y ecuaciones. Gracias a

mis padres por la educacion que me brindaron, por su formacion y los valores que me han ensenado,

por todos sus sacrificios y principalmente por su apoyo incondicional creyendo siempre en mi, sin

ellos no podrıa ser lo que soy ahora. A Isabel Chavarrıa, por ensenarme a superar retos y hacerme

una mejor persona. Por estar junto a mi en momento difıciles, gracias por darme paz y certidumbre.

Gracias por confiar en mi, en esta idea de hacer ciencia para toda la vida. Gracias por este tiempo

juntos y por todo lo que aun nos falta, porque contigo soy feliz.

Al Consejo Nacional de Ciencia y Tecnologıa (CONACyT) por el apoyo economico otorgado a

traves de la beca No. 234507 para estudios de doctorado. Al Instituto Nacional de Astrofısica, Optica

y Electronica y la Coordinacion de de Ciencias Computacionales por la formacion academica y todas

las facilidades otorgadas.

Pablo Fco. Hernandez Leal — Sta. Marıa Tonantzintla, Pue. 2015

ii

Resumen

Lograr disenar un agente capaz de aprender a interactuar con otro agente es un problema abierto.

Una interaccion ocurre cuando dos o mas agentes realizan una accion en un ambiente determinado

y obtienen una utilidad dependiendo de la accion conjunta. Las tecnicas actuales de aprendizaje

multiagente usualmente no obtienen buenos resultados con agentes que cambian de comportamiento

en una interaccion repetida. Esto es porque generalmente no modelan el comportamiento de los

demas agentes, y en su lugar realizan suposiciones que son muy restrictivas para escenarios reales.

Mas aun, considerando que muchas aplicaciones requieren la interaccion de diferentes tipos de agentes

este problema es importante de resolver. No importa si el dominio es cooperativo (donde los agentes

tiene un objetivo comun) o competitivo (donde los objetivos son diferentes), hay un aspecto en comun:

los agentes deben aprender como los demas estan actuando y reaccionar rapidamente a cambios de

comportamiento. Esta tesis esta enfocada en como actuar ante agentes que usan distintas estrategias

durante la interaccion (lo cual los convierte en no estacionarios), en particular se enfoca en agentes

que se enfrentan a otros agentes (llamados oponentes en esta tesis) los cuales pueden usar diferentes

estrategias y cambiar entre ellas en el tiempo. Lidiar con oponentes no estacionarios involucra tres

aspectos diferentes: (i) aprender un modelo del oponente, (ii) establecer una polıtica (plan) contra el

oponente (ya que el objetivo es maximizar la utilidad durante la interaccion), y (iii) detectar cambios

en el comportamiento del oponente. Las contribuciones principales de esta tesis son:

• Se propuso un framework de aprendizaje y planificacion contra estrategias no estacionarias en

juegos repetidos. Un algoritmo que usa este framework es MDP4.5 el cual usa arboles de decision

como modelos del oponente. Un segundo algoritmo es MDP-CL el cual aprende un proceso de

decision de Markov (MDP) para representar la estrategia del oponente. MDP4.5 posteriormente

transforma el arbol de decision en un MDP. Resolver el MDP resulta en una polıtica optima con-

tra el oponente. Para determinar cambios en las estrategias, diferentes modelos son aprendidos

durante el juego con diferente informacion.

• MDP-CL y MDP4.5 aprenden modelos con base en interaccion sin informacion previa. Ademas,

cuando estos algoritmos detectan un cambio, el modelo actual se descarta y se aprende uno

nuevo. Para solventar estas limitaciones se propusieron a priori MDP-CL e incremental MDP-

CL. A priori MDP-CL inicia con un conjunto de modelos antes de la interaccion y debe detectar

cual de ellos es el usado por el oponente. Incremental MDP-CL no descarta el modelo aprendido

cuando detecta un cambio. De esta forma, si el oponente regresa a una estrategia previamente

usada, el modelo ya es conocido y el proceso de deteccion es mas rapido.

• Se propuso un nuevo tipo de exploracion llamada drift, la cual esta disenada para detectar

iii

cambios en el oponente que en otro caso pasarıan desapercibidos. En este contexto se propuso un

nuevo algoritmo, R-max#, el cual visita parte del espacio de estados que no ha sido actualizado

recientemente lo cual implıcitamente revisa cambios de comportamiento del oponente. Mas aun,

se proveen garantıas de optimalidad contra ciertos tipos de oponentes no estacionarios.

• Por ultimo, se propuso DriftER el cual es un mecanismo de deteccion de cambios de compor-

tamiento basado en monitorizar la tasa de error sobre el modelo del oponente. DriftER provee

garantıas teoricas de deteccion de cambios bajo ciertas suposiciones.

Los algoritmos propuestos se evaluaron en distintos dominios, unos que son utilizados normalmente

como referencia, y otros nuevos dominios mas apegados a situaciones reales: el dilema del prisionero

repetido, una tarea de negociacion, una aplicacion del mundo real en mercados energeticos y en juegos

aleatorios del area de teorıa de juegos. Se realizaron comparaciones contra algoritmos del estado del

arte de aprendizaje por refuerzo y teorıa de juegos comprobando que nuestras propuestas son capaces

de detectar cambios de comportamiento y obtener mejores resultados en cuanto a la calidad del modelo

aprendido y recompensas promedio.

iv

Abstract

Designing an agent that is capable of interacting with another agent is an open problem. An

interaction happen when two or more agents perform an action in an environment and they obtain

an utility based on the performed joint action.Current multiagent learning techniques do not fare

well with agents that change their behavior during a repeated interaction. This happens because

they usually do not model the other agents’ behavior and instead make some assumptions that for

real scenarios are too restrictive. Furthermore, considering that many applications demand di↵erent

types of agents to work together this should be an important problem to solve. It does not matter if

the domain is cooperative (where agents have a common goal) or competitive (where objectives are

di↵erent), there is one common aspect: agents must learn how their counterpart is acting and react

quickly to changes in behavior. Our research is focused on how to act against agents that use di↵erent

strategies over time during an interaction (which makes them non-stationary), in particular we focus

on agents that use di↵erent strategies and switch among them on time. Dealing with non-stationary

opponents involves three di↵erent aspects: (i) learning a model of the opponent, (ii) computing a

policy (plan to act) against the opponent (since the objective is to maximize the utility throughout

the interaction) and (iii) detecting switches in the opponent strategy. The main contributions of this

thesis are:

• We have proposed a framework for learning and planning against non-stationary strategies in

repeated games. One algorithm of this framework is MDP4.5 which uses decision trees as op-

ponent models. The second algorithm is MDP-CL that learns a Markov decision process for

representing the opponent strategy. MDP4.5 transforms the decision tree into a MDP. Solv-

ing the MDP yields an optimal policy against that opponent. In order to assess the change of

strategies, di↵erent models are learned trough the game with di↵erent information.

• MDP-CL and MDP4.5 learned models from interaction without any prior information. Also when

they detect a switch the learned model is discarded and a new one is learned. We proposed two

extensions for MDP-CL: a priori and incremental MDP-CL that overcome these limitations. A

priori MDP-CL knows the set of models used by the opponent and the problem is to detect

the strategy of the opponent from that set. Incremental MDP-CL does not discard the learned

model once it detects a switch. In this way, if the opponent returns to a previously used strategy

the model is already known and the detection process is faster than relearning it.

• We proposed a new type of exploration called drift exploration which is designed to detect

switches in the opponent which otherwise would have passed unnoticed. In this regard we

proposed a new algorithm, R-max# which revisits parts of the state space which have not

v

been updated recently, thus implicitly checking for opponent switches. Moreover, we provide

theoretical guarantees for some types of non-stationary opponents under which R-max# will

behave optimally.

• Finally we propose DriftER, a switch detection mechanism keeping track of the error rate of

the opponent model. We also provide a theoretical guarantee for switch detection under certain

assumptions.

Our di↵erent proposals were evaluated on diverse domains, some are used as references and other

domains are more realistic: the iterated prisoner’s dilemma, a negotiation task, a real-world domain

in energy markets and random games from game theory. Comparisons were made against state of the

art algorithms in reinforcement learning and game theory showing that our approaches are capable of

detecting switches and obtaining better scores in terms of quality of the model and average rewards.

vi

Contents

Agradecimientos i

Resumen iii

Abstract v

List of Figures xii

List of Tables xv

List of Algorithms xvii

List of Symbols xviii

1 Introduction 1

1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.1 Main objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Preliminaries 9

2.1 Decision theoretic planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Markov decision process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.2 Partially observable Markov decision process . . . . . . . . . . . . . . . . . . . 11

2.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

vii

CONTENTS

2.2.1 Supervised learning: classification . . . . . . . . . . . . . . . . . . . . . . . . . 12

Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Hidden mode - MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Exploration vs. exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Sample complexity of exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Nash equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.2 Repeated and stochastic games . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3 Behavioral game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Related Work 23

3.1 Decision theoretic planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Multiagent approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.1 Concept drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.3 Exploration vs. exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.1 Implicit negotiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.3 Behavioral game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 Opponent and teammate modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5.1 Memory bounded learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


4 Acting against Non-Stationary Opponents 37

4.1 MDP-CL and MDP4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.1 Modeling opponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.4 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.1.5 Overview of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

viii

CONTENTS

4.1.6 Learning: opponent strategy assessment . . . . . . . . . . . . . . . . . . . . . . 42

MDP4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.7 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Decision trees to MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.8 Detecting opponent switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Running example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 MDP-CL with knowledge reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.3 A priori MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.4 Incremental MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 Drift exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 General drift exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.2 E�cient drift exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

R-MAX# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.3 Running example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.4 Practical considerations of R-max# . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Sample complexity of exploration for R-max# . . . . . . . . . . . . . . . . . . . . . . 56

4.4.1 E�cient drift exploration with switch detection . . . . . . . . . . . . . . . . . . 60

4.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5 DriftER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.2 Model learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.3 Switch detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.4 Theoretical guarantee for switch detection . . . . . . . . . . . . . . . . . . . . . 64

4.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65


5 Experiments 67

5.1 Experimental domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

ix

CONTENTS

5.1.1 Iterated prisoner’s dilemma (iPD) . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.2 Multiagent iterated prisoner’s dilemma . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.3 Alternate-o↵ers bargaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.4 Double auctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.5 General-sum games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

Battle of the sexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 MDP4.5 and MDP-CL against deterministic switching opponents . . . . . . . . . . . . 71

5.2.1 Setting and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2.2 HM-MDPs performance experiments . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.3 HM-MDPs vs MDP4.5 vs MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.4 Preliminary drift exploration for MDP4.5 and MDP-CL . . . . . . . . . . . . . 76

5.2.5 Learning speed of the opponent strategy . . . . . . . . . . . . . . . . . . . . . . 78

5.2.6 Increasing the number of opponents . . . . . . . . . . . . . . . . . . . . . . . . 79

5.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.3 A priori and incremental MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81


5.3.2 Model selection in a priori MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3.3 Rewards and quality in a priori MDP-CL . . . . . . . . . . . . . . . . . . . . . 82

5.3.4 Incremental models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.3.5 A priori noisy models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4 Drift exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4.1 Settings and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4.2 Drift and non-drift exploration approaches . . . . . . . . . . . . . . . . . . . . 87

5.4.3 Further analysis of MDP-CL(DE) . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4.4 Further analysis of R-max# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.5 E�cient exploration + switch detection: R-max#CL . . . . . . . . . . . . . . . 96

5.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5 DriftER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.5.1 Setting and objectives (repeated games) . . . . . . . . . . . . . . . . . . . . . . 99

5.5.2 Switch detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

DriftER parameter behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.5.3 Setting and objectives (double auctions) . . . . . . . . . . . . . . . . . . . . . . 101

5.5.4 Fixed non-stationary opponents . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.5.5 Detecting switches in the opponent . . . . . . . . . . . . . . . . . . . . . . . . . 103

x

CONTENTS

5.5.6 Noisy non-stationary opponents . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.6 Non-stationary game theory strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 105


5.6.2 Battle of the sexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

5.6.3 General-sum games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107


6 Conclusions and Future Research 111

6.1 Summary of the proposed algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.4 Open questions and future research ideas . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

References 117

A PowerTAC 127

A.1 Energy markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.2 PowerTAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.3 Periodic double auctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.4 TacTex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

B General-sum Games 131

C Extra Experiments 133

C.1 HM-MDPs training and performance experiments . . . . . . . . . . . . . . . . . . . . . 133

C.2 R-max exploration against pure and mixed strategies . . . . . . . . . . . . . . . . . . . 136

xi

CONTENTS

xii

List of Figures

2.1 Interaction of an agent in an environment. . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 A Markov decision process (MDP) with four states and two actions. . . . . . . . . . . 10

2.3 A decision tree that models an opponent strategy. . . . . . . . . . . . . . . . . . . . . 13

2.4 An example of an HM-MDP with 3 modes and 4 states. . . . . . . . . . . . . . . . . . 15

2.5 The automata that describe TFT and Pavlov strategies. . . . . . . . . . . . . . . . . . 21

3.1 Related work to this thesis divided in areas. . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 An example of an Influence Diagram that represents the decision whether to take an

umbrella. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Di↵erent sections in this thesis and how the related to each other inside this chapter. . 38

4.2 The three main parts of the framework. . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 A decision tree and a learned MDP using the game matrix of the prisoner’s dilemma. 43

4.4 MDP obtained from decision tree in Figure 4.3 (a). . . . . . . . . . . . . . . . . . . . . 45

4.5 Example of highly dissimilar and similar decision trees. . . . . . . . . . . . . . . . . . . 47

4.6 Example of how the framework works against a TFT-Pavlov opponent. . . . . . . . . . 48

4.7 An example of a learning agent against a Bully-TFT switching opponent. . . . . . . . 53

4.8 An example of the learned models of R-max# against a Bully-TFT switching opponent. 55

4.9 An illustration for the running behavior of R-max#. . . . . . . . . . . . . . . . . . . . 59

4.10 Possible switch points in R-max#. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1 Experimental domains used in this thesis and where they are used in this chapter. . . 68

5.2 The evaluation approach for the proposed framework and for HM-MDPs. . . . . . . . 72

5.3 Comparison of rewards of MDP-CL, MDP4.5 and HM-MDPs against di↵erent switching

opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4 Comparison between MDP4.5 and MDP-CL in terms of average rewards. . . . . . . . 78

5.5 MDP representation and rewards obtained in the multiagent version of the prisoner’s

dilemma with two opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

xiii

LIST OF FIGURES

5.6 Total variation distance of the current learned model compared with each strategy given

as prior information using a priori MDP-CL. . . . . . . . . . . . . . . . . . . . . . . . 82

5.7 Comparison of MDP-CL and a priori MDP-CL in terms of immediate and cumulative

rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.8 Model quality of MDP-CL and a priori MDP-CL against the opponent TFT-Bully. . . 84

5.9 Di↵erence of cumulative rewards between incremental MDP-CL and MDP-CL. Total

variation distance of the learned model and the noisy representations. . . . . . . . . . 85

5.10 Cumulative rewards of MDP-CL with and without drift exploration, the opponent is

Bully-TFT switching at round. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.11 Cumulate rewards against he Bully-TFT opponent in the iPD using R-max# and R-max. 92

5.12 Immediate and cumulative rewards of R-max# and R-max in the alternating o↵ers

domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.13 Immediate rewards of MDP-CL, R-max#, WOLF-PHC and R-max#CL in the iPD

domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.14 Error probabilities of a learning algorithm with no switch detection and DriftER against

an opponent changes between two strategies. . . . . . . . . . . . . . . . . . . . . . . . 99

5.15 Switch detection with di↵erent parameters of DriftER against a non-stationary opponent.100

5.16 Error rate of TacTex-WM and MDP-CL while comparing with DriftER. . . . . . . . . 102

5.17 Profits of TacTex-WM, MDP-CL and DriftER against the non-stationary opponent in

a PowerTAC competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.18 Rewards obtained by DriftER, MDP-CL, R-max# and WOLF-PHC in the BoS game

against a non-stationary opponent that uses pure and mixed Nash. . . . . . . . . . . . 106

A.1 Partial representation of the MDP broker in PowerTAC, ovals represent states (timeslots

for future delivery). Arrows represent transition probability and rewards. . . . . . . . 128

C.1 Fraction of updates when learning an opponent model using R-max exploration against

a pure strategy and a mixed strategy in the BoS game. . . . . . . . . . . . . . . . . . . 136

xiv

List of Tables

2.1 The bimatrix for the prisoners’ dilemma game. . . . . . . . . . . . . . . . . . . . . . . 18

3.1 A comparison of di↵erent algorithms of the state of art. . . . . . . . . . . . . . . . . . 34

4.1 A description of the main parts of the approach using two di↵erent representations:

MDP4.5 and MDP-CL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.1 The bimatrix game known as the prisoners’ dilemma. . . . . . . . . . . . . . . . . . . . 68

5.2 A bimatrix game representing the battle of the sexes game. . . . . . . . . . . . . . . . 71

5.3 Average rewards for the HM-MDPs agent with std. deviation using di↵erent training

sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.4 Average rewards for the HM-MDPs agent and for the opponent with standard deviations. 73

5.5 Average rewards of MDP-CL, MDP4.5 and HM-MDPs against non-stationary opponents. 75

5.6 Comparison without exploration, an ✏�exploration and a softmax exploration for MDP4.5

and MDP-CL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.7 Comparison in terms of average rewards of MDP-CL and a priori MDP-CL and non-

stationary opponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.8 Average rewards of the proposed algorithms against an opponent with a probability ⌘

of changing to a di↵erent strategy at any round in the iPD domain. . . . . . . . . . . 88

5.9 Average rewards and percentage of successful negotiations in the negotiation domain of

the proposed algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.10 Comparison of MDP-CL and MDP-CL(DE) while varying the parameter ✏ (using ✏�greedyas drift exploration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.11 Comparison of R-max and R-max# with di↵erent ⌧ values in terms of average rewards. 95

5.12 Average rewards of R-max#CL and R-max# with di↵erent ⌧ values. . . . . . . . . . 96

5.13 Average timeslots for switch detection, accuracy, and traded energy of the learning

agents against a non-stationary opponent. . . . . . . . . . . . . . . . . . . . . . . . . . 104

xv

LIST OF TABLES

5.14 Average profit of the learning agents against non-stationary opponents with and without

noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.15 Rewards of our proposed approaches andWOLF-PHC against non-stationary opponents

in four random repeated games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.1 A comparison of our proposals in terms of advantages and limitations. . . . . . . . . . 112

B.1 Games used in the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

B.2 Pure and mixed Nash strategies for the selected games. . . . . . . . . . . . . . . . . . 132

C.1 Average rewards for the HM-MDPs agent and for the opponent with standard deviations.135

C.2 Performance measures when solving the HM-MDP as a POMDP, average of di↵erent

opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

C.3 Performance measures when solving the HM-MDP as a POMDP against di↵erent non-

stationary opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

C.4 Average rewards of R-max learning against pure and mixed strategies in the battle of

the sexes game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

xvi

List of Algorithms

2.1 R-max algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 WOLF-PHC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Proposed framework algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Incremental MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 R-max# algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 R-max#CL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5 DriftER algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

xvii

LIST OF ALGORITHMS

xviii

List of Symbols

Symbol Description

Pr Probability

E Expectation

� A normal form game

T Number of rounds

I Number of agents

A Learning agent

Oi

Opponents

O Set of stationary strategies used by an opponent

A Set of attributes used to describe an opponent strategy.

B Set of opponent actions

M A Markov decision process (MDP)

M A set of MDPs

S Set of states

A Set of actions

R Reward function

T Transition function

� Discount factor value

Z Set of observations

Z Observation function

⇡ Policy

rpd

, spd

, tpd

, ppd

Values in the prisoner’s dilemma matrix

D A decision tree

w Parameter that represents window size

TV D Total variation distance

Threshold value in MDP4.5 and MDP-CL

⇢ Threshold value in Incremental MDP-CL

K Set of known states in R-max#and R-max#CL

m Parameter of R-max, R-max# and R-max#CL

⌧ Parameter of R-max# and R-max#CL

� Parameter of DriftER

ninit

Parameter of DriftER

xix

Chapter 1

Introduction

Multiagent systems (MAS) are those that include multiple autonomous entities (agents) that are

capable of independent action. Even when there is not a universal definition of agent [Russell et al.,

1995] most researchers in the community accept that an agent is an entity that: (i) has sensors which

help it see the world and (ii) can take actions to a↵ect the world. An agent can be a simulated software

agent, it could be physical robot, or even more, an agent can be thought as a person. An agent can

figure out for itself what it needs to do in order to satisfy its design objectives. A multiagent system

is one that consists of a number of agents, which interact with one another [Wooldridge, 2009].

The MAS community has developed several algorithms for di↵erent needs (coordination, com-

munication, learning, etc.) and a number of real applications have begun to appear. For example

implementing a network for distributing electricity [Pipattanasomporn et al., 2009], a coordination

mechanism for charging electric cars [Valogianni et al., 2015] or patrolling the Los Angeles airport

[Pita et al., 2009]. Also there are several applications from the branch of economy like energy markets

[Ketter et al., 2014], auctions [Hu and Wellman, 1998] and negotiation [Jennings et al., 2001].

Despite these advances, most of the previous work has focused on homogeneous interactions, this

is, all the agents have the same internal structure, including goals, domain knowledge and possible

actions [Stone and Veloso, 2000]. However, this assumption is not realistic. Thus, new algorithms

must be designed for heterogeneous systems, that range from having di↵erent goals to having di↵erent

models and actions (this includes the case of human-agent interactions). Even more, it would be

desirable that computer agents take into account who they are interacting with. This is related to the

concept of strategic interaction: a situation in which agents share an environment, each one of them

with its own objective, trying to obtain the best results, in spite of the fact that each agent’s outcome

could depend on the behavior of other agents [Risse, 2000].

However, one limitation of current technologies is that, many of the underlying algorithms do not

take into account against who they are interacting. These algorithms do not model the other agent.

1

CHAPTER 1. INTRODUCTION

This is specially troublesome when having di↵erent types of agents in the environment because that

means they act the same for any other agent, which is not optimal. Considering that many applications

demand di↵erent agents to collaborate closely together this should be an important problem to solve.

For example, in the medical area rehabilitation systems can be used directly by the patient (with-

out the therapist) [Sucar et al., 2010]. Even when these systems can produce important benefits in

the patient’s health, most of them lack the capacity to adapt and learn the patient’s needs. Nowa-

days, humans have to adapt to the system, therefore their motivation and consequently its benefits

are usually curtailed. Humans tend to have changes in mood and motivation, we adapt and learn

continuously, and if a system was capable of learning these changes of behavior and adapt itself to the

situation then, the benefits for the subject using the system would dramatically increase.

A similar problem called knowledge tracing [Corbett and Anderson, 1994] occurs in intelligent tutor

systems, where the objective is to monitor students’ change in knowledge state during the teaching

process. In this case, if the system could learn this knowledge state and keep track of the possible

changes, the system could optimize its behavior accordingly to the student.

A recent commercial application is to monitor the user’s music preferences to propose new tracks

that the user may like. An agent, by continuous monitoring of the listened tracks can learn a model

of the user by means of reinforcement learning [Liebman et al., 2015]. Even when the agent learns

a model of the user in an online fashion it does not take into account changes of mood of the user,

which reduces the quality of its predictions and its usefulness for everyday use.

Previous examples showed domains that involved a human and an agent working together in

a cooperative way. However, this problem can be seen also in competitive domains such as security

patrolling. In this scenario (and assuming repeated interactions) an intruder would change its strategy

constantly in order to avoid a patroller to learn its behavior from past interactions. In conclusion,

in cooperative or competitive scenarios agents must learn how their counterpart is acting and react

quickly to changes in their behavior.

1.1 Related work

Regardless of the task’s nature and whether the agent is interacting with another agent, or is isolated

in its environment, it is reasonable to expect that some conditions will change in time. These changes

turn the environment into a non-stationary one and render many techniques futile.

Strategic interactions (where the best outcome for an agent depends on the joint actions of all

the agents in the environment) are one of the fundamental problems in MAS systems. One way to

tackle those is to assume that the agent is the only one in the environment, treating the rest of the

agents like part of the environment and using single agent techniques. Decision theoretic planning is

2

1.1. RELATED WORK

one area that has developed algorithms for a single agent to find the optimal decisions in a sequential

decision problem [Boutilier et al., 1999]. Recently, some works have tackled the problem of having

more than one agent in the environment [Gmytrasiewicz and Doshi, 2005; Seuken and Zilberstein,

2008]. However, these multiagent extensions have a limited use due to computational constraints,

in some cases the complexity of the algorithms involved is NEXP-complete [Seuken and Zilberstein,

2008].1

Assuming there is only one agent in the environment is incorrect in situations in which other agents

do not have a fixed strategy [Shoham et al., 2007] since the environment will turn non-stationary where

single agent techniques fail. Thus, another solution is to represent explicitly other agents and their

possible actions. This is what opponent modeling algorithms do [Gmytrasiewicz and Durfee, 2000;

Stone, 2007]. Recent approaches in this area have shown the importance of adapting fast to the

opponent behavior [Sykulski et al., 2010]. Also they provided ideas about how to model adversarial

environments [Banerjee and Peng, 2005; Cote et al., 2010] where classical game theory solutions will

not succeed.

Game theory studies the decision making process between intelligent rational decision makers

[Myerson, 1991]. It has proposed models that can prescribe optimal strategies (Nash equilibria) under

specific situations. However, it can do so only under the strong assumption that all agents are fully

rational —i.e., a worst-case opponent or the perfect rational teammate (that always takes the best

action possible). The reality is that most situations involve agents that cannot be assumed as perfectly

rational, and where such prescribed strategies will fare less than optimal. Reasons for the agents not

to be rational are diverse, humans make choices depending on di↵erent factors which violate axioms

of the theory [Kahneman and Tversky, 1979], in other cases it is impossible to know all the necessary

information to make the best decision; when dealing with computer agents or robots, they have limited

capabilities (sensors, memory and processing) which limits their reasoning.

Most game theory models are designed for a single decision (one-shot games). However, extensions

for repeated games and stochastic games overcome this limitation. Some approaches [Abdallah and

Lesser, 2008; Conitzer and Sandholm, 2006] are designed to converge to the Nash equilibrium, others

switch between di↵erent game theoretic strategies2 depending on the context [Crandall and Goodrich,

2011; Powers and Shoham, 2005]. Learning approaches had been developed for repeated games [Brown,

1951]. Its limitation is that they assume their counterpart will use a stationary strategy during the

complete interaction, which means that the rule of choosing an action is the same in every stage3

[Shoham and Leyton-Brown, 2008].

1NEXP is the set of decision problems that can be solved by a non-deterministic Turing machine using time O(2p(n))

for some polynomial p(n), and unlimited space.2Nash equilibrium and minimax strategy are examples of these strategies which are described in Section 2.3.3Note that this does not imply that the action chosen in each stage will be the same.

3


On a di↵erent area, the machine learning community has developed an area for learning with

changes over time: concept drift [Gama et al., 2014; Widmer and Kubat, 1996]. However algorithms

in this area cannot be used directly in multiagent scenarios. Approaches from reinforcement learn-

ing [Sutton and Barto, 1998] range from the basic Q-learning algorithm with its various multiagent

extensions [Littman, 1994; Tesauro, 2003]. Another group of algorithms has been proposed with the

objective of converging in self-play [Bowling, 2004; Bowling and Veloso, 2002]. In particular WOLF-

PHC is a variant of Q-learning that is designed to learn against a non-stationary opponent that slowly

changes its behavior. Thus, the authors propose to use a variable learning rate, to learn quickly when

losing and cautiously when winning. However the approach had never been tested against switching

opponents until this thesis. As we will show in Sections 5.4.2 and 5.4.4 WOLF-PHC is not capable of

adapting fast to all opponent switches.

Other approaches like R-max [Brafman and Tennenholtz, 2003] addressed the problem of e�ciently

exploring by providing theoretical guarantees. However, the main limitation is that it does not handle

switching opponents. A di↵erent approach which is in fact designed for non-stationary opponents

is presented in [Choi et al., 1999]. The authors proposed an extension of MDPs for non-stationary

environments called hidden-mode MDPs. This model needs an o✏ine training phase and requires

solving a partially observable MDP, which is intractable in general (PSPACE-complete) [Littman,

1996].

Recent approaches have proposed algorithms for “fast learning” in repeated and stochastic games

[Elidrisi et al., 2012, 2014]. However, experiments were performed on small size problems since the

algorithms show an exponential increase in the number of hypotheses (in the size of the observation

history), which may limit its use in larger domains. Even more, these approaches do not take into

account exploration mechanisms for detecting opponent switches. As a result they obtain longer

detection times and suboptimal rewards.

In summary, there are di↵erent approaches that can be used only for single decisions (one-

shot)[Camerer et al., 2004a; Costa Gomes et al., 2001; Koller and Milch, 2001]. Game theory works

focus on finding Nash equilibria and convergence in self-play (which does not guarantee the opti-

mal rewards) [Abdallah and Lesser, 2008; Bowling, 2004; Bowling and Veloso, 2002; Conitzer and

Sandholm, 2006]. Planning algorithms are computationally intractable for larger multiagent domains

[Gmytrasiewicz and Doshi, 2005; Seuken and Zilberstein, 2008]. Most learning approaches assume

stationarity of the opponent [Brown, 1951] and those that deal with non-stationary are not designed

for multiagent domains or need an o✏ine training phase [Choi et al., 1999]. Recent approaches do not

use exploration mechanisms for detecting switches.

4

1.2. CHALLENGES

1.2 Challenges

The proposed research aims to tackle most of the mentioned problems, in particular it will be focused

on learning against non-stationary opponents that use di↵erent stationary strategies and switch among

them during the interaction. Dealing with non-stationary opponents involves three di↵erent aspects:

• Learning a model of the opponent. This model will provide information on how the opponent

behaves under di↵erent circumstances. Note that we do not have any previous models before to

the interaction with the opponent.

• Compute a policy (plan to act) against the opponent. The objective of the learning agent is to

maximize its rewards throughout the interaction.

• Detecting switches in the opponent strategy. The opponent has a set of strategies and can switch

among them during the interaction. An optimal policy against one strategy will be suboptimal

if the opponent changes to a di↵erent strategy. Thus, the opponent model and the policy against

it should be adapted.

Throughout this thesis we will refer to the other agents in the environment as opponents indepen-

dently of the domain. However, our proposed methods can be applied in collaborative systems since

the objective of our agent is to maximize its own rewards. Thus our agent is a self-interested agent.

Moreover, our methods do not need to know the reward function of the other agents (which it is often

complicated to obtain).

1.2.1 Problem statement

The problem’s setting is the following: One learning agent A, and one or more opponents Oi share the

environment. Agents take one action (simultaneously) in a sequence of rounds/timesteps/timeslots.4

They all obtain a reward5 r at each round that depends on the actions of all agents. The objective of

agent A is to maximize its cumulative rewards over the entire interaction. Agent A observes its reward

at the end of each round but not those of its opponents. Agent A does not have any initial policies

nor models of how Oi act in the environment. Each Oi has a set of Oi possible stationary strategies

to choose from and can switch from one to another in any round of the interaction. A strategy defines

a probability distribution for taking an action given a history of interactions.

4These terms will be used interchangeably throughout this document.5The interpretation of the reward depends on the domain. It could represent for example, money. However, it is

generally a used as a value without units.

5


1.2.2 Research questions

We formulate the following questions:

1. How should we model a non-stationary agent so that a model can be learned in few interactions?

2. How to e�ciently detect strategy switches in other agents?

3. How to learn a model of an opponent based on few interactions to derive an optimal policy for

interacting with other agents?

1.3 Objectives

This thesis has the following main objective.

1.3.1 Main objective

Develop algorithms for learning agent models accurately and in few interactions against non-stationary

opponents, along with planning algorithms that use the learned model to compute a strategy so that

interaction with the other agents is as profitable as possible.

1.3.2 Specific objectives

1. Define a model of an opponent (agent).

2. Define a suitable measure of distance between models of agents.

3. Develop an algorithm for learning models of non-stationary agents.

4. Identify a suitable planning algorithm to be used in conjunction with the learning algorithm.

5. Integrate the learning algorithm and the planning algorithm.

6. Test the proposed algorithm in a real-world application.

1.4 Contributions

The contributions of this thesis in the area of multiagent learning are listed below:

• A framework [Hernandez-Leal et al., 2014a] with two instantiations: MDP-CL and MDP4.5

[Hernandez-Leal et al., 2013a,b,c] which are designed to learn fast against non-stationary oppo-

nents in repeated games.

6

1.5. THESIS SUMMARY

• Two extensions of MDP-CL [Hernandez-Leal et al., 2014b]. A priori MDP-CL assumes to know

the set of possible strategies used by the opponent and detect which one to use while still checking

for switches. Incremental MDP-CL will keep a history of learned models, once it detects a switch

it will check if it is any of the previous models or a new one.

• A new type of exploration called drift exploration that is designed to detect switches against

non-stationary opponents that otherwise would have passed unnoticed.

• R-max# algorithm [Hernandez-Leal et al., 2014c], which performs an e�cient drift exploration.

The algorithm is inspired by R-max [Brafman and Tennenholtz, 2003] but it keeps learning a

model continuously by relearning state-action pairs that have not been updated recently. R-

max# provides theoretical guarantees for obtaining optimal rewards under certain assumptions.

• DriftER [Hernandez-Leal et al., 2015a] algorithm which proposes a way to detect switches based

on tracking the error rate of the learned model. It provides theoretical guarantees for switch

detection with high probability.

1.5 Thesis summary

In Chapter 2, the models and concepts from planning, machine learning and game theory which are

related to this thesis are described.

In Chapter 3, the state of the art in multiagent decision theoretic planning, non-stationary rein-

forcement learning, repeated games in game theory and recent approaches in opponent modeling are

analyzed.

In Chapter 4, we present the main contributions of this thesis, di↵erent proposals for acting against

non-stationary opponents. This thesis first proposes a framework for fast learning against switching

opponents in repeated games. The framework learns opponent’s models from a series of interactions.

Two di↵erent implementations of the framework were evaluated, one uses decision trees (MDP4.5)

and the other uses MDPs (MDP-CL). MDP4.5 limitation is that decision trees are not a good model

to handle stochasticity. MDP-CL limitation is that it fails to detect some type of switches.

Next we take into consideration what happens when prior information can be used when facing

non-stationary opponents. We propose a priori MDP-CL that assumes to know from start the set of

models used by the opponent. The second extension is incremental MDP-CL that learns new models

from a history of interactions, but it will not discard them once it detects a switch. In this way, it

keeps a record in case the opponent reuses a previous strategy. Their limitation is that they do not

have theoretical guarantees of switch detection.

7


Then we argue that classic exploration strategies (e.g., ✏-greedy, softmax), which tend to decrease

its exploration rate over time are not su�cient to be used in non-stationary opponents [Cote et

al., 2010]. We take recent algorithms that perform e�cient exploration (in terms of the number of

suboptimal decisions made during learning [Brafman and Tennenholtz, 2003]) as a stepping stone

to derive a new exploration strategy against non-stationary opponent strategies. We propose a new

adversarial drift exploration, which e�ciently explores the state space while being able to detect

regions of the environment that have changed. We present drift exploration as a strategy for switch

detection and second, a new algorithm called R-max# for learning and planning against non-stationary

opponents. R-max# makes e�cient use of exploration experiences, which results in rapid adaptation

and e�cient drift exploration, to deal with the multiagent non-stationary nature of the opponent

behavior.

Lastly, we present DriftER an algorithm for detecting switches in opponent strategies by tracking

the error rate of the learned model. Moreover, DriftER provides a theoretical guarantee that a switch

will be detected with high probability.

In Chapter 5 we present experiments in five domains: the iterated prisoner’s dilemma, the multi-

agent prisoner’s dilemma, the alternate o↵ers bargaining protocol, double auctions in energy markets

and a general setting involving game theory games and strategies. Experiments and results are dis-

cussed for each of our proposals in terms of quality of learned models and average rewards obtained.

Comparisons were performed against di↵erent state of the art approaches.

Finally, in Chapter 6 we present the conclusions of this thesis with a summary of the contributions.

We enumerate some open questions that are left as future research and we conclude with the list of

derived publications.

8

Chapter 2

Preliminaries

The proposed thesis lies in the intersection of di↵erent areas: decision theoretic planning, game theory

and machine learning. In this section, we present the models, concepts and algorithms of each area

that are relevant for this work.

2.1 Decision theoretic planning

Decision theoretic planning is the study of sequential decisions under uncertainty. Many real problems

depend on several decisions over time, for example in a negotiation against an opponent the best reward

will require a sequence of actions, these problems can be solved with models that take sequential

decisions. Two important models are Markov decision processes (MDPs) and partially observable

Markov decision processes (POMDPs).

2.1.1 Markov decision process

A Markov decision process [Puterman, 1994] is a model that can obtain optimal decisions in an

environment with a single agent assuming it has perfect sensors. An MDP can be seen as a model of

an agent interacting with the world. This process is depicted in Figure 2.1, the agent takes as input a

state s of the world and generates as output an action a that a↵ects the world. There is a transition

function T that describes how an action a↵ects the environment in a given state. The component Z in

the figure represents the agent’s perception function, which transforms the state s into a perception z.

In an MDP it is assumed there is no uncertainty in where the agent is. This implies that the agent has

full and perfect perception capabilities and knows the true state of the environment (what it perceives

is the real state z = s) (next section shows a model where z 6= s). The component R is the reward

function, the rewards help the agent to know which actions and states are good and which are bad

[Littman, 1996].

9

CHAPTER 2. PRELIMINARIES

T

Z

R

azr

s

Agent

Environment

Figure 2.1: Interaction of an agent in an environment. The agent performs an action a which a↵ects the state

environment according to a function T producing the state s. The agent perceives an observation z about the

environment (given by a function Z) and obtains a reward r (given by a function R).

S2 S0

a1,1,10S3 S1

a2,1,5

a2,0.2,1

a1,1,-100

a2,1,5a2,0.8,1a1,1,5

Figure 2.2: A Markov decision process (MDP) with four states S0, S1, S2, S3 and two actions a1, a2. The

arrows denote the tuple: action, transition probability and reward.

We first define a Markovian process and then an MDP.

Definition 2.1 (Markovian process). A stochastic process in which the transition probabilities depend

only on the current state.

Definition 2.2 (Markov decision process). An MDP is defined by the tuple hS,A,R, T i where S

represent the world divided up into a finite set of possible states. A represents a finite set of available

actions. T : S ⇥ A ! �(S), called transition function, is a function that for each state and action

associates a probability distribution over the possible successor states (�(S) denotes the set of all

probability distributions over S). Thus, for each s, s0 2 S and a 2 A, the function T determines the

probability of a transition from state s to state s0 after executing action a. R : S ⇥ A ! R, is the

reward function that defines the immediate reward that an agent would receive for being in state s and

executing action a.

A common assumption about MDPs is that they are stationary, i.e., that the transition probabilities

do not change in time. An example of an MDP with 4 states and 2 actions is depicted in Figure 2.2.

Ovals represent states of the environment. Each arrow has a triplet an, p, r representing the action,

the transition probability and the reward, respectively.

Solving an MDP will yield a policy ⇡ : S ! A, which is a mapping from states to actions. An

10

2.1. DECISION THEORETIC PLANNING

optimal policy ⇡⇤ is the one that guarantees the maximum reward. There are di↵erent techniques

for solving MDP and one of the most common is the value iteration algorithm [Bellman, 1957]. The

complexity of solving an MDP depends on the method but several methods had been shown to be in

P [Littman et al., 1995].

The assumption of perfect perception capabilities of the MDPs can be excessive in some domains.

For example, sensors are not perfect in a robot, which limit their capacities; in competitive domains,

agents may not be able to observe the complete scenario. Thus, MDPs may not be the perfect choice

for modeling those domains.

2.1.2 Partially observable Markov decision process

A POMDP [Kaelbling et al., 1998; Monahan, 1982] is a partially observable MDP, this means that is

not known with certainty the state of the agent, there is a probability distribution of being in every

state. The model is similar to an MDP, and solving a POMDP also yields a policy with the di↵erence

that this policy is now a mapping from probability distributions of states to actions.

A POMDP extends an MDP by adding:

• Observations Z - a finite set of observations of the state (which can be seen as responses,

diagnoses, perceptions or views). In MDPs, the agent has full knowledge of the system state;

therefore, Z = S. In partially observable environments, observations are only probabilistically

dependent on the underlying environment state. Determining which state the agent is in becomes

problematic, because the same observation can be obtained in di↵erent states.

• Observation Function Z - captures the relationship between the state and the observations (and

can be action dependent). This is the probability that observation z will be recorded in time

t+ 1 after an agent performs action a (in time t) and lands in state s0 (in time t+ 1):

Pr(Zt+1 = z0|St+1 = s0, At = a) (2.1)

The observation function is also assumed to be Markovian and stationary.

Definition 2.3 (Partially observable Markov decision process). A POMDP is defined by the tuple

hS,A,Z, Z, T,Ri where: S = set of states, A = set of actions, Z = set of observations, T : S ⇥ A!�(S) = P (s0|s, a) is the state-transition function, giving for each world state and agent action, a

probability distribution over world states. Z : S ⇥A! �(Z) = P (z0|s0, a) is the observation function.

This is the probability of making observation o0 given that the agent took action a and landed in state

s0. R : S ⇥ A ! R is the reward function, giving the expected immediate reward gained by the agent

for taking each action in each state.

11


Solving a POMDP in a finite horizon (steps) is PSPACE-complete and some variations had been

proved NP-complete [Papadimitriou and Tsitsiklis, 1987] making them computationally more complex

than an MDP. Thus, a number of approximate techniques for solving POMDPs had been developed

[Monahan, 1982; Pineau et al., 2006].

2.2 Machine learning

The field of machine learning is concerned with the question of how to construct computer programs

that automatically improve with experience. Machine learning has two important subareas concerning

this thesis: supervised and reinforcement learning.

2.2.1 Supervised learning: classification

The supervised learning objective is to infer a function from a set of labeled data [Mohri et al., 2012].

One important task of supervised learning is classification where usually the data is known before the

learning task starts, which is called an o✏ine learning. Data consist of a set of examples containing a

feature vector xi

and a label (class) yi. A supervised learning algorithm produces a function g : X ! Y ,

with X and Y input and output spaces respectively. One widely used technique for classification is

learning based on decision trees.

Decision trees

Decision tree learning is a method for approximating discrete-valued target functions, in which the

learned function is represented by a decision tree. Learned trees can also be re-represented as sets of

if-then rules to improve human readability [Mitchell, 1997].

The objective of a decision tree is to specify a model that predicts the value of a certain variable,

called class, given that some input information is provided.

Definition 2.4 (Decision tree). A decision tree D is composed of nodes which represent tests to be

carried out on variables known as attributes. Each test has di↵erent outcomes, which are branches of

the node. These outcomes can be of two types: a leaf in which a value for the class (predicted variable)

is provided and represents a final node for the tree. Or it can be another test.

One of the most known algorithms for learning decision trees from a batch of information is C4.5

[Quinlan, 1993].

Trees are useful to represent strategies of behavior. For example, in Figure 2.3 a decision tree

with one decision node and two leaves is depicted. This tree specifies the behavior of an agent: when

the opponent last action is a1 then it will respond with an action b1, 100% of the times. When the

12

2.2. MACHINE LEARNING

LearnAgent last action

Clearn DLearn

Copp Dopp

100/0 100/0

Figure 2.3: A decision tree that models an opponent strategy. It contains one decision node LearnAgent

last action and two leaves that correspond to the actions of the opponent b1 and b2, each leaf has a

number of correctly classified/misclassified instances.

last action is a2 then, it will respond with b2 with 100% accuracy (these values will be used later to

compute probability transitions).

2.2.2 Reinforcement learning

Reinforcement learning (RL) [Sutton and Barto, 1998] addresses the question of how an autonomous

agent can learn to choose optimal actions to achieve its goals. Section 2.1 presented MDPs and how

to solve them given a complete set of states, actions, rewards and transitions. However, this may be

di�cult to obtain for several domains; for this reason, reinforcement learning algorithms deal with

learning optimal policies by experience without having a complete description of the MDP.

A reinforcement learning agent interacts with its environment in discrete time steps. At each

time, the agent chooses an action from the set of actions available, which is subsequently sent to the

environment. The environment moves to a new state and the reward associated with the transition is

determined (see Figure 2.1). The goal of a reinforcement learning agent is to collect as much reward

as possible. In this type of learning the learner is not told which actions to take, but instead must

discover which actions yield the best reward by trying them.

Q-learning [Watkins, 1989] is one well known algorithm for RL. It is generally used in stationary,

single-agent, fully observable environments. In its general form, a Q-learning agent can be in any state

s 2 S and can choose an action a 2 A. It keeps a data structure Q(s, a) that represents its estimate

of its expected payo↵ for starting in state s, taking action a. Each entry Q(s, a) is an estimate of the

corresponding optimal Q⇤ function. Each time the agent makes a transition from a state s to a state

s0 via action a and receives payo↵ r, the Q table is updated according to:

Q(s, a) = ↵(r + �maxb

Q(s0, b)) + (1� ↵)Q(s, a) (2.2)

with ↵ (called the learning rate) and � 2 [0, 1] (called discount factor). Q-learning will converge

13


toward the true Q function if each state-action pair is visited infinitely often, this is, Q(s, a) converges

to the true Q(s, a) as with su�cient visits.

By using Q-learning it is possible to learn an optimal policy without knowing T or R beforehand,

and even without learning these functions [Littman, 1996]. For this reason, this type of learning is

known as model free RL. In contrast, model-based RL is where the agent attempts to learn a model of

its environment. Having such a model allows the agent to predict the consequences of actions before

they are taken, like Dyna-Q [Sutton and Barto, 1998].

Most RL algorithms consider the environment to be stationary, an exception is the Hidden-mode

MDP which will be used as comparison in the experiments in Chapter 5.2.

Hidden mode - MDPs

Hidden-mode Markov decision processes (HM-MDPs) [Choi et al., 1999] are a reinforcement learning

approach designed for non-stationary environments. They assume the environment can be represented

in a small number of modes. Each mode is a stationary environment, which has di↵erent dynamics

and needs a di↵erent policy. It is assumed that at each time step there is only one active mode. The

modes are hidden, which means that cannot be directly observed, they are only estimated by past

observations. Moreover, transitions between modes are stochastic events. Each mode is modeled as

an MDP. Di↵erent MDPs along with its transition probabilities form the HM-MDP. HM-MDPs are a

special case of POMDPs. Therefore, it is always possible to reformulate the former in the form of the

latter [Choi et al., 2001], this is useful since a number of methods for solving POMDPs are available

[Cassandra, 1998].

Definition 2.5 (Hidden-mode Markov decision process). An HM-MDP is an 8-tuple hQ, S, A,X, Y,R,⇧, i,where Q, S and A represent the sets of modes, states and actions respectively; the mode transition

function X maps mode m to n with a fixed probability xmn; the state transition function Y defines

transition probability ym(s, a, s0) from state s to s’ given mode m and action a; the stochastic reward

function R returns rewards with mean value rm(s, a); ⇧ and represent the prior probabilities of the

modes and the states respectively.

In Figure 2.4 an example of an HM-MDP with 3 modes and 4 states is depicted. Each of the three

large circles represent a mode, shaded circles inside the modes represent states. Thick arrows indicate

stochastic transitions between modes, for example the arrow labeled Xmn represents the probability of

transitioning from mode m to mode n. Thinner arrows represent state-action-next state probabilities.

For example, the arrow labeled ym(s, a, s0) represents the probability of transitioning to state s0 when

using action a being in state s in mode m.

14


Mode m state s

Xmn

Mode n

state s'

ym(s,a,s')

Figure 2.4: An example of an HM-MDP with 3 modes (large circles) and 4 states (smaller shaded

circles). The value Xmn represents a transition probability between modes mandn, and ym(s, a, s0)

represents a state transition probability in mode m.

2.2.3 Exploration vs. exploitation

One major di↵erence between reinforcement learning and supervised learning is that a RL agent must

explicitly explore its environment.

The simplest possible reinforcement-learning problem is known as the k-armed bandit problem

[Robbins, 1985]: the agent is in a room with k gambling machines (called a “one-armed bandit”). At

each time step the agent pulls the arm of one of the machines and receives a reward. The agent is

permitted a fixed number of pulls. The agent’s purpose is to maximize his total reward over a sequence

of trials. Since each arm is assumed to have a di↵erent distribution of rewards, the goal is to find the

arm with the best expected return as early as possible, and then to keep gambling using that arm.

This problem illustrates the fundamental tradeo↵ between exploration and exploitation. The agent

might believe that a particular arm has a fairly high payo↵ probability; the questions are: 1) should

it choose that arm all the time (exploit)? or 2) should it choose another one that has less information

about (explore)?

Sample complexity of exploration

In order to answer those questions, we need to define some related terms like the sample complexity.

Loosely speaking, sample complexity is the number of examples needed for the estimate of a target

function to be within a given error rate. Kakade [2003] studies the sample complexity as a function of

the sampling model.1 In particular, the sample complexity is considered to be the number of calls to

1For example the environment itself is the sampling model where an agent must follow one unbroken chain of experience

for some number of decision epochs (timestep) in which a state is observed and an action is taken, and so the number of

15


the sampling model required to satisfy a specified performance criterion, and we are interested in how

this scales with the relevant problem dependent parameters. In the reinforcement learning setting, the

parameters are the size of the state space, the size of the action space, the number of decision steps

and the variance of the reward function.

R-max [Brafman and Tennenholtz, 2003] is a well known model-based RL algorithm (presented

in Algorithm 2.1), which has an e�cient built-in mechanism for resolving the exploration-exploitation

dilemma. It uses an MDP to model the environment which is initialized optimistically assuming all

actions return the maximum possible reward, rmax. With each experience of the form (s, a, s0, r)

R-max updates its model. Since there is only a polynomial number of parameters to learn, as long as

learning is done e�ciently we can ensure that the agent spends a polynomial number of steps exploring,

and the rest of the time will be spent exploiting (Theorem 2.2.1). Thus, the policy e�ciently leads the

agent to less known state-action pairs or exploits known ones with high utility. R-max promotes an

e�cient sample complexity of exploration. However, R-max alone will not work when the environment

is non-stationary (e.g., when there are strategy switches during the interaction).

Algorithm 2.1: R-max algorithmInput: State set S, fictitious state s0, action set A, threshold parameter m, rmax value, number of rounds T

Function: SolveMDP(), receives a tuple which corresponds to a MDP and obtains a policy

1 S = S [ s0

2 8(s, a, s0) r(s, a) = n(s, a) = n(s, a, s0) = 0;

3 8(s, a, s0) T (s, a, s0) = 1;

4 8(s, a) R(s, a) = rmax;

5 for t = 1, . . . ,T do

6 Observe state s

7 Execute action a using policy

8 Observe reward and next state s0

9 if n(s, a) < m then

10 Increment counters for n(s, a) and n(s, a, s0)

11 Update reward r(s, a)

12 if n(s,a)==m then

13 R(s, a) = r(s, a)/m;

14 for s0 2 S do

15 T (s, a, s0) = n(s, a, s0)/m

16 end

17 SolveMDP (S,A, T,R)

18 end

19 end

20 end

Below we present the main theorem for R-max which guarantees near optimal expected rewards.

Definition 2.6 (Approximation Condition). If R-max uses the set of states K and an MDP MK ,

decision epochs is equivalent to the amount of observed experience.

16

2.3. GAME THEORY

then for the optimal policy ⇡ for MK , assumes that for all states s and times t < T

U⇡,t,Mk

(s) > U⇤t,M

k

(s)� ✏ (2.3)

The assumption states that the policy ⇡ that R-max derives from Mk is near-optimal in Mk.

Theorem 2.2.1 (Kakade, 2003). Let M = hS,A, T,Ri be a L-epoch MDP. If c is an L-path sampled

from Pr(·|R�MAX,M, s0) and the approximation condition holds, then, with probability greater than

1� �, the R-MAX algorithm guarantees an expected return of U⇤(ct)� 2✏ within O(m|S||A|T✏ log |S||A|

� )

timesteps t L.

The high level idea of the proof is as follows, by the approximation condition we know that the

learned MDP MK is a good approximation to the real MK . The policy used by the algorithm is 2✏ near

optimal in M . By the Pigeonhole Principle, successful exploration can only occur mSA times. Hence,

as long as the escape probability (exploring a state not known) is “large”, exploration must“quickly”

cease and exploitation must occur [Kakade, 2003].

Kakade’s results will be used as a base to provide a theoretical guarantee for near-optimal rewards

for our R-max# approach in Section 2.2.3.

2.3 Game theory

The area that studies decision problems when several agents interact is game theory [Fudenberg

and Tirole, 1991]. The terminology in this area is di↵erent, agents are called players, and a single

interaction between players is represented as a game.

The most common way of presenting a game is by using a matrix that denote the utility obtained

by each agent, this is the normal form.

Definition 2.7 (Normal-form game). A (finite, I-person) normal-form game �, is a tuple hN , A, ui,where:

N is a finite set of I players, indexed by i;

A = A1⇥· · ·⇥AI , where Ai is a finite set of actions available to player i. Each vector a = (a1, . . . , aI) 2A is called an action profile;

u = (u1, . . . , uI) where ui : A 7! R is a real-valued utility or payo↵ function for player i.

For example, the game presented in Table 2.1 represents a two-dimensional table, called a bimatrix.

In general, each row corresponds to a possible action for player 1, each column corresponds to a

possible action for player 2, and each cell corresponds to one possible outcome. Each player’s utility

for an outcome is written in the cell corresponding to that outcome, with player 1’s utility listed

17


Table 2.1: The bimatrix for the prisoners’ dilemma game. Each cell represents the utilities given for the

agents (the first for agent A and second for agent O), rpd

, tpd

, spd

, ppd

represent numerical values where the

following conditions must hold tpd

> rpd

> ppd

> spd

and 2rpd

> ppd

+ spd

.

Agent Ocooperate defect

Agent Acooperate rpd, rpd spd, tpd

defect tpd, spd ppd, ppd

first. In the example, each player has two actions {cooperate, defect}. This is a well-known game,

known as prisoner’s dilemma (PD) where the following conditions must hold tpd > rpd > ppd > spd

and 2rpd > ppd + spd (to prevent alternating cooperate and defect giving a higher payo↵ than full

cooperation). When both players cooperate they both obtain the reward rpd. If both defect, they

get a punishment reward ppd. If a player chooses to cooperate with someone who defects receives the

sucker’s payo↵ spd, whereas the defecting player gains the temptation to defect, tpd.

A strategy specifies a method for choosing an action. One kind of strategy is to select a single

action and play it, this is a pure strategy.

Definition 2.8 (Mixed strategy). Let (I, A, u) be a normal-form game, and for any set X, let �(X) be

the set of all probability distributions over X, then the set of mixed strategies for player i is Si = �(Ai)

In general, a mixed strategy specifies a probability distribution over actions.

Definition 2.9 (Best response). Player i’s best response to the strategy profile s�i is a mixed strategy

s⇤i 2 Si such that ui(s⇤i , s�i) � ui(si, s�i) for all strategies si 2 Si.

where s�i = s1, . . . , si�1, si+1, . . . , sn represent the strategies of all players except i.

Thus, a best response for an agent is the strategy (or strategies) that produce the most favorable

outcome for a player, taking other players’ strategies as given. Another common strategy is the

minimax strategy.

Definition 2.10 (Minimax Strategy.). Strategy that maximizes its payo↵ assuming the opponent will

make this maximum as small as possible.

Definition 2.11 (Security level). The security level is the expected payo↵ a player can guarantee itself

using a minimax strategy.

In single-agent decision theory, the notion of optimal strategy is the one that maximizes the agent’s

expected payo↵ for a given environment. In multiagent settings the situation is more complex, and the

18

2.3. GAME THEORY

notion of an optimal strategy for a given agent is not meaningful since the best strategy depends on

the choices of others. To solve this problem, game theory has identified certain subsets of outcomes,

called solution concepts [Shoham and Leyton-Brown, 2008], one of those is the Nash equilibrium, that

will be explained next.

2.3.1 Nash equilibrium

Suppose that all players have a fixed strategy profile in a given game, if no player can increase its

utility by unilaterally changing its strategy, then the decisions are in Nash equilibrium. Formally it is

defined by:

Definition 2.12 (Nash equilibrium [Nash, 1950]). A set of strategies s = (s1, . . . , sn) is a Nash

equilibrium if, for all agents i, si is a best response to s�i.

Even when it is proved that in every game exists a Nash equilibrium, there are several limitations.

One problem is that there may be multiple equilibria in a game, and it is not an easy task which one

should be selected [Harsanyi and Selten, 1988].

2.3.2 Repeated and stochastic games

All the concepts presented in the previous section were defined for one-shot games (one single inter-

action), however it could be the case when more than one decision have to be made. For example,

repeating the same game, or having a set of possible games.

Definition 2.13 (Stochastic game). A stochastic game (also know as a Markov game) is a tuple

(Q,N , A, P, r), where: Q is a finite set of games, N is a finite set of I players, A = A1 ⇥ · · · ⇥ AI

where Ai is finite set of actions available to player i, P : Q⇥A⇥Q! R is the transition probability

function; P (q, a, q) is the probability of transitioning from state q to state q after action profile a and

R = r1, . . . , rI where ri : Q⇥A! R is a real valued payo↵ function for player i.

In a stochastic game, the agents repeatedly play games from a collection. The particular game

played at any given iteration depends probabilistically on the previous played game, and on the actions

taken by all agents in that game [Shoham and Leyton-Brown, 2008].

Definition 2.14 (Repeated game). A repeated game is a stochastic game in which there is only one

game (called stage game).

Before presenting examples of strategies for repeated and stochastic games we define formally some

concepts.

19


Definition 2.15 (History). Let ht = (q0, a0, q1, a1, . . . , at�1, qt) denote the history of t stages of a

stochastic game and let Ht be the set of all possible histories of this length.

The set of deterministic strategies is the Cartesian product ⇧t,Ht

Ai, which requires a choice for each

possible history at each point in time. An agent’s strategy can consist of any mixture over deterministic

strategies. However, there are restricted classes of strategies, for example the requirement that the

mixing take place at each history independently, this gives behavioral strategies.

Definition 2.16 (Behavioral strategy). A behavioral strategy si(ht, aij

) returns the probability of

playing action aij

, for history ht.

A Markov strategy restricts a strategy so that, for a given time t, the distribution over actions

depends only on the current state.

Definition 2.17 (Markov strategy). A Markov strategy si is a behavioral strategy in which si(ht, aij

) =

si(h0t, aij

) if qt = q0t where qt and q0t are the final states of ht and h0t respectively.

If we restrict the dependency on the time t we get,

Definition 2.18 (Stationary strategy). A stationary strategy si is a Markov strategy in which si(ht1 , aij

) =

si(h0t2 , aij ), if qt = q0t, where qt1 and q0t2 are the final states of ht1 and h0t2 respectively.

To exemplify a repeated game, recall the prisoner’s dilemma presented previously. If we repeat

the same game we get the iterated prisoner’s dilemma (iPD), which has been the subject of di↵erent

experiments and for which there are diverse well known strategies. A successful strategy which won

Axelrod’s tournament2 is called Tit-for-Tat (TFT) [Axelrod and Hamilton, 1981]; it starts by coop-

erating, and does whatever the opponent did in the previous round. It will cooperate if the opponent

cooperated, and will defect if the opponent defected. Another important strategy is called Pavlov,

which cooperates if both players did the same action and defect whenever they did di↵erent actions in

the past round. Another strategy is called Bully (described in detail in Section 3.4.1) and it behaves

as a player who always defects in the iPD. The finite state machines describing TFT and Pavlov are

depicted in Figure 2.5. It should be noticed that these strategies do not depend on the time index;

they are stationary strategies.

2.3.3 Behavioral game theory

In recent years, some authors have claimed that Nash equilibrium is a solution concept that has

limitations. The reason is that in many experiments people would not follow the actions prescribed by

2Robert Axelrod held a tournament of various strategies for the iterated prisoner’s dilemma. Strategies were run by

computers. In the tournament, programs played games against each other and themselves repeatedly.

20

2.4. SUMMARY OF THE CHAPTER

C Dd

dc

c

(a)

CCDD

CD DCcd

d d

c

dc

c

(b)

Figure 2.5: (a) The automata that describes the TFT strategy, depending of the opponent action (c or d) it

transitions between the two states C and D. (b) The automata describing Pavlov strategy, it consists of four

states formed by the last action of both agents (CC, CD, DC, DD).

the theory [Goeree and Holt, 2001; Risse, 2000; Simon, 1955]. Another complain is that, there exist

many games in which the Nash equilibrium does not guarantee the maximum utility for all players.

Moreover, other limitations are the assumption of rational agents and the possible multiple equilibria.

Some experiments [Kahneman and Tversky, 1979] have shown that humans do not always act

rationally (in terms of following the prescribed Nash equilibrium). These interesting conclusions gave

birth to a branch known as behavioral game theory [Camerer, 1997]. The objective of this area is to

obtain more accurate predictions than with classic game theory. To fulfill this objective, a number of

social factors such as altruism, selfishness, reciprocity [Bolton and Ockenfels, 2000], heuristics [Tversky

and Kahneman, 1974], as well as insights from cognitive science have shown useful to mediate people’s

play in games [Camerer, 2003].

2.4 Summary of the chapter

In this chapter we reviewed some of the most important concepts and models of decision theoretic

planning, machine learning and game theory which will be relevant for the approaches described in

Chapter 4. In the next section we present recent works which are related to this thesis.

21


22

Chapter 3

Related Work

This research lies in the intersection of several areas. In Figure 3.1 a diagram with the state of the

art models of game theory, decision theoretic planning, opponent modeling and machine learning is

depicted. This figure provides an overview of this chapter. Next we review the relevant related work

in detail.

3.1 Decision theoretic planning

In Section 2.1 the MDP and POMDP models were described. Now we present recent approaches which

try to overcome the limitation of having a single agent in the environment.

3.1.1 Multiagent approaches

An obvious approach is to extend single agent solutions to multiagent settings while trying to obtain

the actions that yield the best utility. In order to solve this problem, models such as decentralized

MDPs (DEC-MDPs) and decentralized POMDPs [Seuken and Zilberstein, 2008] (DEC-POMDPs) have

been proposed. These models are a generalization of MDPs and POMDPs to multiple cooperative

agents. One limitation is that they are only useful when the agents share a common objective, i.e.,

they share the utility function, furthermore, the solution needs to be centralized and then distributed

to each agent. Other important limitation is its complexity which is nondeterministic exponential time

(NEXP-complete), and it is believed that solving these problems requires double exponential time in

the worst case [Seuken and Zilberstein, 2008].

Hidden-mode Markov decision processes (HM-MDPs) [Choi et al., 1999], presented in Section

2.2.2, are designed for non-stationary environments. A HM-MDPs is a special case of a POMDP.

HM-MDPs assume the environment can be represented in a small number of modes. Each mode is

a stationary environment (modeled as an MDP), which has di↵erent dynamics and needs a di↵erent

23

CHAPTER 3. RELATED WORK

Related Work(Chapter 3)

Planning(Section 3.1) Machine

Learning(Section 3.3)

Concept drift (Section 3.3.1)

Reinforcement learning

(Section 3.3.2)

Game Theory(Section 3.4)

Multiagent approaches

(Section 3.1.1)

HM-MDPsDEC-MDPs

DEC-POMDPsI-POMDPs

Opponent and Teammate Modeling

(Section 3.5)

Hybrid approaches(Section 3.6)

Exploration vs exploitation

(Section 3.3.3)

Implicit Negotiation

(Section 3.4.1)

Learning(Section 3.4.2)

Behavioral GT(Section 3.4.3)

Memory bounded

(Section 3.5.1)

RL-CDHyper-Q

Minimax-QWOLF-IGAWOLF-PHCGIGA-WOLF

COLFM-Qbed

CLEANBPR

Fictitous PlayManipulatorAWESOME

WPLFAL

FAL-SG

Cog-HierarchyLevel-K

EWA

PI-POMDPMAIDNID

LoE-AIMBullyGodfather

RMMEA^2

TeamUP

Machine Learning

(Section 3.2)

I-DIDs

Figure 3.1: Related work to this thesis divided in areas (white boxes) and the most representative models and

algorithms for each one (grey boxes).

policy. Experiments were performed contrasting our approaches against HM-MDPs in Section 5.2.

A more general model is the interactive POMDP (I-POMDP) [Gmytrasiewicz and Doshi, 2005].

This model does not assume cooperativeness of the agents. They are called interactive because it

considers what an agent knows and believes about what another agent knows and believes [Aumann,

1999]. This means that an agent will have a model of how it believes another agent reasons. I-

POMDPs extends POMDPs incorporating models of other agents into the regular state space, thus

building an interactive state space. A problem that occurs in I-POMDPs is the infinite recursive

reasoning, this is, an agent A will model another agent B that is respectively modeling A. To solve

this problem a threshold of reasoning, `, is defined; in which the base model cannot be a recursive

model. The main limitation of these models is its inherent complexity, since solving one I-POMDP

with M number of models considered in each level, with ` maximum reasoning levels, is equivalent

to solving O(M `) POMDPs [Seuken and Zilberstein, 2008]. Some works have tried these models in

real world applications, such as analyzing money laundering [Ng et al., 2010] and playing a simplified

version of chess [Del Giudice et al., 2009]. Recently Ng et al. [2012] proposed an approach that can

learn I-POMDPs online called Bayes-Adaptive I-POMDPs, also these models have been used to model

populations (more than a thousand) of agents [Sonu et al., 2015]

24

3.2. PROBABILISTIC GRAPHICAL MODELS

Weather

Forecast Utility

Umbrella

Figure 3.2: An example of an Influence Diagram that represents the decision whether to take an

umbrella.

3.2 Probabilistic Graphical Models

MDPs and POMDPs are enumerative models, which have its graphical representation in the area of

probabilistic graphical models (PGM). The objective in using probabilistic graphical models [Koller

and Friedman, 2009] is to exploit its structure in order to find faster solutions than their enumerative

versions [Doshi and Gmytrasiewicz, 2009].

One basic type of PGMs are influence diagrams (IDs) [Howard and Matheson, 2005; Shachter,

1986], which are a graphical compact representation of a single decision. An ID is a directed acyclic

graph with chance nodes, decision nodes, and a utility node. Arcs coming into decision nodes represent

the information that will be available when the decision is made. Arcs coming into chance nodes

represents probabilistic dependence. Arcs coming into the utility node represent what the utility

depends on. In Figure 3.2 an example of an ID is presented, it corresponds to a decision (rectangle)

whether to take an umbrella or not, it has two probabilistic nodes (ovals) Weather and Forecast, and

one utility node (diamond).

Another model are the dynamic influence diagrams (DIDs), which can be seen as a ID that is

repeated over time, in each step a single decision for a single agent has to be made, this is the

graphical form of a POMDP.

There is another relevant model that for multiagent systems, the interactive dynamic influence

diagrams (I-DIDs) [Doshi et al., 2008]. They are a generalization of DIDs for multiple agents. I-DIDs

are the graphical correspondence of I-POMDPs. I-DIDs su↵er from the curse of dimensionality, where

the dimensionality of the planning problem is directly related to the number of states [Kaelbling et al.,

1998] and the curse of history, where the number of belief-contingent plans increases exponentially

with the planning horizon [Pineau et al., 2006]. Because the state space is interactive, I-DIDs includes

the models of other agents and often, the number of candidate models grows exponentially [Doshi

et al., 2008].

25


3.3 Machine learning

Now we review works in two di↵erent areas of learning. The first one is called concept drift and it is

related to supervised learning and changing concepts. The second area is reinforcement learning, in

Section 2.2.2 we presented the basic Q-learning algorithm, here we review extensions for multiagent

systems.

3.3.1 Concept drift

The machine learning community has developed an area related to non-stationary environments and

online learning which is called concept drift [Widmer and Kubat, 1996]. The approach is similar to a

supervised learning scenario where the relation between the input data and the target variable changes

over time [Gama et al., 2014].

In particular, the work in [Gama et al., 2004] studies the problem of learning when the class-

probability distribution that generates the examples changes over time. A central idea is the concept

of context : a set of contiguous examples where the distribution is stationary. The idea behind the

concept drift detection method is to control the online error-rate of the algorithm. When a new

training instance is available, it is classified using the actual model. Statistical theory guarantees that

while the distribution is stationary, the error will decrease. When the distribution changes, the error

will increase. Therefore, if the error is greater than a defined threshold, it means that the concept

has changed and it needs to be relearned. The method was tested on both artificial and real world

datasets. Even when concept drift ideas are related they need to be adapted to a multiagent setting.

Moreover, the approach did not provide any formal guarantees of context (switch) detection.

3.3.2 Reinforcement learning

Reinforcement learning, and in particular the Q-learning algorithm, have been widely studied and

some multiagent extensions have been proposed. We present the most important algorithms in this

area, a more extensive survey is presented in [Busoniu et al., 2008].

Hyper-Q [Tesauro, 2003] is an extension of Q-learning designed specifically for multiagent systems.

The main di↵erence is that the Q function depends on three parameters: the state, the estimated joint

mixed strategy of all other agents, and the current mixed strategy of the agent. The problem with

this approach is that in order to obtain an approximation of the mixed strategies a discretization has

to be performed. Thus, the Q-table grows exponentially in the number of discretization points, which

will also result in larger learning times.

In [Da Silva et al., 2006] the reinforcement learning with context detection (RL-CD) approach is

described. The idea is to learn several partial models and decide at each time step which one to use

26


depending on the context of the environment. The approach needs extensive parameter tuning (six

parameters) for each domain, no formal guarantees are provided and the approach does not use any

form of exploration for detecting changes in the environment.

Minimax-Q [Littman, 1994] extends Q-learning to zero-sum games1. In this case the value function

formed by the joint action of the two players, and instead of maximizing, it computes the minimax

operator in order to play its part of the Nash equilibrium strategy. This algorithm is guaranteed to

converge in self-play. However, it is not guaranteed to obtain a best response, which means it will

obtain suboptimal rewards.

Algorithm 3.1: WOLF-PHC algorithmInput: �

l

> �w

2 (0, 1],↵ 2 (0, 1] learning rate parameters, set of actions A

1 8(s, a) Q(s, a) = 0 // Initialize Q-table, policy and counter

2 8(s, a) ⇡(s, a) = 1|A|

3 8(s) C(s) = 0

4 foreach round do

5 Observe state s and select action a according to ⇡(s) with suitable exploration

6 Obtain reward r and next state s0

7 Q(s, a) = (1� ↵)Q(s, a) + ↵(r + �maxa

0 Q(s0, a0))

8 C(s) = C(s) + 1

9 8a0 2 Ai

⇡(s, a0) = 1C(s) (⇡(s, a

0)� ⇡(s, a0)) // Update average policy

10 ⇡(s, a) = ⇡(s, a) +

8<

:� if a = argmax

a

Q(s, a)��

|A|�1 otherwise

11 where, // Determine learning rate

12

� =

(�w

ifP0

a

⇡(s, a0)Q(s, a0) >P0

a

⇡(s, a0)Q(s, a0)

�l

otherwise

13 end

The principle WOLF (win or learn fast) was introduced in the algorithm WOLF-IGA [Bowling

and Veloso, 2002]. The algorithm was designed to fulfill two properties: (i) rationality, in the form of

obtaining a best response against stationary policies, and (ii) convergence to a stationary policy. The

algorithm uses gradient ascent, so each round the player will update its strategy to increase its expected

payo↵s. However, the key of the approach is to use a variable learning rate for the gradient ascent.

Thus, the intuition is to learn quickly when losing and cautiously when winning. The WoLF-PHC (see

Algorithm 3.1), is another algorithm which is an extension of Q-learning. It uses two learning rates:

one for winning and one for losing, to determine this decision it compares the expected value against

the average policy. The algorithm has been empirically successful in self-play. Building on the WOLF

1In a zero-sum game each participant’s gain (or loss) of utility is exactly balanced by the losses (or gains) of the utility

of the other participant(s).

27


principle the authors proposed another algorithm that also shows no regret2 in the limit [Bowling,

2004]. Even when WOLF algorithms are designed to converge in self-play to the Nash equilibrium,

that is not always optimal in terms of rewards. COLF (Change of Learn Fast) [Cote et al., 2006] is

an algorithm inspired by the WOLF principle, but with the objective of promoting cooperation of

self-interested agents to achieve a Pareto e�cient solution.

A recent algorithm, M-Qubed (Max or Minimax Q) [Crandall and Goodrich, 2011] is another

reinforcement learning algorithm which balances cautious and best-response attitudes. M-Qubed

typically selects actions based on its Q-values in the current state (best-response), but triggers to its

minimax strategy when its total loss exceeds a pre-determine threshold (cautious). However, it is not

designed for switching opponents.

3.3.3 Exploration vs. exploitation

Exploration in multiagent systems has not received as much attention as in the single agent setting.

Some works have analyzed di↵erent explorations in specific domains such as economic systems [Rejeb

et al., 2005] or foraging [Mohan and Ponnambalam, 2011]. Carmel and Markovitch [1999] propose

an exploration strategy for model-based learning. In this case, the opponent was modeled through a

mixed strategy in a way to reflect uncertainty about the opponent. Experiments were performed in

the iterated prisoner’s dilemma. The main limitation of these previous approaches is that they do not

handle exploration in non-stationary environments (like strategy switches).

In multiagent systems, one common assumption is to not explicitly model other agents, but instead

treat them as part of the environment. However, this assumption presents problems, for example when

many learning agents are interacting in the environment and they explore at the same time, they may

create some noise to the rest, this is called exploratory action noise [HolmesParker et al., 2014]. The

authors propose the coordinated learning exploratory action noise (CLEAN) rewards to cancel such

noise. CLEAN assumes that agents have access to an accurate model of the reward function, which is

used jointly with reward shaping [Ng et al., 1999] to promote coordination and scalability. Exploratory

action noise will not appear in our setting since we focus on cooperative and competitive settings. In

competitive settings this approach will not work since it needs to know the reward function from all

others agents.

Bayesian Policy Reuse (BPR) [Rosman et al., 2015] has been proposed as a framework to determine

quickly the best policy to use in an environment facing an unknown task. The agent is presented with

an unknown task which must be solved within a limited and small number of trials. BPR has a set

of policies ⇧ and faces di↵erent tasks T . A BPR agent knows performance models P (U |T ,⇧) of how

each policy behaves over each task. Then for each episode of the interaction BPR selects a policy to

2Regret is a measure of how much worse an algorithm performs compared to the best static strategy.

28

3.4. GAME THEORY

act and receives an observation signal which is used to update their belief. BPR main limitation is

that it needs to know the set of policies to act and how those policies behave under di↵erent tasks

before the interaction starts.

3.4 Game theory

We start by reviewing related work on implicit negotiation. Next, we review methods for learning

in repeated and stochastic games. Finally, we review approaches from the area of behavioral game

theory.

3.4.1 Implicit negotiation

In [Littman and Stone, 2001] the authors consider autonomous agents engaging in implicit negotiation

via their tacit interactions. The authors propose two general “leader” strategies for playing repeated

games. These strategies are called leader since they assume that their opponents will follow to its

strategy by using a best response. The first strategy is a deterministic, state–free policy called “Bully”.

This strategy chooses the action that maximizes its reward given that the opponent is playing a best

response. The second strategy is called “Godfather” since makes its opponent an o↵er it can’t refuse.

They call a pair of deterministic policies a targetable pair if playing them results in each player receiving

more than its security level. Godfather chooses a targetable pair (if there is one) and plays its half in the

first stage. From then on, if the opponent plays its half of the targetable pair in one stage, Godfather

plays its half in the next stage. Otherwise, it plays the policy that forces its opponent to achieve its

security level (expected payo↵ a player can guarantee itself using a minimax strategy). These tactics

are forms of implicit negotiation in that they aim to achieve a mutually beneficial outcome without

using explicit communication outside of the game and can be used as general strategies in repeated

games.

3.4.2 Learning

One well known algorithm for learning in repeated games is fictitious play [Brown, 1951]. The model

simply maintains a count of the plays by the opponent in the past. The opponent is assumed to be

playing a stationary strategy and the observed frequencies are taken to represent the opponent’s mixed

strategy.

Some works have considered how to play against classes of opponents. For example Manipulator

[Powers and Shoham, 2005] is designed against adaptive opponents with bounded memory in normal

form games. In particular, the opponent plays a conditional strategy where its actions can only depend

29


on the most recent k periods of past history, for this it alternates among the strategies: fictitious play,

minimax and a modified Bully strategy [Littman and Stone, 2001].

AWESOME (Adapt When Everybody is Stationary Otherwise Move to Equilibrium) [Conitzer

and Sandholm, 2006] is an algorithm that guarantees to converge to a Nash-equilibrium in self-play.

It also learns to play optimally against stationary opponents in games with an arbitrary number of

players and actions.

Weighted Policy Learner (WPL) [Abdallah and Lesser, 2008] is similarly designed to converge to a

Nash equilibrium. However it can do it with limited knowledge (assumes that an agent neither knows

the underlying game nor observes other agents).

Recent approaches have tried to reduce the learning time thus proposing a “fast” learning. One

approach is the fast adaptive learner (FAL) [Elidrisi et al., 2012]. This algorithm focuses on fast

learning in two-person repeated games. To predict the next action of the opponent it maintains a set

of hypotheses according to the history of observations using the method in [Jensen et al., 2005]. To

obtain a strategy against the opponent they use a modified version of the Godfather strategy [Littman

and Stone, 2001]. However, the Godfather strategy is not a general strategy that can be used against

any opponent and in any game. Also FAL shows an exponential increase in the number of hypotheses

(in the size of the observation history) which may limit its use in larger domains. Recently, a similar

version of FAL designed for stochastic games, FAL-SG [Elidrisi et al., 2014], has been presented. The

idea is to map the stochastic game into the setting used in FAL (a matrix normal form game). For

this, the algorithm generates a meta-game matrix by means of clustering the opponent’s actions. After

obtaining this matrix the algorithm proceeds in the same way as FAL. Both FAL and FAL-SG use a

modified Godfather strategy to act, which fails to promote an exploration for detecting switches, and

this results in a longer time to detect opponent switches.

3.4.3 Behavioral game theory

An important group of models that try to model human-behavior are those that use iterative strategic

reasoning. The cognitive hierarchy [Camerer et al., 2004a] and level-k [Costa Gomes et al., 2001] are

the two most important models of this group. In both models, for an agent to take a decision it will

perform a series of reasoning steps over di↵erent levels. The level zero is formed by simple strategies

(like a random choice across the possible actions). A best response against the lower levels is the way

to construct the next level. Even when these models tend to obtain good results in single-decision

games with human populations [Wright and Leyton-Brown, 2010], there is no analysis of how these

models might perform in repeated games or sequential decisions problems.

Self tuning experience-weighted attraction (EWA) [Camerer et al., 2004b] is an algorithm that

generalizes reinforcement learning and fictitious play and has shown good predictive power in short re-

30

3.5. OPPONENT AND TEAMMATE MODELING

peated games (less than 8 iterations). However, this model does not allow history dependent strategies

or other type of more general opponents.

3.5 Opponent and teammate modeling

Opponent modeling is the capacity of predicting other agents’ behavior in which the environment is

populated with adversaries [Stone, 2007]. In this area, some important works had been devoted to

specific applications, for example playing poker [Bard et al., 2013] and Scrabble [Richards and Amir,

2006]. A similar situation can appear when the environment is populated with teammates without

communication, thus, same approach of learning a model (in this case of a teammate) can be applied.

A general non-specific domain algorithm is the recursive modeling method (RMM) [Gmytrasiewicz

and Durfee, 2000]. This algorithm takes into account the reasoning of other agents to obtain the best

coordinated action. The model proposes that each agent creates a model of the other agents which

can have a model of the first agent (recursive). In order to finish the recursion there is one basic level

in which the other agents are ignored. The model of an agent is a utility matrix, the main limitation

of this approach is how to obtain that matrix when agents are not cooperative.

In [Barrett et al., 2012] the problem is one where an agent is forced to work in a team with other

three unknown agents in order to complete a specific task. Therefore, the agent needs to build a model

of its teammates to plan his future behavior. To learn the models, decision trees [Quinlan, 1993] were

used. To perform planning a Monte Carlo tree search [Kocsis and Szepesvari, 2006] is proposed. One

limitation is that the learning of models is performed in an o✏ine fashion and only the belief’s update

is online.

The Lemonade Stand Game [Zinkevich et al., 2011] is a repeated symmetric 3 player constant-sum

finite horizon game, in which a player chooses a location for their lemonade stand on an island with

the aim of being as far as possible from its opponents. Di↵erent tournaments were played each year

and research groups from all over the world submitted their agents which lead to interesting ideas

related to opponent modeling. In particular the EA2 algorithm [Sykulski et al., 2010] was the winner

of the first tournament. The algorithm attempts to find a suitable partner with which to coordinate

and exploit the third player. To do this, it classifies the behavior of our opponents using the history

of joint interactions. Note that EA2 models the behaviors of its opponents, rather than situations

of the game. In [Cote et al., 2010] the authors presented TeamUP agent which was the winner of

the second tournament. They propose a special representation for adversarial (constant-sum) games.

Given the repeated nature of the interaction they frame the action selection problem as a planning

problem, where the unknown behavior of the opponents is learnt by repeatedly interacting with them.

This idea of the representation will be used throughout this thesis to model the opponent’s strategy.

31


3.5.1 Memory bounded learning

A special type of learning can happen by using a particular type of information. Bounded memory

opponents are agents that use the opponent’s D past actions to assess the way they are behaving.

For these agents the opponent’s history of play defines the state of the learning agent. In [Banerjee

and Peng, 2005] the authors propose the adversary induced MDP (AIM) model, which uses the vector

D, which is a function of the past D actions of the learning agents, i.e., (at, . . . , at�D) 2 AD. Note

that the agent, by just keeping track of D (its own past moves) can infer the policy of the bounded

memory opponent.

Definition 3.1 (AIMs). An adversary induced Markov decision process (AIM) is a tuple hA, , T, Uiwhere,

• A is the action space of the agent.

• = { D :P

a2A D(a) = 1, D(a) 2 [0, 1] 8a 2 A} is the state space.

• T : A ⇥ ! �( ) is the state-transition function that maps actions and states to probability

measures on future states.

• U : A⇥ ! R is the function

U( D, a) =X

b2B⇢( D, b)RA(a, b) (3.1)

that maps state D 2 �(A) and action a 2 A to a real number that represents the agent’s

expected reward. RA is the reward function of the learning agent. Here, ⇢(·, b) 2 [0, 1], subject

to the constraintP

b02B ⇢(·, b0) = 1 (B is the action space of the opponent).

The learning agent, by knowing the MDP that the opponent induces, can compute an optimal policy

⇡⇤. These types of players can be thought of as a finite automata that take the D most recent actions

of the opponent and use this history to compute their policy [Cote and Jennings, 2010].

Since memory bounded opponents are a special case of opponents, di↵erent algorithms were spe-

cially developed to be used against these agents. For example, the agent LoE-AIM [Chakraborty

and Stone, 2008] is designed to play against a memory bounded player (but does not know the exact

memory length). Even more, the algorithm presented in [Cote and Jennings, 2010] is designed to play

against an unbounded memory player.

3.6 Hybrid approaches

Recently, some works have adopted a hybrid approach using models and ideas from di↵erent areas,

most of them use a behavioral game theory approach to model human behavior.

32


In [Wunder et al., 2009, 2011] an I-POMDP [Gmytrasiewicz and Doshi, 2005] model was used as

a building block in combination with the cognitive hierarchy [Camerer et al., 2004a] model to form a

Parametrized I-POMDP. This model presents di↵erent reasoning levels, one key aspect is that there

is a distribution of strategies within each level. To construct the strategies, in each level there is

a population. Zero level corresponds to simple behaviors, thus they do not present any strategic

reasoning. The upper levels (with strategic reasoning) are constructed by obtaining a policy that

maximizes the score against either a distribution over lower levels, or a selection of agents from those

levels, by solving the POMDP formed by them. One limitation of the approach is its complexity, so

it might not scale for larger problems; other main di↵erence with our work is that they learn against

populations of agents, not modeling explicitly the opponents.

Another hybrid model is the multiagent Influence Diagram (MAID) [Koller and Milch, 2001]. This

model provides a graphical representation of a game and the objective is to exploit its graphical form

to compute Nash equilibrium. Moreover, the network of influence diagrams (NID) [Gal and Pfe↵er,

2008] extends the MAID model to include uncertainty over the agent models. One limitation of MAIDs

and NIDs is that they are designed for one-shot decision games. Also, they use Nash equilibrium as a

solution concept which is not always the best solution for di↵erent types of agents or scenarios.


In this chapter, we reviewed recent works which have relation with this thesis. In Table 3.1 we present

the most important related works and we compare them by their single or multiagent nature, the type

of learning they have, the theoretical guarantees provided and their complexity. A summary of the

limitations found in the state of the art is the following:

• Approaches that can be used only for single decisions (one-shot) [Camerer et al., 2004a; Costa

Gomes et al., 2001; Koller and Milch, 2001].

• Approaches that are computationally intractable for large scale problems [Choi et al., 1999;

Gmytrasiewicz and Doshi, 2005; Seuken and Zilberstein, 2008; Tesauro, 2003].

• Approaches that assume stationarity of the opponent [Brown, 1951; Gmytrasiewicz and Durfee,

2000; Koller and Milch, 2001].

• Approaches that remove stationary assumption need an o✏ine training phase [Choi et al., 1999].

3Contains all decision problems that can be solved by a deterministic Turing machine using a polynomial amount of

computation time.4A decision problem is PSPACE-complete if it can be solved using an polynomial amount of space and if every other

problem that can be solved in polynomial space can be transformed to it in polynomial time.

33


Table 3.1: A comparison of di↵erent algorithms in terms of type of opponents they handle, type

of learning, complexity and whether they provide guarantees for switch detection. The * indicates

algorithms that will be used as comparison in the experimental chapter of this thesis. The ? symbol

indicates there is no information about it. Bold typeface indicates a proposed algorithm in this

thesis.

Algorithm Multiagent

approach

Type of

learning

Theoretical

guarantee

for switch

detection

Complexity Exploration

for switch

detection

Hyper-Q, Minimax-Q,

COLF, Manipulator,

M-Qubed, AWESOME,

WPL, EWA, LoW-AIM

Yes Online No ? No

MAID, NID, RMM Yes No No ? No

RL-CD No Online No ? No

BPR No O✏ine No P 3 No

Fictitious play Yes Online No P No

I-POMDPs Yes O✏ine &

Online

No ? No

FAL* Yes Online No ? No

WOLF-PHC* Yes Online No ? Partial

HM-MDPs* No O✏ine No PSPACE � complete4 Partial

R-max* Yes Online No P No

MDP-CL Yes Online No P No

MDP4.5 Yes Online No P No

MDP-CL(DE) Yes Online No P Yes

R-max# Yes Online Yes P Yes

R-max#CL Yes Online No P Yes

DriftER Yes Online Yes P Yes

34


• Approaches that work in non-stationary environments do not use exploration mechanisms for

detecting switches [Elidrisi et al., 2014].

In the next chapter, we present our contributions in the area of non-stationary opponents. We

start by presenting a framework (MDP-CL and MDP4.5) for learning and planning against switching

opponents in repeated games. Then, we address di↵erent initial limitations like adding an exploration

mechanism for switch detection, using prior information and not forgetting previously learned models.

We present another algorithm (R-max#) which provides an e�cient exploration and gives guarantees

of optimality against switching opponents. We conclude with another algorithm (DriftER) for switch

detection based on concept drift which also provides theoretical guarantees for switch detection.

35


36

Chapter 4

Acting against Non-Stationary

Opponents

In this chapter, we present our main proposals for dealing with non-stationary opponents (a graphical

roadmap of this chapter is depicted in Figure 4.1). We start by describing a general framework for

learning and planning against non-stationary opponents in repeated games. Two implementations for

the framework are presented: MDP4.5 and MDP-CL, their di↵erence lies in the model they use for

learning the opponent strategy. The former uses decision trees and the latter uses MDPs.

Then we present two extensions for MDP-CL. In some domains it may be possible to know the set

of strategies used by the opponent before starting the interaction. We adapt MDP-CL for those cases

and name it a priori MDP-CL. In MDP-CL once a switch has been detected the model is discarded

and a new model is learned from scratch. However, the previous model can still be useful. Thus, we

adapt MDP-CL to keep a record of previous models and when a switch is detected it compares if the

model is similar to those previously seen, this is incremental MDP-CL.

The framework and its implementations were able to detect opponent switches, however they did

not apply an explicit exploration for detecting them. In regard to this problem, we propose drift

exploration for non-stationary opponents. First we propose to add drift exploration in MDP-CL.

Then we propose R-max#, an algorithm for e�ciently exploring the state space which is able to work

against non-stationary opponents. This approach is based on the original R-max algorithm (Section

2.2.3) which is theoretical grounded to provide e�cient exploration. Using jointly a switch detection

method with R-max# exploration gives R-max#CL.

Our last proposal is DriftER, an algorithm for detecting switches inspired by concept drift tech-

niques. The main idea is that once the agent has learned a model of the opponent it can track how

that model is behaving with a measure of predictive error. That information will be used as indicator

of switch detection when the error starts increasing consistently.

37

CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS

Proposed Algorithms

(Chapter 4)

Framework against non-stationary

opponents(Section 4.1)

MDP4.5

Drift Exploration(Section 4.3)

MDP-CL (DE) (Section 4.3.1)

R-max# (Section 4.3.2)

DriftER(Section 4.5)MDP-CL

A priori MDP-CL

(Section 4.2.3)

Incremental MDP-CL

(Section 4.2.4)

R-max#CL(Section 4.4.1)

Figure 4.1: Di↵erent sections in this thesis and how the related to each other inside this chapter.

4.1 MDP-CL and MDP4.5

Before presenting our framework, we present how to model opponents in repeated games.

4.1.1 Modeling opponents

How an agent should act in a repeated adversarial games when the adversaries are unknown is a

di�cult task. Specifically, in these games, the capability of an agent to compute a good sequence of

actions relies on its capacity to forecast the behavior of its opponents [Cote et al., 2010].

The repeated nature of the interaction allows the agent to learn the unknown behavior of the

opponents by interacting with them. For example, the interaction of a learning agent with a stationary

opponent can be modeled as an MDP. This occurs since the interaction between agents generates

Markovian observations which can be used to learn the opponent’s strategy. In this case, the learning

agent perceives a stationary environment whose dynamics can be learned. However, the learning task

may require a large number of repeated interactions to be e↵ective. Therefore an abstraction of the

complete interaction history is needed to reduce the number of samples. For this reason we need a set

of attributes A that can describe the opponent strategy and those will be used to construct the states.

Similarly we can build a model of the opponent strategy with another representation, for example

using a decision tree. In this case the decision nodes are the attributes that describe the opponent

strategy. The leaves’ values are the possible opponent actions. Thus a decision tree will provide a

model of how the opponent is behaving.

The set of attributes depend on the domain. It is common to use the history of past interactions

(Section 2.3.2) to model the opponent strategy [Banerjee and Peng, 2005]. Note that this represen-

38

4.1. MDP-CL AND MDP4.5

tation is capable of learning a variety of strategies [Chakraborty and Stone, 2008]. However, in order

to avoid using the complete history of interaction it is common to use only the last step [Crandall

and Goodrich, 2011] as attributes. This representation will allow the agent to learn strategies such

as TFT, Pavlov and Bully in the iterated prisoner’s dilemma (Section 2.3.2). However, in more elab-

orated domains (for example PowerTAC domain; appendix A) agents may use other attributes (that

may not have relation with the interaction history) in the environment to select its actions.

4.1.2 Introduction

Now, we start by presenting a framework for learning and planning against non-stationary opponents

in repeated games. The approach is based on the comparison of learned models in order to detect

strategy switches. It consists of three main parts:

• Learning phase. A model of the opponent is learned.

• Planning phase. Uses the learned model along with information from the environment to compute

an optimal plan against the modeled opponent.

• Change detection process, that embeds the learning and planning phases to identify switches in

the opponent strategy. Here, di↵erent models are learned and comparisons among them reveal

when the opponent model has changed.

A model is learned with information obtained from a window of interactions.

Definition 4.1 (Interaction window). A window of interaction of size k represents a sequence of

interactions among the agents in the environment starting at round ti and ending at round ti+k

This model will reflect the opponent behavior and then a policy to act against it can be computed.

Di↵erent models of the opponent will be learned using di↵erent windows of interactions (in terms of

size). If the opponent has not changed strategies, information from those windows of interaction will

yield to the same learned model (with enough samples). On the other hand, if the opponent has

changed its strategy it will yield a di↵erent model. When the opponent changes the strategy, the

model and the respective policy are reset and the process restarts from scratch.

4.1.3 Assumptions

The proposed framework makes the following assumptions:

• The opponent will not change strategy during a number of interactions (learning phase).1

1This number of interactions can be passed as a parameter to the learning algorithm.

39


Learning (1) Planning (2)Random strategy

Switch?

!* strategy

(3)yes

no

Figure 4.2: The three main parts of the framework: (1) learning, use an exploration (random) strategy, (2)

planning, use the computed policy and the (3) switch detection process.

• Our agent knows the attributes that can learn the dynamics of the opponent.

The first assumption is important for the agent to learn an accurate model of the opponent strategy.

We performed experiments where the second assumption does not hold and the approach still obtains

good results (Sections 5.5.4 and 5.5.6).

4.1.4 Setting

The problem’s setting is the following: Our learning agent A and one opponent O2 repeatedly play

a Repeated normal form game � (see Section 2.3). Both agents take one action (simultaneously) in a

sequence of rounds. A and B are the set of possible actions for our learning agent and the opponent

respectively. They both obtain a reward r that depends on the actions of both agents (defined by �).

The objective of our learning agent is to maximize its cumulative rewards over the entire interaction.

Agent O has a set of O possible stationary strategies (see Section 2.3 for definition) to choose from

and can switch from one to another in any round of the interaction, excluding periods called learning

phases. A strategy defines a probability distribution for taking an action given a history of interactions.

4.1.5 Overview of the framework

In Figure 4.2 a graphical depiction of the three parts of the framework is presented. Learning is the

initial phase and uses a random strategy in order to learn the opponent’s model. The next phase is

planning, where an optimal policy ⇡? is computed to play against the opponent. The computed policy

is used and the switch detection process starts; if the opponent switches strategies then the learning

phase is restarted, if not, the agent continues using the same policy.

In Algorithm 4.1 a pseudocode of the proposed framework is presented. It uses as parameters:

the size of the window of interaction w needed to learn a model, the threshold which determines a

value of comparison for deciding whether the models are di↵erent, and the number of rounds T in

the repeated game.

2In most experiments on this thesis we assume only one opponent however in Section 5.2.6 we present experiments

using MDP-CL with more opponents.

40


Algorithm 4.1: Proposed framework algorithmInput: Size of the window of interactions w, threshold , number of rounds in the repeated game T .

Function: compareModels(), compare two opponent models

Function: planWithModel(), obtain a policy to act using the opponent model

Function: playWithPolicy(), play using the computed policy

1 j = 2; // initialize counter

2 model = ⇡⇤ = ; // Initialize opponent model and policy

3 for t = 1 to T do

4 if t == i · w, (i � 1) and model == ; // Learn exploration model then

5 Learn an exploration model with past w interactions

6 ⇡⇤ = planWithModel(model) // Compute a policy from the learned model

7 end

8 if t == j · w, where (j � 2) // Learn comparison model then

9 Learn another model0 with past interactions

10 d = compareModels(model,model0) // Compare models

11 if d > // Opponent strategy has changed? then

12 ⇡⇤ = model = ;, j = 2 // Reset models and restart exploration

13 else

14 j = j + 1

15 end

16 end

17 if ⇡⇤ == ; then

18 Play with random actions // No model, explore randomly

19 else

20 playWithPolicy(⇡⇤) // Use learned policy

21 end

22 end

Table 4.1: A description of the main parts of the approach using two di↵erent representations: MDP4.5 and

MDP-CL.Representation Learning Planning Switching detection

MDP4.5 Decision Trees DT to MDPs Compare trees by means of structure and pre-

dictive similarity.

MDP-CL MDP MDP Compare MDPs by using total variation dis-

tance of the transition functions.

41


The proposed framework is not subject to a specific representation of the opponents’ models. To

exemplify our approach two di↵erent representations for learning an opponent model were considered:

decision trees and MDPs. These two representations are quite di↵erent, a motivation to use MDPs is

that they are the common representation in sequential decision problems, uncertainty and RL. In fact

MDPs have been proposed to model opponents which use only past history [Banerjee and Peng, 2005]

(see Section 3.5.1). On the other hand, decision trees is a technique used in machine learning, mostly

in classification problems. However, there are several algorithms for learning decision trees from a

batch of information. The main parts of these two representations, called MDP4.5 and MDP-CL, are

presented in Table 4.1 and described in the following sections.

Now we discuss each of the three main parts of the framework in more detail: learning, planning

and switch detection.

4.1.6 Learning: opponent strategy assessment

Consider the case where the opponent uses a stationary strategy for a period of interactions. Our

learning agent starts with no prior model of the opponent and starts an exploration phase playing a

random strategy. After a certain number of interactions, w, the learning agent uses the information

from the past interactions to generate a model of the opponent which we call exploration learned

model.

Definition 4.2 (Exploration learned model). The model of the opponent strategy obtained when there

is no previous model or when a switch is detected using the information from the last w rounds.

Since the proposed framework is general di↵erent techniques to learn a model of the opponent can

be used, to exemplify our approach we present two implementations: MDP4.5, which uses decision

trees to model the opponent and MDP-CL, which uses MDPs.

MDP4.5

Given window of interaction w between agents, C4.5 algorithm [Quinlan, 1993] returns a decision tree

D which corresponds to the opponent strategy. The set of attributes, A , is assumed to be given by an

expert, the class values are the opponent’s actions B. Each path pl of D to a leaf l is a unique decision

rule. Thus, there is a one-to-one correspondence between a path pl and the leaf l. Additionally, each

leaf has two associated values, classified(l) and misclassified(l) which represent the accuracy of each

leaf. In Figure 4.3 (a) a decision tree with just one decision node and two leaves is depicted. The

attributes in the decision tree are previous plays from both agents. The leaves of the tree represent

the opponent’s next action and edges represent the decision rules (later used to plan a strategy).

42


LearnAgent last action

Clearn DLearn

Copp Dopp

100/0 100/0

(a)

C,1,0

C,1,3

D,1,1

D,1,4

C,1,3C,1,0 D,1,1D,1,4

oppC ,C learn C ,Dopp learn

D ,Copp learn D ,Dopp learn

(b)

Figure 4.3: (a) A decision tree that corresponds to a model of an agent. It contains one decision node

LearnAgent last action and two leaves that correspond opponent’s actions Copp

and Dopp

, each leaf has a number

of correctly classified/misclassified instances. (b) A learned MDP using the game matrix of the prisoner’s

dilemma. It is composed of four states (ovals). Each state is formed by the last action (C or D) of the

learning agent (learn) and the opponent (opp).The arrows represent the triplet: action, transition probability

and immediate reward.

MDP-CL

The second approach is to learn an MDP model of the opponent. This MDP describes the dynamics

of the opponent. The set of attributes A used to construct the states is assumed to be given by an

expert. For example, the attribute Ai could take as values of the last play of agent i.

Formally, the MDP is composed of:

• The set of states S := ⇥i2A Ai, i.e., each state is formed by the Cartesian product of the set of

attributes A .

• The set of actions A are the learning agent actions in �,

• The transition function T : S ⇥A! S is learned using counts

T (s, a, s0) =n(s, a, s0)

m(s, a)(4.1)

with n(s, a, s0) the number of times the agent was in state s, used action a and arrived at state

s0, m(s, a) is defined as the number of times the agent was in state s and used action a.

• The reward function R is learned in a similar way

R(s, a) =

Pr(s, a)

m(s, a)(4.2)

with r(s, a) the reward obtained by the agent when being in state s and performing action a.

In Figure 4.3 (b) a learned MDP with four states is depicted. This MDP represents the opponent

strategy TFT in the iPD. Each state is formed by the last plays (C or D) of the learning agent (learn)

and the opponent (opp), the arrows indicate the action, transition probability and reward.

43


4.1.7 Planning

Once a model of the opponent is learned, a plan that provides the optimal way to play against such

opponent is needed. Using the opponent’s model to compute the best action can be seen as a sequential

decision problem, assuming the opponent will remain fixed. Thus, the obvious approach is to plan

with MDPs since we assume the state is fully observable. However, if the state was not fully observable

POMDPs could be used. Solving an MDP does not involve high complexity, computing the optimal

policy is complete for P [Papadimitriou and Tsitsiklis, 1987], and a number of methods, from linear

programming and dynamic programming can be used.

Decision trees to MDPs

When learning decision trees, a transformation is needed (since the DT does not prescribe how to

act) to obtain an MDP and solving it will provide an optimal policy to act. The induced MDP (by a

decision tree D and a bimatrix game �) is composed of:

• A is the set of available actions for the learning agent.

• R are obtained from the game matrix �,

• The set of states S : P ⇥B, i.e., each state is formed by a path and an action of the opponent,

• The transition function T : S ⇥A! S is generated as follows, for each s 2 S and a 2 A,

T (s, a, s0) =

8<

:classified(l) if s0 2 Pmisclassified(l)

|B|�1 other case(4.3)

As an example, the tree of Figure 4.3 (a) represents the strategy of a learning agent (Learn) against

an opponent with two actions B = b1, b2. Converting this tree and using the values t = 4, r = 3, p =

1, s = 0, from the matrix of the prisoner’s dilemma will yield the MDP depicted in Figure 4.4 (a).

This decision tree can be augmented to include new leafs (corresponding to misclassified instances)

as depicted in Figure 4.4 (b). These new leafs share the path from the original leaf but replace the

class with other possible values. In the example, the (original) left leaf has the value b1 and this leaf

classified correctly 90% of the times. The other 10% corresponds to a di↵erent value, in this case

b2, therefore this new leaf is added to the tree and it is depicted with a dotted arrow. Notice that

the original leaves correspond to the set P and new leafs correspond to the set P 0. Converting this

augmented tree will yield the MDP depicted in Figure 4.4 (c). Solving this MDP provides the optimal

policy against the opponent.

44


Learn Last action=a1--->b1


a1/1.0 (r=0)

a2/1.0 (r=4)a2/1.0 (r=1)a1/1.0(r=3)

(a)

Learn last action

a1 a2

b1 b2b2 b1

90/10 80/2010 20(b)



a1/ 0.9 (r=0)

a2 /0.8 (r=4)

a2 / 0.8 (r=1)a1/ 0.9(r=3)



a1/ 0.1 (r=3)

a2 / 0.2 (r=1)

a2 /0.2 (r=4)a2/ 0.1 (r=0)

a1/ 0.9(r=0)a2 / 0.8 (r=4)

a2 / 0.2 (r=4)a1/ 0.1 (r=0)

a2 /0.8 (r=1)

a2 /0.2 (r=1)

a1/ 0.9 (r=3)

a1/ 0.9 (r=3)

(c)

Figure 4.4: (a) The MDP obtained from the decision tree in Figure 4.3 (a) composed of two states (ovals),

the arrows represent transition probabilities using actions ax with rewards in parenthesis. (b) The augmented

decision tree of Figure 4.3 (a) that contains in dotted lines two added leaves representing classification errors.

(c) The MDP obtained from (b) using the prisoner’s dilemma game matrix. It is composed of four states. The

dotted arrows and ovals correspond to the added actions.

As shown in the previous example, when the opponent shows an stochastic behavior converting

a DT into a MDP will increase the number of states. This is a limitation of DT, since they are not

generally well fit to handle these type of behaviors. A better fit will be to directly use MDPs.

In cases where the learning model is already an MDP this phase is omitted. No matter if the MDP

was learned directly or was transformed from a decision tree, any o↵ the shelf dynamic programming

algorithm like value iteration [Bellman, 1957] can be used to solve it. If the model correctly represents

the opponent’s strategy, the solution will produce an optimal policy against that opponent that will

result in a maximization of the accumulated rewards.

4.1.8 Detecting opponent switches

The first two phases of the framework have been described: (i) assessing the strategy used by the op-

ponent and (ii) generating a model of the opponent. Note that on the second phase, the learning agent

switches to a strategy that optimizes against the newly exploration learned model. Notwithstanding,

the switch could trigger a response from the opponent and the agent needs to be able to detect such

changes. With this in mind, another model is learned concurrently and compared with the exploration

learned model periodically in order to detect possible changes in the opponent strategy.

Definition 4.3 (Short history model). A learned model of the opponent strategy used to perform

comparisons with the exploration learned model using the interaction of past w rounds.

45


Short history models are used when comparing decision trees, they use the last window of interac-

tion of size w to learn a model.

Definition 4.4 (Long history model). A learned model of the opponent strategy used to perform

comparisons with the exploration learned model using the interactions from the last switch detected to

the current round.

Long history models are used when comparing MDPs. The exploration model and the history model

(short or long) are compared every jw steps, where w is a parameter of the algorithm (represents the

size of the window of interaction) and j = 2, 3, . . . , to evaluate their similarity. If the distance between

models is greater than a given threshold, parameter (the value for this parameter may be di↵erent

for each representation), the opponent has changed strategy and the modeling agent must restart the

learning phase, resetting both models and starting from scratch with a random (exploration) strategy.

Otherwise, it means that the opponent has not switched strategies, so the same strategy is continued,

and j is incremented.

Decision trees

When using decision trees, a sensible measure of how similar the exploration learned model and short

history one is needed. For this purpose, the dissimilarity measure presented in [Miglio and So↵ritti,

2004] is used. This measure has the advantage that can combine the structure (the attributes of the

nodes) and predictive (the predicted classes) similarities in a single value.

Let Di and Dj be two di↵erent trees learned as presented in Section 4.1.6. We define 1, . . . , H as

the leaves of Di, and 1, . . . ,K as the leaves of Dj , the value mhk is the number of instances which

belong to both the hth leaf of Di and to the kth leaf of Dj .

The dissimilarity measure is defined as:

d(Di, Dj) =HX

h=1

↵h(1� sh)mh0

n+

KX

k=1

↵k(1� sk)m0k

n(4.4)

where themxy values measure the predictive similarity and the ↵x and sx values measure the structural

similarity. The coe�cient sh is a similarity coe�cient whose value synthesizes the similarities shk

between the hth leaf of Di and the K leaves of Dj . The value shk measures similarities of two leaves

taking into account their classes and the objects they classify. The coe�cient ↵x is a dissimilarity

measure of the paths associated to two leaves. When these paths are not discrepant, then the value

is set equal to 0. If, on the contrary, those paths are discrepant, the value is > 0 depending on the

path and level where the two paths di↵er from each other. The maximum value of d(Di, Dj) can

be reached when the di↵erence between the structures of Di and Dj is maximum and the similarity

46


att1 > 3.5

att1 > 5.5

A B

att3 > 4.5

att2 > 4.5

att3 > 6.5

A

B

B A

85%

70% 100% 89%

90% 95%

(a)

att5 > 3.5

A

att1 > 4.5

att1 > 4.5

att3 > 6.5 B

BA

85%97%

90%

90% 95%

att3 > 4.5

A B

100%

(b)

att1 > 5.5

A B

att3 > 3.5

att6 > 3.5

B A

90% 100% 95% 90%

(c)

att1 >4.5

A B

att3 > 4.5

att6 > 3.5

B A

93% 97% 98% 92%

(d)

Figure 4.5: Example of highly dissimilar decision trees (a) and (b) using the measure in Equation 4.4 (since

their paths and predictions di↵er); in contrast (c) and (d) depict highly similar trees since the attributes in the

nodes are the same and the predictions are similar.

between their predictive powers is zero. This measure can be normalized to be in the range [0 � 1],

where 0 represents that the trees are very similar3 and 1 that they are totally dissimilar.

To exemplify the distance presented in Equation 4.4 take the decision trees4 depicted in Figures 4.5

(a) and (b) which have a high dissimilarity value, (d = 0.38). The reason is that paths are discrepant

(structural similarity) and their predictive classification is di↵erent. In contrast, Figures 4.5 (c) and

(d) depict highly similar trees, (d = 0.0), note that the attributes in the nodes are the same (even

when the split value is di↵erent they are considered the same).

MDPs

When using MDPs as models, the comparison is performed between the long history model and

exploration learned model. In particular for comparing MDPs we used the total variation distance

between transition functions. The total variation distance (TVD) compares probability distributions

3Nodes with numeric attributes where with same variables occurs but with di↵erent splitting values are seen as the

totally similar.4Example adapted from [Miglio and So↵ritti, 2004].

47


TFT Pavlov

w (j+1)w

h

My last play

cooperate defect

cooperate

Opponent last play

cooperate defect

defect

round

< h > h

cooperate defect

My last play

cooperate defect

cooperate defect

My last play

cooperate defect

cooperate defect

Opponent last play

cooperate defect

My last play

cooperate defect

defect cooperateShort history models

Exploration learned models

jw... ...

My last play

cooperate defect

cooperate defect

My last play

cooperate defect

cooperate defect

Opponent last play

cooperate defect

My last play

cooperate defect

defect cooperate

d>!

Figure 4.6: Example of how the framework works against a TFT-Pavlov opponent that changes between

strategies at time h. Each w rounds a tree is learned. In the upper part the exploration learned trees are

presented which represent the TFT and Pavlov strategies. In the lower part from left to right, the first tree

represents the TFT opponent strategy, the second one is learned when the opponent changed strategies, because

is di↵erent than the exploration learned model a switch has been detected (d > ). The third one represented

the Pavlov strategy, learned after the opponent switch.

and for categorical distributions is defined as:

TV D(µ, v) =1

2

X

x

|µ(x)� v(x)| (4.5)

between the transition functions µ, v (compares each element in the transition function of µ with the

one in v) of the two MDPs.

Running example

The framework is exemplified in Figure 4.6 using decision trees as models. Models are shown in two

groups to better exemplify the framework: in the upper part exploration learned models are shown

and in the lower part the short history ones are depicted, comparison between models of these groups

will reveal switches in the opponent.

Suppose an infinitely repeated iPD game played against an opponent that starts playing the TFT

strategy (see Section 2.3.2) and after h steps changes its strategy to Pavlov. The first exploration

48

4.2. MDP-CL WITH KNOWLEDGE REUSE

learned model (upper left) is the initial learned tree and represents the TFT strategy, learned after

interacting with the opponent for w steps. After this initial interaction, the agent can compute a policy

against the learned model (TFT) and it will start using that policy. At 2w the first short history model

is learned (lower left) and a comparison between trees is performed every w steps, after which, the

short tree is reset. In this case the comparison reveals they are the same model and the agent keeps

playing the same policy. At step h, with jw < h < (j + 1)w, the opponent switches from TFT to

Pavlov. The short history model learned with information from jw to (j + 1)w (during the switch),

is the second one depicted in the lower part of the figure. This tree is di↵erent compared with the

exploration learned model that represents TFT. Since the distance between these trees, d, is greater

than the specified threshold, , it means the opponent has changed strategies. The current learned

models and policies are reset and the exploration phase restarts. The second exploration learned tree

(upper right) is the learned model after the switch and represents the Pavlov strategy.

4.1.9 Summary

We have presented our first contribution, a framework for fast learning non-stationary strategies in

repeated games. The framework uses windows of interactions to learn a model of the opponent.

The learned model is used to compute an optimal policy against that opponent. Di↵erent models

are learned throughout the repeated game and comparisons between models indicate a switch in the

opponent. Two di↵erent implementations of the framework are evaluated (experiments are presented

in Section 5.2), the first one called MDP4.5 uses decision trees to model the opponent. The second,

called MDP-CL uses only MDPs. This framework has a limitation, it discards the learned model once

a switch is detected (which may be useful for future interactions). Therefore, in the following section

we propose how to overcome this limitation. The idea is keep the learned models in memory and reuse

them, if they reappear in the interaction. A second extension is designed to take advantage when

previous knowledge (the set of possible strategies the opponent will use) can be obtained. We address

both of these extensions for MDP-CL in the next section.

4.2 MDP-CL with knowledge reuse

In this section, we present two algorithms that extend MDP-CL (i.e., the proposed framework presented

in the previous section and modeling the problem as an MDP). The first one (a priori MDP-CL) uses

prior information (set of strategies used by the opponent) to quickly detect the opponent model while

still checking for opponent switches. The second approach (incremental MDP-CL) learns new models

but, in contrast to MDP-CL, it will not discard them once it detects a switch. In this way it keeps a

record of models in case the opponent reuses a previous strategy.

49


4.2.1 Assumptions

In this section, we make the following assumptions:

• The opponent will not change strategy during a number of interactions (learning phase).5

• Our agent uses a state space representation that can describe the opponent strategy.

With a priori MDP-CL we also assume:

• A priori MDP-CL knows the set of strategies used by the opponent before the interaction.

We performed experiments where this last assumption is removed (Section 5.3.5) and the algorithm

starts with a set of noisy models.

4.2.2 Setting

The problem’s setting is the same as in the previous section. Our learning agent and one opponent Orepeatedly play a bimatrix game �.

4.2.3 A priori MDP-CL

MDP-CL learns models of the opponents by exploring the entire state space, where the space state is

as described in Section 4.1.6. However, in some settings an agent could have information about the

set of strategies used by the opponents. A priori MDP-CL is designed to be used in those cases.

A priori MDP-CL assumes to know prior information in the form of a set M of MDPs that

represent possible strategies used by the opponent. However, there is still the problem to detect in a

fast way which of these strategies is the one used by the opponent. In MDP-CL this was not a problem

since there were no prior models to compare. Now, the problem we face is one of model selection.

At each round of the repeated game the learning agent experiences a tuple (s, a, r, s0). In a similar

way to MDP-CL the a priori algorithm learns as if there were no prior information, using an exploration

phase. Recall that MDP-CL needed to finish that phase to learn a model, in contrast a priori MDP-

CL learns a new model every round of the repeated game. This learned model is compared (at each

round) with each M 2 M using the TVD (Equation 4.5). Since we assume that the strategy used

by the opponent belongs to the set of models and we can guarantee that with enough experience

tuples, the correct model will have a perfect similarity (TV D = 0.0) with at least one of the models

in M. When this happens we stop the exploration phase and we change to planning phase, setting

the opponent model to compute a policy. Since this is only an asymptotic guarantee in some domains

a perfect similarity will not happen in finite time. For this reason a priori MDP-CL has as parameter

5This number of interactions can be passed as a parameter to the learning algorithm.

50

4.2. MDP-CL WITH KNOWLEDGE REUSE

a threshold, ⇢, that defines how close a model should be in order to set that model as the current one.

This parameter can be set to handle noisy opponents, like the ones presented in Section 5.3.5. The

rest of the algorithm behaves as MDP-CL.

A priori MDP-CL takes advantage of knowing the set of strategies which will results in less

interactions to detect the model in contrast to learn it, as MDP-CL does (for experimental results

please refer to Section 5.3). The limitation is the assumption of knowing the set of opponent strategies.

4.2.4 Incremental MDP-CL

A priori MDP-CL makes use of an initial set of models, but with incremental MDP-CL we relax the

assumption of having the complete set of models M from the beginning. We assume there is a finite

set of strategies used by the opponent and that these strategies will be used repeatedly during the

interaction. Incremental MDP-CL includes both, learning new models if the opponent uses a new and

unknown strategy and maintaining a history of learned strategies in case the opponent switches to a

previous one.

Algorithm 4.2: Incremental MDP-CLInput: Size of the window of interactions w, threshold , threshold ⇢

Function: TVD(), compare two opponent models using the total variation distance

Function: planWithModel(), obtain a policy to act using opponent model

Function: playWithPolicy(), play using the computed policy

1 M = ;, currentModel = ; // Initialize set of learned models and current model

2 for each round of repeated game do

3 if currentModel == ; then

4 Learn a model with past interactions

5 if less than i · w interactions (i � 1) then

6 for each m 2 M do

7 if TV D(model,m) ⇢ then

8 currentModel = model

9 ⇡⇤ = planWithModel(currentModel)

10 end

11 end

12 else

13 M = M [model

14 currentModel = model

15 ⇡⇤ = planWithModel(currentModel)

16 end

17 Play with random actions

18 else

19 playWithPolicy(⇡⇤) and use switch detection like MDP-CL

20 end

21 end

A high level view of incremental MDP-CL is described in Algorithm 4.2. It starts by initializing the

51


set of learned models, M = ;, and setting the current model variable to null. Then, for every round

of the repeated game it learns an opponent model, currentModel. Then the algorithm compares it

with those in M. If the TVD is lower than a threshold ⇢ then it means the model has been previously

used and it computes a policy to act. Otherwise, we need w interactions to learn a new model and

add it to the set M. Switch detection is performed as in MDP-CL.

4.2.5 Summary

In this section, we presented two extensions for MDP-CL: (i) A priori MDP-CL, which assumes to

know the set of strategies used by the opponent (now, the problem is to detect which is the one used

by the opponent) and (ii) incremental MDP-CL, which keeps a record of the previous learned models

in case the opponent returns to one of those; in which case it will not be necessary to learn it again

from scratch.

Our proposed approaches are capable of detecting most of the changes in the opponent strategy.

However, in cases a shadowing behavior [Fulda and Ventura, 2006] appeared (experimental support is

shown in Section 5.2.3) yielding suboptimal results. Next section explains this problem in detail and

proposes a new type of exploration for detecting switches to overcome that limitation.

4.3 Drift exploration

Exploration in non-stationary environments has a special characteristic that is not present in station-

ary domains. If the opponent plays a stationary strategy, the learning agent perceives a stationary

environment (an MDP) whose dynamics can be learned. However, if the former plays a stationary

strategy (strat1) that induces an MDP (MDP1) and then switches to strat2 and induces MDP2, if

strat1 6= strat2, then MDP1 6= MDP2 and so the learned policy is probably not optimal anymore.

In order to motivate drift exploration, take the example depicted in Figure 4.7, where the learning

agent faces a switching opponent in the iterated prisoner’s dilemma. Here, at time t1 the opponent

starts with a strategy that defects all the time, i.e., Bully (Section 2.3.2). The learning agent can

recreate the underlying MDP that represents the opponent’s strategy using counts (learned Bully

model) by trying out all actions in all states (exploring). At some time (t2 in the figure), it can

solve for the optimal policy against this opponent (because it has learned a correct model), which

is to defect in the iPD, which will produce a sequence of visits to state Dopp, Dlearn. Now, at

some time t3 the opponent switches its selfish Bully strategy to a fair TFT strategy. But because

the transition T ((Dopp, Dlearn), D) = Dopp, Dlearn in both MDPs, the switch in strategy (Bully !TFT) will not be perceived by the learning agent. Thus resulting in not having the optimal strategy

52

4.3. DRIFT EXPLORATION

Bully TFT


C,1,0D ,Copp learn D ,Dopp learn

C,1,3

D,1,1

D,1,4

C,1,3 C,1,0D,1,1D,1,4


C,1,0D ,Copp learn D ,Dopp learn

D,1,1D,1,1C,1,0

learned Bully modelTFT model

t t1 3t2

Figure 4.7: An example of a learned models against a Bully-TFT switching opponent. The models represent

two MDPs: the opponent starts with Bully (at time t1), after some rounds (t2) a model of the opponent is

complete learned and can optimize against it. The opponent switches to TFT (at time t3) and the learning

agent cannot detect the switch since is performing the action D (thick arrow) and not exploring the rest of the

state space.

against the opponent. This e↵ect is known as shadowing6 [Fulda and Ventura, 2006] and can only

be avoided by continuously checking far visited states. Drift exploration deals with such shadowing

explicitly, and in what follows we will present drift exploration for switch detection; then we propose

a new algorithm called R-max# (since it is sharp to changes) for learning and planning against non-

stationary opponents.

4.3.1 General drift exploration

The problem with non-stationary environments is that opponent strategies may share similarities in

their induced MDP (specifically between transition functions). If the agent’s optimal policy produces

an ergodic set of states (e.g., the resulting ergodic set for defecting against Bully is the sole state

Dopp, Dlearn) and this part of the MDP is shared between opponent strategies, the agent will not

perceive such strategy change, which results in a suboptimal policy and performance. The solution to

this is to explore even when an optimal policy has been learned. Exploration schemes like ✏�greedyor softmax (e.g., a a Boltzmann distribution), can be used for such purpose and they will work

as drift exploration with the added cost of not e�ciently exploring the state space. Against this

background, we propose another approach for drift exploration that e�ciently explores the state space

and demonstrates this with a new algorithm called R-max#.

6Other authors have seen a related behavior which is called observationally equivalent models [Doshi and Gmy-

trasiewicz, 2006].

53


4.3.2 E�cient drift exploration

E�cient exploration strategies should take into account what parts of the environment remain un-

certain, R-max is an example of such (see Section 2.2.3). In this section we present R-max#, an

algorithm inspired by R-max but designed for strategic interactions against non-stationary switching

opponents. To handle such opponents, R-max# reasons and acts in terms of two objectives: 1) to

maximize utilities in the short term while learning, and 2) to eventually detect opponent behavioral

changes.

R-MAX#

The basic idea of R-max# is to forget long gone state-action pairs. These pairs are those that 1)

are considered known and 2) have not been updated in ⌧ rounds, at which point the algorithm resets

its reward value to rmax in order to promote exploration of that pair which implicitly rechecks to

determine if the opponent model has changed.

R-max# receives as parameters (m, rmax, ⌧), where m and rmax are used in the same way as R-

max, and ⌧ is a threshold that defines when to reset a state-action pair. R-max# starts by initializing

the counters n(s, a) = n(s, a, a0) = r(s, a) = 0, rewards to rmax, transitions to a fictitious state s0

(like R-max), and set of pairs considered known K = ;. Then, for every round the algorithm checks

for each state-action pair (s, a) that is considered known (2 K) how many rounds have passed since

the last update. If this number is greater than the threshold ⌧ then the reward for that pair is set

to rmax; the counters n(s, a), n(s, a, s0) and the transition function T (s, a, s0) are reset, and a new

policy is computed. Then, the algorithm behaves as R-max. The pseudocode of R-max# is presented

in Algorithm 4.3.

4.3.3 Running example

Now we will use the example in Figure 4.8 to show how R-max# will interact against an unknown

switching opponent. The opponent starts with a Bully strategy (t1). After learning the model, R-

max# knows that the best response against such strategy is to defect (t2) and the interaction will be

a cycle of defects. At time (t3) the opponent changes from Bully to TFT, and because some state-

action pairs have not been updated for several rounds (more than the threshold ⌧) R-max# resets

the rewards and transitions for reaching such states, at which point a new policy is recomputed. This

policy will encourage to re-visit far visited states. Now, R-max# will update its model as shown

in the transition model in Figure 4.8 (note the thick transitions which are di↵erent from the Bully

model). After certain rounds (t4), the complete TFT model will be learned and an optimal policy

against it is computed. Note that R-max#will re-learn a model even when no change has occurred in

54


Algorithm 4.3: R-max# algorithmInput: States S, actions A, m threshold value, rmax value, threshold ⌧

Function: SolveMDP(), receives a tuple which corresponds to a MDP and obtains a policy

1 8(s, a, s0) r(s, a) = n(s, a) = n(s, a, s0) = 0

2 8(s, a, s0) T (s, a, s0) = 1

3 8(s, a) R(s, a) = rmax

4 K = ;5 8(s, a) lastUpdate(s, a) = ;6 ⇡ = SolveMDP (S,A, T,R) // initial policy

7 for t:1, . . . , T do

8 Observe state s, execute action a from policy ⇡(s)

9 for each (s, a) do

10 if (s, a) 2 K and t-lastUpdate(s, a) > ⌧ then

11 R(s, a) = rmax // reset reward to rmax

12 n(s, a) = 0

13 8s0 n(s, a, s0) = 0 // reset counters

14 8s0 T (s, a, s0) = 1 // reset transitions

15 ⇡ = SolveMDP (S,A, T,R) // Solve MDP and get new policy.

16 end

17 end

18 if n(s, a) < m then

19 Increment counters for n(s, a) and n(s, a, s0)

20 Update reward r(s, a)

21 if n(s,a)==m // pair is considered known then

22 K = K [ (s, a)

23 lastUpdate(s, a) = t

24 R(s, a) = r(s, a)/m

25 for s00 2 S do

26 T (s, a, s00) = n(s, a, s00)/m

27 end

28 ⇡ = Solve(S,A, T,R)

29 end

30 end

31 end

Bully TFT

C,1,0

C,1,3

D,1,1

D,1,4

C,1,3 C,1,0D,1,1D,1,4

C,1,0

D,1,1D,1,1C,1,0

C,1,0

D,1,1C,1,3 D,1,4

…

learned Bully model learned TFT modeltransition model

t t t1 3 4t2

oppC ,C learn oppC ,C learnoppC ,C learnC ,Dopp learn

C ,Dopp learn C ,Dopp learn

D ,Copp learn D ,Copp learn D ,Copp learnD ,Dopp learn D ,Dopp learn D ,Dopp learn

Figure 4.8: An example of the learned models of R-max# against a Bully-TFT switching opponent. The

models represent three learned MDPs: at the extremes opponent starts with Bully (at time t1) and switches to

TFT (t4) and in the middle a transition model, learned after the switch (t3) Bully ! TFT.

55


the opponent strategy.

4.3.4 Practical considerations of R-max#

In contrast to stationary opponents, when acting against non-stationary opponents we need to perform

drift exploration constantly. However, knowing when to explore is a more di�cult question, specially

if we know nothing about possible switching behavior (switching periodicity). However, even in this

case we can still provide guidelines for setting ⌧ : 1) should be large enough to learn a su�ciently

good opponent model (could be a partial model when the state space is large). In this situation the

algorithm learns a partial model and optimizes against that model of the opponent. 2) ⌧ should be

small enough to enable exploration of the state space. An extremely large value for ⌧ will decrease the

exploration for longer periods of time and will take longer to detect opponent switches. 3) If expected

switching times can be inferred or learned then ⌧ can be set accordingly to a value that is related to

those timings. This is explained by the fact that after the opponent switches strategies, the optimal

action is to re-explore at that time. For experimental results that support these guidelines please

refer to Section 5.4.4. Next we provide some theoretical guarantees for R-max# that proves that it is

capable of detecting opponent switches and that its rewards are optimal with high probability given

certain assumptions.

4.4 Sample complexity of exploration for R-max#

In this section, we study the sample complexity of exploration for R-max# algorithm. Before pre-

senting our analysis, we first state our assumptions.

1. Complete (self) information: the agent knows its states, actions and rewards received.

2. Approximation condition (from Kakade, 2003): states that the policy derived by R-max is near-

optimal in the MDP (Definition 4.6, see below).

3. The opponent will use a stationary strategy for some number of steps.

4. All state-action pairs will be marked known between some number of steps.

Given the aforementioned assumptions, we show that R-max# will eventually relearn a new model

for the MDP after the opponent switches and will compute a near-optimal policy.

We first need some definitions and notations, which are from Kakade [2003], to formalize the proofs.

Firstly, an L-round MDP = hS,A, T,Ri is an MDP with a set of decision rounds {0, 1, 2, . . . , L� 1},L is either finite or infinite. In each round both agents choose actions concurrently. A deterministic

56

4.4. SAMPLE COMPLEXITY OF EXPLORATION FOR R-MAX#

T -step policy ⇡ is a T -step sequences of decision rules of the form {⇡(s0),⇡(s1), . . . ,⇡(sT �1)}, wheresi 2 S. To measure the performance of a T -step policy in the L-round MDP, t-value is used.

Definition 4.5 (t-value, Kakade, 2003). Let M be an L-round MPD and ⇡ be a T -step policy for M .

For a time t < T , the t-value U⇡,t,M (s) of ⇡ at state s is defined as

U⇡,t,M (s) =1

TE(s

t

,at

,...,sT �1,aT �1)⇠Pr(·|⇡,M,st

=s)

"T �1X

i=t

R(si, ai)

#, (4.6)

where the T -path (st, at, . . . , sT �1, aTT �1) is from time t up until time T starting at s and

following the sequence {⇡(st),⇡(st+1), . . . ,⇡(sT �1)}, E represents expectation and Pr probability.

The optimal t-value at state s is

U⇤t,M (s) = sup

⇡2⇧U⇡,t,M (s), (4.7)

where the ⇧ is the set of all T -step policies for MDP M . Finally, we define the Approximation

Condition (assumption 2).

Definition 4.6 (Approximate Condition, Kakade, 2003). Let K be a set of known states and MDP

MK be an estimation of the true MDP MK with a set of state K. Then, the optimal policy ⇡ that

R-max derived from MK such that for all states s and times t T ,

U⇡,t,MK(s) � U⇤t,MK(s)� ✏ (4.8)

This assumption states that the policy ⇡ derived by R-max from MK is near-optimal in the true

MDP MK. For R-max#, we have the following main theorem:

Theorem 4.4.1. Let ⌧ = 2m|S||A|T✏ log |S||A|

� and M’ be the new L-round MDP after the opponent

switches its strategy. The R-max# algorithm guarantees an expected return of U⇤M 0(ct) � 2✏ within

O(m|S|2|A|T 3

✏3 log2 |S||A|� ) timesteps with probability greater than 1� �, given timesteps t L.

The proof of Theorem 4.4.1 will be provided after we introduce some lemmas. Now we just give

a sketch of the proof. R-max# is more general than R-max since it is capable of reseting the reward

estimations of state-action pairs. However, the basic result of R-max# is derived from R-max. The

proof relies on applying the R-max sample complexity theorem to R-max# as a basic solver. With

proper assumptions, R-max# can be viewed as R-max with periods, that is, the running timesteps

of R-max# is separated into periods. In each period, R-max# behaves as the classic R-max so that

R-max# can learn the new state-action pairs by the R-max algorithm after the opponent switches its

policy.

57


Theorem 4.4.2 (Kakade, 2003). Let M = hS,A, T,Ri be a L-round MDP. If c is an L-path sampled

from Pr(·|R-max,M, s0) and the assumption 1 and 2 hold, then, with probability greater than 1��, theR-max algorithm guarantees an expected return of U⇤(ct)� 2✏ within O

⇣m|S||A|T

✏ log |S||A|�

⌘timesteps

t L.

The proof of Theorem 4.4.2 is given in Lemma 8.5.2 in Kakade [2003]. To simplify the notation,

let C = O⇣m|S||A|T

✏ log |S||A|�

⌘.

Lemma 4.4.3. After C steps, each state-action pair (s, a) is visited m times with probability greater

than 1� �/2.

Proof. This is an alternative interpretation of Theorem 4.4.2 due to Hoe↵ding’s bound,7 using �/2 to

replace �. C ignores all constants, so, within C steps, all state-action pairs are visited m time with

probability greater than 1� �/2.

Lemma 4.4.4. With properly chosen ⌧ , the R-max# algorithm resets and visits each state-action

pair m times with probability greater than 1� �.

Proof. Suppose the R-max# algorithm already learned a model. Lemma 4.4.3 states that within C

steps, each state-action pair (s, a) is visited m times with probability greater than 1��/2. That is, welearn a model, all state-action pairs are marked as known. Remember that ⌧ measures the di↵erence

between the current time step and the time step at which each action-state pair is visited m times

again. To make sure R-max# does not reset a state-action pair before all state-action pairs are visited

with probability greater than 1 � �/2, ⌧ is at least C. Hoe↵ding’s bound does not predict the order

of all the mth visits for each state-action pair. The worst situation is that all state-action pairs are

marked known near t = C. According to Lemma 4.4.3, we need an extra interval C to make sure the

all the state-pairs are visited m times after reset with probability greater than 1 � �/2. In all, we

need to set ⌧ = 2C (one C for the resetting stage and another C for the learning stage). Note that

according to the assumption 4, all state-action pairs will be learned between t = nC and t = (n+1)C

after resetting. Then, R-max# restarts the learning process.

To simplify the proof, we introduce the concept of a cycle to help us to analysis the algorithm.

Definition 4.7. A cycle occurs when all reward estimations for each state-action pair (s, a) are reset

and then marked as known.

Intuitively, a cycle is the process where R-max# forgets the old MDP and learns the new one.

According to the Lemma 4.4.4, 2C steps are su�cient to reset and visit each state-action pair (s, a)

7Hoe↵ding [1963] provides an upper bound on the probability that the sum of random variables deviates from its

expected value.

58

4.4. SAMPLE COMPLEXITY OF EXPLORATION FOR R-MAX#

t1

Reset window

Reset window

Reset window

Learning 1 2 n...

(s,a) pairs:

t=0

1 2 n

Reset

t=C t=2C t=3C

Cycle 1

t2 tnt=4C

1 2 n

Learning

1

2

n...

... ...

Figure 4.9: An illustration for the running behavior of R-max#. Circles represent state-action pairs, a cycle

consists of a reset and learning phases. The length of the reset window is ⌧ = 2C in R-max#. In the learning

stage, all state-action pairs may be marked known before t = C with high probability. After resetting, between

[2C, 3C], we assume that all state-action pairs will be marked known in [3C, 4C].

m times, with probability at least 1 � �. Thus, we set the length of one cycle as 2C. A cycle is a ⌧

window so that we leave enough timesteps for R-max# reset and learn each state-action pair.

Lemma 4.4.5. Let ⌧ = 2C and M 0 be the new L-round MDP after the opponent switches its strategy

the R-max# algorithm guarantees an expected return of U⇤M 0(ct)�2✏ within C timesteps with probability

greater than 1� 3�.

Proof. Lemma 4.4.4 states that if ⌧ = 2C, each state-action pair (s, a) are reset within 2C timesteps

with probability greater than 1 � �, since (1 � �/2)(1 � �/2) = 1 � � + �2/2 > 1 � �. If an opponent

switched its strategy at any timestep in cycle i (see Fig. 4.10 for details), there are three cases: 1) R-

max# does not reset the corresponding state-action pairs (s, a), since they are not considered known

(/2 K); 2) R-max# resets the corresponding state-action pairs (s, a) reward estimations but does not

learn new ones; 3) R-max# has already reset and learned the state-action pairs (s, a).

Case 1 is safe for R-max# since it learns the new state-action pairs, with probability greater than

1� �. For cases 2 and 3, the worst is case 3 since R-max# is not able to learn new state-actions pairs

within cycle i, whereas R-max# may have chance to learn new state-actions in case 2 (in the learning

phase in the same cycle i). The assumption 3 states that the opponent adopts a stationary strategy at

least 4C steps, which is exactly 2 cycles between two switch points. Although R-max# can not learn

new state-action pairs within cycle i when case 3 happens, it can learn them in cycle i+ 1 by Lemma

4.4.4.

In all, the R-max# will eventually learn the new state-action pairs in either cycle i or cycle i+ 1

with probability greater than 1� 2� since (1� �)(1� �) = 1� 2� + �2 > 1� 2�. That is the R-max#

requires 2 cycles or 4C to learn a new model to fit the new opponent policy. Apply the chain rules

in probability theory and Theorem 4.4.2, the R-max# algorithm guarantees an expected return of

59


Cycle i

t=nC t=(n+1)C t=(n+2)C t=(n+3)C

Cycle i+1

t=(n+4)C

Switchcase 2

Switchcase 3

Reset Learning Reset Learning

Figure 4.10: Possible switch points in R-max#. Suppose an opponent switches in cycle i, there are two possible

switch points. The R-max# will learn the new state-action pairs within two cycles.

U⇤(ct)�2✏ within C timesteps with probability greater than (1�2�)(1��) = 1�3�+2�2 > 1�3�.

Note that we do not propose the value ofm. How should the value ofm be bounded? Kakade shows

that m = O⇣|S|T 2

✏2 log |S||A|�

⌘is su�cient, given error ✏ and confidence � (Lemma 8.5.6 in Kakade,

2003). With this result, we have following proof for Theorem 4.4.1:

Proof of Theorem 4.4.1 Recall that C = O⇣m|S||A|T

✏ log |S||A|�

⌘. Combining Lemma 4.4.5 and m =

O⇣|S|T 2

✏2 log |S||A|�

⌘, Theorem 4.4.1 follows by setting � �

4 .

The proofs heavily rely on the assumptions we state at the beginning of this section, however maybe

it is too strong to capture the practical performance of R-max#. In particular assumption 4 may not

hold in some domains, nonetheless it provides a theoretical way to understand R-max#. We are able to

understand R-max# in its relation with R-max since R-max is the basic solver in the proof. Also, the

theoretical result gives some bounds on the parameters, e.g., ⌧ = 2C and C = O⇣m|S||A|T

✏ log |S||A|�

⌘.

4.4.1 E�cient drift exploration with switch detection

As final approach, we propose to use the e�cient drift exploration of R-max# together with the

MDP-CL framework for detecting switches. The idea is that these two approaches tackle the same

problem in di↵erent ways and therefore should probably complement each other at the expense of some

extra exploration. We call such R-max#CL (Algorithm 4.4), which combines MDP-CL synchronous

updates with the asynchronous rapid adaptation of R-max# (note that it uses the parameters of both

approaches).

The approach of R-max#CL behaves as R-max# e�ciently visiting far visited states only. Con-

currently the switch detection process of MDP-CL is performed and if a switch is detected the current

model and policy are reset and R-max# restarts. Naturally, such combination of explorations will

turn out to be profitable in some settings and in some others it is better to use only one. Experiments

in Section 5.4.5 provide some insights about this.

60

4.5. DRIFTER

Algorithm 4.4: R-max#CL algorithmInput: w window, threshold, m value, rmax value, threshold ⌧

Function: TVD(), compare two opponent models using total variation distance

Function: R-max#(), call R-max# algorithm.

1 Initialize as R-max#

2 model == ;3 for t : 1, . . . ,T do

4 R-max#(m, rmax, ⌧)

5 if t == i · w, (i � 1) and model == ; then

6 Learn a model with past w interactions

7 end

8 if t == j · w, (j � 2) then

9 Learn comparison model0 with past interactions

10 d= TVD(model,model’)

11 if d > // Strategy switch? then

12 Reset models, j = 2

13 else

14 j = j + 1

15 end

16 end

17 end

4.4.2 Summary

In section 4.3 we presented drift exploration which is designed to overcome the shadowing problem

that occurs when facing non-stationary opponents. We propose R-max# which is an algorithm that

builds upon the theoretical results of R-max to provide switch detection guarantees under certain

assumptions (Section 4.4). Finally, we propose to use the e�cient exploration approach of R-max#

with MDP-CL in R-max#CL. Next section presents DriftER, an algorithm for switch detection that

uses the estimated error of learned opponent model as an indicator of a possible change of strategies.

4.5 DriftER

While most machine learning work assumes that examples are generated according to some stationary

probability distribution. Concept drift (Section 3.3.1) approaches have studied the problem of learning

when the class-probability distribution that generates the examples changes over time. However, this

approach is not directly applicable to multiagent settings since we need to interact with another agent

and we need to plan an optimal policy. For that reason, we use the modeling idea proposed by MDP-

CL (Section 4.1) to generate a model of the opponent that at the same time can be used to play

against the opponent and can be used to estimate a possible strategy switch. The idea is to compute

a predictive error about the opponent’s model and when the error increases constantly then a new

model is needed.

61


DriftER leverages insights from concept drift and MDP-CL to identify switches in an opponent’s

strategy. When facing non-stationary opponents whose model has been learned, an agent must bal-

ance exploitation (to perform optimally against that strategy) and exploration (to attempt to detect

switches in the opponent). DriftER treats the opponent as part of a stationary (Markovian) environ-

ment but tracks the quality of the learned model as an indicator of a possible change in the opponent’s

strategy. When a switch in the opponent strategy is detected, DriftER resets its learned model and

restarts the learning. An additional virtue of DriftER is that it can check for switches at every timestep

(as opposed to MDP-CL). And in contrast to R-max# it detects switches explicitly with the addition

of having a theoretical guarantee for high probability switch detection.

DriftER pseudocode is presented in Algorithm 4.5. It starts by learning an opponent model in the

form of an MDP (lines 3-4). When a model has been learned, a switch detection process starts by

predicting the opponent’s behavior (line 5). An error probability is computed and keeping track of

this error will decide when a switch has happened (lines 6-13). When this happens the learning phase

is restarted (lines 15-16).

4.5.1 Assumptions

The proposed framework makes the following assumptions.

• The opponent will not change strategy during a number of interactions (learning phase).

If this assumption does not hold then our agent will not learn the correct dynamics of the opponent

and its predictions will not be accurate. Also the theoretical guarantee (Section 4.5.4) will not hold.

4.5.2 Model learning

DriftER learns a model of the opponent in order to compute a policy to act against it. In all settings

after interacting with the environment/opponent for w times, the environment is learned using the

R-max exploration [Brafman and Tennenholtz, 2003] and we can use techniques such as value iteration

[Bellman, 1957] to solve the MDP.

4.5.3 Switch detection

After learning a model of the opponent, DriftER must decide on each timestep whether to learn a new

model or keep the existing one. Using the existing model, DriftER can predict the next state of the

MDP (which depends on the opponent strategy) and then compare it with the experienced true state.

This comparison can be binarized with correct/incorrect values. A Bernoulli process S1, S2, . . . , ST

will be produced assuming a sequence of independent identically distributed events where Si 2 {0, 1}and T is the last timestep. Let pi be the estimated error probability (i.e., the probability of observing

62

4.5. DRIFTER

Algorithm 4.5: DriftER algorithmInput: Learning window size w, counter of errors n

init

, adjust value �, m window

Function: predictAction(), predicts the next action of the opponent

Function: computeError(), computes the error of the prediction

Function: computeConfInterval(), computes confidence intervals

Function: adjustN(), adjust value according to error and �

1 model =;, countError = 0

2 for t=1, . . . ,T do

3 if t == i · w, (i � 1) and model == null then

4 Learn a model with past w interactions

5 end

6 if model 6= ; then

7 a = predictAction(model)

8 Observe real action a

9 p =computeError(a,a)

10 [fupper

, flower

] = computeConfInterval(p)

11 for i = t� 1, . . . , t�m steps do

12 �i

= fupper

(pi

)� fupper

(pi�1)

13 if �i

> 0 then

14 countError = countErrror +1

15 end

16 end

17 n = adjustN(ninit

, p, �)

18 if countError � n // Switch detected then

19 model = ;20 end

21 countError = 0

22 end

23 end

63


incorrect) from S1 to Si, i = 1, . . . , T . Then, the 95% confidence interval [flower(pi), fupper(pi)] over

S1, S2, . . . , Si is calculated for each timestep i using the Wilson score [Wilson, 1927] such that the

confidence interval will improve as the amount of data grows, where flower(pi) and fupper(pi) denote

the lower bound and upper bound of the confidence interval, respectively.

The estimated error probability, and its associated confidence interval can increase for two reasons:

i) if the opponent is exploring or makes mistakes or ii) if the opponent changes its strategy. To detect

the latter case, DriftER tracks the finite di↵erence of the confidence interval using the upper bound

fupper(pi) at each timestep i. The finite di↵erence is defined by

�i = fupper(pi)� fupper(pi�1), i = 1, . . . , T. (4.9)

If �i > 0, �i�1 > 0,. . . ,�i�n+1 > 0, where n is a parameter that should be set accordingly to the

domain. This is, if DriftER detects the confidence interval is increasing in the last n steps, then it

decides to restart the learning phase.

Once DriftER has a model of the opponent it can start computing an error rate and confidence

intervals. However, information from the learning phase is used to have an initial estimation of both

terms (error probability and its confidence intervals). Which will avoid starting with no information

and will reduce peaks in the estimation.

Using a fixed n for all types of opponents may not be the best option. For example, against

stochastic opponents there will be a non-zero probability of incorrectly predicting the opponent’s next

move. Since we still need to check when the error increases, in what follows, we propose to adjust n

accordingly to the error probability p.

We set n = ninit assuming a perfect model can be learned, n is adjusted against stochastic oppo-

nents to

n = ninit +log(�)

log(p)(4.10)

where n ninit + C, C a constant value, � > 0 (described in the next section).

4.5.4 Theoretical guarantee for switch detection

Now we provide a theoretical result to justify that this method is capable of detecting opponent

switches with high probability. In so doing we make the following assumptions: (i) The opponent do

not switch strategies while DriftER is in the learning phase. (ii) The probability of exploration or

mistakes of the opponent is at most ✏ for each timestep.

Theorem 4.5.1. Let ✏ > 0 and � > 0 be small constants. If �i > 0, �i�1 > 0,. . . ,�i�n+1 > 0 and

we set n = O(log �/ log ✏), then DriftER detects the opponent switch with probability 1� �.

64


Proof. If �i > 0, �i�1 > 0,. . . ,�i�n+1 > 0, then DriftER decides to learn a new model. However, we

point out that the �i > 0 may be caused by opponent’s exploration/mistake. The worst case happens

when DriftER incorrectly detects a switch while the opponent only made mistakes or explores, this is

�i�j+1 > 0, for all j = 0, · · · , n � 1 due to opponent’s exploration/mistake. Let A denote the above

event. Given �i > 0, �i�1 > 0,. . . ,�i�n+1 > 0, the probability of event A, is

P[A|�i > 0,�i�1 > 0, . . . ,�i�n+1 > 0] ✏n, (4.11)

Since we assume that the probability of exploration or mistake is at most ✏ for each timestep. By

the chain rule (multiplying the probability of each event), the result follows. Then, set ✏n = �, where

� the probability of incorrectly detecting the switch, so 1� � is the probability of detecting the switch

correctly, finally n = log �/ log ✏ and we have the result.

This result shows that if DriftER decides to restart the learning phase, it does so because it detected

the opponent switch with high probability (1� �) which makes the method robust.

4.5.5 Summary

In this section we introduced DriftER, an algorithm that learns a model of the opponent in the form

of an MDP and keeps tracks of its error-rate. When the error increases significantly, the opponent has

changed its strategy and DriftER must learn a new model. Theoretical results provide a guarantee

of detecting switches with high probability. Section 5.5 will present results in repeated games and in

PowerTAC simulator against state of the art approaches.


This chapter presents three main approaches for dealing with non-stationary opponents. The first one

is a framework for learning and planning in repeated games against non-stationary opponents with

two implementations: MDP-CL and MDP4.5. The framework detects switches by learning di↵erent

models throughout the interaction. Comparisons between models reveal switches in the opponent.

Then two extensions for MDP-CL were presented, a priori MDP-CL assumes to know the set of

opponent’s models and adapts rapidly once it detects a switch. Incremental MDP-CL learns a model

and will not discard it once a switch has occurred, which will avoid relearning previously seen models.

Since MDP-CL was not capable of detecting some types of switches, drift exploration was proposed to

overcome this limitation. We proposed the R-max# algorithm which provides an e�cient exploration

against non-stationary opponents. R-max# provides theoretical guarantees for switch detection and

optimal expected rewards under certain assumptions. Our last proposal is DriftER, a switch detection

mechanism which learns a model of the opponent and uses it to keep track of the error. DriftER

65


provides a theoretical guarantee for switch detection with high probability. These approaches have

di↵erent characteristics and in the next chapter we present the results of each of our proposals in

di↵erent domains.

66

Chapter 5

Experiments

In this chapter, we present experiments performed in five experimental domains (see Figure 5.1).

These are: the iterated prisoner’s dilemma (iPD), the multiagent prisoner’s dilemma, the alternate-

o↵ers bargaining protocol, double auctions in the PowerTAC simulator and general-sum repeated

games. The iPD is simple to understand and well known domain with di↵erent strategies available

(Section 2.3.2). Its multiagent version will be used to show that MDP-CL can be used in domains with

more than one opponent. The alternating-o↵ers protocol is a more complex domain with richer state

and action spaces. Double auctions in PowerTAC are a more realistic approach were there is added

uncertainty in the environment. Finally, by using general-sum games we show that our approaches

can be generalizable against game theoretic strategies (Section 2.3).

Experiments follow the same order as the approaches were presented:

• Section 5.2 compares the MDP4.5 and MDP-CL approaches against Hidden-mode Markov de-

cision process (see Section 2.2.2) since these approaches are designed to handle non-stationary

environments.

• In Section 5.2.6, MDP-CL is evaluated in the multiagent version of the PD to test the approach

is generalizable to more than one opponent.

• In Section 5.3, a priori and incremental MDP-CL are compared against the original MDP-CL.

• In Section 5.4, drift exploration is evaluated in MDP-CL(DE), R-max# and R-max#-CL against

the state of the art approaches R-max [Brafman and Tennenholtz, 2003] as baseline, and FAL

[Elidrisi et al., 2012] and WOLF-PHC [Bowling and Veloso, 2002] since they are algorithms that

can learn in non-stationary environments.

• DriftER is evaluated in Section 5.5, in the PowerTAC domain against MDP-CL and TacTex

[Urieli and Stone, 2014] (Appendix A.4) which is the champion of the inaugural PowerTAC

67

CHAPTER 5. EXPERIMENTS

Experimental Domains(Chapter 5)

Iterated Prisoner's Dilemma

(Sections 5.2, 5.3, 5.4)

Multiagent iPD(Section 5.2.6)

Negotiation (Section 5.4)

Normal form games

(Section 5.5, 5.6)

Battle of the sexes(Section 5.5.1,

5.6.2)

Games ≥ 2 actions(Section 5.6.3)

PowerTAC (Section 5.5.3)

Figure 5.1: Experimental domains used in this thesis and where they are used in this chapter.

competition.

• Finally in Section 5.6, MDP-CL, R-max# and DriftER are compared in the battle of the sexes

games (to see their di↵erent behaviors in the same setting) and then on general-sum games

against switching opponents that use strategies from game theory literature (to show our ap-

proaches can be used against di↵erent strategies). Comparisons are performed with WOLF-PHC.

5.1 Experimental domains

We used five domains for performing experiments. In all of them, the setting consists of one learning

agent and one (or more) opponent(s) that interact for several timesteps/timeslots/rounds. We start

by presenting each domain in detail.

5.1.1 Iterated prisoner’s dilemma (iPD)

Table 5.1: The bimatrix game known as the prisoners’ dilemma. Each cell represents the utilities given for

the agents (the first for agent A and second for agent B).

Agent B

cooperate defect

Agent Acooperate 3,3 0,4

defect 4,0 1,1

As initial setting we used the iPD since it is a well known domain where we can easily use di↵erent

68

5.1. EXPERIMENTAL DOMAINS

strategies for the opponent. In Table 5.1 we present the values used for the iPD game (since they

fulfill the requirements presented in Section 2.3). We used the three most successful and known human

crafted strategies that the literature has proposed: TFT, Pavlov and Bully as opponent strategies

(Section 2.3).

These three strategies have di↵erent behaviors in the iPD and the optimal policy di↵ers across

them. For example with the values in Table 5.1 and a discount factor1 � = 0.9, the optimal policy

against a TFT opponent is always to cooperate, in contrast to the optimal policy against Bully which

is always to defect. The optimal policy against Pavlov is to play the Pavlov strategy.

5.1.2 Multiagent iterated prisoner’s dilemma

Most experiments were performed with one opponent in the environment. However, it is also good

to test performance of the proposed algorithms in settings with more than two agents. The natural

extension was a multiagent version of the prisoner’s dilemma.

In [Stimpson and Goodrich, 2003] an extended version of the prisoner’s dilemma was presented,

it consists of I-players and |A|-actions that preserve the same structure (one Nash equilibrium and

a dominated cooperative strategy). In the game, I agents hold |A| resource units each. At each

iteration, the i-th agent must choose how many of its |A| units will be allocated for a group goal G,

while the remaining will be used for a self-interested goal Si. Let ai be the amount contributed by

agent i towards goal G, and a = [a1, ..., aI ] the joint action. The utility of agent i given the joint

action is:

Ui(a) =1IPI

j=1 aj

|A|(1� k)(5.1)

where k 2 ( 1I ; 1) is a constant that indicates how much each agent estimates its contribution

towards the selfish goal. The pay-o↵ function is such that when all the agents put |A| units in the

group goal, each agent is rewarded with 1. On the other hand, if nobody puts units in the group goal,

a payo↵ of 0 is produced. If each agent adopts a random strategy the expected average payo↵ is 0.5.

Here the state space is formed by the last action of both opponents and our learning agent; the action

space remains with two possible actions.

5.1.3 Alternate-o↵ers bargaining

We also performed experiments on an alternative domain which has di↵erent characteristics (richer

state and action spaces), i.e., the alternate-o↵ers bargaining. This domain consists of two players,

a buyer and a seller. Their o↵ers alternate each other, trying to agree on a price. Their possible

1In game theory, this value is commonly used to represent the time value of money; people usually prefer money

(rewards) immediately rather than at some later date.

69


actions are o↵er(x) with x 2 R, exit and accept. If any of the players accepts the game finishes with

rewards for the players. If one of them plays exit the bargaining stops and the outcome is 0 for both

of them. Each utility function Ui depends on three parameters of agent i: reservation price, RPi (the

maximum/minimum amount a player is willing to accept), discount factor, �i, and deadline, Ti (agents

prefer an earlier agreement, and after the deadline they exit).

For this domain the state space is composed of the last action performed. The parameters used

were Ti = 4, �i = 0.99 and o↵ers are in the range [0�10]2 (integer values), therefore |S| = 13, |A| = 13

(the iPD had |S| = 4, |A| = 2). The buyer valuates the item to buy in 10. One complete interaction

consisted of repeated negotiations. In the experiments, our learning agent was the buyer and the

non-stationary opponent was the seller.

5.1.4 Double auctions

We use the PowerTAC simulator (see Appendix A.2 for a detailed description) as a practical and real-

world setting. The wholesale market operates as periodic double auction (Appendix A.3) in which

brokers are allowed to buy and sell quantities of energy for future delivery, typically between 1 and

24 hours in the future. At each timestep traders can place limit orders in the form of bids (buy

orders) and asks (sell orders). Orders are maintained in an orderbook. In a periodic double auction,

the clearing price is determined by the intersection of the inferred supply and demand functions.

Demand and supply curves are constructed from bids and asks to determine the clearing price of

each orderbook (one for each enabled timeslot) at the intersection of the two, which is the price that

maximizes turnover [Ketter et al., 2014].

Although we define a fixed limit price, and there is only a single opponent (other buying broker),

PowerTAC includes seven wholesale energy providers as well as one wholesale buyer to ensure liq-

uidity of the market [Ketter et al., 2013], introducing additional uncertainty and randomness in the

simulation.

5.1.5 General-sum games

Previously we used the prisoner’s dilemma with di↵erent strategies. In order to present a more general

setting to test our proposed algorithms we will present experiments in general-sum games where the

opponents may use game theoretic strategies. The most relevant strategies derived from game theoretic

stability concepts that we found relevant to test are: pure Nash equilibria (when available), mixed

Nash equilibria, minimax strategy and fictitious play [Brown, 1951] (see Section 2.3). Furthermore,

2These values emulate an scenario where the buyer wants to buy sooner rather than later, and after a number of

rounds it will leave the negotiation.

70

5.2. MDP4.5 AND MDP-CL AGAINST DETERMINISTIC SWITCHING OPPONENTS

Table 5.2: A bimatrix game representing the battle of the sexes game. Two agents choose between two

actions: going to the opera (O) or going to a Football match (F). v1 and v2 represent numerical values.

O F

O v1, v2 0, 0

F 0, 0 v2, v1

to describe this more general game theoretic setting we also include well known games such as Battle

of sexes.

Battle of the sexes

Battle of the sexes (BoS) is a two-player coordination game. Suppose two persons want to meet in

a specific place: the opera (O) or at a football match (F). One prefers opera and the other prefers

the football match. There is no possible communication. The game is presented in Table 5.2 where

v1 > 0, v2 > 0 and v1 6= v2 will yield di↵erent instantiations of the BoS game. This game has two

pure Nash equilibria (O,O) and (F,F). Both pure equilibria are unfair since one player obtains better

scores than the other. There is also one mixed Nash equilibria were players go more often to their

preferred event.

We have presented the domains used in the following sections. Now, we start by presenting

experiments with MDP-CL and MDP4.5 in the iPD.

5.2 MDP4.5 andMDP-CL against deterministic switching opponents

In Section 4.1, we presented a framework which is general enough to accept several learning techniques

to generate opponent models. This section presents the experimental results and comparisons with

a reinforcement learning technique (Hidden mode-MDPs) in terms of average rewards obtained in

repeated games.

The evaluation of the proposed approach was performed on the iPD with values t = 4, r = 3, p =

1, s = 0. We compared the proposed framework with MDP4.5 and MDP-CL against HM-MDPs. We

have chosen HM-MDPs since they are a technique designed to be used in non-stationary environments

(see Section 2.2.2). Solving a HM-MDP will give a policy to act which can be compared to the policy

generated by our two implementations in terms of the average rewards for a repeated game.

71


strategy1strategy2

Evaluation of the proposed framework (MDP4.5 and MDP-CL)

Learning and test period500 rounds(a)

strategy1

strategy2

strategy1

strategy2

Evaluation for HM-MDPs

Learning period rounds

Test period 500 rounds

Solve POMDP(b)

Figure 5.2: Graphical depiction of the experiments. In (a) the evaluation approach for the proposed framework,

in (b) the evaluation for HM-MDPs, note the extra learning period at the beginning.

5.2.1 Setting and objectives

In Figure 5.2 we depict the scheme of the experiments for the three compared approaches. In contrast

to our proposed framework, HM-MDPs need an extensive o↵-line training phase. For this reason,

there is an extra learning phase for HM-MDPs in which the agent uses a random behavior and learns

a HM-MDP (see Fig. 5.2 (b)). The opponent uses two strategies and switches from one to another in

certain part of the game (in both training and evaluation). The learned HM-MDP consists of 4 states

and 2 modes. This information about the opponent is enough to allow the HM-MDP to fully learn

the opponent model. To solve the learned HM-MDP, we transform it into a POMDP and use the

incremental pruning algorithm [Cassandra et al., 1997]. The POMDP is considered solved when the

error between stages is less than 1e�9 or when it exceeds 500 iterations. Solving the POMDP yields

a policy that can be used to play against an opponent.

There are three possible strategies to be used by the opponents. An opponent was constructed

taken two out of these three strategies (we tested using all possible combinations). The opponent

starts playing with one of these strategies and in a certain point in the game (not known by the

opponent), switches to the other (continuing that way for the rest of the game).

The experiments are divided into four parts. The first one is devoted to analyze the performance of

HM-MDPs with di↵erent training options. For a HM-MDP the number of modes (strategies) is fixed

in advance (they cannot learn new models online), so we tested di↵erent training schemes varying the

training data. The second part compares the three approaches MDP4.5, MDP-CL and HM-MDPs.

Our implementations learn in an online fashion in contrast to HM-MDPs which need an o✏ine training

phase, this di↵erence makes it hard to evaluate the approaches under the same conditions. We compare

the average rewards in a series of repeated games, for HM-MDPs, the rewards will be obtained only

for the evaluation phase. Even if unfair to MDP4.5 and MDP-CL, since their learning is online, the

complete interaction is used to compute the average rewards. The third part proposes an extension

to our approach in terms of adding a drift exploration scheme. The last part provides results for

MDP-CL in a multiagent version of prisoner’s dilemma.

72


Table 5.3: Average rewards for the HM-MDPs agent with std. deviation using di↵erent training sizes. in

the iterated prisoner’s dilemma. Opponent switched between strategies in the middle of the interaction. The

evaluation phase consisted of 500 steps.

Opponent/Training size 100 500 2000 Average

TFT-Pavlov 2.81 2.90 2.92 2.88

TFT-Bully 1.77 1.55 1.56 1.63

Pavlov-TFT 2.90 2.85 2.74 2.83

Pavlov-Bully 1.97 1.94 1.86 1.92

Bully-TFT 1.58 1.89 1.50 1.66

Bully-Pavlov 1.88 1.92 1.87 1.89

Average 2.17 2.18 2.06 2.13

Table 5.4: Average rewards for the HM-MDPs agent (AvgR(A)) and for the opponent (AvgR(opp)) with

standard deviations in the iterated prisoner’s dilemma. Opponent switched between strategies in the middle of

the interaction. The evaluation phase consisted of 500 steps. Column Perfect shows the maximum value that a

learning agent can obtain.

Same model Di↵erent model Perfect

Opponent AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A)

TFT-Pavlov 2.87 ± 0.09 2.97 ± 0.10 2.05 ± 0.27 1.59 ± 0.99 3.0

TFT-Bully 1.63 ± 0.39 1.67 ± 0.45 1.56 ± 0.15 2.96 ± 0.43 2.0

Pavlov-TFT 2.84 ± 0.13 2.92 ± 0.24 2.29 ± 0.30 2.12 ± 0.68 3.0

Pavlov-Bully 1.92 ± 0.06 1.76 ± 0.41 1.55 ± 0.11 2.82 ± 0.62 2.0

Bully-TFT 1.65 ± 0.27 2.02 ± 0.48 1.17 ± 0.27 1.43 ± 0.88 2.0

Bully-Pavlov 1.89 ± 0.12 1.87 ± 0.53 1.72 ± 0.05 0.97 ± 0.14 2.0

Average 2.13 ± 0.18 2.20 ± 0.37 1.72 ± 0.19 1.98 ± 0.62 2.3

5.2.2 HM-MDPs performance experiments

In order to evaluate the robustness of HM-MDPs, we evaluated the learned policy under di↵erent

switching times. We used 250 as the round when the opponent switches from strategy1 to strategy2,

the duration of games in the evaluation phase was 500 rounds. We evaluated di↵erent training sizes

tsize = {100, 500, 2000} games. In Table 5.3 we present the average rewards of the HM-MDP agent

with di↵erent training sizes against switching opponents. Results show that a training size of 500

obtained the best scores, a smaller size could help reduce the processing times but it may not be

su�cient to learn the best policy and a large training size takes longer times and can overfit the model

yielding lower scores.

We present the average rewards with standard deviations (average of all tsize values) when the

opponent switched between strategies in middle of the interaction in Table 5.4 under the Same model

73


column3. AvgR(A) presents the average rewards for the learning agent, and AvgR(Opp) presents the

average rewards for the switching opponent.

As we mentioned earlier, HM-MDPs need to fix the number of modes when learning. In the

previous experiment, opponents always used two strategies, since there are three di↵erent strategies

available we modified the experiment in order to have di↵erent strategies in the training and evaluation

phases. The motivation for this is that it may happen the opponent did not use all the strategies

during training phase, and therefore the learned HM-MDP is not complete. During evaluation a new

strategy is used (which is not known for HM-MDP) and this would a↵ect the results. So, in the

next experiment the training opponent consists of strategy1 � strategy2 and the evaluation opponent

consists of strategy1 � strategy3. The results of this experiment are presented in Table 5.4 under the

di↵erent model column.

From the results is easy to note that HM-MDPs consistently decrease its average reward when

using a di↵erent model for evaluation and training (di↵erence between AvgR(A) columns of Same

model and Di↵erent model in Table 5.4). On average the decrease is 0.56± 0.27. In conclusion, when

HM-MDPs can explore all models in training phase, they obtain good results. However, if they do not

learn the complete opponent strategies they cannot compute an optimal policy against it and thus,

they receive lower rewards.

Having analyzed how HM-MDP behave in terms on di↵erent training conditions we present the

comparison with our proposed implementations.

5.2.3 HM-MDPs vs MDP4.5 vs MDP-CL

Since HM-MDPs need an o✏ine learning phase the comparison with MDP-CL and MDP4.5 is not

entirely direct. HM-MDPs have an o↵-line training phase, in which a policy is computed and this

policy is evaluated against the switching opponent. In contrast, MDP4.5 and MDP-CL learn and

compute a policy continuously throughout the interaction, there are no clear training and evaluation

phases. Comparing the average rewards for the evaluation phase for HM-MDPs and the complete

interaction for MDP4.5 and MDP-CL is not totally fair but is a reasonable comparison in which HM-

MDPs have an advantage. However, HM-MDP may learn an incomplete model of the opponent, and

during evaluation a new strategy not presented in the evaluation could occur. This was evaluated in

the previous section. So, for HM-MDPs we take the average of the same model and Di↵erent model

columns of Table 5.4. For MDP4.5 and MDP-CL the average rewards are obtained during the complete

game, i.e., during training and evaluation phases.

In Figure 5.3 we depict the average rewards of HM-MDPs, MDP4.5 and MDP-CL against switching

opponents. We varied the switching time with 150, 250 and 350, we present only one result (250)

3A more detailed description of the experiments performed for HM-MDPs is presented in Appendix C.1

74


Table 5.5: Average rewards with standard deviation of MDP-CL, MDP4.5 and HM-MDPs against non-

stationary opponents. HM-MDPs have two columns depending if the models in the evaluation phase were

di↵erent or the same to those models in the training phase.

Opponent MDP-CL MDP4.5 HM-MDP (Same) HM-MDP (Di↵erent)

TFT-Pavlov 2.87 ± 0.19 2.70 ± 0.08 2.87 ± 0.09 2.05 ± 0.27

TFT-Bully 1.79 ± 0.13 1.87 ± 0.02 1.63 ± 0.39 1.56 ± 0.15

Pavlov-TFT 2.88 ± 0.07 2.72 ± 0.08 2.84 ± 0.13 2.29 ± 0.30

Pavlov-Bully 1.87 ± 0.10 1.79 ± 0.07 1.92 ± 0.06 1.55 ± 0.11

Bully-TFT 0.96 ± 0.01 1.02 ± 0.02 1.65 ± 0.27 1.17 ± 0.27

Bully-Pavlov 1.83 ± 0.05 1.80 ± 0.06 1.89 ± 0.12 1.72 ± 0.05

Average 2.04 ± 0.09 1.98 ± 0.05 2.13 ± 0.18 1.72 ± 0.19

since they are consistent with the other values. In Table 5.5 we present the comparison among the

same algorithms but with the two versions of HM-MDPs (same and di↵erent models in learning and

evaluation). Conclusions are the following:

• MDP-CL obtained the best results in average followed by MDP4.5.

• The main di↵erence between MDP4.5, MDP-CL and HM-MDPs appears with the opponent

Bully-TFT. This happens because the policy of the HM-MDP maintains a good exploration

process. In more detail, the HM-MDP’s policy is to defect against Bully but every 5 o 6 steps

explores the cooperation action in order to detect when the opponent changes to TFT. When this

happens, HM-MDPs are capable of noticing this change of behavior and adapt to it, entering

a cooperate-cooperate cycle. This results in an increase in rewards. In contrast MDP4.5 and

MDP-CL are not capable of detecting the switch in strategy from Bully to TFT and it keeps

defecting throughout the game. The behavior presented in HM-MDPs is a good example of

exploration for detecting switches, which was the motivation of drift exploration (Section 4.3).

These experiments allow us to make the following remarks. MDP-CL obtained the best scores in

average and HM-MDPs obtained the worst against all switching opponents except Bully-TFT. HM-

MDPs seem to have a good exploration scheme, because of its o↵-line learning process. They obtain

good results when they can learn the opponent strategies beforehand, however, facing a unknown

strategy their score decreases. In summary, HM-MDPs have three main limitations: i) the need for a

training phase, ii) the time to solve the resulting POMDP, and iii) the need to determine the number

of training stages. On the other side, MDP4.5 and MDP-CL are on-line learning approaches, which do

not need to know beforehand the number of strategies. Also they compute the policy in a faster way

since solving an MDP is computationally much simpler than a POMDP. A limitation of MDP4.5 and

MDP-CL is that the exploration approach is limited to certain periods of the game, for this reason we

75


0

0.5

1

1.5

2

2.5

3

TFT-Pavlov TFT-Bully Pavov-TFT Pavlov-Bully Bully-TFT Bully-Pavlov Average

Aver

age re

wards

Algorithms/Opponents

MDP-CLMDP4.5

HM-MDP

Figure 5.3: Average rewards of MDP-CL, MDP4.5 and HM-MDPs (100 trials) against di↵erent switching

opponents that switch in the middle of the interaction in a game of 500 rounds.

proposed a drift exploration scheme.

5.2.4 Preliminary drift exploration for MDP4.5 and MDP-CL

One of the main conclusions drawn for the results presented in the previous section was that MDP4.5

and MDP-CL do not notice the opponent’s switch in some cases, which results in suboptimal policies

(low rewards). This is observed against the Bully-TFT strategy switching opponent. Since the optimal

policy against Bully is to defect, when the opponent switches from Bully to TFT, the agent keeps

defecting and TFT behaves as Bully.

To solve this problem we promote exploration even when the agent has a policy to act. This

section presents an initial version of drift exploration, in the form of ✏-greedy (with probability ✏

the agent will act randomly, not following the prescribed policy). We evaluated di↵erent values for

✏ = {0.05, 0.1, 0.15, 0.2, 0.25}. In Table 5.6, we present the comparison between the no exploration

approach and this ✏-exploration (✏ = 0.1) approach for MDP4.5 and MDP-CL.

We note that in all the cases the performance drops except with Bully-TFT, which was expected.

However, for MDP4.5 in average the results are worst with this naive exploration. This happens

because the proposed approach is too simple and explores with a fixed probability in all the cases.

A way to improve this situation is to explore more when the results are bad, and explore less when

the results are good (in terms of rewards). We propose a more intelligent approach that takes into

76


Table 5.6: Comparison without exploration, an ✏�exploration and a softmax exploration for MDP4.5 and

MDP-CL. Perfect column shows the maximum reward an agent can obtain.

No exploration ✏ exploration Softmax exploration Perfect

MDP4.5

Opponent AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A)

TFT-Pavlov 2.70 ± 0.08 2.84 ± 0.08 2.51 ± 0.07 2.49 ± 0.07 2.67 ± 0.09 2.78 ± 0.08 3.0

TFT-Bully 1.87 ± 0.02 2.15 ± 0.02 1.71 ± 0.04 2.12 ± 0.07 1.83 ± 0.04 2.17 ± 0.05 2.0

Pavlov-TFT 2.72 ± 0.08 2.70 ± 0.19 2.53 ± 0.07 2.45 ± 0.08 2.68 ± 0.08 2.68 ± 0.15 3.0

Pavlov-Bully 1.79 ± 0.07 1.90 ± 0.23 1.70± 0.06 1.90 ± 0.07 1.76 ± 0.06 1.96 ± 0.16 2.0

Bully-TFT 1.02 ± 0.02 1.08 ± 0.03 1.63 ± 0.08 1.91 ± 0.08 1.51 ± 0.26 1.69 ± 0.28 2.0

Bully-Pavlov 1.80 ± 0.06 1.69 ± 0.12 1.68 ± 0.05 1.74 ± 0.08 1.77± 0.06 1.73 ± 0.13 2.0

Average 1.98 ± 0.05 2.06 ± 0.11 1.96 ± 0.06 2.10 ± 0.08 2.04 ± 0.10 2.17 ± 0.14 2.3

MDP-CL

Opponent AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A)

TFT-Pavlov 2.87± 0.19 2.93 ± 0.06 2.72 ± 0.07 2.76 ±0.09 2.86 ± 0.15 2.89± 0.13 3.0

TFT-Bully 1.79 ±0.13 2.15 ± 0.03 1.79 ± 0.05 2.15± 0.06 1.83 ± 0.08 2.18 ±0.07 2.0

Pavlov-TFT 2.88 ±0.07 2.83 ±0.24 2.75 ± 0.06 2.64± 0.11 2.87 ± 0.11 2.80± 0.27 3.0

Pavlov-Bully 1.87 ±0.10 2.07 ±0.15 1.82 ± 0.02 2.04± 0.04 1.87 ± 0.02 2.13 ±0.04 2.0

Bully-TFT 0.96 ± 0.01 1.09 ±0.02 1.82± 0.07 2.04±0.08 1.74± 0.15 1.93 ±0.14 2.0

Bully-Pavlov 1.83 ± 0.05 1.94 ±0.21 1.79± 0.04 1.93 ±0.12 1.82 ± 0.05 1.97 ±0.19 2.0

Average 2.03 ± 0.09 2.17 ±0.12 2.12 ± 0.05 2.26 ± 0.08 2.17 ± 0.09 2.31 ± 0.14 2.3

77


1.5

1.6

1.7

1.8

1.9

2

2.1

2.2

2.3

2.4

5 10 15 20 25 30Average rewards

Size of learning window w

MDP-CLMDP4.5Perfect

Figure 5.4: Average rewards of MDP4.5, MDP-CL and the perfect agent (that knows how to best respond

immediately) in the IPD against di↵erent stationary opponents. The window size w for learning the opponent’s

model was varied from 5 to 30.

account the average rewards of the last T interactions. Moreover we give more importance to recent

rewards than old ones so the rewards will be weighted accordingly. Now the exploration probability

is given by:

P (exploration) = (�e��ravg) · x (5.2)

where:

ravg =TX

t=1

�t

�rt; � =

TX

t=1

�t and x 2 (0, 1] (5.3)

We evaluated di↵erent combinations (� = {0.7, 0.8, 0.9, 0.95, 0.99}, � = {1, 1.25, . . . , 2}, x =

{0.1, 0.15, . . . , 0.5}) with the parameters and the best results were obtained with � = 0.95,� = 1.5, x =

0.25 for MDP4.5; and � = 0.99,� = 1.5, x = 0.2 for MDP-CL. These results are presented in Table

5.6 in the column Softmax exploration. We can observe that in this case, using this exploration yields

better results in average than using no exploration and the ✏-exploration approach, this occurs because

it does not explore blindly, when rewards are low it will explore more than when rewards are high. Its

limitation is that it needs three parameters to be tuned to the domain.

5.2.5 Learning speed of the opponent strategy

Previous experiment evaluated the approaches against switching opponents. However it is also im-

portant to evaluate how the size of the window of interactions a↵ects the rewards against stationary

opponents. The motivation is that the learning times could be reduced and we will like to see how

our approaches behave in those conditions. For that reason, we evaluated MDP4.5 and MDP-CL with

di↵erent values for the parameter w = 5, 10, . . . , 25, 30 in terms of average rewards against stationary

opponents (average of TFT, Pavlov and Bully),

78


In Figure 5.4 we depict the comparison in terms of average rewards for di↵erent sizes of the window

of interaction. The results show that MDP-CL outperforms MDP4.5 for all the values of w. For small

sizes of w = 5 and 10, the di↵erence between the two approaches is more noticeable than with larger

sizes of w. In summary, when the window of interaction is small MDP-CL obtains better results than

MDP4.5. However, with an appropriate amount of interaction, (more than 20 steps in this domain),

the two approaches obtain similar results. This is explained due to the fact that decision trees are not

the best model to use with limited data.

Now we move towards more general domains, in particular next section presents experiments with

more than one opponent.

5.2.6 Increasing the number of opponents

In previous experiments we used only one opponent in the environment. However, there are several

domains where there is more than one opponent and we will present experiments showing that our

approach (MDP-CL) will still work on those cases. The domain chosen to perform experiments is

the multiagent version of the prisoner’s dilemma. The environment consisted on three agents: one

learning agent and two opponents who used generalized strategies of Bully, TFT, and Pavlov. The

agents were given 1 resource unit and k = 0.5 (this means that the best score is 1.0 and the worst

is -1.0). Bully strategy never gives its unit to the group. TFT gives its unit when at least one other

player contributes with one unit in the past round, if not it keeps its unit. Pavlov contributes its unit

whenever both remaining players performed the same action in the previous play, if not it keeps its

unit.

The setting was designed to evaluate all possible combinations of di↵erent strategies given that

two opponents are available. The game lasted 7200 rounds and each 400 rounds there was a change

in strategies in one of the two opponents. We decided to test MDP-CL (DE) (with softmax drift

exploration) since it shows a faster learning time and better scores in average than MDP4.5. We

compare against two algorithms: R-max (Section 2.2.3) used as baseline and the perfect agent which

best responds immediately (used as an upper bound).

Note that in this case the state space representation for MDP-CL is increased in order to take

into account that there are two opponents in the environment. In more detail, now, the state space is

formed by the last action of both opponents and our learning agent (see an example in Fig. 5.5 (a)).

The action space remains with two actions: giving the resource unit to the group (which is similar to

cooperate) or keep the unit (similar to defect).

In Figure 5.5 (b) and (c) we depict immediate and cumulative rewards for the learning agents

(average of 100 iterations). In Figure 5.5 (b) we note that MDP-CL(DE) is capable of converging

to the best score after detecting and learning the opponents’ models. In contrast, R-max obtains

79


C,1,0

C,1,-1

D,1,1

D,1,-1

C,1,1C,1,0

D,1,1D,1,1

C ,C ,Dopp1 learnopp2

C ,D ,Dopp1 learnopp2C ,D ,Copp1 learnopp2

C ,C ,Copp1 learnopp2

C,1,0D,1,1

D,1,1

C,1,1

C,1,0

D,1,1D,1,-1

D ,C ,Dopp1 learnopp2

D ,D ,Dopp1 learnopp2D ,D ,Copp1 learnopp2

D ,C ,Copp1 learnopp2

C,1,-1C,1,-1

D,1,-1

(a)

(b) (c)

Figure 5.5: (a) MDP representation in the multiagent version of the prisoner’s dilemma with two opponents,

the state space increased accordingly to the number of opponents. (b) Immediate and (c) cumulative rewards

of MDP-CL(DE), R-max and the perfect agent which best responds immediately to switches in the multiagent

prisoner’s dilemma with 3 agents. At least one of the opponents changes its strategy every 400 rounds.

80

5.3. A PRIORI AND INCREMENTAL MDP-CL

suboptimal scores for di↵erent combinations of opponents since it is not capable of adapting. In

Figure 5.5 (c) we can observe that with every switch MDP-CL needs a detection and learning phase

which results in not obtaining the optimal score for certain period. However, after this period it

obtains the optimal policy.

These results indicate that MDP-CL(DE) is useful in domains with more than one opponent in

the environment. However, one limitation is that it may not scale properly. For example, the state

space will increase exponentially in the number of agents. In particular |S| = |A|I , where A is the

action space and I is the number of agents in the environment. Some ideas about how to solve this

limitation are presented in Section 6.4 and are left as future work.

5.2.7 Summary

This section presented diverse experiments on MDP4.5 and MDP-CL against a technique for non-

stationary environments. We also present an initial version of drift exploration and promising results

with MDP-CL in domains with more than one opponent. Conclusions are that MDP-CL and MDP4.5

are online approaches which do not need to know beforehand the number of opponent strategies and

compute their policy in a faster way than HM-MDPs. Adding drift exploration in MDP-CL provides

better scores since it helps to detect some switches that will not be perceived without it.

5.3 A priori and incremental MDP-CL

We presented results comparing MDP-CL and MD4.5 against state of the art algorithms in non-

stationary reinforcement learning. Now we present experiments for the extensions of MDP-CL: a

priori MDP-CL and incremental MDP-CL.


We compare a priori and incremental versions against the original MDP-CL in terms of performance

(average utility over the repeated game) and quality (prediction of the opponent’s next action at each

round compared with the real value averaged over a number of repetitions) of the learned models.

Experiments were performed on the iPD (one opponent).

First, we present how the TVD (equation 4.5) behaves under non-stationary opponents showing

that it can be used to e�ciently compare models. Second, we present empirical results showing

that prior information increases the cumulative rewards for the learning agent and provides a better

prediction of the opponent model. Third, we show the advantages of incremental MDP-CL in case

the opponent reuses a previous strategy. Finally, we relax the assumption of having the complete set

of models used by the opponent and instead assume a set of noisy models that are an approximation

81


BullyTFT

Figure 5.6: Total variation distance of di↵erent prior models compared with the real one (0 means they are

the same model, totally dissimilar models give a value of 1) using a priori MDP-CL. The opponent is TFT-Bully

switching at the middle of the game.

of the real ones. We first provide qualitative results based on a simple example and then quantitative

results with comparative data.

5.3.2 Model selection in a priori MDP-CL

Here, we present how the TVD behaves against a switching opponent (TFT-Bully) that switches from

a strategy (TFT) to another (Bully) in the middle of the game. The game consisted of 300 rounds

and our agent is given as prior information the set of opponent strategies {Bully, TFT, Pavlov}.In Figure 5.6 the TVD of each strategy compared to the currently learned model is depicted for

each round of the repeated game. From the figure we can observe that from round 5 the TFT model is

the one with the lowest distance (a zero value means they are the same), which is in fact the one used

by the opponent. At round 150 the opponent changes its strategy to Bully and two things happen: the

TVD with respect to Bully decreases and the TVD with respect to TFT increases. Before round 200

the learned model has a perfect similarity (with the correct model). This figure shows how the TVD is

able to e�ciently provide a score useful to identify which model is the one used by the opponent. The

next section shows the improvement of using a priori models on quantitative terms against di↵erent

switching opponents.

5.3.3 Rewards and quality in a priori MDP-CL

In this section MDP-CL and a priori MDP-CL are compared in terms of cumulative rewards and

quality of the current model. We illustrate the results with the same opponent as previously.

82


BullyTFT

(a)

BullyTFT

(b)

Figure 5.7: (a) Immediate rewards and (b) cumulative rewards of MDP-CL and a priori MDP-CL against

the opponent TFT-Bully that switches at the middle of the iPD.

In Figure 5.7 we depict the (a) immediate and (b) cumulative rewards (average of 10 iterations)

for MDP-CL and a priori MDP-CL against the TFT-Bully opponent. In (a) we note that in the first

15 rounds of the interaction the di↵erence in rewards is not noticeable, since both approaches are

exploring (learning/detecting the opponent model). However, from round 15 to 40 a priori MDP-CL

increases its rewards since it already knows which model is the correct one and can exploit it. In

contrast, MDP-CL needs a longer period of exploration to determine correctly the opponent model.

This pattern is repeated when a switch is performed by the opponent (round 150). In (b) we can see

how the cumulative rewards increase each time there is a switch in the opponent because of the faster

detection of a priori MDP-CL.

With respect to the quality of the model, in Figure 5.8 we depict the quality of the predictions

made by MDP-CL and a priori MDP-CL (average of 10 iterations) against the TFT-Bully opponent.

Here it is easy to note that since MDP-CL needs to complete an exploration phase of certain size (in

this case 40) it does not have a correct model until that round (having a quality of 0). In contrast to

a priori MDP-CL which always achieve a better quality in fewer interactions.

The previous example illustrates the benefits of a priori MDP-CL, now we compare MDP-CL and

a priori MDP-CL against di↵erent switching opponents (that switch in the middle of a repeated game

of 300 rounds). Results are shown in Table 5.7 where AvgR(A) represents the rewards of the learning

agent and AvgR(Opp) of the opponent. Each row averages 50 repetitions. From the table, we can

observe that for all opponents, a priori MDP-CL obtained statistical significant better results (using

Wilcoxon signed-rank test, 5% significance level) than MDP-CL, which means a faster detection and

an earlier exploitation of the opponent model.

83


BullyTFT

Figure 5.8: Model quality (a perfect quality, value of 1.0, is where the model predicts perfectly the next state)

of MDP-CL and a priori MDP-CL (50 trials) against the opponent TFT-Bully that switches between strategies

in the middle of the interaction.

Table 5.7: Average rewards of learning agents AvgR(A) (MDP-CL anda priori MDP-CL) and the non-

stationary opponent AvgR(Opp). The symbol * indicate statistical significance using the Wilcoxon signed-rank

test.MDP-CL A priori MDP-CL

Opponent AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp)

Bully-Pavlov 1.74 2.03 1.89* 1.88

Bully-TFT 0.93 1.20 0.99* 1.03

Pavlov-Bully 1.79 2.12 1.89* 2.11

Pavlov-TFT 2.88 2.86 2.96* 2.95

TFT-Bully 1.76 2.17 1.86* 2.23

TFT-Pavlov 2.87 2.87 2.94* 2.94

Average 2.00 2.21 2.09 2.19

84


BullyTFTPavlov

(a)

BullyTFT

(b)

Figure 5.9: (a) Di↵erence of cumulative rewards between incremental MDP-CL and MDP-CL against the

opponent TFT-Bully-Pavlov-Bully, rewards increase against the second Bully. (b) Total variation distance of

the learned model and the noisy representations (values close to zero represent more similarity with the real

model) of TFT and Bully while using a priori MDP-CL.

5.3.4 Incremental models

Now we relax the assumption of starting the interaction with a set of known models. Thus, the

algorithm needs to learn these models trough interaction which is the objective of incremental MDP-

CL. Even more, it should detect a switch faster to a known strategy than learning it from scratch.

In Figure 5.9 (a), we depict the di↵erence between cumulative rewards of incremental MDP-CL and

MDP-CL against the opponent TFT-Bully-Pavlov-Bully (that changes from one strategy to another

every 150 rounds) in a game of 600 rounds. We selected this opponent since it uses the Bully strategy

on two occasions during the interaction. From the figure, we can observe that from round 0 to 450 the

score moves around 0, this means there is no di↵erence in rewards between the approaches since both

are learning the models TFT, Bully and Pavlov. Starting from round 450, the opponent returns to the

Bully strategy which has been previously used (rounds 0 to 150) therefore incremental MDP-CL has

this model in its memory which is faster to detect (approximately 20 rounds) than to relearn it (as

the original MDP-CL does). This is the reason why the incremental MDP-CL increases its rewards

after round 470. This example shows how keeping a record of models increases the rewards when the

opponent reuses one of those previous models.

85


5.3.5 A priori noisy models

As final experiment we assume now a set of approximately similar strategies with respect to the real

ones. The objective is to analyze what happens when the given models are not perfect, which may

happen due to error (noise) or because the opponent is human or a hybrid (mixed) strategy. In order

to include noise in the models we changed two transitions of the MDPs (that represent each strategy)

to random values. Now, a priori MDP-CL will start with a set of noisy models of {TFT, Pavlov, andBully}.

In Figure 5.9 (b), we depict the TVD of a priori MDP-CL against the TFT-Bully opponent that

switches in the middle of the game. The TVD of the non-noisy and noisy models is depicted in the

figure. We can observe that the TVD is still capable of detecting which model is the correct (values

closer to zero) even in the presence of noisy models. The di↵erence is that in this case, the TVD will

not reach the perfect score since our models are not exact, in this case, the best score is close to 0.2

(instead of 0.0).

In order to use the a priori and incremental algorithms when starting with a set of noisy models,

we need only to adjust the ⇢ parameter in Algorithm 4.2 to the desired value.

5.3.6 Summary

This section presented results when the set of opponent strategies is available before the interaction.

Results show that a priori MDP-CL can reduce the learning time and increase the total rewards. Also

we show the advantages of incremental MDP-CL when the opponent reuses a strategy. Finally, we

showed that these algorithms still work against noisy opponents (when the set of initial models is not

exactly the same). The next section presents experiments with respect to drift exploration and the

R-max# algorithm.

5.4 Drift exploration

In Section 5.2.4 we presented initial experiments on drift exploration in MDP-CL. In this section we

include new experiments with our R-max# and R-max#-CL proposals. Comparisons are performed

against FAL, WOLF-PHC and R-max.

5.4.1 Settings and objectives

In this section, we compare our proposals with drift exploration: MDP-CL(DE) (Section 4.3.1), R-

max# (Section 4.3.2) and R-max#CL (Section 4.4.1), with MDP4.5, MDP-CL, R-max [Brafman

86


and Tennenholtz, 2003], FAL [Elidrisi et al., 2012], WOLF-PHC4 [Bowling and Veloso, 2002] and the

omniscient (perfect) agent that best responds immediately to switches. Results are compared in terms

of average utility over the repeated game. Experiments were performed on two di↵erent scenarios.

• Iterated prisoner’s dilemma (with values presented in Table 5.1). We used TFT, Pavlov and

Bully (Section 2.3.2) as opponent strategies. We emulate two di↵erent scenarios. In the first, the

opponent switches strategies deterministically every 100 rounds. The second scenario proposes

a more realistic opponent with non-deterministic switching times. In particular, we model a

probabilistic switching opponent that can switch strategies at any round. Here, each repeated

game consists of 1000 rounds. At every round the opponent either continues using the current

strategy with probability 1�⌘, or with probability ⌘ switches to a new strategy (drawn randomly

from the set of strategies mentioned before). We used switching probabilities ⌘ = 0.02, 0.015, 0.01

which translates to 10 to 20 strategy switches in expectation for one repeated game, this values

are enough to represent di↵erent behaviors in an interaction and are not excessive for the learning

algorithms to not able to learn correctly the opponent models.

• A negotiation task (Section 5.1.3). The opponent uses two di↵erent strategies: i) a fixed price

strategy where the seller uses a fixed price Pf for the complete negotiation, and ii) a flexible

price strategy where the seller initially valuates the object at Pf , but after round 2 he is more

interested in selling the object, so it will accept an o↵er Pl < Pf . We represent that strategy

by Pl = {x ! y}. For example, the optimal policy against Pf = {8} is to o↵er 8 in the first

round (recall the game is discounted by �i in every round, so it’s better to buy/sell sooner rather

than later and the buyer valuates the object in 10), receiving a reward of 2. However against

Pl = {8! 6} the optimal policy is to wait until the third round to o↵er 6, receiving a reward of

4�2i .

In all cases the opponents were non-stationary in the sense that they used di↵erent strategies

for acting in a single repeated interaction. We present experiments against deterministic switching

opponents (Sections 5.4.2, 5.4.3 and 5.4.4) and probabilistic switching opponents (Section 5.4.2 and

5.4.5). We compare how drift exploration in MDP-CL(DE) improves the results over MDP-CL (Section

5.4.3); how the parameters a↵ect R-max# (Section 5.4.4) and R-max#CL (Section 5.4.5). Results in

bold denote the best scores in the tables. Statistical significance is denoted with * and † symbols.

5.4.2 Drift and non-drift exploration approaches

In this section, we present a summary of the results for the two domains in terms of average rewards.

Results show that approaches with drift exploration obtained better results than not using it.

4To simulate a drift exploration there was a constant ✏-greedy exploration and no decay in the learning rate.

87


Table 5.8: Average rewards of the proposed algorithms against an opponent with a probability ⌘ of changing

to a di↵erent strategy at any round in the iPD domain,* and † represent statistical significance of R-max#CL

with MDP-CL and R-max# respectively. Perfect agent best responds immediately after a switch and gives an

upper bound on the maximum value that can be obtained.

Algorithm/ ⌘ 0.02 0.015 0.01 Drift Exp.

Perfect 2.323 2.319 2.331 -

R-max#CL 2.051*† 2.079*† 2.086† Yes

MDP-CL(DE) 1.944 1.988 2.046 Yes

R-max# 1.691 1.709 1.725 Yes

WOLF-PHC 1.628 1.627 1.629 Yes

MDP-CL 1.696 1.790 1.841 No

MDP4.5 1.681 1.782 1.839 No

FAL 1.625 1.658 1.725 No

In the iPD we compared our proposals which use drift exploration: R-max# (⌧ = 55), MDP-

CL(DE) (w = 30, Boltzmann exploration), and R-max#CL (⌧ = 90, w = 50), against state of the art

approaches MDP-CL (w = 30), MDP4.5 (w = 30), FAL, WOLF-PHC (�w = 0.3, �l = 2�w, ↵ = 0.8,

✏-greedy exploration) and the perfect agent that knows exactly when the opponent switches and best

responds immediately. In Table 5.8, we summarize the results showing the average rewards obtained

by each agent against the probabilistic switch opponent for di↵erent values of ⌘ (switch probability).

All the scores were obtained using the best parameters for each algorithm and the results shown are

based on the average of 100 iterations. In all the cases, R-max#CL obtained better scores than the

rest. An * indicates statistical significance against MDP-CL(DE) and † against R-max#. MDP-

CL(DE) obtains good results since it exploits the model and uses drift exploration. However, note

that we fed it with the perfect window w so it could remain competitive. Using only R-max# is not

as good since it explores continuously but it will not properly exploit the learned model, recall that

R-max#will re-learn a model even when the opponent does not changes its strategy (which may result

in obtaining suboptimal rewards). WOLF-PHC shows almost the same performance against di↵erent

switch probabilities, however, its results are far from the best. MDP-CL and MDP4.5 obtained better

results than FAL, but since none of them use drift exploration they are not as good as our proposed

approaches with drift exploration.

We performed a similar analysis for the negotiation domain. Table 5.9, shows average rewards and

percentage of successful negotiations obtained by each learning agent against a switching opponent.

In this case each interaction consists of 500 negotiations. The opponent uses 4 strategies (Fp{8},Fp{9}, Fl = {8 ! 6}, Fl = {9 ! 6}); the switching round and strategy were drawn from a uniform

88


Table 5.9: Average rewards and percentage of successful negotiations of the proposed algorithms against an

opponent with a probabilistic non-stationary opponent in the negotiation domain. Perfect agent best responds

immediately after a switch and gives an upper bound on the maximum value that can be obtained.

Algorithm AvgR(A) SuccessRate Drift Exp.

Perfect 2.70 100.0 -

R-max#CL 1.95 74.9 Yes

R-max# 1.91 70.5 Yes

MDP-CL(DE) 1.73 82.0 Yes

WOLF-PHC 1.71 88.5 Yes

MDP-CL 1.70 85.5 No

R-max 1.67 90.6 No

distribution. R-max#CL obtained the best scores in terms of reward and R-max# obtained the

second best rewards. In this domain MDP-CL(DE) and WOLF-PHC take more time to detect the

switch and adapt accordingly obtaining lower rewards. However they have a higher percentage of

successful negotiations than the R-max# approaches. In this domain, we note that not using drift

exploration (MDP-CL and R-max) results in failing to adapt to non-stationary opponents which

results in suboptimal rewards.

These results show the importance of performing drift exploration in di↵erent domains. In the

next sections, we present detailed analysis of MDP-CL(DE), R-max# and R-max#CL.

5.4.3 Further analysis of MDP-CL(DE)

The MDP-CL framework does not use drift exploration which results in failing to detect some type

of switches. We present two examples where MDP-CL(DE) is capable of detecting those switches. In

Figure 5.10 (a), the cumulative rewards against a Bully opponent that switches to TFT at round 100

(deterministically) are depicted. In the first 100 rounds, the figure shows a slight cost associated to the

drift exploration of MDP-CL (DE). After round 100, MDP-CL(DE) increases its rewards considerably

since the agent has learned the new opponent strategy (TFT) and has updated its policy. In the

negotiation domain a similar behavior happens when the opponent starts with a fixed price strategy

(Pf = {8}) and switches at round 100 (deterministically) to a flexible price strategy (Pl = {8! 6}). InFigure 5.10 (b), the immediate rewards of MDP-CL with and without drift exploration are depicted,

also we represent the rewards of a perfect agent which best responds at every round. The figure

shows that MDP-CL is not capable of detecting the strategy switch, from rounds 50 to 400 it uses

the same action and therefore obtains the same reward. In contrast MDP-CL (DE) explores with

di↵erent actions (therefore it seems unstable) and due to the drift exploration is capable of detecting

89


TFTBully

(a)

P {8→6}P {8}f

l

(b)

Figure 5.10: On top of each figure we depict how the opponent changes between strategies during the in-

teraction. Cumulative rewards of (a) MDP-CL (w = 25) with and without drift exploration, the opponent is

Bully-TFT switching at round 100. (b) Immediate rewards of MDP-CL, MDP-CL(DE) and a perfect agent (that

best responds immediately after the switch) against a non-stationary opponent in the alternating bargaining

o↵ers domain.

the strategy switch. However, it needs several rounds to relearn the optimal policy after which it starts

increasing its rewards at round 175 approximately. After this round, the rewards keep increasing and

eventually converge around the value of 3.0. This occurs because even when MDP-CL(DE) is using

the optimal policy it keeps exploring.

In Table 5.10 we present results in terms of average rewards (AvgR) and percentage of successful

negotiations (SuccessRate) while varying the ✏ parameter (of ✏-greedy as drift exploration) from 0.1 to

0.9 in MDP-CL (DE) in the negotiation domain. We used w = 35 since it obtained the best scores (we

evaluated w = {20, 25, . . . , 50}) and a * represents statistical significance (using Wilcoxon rank-sum

test, 5% significance level) with respect to MDP-CL. These results indicate that using a moderate

drift exploration (0.1 ✏ 0.5) increases the average rewards. A higher value of ✏ causes too much

exploration. Thus, the agent cannot exploit the optimal policy and results are worse than not using

drift exploration. On the one hand, increasing ✏ improves the average rewards only for the successful

negotiations, this happens because using drift exploration causes to detect the switch updating the

immediate reward to 4.0 (after the switch). On the other hand, the number of successful negotiations

is reduced with high values of ✏. This is the common exploration-exploitation trade-o↵ which causes

that moderate ✏ values in the range (0.3� 0.5), obtain the best average rewards.

These results show that adding a general drift exploration (for example with ✏�greedy exploration)

helps to detect switches in the opponent that otherwise would have passed unnoticed. However, as

90


Table 5.10: Comparison of MDP-CL and MDP-CL(DE) while varying the parameter ✏ (using ✏�greedy as

drift exploration) in terms of average rewards (AvgR), percentage of successful negotiations (SuccessRate). *

indicates statistical significance of MDP-CL(DE) over MDP-CL using Wilcoxon rank-sum test.

✏ AvgR(A) SuccessRate

0.1 1.796* 92.1

0.2 1.782* 91.0

0.3 1.776* 88.9

0.4 1.801* 86.1

0.5 1.753 83.5

0.6 1.694 80.7

0.7 1.619 78.2

0.8 1.506 73.7

0.9 1.385 68.1

Average 1.679 82.4

MDP-CL 1.726 88.9

we said before, parameters such as window size, threshold (of MDP-CL), and ✏ (for drift exploration)

should be tuned in order to e�ciently detect switches. In the next section, we analyze R-max#.

5.4.4 Further analysis of R-max#

We proposed another way of performing drift exploration using an implicit approach. R-max# has two

main parameters: m counts whether a state-action pair is assumed to be known (same as R-max),

and ⌧ , that controls how many rounds should have passed for a state-action pair to be considered

forgotten. First we analyze the e↵ect of ⌧ . In Figure 5.11 (a) we present the cumulative rewards of

R-max (dotted straight line) and R-max# with ⌧ = 5 (thick line) and ⌧ = 35 (solid line) against a

Bully-TFT opponent, we used m = 2 for all the experiments since it obtained the best scores. For

R-max#, a ⌧ = 5 makes the agent explore continuously causing a decrease of rewards from rounds 20

to 100. However, from round 100 rewards immediately increase since the agent detects the strategy

switch. Increasing the ⌧ value (⌧ = 35) reduces the drift exploration and also reduces the cost in

rewards before the switch (at round 100). However, it also impacts the total cumulative rewards,

since it takes more time to detect the switch (and learn the new model). Here we show an important

trade-o↵ when choosing ⌧ , a small value causes a continuous exploration which quickly detects switches

but has a cost before the switch occurs, on the contrary a large ⌧ reduces the cost of exploration and

therefore the switch will take more time to be noticed. It is important to note that R-max# is capable

of detecting the switch in strategies as opposed to R-max which shows a linear result since it keeps

91


TFTBully

(a)

P {8→6}P {8}f

l

(b)

P {8→6}P {8}f

l

(c)

P {8→6}P {8}f

l

(d)

Figure 5.11: On top of each figure we depict how the opponent changes between strategies during the interac-

tion. (a) Cumulate rewards against he Bully-TFT opponent in the iPD using R-max# and R-max. Immediate

rewards of R-max# with (b) ⌧ = 100, (c) ⌧ = 60 and (cd ⌧ = 140 and a perfect agent which best responds to

the opponent in the negotiation domain.

92


acting against a Bully opponent when in fact the opponent is TFT.

In Figure 5.11, we depict the immediate rewards of R-max# with (b) ⌧ = 100, (c) ⌧ = 60 and

(c) ⌧ = 140 against the opponent that changes from a fixed prices to a flexible price strategy in the

negotiation domain. In these figures we note that our agent starts with an exploratory phase which

finishes at approximately round 25, then it uses the optimal policy to obtain the maximum possible

reward. The opponent switches to a di↵erent strategy at round 100. A ⌧ = 100 (Figure 5.11 (b))

means that those state-action pairs which after 100 rounds have not been updated will be reset. The

agent will compute a new policy which will re explore, this phase occurs from rounds 105 to 145.

Now, the optimal policy is di↵erent (since the opponent has changed) and R-max# obtains a better

reward. The previous case shows how with a ⌧ = 100 which was deliberately chosen to match the

switching time of the opponent can adapt to switching opponents and learn the optimal policy against

each di↵erent strategy. In Figure 5.11 (c) and (d) we depict what happens when ⌧ does not match the

switching time. Using a ⌧ = 60 means that the agent will re-explore the state space more frequently,

which explains the decrease in rewards from round 65 to 95 and 125 to 160 (since both are exploratory

phases). The first exploratory phase (starting at round 65) shows what happens when no change has

occurred; the agent returns to the same optimal policy. However, the second exploration phase shows

a di↵erent result; starting from round 160 it has updated its optimal policy and can exploit it. Using

a ⌧ = 140, Figure 5.11 (c), means less exploration which can be seen by the stability from round 25 to

145, from that point to round 190 the agent re-explores and updates its optimal policy at round 190.

In Table 5.11, we depict di↵erent scores while using R-max# with di↵erent ⌧ values against the

switching opponent in the negotiation domain, an * indicate statistical significance with R-max(using

Wilcoxon rank-sum test). A small ⌧ value is not enough to learn an optimal policy, which results in a

low number of successful negotiations. In contrast a high ⌧ makes not enough exploration and takes too

much time to detect the switch which results in lower rewards. In this case a ⌧ between 80-120 yields

the best scores because it is enough to learn an appropriate opponent model while providing enough

exploration to detect switches. R-max obtained the best score in successful negotiations because it

learns an optimal policy and uses that for the complete interaction even when it is a suboptimal policy

in terms of rewards for half of the interactions.

Comparison with R-max and WOLF-PHC. Previous experiments used only two strategies,

now we increase to 4 di↵erent strategies in the negotiation domain to compare the approaches of

R-max#,R-max and WOLF-PHC. The strategies are (i)Fp{8}, (ii)Fl = {8 ! 6}, (iii)Fp{9} and (iv)

Fl = {9 ! 6}. The opponent switches every 100 rounds to the next strategy. In Fig. 5.12 we depict

(a) the immediate rewards and (b) cumulative rewards of R-max#, R-max and the perfect agent over

400 negotiations. Each opponent strategy is represented by a number zone (I-IV). Each curve is the

93


P {9→6}P {8→6}l

l

P {9}P {8}f

f

III IVIII

(a)

P {9→6}P {8→6}l

l

P {9}P {8}f

f

III IVIII

(b)

P {9→6}P {8→6}l

l

P {9}P {8}f

f

III IVIII

(c)

P {9→6}P {8→6}l

l

P {9}P {8}f

f

III IVIII

(d)

Figure 5.12: On top of each figure we depict how the opponent changes among strategies during the interaction.

(a) Immediate and (b) cumulative rewards of R-max# (⌧ = 90) and R-max in the alternating o↵ers domain.

(c) Immediate and (d) cumulative rewards of R-max# and WOLF-PHC.

94


Table 5.11: Comparison of R-max and R-max# with di↵erent ⌧ values in terms of average rewards (AvgR),

percentage of successful negotiations (SuccessRate). * indicates statistical significance with R-max.

⌧ AvgR(A) SuccessRate

20 1.101 61.2

40 1.375 61.4

60 1.643 71.6

80 2.034* 77.1

90 2.163* 80.4

100 2.164* 80.9

110 2.043* 81.0

120 1.942* 81.0

140 1.746 80.8

160 1.657 83.5

180 1.736 88.9

Average 1.782 77.0

R-max 1.786 90.6

average of 50 iterations.

In zone I (Figure 5.12 (a) and (b)), from rounds 0 to 25 R-max# and R-max explore and obtain low

rewards, after this exploration phase they have an optimal policy and use it to obtain the maximum

possible reward at that time (2.0). In zone II (at round 100) the opponent switches its strategy and

R-max does not detect this switch and keeps using the same policy. Since R-max# re-explores the

state space (rounds 95 to 135) it obtains a new optimal policy which in this case yields the reward

of 4.0. We can see that during the second exploration phase of R-max# the cumulative rewards are

lower than those of R-max. However, after updating its policy, cumulative rewards start increasing.

From round 170 the rewards are greater than those for R-max. Starting zone III (at round 200) the

opponent switches its strategy and in this case R-max is capable of updating its policy. This happens

because there is a new part of the state space that is explored, but again at zone IV (round 300) the

opponent switches to a flexible strategy which R-max fails to detect and exploit. In contrast, R-max#

is capable of detecting all switches in the opponent strategy and reaching the maximum reward. This

can be easily noted in the di↵erence in cumulative rewards at the final round (534.9 for R-max and

789.2 R-max#).

In Figure 5.12 (c) and (d), we depict the immediate and cumulative rewards of R-max# and

WOLF-PHC. Even when WOLF-PHC is capable of adapting to non-stationary changes in this case

the action space is larger to adapt quickly to all changes. In terms of cumulative rewards WOLF

obtained 647.3 which is better than R-max but it is still far from R-max#.

95


Table 5.12: Average rewards (AvgR) of R-max#CL and R-max# with di↵erent ⌧ values, and percentage of

successful negotiations (SuccessRate).

R-max# R-max#CL w = 50

⌧ AvgR(A) SuccessRate AvgR(A) SuccessRate

60 1.791 70.1 1.697 70.1

80 1.833 69.5 1.919 70.7

90 1.716 65.9 1.898 70.3

100 1.910 70.5 1.948 71.9

110 1.875 67.3 1.947 71.1

120 1.767 66.9 1.943 72.8

140 1.847 68.3 2.110 75.1

Average 1.820 68.4 1.923 71.7

5.4.5 E�cient exploration + switch detection: R-max#CL

Previously we analyzed the e↵ect of adding a general drift exploration to MDP-CL, now we experi-

ment with the R-max#CL which provides an e�cient drift exploration together with the MDP-CL

framework. First, we test the approach against the opponent that switches strategies deterministically

every 100 rounds in the iPD in the following way: TFT-Bully-Pavlov-Bully-TFT-Pavlov-TFT, which

represent all the possible permutation pairs for the three used strategies. The duration of the repeated

game is 700 rounds. The immediate rewards (average of 10 iterations) are depicted in Figure 5.13.

As comparison, the rewards of a perfect agent that best responds immediately to the switches are

depicted as a dotted line. In Figure 5.13 (a) MDP-CL shows a stable approach since it learns the

opponent model and immediately exploits it, but since it lacks a drift exploration it fails to detect some

strategy switches (change from Bully to TFT at round 400 in Fig. 5.13 (a)). In contrast, R-max#

shows peaks throughout the complete interaction since it is continuously exploring. Its advantage is

that they detect all strategy switches, like Bully-TFT. In Figure 5.13 (c) we depict the results for

WOLF-PHC which in this domain shows that is capable of adapting slowly to the changes. In Figure

5.13 (d), we depict the immediate rewards of R-max#CL with w = 50, ⌧ = 90 (since these parameters

obtained the best scores). This approach is capable of learning and exploiting the opponent model

while keeping a drift exploration which enables the agent to detect switches. This experiment clearly

shows the strengths of our approach, the need for an e�cient drift exploration and a rapid adaptation

which results in higher utilities.

Lastly, we show how R-max#CL fares against a random switching opponent in the negotiation

domain. The interaction consists of 500 negotiations and the opponent uses 4 strategies (Fp{8},Fp{9}, Fl = {8 ! 6}, Fl = {9 ! 6}); switching round and strategy were drawn from a uniform

96


BullyTFT

Pavlov

(a)

BullyTFT

Pavlov

(b)

BullyTFT

Pavlov

(c)

BullyTFT

Pavlov

(d)

Figure 5.13: On top of each figure we depict the opponent TFT-Bully-Pavlov-Bully-TFT-Pavlov-TFT that

switches every 100 rounds in the iPD. Immediate rewards of (a) MDP-CL, (b) R-max#, (c) WOLF-PHC and

(d) R-max#CL.

97


distribution. We compared the R-max# and R-max#CL approaches while varying the ⌧ parameter,

for R-max#CL we used w = 50 since it obtained the best results. A summary of the experiments is

presented in Table 5.12, each value is the average of 50 iterations.

As a baseline we used R-max in this setting (not shown in the table). We varied the parameter m

from 2 to 20, and selected the best score (m = 4) nonetheless its results were the worst, with average

rewards of 1.675 which are far from the obtained average rewards of 1.820 and 1.923 of R-max# and R-

max#CL, respectively. Moreover, almost all values were statistically significantly better results than

R-max. From Table 5.12, we can see that R-max#CL using almost any value of ⌧ provides better

results than R-max# (although they are not statistically significant, using Wilcoxon rank-sum test,

5% significance level) against a randomly switching opponent. This can be explained due to the fact

that R-max#CL provides an e�cient drift exploration combined with a switch detection mechanism.

5.4.6 Summary

We tested two domains in which drift exploration is necessary to obtain an optimal policy — due to

the non-stationary nature of the opponent’s strategy. We presented scenarios where the use of switch

detection mechanisms, such as MDP-CL, FAL or MDP4.5 were not enough to deal with switching

opponents (Sections 5.4.2 and 5.4.3). When keeping a non-decreasing learning rate and exploration

WOLF-PHC is capable of adapting in some scenarios however it does so in a slow way (Section

5.4.4). The general approach of drift exploration by means of ✏�greedy or some softmax type of

exploration, solves the problem since this exploration re-visits some parts of the state space that

eventually will lead to detect switches in the opponent strategy (Section 5.4.3). However, the main

limitation is that they need to be tuned for each specific domain and are not very e�cient since they

explore in an undirected way, not considering which parts of the state space need to be re-visited.

Our approach, R-max#, which implicitly handles drift exploration is generally better equipped to

handle non-stationary opponents of di↵erent sorts. Its pitfall lies in its parameterization (parameter

⌧), which generally should be large enough so as to learn a correct opponent model, yet small enough

to react promptly to strategy switches (Section 5.4.4). In realistic scenarios where we do not know the

switching time of a non-stationary opponent, it is useful to combine both approaches, switch detection

and implicit drift exploration, as can be seen in R-max#CL (Section 5.4.5).

5.5 DriftER

MDP-CL (DE) is capable of detecting switches however it needs to wait for a window size of interactions

w. R-max# performs an implicit switch detection and similarly needs to wait for switches depending

on its ⌧ parameter. The joint approach of R-max#CL provides good empirical results, however, its

98

5.5. DRIFTER

Figure 5.14: Upper value of confidence interval over error probabilities (fupper

) of a learning algorithm with no

switch detection (blue line) and DriftER (black line) against an opponent that changes between two strategies

in the middle of the interaction (vertical bar), small arrows represent DriftER learning phase after detecting

the switch.

parameter tuning is time consuming. This section presents experiments performed with DriftER,

which is an approach based on keeping track the error of the opponent model. Additionally, it can

check for switches at every step of the interaction. We present experiments in two domains, first in

repeated games such as battle of sexes to analyze how DriftER will behave against non-stationary

opponents. Then DriftER is evaluated in the PowerTAC domain, particularly in the wholesale market

where it is compared against a previous champion of the competition and the MDP-CL approach.

5.5.1 Setting and objectives (repeated games)

A well known game from GT is called battle of the sexes (BoS). This is a two-player coordination

game. The game is presented in Table 5.2 and Section 5.1.5. The opponent has di↵erent strategies

to use that come from the GT literature: pure Nash equilibria, mixed Nash equilibria and minimax

strategy (see Section 2.3). It can change from one to another during the interaction and DriftER goal

is to adapt as fast as possible to these switches.

5.5.2 Switch detection

We start by comparing the behavior of a learning agent that does not include a switch detection mech-

anism and DriftER against non-stationary opponents that switches in the middle of the interaction

in a repeated BoS game of 1500 rounds. The opponent has two possible actions (O,F ) and starts

with a mixed strategy of [0.9, 0.1] and changes to a pure strategy [0.0, 1.0] (which is the pure Nash

equilibrium that is most beneficial to the opponent).

Figure 5.14 depicts the upper value of the confidence over the error (fupper) against the switching

opponent for a learning algorithm without switch detection mechanism (thin line) and DriftER (thick

99


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

350-400

400-450

450-500

500-550

550-600

600-650

650-700

700-750

750-800

800-850

850-900

900-950

950-1000

1000-1050

1050-1100

1100-1150

1150-1200

1200-1250

1250-1300

1300-1350

1350-1400

1400-1450

1450-1500

Frac. of detected switches

Intervals of rounds

ninit=2ninit=4ninit=8ninit=20

Figure 5.15: Fraction of times a switch was detected with di↵erent parameters of DriftER against a non-

stationary opponent that changes from a maximin strategy to a mixed Nash strategy in round 750.

line). Assume the learning agents obtain an opponent model in the first 200 rounds, from that point

they compute the error and confidence intervals. Since the opponent uses a stochastic policy before

the switch there is a prediction error during the interaction, this happens because the agent predicts

the opponent will use one action (O) and with probability 0.1 the opponent chooses F . At round 750

(marked with a vertical line) the opponent changes to a pure Nash equilibrium action which is to use

F in every round. This will increase consistently the prediction error for the agent without switch

detection, in contrast DriftER detects the switch and starts a learning phase (marked with a small

vertical arrows) after which its error will decrease consistently.

DriftER parameter behavior

First we present how the parameter n (that measure the number of errors allowed) a↵ects DriftER

against switching opponents. The opponent will use a mixed Nash equilibrium [0.2, 0.8] and will

change to a minimax strategy ([0.8, 0.2]) of the BoS game with values v1 = 100, v2 = 25. We present

experiments varying the n parameter in a game of 1500 rounds where the opponent switches in the

middle of the game.

Now we present how the parameter n a↵ects DriftER against switching opponents. The opponent

will start with a mixed Nash equilibrium strategy [0.2, 0.8] and will change to a minimax strategy

([0.8, 0.2]) in the middle of the BoS game. We present experiments varying the parameter n with

values {2, 4, 8, 20} in a game of 1500 rounds. For each n we value we keep track of the round when

a switch was detected (100 iterations). In Figure 5.15 we depict a histogram showing the fraction

of times when a switch was detected in certain interval of the game. From the figure we note that

100

5.5. DRIFTER

choosing a small value (2 in this case) may cause to erroneously detect switches (small red bars). A

higher value reduces the errors and obtains an fast switch detection. If we increase again the value to

8, the errors are almost reduced to zero however it may take more rounds to detect the switch. A large

value (20 in this case) increases the number of rounds required to detect a switch in the opponent’s

policy.

In Section 5.6.3, we present a set of experiments using general-sum games and comparisons with

the rest of our approaches. However, we wanted to test our approach in a more realistic domain where

the environment is more complex and there is uncertainty. Thus, we first present DriftER in the

context of double auctions in PowerTAC.

5.5.3 Setting and objectives (double auctions)

First we present experiments showing how the policies need to adapt to non-stationary opponents in

this domain. Then, experiments are presented against a non-stationary opponent who uses di↵erent

fixed prices and then against noisy non-stationary opponents which has a probability distribution over

an interval of possible limit prices. The opponent we designed is non-stationary in the sense that it

uses two stationary strategies: it starts with a fixed limit priced Pl and then in the middle of the

interaction changes to a di↵erent (higher) fixed limit price Ph. The timestep at which the opponent

switches is unknown to the other broker agent.

The MDP for all learning agents models the opponent with the following parameters: the number

of states was set to |s| = 6, and the actions represent limit prices with the values in {15, 20, 25, 30, 35}.The opponent started with a Pl = 20 and then changed to Ph = 34. In the first case, the learning

agent needs to bid using a price > 20 (25, 30, 35) to win bids. Later, when the opponent uses a limit

price of 34, the only bid that will be accepted by the producer is 35. Both the learning agent and the

opponent have a fixed demand which is greater than the average energy needed to supply all buyers.

We compare the performance of DriftER (with parameters n = 7) against TacTex-WM,5, the

champion of the 2013 competition, and MDP-CL, which is not specific for PowerTAC but is designed

for non-stationary opponents.

We present results in terms of average accuracy, confidence over error rate and profit. The learned

MDP contains a transition function for each (s, a) pair, comparing the predicted next state with the

real (experienced) state gives an accuracy value. At each timestep the agent submits nbids bids and

its learned model predicts if those bids will be cleared or not. When the timestep finishes it receives

feedback from the server and compares the predicted with the real transactions. An average of those

nbids predictions is the average accuracy of each timestep. A value of 1.0 equates to perfect prediction.

A similar measure is confidence over error-rate, as described in Section 4.5.3. Finally, profit is defined

5TacTex-WM is the part of TacTex applied only to the wholesale market.

101


0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100 120 140 160 180 200

fupper

Timeslots

TacTex-WMDriftER

(a)

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100 120 140 160 180 200

fupper

Timeslots

MDP-CLDriftER

(b)

Figure 5.16: Upper value of confidence interval over error probabilities (fupper

) of (a) TacTex-WM and (b)

MDP-CL while comparing with DriftER. The timeslot where the opponent switches strategies is denoted with

a vertical line. DriftER is capable of detecting the switch and adapt its strategy to the new opponent strategy.

in PowerTAC as the income minus the costs (balancing, wholesale, and tari↵ markets). We used

default parameters for all other settings in PowerTAC.

The experiments we designed focused only on the wholesale market of PowerTAC. However, Pow-

erTAC also includes another type of market that we cannot disregard (the tari↵ market). We therefore

fix a strategy for this market so that a single flat tari↵ is published and stays constant throughout the

experiments.

5.5.4 Fixed non-stationary opponents

Now we compare the learning algorithms in terms of confidence over the error-rate against the switching

opponent. In Figure 5.16 (a) the upper value of the confidence over error rate of a single interaction

of TacTex-WM and DriftER are shown. After round 100, when the opponent changes its strategy, the

error rate of TacTex-WM increases because it does not adapt to the opponent. In contrast, DriftER

stops using its learned policy at timeslot 110 and restarts the learning phase, which ends at timeslot

135. At timeslot 135, DriftER has high confidence over the error rate (since it is a new model) and

the error rate shows a peak. At this point, DriftER has learned a new MDP and a new policy which

reduces the error rate. The upper value of confidence over error rate of MDP-CL and DriftER are

shown in 5.16 (b). Both algorithms can detect the opponent’s strategy change, but MDP-CL performs

comparisons to detect switches every w steps (w = 25 in this case) and it must wait n ·w (n = 1, 2, . . . )

timeslots (in contrast to a timeslot per timeslot switch detection of DriftER).

Additional experiments were performed to tune the w parameter. However, optimizing these

parameters is time consuming since w 2 N and threshold 2 R, w = 25 was selected as the best value

102

5.5. DRIFTER

Profi

ts (€

)

Timeslots (hours)

(a)

Profi

ts (€

)

Timeslots (hours)

(b)

Profi

ts (€

)

Timeslots (hours)

(c)

Figure 5.17: Profits (e), where higher is better, of (a) TacTex-WM, (b) MDP-CL and (c) DriftER against

the non-stationary opponent in a PowerTAC competition of 250 timesteps. Neither TacTex nor MDP-CL are

capable of increasing its profits after the opponent switch (vertical line) since they do not adapt in a fast wat

to switches as DriftER does it.

(based on accuracy) for setting threshold = 0.05. In next section we review directly both MDP-CL

and DriftER against switching opponents.

5.5.5 Detecting switches in the opponent

Now we compare MDP-CL and DriftER since both approaches handle non-stationary opponents. We

measure the average number of timeslots needed to detect the switch, the average accuracy and the

traded energy as a measure of indirect cost provided by PowerTAC (the more time it takes to detect

the switch, the less energy the agent successfully buys). Table 5.13 reports the results for MDP-CL

(using w = 25) and DriftER. The competition lasted 250 timesteps and the opponent switched at

103


Table 5.13: Average timeslots for switch detection (Avg. S.D. Time), accuracy, and traded energy of the

learning agents against a non-stationary opponent.

Avg. S.D. Time Accuracy Traded E.

MDP-CL 85.0 ± 55.0 57.55 ± 28.56 2.9 ± 1.3

DriftER 33.2 ± 13.6 67.60 ± 21.21 4.4 ± 0.5

Table 5.14: Average profit of the learning agents against non-stationary opponents with and without noise.

TacTex-WM MDP-CL DriftER

Agent Opp Agent Opp Agent Opp

Fixed NS 219.0 ± 7.5 228.7 ± 31.7 261.3 ± 65.8 270.1 ± 75.5 263.0 ± 38.9 228.7 ± 64.2

Noisy NS 198.0 ± 41.3 197.6 ± 24.78 260.1 ± 75.0 305.6 ± 41.18 265.9 ± 39.9 229.0 ± 38.2

timestep 100. Results are averaged over 10 independent trials. Results show that DriftER needs less

time for detecting switches obtaining better accuracy (explained by the fast switch detection) and as

a result is capable of trading more energy.

To further evaluate the algorithms and perform a fair comparison in terms of profit, we implemented

the same strategy in the tari↵ market on all learning algorithms. Figure 5.17 shows the cumulative

profit of (a) TacTex-WM, (b) MDP-CL, and (c) DriftER against the same non-stationary opponent.

The timeslot where the opponent switches strategies is indicated with a vertical line. From these figures

we note that TacTex-WM profits increase before the opponent switch and then decrease after that.

At the end of the experiment, they both obtain a similar profit. MDP-CL was capable of detecting

switches but took more time and obtained less profit than TacTex-WM. In contrast, DriftER profits

increase again after the switch. In terms of cumulative profits DriftER obtained 80k e more than the

opponent.

5.5.6 Noisy non-stationary opponents

In the previous experiments the opponent switched between two fixed strategies. In this section we

present a better approximation to real-world strategies. The opponent has a limit price Pl = 20.0 with

a noise of ±2.5 (bids are in the range [18.5� 22.5]). Then, it switches to Ph = 34.0, with bids in the

range [31.5, 36.5]. The rest of the experiment remains the same as in the previous section.

Table 5.14 shows the total profits of the learning agents against the non-stationary opponents with

and without noise, averaged over 10 independent trails. When the opponent uses a range of values

to bid, TacTex-WM’s profits are reduced while its standard deviation is increased. MDP-CL shows

competitive profit scores with fixed opponents, but against a noisy opponent MDP-CL obtained lower

scores than the opponent. DriftER shows the best score in profit against fixed opponents and is the

only algorithm of the three able to score better than this noisy, non-stationary opponent.

104

5.6. NON-STATIONARY GAME THEORY STRATEGIES

5.5.7 Summary

This section presented experiments with DriftER in repeated games and in the PowerTAC simulator.

In PowerTAC the opponent is non-stationary in the sense that changes its limit price during the

interaction. DriftER was tested against the champion of the inaugural competition and MDP-CL

obtaining better results in terms of profit against switching opponents. In the next section we present

the last set of experiments with comparisons of our three main proposals and WOLF-PHC in the

context of randomly generated repeated games.

5.6 Non-stationary game theory strategies

This section presents experiments using our proposals: MDP-CL, R-max# and DriftER on GT games.


Common strategies to be used in GT include playing pure and mixed Nash equilibria, maximin strate-

gies (Section 2.3) and fictitious play [Brown, 1951]. Thus, the proposed opponents will use those

strategies and will switch among them during the interaction.

The experiments are divided into two parts: first we analyze how MDP-CL, R-max# and DriftER

behave in the BoS game against opponents that switch between pure and mixed strategies. Then,

we use our approaches in general-sum games with more than two actions. As comparison we present

results with the WOLF-PHC algorithm (Section 3.3.2). We performed experiments (presented in

Appendix C.2) to select the parameter m = 10 for the rest of the section.

5.6.2 Battle of the sexes

We analyze our approaches: DriftER, MDP-CL and R-max# with WOLF-PHC in the BoS game with

values v1 = 100, v2 = 25. The game consisted of 3000 rounds, the opponent switched strategies at

1000 and 2000 rounds. The opponent starts with a pure Nash equilibrium [1.0, 0.0], then switches

to a minimax strategy [0.8, 0.2] and finally it changes to a di↵erent pure Nash equilibrium [0.0, 1.0].

Results are the average of 100 iterations.

First we present experiments with DriftER (n = 5) and WOLF-PHC. In Figure 5.18 (a) we depict

the immediate rewards of both approaches against the non-stationary opponent in the BoS game. The

opponent starts with a pure Nash strategy which DriftER quickly learns (less than 50 rounds). In

contrast, WOLF-PHC needs more time to converge to the best action (120 rounds approximately).

Moreover, since it needs to keep exploring it never reaches the best possible score (25). At round

1000 the opponent changes to a mixed strategy and both algorithms adapt correctly to the opponent,

105


(a)

(b)

(c)

Figure 5.18: Rewards obtained by (a) DriftER, (b) MDP-CL, (c) R-max# and WOLF-PHC in the BoS game

against a non-stationary opponent that uses pure Nash and mixed Nash in a game of 3000 rounds. Switches

happen every 1000 rounds.

106

5.6. NON-STATIONARY GAME THEORY STRATEGIES

even when DriftER does not play a mixed policy it is capable to obtain similar rewards than WOLF-

PHC. At round 2000 the opponent changes to a di↵erent pure strategy. DriftER is capable of quickly

adapting its model and its policy obtaining the best possible score. WOLF-PHC takes more rounds

to adapt and it does not exploit completely the opponent.

In Figure 5.18 (b) we depict the rewards for MDP-CL (w = 160) and WOLF-PHC. In contrast to

DriftER that uses R-max exploration MDP-CL uses a fixed window of exploration and in this case

it takes more time to exploit the model. When the opponent switches at round 1000 it is capable

of adapting. At round 2000 the opponent changes again and we can observe that MDP-CL adjusts

by steps until finally converging to the new opponent model. These steps happen every w when the

model comparison is performed to detect switches.

In Figure 5.18 (c) we depict the rewards for R-max# (⌧ = 1000). The approach is capable of

learning quickly the opponent model (as DriftER) and can learn the new opponent model in a quickly

manner. However, since the approach uses an implicit switch detection mechanism in some cases it

would take more time to adjust (like the second switch against the opponent).

5.6.3 General-sum games

We tested our proposals and WOLF-PHC in repeated games with the following conditions: at least

one pure and one mixed Nash equilibrium di↵erent from each other and � 2 actions. The values of

the games are in the range of [�100, 100] and they are shown in Appendix B.

The setting is one opponent that starts playing a pure Nash equilibrium, then changes to a mixed

Nash strategy and finally uses a fictitious play strategy. Switches happen every 1000 or 2000 rounds,

the game consists of 3000 and 6000 rounds respectively. An * indicates statistical significance of the

algorithm with respect to WOLF-PHC (using Wilcoxon rank-sum test, 5% significance level).

Average rewards of the algorithms are depicted in Table 5.15 against non-stationary opponents

with two di↵erent switching frequencies. Results show that DriftER (n = 5, w = 200) obtained in

average better results than WOLF-PHC. When switching frequency was 1000 rounds only one result

is statistically significant. In contrast, when switching frequency increases to 2000 rounds DriftER can

exploit the model for more rounds and the di↵erence with WOLF-PHC increases.

We selected the best parameters for MDP-CL w = 120, = 0.15 when switch frequency is 1000

and w = 150 when switch frequency is 2000. When the switch frequency is 1000 rounds MDP-CL is

comparable with WOLF-PHC in games 1-3. In game-4 MDP-CL results drastically decrease because

the parameter is sensitive to the number of actions. The reason is that MDP-CL computes a distance

that averages over all actions so the di↵erence between MDP becomes smaller (in game-4 there are 5

actions). When using a suitable value ( = 0.01) then we get results (marked with †) comparable to

WOLF-PHC. When the switch frequency increases to 2000 rounds, for games 1 to 3 MDP-CL obtained

107


Table 5.15: Average rewards of our proposed approaches and WOLF-PHC against non-stationary opponents

in four random repeated games. * indicates statistical significance with respect to WOF-PHC (using Wilcoxon

rank-sum test). † indicates that the value was used with a di↵erent parameter for MDP-CL.

Game Id DriftER MDP-CL R-max# WOLF-PHC Switch freq.

1 35.69 ± 1.29 35.24 ± 1.51 38.00 ± 0.65* 35.14 ± 1.11 1000

2 58.03 ± 1.54* 57.30 ± 0.67* 47.27 ± 0.80 56.21 ± 1.71 1000

3 71.76 ± 2.05 75.34 ± 1.70* 73.85 ± 1.77* 71.68 ± 1.46 1000

4 68.22 ± 2.19 67.53 ± 5.74† 71.78 ± 0.98 68.03 ± 5.06 1000

Avg 58.42 ± 1.77 58.85 ± 2.40 57.72 ± 1.05 57.77 ± 2.33 1000

1 37.72 ± 2.19 37.76 ± 0.41* 40.74 ± 2.83* 35.60 ± 0.68 2000

2 60.32 ± 0.85* 59.78 ± 0.33* 45.46 ± 0.67 57.58 ± 1.06 2000

3 75.61 ± 1.95* 74.63 ± 1.95* 73.80 ± 1.77 72.87 ± 0.92 2000

4 74.19 ± 0.74* 70.05 ± 7.05† 69.89 ± 1.06 70.94 ± 2.58 2000

Avg 61.96 ± 0.74* 60.55 ± 2.43 57.47 ± 1.58 59.25 ± 1.31 2000

better scores (statistically significant) than WOLF-PHC this happens because MDP-CL is capable of

detecting switches fast and it can exploit the model for more rounds. In game 4 however, the results

decrease considerably.

R-max# was also compared in this setting using ⌧ = 1000. When switch frequency was 1000 steps

was capable of adapting to the switches and obtained in average comparable results to WOLF-PHC.

When the switch frequency was increased to 2000 steps R-max# results were not as good since it is

performing exploration more frequently than the opponent is changing which results in lower rewards.


This chapter presents experiments in five di↵erent domains: prisoner’s dilemma, multiagent prisoner’s

dilemma, alternate-o↵ers bargaining, double auctions in PowerTAC and randomly generated repeated

games. First the MDP4.5 and MDP-CL approaches were compared to a reinforcement learning al-

gorithm for non-stationary environments. Results show that our proposals are capable of learning in

repeated games with comparable results with the advantage of a faster computation and an online

learning approach. Extensions that handle a priori information as well as not forgetting previously

learned model were compared showing the advantage of both extensions to MDP-CL. Later we tested

two di↵erent domains in which drift exploration is necessary to obtain an optimal policy. The iPD

problem and the negotiation task. In both scenarios, the use of switch detection mechanisms, such

as MDP-CL, FAL or MDP4.5 were not enough to deal with switching opponents. The general ap-

proach of drift exploration by means of ✏�greedy or some softmax type of exploration, solves the

108


problem since this exploration re-visits some parts of the state space that eventually will lead to

detect switches in the opponent strategy. Our approach, R-max#, which implicitly handles drift ex-

ploration is generally better equipped to handle non-stationary opponents of di↵erent sorts. Its pitfall

lies in its parameterization (parameter ⌧), which generally should be large enough so as to learn a

correct opponent model, yet small enough to react promptly to strategy switches. In realistic scenarios

where we do not know the switching time of a non-stationary opponent, it is useful to combine both

approaches, switch detection and implicit drift exploration, as can be seen in R-max#CL. Next we

presented experiments showing that DriftER can be used in a realistic domain (double auctions inside

the PowerTAC simulator). DriftER obtained better scores than MDP-CL and showed robust results

against noisy opponents. We conclude with a set of experiments in repeated games comparing our

di↵erent approaches with WOLF-PHC the state of the art algorithm for non-stationary strategies in

repeated games. To summarize, our approaches were capable of exploiting the opponent model and

react quickly to strategy changes. Next chapter will conclude with the contributions of this research

and ideas for future work.

109


110

Chapter 6

Conclusions and Future Research

In this chapter we present a summary of the proposed algorithms, conclusions, the contributions of

this thesis and outline open questions and ideas for future research. We conclude with the list of the

publications derived from this research.

6.1 Summary of the proposed algorithms

In Table 6.1 we present a summary of our proposals in terms of theoretical guarantees, how the switch

detection is performed against non-stationary opponents, its advantages and limitations. MDP4.5 uses

decision trees to model the opponent. Its main advantage is that the learned model can be analyzed

and interpreted easily. However, it may not be the best option against stochastic opponents. MDP-CL

learns models using MDPs, its advantages are that it can use a priori models and drift exploration can

be easily added. However, its threshold parameter is sensitive to the number of actions. R-max# is

an algorithm with theoretical guarantees for switch detection and near-optimal rewards under certain

conditions. However, it does not detect switches explicitly. R-max#CL is the approach which obtained

the best scores empirically in two domains, however it needs to set 5 parameters and guarantees are

not proved. Finally, DriftER uses a single learned model as a method for switch detection which it is

simpler than the approaches that learn di↵erent models. Also DriftER provides guarantees for switch

detection with high probability. However, a current limitation is that it cannot use a priori models

nor keep models in case the opponent returns to a previous strategy. It is left as open work to consider

an approach similar to the one used for a priori MDP-CL to extend DriftER.

6.2 Conclusions

Now, we provide some final remarks about this research.

111

CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH

Table 6.1: A comparison of our proposals in terms of guarantees, switch detection, advantages and limitations.

Algorithm Theoretical

guarantee

of switch

detection

Switch detection Advantages Limitations

MDP4.5 No Periodical comparison

between decision trees.

Learned model can be trans-

lated into rules.

Transformation from DTs to

MDPs increases the state

space. Not the best model

agains stochastic opponents.

MDP-CL No Periodical comparison

between MDPs.

More information to detect a

switch. It is possible to add

drift exploration and use a

priori models.

Parameter sensitive to num-

ber of actions. No theoreti-

cal guarantees.

R-max# Yes Implicit switch detec-

tion.

Few parameters, algorithm

easy to understand. Per-

forms e�cient drift explo-

ration.

Do not detect switches ex-

plicitly.

R-max#CL No Implicitly by relearning

models and explicitly

by periodical compari-

son between MDPs.

Obtained the best scores

empirically in two domains.

Parameter tuning.

DriftER Yes Tracking predictive

error of the learned

model.

Simple method detection in-

dependent of the domain.

Cannot use a priori models

or keep history of models.

112

6.3. CONTRIBUTIONS

• Some assumptions about the nature of the non-stationarity of the opponents must be made.

Otherwise, it would be impossible to propose a general algorithm for all the cases. Our assump-

tion, is that the opponent will switch among several stationary strategies during the interaction.

This makes it possible to learn how to act optimally against it.

• We proposed methods that are based on computing a model of the opponent and track that

model for possible changes. Thus, an e↵ective representation of the opponent (attributes used

to describe it) model is of great importance to achieve: 1) a policy against it and 2) an e�cient

switch detection method.

• Knowing beforehand the set of strategies that the opponent may use is an assumption that may

not hold in several domains. However, this research has showed that knowing these models

accelerates the process of opponent strategy detection. Also, keeping models after a switch

can be helpful for further use for example guiding exploration. In particular, we are further

exploring this issue using a Bayesian approach for tracking and identifying opponent switches

[Hernandez-Leal et al., 2016].

• When an opponent switches between strategies some of them will pass unnoticed (shadowing

e↵ect) unless an exploration is applied. This type of exploration (coined as drift exploration),

needs to explore with actions that di↵er from the computed optimal policy. Thus, a tradeo↵

appears, exploring may reduce the immediate rewards but detecting a switch to another opponent

strategy may increase the rewards in the long term. A more extensive analysis of this situation

providing theoretical guarantees is still an open problem.

• Theoretical results for opponent switch detection are important to make robust algorithms. We

provided two results in this context, for R-max# and DriftER. However, their main limitation

is that the assumptions made about the opponent behavior may not hold in some domains.

6.3 Contributions

We contribute to the state of the art with di↵erent algorithms for learning against non-stationary

opponents providing empirical results in five domains and theoretical guarantees for two algorithms.

In detail, the contributions are:

• A framework for learning against non-stationary opponent in repeated games. This framework

uses windows of interactions to learn a model of the opponent. The learned model is used to

compute an optimal policy against that opponent. Di↵erent models are learned throughout the

repeated game and comparisons between models detect a switch in the opponent. The approaches

113


were evaluated empirically with two di↵erent implementations: MDP4.5 and MDP-CL against

a reinforcement learning technique for non-stationary environments.

• Two extensions for MDP-CL were presented. A priori MDP-CL assumes to know beforehand

the set of strategies that the opponent is using and will detect which is the one used by the

opponent. Incremental MDP-CL keeps a record of the learned models and will not discard them

if the opponent returns to a previously used strategy. Empirically the approaches were evaluated

in the iterated prisoner’s dilemma.

• Drift exploration is proposed as an exploration mechanism for detecting opponent switches that

will otherwise pass unnoticed. We evaluated the approach experimentally by using drift explo-

ration in MDP-CL on two domains the iPD and a negotiation task.

• In the context of drift exploration R-max# was proposed. Its roots come from R-max but di↵ers

in that keeps learning a model continuously by forgetting state-action pairs that have not been

updated recently. We provide theoretical results showing that R-max# will perform optimally

under certain assumptions. Moreover, using R-max# with MDP-CL results in R-max#CL

which obtained the best results on two domains since it combines an e�cient exploration with a

switch detection mechanism.

• Finally, we proposed DriftER an algorithm that learns a model of the opponent and keeps

tracks of their error-rate. When the error-rate increases for several timesteps, the opponent

has changed strategy and we must learn a new model. We provide a theoretical result that

ensures that DriftER will detect opponent switches with high probability by correctly setting

its parameters. Results on repeated games and the PowerTAC simulator show that DriftER is

capable of detecting switches in the opponent faster than state of the art algorithms.

During the process of this thesis we made some assumptions about our settings and therefore there

are still several open questions which are discussed in the next section.

6.4 Open questions and future research ideas

We propose five di↵erent ideas for future research that are worth pursuing.

• Learning opponents. Throughout this thesis we assumed that opponents used a set of strategies

and switched among them during the interaction. However, we do not consider that the opponent

used learning strategies [Bowling and Veloso, 2002]. In this case, consideration needs to be taken,

if both agents are learning at the same time they may learn noise [HolmesParker et al., 2014].

114

6.5. PUBLICATIONS

• Not knowing the representation of the opponent. In order to learn the opponent model we

assumed a representation which in most cases can correctly describe the opponent. In order to

relax this assumption we must learn the model and at the same time the correct representation. A

recent area whose aim is to learn without putting e↵ort into designing the correct representation

is deep learning [Deng and Yu, 2013]. Some approaches could be used in a preprocessing phase

since these approaches have the limitation of need long learning times.

• Stochastic games. The prisoner’s dilemma and double auctions can be seen as repeated games.

The alternating-o↵ers bargaining problem can be solved as an extensive form game (a tree repre-

sentation of a game). However, there are domains where the environment cannot be represented

as a single matrix. Thus a stochastic game could be the best fit, approaches such as [Elidrisi

et al., 2014] could be a good start to model non-stationary opponents in those cases.

• Increasing the number of opponents. In Section 5.2.6, we increase the number of opponents and

showed that MDP-CL performed successfully. However, the main limitation is that increasing

the number of opponents increases the size of the space state exponentially, which may limit its

use. A possible solution on how to handle a large number of opponents is to treat them as a

population and best respond to classes of opponents [Bard et al., 2015; Wunder et al., 2011].

• Adapting related approaches for opponent modeling. Recent models such as Bayesian policy

reuse [Rosman et al., 2015] had been proposed for fast learning in sequential decision tasks.

However, the problem could be cast to an adversarial setting. Another approach that is worth

analyzing are I-POMDPs since there are new techniques for learning and solving them in a faster

way [Qu and Doshi, 2015].

6.5 Publications

Several parts of this thesis had been published. Below we provide the list of papers derived from this

research. One journal paper:

• “A framework for learning and planning against switching strategies in repeated games” [Hernandez-

Leal et al., 2014a].

the following conference papers:

• “Bidding in Non-Stationary energy markets” (AAMAS 2015) [Hernandez-Leal et al., 2015a]

• “Opponent modeling against non-stationary strategies (Doctoral Consortium)” (AAMAS 2015)

[Hernandez-Leal et al., 2015c]

115


• “Using a priori information for fast learning against non-stationary opponents” (IBERAMIA

2014) [Hernandez-Leal et al., 2014b].

• “Modeling non-stationary opponents” (AAMAS 2013) [Hernandez-Leal et al., 2013b]

• “Strategic Interactions Among Agents with Bounded Rationality” (IJCAI 2013) [Hernandez-Leal

et al., 2013d]

and the following workshop papers:

• “Learning against non-stationary opponents in double auctions” (Workshop ALA 2015) [Hernandez-

Leal et al., 2015b]

• “Exploration strategies to detect strategy switches” (Workshop ALA 2014) [Hernandez-Leal et

al., 2014c]

• “Learning against non-stationary opponents.” (Workshop ALA2013) [Hernandez-Leal et al.,

2013c]

• “Opponent modeling and planning against non-stationary strategies.” (Workshop MSDM 2013)

[Hernandez-Leal et al., 2013a]

116

References

Abdallah, Sherief and Victor Lesser (2008). “A multiagent reinforcement learning algorithm with

non-linear dynamics.” In: Journal of Artificial Intelligence Research 33.1, pp. 521–549.

Aumann, Robert J. (1999). “Interactive epistemology I: knowledge.” In: International Journal of Game

Theory 28.3, pp. 263–300.

Axelrod, Robert and William D. Hamilton (1981). “The evolution of cooperation.” In: Science 211.27,

pp. 1390–1396.

Banerjee, Bikramjit and Jing Peng (2005). “E�cient learning of multi-step best response.” In: Proceed-

ings of the 4th International Conference on Autonomous Agents and Multiagent Systems. Utretch,

Netherlands: ACM, pp. 60–66.

Bard, Nolan, Michael Johanson, Neil Burch, and Michael Bowling (2013). “Online implicit agent mod-

elling.” In: Proceedings of the 12th International Conference on Autonomous Agents and Multiagent

Systems. International Foundation for Autonomous Agents and Multiagent Systems, pp. 255–262.

Bard, Nolan, Deon Nicholas, Csaba Szepesvari, and Michael Bowling (2015). “Decision-theoretic Clus-

tering of Strategies.” In: Proceedings of the 14th International Conference on Autonomous Agents

and Multiagent Systems. Istanbul,Turkey, pp. 17–25.

Barrett, Samuel, Peter Stone, Sarit Kraus, and Avi Rosenfeld (2012). “Learning Teammate Models

for Ad Hoc Teamwork.” In: AAMAS Adaptive Learning Agents (ALA) Workshop.

Bellman, Richard (1957). “A Markovian decision process.” In: Journal of Mathematics and Mechanics

6.5, pp. 679–684.

Bolton, Gary E. and Axel Ockenfels (2000). “ERC: A theory of equity, reciprocity, and competition.”

In: American Economic Review, pp. 166–193.

Boutilier, Craig, Thomas L. Dean, and Steve Hanks (1999). “Decision-Theoretic Planning: Structural

Assumptions and Computational Leverage.” In: Journal of Artificial Intelligence Research, pp. 1–

94.

Bowling, Michael (2004). “Convergence and no-regret in multiagent learning.” In: Advances in Neural

Information Processing Systems. Vancouver, Canada, pp. 209–216.

117

REFERENCES

Bowling, Michael and Manuela Veloso (2002). “Multiagent learning using a variable learning rate.”

In: Artificial Intelligence 136.2, pp. 215–250.

Brafman, Ronen I. and Moshe Tennenholtz (2003). “R-MAX a general polynomial time algorithm for

near-optimal reinforcement learning.” In: The Journal of Machine Learning Research 3, pp. 213–

231.

Brown, George W. (1951). “Iterative solution of games by fictitious play.” In: Activity analysis of

production and allocation 13.1, pp. 374–376.

Busoniu, Lucian, Robert Babuska, and Bart De Schutter (2008). “A Comprehensive Survey of Mul-

tiagent Reinforcement Learning.” In: IEEE Transactions on Systems, Man and Cybernetics, Part

C (Applications and Reviews) 38.2, pp. 156–172.

Camerer, Colin F. (1997). “Progress in behavioral game theory.” In: The Journal of Economic Per-

spectives 11.4, pp. 167–188.

— (2003). Behavioral Game Theory: Experiments in Strategic Interaction (Roundtable Series in Be-

havioral Economics). Princeton University Press.

Camerer, Colin F., Teck-Hua Ho, and Juin-Kuan Chong (2004a). “A cognitive hierarchy model of

games.” In: The Quarterly Journal of Economics 119.3, p. 861.

— (2004b). “Behavioural Game Theory: Thinking, Learning and Teaching.” In: Advances in Under-

standing Strategic Behavior. New York, pp. 120–180.

Carmel, David and Shaul Markovitch (1999). “Exploration strategies for model-based learning in

multi-agent systems.” In: Autonomous Agents and Multi-Agent Systems 2.2, pp. 141–172.

Cassandra, Anthony R. (1998). “Exact and approximate algorithms for partially observable Markov

decision processes.” PhD thesis. Computer Science Department, Brown University.

Cassandra, Anthony R., Michael L. Littman, and Nevin L. Zhang (1997). “Incremental pruning: a

simple, fast, exact method for partially observable Markov decision processes.” In: Proceedings

of the 13th Conference on Uncertainty in Artificial Intelligence. Providence, Rhode Island, USA:

Morgan Kaufmann Publishers Inc, pp. 54–61.

Chakraborty, Doran and Peter Stone (2008). “Online multiagent learning against memory bounded

adversaries.” In: Machine Learning and Knowledge Discovery in Databases, pp. 211–226.

Choi, Samuel P. M., Dit-Yan Yeung, and Nevin L. Zhang (1999). “An Environment Model for Non-

stationary Reinforcement Learning.” In: NIPS, pp. 987–993.

— (2001). “Hidden-mode markov decision processes for nonstationary sequential decision making.”

In: Sequence Learning, pp. 264–287.

Conitzer, Vincent and Tuomas Sandholm (2006). “AWESOME: A general multiagent learning algo-

rithm that converges in self-play and learns a best response against stationary opponents.” In:

Machine Learning 67.1-2, pp. 23–43.

118

REFERENCES

Corbett, Albert T. and John R. Anderson (1994). “Knowledge tracing: Modeling the acquisition of

procedural knowledge.” In: User Modeling and User-Adapted Interaction 4.4, pp. 253–278.

Costa Gomes, Miguel, Vincent P. Crawford, and B. Broseta (2001). “Cognition and Behavior in

Normal–Form Games: An Experimental Study.” In: Econometrica 69.5, pp. 1193–1235.

Cote, Enrique Munoz de and Nicholas R. Jennings (2010). “Planning against fictitious players in

repeated normal form games.” In: Proceedings of the 9th International Conference on Autonomous

Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent

Systems, pp. 1073–1080.

Cote, Enrique Munoz de, Alessandro Lazaric, and Marcello Restelli (2006). “Learning to cooperate

in multi-agent social dilemmas.” In: Proceedings of the Fifth International Joint Conference on

Autonomous Agents and Multiagent Systems. ACM, pp. 783–785.

Cote, Enrique Munoz de, Archie C. Chapman, Adam M. Sykulski, and Nicholas R. Jennings (2010).

“Automated Planning in Repeated Adversarial Games.” In: Uncertainty in Artificial Intelligence,

pp. 376–383.

Crandall, Jacob W. and Michael A. Goodrich (2011). “Learning to compete, coordinate, and cooperate

in repeated games using reinforcement learning.” In: Machine Learning 82.3, pp. 281–314.

Da Silva, Bruno C, Eduardo W. Basso, Ana L.C. Bazzan, and Paulo M. Engel (2006). “Dealing with

non-stationary environments using context detection.” In: Proceedings of the 23rd International

Conference on Machine Learnig. Pittsburgh, Pennsylvania, pp. 217–224.

Del Giudice, A., Piotr J. Gmytrasiewicz, and J. Bryan (2009). “Towards strategic kriegspiel play

with opponent modeling.” In: Proceedings of the 8th International Conference on Autonomous

Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent

Systems, pp. 1265–1266.

Deng, Li and Dong Yu (2013). “Deep Learning Methods and Applications.” In: Foundations and

Trends in Signal Processing 7.3-4, pp. 197–387.

On the Di�culty of Achieving Equilibrium in Interactive POMDPs (2006). Boston, MA, USA.

Doshi, Prashant and Piotr J. Gmytrasiewicz (2009). “Monte Carlo sampling methods for approximat-

ing interactive POMDPs.” In: Journal of Artificial Intelligence Research 34.1, p. 297.

Doshi, Prashant, Yifeng Zeng, and Qiongyu Chen (2008). “Graphical models for interactive POMDPs:

representations and solutions.” In: Autonomous Agents and Multi-Agent Systems 18.3, pp. 376–

416.

Elidrisi, Mohamed, Nicholas Johnson, and Maria Gini (2012). “Fast Learning against Adaptive Ad-

versarial Opponents.” In: Adaptive Learning Agents Workshop at AAMAS. Valencia, Spain.

119

REFERENCES

Elidrisi, Mohamed, Nicholas Johnson, Maria Gini, and Jacob W. Crandall (2014). “Fast adaptive learn-

ing in repeated stochastic games by game abstraction.” In: Proceedings of the 13th International

Joint Conference on Autonomous Agents and Multiagent Systems. Paris, France, pp. 1141–1148.

Fudenberg, Drew and Jean Tirole (1991). Game Theory. The MIT Press.

Fulda, Nancy and Dan Ventura (2006). “Predicting and Preventing Coordination Problems in Co-

operative Q-learning Systems.” In: IJCAI-07: Proceedings of the Twentieth International Joint

Conference on Artificial Intelligence, pp. 780–785.

Gal, Ya’akov and Avi Pfe↵er (2008). “Networks of influence diagrams: A formalism for representing

agents’ beliefs and decision making processes.” In: Journal of Artificial Intelligence Research 33.1,

pp. 109–147.

Gama, Joao, Pedro Medas, Gladys Castillo, and Pedro Rodrigues (2004). “Learning with Drift Detec-

tion.” In: Advances in Artificial Intelligence–SBIA. Brazil, pp. 286–295.

Gama, Joao, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia (2014).

“A survey on concept drift adaptation.” In: ACM Computing Surveys (CSUR) 46.4.

Gmytrasiewicz, Piotr J. and Prashant Doshi (2005). “A framework for sequential planning in multia-

gent settings.” In: Journal of Artificial Intelligence Research 24.1, pp. 49–79.

Gmytrasiewicz, Piotr J. and Edmund H. Durfee (2000). “Rational Coordination in Multi-Agent Envi-

ronments.” In: Autonomous Agents and Multi-Agent Systems 3.4, pp. 319–350.

Goeree, Jacob K. and C.A. Holt (2001). “Ten little treasures of game theory and ten intuitive contra-

dictions.” In: American Economic Review, pp. 1402–1422.

Harsanyi, John C. and Reinhard Selten (1988). A general theory of equilibrium selection in games.

MIT Press.

Hernandez-Leal, Pablo, Enrique Munoz de Cote, and L. Enrique Sucar (2013a). “Learning against non-

stationary opponents.” In: Workshop on Adaptive Learning Agents (ALA). Saint Paul, Minnesota,

pp. 76–83.

— (2013b). “Modeling Non-Stationary Opponents.” In: Proceedings of the 12th International Con-

ference on Autonomous Agents and Multiagent Systems. Saint Paul, Minnesota, USA, pp. 1135–

1136.

— (2013c). “Opponent modeling and planning against non-stationary strategies.” In: The 8th Work-

shop on Multiagent Sequential Decision Making Under Uncertainty (MSDM) 2013. Saint Paul,

Minnesota, pp. 47–54.

— (2013d). “Strategic Interactions Among Agents with Bounded Rationality.” In: Proceedings of the

Twenty-Third International Joint Conference on Artificial Intelligence. Beijing, China, pp. 3219–

3220.

120

REFERENCES

— (2014a). “A framework for learning and planning against switching strategies in repeated games.”

In: Connection Science 26.2, pp. 103–122.

— (2014b). “Exploration strategies to detect strategy switches.” In: AAMAS Workshop on Adaptive

Learning Agents. Paris, France.

— (2014c). “Using a priori information for fast learning against non-stationary opponents.” In: Ad-

vances in Artificial Intelligence – IBERAMIA 2014. Santiago de Chile, pp. 536–547.

Hernandez-Leal, Pablo, Matthew E. Taylor, L. Enrique Sucar, and Enrique Munoz de Cote (2015a).

“Bidding in Non-Stationary Energy Markets.” In: Proceedings of the 14th International Conference

on Autonomous Agents and Multiagent Systems. Istanbul, Turkey, pp. 1709–1710.

Hernandez-Leal, Pablo, Matthew E. Taylor, Enrique Munoz de Cote, and L. Enrique Sucar (2015b).

“Learning Against Non-Stationary Opponents in Double Auctions.” In: Workshop Adaptive Learn-

ing Agents ALA 2015. Istanbul, Turkey.

Hernandez-Leal, Pablo, Enrique Munoz de Cote, and L. Enrique Sucar (2015c). “Opponent Model-

ing against Non-stationary Strategies.” In: Proceedings of the 14th International Conference on

Autonomous Agents and Multiagent Systems. Istanbul, Turkey, pp. 1989–1990.

Hernandez-Leal, Pablo, Matthew E. Taylor, Benjamin Rosman, L. Enrique Sucar, and Enrique Munoz

de Cote (2016). “Identifying and Tracking Switching, Non-stationary Opponents: a Bayesian Ap-

proach.” In: Third Workshop on Multiagent Interaction without prior Coordination. Phoenix, AZ,

USA.

Hoe↵ding, Wassily (1963). “Probability inequalities for sums of bounded random variables.” In: Journal

of the American Statistical Association 58.301, pp. 13–30.

HolmesParker, Chris, Matthew E. Taylor, Adrian Agogino, and Kagan Tumer (2014). “CLEANing

the reward: counterfactual actions to remove exploratory action noise in multiagent learning.”

In: Proceedings of the 13th International Joint Conference on Autonomous Agents and Multiagent

Systems. Paris, France: International Foundation for Autonomous Agents and Multiagent Systems,

pp. 1353–1354.

Howard, Ronald A. and James E. Matheson (2005). “Influence Diagrams.” In: Decision Analysis 2.3,

pp. 127–143.

Hu, J. and M.P. Wellman (1998). “Online learning about other agents in a dynamic multiagent system.”

In: Proceedings of the Second International Conference on Autonomous Agents. ACM, pp. 239–246.

Jennings, Nicholas R. et al. (2001). “Automated negotiation: prospects, methods and challenges.” In:

Group Decision and Negotiation 10.2, pp. 199–215.

Jensen, Steven, Daniel Boley, Maria Gini, and Paul Schrater (2005). “Rapid on-line temporal se-

quence prediction by an adaptive agent.” In: Proceedings of the 4th International Conference on

Autonomous Agents and Multiagent Systems. Utrecht, The Netherlands: ACM, pp. 67–73.

121

REFERENCES

Kaelbling, Leslie P., Michael L. Littman, and Anthony R. Cassandra (1998). “Planning and acting in

partially observable stochastic domains.” In: Artificial Intelligence 101.1-2, pp. 99–134.

Kahneman, Daniel and Amos Tversky (1979). “Prospect theory: An analysis of decision under risk.”

In: Econometrica, pp. 263–291.

Kakade, Sham Machandranath (2003). “On the sample complexity of reinforcement learning.” PhD

thesis. Gatsby Computational Neuroscience Unit, University College London.

Ketter, Wolfgang, John Collins, and Prashant P. Reddy (2013). “Power TAC: A competitive economic

simulation of the smart grid.” In: Energy Economics 39, pp. 262–270.

Ketter, Wolfgang, John Collins, Prashant P. Reddy, and Mathijs De Weerdt (2014). The 2014 Power

Trading Agent Competition. Rotterdam, The Netherlands: Department of Decision and Information

Sciencies, Erasmus University.

Kocsis, Levente and Csaba Szepesvari (2006). “Bandit based monte-carlo planning.” In: Proceedings

of the 17th European Conference on Machine Learning. Berlin, Germany: Springer, pp. 282–293.

Koller, D. and N. Friedman (2009). Probabilistic graphical models: principles and techniques. The MIT

Press.

Koller, Daphne and Brian Milch (2001). “Multi-agent influence diagrams for representing and solv-

ing games.” In: IJCAI’01: Proceedings of the 17th International Joint Conference on Artificial

Intelligence. Seattle, Washington: Morgan Kaufmann Publishers Inc, pp. 1027–1036.

Liebman, Elad, Maytal Saar-Tsechansky, and Peter Stone (2015). “DJ-MC: A Reinforcement-Learning

Agent for Music Playlist Recommendation.” In: Proceedings of the 14th International Conference

on Autonomous Agents and Multiagent Systems. Istanbul, Turkey, pp. 591–599.

Littman, Michael L. (1994). “Markov games as a framework for multi-agent reinforcement learning.”

In: Proceedings of the 11th International Conference on Machine Learning. New Brunswick, NJ,

USA, pp. 157–163.

— (1996). “Algorithms for sequential decision making.” PhD thesis. Department of Computer Science,

Brown University.

Littman, Michael L. and Peter Stone (2001). “Implicit Negotiation in Repeated Games.” In: ATAL

’01: Revised Papers from the 8th International Workshop on Intelligent Agents VIII.

Littman, Michael L., Thomas L. Dean, and Leslie P. Kaelbling (1995). “On the complexity of solving

Markov decision problems.” In: Proceedings of the 11th Conference on Uncertainty in Artificial

Intelligence. Montreal, Canada: Morgan Kaufmann Publishers Inc, pp. 394–402.

McKelvey, Richard D., Andrew M. McLennan, and Theodore L. Turocy (2014). Gambit: Software

Tools for Game Theory. url: http://www.gambit-project.org.

Miglio, Rossella and Gabriele So↵ritti (2004). “The comparison between classification trees through

proximity measures.” In: Computational Statistics and Data Analysis 45.3, pp. 577–593.

122

http://www.gambit-project.org

REFERENCES

Mitchell, Thomas M. (1997). Machine Learning. 1st. McGraw-Hill Higher Education.

Mohan, Yogeswaran and S G Ponnambalam (2011). “Exploration strategies for learning in multi-agent

foraging.” In: Swarm, Evolutionary, and Memetic Computing 2011. Springer, pp. 17–26.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar (2012). Foundations of Machine Learn-

ing. The MIT Press.

Monahan, George E. (1982). “A survey of partially observable Markov decision processes: Theory,

models, and algorithms.” In: Management Science 28, pp. 1–16.

Myerson, Roger B. (1991). Game theory: analysis of conflict. Harvard University Press.

Nash, John F. (1950). “Equilibrium points in n-person games.” In: Proceedings of the National Academy

of Sciences 36.1, pp. 48–49.

Ng, Andrew Y, Daishi Harada, and Stuart J. Russell (1999). “Policy invariance under reward transfor-

mations: Theory and application to reward shaping.” In: Proceedings of the Sixteenth International

Conference on Machine Learning. Bled, Slovenia, pp. 278–287.

Ng, Brenda, Carol Meyers, Kofi Boakye, and J. Nitao (2010). “Towards Applying Interactive POMDPs

to Real-World Adversary Modeling.” In: Twenty-Second IAAI Conference. Atlanta, Georgia, pp. 1814–

1820.

Ng, Brenda, Kofi Boakye, Carol Meyers, and Andrew Wang (2012). “Bayes-Adaptive Interactive

POMDPs.” In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. Toronto,

Canada, pp. 1408–1414.

Nudelman, Eugene, Jennifer Wortman, Yoav Shoham, and Kevin Leyton-Brown (2004). “Run the

GAMUT: a comprehensive approach to evaluating game-theoretic algorithms.” In: Proceedings of

the Third International Joint Conference on Autonomous Agents and Multiagent System, pp. 880–

887.

Papadimitriou, Christos H. and John N. Tsitsiklis (1987). “The complexity of Markov decision pro-

cesses.” In: Mathematics of Operations Research 12.3, pp. 441–450.

Parsons, Simon, Marek Marcinkiewicz, Jinzhong Niu, and Steve Phelps (2006). Everything you wanted

to know about double auctions, but were afraid to (bid or) ask. New York, USA: Department of

Computer & Information Science, University of New York.

Pineau, Joelle, Geo↵rey Gordon, and Sebastian Thrun (2006). “Anytime point-based approximations

for large POMDPs.” In: Journal of Artificial Intelligence Research 27.1, pp. 335–380.

Pipattanasomporn, M., H. Feroze, and Saifur Rahman (2009). “Multi-agent systems in a distributed

smart grid: Design and implementation.” In: Power Systems Conference and Exposition, 2009.

PSCE’09. IEEE/PES. IEEE, pp. 1–8.

Pita, James et al. (2009). “Using game theory for Los Angeles airport security.” In: AI Magazine 30.1,

pp. 43–57.

123

REFERENCES

Powers, Rob and Yoav Shoham (2005). “Learning against opponents with bounded memory.” In: IJ-

CAI’05: Proceedings of the 19th International Joint Conference on Artificial Intelligence. Edinburg,

Scotland, UK: Morgan Kaufmann Publishers Inc, pp. 817–822.

Puterman, Martin L. (1994). Markov decision processes: Discrete stochastic dynamic programming.

John Wiley & Sons, Inc.

Qu, Xia and Prashant Doshi (2015). “Improved Planning for Infinite-Horizon Interactive POMDPs

Using Probabilistic Inference.” In: Proceedings of the 14th International Conference on Autonomous

Agents and Multiagent Systems. Istanbul, Turkey, pp. 1839–1840.

Quinlan, J. Ross (1993). C4. 5: programs for machine learning. Morgan Kaufmann.

Rejeb, Lilia, Zahia Guessoum, and Rym M’Hallah (2005). “An adaptive approach for the exploration-

exploitation dilemma for learning agents.” In: Proceedings of the 4th international Central and

Eastern European conference on Multi-Agent Systems and Applications. Springer, pp. 316–325.

Richards, Mark and Eyal Amir (2006). “Opponent Modeling in Scrabble.” In: IJCAI-07: Proceed-

ings of the Twentieth International Joint Conference on Artificial Intelligence. Hyderabad, India,

pp. 1482–1487.

Risse, Mathias (2000). “What is rational about Nash equilibria?” In: Synthese 124.3, pp. 361–384.

Robbins, Herbert (1985). “Some aspects of the sequential design of experiments.” In: Herbert Robbins

Selected Papers. Springer, pp. 527–535.

Rosman, Benjamin, Majd Hawasly, and Subramanian Ramamoorthy (2015). “Bayesian Policy Reuse.”

In: arXiv.org. arXiv: 1505.00284v1 [cs.AI].

Russell, Stuart J., Peter Norvig, J.F. Canny, J.M. Malik, and D.D. Edwards (1995). Artificial intelli-

gence: a modern approach. Vol. 2. Englewood Cli↵s, NJ: Prentice Hall.

Seuken, Sven and Shlomo Zilberstein (2008). “Formal models and algorithms for decentralized decision

making under uncertainty.” In: Autonomous Agents and Multi-Agent Systems 17.2, pp. 190–250.

Shachter, Ross D. (1986). “Evaluating influence diagrams.” In: Operations Research 34.6.

Shoham, Yoav and Kevin Leyton-Brown (2008). Multiagent Systems: Algorithmic, Game-Theoretic,

and Logical Foundations. Cambridge University Press.

Shoham, Yoav, Rob Powers, and T. Grenager (2007). “If multi-agent learning is the answer, what is

the question?” In: Artificial Intelligence 171.7, pp. 365–377.

Simon, Herbert A. (1955). “A behavioral model of rational choice.” In: The Quarterly Journal of

Economics 69.1, p. 99.

Sonu, Ekhlas, Yingke Chen, and Prashant Doshi (2015). “Individual Planning in Agent Populations:

Exploiting Anonymity and Frame-Action Hypergraphs.” In: ICAPS, pp. 202–210.

124

http://arxiv.org/abs/1505.00284v1

REFERENCES

Stimpson, Je↵rey L. and Michael A. Goodrich (2003). “Learning To Cooperate in a Social Dilemma:

A Satisficing Approach to Bargaining.” In: Proceedings of the Twentieth International Conference

on Machine Learning, pp. 728–735.

Stone, Peter (2007). “Learning and multiagent reasoning for autonomous agents.” In: The 20th Inter-

national Joint Conference on Artificial Intelligence. Hyderabad, India, pp. 13–30.

Stone, Peter and Manuela Veloso (2000). “Multiagent Systems: A Survey from a Machine Learning

Perspective.” In: Autonomous Robots 8.3.

Sucar, L. Enrique, Roger Luis, Ron Leder, Jorge Hernandez, and Israel Sanchez (2010). “Gesture

therapy: a vision-based system for upper extremity stroke rehabilitation.” In: Annual International

Conference of the IEEE Engineering in Medicine and Biology Society. Buenos Aires, Argentina:

IEEE, pp. 3690–3693.

Sutton, Richard S. and Andrew G. Barto (1998). Reinforcement Learning An Introduction. Cambridge,

MA: MIT Press.

Sykulski, Adam M., Archie C. Chapman, Enrique Munoz de Cote, and Nicholas R. Jennings (2010).

“EA2: The Winning Strategy for the Inaugural Lemonade Stand Game Tournament.” In: Proceed-

ing of the 19th European Conference on Artificial Intelligence. IOS Press, pp. 209–214.

Tesauro, Gerald (2003). “Extending Q-learning to general adaptive multi-agent systems.” In: Advances

in Neural Information Processing Systems 16, pp. 871–878.

Tesauro, Gerald and Jonathan L. Bredin (2002). “Strategic sequential bidding in auctions using dy-

namic programming.” In: Proceedings of the 1st International Joint Conference on Autonomous

Agents and Multiagent Systems. ACM Request Permissions.

Tversky, Amos and Daniel Kahneman (1974). “Judgment under uncertainty: Heuristics and biases.”

In: Science 185.4157, pp. 1124–1131.

Urieli, Daniel and Peter Stone (2014). “TacTex’13: A Champion Adaptive Power Trading Agent.” In:

Proceedings of the Twenty-Eighth Conference on Artificial Intelligence. Quebec, Canada, pp. 465–

471.

Valogianni, Konstantina, Wolfgang Ketter, and John Collins (2015). “A Multiagent Approach to

Variable-Rate Electric Vehicle Charging Coordination.” In: Proceedings of the 14th International

Conference on Autonomous Agents and Multiagent Systems. Istanbul, Turkey, pp. 1131–1139.

Watkins, John (1989). “Learning from delayed rewards.” PhD thesis. Cambridge, UK: King’s College.

Widmer, Gerhard and Miroslav Kubat (1996). “Learning in the presence of concept drift and hidden

contexts.” In: Machine Learning 23.1, pp. 69–101.

Wilson, Edwin B. (1927). “Probable Inference, the Law of Succesion, and Statistical Inference.” In:

Journal of the American Statistical Association 22.158, pp. 209–212.

Wooldridge, Michael (2009). An Introduction to MultiAgent Systems. 2nd. Wiley Publishing.

125

REFERENCES

Wright, James Robert and Kevin Leyton-Brown (2010). “Beyond equilibrium: Predicting human be-

havior in normal-form games.” In: Twenty-Fourth Conference on Artificial Intelligence (AAAI-10).

Atlanta, Georgia, pp. 901–907.

Wunder, Michael, Michael L. Littman, and Matthew Stone (2009). “Communication, Credibility and

Negotiation Using a Cognitive Hierarchy Model.” In: AAMAS Workshop# 19: Multi-agent Sequen-

tial Decision Making 2009. Budapest, Hungary, pp. 73–80.

Wunder, Michael, Michael Kaisers, John Robert Yaros, and Michael L. Littman (2011). “Using iterated

reasoning to predict opponent strategies.” In: Proceedings of 10th Int. Conference on Autonomous

Agents and Multiagent Systems. Taipei, Taiwan, pp. 593–600.

Wurman, Peter R, William E. Walsh, and M.P. Wellman (1998). “Flexible double auctions for elec-

tronic commerce: theory and implementation.” In: Decision Support Systems 24.1, pp. 17–27.

Zinkevich, Martin A., Michael Bowling, and Michael Wunder (2011). “The lemonade stand game

competition: solving unsolvable games.” In: SIGecom Exchanges 10.1, pp. 35–38.

126

Appendix A

PowerTAC

In this section we review the PowerTAC competition, which has been used to perform research in

multiagent systems and therefore we used it as a testbed for experiments in our research.

A.1 Energy markets

New trends in energy generation and distribution are being implemented around the world, this has

lead to the deregulation of energy supply and demand, allowing producers to sell energy to consumers

by using a broker as an intermediary, e↵ectively creating a market. Such markets have led to the

development of diverse energy trading strategies, most of which remain di�cult to optimize due to the

inherent complexity of the markets (rich state spaces, high dimensionality, and partial observability

[Urieli and Stone, 2014]) which results in that straightforward game-theoretic, machine learning, and

artificial intelligence techniques fall short.

A.2 PowerTAC

The PowerTAC simulator models a retail electrical energy market, where competing brokers (agents)

are challenged to maximize their profits. Brokers take actions in di↵erent markets at each timestep,

which simulates one hour of real time: (i) the tari↵ market, where brokers buy and sell energy by

o↵ering tari↵ contracts that specify price and other characteristics like early withdraw fee, bonus for

subscription, and expiration time; (ii) the wholesale market, where brokers buy and sell quantities

of energy for future delivery; and (iii) the balancing market, which is responsible for the real-time

balance of supply and demand on the distribution grid.

127

APPENDIX A. POWERTAC

Success

S0...

S24 S1S23

Figure A.1: Partial representation of the MDP broker in PowerTAC, ovals represent states (timeslots

for future delivery). Arrows represent transition probability and rewards.

A.3 Periodic double auctions

In this thesis we focus on the wholesale market, which operates as a periodic double auction (PDA)

[Wurman et al., 1998] and is similar to real world markets (e.g., Nord Pool in Scandinavia or FERC in

North America) [Ketter et al., 2014]. The wholesale market allows brokers to buy and sell quantities

of energy for future delivery, typically between 1 and 24 hours in the future. A PDA is a mechanism to

match buyers and sellers of a particular good, and to determine the prices at which trades are executed.

At any point in time, traders can place limit orders in the form of bids (buy orders) and asks (sell

orders) [Parsons et al., 2006]. Orders are maintained in an orderbook. In a PDA, the clearing price

is determined by the intersection of the inferred supply and demand functions, demand and supply

curves are constructed from bids and asks to determine the clearing price of each orderbook (one

for each enabled timeslot) at the intersection of the two, which is the price that maximizes turnover

[Ketter et al., 2014].

A.4 TacTex

The champion agent from the inaugural PowerTAC competition in 2013 was TacTex [Urieli and Stone,

2014], which uses an approach based on reinforcement learning for the wholesale market and prediction

methods for the tari↵ market. TacTex uses a modified version of Tesauro’s representation of the

wholesale market [Tesauro and Bredin, 2002], where states represent agent’s holdings and transition

probabilities are estimated from the market event history. TacTex models the bidding process as an

MDP, starting each game with no data and learning to improve its estimates and bids online. At each

timeslot, it uses dynamic programming to solve the MDP with all the data collected thus far, providing

a limit price to bid for future timeslots. Even tough TacTex learns to quickly bid, it is not capable

of adapting to non-stationary opponents. This is a large drawback, as many real-life agents are non-

stationary and change strategies over time — note that agents may slowly change or drastically switch

from one strategy to another (either to confuse the opponent or just as a best response measure).

In PowerTAC, a wholesale broker can place a bid for buying or selling energy by issuing a tuple

ht, e, pi: in timeslot t the broker makes a bid/ask for an amount of energy e (expressed in megawatt-

128

A.4. TACTEX

hour MWh) at a limit price p of buying/selling. At each timeslot, PowerTAC provides (as public

information) market clearing prices and the cleared volume. It also provides as private information

the successful bids and asks. A bid/ask can be partially or fully cleared. When a bid is fully cleared

the total amount of energy will be sent at the requested timeslot, if a bid was partially cleared the

o↵er was accepted but there was not enough energy and only a fraction of the requested energy will be

sent. In order to maintain a clear view of the problem and its solution, similar to TacTex, we restrict

our setting to brokers that only made bids for buying energy.

TacTex models the problem as an MDP, depicted in Figure A.1, in this work we will adopt the

same model for DriftER to be able to fairly compare the approaches in the experimental section.

States: s 2 {0, 1, . . . , nbids, success }, represent the timeslots for future delivery for the bids in the

market, with initial state s0 = nbids and terminal states s⇤ 2 {0, success}. Actions represent di↵erentlimit prices for the buying o↵ers in the wholesale market. Any state st 2 {1, . . . , nbids} will transition

to one of two states: success if a bid is partially or fully cleared, or st+1 = st�1. Note that transitions

depend on the action chosen by DriftER, but also on the actions (bids) chosen by the opponents, also,

these transitions are initially unknown to our agent. The reward is 0 in any state s 2 {1, . . . , nbids},the limit price of the successful bid in the state success and a balancing-penalty in state s = 0 (i.e.

when the time is over to secure the required energy).

The solution (a policy) to this MDP will define the best limit-price order for each of the nbids

states. TacTex solves the MDP once per timeslot, submitting nbids limit-prices to each of the nbids

auctions. DriftER leverages the TacTex MDP formulation but is designed to handle non-stationary

opponents by explicitly accounting for drift.

During this learning phase, let CTs be the set of past successful cleared transactions at state s

(i.e., timeslot for future delivery). Note that each element a 2 CTs contains information about the

cleared energy and clearing price (ae and ap respectively). To compute the probability of reaching

state success from state s and action a we use:

P successs,a :=

Pa02CT

s

,a0p

<ap

aeP

a02CTs

ae(A.1)

Where (A.1) captures the ratio between all successful past transactions smaller than the limit price

ap and all successful past transactions. Using A.1 we compute the empirical transition function as:

T (s, a, s0) =

8<

:P s0s,a if (s0 == success)

1� P s0s,a otherwise

(A.2)

The value P s0s,a gives the probability for cleared transactions. If the transaction is not cleared, we

transition to state s� 1 with probability 1� P s0s,a. Rewards are not stochastic — no statistics need to

be collected to learn the reward function.

129

APPENDIX A. POWERTAC

130

Appendix B

General-sum Games

We tested our proposals in randomly generated general-sum games which at least had one pure and

one mixed Nash equilibrium. To generate games the Gamut library [Nudelman et al., 2004] was used

and to generate Nash strategies Gambit [McKelvey et al., 2014] was used. Values and characteristics

of the games are shown in Tables B.1 and B.2.

131

APPENDIX B. GENERAL-SUM GAMES

Table B.1: Games used in the experiments. They have at least one pure and one mixed Nash equilibrium.

Learning agents will play rows and opponents will play columns.

(a) Game 1

B1 B2 B3

A1 �29,�41 93,�56 56,-4

A2 �17,�87 �70,�79 -44, -82

A2 50, 49 �75, 76 27,-56

(b) Game 2

B1 B2 B3

A1 37, 35 45, 76 67,43

A2 33, 94 38, 74 -94, -72

A2 83,�61 �95,�5 99,32

(c) Game 3

B1 B2 B3 B4

A1 �50, 20 73,�7 69,-45 83,22

A2 �51, 89 88, 96 -55,40 -26,-92

A3 �58, 58 �41, 14 66,-46 0,-80

A4 �62, 52 �94,�52 -40,-46 -94,-84

(d) Game 4

B1 B2 B3 B4 B5

A1 5, 7 32, 78 1,7 -55, -79 -1,0

A2 89, 96 81,�45 -26 61 73,78 -45,-68

A3 29, 92 90,�53 -53,-46 45,-83 11,20

A4 �89, 14 94,�99 -26,-10 89,22 67,-19

A5 35, 84 67, 34 75,35 -6,33 -16,-62

Table B.2: Pure and mixed Nash strategies for column player of the games used.

Game Id # actions Pure Nash Mixed Nash

Game 1 3 [0,0,1] [0.680, 0.319,0]

Game 2 3 [0,1,0] [0.0, 0.186, 0.813]

Game 3 4 [0,1,0,0] [0.0, 0.879, 0.0, 0.120]

Game 4 5 [1,0,0,0,0] [0.082, 0.0, 0.0, 0.917, 0.0]

132

Appendix C

Extra Experiments

C.1 HM-MDPs training and performance experiments

In order to evaluate the robustness of HM-MDPs, we evaluated the learned policy under di↵erent

switching times. We used 150, 250 and 350 as the times when the opponent switches from strategy1

to strategy2, the duration of games in the evaluation phase was 500 stages. The training phase

directly a↵ects the quality of the model [Choi et al., 1999], therefore we evaluated at what extent

this variation in size could a↵ect the total rewards. For this reason, training size was varied with

tsize = {100, 500, 2000} games.

We present the average results with standard deviations of using all values of tsize for each switching

time in Table C.1 under the same model column. R(A) presents the average rewards for the learning

agent, and AvgR(Opp) presents the average rewards for the switching opponent.

Some conclusions that can be drawn are:

• The step where the opponent switches between strategies does not impact the results. There is

only a variation of 0.04 in average between the best and worst results.

• HM-MDP agent shows high standard deviation against TFT-Bully. This happens because two

types of behaviours appeared in the experiments: in some cases the cooperate-cooperate cycle

happen with TFT and a defect-defect cycle with Bully. However, in other cases, the agent

performed a defect against TFT and started the defect-defect cycle for the rest of the repeated

game. This suboptimal behaviour appeared when the training phase was small (100 steps).

• In some cases a suboptimal cycle also appeared against TFT-Pavlov. Some policies gets stuck

in a cycle C-D, D-C against TFT, and this cycle reduced the accumulated rewards, since a C-C

was the optimal policy against TFT. It was only when changing to Pavlov, that the cycle C-C

appeared.

133

APPENDIX C. EXTRA EXPERIMENTS

In general, when the training size is small, suboptimal behaviour could appear, which a↵ects the

total rewards.

As we mentioned earlier, HM-MDPs need to fix the number of modes when learning. In the previous

experiments opponents always use two strategies, but there are three di↵erent strategies available, so

we modified the first experiment in order to have di↵erent strategies in the training and evaluation

phases. The motivation for this is that it can happen that during training phase the opponent did

not use all the strategies, and therefore the learned HM-MDP is not complete. During evaluation a

new strategy is used (which is not known for HM-MDP) and this would a↵ect the results. So, in the

next experiment the training opponent consists of strategy1 � strategy2 and the evaluation opponent

consists of strategy1� strategy3. The results of this experiment are presented in Table C.1 under the

di↵erent model column.

From the results is easy to note that HM-MDPs consistently decrease its average reward when using

a di↵erent model for evaluation and training. On average the decrease is 0.56 ± 0.27. In conclusion,

when HM-MDPs can explore all models in training phase, they obtain good results. However, if they

do not learn the complete opponent strategies they cannot compute an optimal policy against it and

thus, they receive lower rewards.

To solve the learned HM-MDP a transformation to a POMDP was performed. In tables C.2 and

C.3 we present data regarding the time and convergence statistics when solving the POMDPs1. In

particular we present when the policy converged (final horizon), the number of ↵ vectors generated

and the time (in seconds) needed to solve the POMDP. Some conclusions from the experiments are:

• As the number of learning steps increased there was an increase of iterations needed to solve the

POMDP. In some cases (TFT-Pavlov, Pavlov-TFT, Bully-TFT and Bully-Pavlov) 500 iterations

were not enough to converge.

• The policies that contain a high number of vectors (Bully-TFT or TFT-Bully) have a larger

solving time. Conversely, if the number of vectors is low (Pavlov-Bully) this yields lower solving

times.

• The number of vectors is mainly determined by the type of opponent, TFT-Bully and Bully-TFT

have a high number of vectors in all the cases. On the contrary, Pavlov-Bully and Bully-Pavlov

have a low number of vectors with di↵erent sizes of learning. TFT-Pavlov and Pavlov-TFT

started with a low number of vectors but as the training phase size increased also the number

of vectors did.

• Too much interaction increases the time of solving the POMDP. However, as we said in the

1All the experiments were performed on a MacBook Pro with Intel Core 2 Duo 2.16 GHz and 8 Gb of RAM.

134

C.1. HM-MDPS TRAINING AND PERFORMANCE EXPERIMENTS

Table C.1: Average rewards for the HM-MDPs agent (AvgR(A)) and for the opponent (AvgR(opp)) with stan-

dard deviations. Change at column present the round where the opponent switches strategies. The evaluation

phase consisted of 500 steps.

HM-MDP

Same model Di↵erent model

Opponent Change at AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp)

TFT-Pavlov 150 2.86 ± 0.10 2.97 ± 0.10 2.09 ± 0.13 1.46 ± 1.01

TFT-Bully 150 1.39 ± 0.22 1.44 ± 0.28 1.08 ± 0.10 3.01 ± 0.46

Pavlov-TFT 150 2.88 ± 0.06 2.94 ± 0.13 2.01 ± 0.37 1.89 ± 0.58

Pavlov-Bully 150 1.56 ± 0.05 1.50 ± 0.29 1.04 ± 0.14 3.02 ± 0.58

Bully-TFT 150 1.99 ± 0.35 2.23 ± 0.49 1.28 ± 0.50 1.45 ± 0.88

Bully-Pavlov 150 2.24 ± 0.09 2.06 ± 0.58 2.01 ± 0.04 0.83 ± 0.15

Average - 2.15 ± 0.14 2.19 ± 0.31 1.59 ± 0.21 1.94 ± 0.61

TFT-Pavlov 250 2.87 ± 0.09 2.97 ± 0.10 2.05 ± 0.27 1.59 ± 0.99

TFT-Bully 250 1.63 ± 0.39 1.67 ± 0.45 1.56 ± 0.15 2.96 ± 0.43

Pavlov-TFT 250 2.84 ± 0.13 2.92 ± 0.24 2.29 ± 0.30 2.12 ± 0.68

Pavlov-Bully 250 1.92 ± 0.06 1.76 ± 0.41 1.55 ± 0.11 2.82 ± 0.62

Bully-TFT 250 1.65 ± 0.27 2.02 ± 0.48 1.17 ± 0.27 1.43 ± 0.88

Bully-Pavlov 250 1.89 ± 0.12 1.87 ± 0.53 1.72 ± 0.05 0.97 ± 0.14

Average - 2.13 ± 0.18 2.20 ± 0.37 1.72 ± 0.19 1.98 ± 0.62

TFT-Pavlov 350 2.89 ± 0.09 2.97 ± 0.09 1.99 ± 0.52 1.71 ± 0.98

TFT-Bully 350 2.04 ± 0.37 1.89 ± 0.62 2.04 ± 0.24 2.90 ± 0.41

Pavlov-TFT 350 2.80 ± 0.18 2.89 ± 0.33 2.56 ± 0.26 2.33 ± 0.80

Pavlov-Bully 350 2.28 ± 0.08 2.03 ± 0.52 2.05 ± 0.09 2.65 ± 0.68

Bully-TFT 350 1.33 ± 0.19 1.80 ± 0.48 1.06 ± 0.05 1.43 ± 0.88

Bully-Pavlov 350 1.5 ± 0.10 1.62 ± 0.47 1.39 ± 0.05 1.11 ± 0.18

Average - 2.11 ± 0.17 2.20 ± 0.42 1.85 ± 0.20 2.02 ± 0.66

Table C.2: Performance measures when solving the HM-MDP as a POMDP. Final horizon presents the iteration

when the policy converged, the number of ↵ vectors and the time in seconds are also presented. We show the

results while varying the training phase size.

Final Horizon # ↵ Vectors Time (s) Learning steps

232.17 ± 21.17 71.17 ± 108.30 95.98 ± 195.10 100

320.47 ± 110.66 260.53 ± 189.76 434.74 ± 574.83 500

457.08 ± 60.69 717.83 ± 426.39 2297.27 ± 2193.54 2000

135

APPENDIX C. EXTRA EXPERIMENTS

Table C.3: Performance measures when solving the HM-MDP as a POMDP against di↵erent non-stationary

opponents. Final horizon presents the iteration when the policy converged, the number of ↵ vectors and the

time in seconds are also presented.

Opponent Final Horizon # ↵ Vectors Time (s)

TFT-Pavlov 297.24 ± 57.04 176.50 ± 34.36 464.61 ± 116.21

TFT-Bully 318.06 ± 118.34 500.74 ± 707.46 972.31 ± 1650.27

Pavlov-TFT 267.63 ± 74.74 180.06 ± 101.78 685.61 ± 749.39

Pavlov-Bully 338.29 ± 92.55 71.37 ± 54.64 24.81 ± 35.70

Bully-TFT 352.09 ± 78.66 1022.27 ± 607.91 2123.02 ± 2352.17

Bully-Pavlov 309.86 ± 59.62 72.64 ± 54.38 297.23 ± 413.67

Average 313.86 ± 80.16 337.26 ± 260.09 761.26 ± 886.23

(a) (b)

Figure C.1: Fraction of updates when learning an opponent model using R-max exploration against (a) a pure

strategy and (b) a mixed strategy in the BoS game. Results are the average of 10 interactions.

previous section, too little information may not be enough to learn an optimal policy against the

opponent. Thus a trade-o↵ appears, knowing the amount of interaction needed to learn o↵-line

models without increasing dramatically the solving time is an open problem for future work.

C.2 R-max exploration against pure and mixed strategies

Our approaches need a phase to learn the transition function that will describe the opponent’s dy-

namics. MDP-CL uses a random exploration in a fixed window, in contrast R-max# and DriftER are

based on R-max which guarantees an e�cient exploration. When using R-max we need to set one

extra parameter m which sets the number of visits needed on a state to be considered known.

We analyze the behavior of a learning agent using R-max exploration against opponents that use

pure (deterministic) and mixed (stochastic) strategies. We measure how often the model is updated

136

C.2. R-MAX EXPLORATION AGAINST PURE AND MIXED STRATEGIES

Table C.4: R-max learning against (a) pure and (b) mixed strategies in the battle of the sexes game. Results

are the average of 100 iterations. Each game consists of 1500 rounds.

(a)

m Average Rewards

2 53.843 ± 0.017

5 53.646 ± 0.021

8 53.442 ± 0.024

10 53.318 ± 0.025

12 53.173 ± 0.028

15 52.970 ± 0.023

21 52.461 ± 0.015

(b)

m Average Rewards

2 57.712 ± 11.077

5 60.529 ± 7.077

8 61.832 ± 4.827

10 62.145 ± 4.139

12 61.859 ± 4.068

15 62.145 ± 2.354

21 60.998 ± 2.121

during the repeated game while keeping track of the rewards.

The BoS game was selected with v1 = 100, v2 = 54. In this case, opponents are stationary and

the game consists of 1500 rounds. Results are the average of 10 iterations. In Figure C.1 we depict

percentages of model updates while playing against a (a) pure Nash strategy [1.0, 0.0] and a (b) mixed

Nash strategy [0.65,0.35] using di↵erent values for parameter m. In Table C.4 we present the respective

average rewards for those values of m. From these results we note that learning a pure strategy is

much faster, using m = 2 yields the best scores, requiring about 20 rounds to learn a complete model.

In contrast, when learning against a mixed strategy, the best scores were obtained with m = 10, 15,

which means that it takes more than 200 rounds to learn a model that achieves the maximum reward.

Note that the best value for m is di↵erent for pure and mixed strategies and this is important to

be taken into account when facing non-stationary opponents that use both strategies. Also note that

there is a tradeo↵, a large value of m will ensure that a correct model is learned (thus obtaining a

value closer to the maximum reward) however it would take more time to be certain we have correctly

learned a model and we can start exploiting it. For example using a m = 15 provides the maximum

reward with lower standard deviation and needs approximately 300 rounds to learn a model completely.

However, using m = 10 will in average yield the same reward and will need only 200 rounds to learn

the model.

137

Documents

Strategic Interactions against Non-Stationary Agents