333
Multi-Agent Reinforcement Learning An Overview Marcello Restelli November 12, 2014

 · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

Multi-Agent Reinforcement LearningAn Overview

Marcello Restelli

November 12, 2014

Page 2:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 3:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 4:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Game Theory in Computer Science

Game Theory ComputerScience

Page 5:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Game Theory in Computer Science

Game Theory ComputerScience

Computing Solution Concepts

Compact Game Representations

Mechanism Design

Multi-agent Learning

Page 6:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 7:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 8:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 9:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 10:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 11:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 12:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Some Naming Conventions

Player = AgentPayoff = RewardValue = UtilityMatrix = Strategic form = Normal formStrategy = PolicyPure strategy = Determinitic policyMixed strategy = Stochastic policy

Page 13:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Page 14:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Page 15:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

If multi-agent is theanswer, what is thequestion?

Page 16:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Page 17:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 18:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 19:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 20:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 21:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 22:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 23:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 24:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 25:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 26:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

What is Multi-Agent Learning?

Difficult question...... we will try to answer in theseslidesIt involves

Multiple agentsSelf-interestConcurrently learning

It is strictly related toGame TheoryReinforcement LearningMulti-agent Systems

Shoham et al.,2002-2007

Stone, 2007

If multi-agent is theanswer, what is thequestion?

Multi-agent learningis not the answer, itis the question!

Page 27:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Which Applications?

Distributed vehicle regulationAir traffic controlNetwork management and routingElectricity distribution managementSupply chainsJob schedulingComputer games

Page 28:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 29:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 30:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 31:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 32:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent Learning and RL

We are interested in learning in situations wheremultiple decision makers repeatedly interactAmong the different machine learning paradigms,reinforcement learning is the most suited to approachsuch problemWe will mainly focus on multi-agent RL, even if other(game-theoretic) learning approaches will bementioned

Fictitious playNo-regret learning

Page 33:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 34:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Page 35:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Page 36:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Optimal Control

Page 37:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Optimal Control

Bellman (1957)Dynamic Programming

Howard (1960)Policy Iteration

Page 38:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Reinforcement Learning

Optimal Control

Bellman (1957)Dynamic Programming

Howard (1960)Policy Iteration

Page 39:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

History of RL

Psychology, Trial and error

Pavlov (1903)Classical conditioning

Thorndike (1905)Law of effect

Minsky (1961)Credit–assignment

problem

Reinforcement Learning

Optimal Control

Bellman (1957)Dynamic Programming

Howard (1960)Policy Iteration

Samuel (1956)Checkers

Sutton & Barto (1984)Temporal Difference

Watkins (1989)Q–learning

Littman (1994)minimax–Q

Tesauro (1992)TD–Gammon

Page 40:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Agent-Environment Interface

Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)

Page 41:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Agent-Environment Interface

Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)

Page 42:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Agent-Environment Interface

Agent interacts at discrete time steps t = 0,1,2, . . .Full observability: agent directly observesenvironment stateFormally, this is a Markov Decision Process (MDP)

Page 43:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 44:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 45:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 46:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 47:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 48:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 49:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 50:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 51:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Decision Processes

An MDP is formalized as a 4-tuple: 〈S,A,P,R〉S: set of states

What the agent knows (complete observability)A: set of actions

What the agent can do (it may depend on state)P: state transition model

S × A× S → [0,1]

R: reward functionS × A× S → R

Page 52:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Assumption

Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)

Markov is a special kind of conditional independenceFuture is independent of past given current state

Page 53:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Assumption

Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)

Markov is a special kind of conditional independenceFuture is independent of past given current state

Page 54:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Assumption

Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)

Markov is a special kind of conditional independenceFuture is independent of past given current state

Page 55:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Markov Assumption

Let st be a random variable for state at time tP(st |at−1, st−1, . . . ,a0, s0) = P(st |at−1, st−1)

Markov is a special kind of conditional independenceFuture is independent of past given current state

Page 56:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 57:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 58:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 59:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 60:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 61:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 62:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 63:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 64:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 65:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

The Goal: a Policy

Finding a policy that maximizes some cumulativefunctions of the rewardsWhat is a policy?

a mapping function from states to distributions overactionsdeterministic vs stochasticstationary vs non-stationary

Cost criteriafinite horizoninfinite horizon

average

discounted Rt =∞∑

k=0

γk rt+k

Page 66:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 67:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 68:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 69:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 70:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Value Functions

MDP + stationary policy⇒ Markov chainGiven a policy π, it is possible to define the utility ofeach state: Policy EvaluationValue function (Bellman equation)

Vπ(s) =∑a∈A

π(s|a)∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

For control purposes, rather than the value of eachstate, it is easier to consider the value of each action ineach stateAction-value function (Bellman equation)

Qπ(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γVπ(s′))

Page 71:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Optimal Value Functions

Optimal Bellamn equation (Bellman, 1957)

V ∗(s) = maxa

(∑s′∈S

P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))

Q∗(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γmaxa′

Q∗(s′,a′))

For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗

Page 72:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Optimal Value Functions

Optimal Bellamn equation (Bellman, 1957)

V ∗(s) = maxa

(∑s′∈S

P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))

Q∗(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γmaxa′

Q∗(s′,a′))

For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗

Page 73:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Optimal Value Functions

Optimal Bellamn equation (Bellman, 1957)

V ∗(s) = maxa

(∑s′∈S

P(s′|s,a)(R(s,a, s′) + γV ∗(s′)))

Q∗(s,a) =∑s′∈S

P(s′|s,a)(R(s,a, s′) + γmaxa′

Q∗(s′,a′))

For each MDP there is at least one deterministicoptimal policyAll optimal policies have the same value function V ∗

Page 74:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 75:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 76:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 77:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 78:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 79:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 80:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 81:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 82:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Solving an MDP

Policy searchbrute force is unfeasible (|A||S|)policy gradient, stochastic optimization approaches

Dynamic Programming (DP)Value IterationPolicy Iteration

Linear ProgrammingLP worst–case convergence guarantees are betterthan those of DP methodsLP methods become impractical at a much smallernumber of states than DP methods do

Page 83:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 84:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 85:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 86:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 87:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Dynamic Programming

Dynamic Programming (DP) is a collection ofalgorithms to solve problems exhibiting optimalsubstructureWhen the transition model and the reward function areknown, (offline) DP algorithms can be used to solveMDP problems

Complete knowledgeComputational expensive

From DP algorithms have been derived RL algorithms

Page 88:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 89:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 90:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 91:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 92:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 93:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

RL vs DP

RL methods are used when the transition model or thereward function are unknownThrough repeated interactions the agent estimatesthe utility of each stateTwo approaches

Model-basedModel-free

Q-learning

Page 94:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 95:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 96:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 97:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 98:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 99:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Q-learning (Watkins,’89)

Q-learning is the most popular RL algorithmQt+1(st ,at ) = Qt (st ,at )+α(rt +γmaxa Qt (st+1,a)−Qt (st ,at ))

Qt+1(st ,at ) = (1− α)Qt (st ,at ) + α(rt + γmaxa Qt (st+1,a))

Off-policy TD algorithmSimple to implementIf all the state-action pairs are tried infinitely often andthe learning rate opportunely decreases, converges tothe optimal solution

Page 100:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Advanced Topics in RL

High-dimensional problemsContinuous MDPsPartially observable MDPsMulti-Objective MDPsInverse RLTransfer of KnowledgeExploration vs ExploitationMulti-agent learning

Page 101:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 102:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 103:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 104:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 105:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 106:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 107:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Exploration vs Exploitation

To cumulate high rewards an agent needs to exploitactions that have been tried in the past and are knownto be effective...... but it has to explore such actions to improveThe dilemma is that both exploration and exploitationare necessaryMany techniques have been studied

ε-greedy

Boltzmann→ π(s,a) =e

Q(s,a)T∑

a′∈A eQ(s,a′)

T

More efficient techniques (Multi–Armed Bandits)

Page 108:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 109:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

How RL can be extended to MAS?

RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts

Page 110:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

How RL can be extended to MAS?

RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts

Page 111:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

How RL can be extended to MAS?

RL research is mainly focused on single-agentlearningWe need to extend the MDP framework in order toconsider other agents with possibly different rewardfunctionsSo we will need to resort to game theory concepts

Page 112:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 113:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 114:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 115:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 116:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 117:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 118:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 119:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

When do we have a MAL problem?When there are multiple concurrent learnersActually, when some agent’s policies depend on otheragents’ past actions

MAL is much more difficult than SAL. Why?Problem dimensions typically grow with the number ofagentsNon-stationarity“Optimal” policies can be stochasticLearning cannot be separated from teaching

Page 120:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 121:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 122:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 123:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 124:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 125:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 126:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 127:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 128:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 129:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 130:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Multi-agent vs Single-agent

Which is the goal? What the agents have to learn?Actually it depends on the learning strategy adopted byother agents

Best-responseEquilibrium

No learning procedure is optimal against all possibleopponent beahviors

Self-playTargeted optimality

Desirable properties for learning strategiesSafetyRationalityUniversal consistency / no-regret

Page 131:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 132:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 133:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 134:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 135:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 136:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 137:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 138:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 139:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Matrix Games

A matrix (or strategic) game is a tuple: 〈n,A,R〉n: number of playersA: joint action space, Ai is the set of actions of player iR: vector of reward functions, Ri is the reward functionof player i

Matrix games are one-shot gamesLearning requires repeated interactions

Repeated gamesStochastic games

Page 140:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 141:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 142:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 143:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 144:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 145:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 146:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 147:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

A Special Case: Repeated Games

In repeated games, the same one-shot game (calledstage game) is repeatedly played

E.g., iterated prisoner’s dilemma

Infinitely vs finitely repeated gamesReally, an extensive form game

Subgame-perfect (SP) equilibriaOne SP equilibrium is to repeatedly play some Nashequilibrium of the stage game

Stationary strategyAre other equilibria possible?

Page 148:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Folk Theorem

No strategy profiles, but obtained payoffsInformally:

“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under

mixed strategies in the single stage game,with the constraint on the mixed strategies

that each player’s payoff is at least theamount he would receive if the other players

adopted minimax strategies against him”

Page 149:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Folk Theorem

No strategy profiles, but obtained payoffsInformally:

“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under

mixed strategies in the single stage game,with the constraint on the mixed strategies

that each player’s payoff is at least theamount he would receive if the other players

adopted minimax strategies against him”

Page 150:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Folk Theorem

No strategy profiles, but obtained payoffsInformally:

“In infinitely repeated game the set ofaverage rewards attainable in equilibriumare precisely those pairs attainable under

mixed strategies in the single stage game,with the constraint on the mixed strategies

that each player’s payoff is at least theamount he would receive if the other players

adopted minimax strategies against him”

Page 151:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Folk Theorem

Page 152:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 153:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 154:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 155:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 156:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 157:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 158:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 159:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

MDP + Matrix = Stochastic Games (SGs)

A stochastic (or Markov) game is a tuple: 〈n,S,A,P,R〉n: number of playersS: set of statesA: joint action space, A1 × · · · × AnP: state transition modelR: vector of reward functions, one for each agent

SG extends MDP to multiple agentsSGs with one state are called repeated games

Page 160:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 161:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 162:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 163:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 164:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Strategies in SG

Let ht = (s0,a0, s1,a1, . . . , st−1,at−1, st ) denote ahistory of t stages of a stochastic gameThe space of possible strategies is huge, but there areinteresting restrictions

Behavioral strategy: returns the probability of playingan action given a history htMarkov strategy: is a behavioral strategy in which thedistribution over actions depends only on the currentstateStationary strategy: is a time-independent Markovstrategy

Page 165:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibria in SG

Markov-perfect equilibrium: is a profile of Markovstrategies that yields a Nash equilibrium in everyproper subgameEvery n-player, general-sum, discounted-rewardstochastic game has a Markov perfect equilibrium

Page 166:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibria in SG

Markov-perfect equilibrium: is a profile of Markovstrategies that yields a Nash equilibrium in everyproper subgameEvery n-player, general-sum, discounted-rewardstochastic game has a Markov perfect equilibrium

Page 167:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 168:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 169:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 170:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 171:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Stochastic Games: Example

RewardGoalreached:+100Collision: -1Otherwise: 0

Some solutions:

Page 172:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryProblems

States

Agents

Single

Multiple

Single Multiple

Optimization

Page 173:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryProblems

States

Agents

Single

Multiple

Single Multiple

Optimization

MDP

Page 174:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryProblems

States

Agents

Single

Multiple

Single Multiple

Optimization

MDP

Matrix Game

Page 175:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryProblems

States

Agents

Single

Multiple

Single Multiple

Optimization

MDP

Matrix Game

Stochastic Game

Page 176:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryLearning

States

Agents

Single

Multiple

Single Multiple

Multi-ArmedBandit

Page 177:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryLearning

States

Agents

Single

Multiple

Single Multiple

Multi-ArmedBandit

ReinforcementLearning

Page 178:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryLearning

States

Agents

Single

Multiple

Single Multiple

Multi-ArmedBandit

ReinforcementLearning

Learning inRepeated Games

Page 179:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

SummaryLearning

States

Agents

Single

Multiple

Single Multiple

Multi-ArmedBandit

ReinforcementLearning

Learning inRepeated Games

Multi-AgentLearning

Page 180:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 181:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 182:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 183:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 184:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 185:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Learning vs Game Theory

Game theory predicts which strategies rationalplayers will playUnfortunately, in multi-agent learning, many agents arenot able to behave rationally

The problem is unknownReal-time constraintsHumans

In some problems a non-equilibrium strategy isappropriate if one expects others to playnon-equilibrium strategies

Page 186:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 187:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 188:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 189:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 190:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Why Learning?

If the game is known, the agent wants to learn thestrategies employed by the other agentsIf the game is unknown, the agent wants to learn alsothe structure of the game

Unknown payoffsUnknown transition probabilities

Observability: Do the agents see each others’actions, and/or each others’ payoffs?

Page 191:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 192:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 193:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 194:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 195:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 196:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Desired Properties in MAL

Rationality: play best-response against stationaryopponentsConvergence: play a Nash equilibrium in self-playSafety: no worse than minimax strategyTargeted-optimality: approximate best-responseagainst memory-bounded opponentsCooperate and compromise: an agent must offer andaccept compromisesThere are a lot of algorithms that have been proposedshowing some of these properties (WoLF, GIGA-WoLF,AWESOME, M-Qbed, . . . )

Page 197:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 198:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 199:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 200:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 201:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 202:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 203:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 204:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 205:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsTask type

Fully cooperativeStatic: JAL, FMQDynamic: Team-Q, Distributed-Q, OAL

Fully competitiveMinimax-Q

MixedStatic: Fictitious Play, MetaStrategy, IGA, WoLF-IGA,GIGA, GIGA-WoLF, AWESOME, Hyper-QDynamic: Single-agent RL, Nash-Q, CE-Q,Asymmetric-Q, NSCP, WoLF-PHC, PD-WoLF, EXORL

Page 206:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Taxonomy of MARL AlgorithmsField of Origin

TemporalDifference RL

single-agent RLJAL

Distributed-QEXORLHyper-Q

FMQ

CE-QNash-QTeam-Q

minimax-QNSCP

Asymmetric-Q

OAL

Fictitious Play

AWESOME

MetaStrategy

WoLF-PHCPD-WoLF

IGAWoLF-IGA

GIGA-WoLFGIGA

Game Theory

Direct Policy Search

Page 207:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 208:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 209:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 210:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 211:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 212:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 213:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 214:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 215:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Equilibrium or not?

Why focus on equilibria?Equilibrium identifies conditions under which learningcan or should stopEasier to play in equilibrium as opposed to continuedcomputation

Why not to focus on equilibriaNash equilibrium strategy has no prescriptive forceMultiple potential equilibriaUse of an oracle to uniquely identify an equilibria is“cheating”Opponent may not wish to play an equilibriaCalculating a Nash Equilibrium for a large game can beintractable

Page 216:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 217:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 218:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 219:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 220:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 221:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 222:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 223:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 224:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 225:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 226:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Indipendent Learners

Typical conditions for IndependentLearning (IL):

An agent is unaware of theexistence of other agentsIt cannot identify other agent’sactions, or has no reason to believethat other agents are actingstrategically.

Independent learners try to learn bestresponses

AdvantagesStraightforward application ofsingle-agent techniquesScales with the number of agents

DisadvantagesConvergence guarantees fromsingle-agent setting are lostNo explicit means for coordination

Traffic

Page 227:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent Reinforcement Learners

Q-learning [Watkins’92]Learning Automata [Narendra’74,Wheeler’86]WoLF-PHC [Bowling’01]FAQ-learning [Kaisers’10]RESQ-learning [Hennes’10]

Page 228:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Joint Action Learners

A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)

fi(a−i) = Πj 6=iφj(a−i)

The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:

EV (ai) =∑

a−i∈A−i

Q(ai ∪ a−i)fi(a−i)

Page 229:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Joint Action Learners

A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)

fi(a−i) = Πj 6=iφj(a−i)

The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:

EV (ai) =∑

a−i∈A−i

Q(ai ∪ a−i)fi(a−i)

Page 230:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Joint Action Learners

A joint action learner (JAL) is an agent that learnsQ-values for joint actionsTo estimate opponents’ actions empiricaldistributions can be used (as in fictitious play)

fi(a−i) = Πj 6=iφj(a−i)

The expected value of an individual action is the sum ofjoint Q-values, weighted by the estimated probability ofthe associated complementary joint action profiles:

EV (ai) =∑

a−i∈A−i

Q(ai ∪ a−i)fi(a−i)

Page 231:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Outline

1 Introduction to Multi-Agent Reinforcement LearningReinforcement LearningMARL vs RLMARL vs Game Theory

2 MARL algorithmsBest-Response LearningEquilibrium Learners

Team GamesZero-sum GamesGeneral-sum Games

Page 232:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Games

Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems

Page 233:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Games

Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems

Page 234:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Games

Team games are fully cooperative games in which allthe agents share the same reward functionIf the learning is centralized, it is actually single-agentlearning with multiple actuatorsMulti-agent learning raises in distributed problems

Page 235:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Coordination Equilibria

In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an

π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an

Qi(s, a1, . . . , an)

for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium

Page 236:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Coordination Equilibria

In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an

π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an

Qi(s, a1, . . . , an)

for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium

Page 237:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Coordination Equilibria

In a coordination equilibrium all the agents achieve theirmaximum possible payoff.If π1, π2, . . . are in coordination equilibrium, we havethat∑a1,...,an

π1(s, a1)·· · ··πn(s, an)Qi(s, a1, . . . , an) = maxa1,...,an

Qi(s, a1, . . . , an)

for all 1 ≤ i ≤ n ans states sIf a game has a coordination equilibrium, then it has adeterministic coordination equilibrium

Page 238:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Coordination game

a0 a1b0 10 0b1 0 10

The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy

Page 239:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Coordination game

a0 a1b0 10 0b1 0 10

The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy

Page 240:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Coordination game

a0 a1b0 10 0b1 0 10

The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy

Page 241:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Coordination game

a0 a1b0 10 0b1 0 10

The agents use Boltzmann explorationBoth are able to converge to one of the optimalstrategiesJALs can distinguish Q-values of different joint actionsThe difference in performance is small due to theexploration strategy

Page 242:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Penalty game

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Considering k < 0, the game has 3 pure equilibria

Suppose penalty K = −100

Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy

Page 243:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Penalty game

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Considering k < 0, the game has 3 pure equilibria

Suppose penalty K = −100

Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy

Page 244:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Penalty game

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Considering k < 0, the game has 3 pure equilibria

Suppose penalty K = −100

Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy

Page 245:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Penalty game

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Considering k < 0, the game has 3 pure equilibria

Suppose penalty K = −100

Both ILs and JALs will converge to the self-confirmingequilibrium 〈a1,b1〉The magnitude of the penalty k influences the probability ofconvergence to the optimal joint strategy

Page 246:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Climbing game

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure

Page 247:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Climbing game

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure

Page 248:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersExample: Climbing game

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Agents start playing 〈a2,b2〉Agents converge to 〈a1,b1〉Convergence to pure equilibria is almost sure

Page 249:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 250:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 251:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 252:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 253:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersSufficient conditions

The learning rate α decreases over time such that∑

t α is divergentand

∑t α

2 is convergent

Each agent samples each of its actions infinitely often

The probability P it (a) of agent i choosing action a is nonzero

Agents become full exploiters with probability one eventually:

limt→∞

P it (Xt) = 0

where Xt is a random variable denoting the event that (fi , gi)prescribe a sub-optimal action

Let Et be a random variable denoting the probability of a(deterministic) equilibrium strategy profile being played at timet. Then for both ILs and JALs, for any δ, ε > 0, there is anT (δ, ε) such that

Pr(|Et − 1| < ε) > 1− δ

for all t > T (δ, ε).

Page 254:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 255:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 256:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 257:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 258:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 259:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic Heuristics

Neither ILs nor JALs ensure convergence to an optimalequilibriumNo hope with ILs, but JALs with a different explorationstrategy...Myopic heuristics

Optimistic Boltzmann (OB): For agent i , actionai ∈ Ai , let maxQ(ai ) = maxΠ−i Q(Π−i ,ai ). Chooseactions with Boltzmann exploration (another expolitivestrategy would suffice) using MaxQ(ai ) as the value ofaiWeighted OB (WOB): Explore using Boltzmann usingfactors MaxQ(ai ) · Pri (optimalmatchΠ−i forai )Combined: Let C(ai ) = ρMaxQ(ai ) + (1− ρ)EV (ai ), forsome 0 ≤ ρ ≤ 1. Choose actions using Boltzmannexploration with C(ai ) as value of ai

Page 260:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Independent vs Joint Action LearnersMyopic heuristics: Penalty game

Page 261:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learning [Lauer and Riedmiller,’01]

Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:

Q0(s,a) = 0

Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A

Q(s′,a′))

This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often

Page 262:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learning [Lauer and Riedmiller,’01]

Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:

Q0(s,a) = 0

Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A

Q(s′,a′))

This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often

Page 263:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learning [Lauer and Riedmiller,’01]

Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:

Q0(s,a) = 0

Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A

Q(s′,a′))

This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often

Page 264:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learning [Lauer and Riedmiller,’01]

Applies to deterministic cooperative SGsNon-negative reward functionsUpdate rule:

Q0(s,a) = 0

Qk+1(s,a) = max(Qk (s,a),R(s,a) + γmaxa′∈A

Q(s′,a′))

This optimistic algorithm learns distributed Q-tables,provided that all state-action pairs occurs infinitely often

Page 265:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Climbing Games

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Distributed Q-tables

a0 a1 a2Q1(s0,a) 11 7 6Q2(s0,a) 11 7 5

Page 266:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Climbing Games

a0 a1 a2b0 11 -30 0b1 -30 7 6b2 0 0 5

Distributed Q-tables

a0 a1 a2Q1(s0,a) 11 7 6Q2(s0,a) 11 7 5

Page 267:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Penalty Games

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Distributed Q-tables

a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10

It requires an additional mechanism of coordinationbetween agents

Update the current policy only if an improvement to theQ-value happens

Page 268:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Penalty Games

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Distributed Q-tables

a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10

It requires an additional mechanism of coordinationbetween agents

Update the current policy only if an improvement to theQ-value happens

Page 269:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Penalty Games

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Distributed Q-tables

a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10

It requires an additional mechanism of coordinationbetween agents

Update the current policy only if an improvement to theQ-value happens

Page 270:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningExample: Penalty Games

a0 a1 a2b0 10 0 kb1 0 2 0b2 k 0 10

Distributed Q-tables

a0 a1 a2Q1(s0,a) 10 2 10Q2(s0,a) 10 2 10

It requires an additional mechanism of coordinationbetween agents

Update the current policy only if an improvement to theQ-value happens

Page 271:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 272:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 273:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 274:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 275:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 276:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Distributed Q-learningStochastic environments

Distributed Q-learning works fine with deterministiccooperative environmentsExtension to stochastic environments is problematicThe main difficulty is that Q-values are affected by twokinds of uncertainty

behavior of other agentsinfluence of stochastic environments

Distinguish these two uncertainties is a key point inmultiagent learning

Page 277:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 278:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 279:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 280:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 281:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 282:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Team Q-learning [Littman, ’01]

Requires to observe actions from other agentsUpdate rule

Q1(s, a1, . . . , an)← (1− α)Q1(s, a1, . . . , an) + α

(r1 + γmaxa′1,...,a

′n

Q1(s′, a′1, . . . , a′n)

)

It does not use an opponent modelIn a team game, team Q-learning will converge to theoptimal Q-function with probability one.If the limit equilibrium is unique and the agent follows aGLIE policy, it will converge in behavior withprobability oneThe main problem is to select an equilibrium whenthere are multiple coordination equilibria

Page 283:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 284:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 285:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 286:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 287:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 288:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Zero-sum Games

Consider 2 playersR1(i , j) = M(i , j)R1(i , j) = −M(i , j)player 1 is maximizerplayer 2 is minimizerExamples: matching pennies, rock-paper-scissors

Page 289:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 290:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 291:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 292:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 293:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-Q [Littman, ’94]

In MDPs a stationary, deterministic, and undominatedoptimal policy always existsIn MGs The performance of a policy depends on theopponent’s policy, so we cannot evaluate them withoutcontext.New definition of optimality in game theory

Performs best at its worst case compared with othersAt least one optimal policy exists, which may or may notbe deterministic because the agent is uncertain of itsopponent’s move.

Page 294:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QLearning Optimal Policy

Q-learning

Q(s,a)← (1− α)Q(s,a) + α(r + γV (s′)

)V (s) = max

aQ(s,a)

minimax-Q learning

Q(s,a,o) := (1− α)Q(s,a,o) + α(rs,a,o + γV (s′)

)π(s, . . . ) := arg max

π′(s,... )min

o′

∑a′

(π(s,a′) ·Q(s,a′,o′)

)V (s) := min

o′

∑a′π(s,a′) ·Q(s,a′,o′)

Page 295:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QLearning Optimal Policy

Q-learning

Q(s,a)← (1− α)Q(s,a) + α(r + γV (s′)

)V (s) = max

aQ(s,a)

minimax-Q learning

Q(s,a,o) := (1− α)Q(s,a,o) + α(rs,a,o + γV (s′)

)π(s, . . . ) := arg max

π′(s,... )min

o′

∑a′

(π(s,a′) ·Q(s,a′,o′)

)V (s) := min

o′

∑a′π(s,a′) ·Q(s,a′,o′)

Page 296:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 297:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 298:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 299:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 300:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Minimax-QConsiderations

In a two-player zero-sum multiagent environment, anagent following minimax Q-learning will converge tothe optimal Q-function with probability one.Furthermore, if it follows a GLIE policy and the limitequilibrium is unique, it will converge in behavior withprobability oneIn zero-sum SGs, even if the limit equilibrium is notunique, it converge to a policy that always achieve theoptimal value regardless of its opponent (safety)The minimax Q-learning achieves the largest valuepossible in the absence of knowledge of theopponent’s policyMinimax-Q is quite slow to converge w.r.t. Q-learning(but the latter learns only deterministic policies)

Page 301:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Can we extend this approach to general-sumSGs?

Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties

Page 302:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Can we extend this approach to general-sumSGs?

Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties

Page 303:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Can we extend this approach to general-sumSGs?

Yes and NoNash-Q learning is such an extensionHowever, it has much worse computational andtheoretical properties

Page 304:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-Q [Hu & Wellman, ’98-’03]

NashQit (s′) = π1(s′) · · · · · πn(s′) ·Qi

t (s′)

Each agent needs to maintain the Q-functions of all theother agents

Page 305:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-Q [Hu & Wellman, ’98-’03]

NashQit (s′) = π1(s′) · · · · · πn(s′) ·Qi

t (s′)

Each agent needs to maintain the Q-functions of all theother agents

Page 306:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QComplexity

Space requirements: n · |S| · |A|n

The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)

Page 307:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QComplexity

Space requirements: n · |S| · |A|n

The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)

Page 308:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QComplexity

Space requirements: n · |S| · |A|n

The algorithm running time is dominated by thecalculation of Nash equilibriumThe minimax operator can be computed in polynomialtime (linear programming)

Page 309:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 310:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 311:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 312:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 313:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 314:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 315:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 316:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence

Assumptions

Every state and joint action are visited infinitely oftenLearning rates suitably decreaseOne of the following assumptions hold

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a global optimal point,

and agent’s payoff in this equilibrium are used to update their Q-functions

Every stage game (Q1t (s), . . . ,Qn

t (s)), for all t and s, has a saddle point, and

agent’s payoff in this equilibrium are used to update their Q-functions

TheoremUnder these assumptions, the sequence Qt = (Q1

t , . . . ,Qnt ), updated

by

Qkt+1(s, a1

, . . . , an) = (1−αt )Qkt (s, a1

, . . . , an)+αt

(rkt + γπ

1(s′) · · · · · πn(s′)Qkt (s′)

)for k = 1, . . . , n

where(π1(s′), . . . , πn(s′)

)is the appropriate type of Nash

equilibrium solution for the stage game(Q1

t (s′), . . . ,Qnt (s′)

),

converges to the Nash Q-value Q∗ =(Q1∗, . . . ,Qn

∗).

Page 317:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 318:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 319:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 320:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 321:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 322:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Nash-QConvergence Result Analysis

The third assumption is really strongIt is unlikely that stage games during learning maintainadherence to assumptionsThe global optimum assumption implies fullcooperation between agents.The saddle point assumption implies no cooperationbetween agents.Nonetheless, empirically the algorithm convergeseven when assumptions are violatedThis suggests that assumptions may be relaxed, atleast for some classes of games

Page 323:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 324:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 325:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 326:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 327:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe [Littman, ’01]

Friend-or-Foe Q-learning (FFQ) aims at removing therequirements on the intermediate Q-values duringlearningThe idea is to let the algorithm know what kind ofopponent to expect

friend: coordination equilibrium

Nash1(s,Q1,Q2) = maxa1∈A1,a2∈A2

Q1(s,a1,a2)

foe: adversarial equilibrium

Nash1(s,Q1,Q2) = maxπ∈Π(A1)

mina2∈A2

∑a1∈A1

π(a1)Q1(s,a1,a2)

In FFQ the learner maintains only a Q-function foritself

Page 328:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 329:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 330:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 331:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 332:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy

Page 333:  · MARL Marcello Restelli Introduction to Multi-Agent Reinforcement Learning Reinforcement Learning MARL vs RL MARL vs Game Theory MARL algorithms Best-Response Learning Equilibrium

MARL

MarcelloRestelli

Introduction toMulti-AgentReinforcementLearningReinforcementLearning

MARL vs RL

MARL vs GameTheory

MARLalgorithmsBest-ResponseLearning

Equilibrium Learners

Team Games

Zero-sum Games

General-sumGames

Friend-or-Foe Q-learningConvergence results

Friend-or-foe Q-learning convergesIn general the values learned by FFQ will not convergeto those of any Nash equilibriumThere are some special cases (independently fromopponent behavior)

Foe-Q learns values for a Nash equilibrium policy if thegame has an adversarial equilibriumFriend-Q learns values for a Nash equilibrium policy ifthe game has a coordination equilibrium

Foe-Q learns a Q-function whose corresponding policywill achieve at least the learned values, regardless ofthe opponent’s selected policy