78
An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst UPF Lecture 1 Autonomous Learning Laboratory – Department of Computer Science

An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

Andrew G. BartoDepartment of Computer Science

University of Massachusetts – Amherst

UPF Lecture 1

Autonomous Learning Laboratory – Department of Computer Science

Page 2: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

Psychology

Artificial Intelligence

Control Theory andOperations Research

Artificial Neural Networks

ComputationalReinforcementLearning (RL)

Neuroscience

Page 3: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 3

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Page 4: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 4

What is Reinforcement Learning?

❐ Learning from interaction❐ Goal-oriented learning❐ Learning about, from, and while interacting with an

external environment❐ Learning what to do—how to map situations to

actions—so as to maximize a numerical reward signal

Page 5: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 5

Supervised Learning

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Page 6: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 6

Reinforcement Learning

RLSystemInputs Outputs (“actions”)

Training Info = evaluations (“rewards” / “penalties”)

Objective: get as much reward as possible

Page 7: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 7

Key Features of RL

❐ Learner is not told which actions to take❐ Trial-and-Error search❐ Possibility of delayed reward

Sacrifice short-term gains for greater long-term gains❐ The need to explore and exploit❐ Considers the whole problem of a goal-directed agent

interacting with an uncertain environment

Page 8: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 8

Complete Agent

❐ Temporally situated❐ Continual learning and planning❐ Object is to affect the environment❐ Environment is stochastic and uncertain

Environment

actionstate

rewardAgent

Page 9: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 9

A Less Misleading View

external sensations

memory

state

reward

actions

internal sensations

RL agent

Page 10: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 10

Elements of RL

❐ Policy: what to do❐ Reward: what is good❐ Value: what is good because it predicts reward❐ Model: what follows what

Policy

RewardValue

Model ofenvironment

Page 11: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 11

An Extended Example: Tic-Tac-Toe

X XXO OX

XOX

OXO

X

O

XXO

X

O

X OXO

X

O

X O

X

} x’s move

} x’s move

} o’s move

} x’s move

} o’s move

...

...... ...

... ... ... ... ...

x x

x

x o

x

o

xo

x

x

xx

o

o

Assume an imperfect opponent: —he/she sometimes makes mistakes

Page 12: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 12

An RL Approach to Tic-Tac-Toe

1. Make a table with one entry perstate:

2. Now play lots of games.To pick our moves,

look ahead one step:

State V(s) – estimated probability of winning.5 ?.5 ?

. . .

. . .

. . .. . .

1 win

0 loss

. . .. . .

0 draw

x

xxx

oo

oo

ox

x

oo

o ox

xx

xo

current state

various possiblenext states*

Just pick the next state with the highestestimated prob. of winning — the largest V(s);a greedy move.

But 10% of the time pick a move at random;an exploratory move.

Page 13: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 13

RL Learning Rule for Tic-Tac-Toe

“Exploratory” move

s – the state before our greedy move

! s – the state after our greedy move

We increment each V(s) toward V( ! s ) – a backup :

V(s)"V (s) + # V( ! s ) $ V (s)[ ]

a small positive fraction, e.g., ! = .1

the step - size parameter

Page 14: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 14

How can we improve this T.T.T. player?

❐ Take advantage of symmetries representation/generalization How might this backfire?

❐ Do we need “random” moves? Why? Do we always need a full 10%?

❐ Can we learn from “random” moves?❐ Can we learn offline?

Pre-training from self play? Using learned models of opponent?

❐ . . .

Page 15: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 16

How is Tic-Tac-Toe Too Easy?

❐ Finite, small number of states❐ One-step look-ahead is always possible❐ State completely observable❐ . . .

Page 16: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17

Some Notable RL Applications

❐ TD-Gammon: Tesauro– world’s best backgammon program

❐ Elevator Control: Crites & Barto– high performance down-peak elevator controller

❐ Inventory Management: Van Roy, Bertsekas, Lee&Tsitsiklis– 10–15% improvement over industry standard methods

❐ Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin– high performance assignment of radio channels to mobile

telephone calls

Page 17: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 18

TD-Gammon

Start with a random networkPlay very many games against selfLearn a value function from this simulated experience

This produces arguably the best player in the world

Action selectionby 2–3 ply search

Value

TD errorVt+1 ! Vt

Tesauro, 1992–1995

Page 18: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 19

Elevator Dispatching

10 floors, 4 elevator cars

STATES: button states; positions,directions, and motion states ofcars; passengers in cars & in halls

ACTIONS: stop at, or go by, next floorREWARDS: roughly, –1 per time step

for each person waiting

Conservatively about 10 states22

Crites and Barto, 1996

Page 19: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 20

Autonomous Helicopter FlightA. Ng, Stanford; H. Kim, M. Jordon, S. Sastry, Berkeley

Page 20: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 22

Quadrupedal LocomotionNate Kohl & Peter Stone, Univ of Texas at Austin

Before Learning After 1000 trials, or about 3 hours

All training done with physical robots: Sony Aibo ERS-210A

Page 21: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 24

Learning Control for Dynamically Stable Walking Robots Russ Tedrake, Teresa Zhang, H. Sebastion Seung, MIT

Start with a Passive Walker

Page 22: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

http://hebb.mit.edu/~russt/robots

Page 23: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 27

Grasp ControlR. Platt, A. Fagg, R. Grupen, Univ of Mass

Umass Torso: “Dexter”

Page 24: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 29

Some RL History

Trial-and-Errorlearning

Temporal-differencelearning

Optimal control,value functions

Thorndike (Ψ)1911

Minsky

Klopf

Barto et al.

Secondary reinforcement (Ψ)

Samuel

Witten

Sutton

Hamilton (Physics)1800s

Shannon

Bellman/Howard (OR)

Werbos

Watkins

Holland

Page 25: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 30

Arthur Samuel 1959, 1967

Samuel’s Checkers Player

❐ Score board configurations by a “scoring polynomial”(after Shannon, 1950)

❐ Minimax to determine “backed-up score” of a position❐ Alpha-beta cutoffs❐ Rote learning: save each board config encountered together

with backed-up score needed a “sense of direction”: like discounting

❐ Learning by generalization: similar to TD algorithm

Page 26: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 31

Samuel’s Backups

Page 27: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 32

The Basic Idea

“. . . we are attempting to make the score,calculated for the current boardposition, look like that calculated forthe terminal board positions of thechain of moves which most probablyoccur during actual play.”

A. L. Samuel Some Studies in Machine Learning

Using the Game of Checkers, 1959

Page 28: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 33

MENACE (Michie 1961)

x

x

x

x

o

o

o

o

o

x

x

xx

o

o

ox

x

xo

o

ox x

x x

x x

x x

o

o

xo

ox

x

x

xx

x o

o

o

o

o

o

o

o

o

o

o

x

x

x

o

o

o

oo

o

o

x

x

x

oox

x

x

x

o

o

o

x o

o

o

o

o

o

o

o

o

x

xx

x

oo

o

o

o

x

x

x

x

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

x

x

x

x

o

o

o

ox

x

x

x

o

o

o

ox

x

“Matchbox Educable Noughts and Crosses Engine”

Page 29: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 34

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Page 30: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 35

Lecture 1, Part 2: Evaluative Feedback

❐ Evaluating actions vs. instructing by giving correct actions❐ Pure evaluative feedback depends totally on the action taken.

Pure instructive feedback depends not at all on the action taken.❐ Supervised learning is instructive; optimization is evaluative❐ Associative vs. Nonassociative:

Associative: inputs mapped to outputs; learn the best outputfor each input

Nonassociative: “learn” (find) one best output❐ n-armed bandit (at least how we treat it) is:

Nonassociative Evaluative feedback

Page 31: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 36

The n-Armed Bandit Problem

❐ Choose repeatedly from one of n actions; each choice iscalled a play

❐ After each play , you get a reward , where

E rt| a

t=Q

*(a

t)

at

rt

These are unknown action valuesDistribution of depends only on r

tat

❐ Objective is to maximize the reward in the long term, e.g.,over 1000 plays

To solve the n-armed bandit problem, you must explore a variety of actions and the exploit the best of them

Page 32: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 37

The Exploration/Exploitation Dilemma

❐ Suppose you form estimates

❐ The greedy action at t is

❐ You can’t exploit all the time; you can’t explore all the time❐ You can never stop exploring; but you should always reduce

exploring

Qt(a) !Q

*(a) action value estimates

at

* = argmaxaQt(a)

at= a

t

*! exploitation

at" a

t

* ! exploration

Page 33: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 38

Action-Value Methods

❐ Methods that adapt action-value estimates and nothingelse, e.g.: suppose by the t-th play, action had beenchosen times, producing rewards then

Qt(a) =r1+ r

2+Lrka

ka

ka

r1, r2,K, r

ka,

“sample average”

limk a!"

Qt(a) =Q*(a)

a

Page 34: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 39

ε-Greedy Action Selection

❐ Greedy action selection:

❐ ε-Greedy:

at= a

t

*= argmax

aQt(a)

at

* with probability 1 ! "

random action with probability "{a

t=

. . . the simplest way to try to balance exploration and exploitation

Page 35: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 40

10-Armed Testbed

❐ n = 10 possible actions❐ Each is chosen randomly from a normal distribution:❐ each is also normal:❐ 1000 plays❐ repeat the whole thing 2000 times and average the results

!(Q*(a

t),1)

!(0,1)

rt

Q*(a)

Page 36: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 41

ε-Greedy Methods on the 10-Armed Testbed

Page 37: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 42

Softmax Action Selection

❐ Softmax action selection methods grade action probs. byestimated values.

❐ The most common softmax uses a Gibbs, or Boltzmann,distribution:

Choose action a on play t with probability

eQt (a) !

eQt (b) !

b=1

n

",

where ! is the

“computational temperature”

Page 38: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 45

Linear Learning Automata

Let !t(a) = Pr a

t= a{ } be the only adapted parameter

LR –I (Linear, reward - inaction)

On success : ! t+1(at ) = !t (at) +" (1 #! t(at )) 0 <" < 1

(the other action probs. are adjusted to still sum to 1)

On failure : no change

LR -P (Linear, reward - penalty)

On success : ! t+1(at ) = !t (at) +" (1 #! t(at )) 0 <" < 1

(the other action probs. are adjusted to still sum to 1)

On failure : ! t+1(at ) = ! t(at) +" (0 #! t(at )) 0 < " < 1

For two actions, a stochastic, incremental version of the supervised algorithm

Page 39: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 47

Incremental Implementation

Qk =r1+ r

2+Lrk

k

Recall the sample average estimation method:

Can we do this incrementally (without storing all the rewards)?

We could keep a running sum and count, or, equivalently:

Qk+1 = Qk +1

k +1rk+1 !Qk[ ]

The average of the first k rewards is(dropping the dependence on ):

This is a common form for update rules:

NewEstimate = OldEstimate + StepSize[Target – OldEstimate]

a

Page 40: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 48

Tracking a Nonstationary Problem

Choosing to be a sample average is appropriate in astationary problem, i.e., when none of the change over time,

But not in a nonstationary problem.

Qk

Q*(a)

Better in the nonstationary case is:

Qk+1 = Qk +! rk+1 "Qk[ ]for constant !, 0 < ! # 1

= (1" !)kQ0 + ! (1 "!

i=1

k

$ )k "iri

exponential, recency-weighted average

Page 41: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 49

Optimistic Initial Values

❐ All methods so far depend on , i.e., they are biased.❐ Suppose instead we initialize the action values optimistically,

Q0(a)

i.e., on the 10-armed testbed, use Q0 (a) = 5 for all a

Page 42: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 50

Reinforcement Comparison

❐ Compare rewards to a reference reward, , e.g., anaverage of observed rewards

❐ Strengthen or weaken the action taken depending on❐ Let denote the preference for action❐ Preferences determine action probabilities, e.g., by Gibbs

distribution:

❐ Then:

r t

rt! r

t

pt(a) a

!t(a) = Pr at = a{ } =

ept (a)

e pt (b)b=1

n

"

pt+1(at ) = pt(a) + rt ! r t[ ] and r t+1 = r t + " rt ! r t[ ]

Page 43: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 54

Associative Search

actions

Bandit 3

Imagine switching bandits at each play

Page 44: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 55

Conclusions

❐ These are all very simple methods but they are complicated enough—we will build on them

❐ Ideas for improvements: estimating uncertainties . . . interval estimation approximating Bayes optimal solutions Gittens indices

❐ The full RL problem offers some ideas for solution . . .

Page 45: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 56

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Page 46: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 57

Lecture 1, Part 3: Markov Decision Processes

❐ describe the RL problem in terms of MDPs❐ present idealized form of the RL problem for which we have

precise theoretical results;❐ introduce key components of the mathematics: value functions and

Bellman equations;❐ describe trade-offs between applicability and mathematical

tractability.

Objectives of this part:

Page 47: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 58

The Agent-Environment Interface

Agent and environment interact at discrete time steps : t = 0, 1, 2, K

Agent observes state at step t : st!S

produces action at step t : at! A(s

t)

gets resulting reward : rt+1 !"

and resulting next state : st+1

t. . . st a

rt +1 st +1t +1a

rt +2 st +2t +2a

rt +3 st +3 . . .t +3a

Page 48: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 59

Policy at step t , !t

:

a mapping from states to action probabilities

!t(s, a) = probability that a

t= a when s

t= s

The Agent Learns a Policy

❐ Reinforcement learning methods specify how the agentchanges its policy as a result of experience.

❐ Roughly, the agent’s goal is to get as much reward as itcan over the long run.

Page 49: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 60

Getting the Degree of Abstraction Right❐ Time steps need not refer to fixed intervals of real time.❐ Actions can be low level (e.g., voltages to motors), or high level (e.g.,

accept a job offer), “mental” (e.g., shift in focus of attention), etc.❐ States can be low-level “sensations”, or they can be abstract, symbolic,

based on memory, or subjective (e.g., the state of being “surprised” or“lost”).

❐ An RL agent is not like a whole animal or robot, which consist ofmany RL agents as well as other components.

❐ The environment is not necessarily unknown to the agent, onlyincompletely controllable.

❐ Reward computation is in the agent’s environment because the agentcannot change it arbitrarily.

Page 50: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 61

Goals and Rewards

❐ Is a scalar reward signal an adequate notion of agoal?—maybe not, but it is surprisingly flexible.

❐ A goal should specify what we want to achieve, not howwe want to achieve it.

❐ A goal must be outside the agent’s direct control—thusoutside the agent.

❐ The agent must be able to measure success: explicitly; frequently during its lifespan.

Page 51: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 62

Returns

Suppose the sequence of rewards after step t is :

rt+1, rt+ 2 , r

t+ 3, K

What do we want to maximize?

In general,

we want to maximize the expected return, E Rt{ }, for each step t.

Episodic tasks: interaction breaks naturally intoepisodes, e.g., plays of a game, trips through a maze.

Rt= r

t+1 + rt+2 +L + rT,

where T is a final time step at which a terminal state is reached,ending an episode.

Page 52: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 63

Returns for Continuing Tasks

Continuing tasks: interaction does not have natural episodes.

Discounted return:

Rt= r

t+1+! r

t+ 2+ ! 2

rt+3+L = ! k

rt+ k+1

,k =0

"

#

where ! , 0 $ ! $ 1, is the discount rate.

shortsighted 0 !" # 1 farsighted

Page 53: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 64

An Example

Avoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.

reward = +1 for each step before failure

! return = number of steps before failure

As an episodic task where episode ends upon failure:

As a continuing task with discounted return:reward = !1 upon failure; 0 otherwise

" return is related to ! # k, for k steps before failure

In either case, return is maximized by avoiding failure for as long as possible.

Page 54: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 65

Another Example

Get to the top of the hillas quickly as possible.

reward = !1 for each step where not at top of hill

" return = ! number of steps before reaching top of hill

Return is maximized by minimizing number of steps reach the top of the hill.

Page 55: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 66

A Unified Notation

❐ In episodic tasks, we number the time steps of eachepisode starting from zero.

❐ We usually do not have to distinguish between episodes,so we write instead of for the state at step t ofepisode j.

❐ Think of each episode as ending in an absorbing state thatalways produces reward of zero:

❐ We can cover all cases by writing

st

st, j

Rt= ! k

rt+k +1,

k =0

"

#

where ! can be 1 only if a zero reward absorbing state is always reached.

Page 56: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 67

The Markov Property

❐ By “the state” at step t, we mean whatever information isavailable to the agent at step t about its environment.

❐ The state can include immediate “sensations,” highly processedsensations, and structures built up over time from sequences ofsensations.

❐ Ideally, a state should summarize past sensations so as to retainall “essential” information, i.e., it should have the MarkovProperty:

Pr st +1

= ! s , rt +1

= r st,a

t,r

t, s

t"1,a

t"1,K, r

1,s

0,a

0{ } =

Pr st +1

= ! s , rt +1

= r st,a

t{ }for all ! s , r, and histories s

t,a

t,r

t, s

t"1,a

t"1,K, r

1, s

0,a

0.

Page 57: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 68

Markov Decision Processes

❐ If a reinforcement learning task has the Markov Property, it isbasically a Markov Decision Process (MDP).

❐ If state and action sets are finite, it is a finite MDP.❐ To define a finite MDP, you need to give:

state and action sets one-step “dynamics” defined by transition probabilities:

reward expectations:

Ps ! s

a = Pr st +1 = ! s s

t= s, a

t= a{ } for all s, ! s "S, a "A(s).

Rs ! s

a = E rt +1 s

t= s, a

t= a, s

t +1 = ! s { } for all s, ! s "S, a"A(s).

Page 58: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 69

Recycling Robot

An Example Finite MDP

❐ At each step, robot has to decide whether it should (1) actively search for acan, (2) wait for someone to bring it a can, or (3) go to home base andrecharge.

❐ Searching is better but runs down the battery; if runs out of power whilesearching, has to be rescued (which is bad).

❐ Decisions made on basis of current energy level: high, low.❐ Reward = number of cans collected

Page 59: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 70

Recycling Robot MDP

S = high ,low{ }

A(high) = search , wait{ }

A(low) = search ,wait, recharge{ }

Rsearch

= expected no. of cans while searching

Rwait

= expected no. of cans while waiting

Rsearch > Rwait

Page 60: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 71

Value Functions

State - value function for policy ! :

V!(s) = E! R

tst

= s{ } = E! " krt+k +1 s

t= s

k =0

#

$% & '

( ) *

Action- value function for policy ! :

Q!(s, a) = E! Rt st = s, at = a{ } = E! " k

rt+ k+1 st = s,at = ak= 0

#

$% & '

( ) *

❐ The value of a state is the expected return starting fromthat state; depends on the agent’s policy:

❐ The value of taking an action in a state under policy πis the expected return starting from that state, taking thataction, and thereafter following π :

Page 61: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 72

Bellman Equation for a Policy π

Rt= r

t+1 + ! rt+2 +!2rt+ 3 +!

3rt+ 4L

= rt+1 + ! r

t+2 + ! rt+3 + !2

rt+ 4L( )

= rt+1 + ! Rt+1

The basic idea:

So:

!

V"(s) = E" R

tst

= s{ }

= E" rt+1 + #V "

st+1( ) st = s{ }

Or, without the expectation operator:

V!(s) = ! (s, a) P

s " s

a

Rs " s

a

+ #V!( " s )[ ]

" s

$a

$

Page 62: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 73

More on the Bellman Equation

V!(s) = ! (s, a) P

s " s

a

Rs " s

a

+ #V!( " s )[ ]

" s

$a

$

This is a set of equations (in fact, linear), one for each state.The value function for π is its unique solution.

Backup diagrams:

for V!

for Q!

Page 63: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 74

What about Value Function ?for Q!

!

Q"(s,a) =

!

Ps " s

a

" s

#

!

Rs " s

a + #[ ]

!

" ( # s , # a )# a

$

!

Q"( # s , # a )

Page 64: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 75

Gridworld

❐ Actions: north, south, east, west; deterministic.❐ If would take agent off the grid: no move but reward = –1❐ Other actions produce reward = 0, except actions that move agent out

of special states A and B as shown.

State-value function for equiprobable random policy;γ = 0.9

Note: A’s value is less than immediate reward B’s value is more than immediate reward

Page 65: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 77

! " # ! if and only if V!

(s) " V# ! (s) for all s $S

Optimal Value Functions

❐ For finite MDPs, policies can be partially ordered:

❐ There is always at least one (and possibly many) policies thatis better than or equal to all the others. This is an optimalpolicy. We denote them all π *.

❐ Optimal policies share the same optimal state-value function:

❐ Optimal policies also share the same optimal action-valuefunction:

V!(s) = max

"V"(s) for all s #S

Q!(s, a) = max

"Q

"(s, a) for all s #S and a #A(s)

This is the expected return for taking action a in state sand thereafter following an optimal policy.

Page 66: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 79

Bellman Optimality Equation for V*

V!(s) = max

a"A( s)Q

# !

(s,a)

= maxa"A( s)

E rt +1 + $ V

!(s

t +1) st= s, a

t= a{ }

= maxa"A( s)

Ps % s

a

% s

& Rs % s

a + $V !( % s )[ ]

The value of a state under an optimal policy must equalthe expected return for the best action from that state:

The relevant backup diagram:

is the unique solution of this system of nonlinear equations.V!

Page 67: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 80

Bellman Optimality Equation for Q*

!

Q"(s,a) = E r

t +1 + #max$ a

Q"( $ s , $ a ) s

t= s,a

t= a{ }

= Ps $ s

aR

s $ s

a + #max$ a

Q"( $ s , $ a )[ ]

$ s

%

The relevant backup diagram:

is the unique solution of this system of nonlinear equations.Q*

Page 68: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 81

Why Optimal State-Value Functions are Useful

V!

V!

Any policy that is greedy with respect to is an optimal policy.

Therefore, given , one-step-ahead search produces the long-term optimal actions.

E.g., back to the gridworld:

Page 69: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 82

Car-on-the-Hill Optimal Value Function

Munos & Moore “Variable resolution discretization for high-accuracysolutions of optimal control problems”, IJCAI 99.

Get to the top of the hillas quickly as possible(roughly)

Predicted minimumtime to goal (negated)

Page 70: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 83

What About Optimal Action-Value Functions?

Given , the agent does not evenhave to do a one-step-ahead search:

Q*

!"(s) = argmax

a#A (s)Q

"(s, a)

Page 71: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 84

Solving the Bellman Optimality Equation❐ Finding an optimal policy by solving the Bellman Optimality Equation

requires the following: accurate knowledge of environment dynamics; we have enough space and time to do the computation; the Markov Property.

❐ How much space and time do we need? polynomial in number of states (via dynamic programming

methods; next lecture), BUT, number of states is often huge (e.g., backgammon has about

10**20 states).❐ We usually have to settle for approximations.❐ Many RL methods can be understood as approximately solving the

Bellman Optimality Equation.

Page 72: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 88

Semi-Markov Decision Processes (SMDPs)

❐ Generalization of an MDP where there is a waiting, ordwell, time τ in each state

❐ Transition probabilities generalize to❐ Bellman equations generalize, e.g., for a discrete time

SMDP:!

P( " s ,# | s,a)

!

V*(s) = max

a"A (s)P( # s ,$ | s,a) R

s # s

a + %$V *( # s )[ ]

# s ,$

&

!

Rs " s

awhere is now the amount ofdiscounted reward expected toaccumulate over the waiting time in supon doing a and ending up in s’

Page 73: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 89

Summary

❐ Agent-environment interaction States Actions Rewards

❐ Policy: stochastic rule forselecting actions

❐ Return: the function of futurerewards agent tries to maximize

❐ Episodic and continuing tasks❐ Markov Property❐ Markov Decision Process

Transition probabilities Expected rewards

❐ Value functions State-value function for a policy Action-value function for a policy Optimal state-value function Optimal action-value function

❐ Optimal value functions❐ Optimal policies❐ Bellman Equations❐ The need for approximation❐ Semi-Markov Decision Processes

Page 74: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 90

Edward L. Thorndike (1874-1949)

puzzle box

Learning by “Trial-and-Error”

Page 75: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 91

“Law of Effect”

“Of several responses made to the same situation, thosewhich are accompanied or closely followed by satisfaction tothe animal will, other things being equal, be more firmlyconnected with the situation, so that, when it recurs, they willbe more likely to recur; those which are accompanied orclosely followed by discomfort to the animal will, otherthings being equal, have their connections with that situationweakened, so that when it recurs, they will be less likely tooccur.”

Edward Thorndike, 1911

Page 76: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 92

Search + Memory

❐ Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection

❐ Memory: remember what worked best for each situationand start from there next time

Page 77: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 93

“Credit Assignment Problem”

Spatial

Temporal

Getting useful training information to theright places at the right times

Marvin Minsky, 1961

Page 78: An Introduction to - ETIC UPFhgeffner/Andy1.pdf · Model-based methods ... Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17 Some

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 94

The Overall Plan

❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

❐ Lecture 2: Basic Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience

❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning