Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
An Introduction to COMPUTATIONAL REINFORCEMENT LEARING
Andrew G. BartoDepartment of Computer Science
University of Massachusetts – Amherst
UPF Lecture 1
Autonomous Learning Laboratory – Department of Computer Science
Psychology
Artificial Intelligence
Control Theory andOperations Research
Artificial Neural Networks
ComputationalReinforcementLearning (RL)
Neuroscience
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 3
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 4
What is Reinforcement Learning?
❐ Learning from interaction❐ Goal-oriented learning❐ Learning about, from, and while interacting with an
external environment❐ Learning what to do—how to map situations to
actions—so as to maximize a numerical reward signal
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 5
Supervised Learning
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 6
Reinforcement Learning
RLSystemInputs Outputs (“actions”)
Training Info = evaluations (“rewards” / “penalties”)
Objective: get as much reward as possible
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 7
Key Features of RL
❐ Learner is not told which actions to take❐ Trial-and-Error search❐ Possibility of delayed reward
Sacrifice short-term gains for greater long-term gains❐ The need to explore and exploit❐ Considers the whole problem of a goal-directed agent
interacting with an uncertain environment
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 8
Complete Agent
❐ Temporally situated❐ Continual learning and planning❐ Object is to affect the environment❐ Environment is stochastic and uncertain
Environment
actionstate
rewardAgent
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 9
A Less Misleading View
external sensations
memory
state
reward
actions
internal sensations
RL agent
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 10
Elements of RL
❐ Policy: what to do❐ Reward: what is good❐ Value: what is good because it predicts reward❐ Model: what follows what
Policy
RewardValue
Model ofenvironment
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 11
An Extended Example: Tic-Tac-Toe
X XXO OX
XOX
OXO
X
O
XXO
X
O
X OXO
X
O
X O
X
} x’s move
} x’s move
} o’s move
} x’s move
} o’s move
...
...... ...
... ... ... ... ...
x x
x
x o
x
o
xo
x
x
xx
o
o
Assume an imperfect opponent: —he/she sometimes makes mistakes
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 12
An RL Approach to Tic-Tac-Toe
1. Make a table with one entry perstate:
2. Now play lots of games.To pick our moves,
look ahead one step:
State V(s) – estimated probability of winning.5 ?.5 ?
. . .
. . .
. . .. . .
1 win
0 loss
. . .. . .
0 draw
x
xxx
oo
oo
ox
x
oo
o ox
xx
xo
current state
various possiblenext states*
Just pick the next state with the highestestimated prob. of winning — the largest V(s);a greedy move.
But 10% of the time pick a move at random;an exploratory move.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 13
RL Learning Rule for Tic-Tac-Toe
“Exploratory” move
s – the state before our greedy move
! s – the state after our greedy move
We increment each V(s) toward V( ! s ) – a backup :
V(s)"V (s) + # V( ! s ) $ V (s)[ ]
a small positive fraction, e.g., ! = .1
the step - size parameter
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 14
How can we improve this T.T.T. player?
❐ Take advantage of symmetries representation/generalization How might this backfire?
❐ Do we need “random” moves? Why? Do we always need a full 10%?
❐ Can we learn from “random” moves?❐ Can we learn offline?
Pre-training from self play? Using learned models of opponent?
❐ . . .
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 16
How is Tic-Tac-Toe Too Easy?
❐ Finite, small number of states❐ One-step look-ahead is always possible❐ State completely observable❐ . . .
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17
Some Notable RL Applications
❐ TD-Gammon: Tesauro– world’s best backgammon program
❐ Elevator Control: Crites & Barto– high performance down-peak elevator controller
❐ Inventory Management: Van Roy, Bertsekas, Lee&Tsitsiklis– 10–15% improvement over industry standard methods
❐ Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin– high performance assignment of radio channels to mobile
telephone calls
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 18
TD-Gammon
Start with a random networkPlay very many games against selfLearn a value function from this simulated experience
This produces arguably the best player in the world
Action selectionby 2–3 ply search
Value
TD errorVt+1 ! Vt
Tesauro, 1992–1995
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 19
Elevator Dispatching
10 floors, 4 elevator cars
STATES: button states; positions,directions, and motion states ofcars; passengers in cars & in halls
ACTIONS: stop at, or go by, next floorREWARDS: roughly, –1 per time step
for each person waiting
Conservatively about 10 states22
Crites and Barto, 1996
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 20
Autonomous Helicopter FlightA. Ng, Stanford; H. Kim, M. Jordon, S. Sastry, Berkeley
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 22
Quadrupedal LocomotionNate Kohl & Peter Stone, Univ of Texas at Austin
Before Learning After 1000 trials, or about 3 hours
All training done with physical robots: Sony Aibo ERS-210A
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 24
Learning Control for Dynamically Stable Walking Robots Russ Tedrake, Teresa Zhang, H. Sebastion Seung, MIT
Start with a Passive Walker
http://hebb.mit.edu/~russt/robots
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 27
Grasp ControlR. Platt, A. Fagg, R. Grupen, Univ of Mass
Umass Torso: “Dexter”
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 29
Some RL History
Trial-and-Errorlearning
Temporal-differencelearning
Optimal control,value functions
Thorndike (Ψ)1911
Minsky
Klopf
Barto et al.
Secondary reinforcement (Ψ)
Samuel
Witten
Sutton
Hamilton (Physics)1800s
Shannon
Bellman/Howard (OR)
Werbos
Watkins
Holland
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 30
Arthur Samuel 1959, 1967
Samuel’s Checkers Player
❐ Score board configurations by a “scoring polynomial”(after Shannon, 1950)
❐ Minimax to determine “backed-up score” of a position❐ Alpha-beta cutoffs❐ Rote learning: save each board config encountered together
with backed-up score needed a “sense of direction”: like discounting
❐ Learning by generalization: similar to TD algorithm
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 31
Samuel’s Backups
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 32
The Basic Idea
“. . . we are attempting to make the score,calculated for the current boardposition, look like that calculated forthe terminal board positions of thechain of moves which most probablyoccur during actual play.”
A. L. Samuel Some Studies in Machine Learning
Using the Game of Checkers, 1959
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 33
MENACE (Michie 1961)
x
x
x
x
o
o
o
o
o
x
x
xx
o
o
ox
x
xo
o
ox x
x x
x x
x x
o
o
xo
ox
x
x
xx
x o
o
o
o
o
o
o
o
o
o
o
x
x
x
o
o
o
oo
o
o
x
x
x
oox
x
x
x
o
o
o
x o
o
o
o
o
o
o
o
o
x
xx
x
oo
o
o
o
x
x
x
x
o
o
o
o
o
o
o
o
x
x
x
x
o
o
o
o
o
o
o
x
x
x
x
o
o
o
ox
x
x
x
o
o
o
ox
x
“Matchbox Educable Noughts and Crosses Engine”
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 34
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 35
Lecture 1, Part 2: Evaluative Feedback
❐ Evaluating actions vs. instructing by giving correct actions❐ Pure evaluative feedback depends totally on the action taken.
Pure instructive feedback depends not at all on the action taken.❐ Supervised learning is instructive; optimization is evaluative❐ Associative vs. Nonassociative:
Associative: inputs mapped to outputs; learn the best outputfor each input
Nonassociative: “learn” (find) one best output❐ n-armed bandit (at least how we treat it) is:
Nonassociative Evaluative feedback
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 36
The n-Armed Bandit Problem
❐ Choose repeatedly from one of n actions; each choice iscalled a play
❐ After each play , you get a reward , where
E rt| a
t=Q
*(a
t)
at
rt
These are unknown action valuesDistribution of depends only on r
tat
❐ Objective is to maximize the reward in the long term, e.g.,over 1000 plays
To solve the n-armed bandit problem, you must explore a variety of actions and the exploit the best of them
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 37
The Exploration/Exploitation Dilemma
❐ Suppose you form estimates
❐ The greedy action at t is
❐ You can’t exploit all the time; you can’t explore all the time❐ You can never stop exploring; but you should always reduce
exploring
Qt(a) !Q
*(a) action value estimates
at
* = argmaxaQt(a)
at= a
t
*! exploitation
at" a
t
* ! exploration
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 38
Action-Value Methods
❐ Methods that adapt action-value estimates and nothingelse, e.g.: suppose by the t-th play, action had beenchosen times, producing rewards then
❐
Qt(a) =r1+ r
2+Lrka
ka
ka
r1, r2,K, r
ka,
“sample average”
limk a!"
Qt(a) =Q*(a)
a
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 39
ε-Greedy Action Selection
❐ Greedy action selection:
❐ ε-Greedy:
at= a
t
*= argmax
aQt(a)
at
* with probability 1 ! "
random action with probability "{a
t=
. . . the simplest way to try to balance exploration and exploitation
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 40
10-Armed Testbed
❐ n = 10 possible actions❐ Each is chosen randomly from a normal distribution:❐ each is also normal:❐ 1000 plays❐ repeat the whole thing 2000 times and average the results
!(Q*(a
t),1)
!(0,1)
rt
Q*(a)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 41
ε-Greedy Methods on the 10-Armed Testbed
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 42
Softmax Action Selection
❐ Softmax action selection methods grade action probs. byestimated values.
❐ The most common softmax uses a Gibbs, or Boltzmann,distribution:
Choose action a on play t with probability
eQt (a) !
eQt (b) !
b=1
n
",
where ! is the
“computational temperature”
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 45
Linear Learning Automata
Let !t(a) = Pr a
t= a{ } be the only adapted parameter
LR –I (Linear, reward - inaction)
On success : ! t+1(at ) = !t (at) +" (1 #! t(at )) 0 <" < 1
(the other action probs. are adjusted to still sum to 1)
On failure : no change
LR -P (Linear, reward - penalty)
On success : ! t+1(at ) = !t (at) +" (1 #! t(at )) 0 <" < 1
(the other action probs. are adjusted to still sum to 1)
On failure : ! t+1(at ) = ! t(at) +" (0 #! t(at )) 0 < " < 1
For two actions, a stochastic, incremental version of the supervised algorithm
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 47
Incremental Implementation
Qk =r1+ r
2+Lrk
k
Recall the sample average estimation method:
Can we do this incrementally (without storing all the rewards)?
We could keep a running sum and count, or, equivalently:
Qk+1 = Qk +1
k +1rk+1 !Qk[ ]
The average of the first k rewards is(dropping the dependence on ):
This is a common form for update rules:
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
a
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 48
Tracking a Nonstationary Problem
Choosing to be a sample average is appropriate in astationary problem, i.e., when none of the change over time,
But not in a nonstationary problem.
Qk
Q*(a)
Better in the nonstationary case is:
Qk+1 = Qk +! rk+1 "Qk[ ]for constant !, 0 < ! # 1
= (1" !)kQ0 + ! (1 "!
i=1
k
$ )k "iri
exponential, recency-weighted average
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 49
Optimistic Initial Values
❐ All methods so far depend on , i.e., they are biased.❐ Suppose instead we initialize the action values optimistically,
Q0(a)
i.e., on the 10-armed testbed, use Q0 (a) = 5 for all a
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 50
Reinforcement Comparison
❐ Compare rewards to a reference reward, , e.g., anaverage of observed rewards
❐ Strengthen or weaken the action taken depending on❐ Let denote the preference for action❐ Preferences determine action probabilities, e.g., by Gibbs
distribution:
❐ Then:
r t
rt! r
t
pt(a) a
!t(a) = Pr at = a{ } =
ept (a)
e pt (b)b=1
n
"
pt+1(at ) = pt(a) + rt ! r t[ ] and r t+1 = r t + " rt ! r t[ ]
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 54
Associative Search
actions
Bandit 3
Imagine switching bandits at each play
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 55
Conclusions
❐ These are all very simple methods but they are complicated enough—we will build on them
❐ Ideas for improvements: estimating uncertainties . . . interval estimation approximating Bayes optimal solutions Gittens indices
❐ The full RL problem offers some ideas for solution . . .
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 56
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 57
Lecture 1, Part 3: Markov Decision Processes
❐ describe the RL problem in terms of MDPs❐ present idealized form of the RL problem for which we have
precise theoretical results;❐ introduce key components of the mathematics: value functions and
Bellman equations;❐ describe trade-offs between applicability and mathematical
tractability.
Objectives of this part:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 58
The Agent-Environment Interface
Agent and environment interact at discrete time steps : t = 0, 1, 2, K
Agent observes state at step t : st!S
produces action at step t : at! A(s
t)
gets resulting reward : rt+1 !"
and resulting next state : st+1
t. . . st a
rt +1 st +1t +1a
rt +2 st +2t +2a
rt +3 st +3 . . .t +3a
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 59
Policy at step t , !t
:
a mapping from states to action probabilities
!t(s, a) = probability that a
t= a when s
t= s
The Agent Learns a Policy
❐ Reinforcement learning methods specify how the agentchanges its policy as a result of experience.
❐ Roughly, the agent’s goal is to get as much reward as itcan over the long run.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 60
Getting the Degree of Abstraction Right❐ Time steps need not refer to fixed intervals of real time.❐ Actions can be low level (e.g., voltages to motors), or high level (e.g.,
accept a job offer), “mental” (e.g., shift in focus of attention), etc.❐ States can be low-level “sensations”, or they can be abstract, symbolic,
based on memory, or subjective (e.g., the state of being “surprised” or“lost”).
❐ An RL agent is not like a whole animal or robot, which consist ofmany RL agents as well as other components.
❐ The environment is not necessarily unknown to the agent, onlyincompletely controllable.
❐ Reward computation is in the agent’s environment because the agentcannot change it arbitrarily.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 61
Goals and Rewards
❐ Is a scalar reward signal an adequate notion of agoal?—maybe not, but it is surprisingly flexible.
❐ A goal should specify what we want to achieve, not howwe want to achieve it.
❐ A goal must be outside the agent’s direct control—thusoutside the agent.
❐ The agent must be able to measure success: explicitly; frequently during its lifespan.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 62
Returns
Suppose the sequence of rewards after step t is :
rt+1, rt+ 2 , r
t+ 3, K
What do we want to maximize?
In general,
we want to maximize the expected return, E Rt{ }, for each step t.
Episodic tasks: interaction breaks naturally intoepisodes, e.g., plays of a game, trips through a maze.
Rt= r
t+1 + rt+2 +L + rT,
where T is a final time step at which a terminal state is reached,ending an episode.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 63
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:
Rt= r
t+1+! r
t+ 2+ ! 2
rt+3+L = ! k
rt+ k+1
,k =0
"
#
where ! , 0 $ ! $ 1, is the discount rate.
shortsighted 0 !" # 1 farsighted
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 64
An Example
Avoid failure: the pole falling beyonda critical angle or the cart hitting end oftrack.
reward = +1 for each step before failure
! return = number of steps before failure
As an episodic task where episode ends upon failure:
As a continuing task with discounted return:reward = !1 upon failure; 0 otherwise
" return is related to ! # k, for k steps before failure
In either case, return is maximized by avoiding failure for as long as possible.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 65
Another Example
Get to the top of the hillas quickly as possible.
reward = !1 for each step where not at top of hill
" return = ! number of steps before reaching top of hill
Return is maximized by minimizing number of steps reach the top of the hill.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 66
A Unified Notation
❐ In episodic tasks, we number the time steps of eachepisode starting from zero.
❐ We usually do not have to distinguish between episodes,so we write instead of for the state at step t ofepisode j.
❐ Think of each episode as ending in an absorbing state thatalways produces reward of zero:
❐ We can cover all cases by writing
st
st, j
Rt= ! k
rt+k +1,
k =0
"
#
where ! can be 1 only if a zero reward absorbing state is always reached.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 67
The Markov Property
❐ By “the state” at step t, we mean whatever information isavailable to the agent at step t about its environment.
❐ The state can include immediate “sensations,” highly processedsensations, and structures built up over time from sequences ofsensations.
❐ Ideally, a state should summarize past sensations so as to retainall “essential” information, i.e., it should have the MarkovProperty:
Pr st +1
= ! s , rt +1
= r st,a
t,r
t, s
t"1,a
t"1,K, r
1,s
0,a
0{ } =
Pr st +1
= ! s , rt +1
= r st,a
t{ }for all ! s , r, and histories s
t,a
t,r
t, s
t"1,a
t"1,K, r
1, s
0,a
0.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 68
Markov Decision Processes
❐ If a reinforcement learning task has the Markov Property, it isbasically a Markov Decision Process (MDP).
❐ If state and action sets are finite, it is a finite MDP.❐ To define a finite MDP, you need to give:
state and action sets one-step “dynamics” defined by transition probabilities:
reward expectations:
Ps ! s
a = Pr st +1 = ! s s
t= s, a
t= a{ } for all s, ! s "S, a "A(s).
Rs ! s
a = E rt +1 s
t= s, a
t= a, s
t +1 = ! s { } for all s, ! s "S, a"A(s).
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 69
Recycling Robot
An Example Finite MDP
❐ At each step, robot has to decide whether it should (1) actively search for acan, (2) wait for someone to bring it a can, or (3) go to home base andrecharge.
❐ Searching is better but runs down the battery; if runs out of power whilesearching, has to be rescued (which is bad).
❐ Decisions made on basis of current energy level: high, low.❐ Reward = number of cans collected
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 70
Recycling Robot MDP
S = high ,low{ }
A(high) = search , wait{ }
A(low) = search ,wait, recharge{ }
Rsearch
= expected no. of cans while searching
Rwait
= expected no. of cans while waiting
Rsearch > Rwait
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 71
Value Functions
State - value function for policy ! :
V!(s) = E! R
tst
= s{ } = E! " krt+k +1 s
t= s
k =0
#
$% & '
( ) *
Action- value function for policy ! :
Q!(s, a) = E! Rt st = s, at = a{ } = E! " k
rt+ k+1 st = s,at = ak= 0
#
$% & '
( ) *
❐ The value of a state is the expected return starting fromthat state; depends on the agent’s policy:
❐ The value of taking an action in a state under policy πis the expected return starting from that state, taking thataction, and thereafter following π :
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 72
Bellman Equation for a Policy π
Rt= r
t+1 + ! rt+2 +!2rt+ 3 +!
3rt+ 4L
= rt+1 + ! r
t+2 + ! rt+3 + !2
rt+ 4L( )
= rt+1 + ! Rt+1
The basic idea:
So:
!
V"(s) = E" R
tst
= s{ }
= E" rt+1 + #V "
st+1( ) st = s{ }
Or, without the expectation operator:
V!(s) = ! (s, a) P
s " s
a
Rs " s
a
+ #V!( " s )[ ]
" s
$a
$
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 73
More on the Bellman Equation
V!(s) = ! (s, a) P
s " s
a
Rs " s
a
+ #V!( " s )[ ]
" s
$a
$
This is a set of equations (in fact, linear), one for each state.The value function for π is its unique solution.
Backup diagrams:
for V!
for Q!
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 74
What about Value Function ?for Q!
!
Q"(s,a) =
!
Ps " s
a
" s
#
!
Rs " s
a + #[ ]
!
" ( # s , # a )# a
$
!
Q"( # s , # a )
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 75
Gridworld
❐ Actions: north, south, east, west; deterministic.❐ If would take agent off the grid: no move but reward = –1❐ Other actions produce reward = 0, except actions that move agent out
of special states A and B as shown.
State-value function for equiprobable random policy;γ = 0.9
Note: A’s value is less than immediate reward B’s value is more than immediate reward
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 77
! " # ! if and only if V!
(s) " V# ! (s) for all s $S
Optimal Value Functions
❐ For finite MDPs, policies can be partially ordered:
❐ There is always at least one (and possibly many) policies thatis better than or equal to all the others. This is an optimalpolicy. We denote them all π *.
❐ Optimal policies share the same optimal state-value function:
❐ Optimal policies also share the same optimal action-valuefunction:
V!(s) = max
"V"(s) for all s #S
Q!(s, a) = max
"Q
"(s, a) for all s #S and a #A(s)
This is the expected return for taking action a in state sand thereafter following an optimal policy.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 79
Bellman Optimality Equation for V*
V!(s) = max
a"A( s)Q
# !
(s,a)
= maxa"A( s)
E rt +1 + $ V
!(s
t +1) st= s, a
t= a{ }
= maxa"A( s)
Ps % s
a
% s
& Rs % s
a + $V !( % s )[ ]
The value of a state under an optimal policy must equalthe expected return for the best action from that state:
The relevant backup diagram:
is the unique solution of this system of nonlinear equations.V!
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 80
Bellman Optimality Equation for Q*
!
Q"(s,a) = E r
t +1 + #max$ a
Q"( $ s , $ a ) s
t= s,a
t= a{ }
= Ps $ s
aR
s $ s
a + #max$ a
Q"( $ s , $ a )[ ]
$ s
%
The relevant backup diagram:
is the unique solution of this system of nonlinear equations.Q*
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 81
Why Optimal State-Value Functions are Useful
V!
V!
Any policy that is greedy with respect to is an optimal policy.
Therefore, given , one-step-ahead search produces the long-term optimal actions.
E.g., back to the gridworld:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 82
Car-on-the-Hill Optimal Value Function
Munos & Moore “Variable resolution discretization for high-accuracysolutions of optimal control problems”, IJCAI 99.
Get to the top of the hillas quickly as possible(roughly)
Predicted minimumtime to goal (negated)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 83
What About Optimal Action-Value Functions?
Given , the agent does not evenhave to do a one-step-ahead search:
Q*
!"(s) = argmax
a#A (s)Q
"(s, a)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 84
Solving the Bellman Optimality Equation❐ Finding an optimal policy by solving the Bellman Optimality Equation
requires the following: accurate knowledge of environment dynamics; we have enough space and time to do the computation; the Markov Property.
❐ How much space and time do we need? polynomial in number of states (via dynamic programming
methods; next lecture), BUT, number of states is often huge (e.g., backgammon has about
10**20 states).❐ We usually have to settle for approximations.❐ Many RL methods can be understood as approximately solving the
Bellman Optimality Equation.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 88
Semi-Markov Decision Processes (SMDPs)
❐ Generalization of an MDP where there is a waiting, ordwell, time τ in each state
❐ Transition probabilities generalize to❐ Bellman equations generalize, e.g., for a discrete time
SMDP:!
P( " s ,# | s,a)
!
V*(s) = max
a"A (s)P( # s ,$ | s,a) R
s # s
a + %$V *( # s )[ ]
# s ,$
&
!
Rs " s
awhere is now the amount ofdiscounted reward expected toaccumulate over the waiting time in supon doing a and ending up in s’
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 89
Summary
❐ Agent-environment interaction States Actions Rewards
❐ Policy: stochastic rule forselecting actions
❐ Return: the function of futurerewards agent tries to maximize
❐ Episodic and continuing tasks❐ Markov Property❐ Markov Decision Process
Transition probabilities Expected rewards
❐ Value functions State-value function for a policy Action-value function for a policy Optimal state-value function Optimal action-value function
❐ Optimal value functions❐ Optimal policies❐ Bellman Equations❐ The need for approximation❐ Semi-Markov Decision Processes
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 90
Edward L. Thorndike (1874-1949)
puzzle box
Learning by “Trial-and-Error”
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 91
“Law of Effect”
“Of several responses made to the same situation, thosewhich are accompanied or closely followed by satisfaction tothe animal will, other things being equal, be more firmlyconnected with the situation, so that, when it recurs, they willbe more likely to recur; those which are accompanied orclosely followed by discomfort to the animal will, otherthings being equal, have their connections with that situationweakened, so that when it recurs, they will be less likely tooccur.”
Edward Thorndike, 1911
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 92
Search + Memory
❐ Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection
❐ Memory: remember what worked best for each situationand start from there next time
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 93
“Credit Assignment Problem”
Spatial
Temporal
Getting useful training information to theright places at the right times
Marvin Minsky, 1961
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 94
The Overall Plan
❐ Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
❐ Lecture 2: Basic Monte Carlo methods Dynamic Programming Temporal Difference methods A unified perspective Connections to neuroscience
❐ Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning