View
216
Download
0
Tags:
Embed Size (px)
Citation preview
An Introduction to COMPUTATIONAL REINFORCEMENT LEARING
Andrew G. BartoDepartment of Computer Science
University of Massachusetts – Amherst
Lecture 3
Autonomous Learning Laboratory – Department of Computer Science
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 2
The Overall Plan
Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 3
The Overall Plan
Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning
Lecture 3, Part 1: Generalization and Function Approximation
Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part
Overview of function approximation (FA) methods and how they can be adapted to RL
Objectives of this part:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 5
Value Prediction with FA
As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function Vπ
In earlier chapters, value functions were stored in lookup tables.
€
Here, the value function estimate at time t, Vt , depends
on a parameter vector r θ t , and only the parameter vector
is updated.
e.g., r θ t could be the vector of connection weights
of a neural network.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 6
Adapt Supervised Learning Algorithms
Supervised Learning SystemInputs Outputs
Training Info = desired (target) outputs
Error = (target output – actual output)
Training example = {input, target output}
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 7
Backups as Training Examples
e.g., the TD(0) backup:
V(st) ← V(st)+α rt+1 +γ V(st+1) −V(st)[ ]
description of st, rt+1 +γV(st+1){ }
As a training example:
input target output
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 8
Any FA Method?
In principle, yes: artificial neural networks decision trees multivariate regression methods etc.
But RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 9
Gradient Descent Methods
€
rθ t = θ t (1),θ t (2),K ,θ t (n)( )
Ttranspose
Assume Vt is a (sufficiently smooth) differentiable function
of r θ t, for all s∈S.
Assume, for now, training examples of this form:
description of st, Vπ (st){ }
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 10
Performance Measures
Many are applicable but… a common and simple one is the mean-squared error
(MSE) over a distribution P :
Why P ? Why minimize MSE? Let us assume that P is always the distribution of states
with which backups are done. The on-policy distribution: the distribution created while
following the policy being evaluated. Stronger results are available for this distribution.
€
MSE(θ t ) = P(s) V π (s) −Vt (s)[ ]s∈S
∑2
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 11
Gradient Descent
€
Let f be any function of the parameter space.
Its gradient at any point r θ t in this space is :
∇ r θ
f (r θ t ) =
∂f (r θ t )
∂θ(1),∂f (
r θ t )
∂θ(2),K ,
∂f (r θ t )
∂θ(n)
⎛
⎝ ⎜
⎞
⎠ ⎟
T
θ(1)
θ(2)
r θ t = θt(1),θt(2)( )
T
r θ t+1 =
r θ t −α∇ r
θ f(r θ t)
Iteratively move down the gradient:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 12
Gradient Descent Cont.
€
rθ t +1 =
r θ t −
1
2α∇ r
θ MSE(
r θ t )
=r θ t −
1
2α∇ r
θ P(s)
s∈S
∑ V π (s) −Vt (s)[ ]2
=r θ t + α P(s)
s∈S
∑ V π (s) −Vt (s)[ ]∇ r θ Vt (s)
For the MSE given above and using the chain rule:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 13
Gradient Descent Cont.
€
rθ t +1 =
r θ t −
1
2α∇ r
θ V π (st ) −Vt (st )[ ]
2
=r θ t + α V π (st ) −Vt (st )[ ]∇ r
θ Vt (st ),
Use just the sample gradient instead:
Since each sample gradient is an unbiased estimate ofthe true gradient, this converges to a local minimum of the MSE if decreases appropriately with t.
€
E V π (st ) −Vt (st )[ ]∇ r θ Vt (st ) = P(s) V π (s) −Vt (s)[ ]
s∈S
∑ ∇ r θ Vt (s)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 14
But We Don’t have these Targets
Suppose we just have targets vt instead:r θ t+1 =
r θ t +α vt −Vt(st)[ ]∇ r
θ Vt(st)
€
If each v t is an unbiased estimate of V π (st ),
i.e., E v t{ } = V π (st ), then gradient descent converges
to a local minimum (provided α decreases appropriately).
€
e.g., the Monte Carlo target v t = Rt :
r θ t +1 =
r θ t + α Rt −Vt (st )[ ]∇ r
θ Vt (st )
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 15
What about TD() Targets?
€
rθ t +1 =
r θ t + α Rt
λ −Vt (st )[ ]∇ r θ Vt (st )
Not unbiased for λ <1
€
But we do it anyway, using the backwards view :r θ t +1 =
r θ t + α δ t
r e t ,
where :
δt = rt +1 + γ Vt (st +1) −Vt (st ), as usual, andr e t = γ λ
r e t−1 +∇ r
θ Vt (st )
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 16
On-Line Gradient-Descent TD()
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17
Linear Methods
€
Represent states as feature vectors :
for each s∈ S :r φ s = φs(1),φs(2),K ,φs(n)( )
T
€
Vt (s) =r θ t
Tr φ s = θ t (i)φs(i)
i=1
n
∑
€
∇ rθ Vt (s) = ?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 18
Nice Properties of Linear FA Methods
The gradient is very simple: For MSE, the error surface is simple: quadratic surface
with a single minumum. Linear gradient descent TD() converges:
Step size decreases appropriately On-line sampling (states sampled from the on-policy
distribution) Converges to parameter vector with property:
∇r θ Vt(s) =
r φ s
r θ ∞
MSE(
r θ ∞) ≤
1−γ λ1−γ
MSE(r θ ∗)
best parameter vector(Tsitsiklis & Van Roy, 1997)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 19
Coarse Coding
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 20
Learning and Coarse Coding
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 21
Tile Coding
Binary feature for each tile Number of features present at
any one time is constant Binary features means weighted
sum easy to compute Easy to compute indices of the
freatures present
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 22
Tile Coding Cont.
Irregular tilings
HashingCMAC “Cerebellar Model Arithmetic Computer”Albus 1971QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 23
Radial Basis Functions (RBFs)
€
φs(i) = exp −s − c i
2
2σ i2
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
e.g., Gaussians
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 24
Can you beat the “curse of dimensionality”?
Can you keep the number of features from going up exponentially with the dimension?
Function complexity, not dimensionality, is the problem. Kanerva coding:
Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only
complexity “Lazy learning” schemes:
Remember all the data To get new value, find nearest neighbors and
interpolate e.g., locally-weighted regression
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 25
Control with FA
description of (st,at), vt{ }Training examples of the form:
Learning state-action values
The general gradient-descent rule:
Gradient-descent Sarsa() (backward view): r θ t+1 =
r θ t +α vt −Qt(st ,at)[ ]∇ r
θ Q(st ,at)
r θ t+1 =
r θ t +αδt
r e t
where
δt =rt+1 +γQt(st+1,at+1)−Qt(st,at)r e t =γλ
r e t−1 +∇ r
θ
r Q t(st,at)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 26
Linear Gradient Descent Sarsa()
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 28
Mountain-Car Task
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 29
Mountain-Car Results
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 30
Baird’s Counterexample
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 31
Baird’s Counterexample Cont.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 32
Should We Bootstrap?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 33
Summary
Generalization Adapting supervised-learning function approximation
methods Gradient-descent methods Linear gradient-descent methods
Radial basis functions Tile coding Kanerva coding
Nonlinear gradient-descent methods? Backpropation? Subleties involving function approximation, bootstrapping
and the on-policy/off-policy distinction
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 34
The Overall Plan
Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 35
Lecture 3, Part 2: Model-Based Methods
Use of environment models Integration of planning and learning methods
Objectives of this part:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 36
Models
Model: anything the agent can use to predict how the environment will respond to its actions
Distribution model: description of all possibilities and their probabilities e.g.,
Sample model: produces sample experiences e.g., a simulation model
Both types of models can be used to produce simulated experience
Often sample models are much easier to come by
Ps ′ s a and Rs ′ s
a for all s, ′ s , and a∈A(s)
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 37
Planning
Planning: any computational process that uses a model to create or improve a policy
Planning in AI: state-space planning plan-space planning (e.g., partial-order planner)
We take the following (unusual) view: all state-space planning methods involve computing
value functions, either explicitly or implicitly they all apply backups to simulated experience
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 38
Planning Cont.
Random-Sample One-Step Tabular Q-Planning
Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods A planning method based on Q-learning:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 39
Learning, Planning, and Acting
Two uses of real experience: model learning: to improve
the model direct RL: to directly
improve the value function and policy
Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 40
Direct vs. Indirect RL
Indirect (model-based) methods: make fuller use of
experience: get better policy with fewer environment interactions
Direct methods simpler not affected by bad
models
But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur simultaneously
and in parallel
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 41
The Dyna Architecture (Sutton 1990)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 42
The Dyna-Q Algorithm
direct RL
model learning
planning
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 43
Dyna-Q on a Simple Maze
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
rewards = 0 until goal, when =1
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 44
Dyna-Q Snapshots: Midway in 2nd Episode
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 48
Prioritized Sweeping
Which states or state-action pairs should be generated during planning?
Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values
would change a lot if backed up, prioritized by the size of the change
When a new backup occurs, insert predecessors according to their priorities
Always perform backups from first in queue Moore and Atkeson 1993; Peng and Williams, 1993
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 52
Full and Sample (One-Step) Backups
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 53
Full vs. Sample Backups
b successor states, equally likely; initial error = 1;
assume all next states’ values are correct
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 54
Trajectory Sampling
Trajectory sampling: perform backups along simulated trajectories
This samples from the on-policy distribution Advantages when function approximation is used Focusing of computation: can cause vast uninteresting parts
of the state space to be (usefully) ignored:
Initial states
Reachable under optimal control
Irrelevant states
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 56
Heuristic Search
Used for action selection, not for changing a value function (=heuristic evaluation function)
Backed-up values are computed, but typically discarded Extension of the idea of a greedy policy — only deeper Also suggests ways to select states to backup: smart
focusing:
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 57
Summary
Emphasized close relationship between planning and learning
Important distinction between distribution models and sample models
Looked at some ways to integrate planning and learning synergy among planning, acting, model learning
Distribution of backups: focus of the computation trajectory sampling: backup along trajectories prioritized sweeping heuristic search
Size of backups: full vs. sample; deep vs. shallow
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 58
The Overall Plan
Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes
Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience
Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning
Lecture 3, part 3: Dimensions of Reinforcement Learning
Review the treatment of RL taken in this course What have left out? What are the hot research areas?
Objectives of this part:
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 60
Three Common Ideas
Estimation of value functions Backing up values along real or simulated
trajectories Generalized Policy Iteration: maintain an
approximate optimal value function and approximate optimal policy, use each to improve the other
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 61
Backup Dimensions
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 62
Other Dimensions
Function approximation tables aggregation other linear methods many nonlinear methods
On-policy/Off-policy On-policy: learn the value function of the policy being
followed Off-policy: try learn the value function for the best
policy, irrespective of what policy is being followed
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 63
Still More Dimensions
Definition of return: episodic, continuing, discounted, etc. Action values vs. state values vs. afterstate values Action selection/exploration: -greed, softmax, more sophisticated
methods Synchronous vs. asynchronous Replacing vs. accumulating traces Real vs. simulated experience Location of backups (search control) Timing of backups: part of selecting actions or only afterward? Memory for backups: how long should backed up values be retained?
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 64
Frontier Dimensions
Prove convergence for bootstrapping control methods. Trajectory sampling Non-Markov case:
Partially Observable MDPs (POMDPs)
– Bayesian approach: belief states
– construct state from sequence of observations Try to do the best you can with non-Markov states
Modularity and hierarchies Learning and planning at several different levels
– Theory of options
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 65
More Frontier Dimensions
Using more structure factored state spaces: dynamic Bayes nets
factored action spaces
A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 66
Still More Frontier Dimensions
Incorporating prior knowledge advice and hints trainers and teachers shaping Lyapunov functions etc.