58
An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 utonomous Learning Laboratory – Department of Computer Science

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

An Introduction to COMPUTATIONAL REINFORCEMENT LEARING

Andrew G. BartoDepartment of Computer Science

University of Massachusetts – Amherst

Lecture 3

Autonomous Learning Laboratory – Department of Computer Science

Page 2: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 2

The Overall Plan

Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Page 3: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 3

The Overall Plan

Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Page 4: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

Lecture 3, Part 1: Generalization and Function Approximation

Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part

Overview of function approximation (FA) methods and how they can be adapted to RL

Objectives of this part:

Page 5: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 5

Value Prediction with FA

As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function Vπ

In earlier chapters, value functions were stored in lookup tables.

Here, the value function estimate at time t, Vt , depends

on a parameter vector r θ t , and only the parameter vector

is updated.

e.g., r θ t could be the vector of connection weights

of a neural network.

Page 6: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 6

Adapt Supervised Learning Algorithms

Supervised Learning SystemInputs Outputs

Training Info = desired (target) outputs

Error = (target output – actual output)

Training example = {input, target output}

Page 7: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 7

Backups as Training Examples

e.g., the TD(0) backup:

V(st) ← V(st)+α rt+1 +γ V(st+1) −V(st)[ ]

description of st, rt+1 +γV(st+1){ }

As a training example:

input target output

Page 8: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 8

Any FA Method?

In principle, yes: artificial neural networks decision trees multivariate regression methods etc.

But RL has some special requirements: usually want to learn while interacting ability to handle nonstationarity other?

Page 9: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 9

Gradient Descent Methods

rθ t = θ t (1),θ t (2),K ,θ t (n)( )

Ttranspose

Assume Vt is a (sufficiently smooth) differentiable function

of r θ t, for all s∈S.

Assume, for now, training examples of this form:

description of st, Vπ (st){ }

Page 10: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 10

Performance Measures

Many are applicable but… a common and simple one is the mean-squared error

(MSE) over a distribution P :

Why P ? Why minimize MSE? Let us assume that P is always the distribution of states

with which backups are done. The on-policy distribution: the distribution created while

following the policy being evaluated. Stronger results are available for this distribution.

MSE(θ t ) = P(s) V π (s) −Vt (s)[ ]s∈S

∑2

Page 11: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 11

Gradient Descent

Let f be any function of the parameter space.

Its gradient at any point r θ t in this space is :

∇ r θ

f (r θ t ) =

∂f (r θ t )

∂θ(1),∂f (

r θ t )

∂θ(2),K ,

∂f (r θ t )

∂θ(n)

⎝ ⎜

⎠ ⎟

T

θ(1)

θ(2)

r θ t = θt(1),θt(2)( )

T

r θ t+1 =

r θ t −α∇ r

θ f(r θ t)

Iteratively move down the gradient:

Page 12: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 12

Gradient Descent Cont.

rθ t +1 =

r θ t −

1

2α∇ r

θ MSE(

r θ t )

=r θ t −

1

2α∇ r

θ P(s)

s∈S

∑ V π (s) −Vt (s)[ ]2

=r θ t + α P(s)

s∈S

∑ V π (s) −Vt (s)[ ]∇ r θ Vt (s)

For the MSE given above and using the chain rule:

Page 13: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 13

Gradient Descent Cont.

rθ t +1 =

r θ t −

1

2α∇ r

θ V π (st ) −Vt (st )[ ]

2

=r θ t + α V π (st ) −Vt (st )[ ]∇ r

θ Vt (st ),

Use just the sample gradient instead:

Since each sample gradient is an unbiased estimate ofthe true gradient, this converges to a local minimum of the MSE if decreases appropriately with t.

E V π (st ) −Vt (st )[ ]∇ r θ Vt (st ) = P(s) V π (s) −Vt (s)[ ]

s∈S

∑ ∇ r θ Vt (s)

Page 14: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 14

But We Don’t have these Targets

Suppose we just have targets vt instead:r θ t+1 =

r θ t +α vt −Vt(st)[ ]∇ r

θ Vt(st)

If each v t is an unbiased estimate of V π (st ),

i.e., E v t{ } = V π (st ), then gradient descent converges

to a local minimum (provided α decreases appropriately).

e.g., the Monte Carlo target v t = Rt :

r θ t +1 =

r θ t + α Rt −Vt (st )[ ]∇ r

θ Vt (st )

Page 15: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 15

What about TD() Targets?

rθ t +1 =

r θ t + α Rt

λ −Vt (st )[ ]∇ r θ Vt (st )

Not unbiased for λ <1

But we do it anyway, using the backwards view :r θ t +1 =

r θ t + α δ t

r e t ,

where :

δt = rt +1 + γ Vt (st +1) −Vt (st ), as usual, andr e t = γ λ

r e t−1 +∇ r

θ Vt (st )

Page 16: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 16

On-Line Gradient-Descent TD()

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 17: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 17

Linear Methods

Represent states as feature vectors :

for each s∈ S :r φ s = φs(1),φs(2),K ,φs(n)( )

T

Vt (s) =r θ t

Tr φ s = θ t (i)φs(i)

i=1

n

∇ rθ Vt (s) = ?

Page 18: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 18

Nice Properties of Linear FA Methods

The gradient is very simple: For MSE, the error surface is simple: quadratic surface

with a single minumum. Linear gradient descent TD() converges:

Step size decreases appropriately On-line sampling (states sampled from the on-policy

distribution) Converges to parameter vector with property:

∇r θ Vt(s) =

r φ s

r θ ∞

MSE(

r θ ∞) ≤

1−γ λ1−γ

MSE(r θ ∗)

best parameter vector(Tsitsiklis & Van Roy, 1997)

Page 19: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 19

Coarse Coding

Page 20: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 20

Learning and Coarse Coding

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 21: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 21

Tile Coding

Binary feature for each tile Number of features present at

any one time is constant Binary features means weighted

sum easy to compute Easy to compute indices of the

freatures present

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 22: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 22

Tile Coding Cont.

Irregular tilings

HashingCMAC “Cerebellar Model Arithmetic Computer”Albus 1971QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 23: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 23

Radial Basis Functions (RBFs)

φs(i) = exp −s − c i

2

2σ i2

⎝ ⎜ ⎜

⎠ ⎟ ⎟

e.g., Gaussians

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 24: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 24

Can you beat the “curse of dimensionality”?

Can you keep the number of features from going up exponentially with the dimension?

Function complexity, not dimensionality, is the problem. Kanerva coding:

Select a bunch of binary prototypes Use hamming distance as distance measure Dimensionality is no longer a problem, only

complexity “Lazy learning” schemes:

Remember all the data To get new value, find nearest neighbors and

interpolate e.g., locally-weighted regression

Page 25: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 25

Control with FA

description of (st,at), vt{ }Training examples of the form:

Learning state-action values

The general gradient-descent rule:

Gradient-descent Sarsa() (backward view): r θ t+1 =

r θ t +α vt −Qt(st ,at)[ ]∇ r

θ Q(st ,at)

r θ t+1 =

r θ t +αδt

r e t

where

δt =rt+1 +γQt(st+1,at+1)−Qt(st,at)r e t =γλ

r e t−1 +∇ r

θ

r Q t(st,at)

Page 26: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 26

Linear Gradient Descent Sarsa()

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 27: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 28

Mountain-Car Task

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 28: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 29

Mountain-Car Results

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 29: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 30

Baird’s Counterexample

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 30: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 31

Baird’s Counterexample Cont.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 31: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 32

Should We Bootstrap?

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 32: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 33

Summary

Generalization Adapting supervised-learning function approximation

methods Gradient-descent methods Linear gradient-descent methods

Radial basis functions Tile coding Kanerva coding

Nonlinear gradient-descent methods? Backpropation? Subleties involving function approximation, bootstrapping

and the on-policy/off-policy distinction

Page 33: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 34

The Overall Plan

Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Page 34: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 35

Lecture 3, Part 2: Model-Based Methods

Use of environment models Integration of planning and learning methods

Objectives of this part:

Page 35: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 36

Models

Model: anything the agent can use to predict how the environment will respond to its actions

Distribution model: description of all possibilities and their probabilities e.g.,

Sample model: produces sample experiences e.g., a simulation model

Both types of models can be used to produce simulated experience

Often sample models are much easier to come by

Ps ′ s a and Rs ′ s

a for all s, ′ s , and a∈A(s)

Page 36: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 37

Planning

Planning: any computational process that uses a model to create or improve a policy

Planning in AI: state-space planning plan-space planning (e.g., partial-order planner)

We take the following (unusual) view: all state-space planning methods involve computing

value functions, either explicitly or implicitly they all apply backups to simulated experience

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 37: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 38

Planning Cont.

Random-Sample One-Step Tabular Q-Planning

Classical DP methods are state-space planning methods Heuristic search methods are state-space planning methods A planning method based on Q-learning:

Page 38: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 39

Learning, Planning, and Acting

Two uses of real experience: model learning: to improve

the model direct RL: to directly

improve the value function and policy

Improving value function and/or policy via a model is sometimes called indirect RL or model-based RL. Here, we call it planning.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 39: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 40

Direct vs. Indirect RL

Indirect (model-based) methods: make fuller use of

experience: get better policy with fewer environment interactions

Direct methods simpler not affected by bad

models

But they are very closely related and can be usefully combined:

planning, acting, model learning, and direct RL can occur simultaneously

and in parallel

Page 40: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 41

The Dyna Architecture (Sutton 1990)

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 41: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 42

The Dyna-Q Algorithm

direct RL

model learning

planning

Page 42: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 43

Dyna-Q on a Simple Maze

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

rewards = 0 until goal, when =1

Page 43: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 44

Dyna-Q Snapshots: Midway in 2nd Episode

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 44: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 48

Prioritized Sweeping

Which states or state-action pairs should be generated during planning?

Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values

would change a lot if backed up, prioritized by the size of the change

When a new backup occurs, insert predecessors according to their priorities

Always perform backups from first in queue Moore and Atkeson 1993; Peng and Williams, 1993

Page 45: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 52

Full and Sample (One-Step) Backups

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 46: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 53

Full vs. Sample Backups

b successor states, equally likely; initial error = 1;

assume all next states’ values are correct

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 47: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 54

Trajectory Sampling

Trajectory sampling: perform backups along simulated trajectories

This samples from the on-policy distribution Advantages when function approximation is used Focusing of computation: can cause vast uninteresting parts

of the state space to be (usefully) ignored:

Initial states

Reachable under optimal control

Irrelevant states

Page 48: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 56

Heuristic Search

Used for action selection, not for changing a value function (=heuristic evaluation function)

Backed-up values are computed, but typically discarded Extension of the idea of a greedy policy — only deeper Also suggests ways to select states to backup: smart

focusing:

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 49: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 57

Summary

Emphasized close relationship between planning and learning

Important distinction between distribution models and sample models

Looked at some ways to integrate planning and learning synergy among planning, acting, model learning

Distribution of backups: focus of the computation trajectory sampling: backup along trajectories prioritized sweeping heuristic search

Size of backups: full vs. sample; deep vs. shallow

Page 50: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 58

The Overall Plan

Lecture 1: What is Computational Reinforcement Learning? Learning from evaluative feedback Markov decision processes

Lecture 2: Dynamic Programming Basic Monte Carlo methods Temporal Difference methods A unified perspective Connections to neuroscience

Lecture 3: Function approximation Model-based methods Dimensions of Reinforcement Learning

Page 51: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

Lecture 3, part 3: Dimensions of Reinforcement Learning

Review the treatment of RL taken in this course What have left out? What are the hot research areas?

Objectives of this part:

Page 52: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 60

Three Common Ideas

Estimation of value functions Backing up values along real or simulated

trajectories Generalized Policy Iteration: maintain an

approximate optimal value function and approximate optimal policy, use each to improve the other

Page 53: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 61

Backup Dimensions

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 54: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 62

Other Dimensions

Function approximation tables aggregation other linear methods many nonlinear methods

On-policy/Off-policy On-policy: learn the value function of the policy being

followed Off-policy: try learn the value function for the best

policy, irrespective of what policy is being followed

Page 55: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 63

Still More Dimensions

Definition of return: episodic, continuing, discounted, etc. Action values vs. state values vs. afterstate values Action selection/exploration: -greed, softmax, more sophisticated

methods Synchronous vs. asynchronous Replacing vs. accumulating traces Real vs. simulated experience Location of backups (search control) Timing of backups: part of selecting actions or only afterward? Memory for backups: how long should backed up values be retained?

Page 56: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 64

Frontier Dimensions

Prove convergence for bootstrapping control methods. Trajectory sampling Non-Markov case:

Partially Observable MDPs (POMDPs)

– Bayesian approach: belief states

– construct state from sequence of observations Try to do the best you can with non-Markov states

Modularity and hierarchies Learning and planning at several different levels

– Theory of options

Page 57: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 65

More Frontier Dimensions

Using more structure factored state spaces: dynamic Bayes nets

factored action spaces

Page 58: An Introduction to COMPUTATIONAL REINFORCEMENT LEARING Andrew G. Barto Department of Computer Science University of Massachusetts – Amherst Lecture 3 Autonomous

A. G. Barto, Barcelona Lectures, April 2006. Based on R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction, MIT Press, 1998 66

Still More Frontier Dimensions

Incorporating prior knowledge advice and hints trainers and teachers shaping Lyapunov functions etc.