133
utonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning Laboratory Department of Computer Science University of Massachusetts Amherst [email protected] Searching in the Right Space

Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Embed Size (px)

Citation preview

Page 1: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer Science

Perspectives on Computational Reinforcement Learning

Andrew G. Barto

Autonomous Learning LaboratoryDepartment of Computer Science

University of MassachusettsAmherst

[email protected]

Searching in the Right Space

Page 2: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Psychology

Artificial Intelligence(machine learning)

Control Theory andOperations Research

Artificial Neural Networks

ComputationalReinforcementLearning (RL)

Neuroscience

Computational Reinforcement Learning

“Reinforcement learning (RL) bears a tortuous relationship with historical and contemporary ideas in classical and instrumental conditioning.” —Dayan 2001

Page 3: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Plan

High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL

Page 4: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The View from Machine Learning

Unsupervised Learning• recode data based on some given principle

Supervised Learning• “Learning from examples”, “Learning with a

teacher”, related to Classical (or Pavlovian) Conditioning

Reinforcement Learning• “Learning with a critic”, related to Instrumental

(or Thorndikian) Conditioning

Page 5: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Classical Conditioning

Tone(CS: Conditioned Stimulus)

Food(US: Unconditioned Stimulus)

•••

Salivation(UR: Unconditioned Response)

Anticipatory salivation(CR: Conditioned Response)

Pavlov, 1927

Page 6: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Edward L. Thorndike (1874-1949)

puzzle box

Learning by “Trial-and-Error”

Page 7: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Trial-and-Error = Error Correction

Artificial Neural Network:

learns from a set of examples via error-correction

Page 8: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“Least-Mean-Square” (LMS) Learning Rule

input pattern

desiredoutput

“delta rule”, Adaline, Widrow and Hoff, 1960

z+adjust weights

actual output

+

Vx2

xn

x1

wn

w1

w2

wi z V xi

Page 9: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Trial-and-Error?

“The boss continually seeks a better worker by trial and error experimentation with the structure of the worker. Adaptation is a multidimensional performance feedback process. The `error’ signal in the feedback control sense is the gradient of the mean square error with respect to the adjustment.”

Widrow and Hoff, “Adaptive Switching Circuits”

1960 IRE WESCON Conventional Record

Page 10: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

MENACE Michie 1961

“Matchbox Educable Noughts and Crosses Engine”

x

x

x

x

o

o

o

o

o

x

x

xx

o

o

ox

x

xo

o

ox x

x x

x x

x x

o

o

xo

o

x

x

x

xx

x o

o

o

o

o

o

o

o

o

o

o

x

x

x

o

o

o

oo

o

o

x

x

x

oox

x

x

x

o

o

o

xo

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

x

x

x

x

o

o

o

ox

x

x

x

o

o

o

o

x

x

Page 11: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Essence of RL (for me at least!): Search + Memory

Search: Trial-and-Error, Generate-and-Test, Variation-and-Selection, . . .

Memory: remember what worked best for each situation and start from there next time

RL is about caching search results(so you don’t have to keep searching!)

Page 12: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Generate-and-Test

Generator should be smart:• Generate lots things that are likely to be good

based on prior knowledge and prior experience• But also take chances …

Tester should be smart too:• Evaluate based on real criteria, not convenient

surrogates• But be able to recognize partial success

Page 13: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Plan

High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL

Page 14: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Key Players

Harry Klopf Rich Sutton Me

Page 15: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Arbib, Kilmer, and Spinelli

in Neural Mechanisms of Learning and Memory, Rosenzweig and Bennett, 1974

“Neural Models and Memory”

Page 16: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

A. Harry Klopf

“Brain Function and Adaptive Systems -- A Heterostatic Theory” Air Force Cambridge Research Laboratories Technical Report 3 March 1972

“…it is a theory which assumes that living adaptive systems seek, as their primary goal, a maximal condition (heterostasis), rather than assuming that the primary goal is a steady-state condition (homeostasis). It is further assumed that the heterostatic nature of animals, including man, derives from the heterostatic nature of neurons. The postulate that the neuronis a heterostat (that is, a maximizer) is a generalization of a more specific postulate, namely, that the neuron is a hedonist.”

Page 17: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Klopf’s theory (very briefly!)

Inspiration: The nervous system is a society of self-interested agents.• Nervous Systems = Social Systems• Neuron = Man• Man = Hedonist• Neuron = Hedonist• Depolarization = Pleasure• Hyperpolarization = Pain

A neuronal model:• A neuron “decides” when to fire based on comparing a spatial and temporal summation of

weighted inputs with a threshold.• A neuron is in a condition of heterostasis from time t to t+ if it maximizes the amount of

depolarization and minimizes the amount of hyperpolarization over this interval.• Two ways to adapt weights to do this:

• Push excitatory weights to upper limits; zero out inhibitory weights • Make neuron control its input.

Page 18: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Heterostatic Adaptation

When a neuron fires, all of its synapses that were active during the summation of potentials leading to the response become eligible to undergo changes in their transmittances.

The transmittances of an eligible excitatory synapse increases if the generation of an action potential is followed by further depolarization for a limited time after the response.

The transmittances of an eligible inhibitory synapse increases if the generation of an action potential is followed by further hyperpolarization for a limited time after the response.

Add a mechanism that prevents synapses that participate in the reinforcement from undergoing changes due to that reinforcement (“zerosetting”).

Page 19: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Key Components of Klopf’s Theory

Eligibility Closed-loop control by neurons Extremization (e.g., maximization) as goal

instead of zeroing something “Generalized Reinforcement”: reinforcement is

not delivered by a specialized channel

The Hedonistic NeuronA Theory of Memory, Learning, and Intelligence

A. Harry KlopfHemishere Publishing Corporation 1982

Page 20: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Eligibility Traces

x

t

x y

wt = α y t x y t

xt

wt

Klopf, 1972

x y t

Optimal ISIThe same curve as the reinforcement-

effectiveness curve in conditioning:max at 400 ms; 0 after approx 4 s.

y t

tt

ya histogram of the lengths of feedback pathways in which

the neuron is embedded

Page 21: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Later Simplified Eligibility Traces

visits to state s

TIME

accumulatingtrace

replacetrace

Page 22: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Rich Sutton

BA Psychology, Stanford, 1978 As an undergrad, discovered Klopf’s 1972 tech

report Two unpublished undergraduate reports:

• “Learning Theory Support for a Single Channel Theory of the Brain” 1978

• “A Unified Theory of Expectation in Classical and Instrumental Conditioning” 1978 (?)

Rich’s first paper:• “Single Channel Theory: A Neuronal Theory of Learning”

Brain Theory Newsletter, 1978.

Page 23: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Sutton’s Theory

Aj: level of activation of mode j at time t Vij: sign and magnitude of association from mode

i to mode j at time t Eij: eligibility of Vij for undergoing changes at

time t. It is proportional to the average of the product Ai(t)Aj(t) over some small past time interval (or an average of the logical AND).

Pj: expected level of activation of mode j at time t (a prediction of level of activation of mode j)

Cij a constant depending on particular association being changed

d

dtVij = Cij A j − Pj( )E ij

Page 24: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

What exactly is Pj?

Based on recent activation of the mode: The higher the activation within the last few seconds, the higher the level expected for the present . . .

Pj(t) is proportional to the average of the activation level over some small time interval (a few seconds or less) before t.

wt = α y t − y t( ) x y t

xt

wt

x y t

y t

Page 25: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Sutton’s theory

Contingent Principle: based on reinforcement a neuron receives after firings and the synapses which were involved in the firings, the neuron modifies its synapses so that they will cause it to fire when the firing causes an increase in the neuron’s expected reinforcement after the firing.• Basis of Instrumental, or Thorndikian, conditioning

Predictive Principle: if a synapse’s activity predicts (frequently precedes) the arrival of reinforcement at the neuron, then that activity will come to have an effect on the neuron similar to that of reinforcement.• Basis of Classical, or Pavlovian, conditioning

Page 26: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Sutton’s Theory

Main addition to Kopf’s theory: addition of the difference term — a temporal difference term

Showed relationship to the Rescorla-Wagner model (1972) of Classical Conditioning• Blocking• Overshadowing

Sutton’s model was a real-time model of both classical and instrumental conditioning

Emphasized conditioned reinforcement

Page 27: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Rescorla Wagner Model, 1972

ΔVA =α λ −VΣ( )

change in associative strength of CS A: parameter related to CS intensity: parameter related to US intensity : sum of associative strengths of all CSs present (“composite expectation”)

ΔVA

“Organisms only learn when events violate their expectations.”

A “trial-level” model

Page 28: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Conditioned Reinforcement

Stimuli associated with reinforcement take on reinforcing properties themselves

Follows immediately from the predictive principle: “By the predictive principle we propose that the neurons of the brain are learning to have predictors of stimuli have the same effect on them as the stimuli themselves” (Sutton, 1978)

“In principle this chaining can go back for any length …” (Sutton, 1978)

Equated Pavlovian conditioned reinforcement with instrumental higher-order conditioning

Page 29: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Where was I coming from?

Studied at the University of Michigan: at the time a hotbed of genetic algorithm activity due to John Holland’s influence (PhD in 1975)

Holland talked a lot about the exploration/exploitation tradeoff But I studied dynamic system theory, relationship between

state-space and input/output representations of systems, convolution and harmonic analysis, finally cellular automata

Fascinated by how simple local rules can generate complex global behavior:• Dynamic systems• Cellular automata• Self-organization• Neural networks• Evolution• Learning

Page 30: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Sutton and Barto, 1981

“Toward a Modern Theory of Adaptive Networks: Expectation and Prediction” Psych Review 88, 1981

Drew on Rich’s earlier work, but clarified the math and simplified the eligibility term to be non-contingent: just a trace of x instead of xy.

Emphasized anticipitory nature of CR Related to “Adaptive System Theory”:

• Other neural models (Hebb, Widrow & Hoff’s LMS, Uttley’s “Informon”, Anderson’s associative memory networks)

• Pointed out relationship between Rescorla-Wagner model and Adaline, or LMS algorithm

• Studied algorithm stability• Reviewed possible neural mechanisms: e.g. eligibility =

intracellular Ca ion concentrations

Page 31: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“SB Model” of Classical Conditioning

wt = c y t − y t( ) x ty t = y t−1

x t +1 = α x t + x t

xt

wt

x

y t

Page 32: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Temporal Primacy Overrides Blocking in SB model

Kehoe, Schreurs,and Graham 1987

our simulation

Page 33: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Intratrial Time Courses (part 2 of blocking)

Page 34: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Adaline Learning Rule

input pattern

target output

LMS rule, Widrow and Hoff, 1960

xt

zt

wtyt =wt

Txt

wt = α zt − y t[ ] x t

Page 35: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“Rescorla–Wagner Unit”

CS

US

to CR

“composite expectation”

to UR

wt = α zt − y t[ ] x t

yt =wtTxt

yt =wtTxt

xt

zt

wt vector of “associative strengths”

Page 36: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Important Notes

The “target output” of LMS corresponds to the US input to Rescorla-Wagner model

In both cases, this input is specialized in that it does not directly activate the unit but only directs learning

The SB model is different, with the US input activating the unit and directing learning

Hence, SB model can do secondary reinforcement

SB model stayed with Klopf’s idea of “generalized reinforcement”

Page 37: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

One Neural Implementation of S-B Model

Page 38: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

A Major Problem: US offset

e.g., if a CS has same time course as US, weights would change so US will be cancelled out.

US

CS

Final result

Why? Because trying to zero out yt – yt–1

Page 39: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Associative Memory Networks

Kohonen et al. 1976, 1977; Anderson et al. 1977

Page 40: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Associative Search Network Barto, Sutton, & Brouwer 1981

Input vector X(t) = x1(t),K ,xn (t)( ) randomly chosen from set X = X1, X2,K , X k{ }

Output vector in response to X(t) is vector Y(t) = y1(t),K ,ym (t)( )

For each Xα (t) the payoff is a scalar Zα (Y(t))

X(t) is the context vector at time t

y(t) =1 if s(t) + NOISE(t) > 0

0 otherwise

⎧ ⎨ ⎩

where

s(t) = wi(t) x i(t)i=1

n

Page 41: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Associative Search NetworkBarto, Sutton, Brouwer, 1981

Problem of context transitions:add a predictor

wi(t) = c z(t) − z(t −1)[ ]

⋅ y(t −1) − y(t − 2)[ ] x i(t −1)

wi(t) = c z(t) − p(t −1)[ ]

⋅ y(t −1) − y(t − 2)[ ] x i(t −1)

wpi(t) = cp z(t) − p(t −1)[ ] x i(t −1)

“one-step-ahead LMS predictor”

Page 42: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Relation to Klopf/Sutton Theory

wi(t) = c z(t) − p(t −1)[ ]

⋅ y(t −1) − y(t − 2)[ ] x i(t −1)

eligibility = ˙ y x

Did not include generalized reinforcementsince z(t) is a specialized reward input

Associative version of the ALOPEX algorithm of Harth & Tzanakou, and later Unnikrishnan

Page 43: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Associative Search Network

Page 44: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“Landmark Learning” Barto & Sutton 1981

An illustration of associative search

Page 45: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“Landmark Learning”

Page 46: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“Landmark Learning”

swap E and Wlandmarks

Page 47: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Note: Diffuse Reward Signal

x1

x2

x3

y1

y2

y3

reward

Units can learn different thingsdespite receiving identical inputs . . .

Page 48: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Provided there is variability

ASN just used noisy units to introduce variability Variability drives the search Needs to have an element of “blindness”, as in

“blind variation”: i.e. outcome is not completely known beforehand

BUT does not have to be random IMPORTANT POINT:

Blind Variation does not have to be random, or dumb

Page 49: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Pole Balancing

Widrow & Smith, 1964“Pattern Recognizing Control Systems”

Michie & Chambers, 1968“Boxes: An Experiment in Adaptive Control”

Barto, Sutton, & Anderson 1984

Page 50: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

MENACE Michie 1961

“Matchbox Educable Noughts and Crosses Engine”

x

x

x

x

o

o

o

o

o

x

x

xx

o

o

ox

x

xo

o

ox x

x x

x x

x x

o

o

xo

o

x

x

x

xx

x o

o

o

o

o

o

o

o

o

o

o

x

x

x

o

o

o

oo

o

o

x

x

x

oox

x

x

x

o

o

o

xo

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

o

x

x

x

x

o

o

o

o

o

o

o

x

x

x

x

o

o

o

ox

x

x

x

o

o

o

o

x

x

Page 51: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The “Boxes” Idea

“Although the construction of this Matchbox Educable Noughts and Crosses Engine (Michie 1961, 1963) was undertaken as a ‘fun project’, there was present a more serious intention to demonstrate the principle that it may be easier to learn to play many easy games than one difficult one. Consequently it may be advantageous to decompose a game into a number of mutually independent sub-games even if much relevant information is put out of reach in the process.”

Michie and Chambers, “Boxes: An Experiment in Adaptive Control”

Machine Intelligence 2, 1968

Page 52: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Boxes

Page 53: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Actor-Critic Architecture

ACE = adaptive critic elementASE = associative search element

Page 54: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Actor

ASE: associative search element

y(t) =+1 if s(t) + NOISE(t) ≥ 0

−1 otherwise

⎧ ⎨ ⎩

Δwi(t) = α r(t)ei(t)

where eligibility ei(t) = y x i(t)

Note: 1) Move from changes in evaluation to just r2) Move from y to just y in eligibility

.

Page 55: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Critic

ACE: adaptive critic element

p(t) = v i(t) x i(t)i=1

n

Δv i(t) = β r(t) + γ p(t) − p(t −1)[ ] ei(t)

where eligibility ei(t) = x i(t)

Note differences with SB model:1) Reward has been pulled out of the weighted sum

2) Discount factor : decay rate of predictions if notsustained by external reinforcement

Page 56: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Putting them Together

lower rewardprediction

higher rewardprediction

actionmake taking action y in state s more likely

s s’y

p(s) p(s’)

ˆ r (t) = r(t) + γ p(t) − p(t −1)

internal reinforcement

effective reinforcement

temporal - difference (TD) error δ(t)

Page 57: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Actor & Critic learning almost identical

Actor

Adaptive Critic

++

p

+

noise

r

action

prediction

primaryreward

we

e = trace of presynaptic

activity only

e = trace of pre-

and postsynaptic

correlation

e

e

Page 58: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“Credit Assignment Problem”

Spatial

Temporal

Getting useful training information to theright places at the right times

Marvin Minsky, 1961

Page 59: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“Associative Reward-Penalty Element” (AR-

P)Barto & Anandan 1985

wi(t) =ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = +1

λ ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = −1

⎧ ⎨ ⎪

⎩ ⎪€

y(t) =+1 if s(t) + NOISE(t) > 0

−1 otherwise

⎧ ⎨ ⎩

where

s(t) = wi(t) x i(t)i=1

n

(same as ASE)

ρ > 0

0 ≤ λ ≤1

Page 60: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

wi(t) =ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = +1

λ ρ r(t)y(t) − E y(t) s(t){ }[ ]x i(t) if r(t) = −1

⎧ ⎨ ⎪

⎩ ⎪

AR-P

If = 0, “Associative Reward-Inaction Element” AR-I

Think of r(t)y(t) as desired response Stochastic version of Widrow et al.’s “Selective Boostrap

Element” [Widrow, Gupta, & Maitra “Reward/Punish: learning with a critic in adaptive threshold systems”, 1973]

Associative generalization of LR-P, a “stochastic Learning automaton” algorithm (with roots in Tsetlin’s work and in mathematical psychology, e.g., Bush & Mosteller, 1955)

Where we got the term “Critic”

Page 61: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

AR-P Convergence Theorem

Input patterns linearly independent Each input has nonzero probability of being

presented on a trial NOISE has cumulative distribution that is strictly

monotonically increasing (excludes uniform dist. and deterministic case)

ρ has to decrease as usual…. For all stochastic reward contingencies, as

approaches 0, the probability of each correct action approaches 1.

BUT, does not work when = 0.

Page 62: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Contingency Space: 2 actions (two-armed bandit)

Explore/Exploit Dilemma

Page 63: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Interesting follow up to AR-P theorem

Williams’ REINFORCE class of algorithms (1987) generalizes AR-I (i.e., = 0).

He showed that the weights change according to an unbiased estimate of the gradient of the reward function

BUT NOTE: this is the case for which our theorem isn’t true!

Recent “policy gradient” methods generalize REINFORCE algorithms

Page 64: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Learning by Statistical Cooperation Barto 1985

Feedforward networks of AR-P units

Most reward achieved when the network implements the identity map

each unit has an (unshown) constant input

Page 65: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Identity Network Results

= .04

Page 66: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

XOR Network

Most reward achieved when the network implements XOR

Page 67: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

XOR Network Behavior

Visible element

Hidden element

= .08

Page 68: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Notes on AR-P Nets

None of these networks work with = . They almost always converge to a local maximum.

Elements face non-stationary reward contingencies; they have to converge for all contingencies, even hard ones.

Rumelhart, Hinton, & Williams published the backprop paper shortly after this (in 1986).

AR-P networks and backprop networks do pretty much the same thing, BUT backprop is much faster.

Barto & Jordan “Gradient Following without Backpropagation in Layered Networks” First IEEE conference on NNs, 1987.

Page 69: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

On the speeds of various layered network algorithms

Backprop: slow Boltzmann Machine: glacial Reinforcement Learning: don’t ask!

My recollection of a talk by Geoffrey Hinton c. 1988

Page 70: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

“Credit Assignment Problem”

Spatial

Temporal

Getting useful training information to theright places at the right times

Marvin Minsky, 1961

Page 71: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Teams of Learning Automata

Tsetlin, M. L. Automata Theory and Modeling of Biological Systems, Academic Press NY, 1973

e.g. the “Goore Game”

Real games were studied too…

Page 72: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Neurons and Bacteria

Koshland’s (1980) model of bacterial tumbling

Barto (1989) “From Chemotaxis to Cooperativity: Abstract Exercises in Neuronal Learning Strategies” in The Computing Neuron, Durbin, Miall, & Mitchison (eds.), Addison-Wesley, Workingham England

Page 73: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

TD Model of Pavlovian Conditioning

The adaptive critic (slightly modified) as a model of Pavlovian conditioning

Sutton & Barto 1990

p(t) = v i(t) x i(t)i=1

n

Δv i(t) = β λ (t) + γ p(t)⎣ ⎦− p(t −1)⎣ ⎦[ ] ei(t)

where eligibility ei(t) = x i(t)

“floor”

US instead of r

Page 74: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

TD Model

Predictions of what?

“imminence weightedsum of future USs”

i.e. discounting

Page 75: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

TD Model

“Complete Serial Compound”

. . .

“tapped delay line”

Page 76: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Summary of Part I

• Eligibility• Neurons as closed-loop controllers• Generalized reinforcement• Prediction• Real-time conditioning models• Conditioned reinforcement• Adaptive system/machine learning theory• Stochastic search• Associative Reinforcement Learning• Teams of self-interested units

Page 77: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Key Computational Issues

Trial-and-error Error-Correction Essence of RL (for me): search + memory Variability is essential Variability needs to be somewhat blind but not

dumb Smart generator; smart tester The “Boxes Idea”: break up large search into many

small searches Prediction is important What to predict: total future reward Changes in prediction are useful local evaluations Credit assignment problems

Page 78: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Plan

High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL

Page 79: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Part II: The Modern View

Shift from animal learning to sequential decision problems: stochastic optimal control

Markov Decision Processes (MDPs) Dynamic Programming (DP) RL as approximate DP Give up the neural models…

Page 80: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Samuel’s Checkers Player 1959

CURRENT BOARD

EVALUATION FUNCTION(Value Function)

+20

V

Page 81: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Arthur L. Samuel

“. . . we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board positions of the chain of moves which most probably occur during actual play.”

Some Studies in Machine Learning

Using the Game of Checkers, 1959

Page 82: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

TD-Gammon

Start with a random network

Play very many games against self

Learn a value function from this simulated experience

This produces (arguably) the best player in the world

Value = estimated prob. of winning

Tesauro, 1992–1995

STATES: configurations of the playing board (about 10 )

ACTIONS: moves

REWARDS: win: +1

lose: 0

20

Page 83: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Sequential Decision Problems

Decisions are made in stages. The outcome of each decision is not fully

predictable but can be observed before the next decision is made.

The objective is to maximize a numerical measure of total reward over the entire sequence of stages: called the return

Decisions cannot be viewed in isolation: need to balance desire for immediate reward with possibility of high reward in the future.

Page 84: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Agent-Environment Interface

Agent and environment interact at discrete time steps: t =0,1, 2,K

Agent observes state at step t: st ∈S

produces action at step t : at ∈A(st)

gets resulting reward: rt+1 ∈ℜ

and resulting next state: st+1

t

. . . st art +1 st +1

t +1art +2 st +2

t +2art +3 st +3

. . .t +3a

Page 85: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Markov Decision Processes

If a reinforcement learning task has the Markov Property, it is basically a Markov Decision Process (MDP).

If state and action sets are finite, it is a finite MDP.

To define a finite MDP, you need to give:• state and action sets• one-step “dynamics” defined by transition probabilities:

• reward expectations:

Ps ′ s a =Pr st+1 = ′ s st =s,at =a{ } for all s, ′ s ∈S, a∈A(s).

Rs ′ s a =E rt+1 st =s,at =a,st+1 = ′ s { } for all s, ′ s ∈S, a∈A(s).

Page 86: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Elements of the MDP view

Policies Return: e.g, discounted sum of future rewards Value functions Optimal value functions Optimal policies Greedy policies Models: probability models, sample models Backups Etc.

Page 87: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

T T T TT

T T T T T

V(s) ← ?

s

Backups

Page 88: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

T T T TT

T T T T T

Needs a probabilitymodel to compute all the requiredexpected values

V(s)← maxa

E r+V(succssorofsundera)[ ]

s

r′ s

Stochastic Dynamic Programming

Page 89: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

= update the value of each state once using the max backup

Lookup–table storage of

a SWEEP

V0 → V1 → L → Vk → Vk+1 → L → V∗

V

e.g., Value Iteration

Page 90: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Dynamic Programming

Bellman 195?

“… it’s impossible to to use the word, dynamic, in a pejorative sense. Try thinking of some combination which will possibly give it a pejorative meaning. It’s impossible. … It was something not even a Congressman could object to.”

Bellman

Page 91: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Stochastic Dynamic Programming

COMPUTATIONALLY COMPLEX:• Multiple exhaustive sweeps• Complex "backup" operation• Complete storage of evaluation function,

NEEDS ACCURATE PROBABILITY MODEL

Page 92: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Approximating Stochastic DP

AVOID EXHAUSTIVE SWEEPS OF STATE SET• To which states should the backup operation be

applied?

SIMPLIFY THE BACKUP OPERATION• Can one avoid evaluating all possible next states in

each backup operation?

REDUCE DEPENDENCE ON MODELS• What if details of process are unknown or hard to

quantify?

COMPACTLY APPROXIMATE V• Can one avoid explicitly storing all of V ?

Page 93: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Avoiding Exhaustive Sweeps

Generate multiple sample paths: in reality or with a simulation (sample) model

FOCUS backups around sample paths Accumulate results in V

Page 94: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

T T T TT

T T T T T

V(s) ← ?

s

Simplifying Backups

Page 95: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

T T T TT

T T T T T

• no probability model needed

• real or simulated experience

• relatively efficient on very large problems

s

V(s)← (1−)V(s) + REWARD(path)

Simple Monte Carlo

Page 96: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

T T T TT

T T T T T

• no probability model needed

• real or simulated experience

• incremental

• but less informative than a DP backup

V(s)← (1−)V(s) + r +V( ′ s )[ ]

′ s r

s

Temporal Difference Backup

Page 97: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

to get:

TD error

Rewrite this

V(s)← (1−)V(s) + r +V( ′ s )[ ]

V(s)← V(s) + r +V( ′ s )−V(s)[ ]

Our familiar TD error

Page 98: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Why TD?

Loss

Win

BadNew

90%

10%

Page 99: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Function Approximation Methods:

e.g., artificial neural networks

ANNdescriptionof state

evaluation of

V(s)s

s

Compactly Approximate V

Page 100: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

expected return for taking action in state and following an optimal policy thereafter

Let current estimate of

For any state, any action with a maximal optimal action value is an optimal action:

( optimal action in )

action valuess

aQ∗ Q∗(s,a) a s

Q(s,a) = Q∗(s,a)

=arg maxa

Q∗(s, a)s

Q-Learning Watkins 1989; Leigh Tesfatsion

Page 101: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Does not need aprobability model(for either learning orperformance)

T T T TT

T T T T T

Q(s,a) ← 1−( )Q(s,a) + r +maxb

Q( ′ s ,b)[ ]

′ s r

s

a

The Q-Learning Backup

Page 102: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Another View: Temporal Consistency

Vt rt1 rt2 rt3 L Vt 1 rt rt1 rt2 L

so:

Vt 1 rt Vt

or:

rt Vt Vt 1 0

“TD error”

Page 103: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Review

MDPs Dynamic Programming Backups Bellman equations (temporal consistency) Approximating DP

• Avoid exhaustive sweeps• Simplify backups• Reduce dependence on models• Compactly approximate V

A good case can be made for using RL to approximate solutions to large MDPs

Page 104: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Plan

High-level intro to RL Part I: The personal odyssey Part II: The modern view Part III: Intrinsically Motivated RL

Page 105: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Environment

actionstate

rewardAgent

A Common View

Page 106: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

external sensations

memory

state

reward

actions

internal sensations

RL agent

A Less Misleading Agent View…

Page 107: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Motivation

“Forces” that energize an organism to act and that direct its activity.

Extrinsic Motivation: being moved to do something because of some external reward ($$, a prize, etc.).

Intrinsic Motivation: being moved to do something because it is inherently enjoyable.

Page 108: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Intrinsic Motivation

An activity is intrinsically motivated if the agent does it for its own sake rather than as a step toward solving a specific problem

Curiosity, Exploration, Manipulation, Play, Learning itself . . . Can an artificial learning system be intrinsically motivated? Specifically, can a Reinforcement Learning system be

intrinsically motivated?

Working with Satinder Singh

Page 109: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Usual View of RL

Reward looks extrinsic

Page 110: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Less Misleading View

All reward is intrinsic.

Page 111: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

So What is IMRL?

Key distinction:• Extrinsic reward = problem specific• Intrinsic reward = problem independent

Learning phases:• Developmental Phase: gain general competence• Mature Phase: learn to solve specific problems

Why important: open-ended learning via hierarchical exploration

Page 112: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Scaling Up: Abstraction Ignore irrelevant details

• Learn and plan at a higher level• Reduce search space size• Hierarchical planning and control

• Knowledge transfer• Quickly react to new situations

• c.f. macros, chunks, skills, behaviors, . . . Temporal abstraction: ignore temporal details (as

opposed to aggregating states)

Page 113: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The “Macro” Idea

A sequence of operations with a name; can be invoked like a primitive operation• Can invoke other macros. . . hierarchy• But: an open-loop policy

Closed-loop macros• A decision policy with a name; can be invoked

like a primitive control action• behavior (Brooks, 1986), skill (Thrun & Schwartz,

1995), mode (e.g., Grudic & Ungar, 2000), activity (Harel, 1987), temporally-extended action, option (Sutton, Precup, & Singh, 1997)

Page 114: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Options (Precup, Sutton, & Singh, 1997)

A generalization of actions to include temporally-extendedcourses of action

An option is a triple o =< I,π ,β >

• I : initiation set : the set of states in which o may be started

• π : is the policy followed during o

• β : termination conditions : gives the probability of

terminating in each state

Example: robot docking

: pre-defined controller

: terminate when docked or charger not visible

I : all states in which charger is in sight

Page 115: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Options cont.

Policies can select from a set of options & primitive actions

Generalizations of the usual concepts:• Transition probabilities (“option models”)• Value functions• Learning and planning algorithms

Intra-option off-policy learning:• Can simultaneously learn policies for many

options from same experience

Page 116: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Discrete timeHomogeneous discount

Continuous timeDiscrete eventsInterval-dependent discount

Discrete timeOverlaid discrete eventsInterval-dependent discount

A discrete-time SMDP overlaid on an MDPCan be analyzed at either level

MDP

SMDP

Optionsover MDP

State

Time

Options define a Semi-Markov Decision Process

Page 117: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Where do Options come from?

Dominant approach: hand-crafted from the start How can an agent create useful options for itself?

• Several different approaches (McGovern, Digney, Hengst, ….). All involve defining subgoals of various kinds.

Page 118: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Canonical Illustration: Rooms Example

HALLWAYS

O2

O1

4 rooms

4 hallways

8 multi-step options

Given goal location, quickly plan shortest route

up

down

rightleft

(to each room's 2 hallways)

G?

G?

4 unreliable primitive actions

Fail 33% of the time

Goal states are givena terminal value of 1 = .9

All rewards zero

ROOM

Page 119: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Task-Independent Subgoals

“Bottlenecks”, “Hubs”, “Access States”, … Surprising events Novel events Incongruous events Etc. …

Page 120: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

A Developmental Approach

Subgoals: events that are “intrinsically interesting”; not in the service of any specific task

Create options to achieve them Once option is well learned, the triggering event

becomes less interesting Previously learned options are available as

actions in learning new option policies When facing a specific problem: extract a

“working set” of actions (primitive and abstract) for planning and learning

Page 121: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

For Example:

Built-in salient stimuli: changes in lights and sounds

Intrinsic reward generated by each salient event:• Proportional to the error in prediction of that event

according to the option model for that event (“surprise”)

Motivated in part by novelty responses of dopamine neurons

Page 122: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Creating Options

Upon first occurrence of salient event: create an option and initialize:• Initiation set• Policy• Termination condition• Option model

All options and option models updated all the time using intra-option learning

Page 123: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Playroom Domain

Agent has eye, hand, visual marker

Actions:

move eye to hand

move eye to marker

move eye N, S, E, or W

move eye to random object

move hand to eye

move hand to marker

move marker to eye

move marker to hand

If both eye and hand are on object: turn on light, push ball. etc.

Page 124: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

The Playroom Domain cont.

Switch controls room lightsBell rings and moves one square

if ball hits itPress blue/red block turns music

on/offLights have to be on to see

colorsCan push blocksMonkey cries out if bell and

music both sound in dark room

Page 125: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Skills

To make monkey cry out:• Move eye to switch• Move hand to eye• Turn lights on• Move eye to blue block• Move hand to eye• Turn music on• Move eye to switch• Move hand to eye• Turn light off• Move eye to bell• Move marker to eye• Move eye to ball• Move hand to ball• Kick ball to make bell ring

Using skills (options)• Turn lights on• Turn music on• Turn lights off• Ring bell

Page 126: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Reward for Salient Events

Page 127: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Speed of Learning Various Skills

Page 128: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Learning to Make the Monkey Cry Out

Page 129: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Connects with Previous RL Work

Schmidhuber Thrun and Moller Sutton Kaplan and Oudeyer Duff Others….

But these did not have the option frameworkand related algorithms available

Page 130: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Beware the “Fallacy of Misplaced Concreteness”

Alfred North Whitehead

We have a tendency to mistake our models forreality, especially when they are good models.

Page 131: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

Thanks to all my past PhD Students

Rich Sutton Chuck Anderson Stephen Judd Robbie Jacobs Jonathan Bachrach Vijay Gullapalli Satinder Singh Bob Crites Steve Bradtke Mike Duff Amy McGovern Ted Perkins Mike Rosenstein Balaraman Ravindran

Page 132: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Autonomous Learning Laboratory – Department of Computer ScienceAndrew Barto, Okinawa Computational Neuroscience Course, July 2005

And my current students

Colin Barringer Anders Jonsson George D. Konidaris Ashvin Shah Özgür Şimşek Andrew Stout Chris Vigorito Pippin Wolfe

And the funding agencies

AFOSR, NSF, NIH, DARPA

Page 133: Autonomous Learning Laboratory – Department of Computer Science Perspectives on Computational Reinforcement Learning Andrew G. Barto Autonomous Learning

Whew!

Thanks!