Reinforcement Learning’s Computational Theory of Mind Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to:

Reinforcement Learning’sComputational Theory of

MindRich Sutton

Andy Barto Satinder Singh Doina Precup

with thanks to:

Outline

• Computational Theory of Mind• Reinforcement Learning

– Some vivid examples

• Intuition of RL’s Computational Theory of Mind– Reward, policy, value (prediction of reward)

• Some of the Math– Policy iteration (values & policies build on each other)– Discounted reward– TD error

• Speculative extensions– Reason– Knowledge

Honeybee Brain & VUM Neuron

Hammer, Menzel

Dopamine Neurons

Signal “Error/Change” in Prediction

of Reward

Marr’s Three Levels at which any information processing system can be

understood• Computational Theory Level

– What are the goals of the computation?– What is being computed?– Why are these the right things to compute?– What overall strategy is followed?

• Representation and Algorithm Level– How are these things computed?– What representation and algorithms are used?

• Hardware Implementation Level– How is this implemented physically?

What and Why?

How?

Really how?

Cash Register

• Computational Theory– Adding numbers– Making change– Computing tax– Controling access to cash drawer

• Representations and Algorithms– Are numbers stored in decimal or binary, or

BCD?– Is multiplication done by repeated adding?

• Hardware Implementation– Silicon or gears?– Motors or springs?

What and Why?

Word Processor

• Computational Theory– The whole “application” level– To display document as it will appear on paper– To make it easy to change, enhance

• Representations and Algorithms– How is the document stored?– What algorithms are used to maintain its display?– The underlying C code

• Hardware Implementation– How does the display work?– How does the silicon implement the computer?

What and Why?

Flight

• Computational Theory– Aerodynamics– Lift, propulsion, airfoil shape

• Representations and Algorithms– Fixed wings or flapping?

• Hardware Implementation– Steel, wood, or feathers?

What and Why?

Importance of Computational Theory

• The levels are loosely coupled• Each has an internal logic, coherence of its own• But computational theory drives all lower levels• We have so often gone wrong by mixing CT with

the other levels– Imagined neuro-physiological constraints– The meaning of connectionism, neural networks

• Constraints within the CT level have more force• We have little computational theory for AI

– No clear problem definition– Many methods for knowledge representation, but no

theory of knowledge

Outline






Reinforcement learning:

Learning from interaction

to achieve a goal Environment

actionstate

rewardAgent

•complete agent• temporally situated• continual learning & planning• object is to affect environment• environment stochastic &

uncertain

Strands of History of RLTrial-and-error

learningTemporal-difference

learningOptimal control,value functions

Thorndike ()1911

Minsky

Klopf

Barto et al

Secondary reinforcement ()

Samuel

Witten

Sutton

Hamilton (Physics)1800s

Shannon

Bellman/Howard (OR)

Werbos

Watkins

Holland

Hiroshi Kimura’s RL Robots

QuickTime™ and a decompressor

are needed to see this picture.

Before



After



Backward



New Robot, Same algorithm

The RoboCup Soccer Competition

The Acrobot Problem

e.g., Dejong & Spong, 1994

Sutton, 1995

Minimum–Time–to–Goal:

4 state variables: 2 joint angles 2 angular velocities

Tile coding with 48 layers

Goal: Raise tip above line

Torqueapplied

here

tip

Reward = -1 per time step

fixed base

Examples of Reinforcement Learning• Robocup Soccer Teams Stone & Veloso, Reidmiller et al.

– World’s best player of simulated soccer, 1999; Runner-up 2000

• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis– 10-15% improvement over industry standard methods

• Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin– World's best assigner of radio channels to mobile telephone calls

• Elevator Control Crites & Barto– (Probably) world's best down-peak elevator controller

• Many Robots– navigation, bi-pedal walking, grasping, switching between skills...

• TD-Gammon and Jellyfish Tesauro, Dahl– World's best backgammon player

New Applications of RL• CMUnited Robocup Soccer Team Stone & Veloso

– World’s best player of Robocup simulated soccer, 1998

• KnightCap and TDleaf Baxter, Tridgell & Weaver – Improved chess play from intermediate to master in 300 games

• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis– 10-15% improvement over industry standard methods

• Walking Robot Benbrahim & Franklin– Learned critical parameters for bipedal walking

Real-world applications using on-line learning Back-

prop

QuickTime™ and aCinepak decompressor


Backgammon

SITUATIONS: configurations of

the playing board (about 1020)

ACTIONS: moves

REWARDS: win: +1

lose: –1

else: 0

20

Pure delayed reward

TD-Gammon

. . .. . .

. . .. . .

Value

TD Error

Vt1 Vt

Action selectionby 2-3 ply search

Tesauro, 1992-1995

Start with a random Network

Play millions of games against itself

Learn a value function from this simulated experience

This produces arguably the best player in the world

Outline





• More speculative extensions– Reason– Knowledge

The Reward Hypothesis

The mind’s goal is to maximize the cumulative sum of a received scalar

signal (reward)



signal (reward)

can be conceived of, understood as



signal (reward)

Must come from outside,not under the mind’s direct control



signal (reward)

a simple, single number(not a vector or symbol structure)


• Obvious? Brilliant? Demeaning? Inevitable? Trivial?

• Simple, but not trivial

• May be adequate, may be completely satisfactory

• A good null hypothesis


signal (reward)

Policies

• A policy maps each state to an action to take– Like a stimulus–response rule

• We seek a policy that maximizes cumulative reward

• The policy is a subgoal to achieving reward

Reward

Policy

Value Functions• Value functions = Predictions of expected reward

following states: Value: States Expected future reward

• Moment-by-moment estimates of how well its going • All efficient methods for finding optimal policies first estimate

value functions – RL methods, state-space planning methods, dynamic programming

• Recognizing and reacting to the ups and downs of life is an important part of intelligence

Reward

ValueFunction

Policy

The Mountain Car Problem

Minimum-Time-to-Goal Problem

Moore, 1990 Goal

Gravity wins

SITUATIONS: car's position and velocity

ACTIONS: three thrusts: forward, reverse, none

REWARDS: always –1 until car reaches the goal

No Discounting

Value Functions Learned while solving the Mountain Car

problem

Lower is better


Hammer, Menzel

Dopamine Neurons

Signal “Error/Change” in Prediction

of Reward

Outline






Notation• Reward

r : States Pr() e.g.,

• Policies: States Pr(Actions) is optimal

• Value Functions

Vπ :States → ℜ

Vπ (s)=Eπ γk−1rt+kk=1

∞

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭ given that st =s

rt ∈ℜ

discount factor ≈1 but <1

Policy Iteration

Generalized Policy Iteration

Policy

Value

Function

V

policyevaluation

* V

greedification

*

Reward is a time series

Reward

Reward

Total future reward

Discounted(imminent) reward

Discounting Example

REWARD

What should the prediction error be?

at time t?

Vt = E γk−1rt+kk=1

∞

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭

= E rt+1 +γ γk−1rt+1+kk=1

∞

∑⎧ ⎨ ⎩

⎫ ⎬ ⎭

≈ rt+1 +γVt+1

TDerrort = rt+1 +γVt+1 −Vt

Reward Unexpected

Reward

Value

TD error

Reward Expected

Cue

Value

TD error

Reward Absent

Value

TD error

ComputationTheoretical

TD Errors


Hammer, Menzel

Dopamine Neurons

Signal TD Error

in Prediction of Reward

Representation & Algorithm Level:

The many methods for approximating valueDynamic

programming

Temporal-differencelearning

Monte Carlo

Exhaustivesearch

bootstrapping,

fullbackups

samplebackups

shallowbackups

deepbackups

Outline






Tolman & Honzik, 1930“Reasoning in Rats”

Food box

Path 1

Path 3Path 2

Block B

Block A

Startbox

Reason as RL over Imagined Experience

1. Learn a model of the world’s transition dynamicstransition probabilities, expected immediate rewards“1-step model” of the world

2. Use model to generate imaginary experiencesinternal thought trials, mental simulation (Craik, 1943)

3. Apply RL as if experience had really happened

Reward

ValueFunction

1-Step Model

Policy

Mind is About PredictionsHypothesis: Knowledge is predictive

About what-leads-to-what, under what ways of behavingWhat will I see if I go around the corner?Objects: What will I see if I turn this over?Active vision: What will I see if I look at my hand?Value functions: What is the most reward I know how to get?

Such knowledge is learnable, chainable

Hypothesis: Mental activity is working with predictionsLearning themCombining them to produce new predictions (reasoning)Converting them to action (planning, reinforcement

learning)Figuring out which are most useful

An old, simple, appealing idea

• Mind as prediction engine! • Predictions are learnable, combinable• They represent cause and effect, and can be

pieced together to yield plans• Perhaps this old idea is essentially correct.• Just needs

– Development, revitalization in modern forms– Greater precision, formalization, mathematics– The computational perspective to make it respectable– Imagination, determination, patience

• Not rushing to performance

Documents

Reinforcement Learning’s Computational Theory of Mind Rich Sutton Andy Barto Satinder SinghDoina Precup with thanks to: