Upload
jovani-belk
View
217
Download
2
Tags:
Embed Size (px)
Citation preview
From Reflex to Reason
Rich SuttonAT&T Labs
with thanks to Satinder Singh, Doina Precup, and Andy Barto
Overall Goal
A computational understanding of a broad span of the mind’s activities
what it computeswhy it computes it
At a high level, withoutspecifics of sensory and motor systemsspecific representations and algorithmsneural implementationslanguage
What does the mind do?Is there an overall, simple answer?
Marr’s 3 levels
Main Claims
• Mind is about predictions– making predictions– discovering what predictions can be made
• Knowledge is predictions– action-contingent and temporally-flexible predictions– agent-centric, grounded in experience from the bottom
up
• The mind’s ultimate goal is to make reward-maximizing decisions– but most of its effort is devoted to subgoal of prediction
• A few simple mechanisms enable working flexibly with predictions– TD learning and Bellman backups
Prediction Semantics
Prediction Semantics
• A prediction is a signal with meaning
• Knowing that one signal is a prediction of another enables it to do useful work for you
• When something new predicts X, you know what to do
• Prediction semantics constrains in two directions
Pred.of X
ResponseY
X
existinglink
newlink
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
These together are much of what the mind doesCan we explain them all in a uniform way?
Pavlovian Conditioning,the Conditioning of Reflexes
CS Tone
Eyeshock
Eyeblinkbefore learning
Eyeblinkafter learning
US
UR
CR
Almost any reflex can be conditioned:
salivationorientingheart rate, blood pressuregill withdrawallnausea, taste aversionfear, secondary reinforcersCER: freezing, suppressionneutral stimuli
• Animal can be viewed as learning that the CS predicts the US
• And then responding in anticipation
• But Why? Why should a prediction of the US produce the same response as the US?
(No US)
(Inadequate) Comp. Theories of CC• Instrumental theories -- the CR makes the US feel
better– Works well for eyeblink, salivation, not for 2ndary
reinforcers– Does not explain the similarity of CR and UR– Does not explain apparent conflict of CC and instrumental
• Anticipation theories -- whatever you are going to do, CC causes you to do it earlier– Why earlier? Earlier is not always better!– How much earlier? CR tends to occur at time of US
• Prediction theories -- CC is learning to predict the US– Works for fear, CER, 2ndary reinforcers– Does not explain response to CR or to UR– Explains “What” but not “Why”
Pred Rep’n Theory of Conditioning
The reflex is not US Response:
US Rreflex
CS learnableNOT OR
US Rreflex
CS
learnable
But Prediction of US Response:
Predof US
Rreflex
CS learnable
USlearnable
+USs couldhabituate!
Pred Rep’n Theory of Conditioning (2)
• Consider an innate, learnable association US Response– represents an innate guess, e.g., that a shock now is good
predictor of a shock coming up– but could be wrong– Predicts URs could habituate, change over time depending
on their relationship to themselves
US
supervisory
cue
CS
CS
1
2
+ prediction ofsupervisory US Response
Long USs predict themselves
Short USs are poor self-predictors
*
*
*
*
Pred Rep’n Theory of Conditioning (3)
• Implications for response topography/generation– predicts maximal CR at time of US onset (correct)– predicts CR onset only so early as to enable this– predicts threshold phenomena in CR production– predicts interaction of threshold with relative
effectiveness of reinforced and unreinforced trials
US
CR
responsetopography
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
The Reward Hypothesis
• Is this reasonable?• Is it demeaning?• Is there no other choice?
• It seems to be adequate and perhaps completely satisfactory
That purposes can be adequately represented as
maximization of the cumulative sum of a scalar reward signal received from the
environment
Reinforcement Learning Theory:
What to Compute and Why• Policies: States Pr(Actions)
• Value Functions
V
(s) E t 1rewardt
t1
start in s0 , follow
V : States→ ℜ
• 1-Step Models
P st+1 st ,at E rt+1 st ,at
s
Predictions!
Honeybee Brain & VUM Neuron
Hammer, Menzel
The Acrobot Problem
e.g., Dejong & Spong, 1994
Sutton, 1995
Minimum–Time–to–Goal:
4 state variables: 2 joint angles 2 angular velocities
CMAC of 48 layers
RL same as Mountain Car
1
Goal: Raise tip above line
Torqueapplied
here
tip
Reward = -1 per time step
fixed base
Prediction Semantics of RL
Value, a pred. of reward
reward
fixedlink
Action SelectionPick the highestvalued action
Representationsof state
and action
learnedlinks
An action that predicts reward in a state...should to that extent be favored in that state
Examples of Reinforcement Learning• Robocup Soccer Teams Stone & Veloso, Reidmiller et al.
– World’s best player of simulated soccer, 1999; Runner-up 2000
• Inventory Management Van Roy, Bertsekas, Lee & Tsitsiklis– 10-15% improvement over industry standard methods
• Dynamic Channel Assignment Singh & Bertsekas, Nie & Haykin– World's best assigner of radio channels to mobile telephone calls
• Elevator Control Crites & Barto– (Probably) world's best down-peak elevator controller
• Many Robots– navigation, bi-pedal walking, grasping, switching between skills...
• TD-Gammon and Jellyfish Tesauro, Dahl– World's best backgammon player
TD-Gammon
. . .. . .
. . .. . .
Value
TD Error
Vt+1 Vt
Action selectionby 2-3 ply search
Tesauro, 1992-1995
Start with a random Network
Play millions of games against itself
Learn a value function from this simulated experience
This produces arguably the best player in the world
Prediction Semantics in TD-Gammon
• A prediction of winning can substitute for winning– the central idea of Temporal-Difference (TD) learning
• learning a prediction from a prediction!
– also key idea of dynamic programming– and all heuristic search
• In lookahead search, predictions are composed to produce longer-term predictions– key to all state-space planning– suggests prediction semantics is key element of
reasoning
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
Planning as RL over Mental Simulation
1. Learn a model of the world’s transition dynamicstransition probabilities, expected immediate rewards“1-step model” of the world
2. Use model to generate imaginary experiencesinternal thought trials, mental simulation (Craik, 1943)
3. Apply RL as if experience had really happened
Reward
ValueFunction
1-Step Model
Policy
I.e., learning on model-generated experience:
Dyna Algorithm
1. s current state
2. Choose an action, a, and take it
3. Receive next state, s’, and reward, r4. Apply RL backup to s, a, s’, r
e.g., Q-learning update
5. Update Model(s, a) with s’, r6. Repeat k times:
- select a previously seen state-action pair s,a
- s’, r Model(s, a)
- Apply RL backup to s, a, s’, r7. Go to 1
value/policy
model experience
acting
modellearning
directRL
planning
State-Space Searchis based on a Prediction
Semanticsin seeking to evaluatethis state
we use predictions from these
Prediction Semantics in Planning
is just like in TD-Gammon
• Predictions substitute for path outcomes
• Predictions are composed to predict consequences of arbitrary sequences of action
Naïve RL Theory of Reason
Reward
ValueFunction
1-Step Model
Policy
Reason is RL on model-generated experience
• Pro:– Very simple, uniform, general– Sufficient to reproduce e.g., latent learning
• Con– Seems too low-level– Represents only a limited kind of knowledge
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
Experience
Agent Worldactions
observations
L at 3 ,at 2 ,at 1,at L ?
L ot 3,ot 2 ,ot 1,ot L ? ?
Actions:
Observations:
Experience is the data; it is all we really know Experience provides something for knowledge to be about
A mind interacts with its world
To produce two time series:
Experience
• The world is a black box, known only by its I/O behavior (observations in response to actions)
• Therefore, all meaningful statements about the world are statements about the observations it generates
• The only observations worth talking about are future ones
The only meaningful things to say about the world are predictions
World Knowledge Predictions
Therefore:
Predictions = statements about the joint distribution of future observations and actions
Non-predictive “Knowledge”• Mathematical knowledge, theorems and proofs
– always true, but tell us nothing about the world– not world knowledge
• Uninterpretted signals, e.g., useful representations– real and useful, but not by themselves world
knowledge, only an aid to acquiring it
• Knowledge of the past• Policies
– could be viewed as predictions of value– but by themselves are more like uninterpretted
signals
Predictions capture “regular”, descriptive world knowledge
Every Prediction must be Grounded
in Two Directions
if I do action 1, then obs 12 will be 0
for three steps
history ofactions &
observations
recognition grounding
prediction
prediction grounding
“symbol grounding”
“prediction semantics”
Both Recognition and Prediction Grounding are Needed
• “Classical” AI systems omit recognition grounding– e.g., “Tweety is a bird”, “John loves Mary”– sometimes called the “symbol grounding problem”
• Modern AI sytems tend to skimp prediction grounding– supervised learning, Bayes nets, robotics…
• It is not OK to leave prediction grounding to external, human observers– the information is just not in the machine– we don’t understand it; we haven’t done our job!
• Yet this is such an appealing shortcut that we have almost always done it
Prediction Semantics formalized as Macros-Actions
Let : States Pr(Actions) be an arbitrary policyLet : States Pr({0,1}) be a termination condition
Then macro-action <,> is a kind of experiment – do until says “stop” – measure something about the resulting experienceSuppose we measure – the state at the end of the experiment – the total reward during the experimentThen the macro prediction for <,> would predict Pr(end-state), E{total reward} given start-state
Predictions of this form can represent a lot... ...possibly all world knowledge
Sutton, Precup & SinghAIJ 1999 etal.
Rooms Example
o2
HALLWAYS
o1
up
down
rightleft
(to each room's 2 hallways)
G2
Fail 33% of the time
G1
TargetHallway
Policy of one macro-action:
Sutton, Precup,& Singh, 1999
8 multi-step macro-actions
4 stochasticprimitive actions
Planning with Macro-Predictions
Iteration #0 Iteration #1 Iteration #2
with cell-to-cellprimitive actions
Iteration #0 Iteration #1 Iteration #2
with room-to-roomoptions
V(goa l)=1
V (goa l)=1
macro-actions
Learning Path-to-Goal with and without Hallway Macros-Actions
Episodes
Stepsper
episode
1 10 100 1000 10,00010
100
1000
Actions
Macros
Macros& actions
Illustration: Reconnaissance Mission Planning (Problem)
• Mission: Fly over (observe) most valuable sites and return to base
• Stochastic weather affects observability (cloudy or clear) of sites
• Limited fuel• Intractable with classical optimal
control methods• Temporal scales:
– Actions: which direction to fly now
– Options: which site to head for • Options compress space and time
– Reduce steps from ~600 to ~6– Reduce states from ~1011 to ~106
QO
* (s, o) =rso + ps ′s
o VO* ( ′s )
′s∑
any state (106) sites only (6)
10
50
50
50
100
25
15 (reward)
5
25
8
Base100 decision steps
options
(mean time between weather changes)
30
40
50
60
Illustration: Reconnaissance Mission Planning (Results)
• SMDP planner:– Assumes options followed to
completion– Plans optimal SMDP solution
• SMDP planner with re-evaluation– Plans as if options must be
followed to completion– But actually takes them for
only one step– Re-picks a new option on
every step• Static planner:
– Assumes weather will not change
– Plans optimal tour among clear sites
– Re-plans whenever weather changes
Low FuelHigh Fuel
Expected Reward/Mission
SMDPPlanner
StaticRe-planner
SMDP planner
with re-evaluation of options on
each step
Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner
Outline/Steps
• Reflexes and their conditioning
• Learning to get reward
• Planning, by mental simulation
• Knowledge, as temporally flexible predictions
• Reason, as flexible use of knowledge
Reason
Combining knowledge to obtain new knowledge,flexibly and generally
We must be able to reason about any event as a possible (sub)goal, not just about rewards
This is the final step
Subgoals
• Many natural macro-actions are goal-oriented– E.g., drive-to-work, open-the-door
• So replicate planning in-miniature for each subgoal• Macros can then be learned to achieve each
subgoal• Many can be learned at once, independently
– Solves classic problem of subgoal credit assignment– Solves psychological puzzle of goal-oriented action
• Models of such macros are goal-oriented recognizers– correspond to classical “concepts”– e.g., a “chair” state is one where sitting is predicted to
work
rooms exampl
e
Rooms ExampleIndependent learning of all 8
Subgoals
Time steps
RMS Error insubgoal values
Two subgoalstate values
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 20,000 40,000
Time Steps60,000
upperhallwaysubgoal
learnedvalues
idealvalues
lowerhallwaysubgoal
80,000 100,0000
0.1
0.2
0.3
0.4
0 20,000 40,000 60,000 80,000 100,000
All 8 hallway macros and predictions are learned accurately and efficiently while actions are selected totally at random
Co-Existence of Hedonism and Exploration/Constructivism
• The ultimate goal is still reward• Still one primary policy and set of values• But many other policies, values, and predictions
are learned not directly in service of reward• Most time is spent in exploration and discovery, gaining knowledge rather than reward:
– What possibilities does the world afford?– How can I control and predict it in a variety of ways?– What concepts can be learned that might help later?
• From hedonism to curiosity and constructivism
Main Claims
• Mind is about predictions– making predictions– discovering what predictions can be made
• Knowledge is predictions– action-contingent and temporally-flexible predictions– agent-centric, grounded in experience from the bottom
up
• The mind’s ultimate goal is to make reward-maximizing decisions– but most of its effort is devoted to subgoal of prediction
• A few simple mechanisms enable working flexibly with predictions– TD learning and Bellman backups
Prediction Semantics
What is New?
• The formalization of macro-actions
– provide temporal abstraction
– as well as action contingency (experiments)
– mesh seemlessly with learning and planning methods
• Using the goal-oriented machinery of RL
– for knowledge construction
– for perceptual concepts
• Taking the discipline of predictive knowledge seriously
– speaking only in terms of the subjective, experiential data
Should Knowledge be Experiential?
Allowing only Predictions in terms of Data? loses
• Expressiveness– can’t talk about objects, space, people; no “is-a” or “part-
of”
• External (human) coherence– verbal labels, interpretability, explainability, calibration– the “shortcut” of entering knowledge directly into the
agent
gains • The knowledge will have meaning to the machine• It can be mechanically learned/verified/extended • It will be suited for a general reasoning processes
– composition and backup of predictions to yield new predictions