Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
action evaluation in appetitive and aversive
learningnathaniel daw
princeton university
leuven, 2019
learning for decisions
• how you compute has important consequences for what you choose• eg: which of many possible outcomes you consider• better worked out in appetitive domain but likely extend to avoidance
algorithms for computing expected utility over candidate actions
1. habits vs deliberation: model-based vs. model-free RL
2. psychiatry: disorders involving compulsivity and avoidance
3. stress and opportunity cost
sequential decision tasks
𝑄 𝑠𝑡 , 𝑎𝑡 = 𝑟 𝑠𝑡 +
𝑠𝑡+1
𝑃 𝑠𝑡+1 𝑠𝑡 , 𝑎𝑡 𝑟 𝑠𝑡+1 +
𝑠𝑡+2
…
$0 $25
A B
C D E F
$10
Markov decision process: consequences of actions are delayed, contingent• connect actions to consequences over space and time• hard to estimate; hard to learn; “temporal credit assignment”• maximizing utility unites both seeking reward and avoiding punishment
“model-based” learning
𝑄 𝑠𝑡 , 𝑎𝑡 = 𝑟 𝑠𝑡 +
𝑠𝑡+1
𝑃 𝑠𝑡+1 𝑠𝑡 , 𝑎𝑡 𝑟 𝑠𝑡+1 +
𝑠𝑡+2
…
$0 $25
A B
C D E F
$10
• learn one-step reward & transition “map”; • iterative, tree-structured computation;• hippocampal “preplay”? (Mattar & Daw 2018)
(Pfeiffer and Foster, 2013)
“model-free” learning
$25
A B
$10
𝑄 𝑠𝑡 , 𝑎𝑡 = 𝑟 𝑠𝑡 +
𝑠𝑡+1
𝑃 𝑠𝑡+1 𝑠𝑡 , 𝑎𝑡 𝑟 𝑠𝑡+1 +
𝑠𝑡+2
…
shortcut: store endpoints of computation (long-run action values)• these can be learned directly, ”model free” (TD learning)• simplifies choice-time computation (just retrieve) – but may not
reflect all available information• standard theory of dopamine, reward prediction errors etc
(Schultz et al 1997)
model-based and model-free learning• these ideas propose to formalize rodent work on
goal-directed vs habitual disticntion in instrumental behavior (Daw et al. 2005)
• these are most often studied in reward domain (e.g. via reward devaluation)
• but to the extent known, largely paralleled in avoidance (LeDoux & Daw 2018)
• lots to do (e.g.: Cain; Cano this meeting)
learned decision making in humans
+
0
0.25
0.5
pro
babili
ty
0 100 200 3000
0.25
0.5
trial
pro
babili
ty
“bandit” taskse.g. Daw et al 2006
Wittmann et al 2008
Gershman et al 2009
Schonberg et al 2010
Glascher et al 2010
Wimmer et al 2012
Seymour et al 2012
Kovach et al 2012
Madlon-Kay et al 2013
behavioral analysis: characterize the function relating outcomes to future choices (trial by trial learning model)
multinomial logistic regression: outcomes choices
decay form characteristic of error-driven learning
-1 -6 -1 -6 -1 -6-5
0
5
lag (trials)
<-
avo
id
- ch
oo
se
->
reward shock choice
(Seymour et al. J Neuro 2012)
sequential decision task
with prob: 26% 57% 41% 28%
(all slowly changing)(Daw et al Neuron 2011)
extend experiment to probe map learning:
is choice guided by anticipated states or previous actions?
idea
30%
How does bottom-stage feedback affect top-stage choices?
Example: rare transition at top level, followed by win
• Which top-stage action is now favored?
predictions
model-free
ignores transition structuremodel-based
respects transition structure
data
model-free model-based
individual subs x 201 trials each
(Daw et al Neuron 2011)
data
model-free model-based
17 subs x 201 trials each
(Daw et al Neuron 2011)
reward: p<1e-8reward x rare: p<5e-5(mixed effects logit)
results reject pure reinforcement models suggest mixture of planning and
reinforcement processes
data
model-free model-based
17 subs x 201 trials each
reward: p<1e-8reward x rare: p<5e-5(mixed effects logit)
(Daw et al Neuron 2011)
MB and MF with shock
L R
or or
in prep w/ Neil Garrett, Marijn Kroes, Liz Phelps
shock: p<5e-7shock x rare: p<.01
(Otto et al Psych Science, 2013)
single task
dual task
dual x model-based: p< .05
Mo
del
bas
ed
log cortisol delta (Z score)
interference
(Otto et al PNAS, 2013)
stress
Also:Individual differences• Development (Decker ea, 2016)• Aging (Eppinger ea 2013)• IQ (Schad ea 2014; Gillan ea 2016)• cognitive control (Otto ea 2015)• Psychopathology (more later…)
Dopamine & PFC• PFC TMS (Smittenaar ea 2013)• COMT (PFC DA) genotype (Doll ea 2016)• Parkinson’s disease & DA meds (Sharp ea
2016; Wunderlich ea 2012)• dopamine PET (Desserno ea 2015)
what are the neural mechanisms underlying MB evaluation?
Is model-based learning really decision by simulation?
decodable stimuli
(Doll, Duncan, Simon, Shohamy & Daw Nature Neuroscience 2015)
pu
tam
en p
red
icti
on
err
or
RPE (ventral putamen)
behavior MB MF
P<.01
(Doll, Duncan, Simon, Shohamy & Daw Nature Neuroscience 2015)
pro
spec
tive
act
ivat
ion
behavior MB MF
prospection (category selective ctx)
P=.02
Signatures of two dissociable neural evaluation mechanisms
1. forward search2. error-driven updating
which have the expected relationships to choice behavior
Is model-based learning related to disorders of compulsion?
Binge eating disorder, n=30
Healthy volunteers, n=106
OCD, n=35Stimulant abusers, n=36
(Voon et al., Biological Psychiatry, 2014)
Methamphetamine/cocaineAbstinent at least 1 wk
3 questions
1) what to make of inflexible goal-pursuit (like anorexia nervosa)?
2) are decision making effects actually acute due to illness?
3) are patients MB for object of compulsion?
what causes MB/MF imbalance in AN?
idea: food restriction behaviors are like avoidance habits• 2-factor theory: avoidance habits
can only be reinforced if safety is reframed as goal
• suggestion: AN are particularly prone to such reframing
• preliminary evidence from Palminteri et al. (2015) reframing task
prelim w/ Karin Foerde, Daphna Shohamy, Joanna Steinglass
train:
probe:
objectively betterworse in training frame
objectively worsebetter in training frame
anxiety a puzzle and a model• anxiety disorders are characterized by persistent
and overgeneralized fear and avoidance• why should this be, given that avoidance is protective?
• in models, approach propagates opportunity and avoidance contains danger: due to assumption you will avoid in future
in prep w/ Sam Zorowitz
suggestion• in general, sequential evaluation requires
assumptions about future events• suggestion: a core dysfunction in anxiety is pessimistic
expectations about future choices
in prep w/ Sam Zorowitz
consequences
idea ties together many disparate aspects of anxiety• overgeneralization of avoidance
• control & self-efficacy
• transition to depression
• unbalanced approach-avoidconflict (eg BART)
prediction• attenuated (or reversed) free-choice bias (eg Leotti &
Delgado 2011)
in prep w/ Sam Zorowitz
recap
• psychopathology may reflect dysfunction of underlying evaluation choice mechanisms
• compulsion & MB/MF imbalance
• anorexia and anxiety potentially reflecting more unique aspects of avoidance
stress and opportunity costWhy does stress favor habits?
How can we reason formally
about the range of effects of
the stress response?
• so far: transient, action-or stimulus-linked evaluations
• also: more global evaluations • stress, mood, schemas, tonic neuromodulators• average reward, controllability, priors
Mo
del
bas
edlog cortisol delta (Z score)
(Otto et al PNAS, 2013)
opportunity cost of inaction
• deliberation can improve rewards (better choices)
• but takes time (delaying rewards, failing to avoid punishments)
• in appetitive circumstances, the opportunity cost of inaction is proportional to the average reward of the environment (Niv et al., 2007)deliberation should be modulated by long-run average
reward in the environment (Keramati et al., 2011)
also the average opportunity to avoid (Cools et al. 2011; Dayan 2012)
Same basic logic plays out across:
• Decisions• vigor (Niv et al 2007)• foraging (Charnov 1977)• speed-accuracy tradeoffs (Otto & Daw 2018)• time discounting (Kacelnik)
• Meta-decisions / control (Boureau et al. 2015)• deliberation (Keramati et al. 2011)• action chunking (Dezfouli & Balleine 2012)• cognitive effort, ego depletion (Kurzban; Shenhav)• thresholds for signal detection / DDMs (Gold & Shadlen 2003)• explore/exploit
… the average reward as a ubiquitous decision variable
Charnov (1976); Stephens & Krebs (1986)
• serially visit reward patches• choose to harvest or exit• harvesting earns diminishing rewards• exiting leads to a new patch (takes time; no going
back)
principle of lost opportunity• balance between reward and opportunity cost of
harvesting• many problems can be expressed in this stay-
switch form
patch foraging
average reward per harvest(opportunity cost of foraging)ap
ple
s p
er h
arve
st
time
exit when
Charnov (1976) ; Stephens & Krebs (1986)
marginal value theorem
𝑛𝑒𝑥𝑡 𝑟𝑒𝑤𝑎𝑟𝑑 < 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑟𝑒𝑤𝑎𝑟𝑑
patch foraging in undergraduates
... ...
decisionwait 9s
...
decisionwait 3s
decisionwait 3s
decision
...
stay
exit
stay
Constantino & Daw 2015
app
les
per
har
vest
time
predictions: travel time
app
les
per
har
vest
time
sample subject
long (13.5s)
short (6s)
exit
th
resh
old
(m
ean
last
rew
ard
)
time (minutes)
travel time:
Constantino & Daw 2015
app
les
per
har
vest
time
predictions: depletion rate
app
les
per
har
vest
time
sample subjectex
it t
hre
sho
ld (
mea
n la
st r
ewar
d)
time (minutes)
steep (.68)
shallow (.89)
depletion rate:
Constantino & Daw 2015
group dataex
it t
hre
sho
ld (
mea
n la
st r
ewar
d)
steepshallowlong short
travel time depletion rate
n=11 n=11
Constantino & Daw 2015
chronic stress
(Lenow, Constantino, Daw & Phelps J Neurosci 2017)
(p < .01)
ove
rhar
vest
ing
un
der
har
vest
ing
acute stress(p < .05)
ove
rhar
vest
ing
un
der
har
vest
ing
(Lenow, Constantino, Daw & Phelps J Neurosci 2017)
stress and evaluation
• global decision variables like average reward might provide a more fundamental interpretation for a range of stress effects
• including effects on MB/MF tradeoff, via opportunity cost of time
• not fully worked out for aversive events and avoidance (probably most relevant to stress)
conclusions
how we compute decision variables influences what we choose
two strategies for computing decision variables underlying goal-habit conflict
• distinct neural and behavioral signatures• links to psychopathology eg compulsion
principles and mechanisms likely equally applicable to avoidance
• though much still to explore
asymmetries between approach and avoidance also important
• famous: two-factor theory• novel: different effects in sequential behavior, anxiety
Lab:
Ida Momennejad (now Columbia)
Ross Otto (now McGill)
Claire Gillan (now TCD)
Brad Doll (now about.com)
Sara Constantino (now Princeton)
Dylan Rich
Marcelo Mattar
Collaborators:
Ken Norman
Liz Phelps
Sam Gershman
Daphna Shohamy
Valerie Voon
Jennifer Lenow
Joanna Steinglass
Funding:
NIMHNIDANINDSNSFMcDonnell FoundationTempleton Foundation
US DODGoogle DeepMind
Lindsay Hunter
Evan Russek
Oliver Vikbladh
Neil Garrett
Kevin Lloyd
Flora Bouchacourt
Sam Zorowitz
Peter Dayan
Yael Niv
Deborah Talmi
Ming Hsu
Mate Lengyel
Karin Foerde