action evaluation in appetitive and aversive learning...action evaluation in appetitive and aversive...

Preview:

Citation preview

action evaluation in appetitive and aversive

learningnathaniel daw

princeton university

leuven, 2019

learning for decisions

• how you compute has important consequences for what you choose• eg: which of many possible outcomes you consider• better worked out in appetitive domain but likely extend to avoidance

algorithms for computing expected utility over candidate actions

1. habits vs deliberation: model-based vs. model-free RL

2. psychiatry: disorders involving compulsivity and avoidance

3. stress and opportunity cost

sequential decision tasks

𝑄 𝑠𝑡 , 𝑎𝑡 = 𝑟 𝑠𝑡 +

𝑠𝑡+1

𝑃 𝑠𝑡+1 𝑠𝑡 , 𝑎𝑡 𝑟 𝑠𝑡+1 +

𝑠𝑡+2

$0 $25

A B

C D E F

$10

Markov decision process: consequences of actions are delayed, contingent• connect actions to consequences over space and time• hard to estimate; hard to learn; “temporal credit assignment”• maximizing utility unites both seeking reward and avoiding punishment

“model-based” learning

𝑄 𝑠𝑡 , 𝑎𝑡 = 𝑟 𝑠𝑡 +

𝑠𝑡+1

𝑃 𝑠𝑡+1 𝑠𝑡 , 𝑎𝑡 𝑟 𝑠𝑡+1 +

𝑠𝑡+2

$0 $25

A B

C D E F

$10

• learn one-step reward & transition “map”; • iterative, tree-structured computation;• hippocampal “preplay”? (Mattar & Daw 2018)

(Pfeiffer and Foster, 2013)

“model-free” learning

$25

A B

$10

𝑄 𝑠𝑡 , 𝑎𝑡 = 𝑟 𝑠𝑡 +

𝑠𝑡+1

𝑃 𝑠𝑡+1 𝑠𝑡 , 𝑎𝑡 𝑟 𝑠𝑡+1 +

𝑠𝑡+2

shortcut: store endpoints of computation (long-run action values)• these can be learned directly, ”model free” (TD learning)• simplifies choice-time computation (just retrieve) – but may not

reflect all available information• standard theory of dopamine, reward prediction errors etc

(Schultz et al 1997)

model-based and model-free learning• these ideas propose to formalize rodent work on

goal-directed vs habitual disticntion in instrumental behavior (Daw et al. 2005)

• these are most often studied in reward domain (e.g. via reward devaluation)

• but to the extent known, largely paralleled in avoidance (LeDoux & Daw 2018)

• lots to do (e.g.: Cain; Cano this meeting)

learned decision making in humans

+

0

0.25

0.5

pro

babili

ty

0 100 200 3000

0.25

0.5

trial

pro

babili

ty

“bandit” taskse.g. Daw et al 2006

Wittmann et al 2008

Gershman et al 2009

Schonberg et al 2010

Glascher et al 2010

Wimmer et al 2012

Seymour et al 2012

Kovach et al 2012

Madlon-Kay et al 2013

behavioral analysis: characterize the function relating outcomes to future choices (trial by trial learning model)

multinomial logistic regression: outcomes choices

decay form characteristic of error-driven learning

-1 -6 -1 -6 -1 -6-5

0

5

lag (trials)

<-

avo

id

- ch

oo

se

->

reward shock choice

(Seymour et al. J Neuro 2012)

sequential decision task

with prob: 26% 57% 41% 28%

(all slowly changing)(Daw et al Neuron 2011)

extend experiment to probe map learning:

is choice guided by anticipated states or previous actions?

idea

30%

How does bottom-stage feedback affect top-stage choices?

Example: rare transition at top level, followed by win

• Which top-stage action is now favored?

predictions

model-free

ignores transition structuremodel-based

respects transition structure

data

model-free model-based

individual subs x 201 trials each

(Daw et al Neuron 2011)

data

model-free model-based

17 subs x 201 trials each

(Daw et al Neuron 2011)

reward: p<1e-8reward x rare: p<5e-5(mixed effects logit)

results reject pure reinforcement models suggest mixture of planning and

reinforcement processes

data

model-free model-based

17 subs x 201 trials each

reward: p<1e-8reward x rare: p<5e-5(mixed effects logit)

(Daw et al Neuron 2011)

MB and MF with shock

L R

or or

in prep w/ Neil Garrett, Marijn Kroes, Liz Phelps

shock: p<5e-7shock x rare: p<.01

(Otto et al Psych Science, 2013)

single task

dual task

dual x model-based: p< .05

Mo

del

bas

ed

log cortisol delta (Z score)

interference

(Otto et al PNAS, 2013)

stress

Also:Individual differences• Development (Decker ea, 2016)• Aging (Eppinger ea 2013)• IQ (Schad ea 2014; Gillan ea 2016)• cognitive control (Otto ea 2015)• Psychopathology (more later…)

Dopamine & PFC• PFC TMS (Smittenaar ea 2013)• COMT (PFC DA) genotype (Doll ea 2016)• Parkinson’s disease & DA meds (Sharp ea

2016; Wunderlich ea 2012)• dopamine PET (Desserno ea 2015)

what are the neural mechanisms underlying MB evaluation?

Is model-based learning really decision by simulation?

decodable stimuli

(Doll, Duncan, Simon, Shohamy & Daw Nature Neuroscience 2015)

pu

tam

en p

red

icti

on

err

or

RPE (ventral putamen)

behavior MB MF

P<.01

(Doll, Duncan, Simon, Shohamy & Daw Nature Neuroscience 2015)

pro

spec

tive

act

ivat

ion

behavior MB MF

prospection (category selective ctx)

P=.02

Signatures of two dissociable neural evaluation mechanisms

1. forward search2. error-driven updating

which have the expected relationships to choice behavior

Is model-based learning related to disorders of compulsion?

Binge eating disorder, n=30

Healthy volunteers, n=106

OCD, n=35Stimulant abusers, n=36

(Voon et al., Biological Psychiatry, 2014)

Methamphetamine/cocaineAbstinent at least 1 wk

3 questions

1) what to make of inflexible goal-pursuit (like anorexia nervosa)?

2) are decision making effects actually acute due to illness?

3) are patients MB for object of compulsion?

what causes MB/MF imbalance in AN?

idea: food restriction behaviors are like avoidance habits• 2-factor theory: avoidance habits

can only be reinforced if safety is reframed as goal

• suggestion: AN are particularly prone to such reframing

• preliminary evidence from Palminteri et al. (2015) reframing task

prelim w/ Karin Foerde, Daphna Shohamy, Joanna Steinglass

train:

probe:

objectively betterworse in training frame

objectively worsebetter in training frame

anxiety a puzzle and a model• anxiety disorders are characterized by persistent

and overgeneralized fear and avoidance• why should this be, given that avoidance is protective?

• in models, approach propagates opportunity and avoidance contains danger: due to assumption you will avoid in future

in prep w/ Sam Zorowitz

suggestion• in general, sequential evaluation requires

assumptions about future events• suggestion: a core dysfunction in anxiety is pessimistic

expectations about future choices

in prep w/ Sam Zorowitz

consequences

idea ties together many disparate aspects of anxiety• overgeneralization of avoidance

• control & self-efficacy

• transition to depression

• unbalanced approach-avoidconflict (eg BART)

prediction• attenuated (or reversed) free-choice bias (eg Leotti &

Delgado 2011)

in prep w/ Sam Zorowitz

recap

• psychopathology may reflect dysfunction of underlying evaluation choice mechanisms

• compulsion & MB/MF imbalance

• anorexia and anxiety potentially reflecting more unique aspects of avoidance

stress and opportunity costWhy does stress favor habits?

How can we reason formally

about the range of effects of

the stress response?

• so far: transient, action-or stimulus-linked evaluations

• also: more global evaluations • stress, mood, schemas, tonic neuromodulators• average reward, controllability, priors

Mo

del

bas

edlog cortisol delta (Z score)

(Otto et al PNAS, 2013)

opportunity cost of inaction

• deliberation can improve rewards (better choices)

• but takes time (delaying rewards, failing to avoid punishments)

• in appetitive circumstances, the opportunity cost of inaction is proportional to the average reward of the environment (Niv et al., 2007)deliberation should be modulated by long-run average

reward in the environment (Keramati et al., 2011)

also the average opportunity to avoid (Cools et al. 2011; Dayan 2012)

Same basic logic plays out across:

• Decisions• vigor (Niv et al 2007)• foraging (Charnov 1977)• speed-accuracy tradeoffs (Otto & Daw 2018)• time discounting (Kacelnik)

• Meta-decisions / control (Boureau et al. 2015)• deliberation (Keramati et al. 2011)• action chunking (Dezfouli & Balleine 2012)• cognitive effort, ego depletion (Kurzban; Shenhav)• thresholds for signal detection / DDMs (Gold & Shadlen 2003)• explore/exploit

… the average reward as a ubiquitous decision variable

Charnov (1976); Stephens & Krebs (1986)

• serially visit reward patches• choose to harvest or exit• harvesting earns diminishing rewards• exiting leads to a new patch (takes time; no going

back)

principle of lost opportunity• balance between reward and opportunity cost of

harvesting• many problems can be expressed in this stay-

switch form

patch foraging

average reward per harvest(opportunity cost of foraging)ap

ple

s p

er h

arve

st

time

exit when

Charnov (1976) ; Stephens & Krebs (1986)

marginal value theorem

𝑛𝑒𝑥𝑡 𝑟𝑒𝑤𝑎𝑟𝑑 < 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 𝑟𝑒𝑤𝑎𝑟𝑑

patch foraging in undergraduates

... ...

decisionwait 9s

...

decisionwait 3s

decisionwait 3s

decision

...

stay

exit

stay

Constantino & Daw 2015

app

les

per

har

vest

time

predictions: travel time

app

les

per

har

vest

time

sample subject

long (13.5s)

short (6s)

exit

th

resh

old

(m

ean

last

rew

ard

)

time (minutes)

travel time:

Constantino & Daw 2015

app

les

per

har

vest

time

predictions: depletion rate

app

les

per

har

vest

time

sample subjectex

it t

hre

sho

ld (

mea

n la

st r

ewar

d)

time (minutes)

steep (.68)

shallow (.89)

depletion rate:

Constantino & Daw 2015

group dataex

it t

hre

sho

ld (

mea

n la

st r

ewar

d)

steepshallowlong short

travel time depletion rate

n=11 n=11

Constantino & Daw 2015

chronic stress

(Lenow, Constantino, Daw & Phelps J Neurosci 2017)

(p < .01)

ove

rhar

vest

ing

un

der

har

vest

ing

acute stress(p < .05)

ove

rhar

vest

ing

un

der

har

vest

ing

(Lenow, Constantino, Daw & Phelps J Neurosci 2017)

stress and evaluation

• global decision variables like average reward might provide a more fundamental interpretation for a range of stress effects

• including effects on MB/MF tradeoff, via opportunity cost of time

• not fully worked out for aversive events and avoidance (probably most relevant to stress)

conclusions

how we compute decision variables influences what we choose

two strategies for computing decision variables underlying goal-habit conflict

• distinct neural and behavioral signatures• links to psychopathology eg compulsion

principles and mechanisms likely equally applicable to avoidance

• though much still to explore

asymmetries between approach and avoidance also important

• famous: two-factor theory• novel: different effects in sequential behavior, anxiety

Lab:

Ida Momennejad (now Columbia)

Ross Otto (now McGill)

Claire Gillan (now TCD)

Brad Doll (now about.com)

Sara Constantino (now Princeton)

Dylan Rich

Marcelo Mattar

Collaborators:

Ken Norman

Liz Phelps

Sam Gershman

Daphna Shohamy

Valerie Voon

Jennifer Lenow

Joanna Steinglass

Funding:

NIMHNIDANINDSNSFMcDonnell FoundationTempleton Foundation

US DODGoogle DeepMind

Lindsay Hunter

Evan Russek

Oliver Vikbladh

Neil Garrett

Kevin Lloyd

Flora Bouchacourt

Sam Zorowitz

Peter Dayan

Yael Niv

Deborah Talmi

Ming Hsu

Mate Lengyel

Karin Foerde

Recommended