Actor-Critic models: from ventral striatal reward-related activity to robotics simulations

IntroElectrophysiology

ModellingDiscussion

slide # 1 / 59

Actor-Critic models: from ventral striatal reward-related activity to robotics

simulations.

Dr. Mehdi Khamassi1,2

1LPPA, UMR CNRS 7152, Collège de France, Paris

2AnimatLab-LIP6 / SIMA-ISIR, Université Pierre et Marie Curie, Paris 6


ModellingDiscussion

slide # 2 / 59

OBJECTIVE

Help to understand how mammals can adapt their behavior in order to maximize reward obtained from the environment.

Help to understand brain mechanisms underlying these cognitive processes.

IntroIntroIntroIntro


ModellingDiscussion

slide # 3 / 59

OBJECTIVE

Challenging goal: different levels of decision, different learning

processes, different types of representation

Pluridisciplinary approach

Behavioral Neurophysiology Computational Modelling Autonomous Robotics



ModellingDiscussion

slide # 4 / 59

ACTOR-CRITIC MODEL

CRITIC

Learns to

Predict reward


• Developed in the AI community (RL)

• Explains some reward-seeking behaviors

• Resemblance with some part of the brain

(dopaminergic neurons & striatum)

ACTOR

Learns to

Select actions


ModellingDiscussion

slide # 5 / 59

Outline

1. Introduction How does an Actor-

Critic model work ?

2. Electrophysiology Reward predictions in

the rat ventral striatum

Intro

3. Computational modelling

An Actor-Critic model in a simulated robot

4. Discussion

IntroIntroIntro


ModellingDiscussion

slide # 6 / 59

The Actor-Critic model

Learning from reward

1

2

3

4

5Reward

1 2 3 4 5actions:reward

Intro


ModellingDiscussion

slide # 7 / 59


• Learning from reward

1

2

3

4

5Reward

1 2 3 4 5actions:

reinforcement

reward

rewardreinforcement

Intro


ModellingDiscussion

slide # 8 / 59


• Learning from reward

1

2

3

4

5Reward

1 2 3 4 5actions:

reinforcement

reward

rewardreinforcement

Pt-1reward prediction:

Rescorla and Wagner (1972).

Intro


ModellingDiscussion

slide # 9 / 59


• Temporal-Difference (TD) learning

1

2

3

4

5

Pt-1 Pt

Reward

1 2 3 4 5actions:reward

reward predictions:

rewardreinforcement

reinforcement ȓ

Sutton and Barto (1998).

Intro


ModellingDiscussion

slide # 10 / 59


• Analogy with dopaminergic neurons

rewardreinforcement

R S

Romo & Schultz (1990).Houk et al. (1995); Schultz et al. (1997).

+1

Intro


ModellingDiscussion

slide # 11 / 59


Analogy with dopaminergic neurons

rewardreinforcement

R S

+1


Intro


ModellingDiscussion

slide # 12 / 59



rewardreinforcement

R S

0


Intro


ModellingDiscussion

slide # 13 / 59



rewardreinforcement

R S

-1


Intro


ModellingDiscussion

slide # 14 / 59


Actor-Critic models

Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002).see Joel et al. (2002) for a review.

Dopaminergic neuron

Intro


ModellingDiscussion

slide # 15 / 59


Actor-Critic models

Dopaminergic neuron

Intro

P = 0 P = 0

P = 0 P = 0

r = 0

r = 1

L E


ModellingDiscussion

slide # 16 / 59


Actor-Critic models

Dopaminergic neuron

Intro

P = 0 P = 0

P = 0 P = 1

r = 0

r = 1

L E

11


ModellingDiscussion

slide # 17 / 59


Actor-Critic models

Dopaminergic neuron

Intro

P = 1 P = 0

P = 0 P = 1

r = 0

r = 1

L E

11

11


ModellingDiscussion

slide # 18 / 59

Adapted from Tierney (2006)

The rat brainIntro


ModellingDiscussion

slide # 19 / 59

Adapted from Voorn et al. (2004)

The striatumIntro


ModellingDiscussion

slide # 20 / 59

Ventral Striatum

Dopaminergic neurons (VTA / SNc)

Dorsal Striatum

Actions

ACTORCRITIC

The striatumIntro

(Barto, 1995; Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997; Doya et al., 2002; O’Doherty et

al., 2004)


ModellingDiscussion

slide # 21 / 59

Learning based on reward prediction in VS...

... on dopamine reinforcements.

... modelled by Temporal Difference (TD)-learning

In the monkey: (Hikosaka et al., 1989; Hollerman et al., 1998; Kawagoe et al., 1998; Hassani et al., 2001; Cromwell and

Schultz, 2003)In the rat: (Carelli et al., 2000; Daw et al., 2002; Setlow et al.,

2003; Nicola et al., 2004; Wilson and Bowman, 2005)

(Barto, 1995; Houk et al., 1995; Schultz et al., 1997; Doya et al., 2002)

(Schultz et al., 1992; Satoh et al., 2003; Nakahara et al., 2004)

The striatumIntro


ModellingDiscussion

slide # 22 / 59

... using precise timing reward prediction in TD-learning

Adapted from (Suri and Schultz, 2001)

simulation of a TD-learning model

activity recorded from the monkey striatum

(Montague et al., 1996; Suri and Schultz, 2001; Perez-Uribe, 2001; Alexander and Sporns, 2002)

The striatumIntro


ModellingDiscussion

slide # 23 / 59

ElectrophysiologyMethods

Recording in the rat VS

Simple electrodes

Electrophysiology


ModellingDiscussion

slide # 24 / 59

ElectrophysiologyBehavioral methods

The plus-maze task

Electrophysiology


ModellingDiscussion

slide # 25 / 59

ElectrophysiologyBehavioral methods

immobilerunning

Box arrival

Time

Center departure

The plus-maze task

Electrophysiology


ModellingDiscussion

slide # 26 / 59

ElectrophysiologyResults

170 neurons 91 neurons with behavioral correlates

Departure Center Arrival

5

Time

Electrophysiology


ModellingDiscussion

slide # 27 / 59

ElectrophysiologyResults: Reward anticipation

Ventral striatal neuron.

Activity anticipating

each reward droplet.

Independent from

locomotor behavior.

Khamassi, Mulder et al. (in revision) J Neurophysiol.

Electrophysiology


ModellingDiscussion

slide # 28 / 59





Independent from

locomotor behavior.


Electrophysiology


ModellingDiscussion

slide # 29 / 59





Independent from

locomotor behavior.

Anticipation of an extra

reward.


Electrophysiology


ModellingDiscussion

slide # 30 / 59

Modelling with TD-learningResults

TD-learning

Temporal representation of stimuli (Montague et al., 1996).

Incomplete temporal representation

Ambiguous visual input

No spatial information

7 droplets 5 3 1

TD-learning

TD-learning

TD-learning

Electrophysiology


ModellingDiscussion

slide # 31 / 59


TD-learning



Same context after last drop than during droplets delivery.


7 droplets 5 3 1

TD-learning

TD-learning

TD-learning

Electrophysiology


ModellingDiscussion

slide # 32 / 59


TD-learning





7 droplets 5 3 1

TD-learning

TD-learning

TD-learning

Electrophysiology


ModellingDiscussion

slide # 33 / 59


TD-learning





7 droplets 5 3 1

TD-learning

TD-learning

TD-learning

Electrophysiology


ModellingDiscussion

slide # 34 / 59

TD-learning could reproduce neural anticipatory activity.

Can it reproduce the rat's locomotor behavior in the same task ?


Electrophysiology


ModellingDiscussion

slide # 35 / 59

Autonomous roboticsMethods

Virtual plus-maze

Visual perceptions

reward

reward

Actions

Modelling


ModellingDiscussion

slide # 36 / 59


Virtual plus-maze

Actions1

2

3

4

1

2

3

4

Visual perceptions

5

5

reward

reward

Modelling


ModellingDiscussion

slide # 37 / 59


Results expected

1

2

3

4

5

reward

Modelling


ModellingDiscussion

slide # 38 / 59


Actor-Critic models


Simplistic Actor. Most often: discrete

environments.

Dopaminergic neuron

Modelling


ModellingDiscussion

slide # 39 / 59


Actor-Critic models



environments.

Continuous environments: coordination of modules.

gating network: Baldassarre (2002); Doya et al. (2002).

hand-tuned (independent from modules' performances): Suri and Schultz (2001).

Dopaminergic neuron

Modelling


ModellingDiscussion

slide # 40 / 59


Actor-Critic models



environments.

Continuous environments: coordination of modules.

gating network: Baldassarre (2002); Doya et al. (2002).

hand-tuned (independent from modules' performances): Suri and Schultz (2001).

Test principles within a common framework

Dopaminergic neuron

Modelling


ModellingDiscussion

slide # 41 / 59


Implemented framework

Modelling


ModellingDiscussion

slide # 42 / 59


Gurney, Prescott & Redgrave. (2001)Adapted by Girard et al. (2002; 2003).

Modelling


ModellingDiscussion

slide # 43 / 59


module coordination

Modelling


ModellingDiscussion

slide # 44 / 59


1. gating network(tests modules' capacity for state prediction)

Modelling


ModellingDiscussion

slide # 45 / 59


2. hand-tuned(independent from modules' performance)

reward

Categorization

Visual perceptions

Modelling


ModellingDiscussion

slide # 46 / 59


3. unsupervised categorization(Self-Oganizing Maps)

Modelling


ModellingDiscussion

slide # 47 / 59


4. random robot

Modelling


ModellingDiscussion

slide # 48 / 59

Autonomous roboticsResults

average

Modelling


ModellingDiscussion

slide # 49 / 59


Nb of iterations required(Average performance during the second

half of the experiment)

3,50094

40430,000

1. gating network2. hand-tuned3. unsupervised categorization (SOM)4. random robot

Modelling


ModellingDiscussion

slide # 50 / 59


1. gating network2. hand-tuned3. unsupervised categorization (SOM)4. random robot

Nb of iterations required(Average performance during the second

half of the experiment)

3,50094

40430,000

Modelling


ModellingDiscussion

slide # 51 / 59

Discussion

Contributions Critic-like reward anticipation in the ventral striatum Coordinating multiple modules with SOM

Discussion


ModellingDiscussion

slide # 52 / 59

Discussion

Contributions Critic-like reward anticipation in the ventral striatum Coordinating multiple modules with SOM Prediction: dopamine signal for missing final drop

Discussion


ModellingDiscussion

slide # 53 / 59

Discussion

Contributions Critic-like reward anticipation in the ventral

striatum Coordinating multiple modules with SOM Prediction: dopamine signal for missing final

drop

Perspectives Vary intervals between droplet rewards Integrate action values (Samejima et al., 2005) Improve the model based on other robotics

multi-modules reinforcement learning methods (Uchibe et al., 2004; Brunskill et al.; 2006)

Discussion


ModellingDiscussion

slide # 54 / 59


Actor-Critic models

Dopaminergic neuron

Intro

P = 1 P = 0

P = 0 P = 1

r = 0

r = 1

L E

11

11


ModellingDiscussion

slide # 55 / 59

Model-based reinforcement learning

Intro

P = 1 P = 0

P = 0 P = 1

r = 0

r = 1


ModellingDiscussion

slide # 56 / 59

General discussionS

trat

egy

dim

ensi

on

Visual

Place

Cue-guided strategy

Place strategy

Action selection process

flexible, rapidly learned

(cognitive map)

(Action-outcome contingencies)

inflexible, slow to acquire

(Stimulus-Response associations)

Place recognition-triggered responseTrullier et al. (1997)

Cue-guided strategyDickinson and Balleine (1998)

Daw et al. (2005)

Model-free Model-based

Discussion


ModellingDiscussion

slide # 57 / 59

General discussion

Reinterpret inconsistent behavioral results spatial more rapidly acquired than cue-guided (Packard and

McGaugh, 1996)

cue-guided more rapidly acquired than spatial (Pych et al., 2005).

Evidence for involvement of the prefronto-striatal system in model-based strategies

In mPFC: A-O contingencies (Mulder et al., 2003), spatial goals (Hok et al., 2005)

Lesions of the striatum impair model-based strategies (Kelley et al., 1997; Corbit et al., 2001; Yin et al., 2005)

Discussion


ModellingDiscussion

slide # 58 / 59

Perspective

EC Project ICEA (Integrating Cognition, Emotion and Autonomy)

Bioinspired interfaces for assessing new hypotheses

DiscussionDiscussion

Neurophysiological experiments, LPPA

Autonomous robotics, LIP6/ISIR

Discussion

Webots software, (c) Wany Robotics

Klusters software(c) L. Hazan in Buzśaki’s lab


ModellingDiscussion

slide # 59 / 59

Collaborators

Thesis advisors:Agnès GuillotSidney I. Wiener

LPPA Collège de France:Alain BerthozBenoît GirardAdrien PeyracheKarim Benchenane

IDIAP Research Institute:Ricardo Chavarriaga

ISIR, Université Paris 6:Jean-Arcady MeyerLaurent DolléLouis-Emmanuel MartinetOlivier Sigaud

Universiteit van Amsterdam:Francesco P. BattagliaAntonius B. Mulder

Toyama Faculty of Food nutrition:

Eichi Tabuchi

DiscussionDiscussionDiscussion

Documents

Actor-Critic models: from ventral striatal reward-related activity to robotics simulations