Upload
ratana
View
38
Download
0
Embed Size (px)
DESCRIPTION
Actor-Critic models: from ventral striatal reward-related activity to robotics simulations. Dr. Mehdi Khamassi 1,2 1 LPPA, UMR CNRS 7152, Collège de France, Paris 2 AnimatLab-LIP6 / SIMA-ISIR, Université Pierre et Marie Curie, Paris 6. Intro. Intro. Intro. Intro. OBJECTIVE. - PowerPoint PPT Presentation
Citation preview
IntroElectrophysiology
ModellingDiscussion
slide # 1 / 59
Actor-Critic models: from ventral striatal reward-related activity to robotics
simulations.
Dr. Mehdi Khamassi1,2
1LPPA, UMR CNRS 7152, Collège de France, Paris
2AnimatLab-LIP6 / SIMA-ISIR, Université Pierre et Marie Curie, Paris 6
IntroElectrophysiology
ModellingDiscussion
slide # 2 / 59
OBJECTIVE
Help to understand how mammals can adapt their behavior in order to maximize reward obtained from the environment.
Help to understand brain mechanisms underlying these cognitive processes.
IntroIntroIntroIntro
IntroElectrophysiology
ModellingDiscussion
slide # 3 / 59
OBJECTIVE
Challenging goal: different levels of decision, different learning
processes, different types of representation
Pluridisciplinary approach
Behavioral Neurophysiology Computational Modelling Autonomous Robotics
IntroIntroIntroIntro
IntroElectrophysiology
ModellingDiscussion
slide # 4 / 59
ACTOR-CRITIC MODEL
CRITIC
Learns to
Predict reward
IntroIntroIntroIntro
• Developed in the AI community (RL)
• Explains some reward-seeking behaviors
• Resemblance with some part of the brain
(dopaminergic neurons & striatum)
ACTOR
Learns to
Select actions
IntroElectrophysiology
ModellingDiscussion
slide # 5 / 59
Outline
1. Introduction How does an Actor-
Critic model work ?
2. Electrophysiology Reward predictions in
the rat ventral striatum
Intro
3. Computational modelling
An Actor-Critic model in a simulated robot
4. Discussion
IntroIntroIntro
IntroElectrophysiology
ModellingDiscussion
slide # 6 / 59
The Actor-Critic model
Learning from reward
1
2
3
4
5Reward
1 2 3 4 5actions:reward
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 7 / 59
The Actor-Critic model
• Learning from reward
1
2
3
4
5Reward
1 2 3 4 5actions:
reinforcement
reward
rewardreinforcement
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 8 / 59
The Actor-Critic model
• Learning from reward
1
2
3
4
5Reward
1 2 3 4 5actions:
reinforcement
reward
rewardreinforcement
Pt-1reward prediction:
Rescorla and Wagner (1972).
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 9 / 59
The Actor-Critic model
• Temporal-Difference (TD) learning
1
2
3
4
5
Pt-1 Pt
Reward
1 2 3 4 5actions:reward
reward predictions:
rewardreinforcement
reinforcement ȓ
Sutton and Barto (1998).
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 10 / 59
The Actor-Critic model
• Analogy with dopaminergic neurons
rewardreinforcement
R S
Romo & Schultz (1990).Houk et al. (1995); Schultz et al. (1997).
+1
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 11 / 59
The Actor-Critic model
Analogy with dopaminergic neurons
rewardreinforcement
R S
+1
Romo & Schultz (1990).Houk et al. (1995); Schultz et al. (1997).
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 12 / 59
The Actor-Critic model
Analogy with dopaminergic neurons
rewardreinforcement
R S
0
Romo & Schultz (1990).Houk et al. (1995); Schultz et al. (1997).
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 13 / 59
The Actor-Critic model
Analogy with dopaminergic neurons
rewardreinforcement
R S
-1
Romo & Schultz (1990).Houk et al. (1995); Schultz et al. (1997).
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 14 / 59
The Actor-Critic model
Actor-Critic models
Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002).see Joel et al. (2002) for a review.
Dopaminergic neuron
Intro
IntroElectrophysiology
ModellingDiscussion
slide # 15 / 59
The Actor-Critic model
Actor-Critic models
Dopaminergic neuron
Intro
P = 0 P = 0
P = 0 P = 0
r = 0
r = 1
L E
IntroElectrophysiology
ModellingDiscussion
slide # 16 / 59
The Actor-Critic model
Actor-Critic models
Dopaminergic neuron
Intro
P = 0 P = 0
P = 0 P = 1
r = 0
r = 1
L E
11
IntroElectrophysiology
ModellingDiscussion
slide # 17 / 59
The Actor-Critic model
Actor-Critic models
Dopaminergic neuron
Intro
P = 1 P = 0
P = 0 P = 1
r = 0
r = 1
L E
11
11
IntroElectrophysiology
ModellingDiscussion
slide # 18 / 59
Adapted from Tierney (2006)
The rat brainIntro
IntroElectrophysiology
ModellingDiscussion
slide # 19 / 59
Adapted from Voorn et al. (2004)
The striatumIntro
IntroElectrophysiology
ModellingDiscussion
slide # 20 / 59
Ventral Striatum
Dopaminergic neurons (VTA / SNc)
Dorsal Striatum
Actions
ACTORCRITIC
The striatumIntro
(Barto, 1995; Houk et al., 1995; Montague et al., 1996; Schultz et al., 1997; Doya et al., 2002; O’Doherty et
al., 2004)
IntroElectrophysiology
ModellingDiscussion
slide # 21 / 59
Learning based on reward prediction in VS...
... on dopamine reinforcements.
... modelled by Temporal Difference (TD)-learning
In the monkey: (Hikosaka et al., 1989; Hollerman et al., 1998; Kawagoe et al., 1998; Hassani et al., 2001; Cromwell and
Schultz, 2003)In the rat: (Carelli et al., 2000; Daw et al., 2002; Setlow et al.,
2003; Nicola et al., 2004; Wilson and Bowman, 2005)
(Barto, 1995; Houk et al., 1995; Schultz et al., 1997; Doya et al., 2002)
(Schultz et al., 1992; Satoh et al., 2003; Nakahara et al., 2004)
The striatumIntro
IntroElectrophysiology
ModellingDiscussion
slide # 22 / 59
... using precise timing reward prediction in TD-learning
Adapted from (Suri and Schultz, 2001)
simulation of a TD-learning model
activity recorded from the monkey striatum
(Montague et al., 1996; Suri and Schultz, 2001; Perez-Uribe, 2001; Alexander and Sporns, 2002)
The striatumIntro
IntroElectrophysiology
ModellingDiscussion
slide # 23 / 59
ElectrophysiologyMethods
Recording in the rat VS
Simple electrodes
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 24 / 59
ElectrophysiologyBehavioral methods
The plus-maze task
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 25 / 59
ElectrophysiologyBehavioral methods
immobilerunning
Box arrival
Time
Center departure
The plus-maze task
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 26 / 59
ElectrophysiologyResults
170 neurons 91 neurons with behavioral correlates
Departure Center Arrival
5
Time
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 27 / 59
ElectrophysiologyResults: Reward anticipation
Ventral striatal neuron.
Activity anticipating
each reward droplet.
Independent from
locomotor behavior.
Khamassi, Mulder et al. (in revision) J Neurophysiol.
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 28 / 59
ElectrophysiologyResults: Reward anticipation
Ventral striatal neuron.
Activity anticipating
each reward droplet.
Independent from
locomotor behavior.
Khamassi, Mulder et al. (in revision) J Neurophysiol.
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 29 / 59
ElectrophysiologyResults: Reward anticipation
Ventral striatal neuron.
Activity anticipating
each reward droplet.
Independent from
locomotor behavior.
Anticipation of an extra
reward.
Khamassi, Mulder et al. (in revision) J Neurophysiol.
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 30 / 59
Modelling with TD-learningResults
TD-learning
Temporal representation of stimuli (Montague et al., 1996).
Incomplete temporal representation
Ambiguous visual input
No spatial information
7 droplets 5 3 1
TD-learning
TD-learning
TD-learning
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 31 / 59
Modelling with TD-learningResults
TD-learning
Temporal representation of stimuli (Montague et al., 1996).
Incomplete temporal representation
Same context after last drop than during droplets delivery.
No spatial information
7 droplets 5 3 1
TD-learning
TD-learning
TD-learning
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 32 / 59
Modelling with TD-learningResults
TD-learning
Temporal representation of stimuli (Montague et al., 1996).
Incomplete temporal representation
Ambiguous visual input
No spatial information
7 droplets 5 3 1
TD-learning
TD-learning
TD-learning
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 33 / 59
Modelling with TD-learningResults
TD-learning
Temporal representation of stimuli (Montague et al., 1996).
Incomplete temporal representation
Ambiguous visual input
No spatial information
7 droplets 5 3 1
TD-learning
TD-learning
TD-learning
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 34 / 59
TD-learning could reproduce neural anticipatory activity.
Can it reproduce the rat's locomotor behavior in the same task ?
Khamassi, Mulder et al. (in revision) J Neurophysiol.
Electrophysiology
IntroElectrophysiology
ModellingDiscussion
slide # 35 / 59
Autonomous roboticsMethods
Virtual plus-maze
Visual perceptions
reward
reward
Actions
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 36 / 59
Autonomous roboticsMethods
Virtual plus-maze
Actions1
2
3
4
1
2
3
4
Visual perceptions
5
5
reward
reward
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 37 / 59
Autonomous roboticsMethods
Results expected
1
2
3
4
5
reward
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 38 / 59
Autonomous roboticsMethods
Actor-Critic models
Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002).see Joel et al. (2002) for a review.
Simplistic Actor. Most often: discrete
environments.
Dopaminergic neuron
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 39 / 59
Autonomous roboticsMethods
Actor-Critic models
Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002).see Joel et al. (2002) for a review.
Simplistic Actor. Most often: discrete
environments.
Continuous environments: coordination of modules.
gating network: Baldassarre (2002); Doya et al. (2002).
hand-tuned (independent from modules' performances): Suri and Schultz (2001).
Dopaminergic neuron
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 40 / 59
Autonomous roboticsMethods
Actor-Critic models
Barto (1995); Houk et al. (1995); Montague et al. (1996); Schultz et al. (1997); Berns and Sejnowski (1996); Suri and Schultz (1999); Doya (2000); Suri et al. (2001); Baldassarre (2002).see Joel et al. (2002) for a review.
Simplistic Actor. Most often: discrete
environments.
Continuous environments: coordination of modules.
gating network: Baldassarre (2002); Doya et al. (2002).
hand-tuned (independent from modules' performances): Suri and Schultz (2001).
Test principles within a common framework
Dopaminergic neuron
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 41 / 59
Autonomous roboticsMethods
Implemented framework
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 42 / 59
Autonomous roboticsMethods
Gurney, Prescott & Redgrave. (2001)Adapted by Girard et al. (2002; 2003).
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 43 / 59
Autonomous roboticsMethods
module coordination
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 44 / 59
Autonomous roboticsMethods
1. gating network(tests modules' capacity for state prediction)
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 45 / 59
Autonomous roboticsMethods
2. hand-tuned(independent from modules' performance)
reward
Categorization
Visual perceptions
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 46 / 59
Autonomous roboticsMethods
3. unsupervised categorization(Self-Oganizing Maps)
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 47 / 59
Autonomous roboticsMethods
4. random robot
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 48 / 59
Autonomous roboticsResults
average
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 49 / 59
Autonomous roboticsResults
Nb of iterations required(Average performance during the second
half of the experiment)
3,50094
40430,000
1. gating network2. hand-tuned3. unsupervised categorization (SOM)4. random robot
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 50 / 59
Autonomous roboticsResults
1. gating network2. hand-tuned3. unsupervised categorization (SOM)4. random robot
Nb of iterations required(Average performance during the second
half of the experiment)
3,50094
40430,000
Modelling
IntroElectrophysiology
ModellingDiscussion
slide # 51 / 59
Discussion
Contributions Critic-like reward anticipation in the ventral striatum Coordinating multiple modules with SOM
Discussion
IntroElectrophysiology
ModellingDiscussion
slide # 52 / 59
Discussion
Contributions Critic-like reward anticipation in the ventral striatum Coordinating multiple modules with SOM Prediction: dopamine signal for missing final drop
Discussion
IntroElectrophysiology
ModellingDiscussion
slide # 53 / 59
Discussion
Contributions Critic-like reward anticipation in the ventral
striatum Coordinating multiple modules with SOM Prediction: dopamine signal for missing final
drop
Perspectives Vary intervals between droplet rewards Integrate action values (Samejima et al., 2005) Improve the model based on other robotics
multi-modules reinforcement learning methods (Uchibe et al., 2004; Brunskill et al.; 2006)
Discussion
IntroElectrophysiology
ModellingDiscussion
slide # 54 / 59
The Actor-Critic model
Actor-Critic models
Dopaminergic neuron
Intro
P = 1 P = 0
P = 0 P = 1
r = 0
r = 1
L E
11
11
IntroElectrophysiology
ModellingDiscussion
slide # 55 / 59
Model-based reinforcement learning
Intro
P = 1 P = 0
P = 0 P = 1
r = 0
r = 1
IntroElectrophysiology
ModellingDiscussion
slide # 56 / 59
General discussionS
trat
egy
dim
ensi
on
Visual
Place
Cue-guided strategy
Place strategy
Action selection process
flexible, rapidly learned
(cognitive map)
(Action-outcome contingencies)
inflexible, slow to acquire
(Stimulus-Response associations)
Place recognition-triggered responseTrullier et al. (1997)
Cue-guided strategyDickinson and Balleine (1998)
Daw et al. (2005)
Model-free Model-based
Discussion
IntroElectrophysiology
ModellingDiscussion
slide # 57 / 59
General discussion
Reinterpret inconsistent behavioral results spatial more rapidly acquired than cue-guided (Packard and
McGaugh, 1996)
cue-guided more rapidly acquired than spatial (Pych et al., 2005).
Evidence for involvement of the prefronto-striatal system in model-based strategies
In mPFC: A-O contingencies (Mulder et al., 2003), spatial goals (Hok et al., 2005)
Lesions of the striatum impair model-based strategies (Kelley et al., 1997; Corbit et al., 2001; Yin et al., 2005)
Discussion
IntroElectrophysiology
ModellingDiscussion
slide # 58 / 59
Perspective
EC Project ICEA (Integrating Cognition, Emotion and Autonomy)
Bioinspired interfaces for assessing new hypotheses
DiscussionDiscussion
Neurophysiological experiments, LPPA
Autonomous robotics, LIP6/ISIR
Discussion
Webots software, (c) Wany Robotics
Klusters software(c) L. Hazan in Buzśaki’s lab
IntroElectrophysiology
ModellingDiscussion
slide # 59 / 59
Collaborators
Thesis advisors:Agnès GuillotSidney I. Wiener
LPPA Collège de France:Alain BerthozBenoît GirardAdrien PeyracheKarim Benchenane
IDIAP Research Institute:Ricardo Chavarriaga
ISIR, Université Paris 6:Jean-Arcady MeyerLaurent DolléLouis-Emmanuel MartinetOlivier Sigaud
Universiteit van Amsterdam:Francesco P. BattagliaAntonius B. Mulder
Toyama Faculty of Food nutrition:
Eichi Tabuchi
DiscussionDiscussionDiscussion