Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Ghent University
Faculty of Psychology and Educational Sciences
Second year Master of Science in Psychology
Theoretical and Experimental Psychology
Second exam period
To Go or Not to Go: Differences in Cognitive Reinforcement
Learning
Master thesis to obtain a degree as master of science in Psychology,
in the field of Theoretical and Experimental Psychology.
Michiel Van Boxelaere 00700889
Promoter: Prof. Dr. Tom Verguts
Supervisor: Dr. Filip Van Opstal
Department of Experimental Psychology
13-08-2013
[2]
[3]
Abstract
Background. Psychologists have long suggested that the procedural learning system and the
declarative learning system are engaged under different circumstances (Poldrack & Packard,
2003). Researchers have indicated an important involvement of dopamine and the striatum in
procedural trial and error learning (Yin & Knowlton, 2006). It remains largely unclear whether
learning from feedback occurs similarly in more declarative memory tasks, thought to rely on
the medial temporal lobe and the hippocampus (Squire, 1992).
Objective. In the current study we want to investigate whether individual differences in
learning from positive or negative feedback differs between tasks that rely on declarative
memory cortices and tasks that rely on cortices involved in habit formation.
Methodology. To address this research question we adopted two well established procedural
learning tasks (Frank, Seeberger, & O'reilly, 2004) and compared decision making performance
on these tasks with feedback learning performance on two versions of a newly developed
explicit declarative memory task.
Hypothesis. We hypothesized that participants who learn better from positive feedback in one
task, will also learn better from positive feedback in another task.
Results. We observed a general bias to learn better from positive feedback in the declarative
learning tasks, but not in the procedural learning tasks. Participants who learned better from
negative feedback during procedural tasks were more likely to learn better from positive
feedback in the explicit declarative memory tasks. These results suggest a different functional
role for the declarative and procedural memory system in learning from negative or positive
feedback.
[4]
Acknowledgements
First of all my special thanks go out to my promoter, Dr. Tom Verguts, for
introducing me to the research field of reinforcement learning and providing me the
necessary input and guidance to complete this master thesis.
Secondly, I would also like to thank my supervisor, Dr. Filip van Opstal, for
guiding me while programming the experiments.
Thirdly, special thanks goes out to Jan Van Boxem and my partner Fien Van
Boxem for providing me the necessary feedback to improve various stylistic aspects of
this thesis and for general support and guidance during the process of writing.
Some thoughts go out to my stepbrother, Jochen Pichal, and my uncle, Geert
Van Boxelaere, who passed away while writing this thesis, teaching me to put the little
‘big’ problems in perspective.
General thanks goes out to my family and friends for giving me the opportunity
to develop myself as a person, both personally and professionally, and giving me the
support I needed over the years.
[5]
Table of Contents
1. Introduction............................................................................................................................... 9
1.1. Adaptive learning and memory. ......................................................................................... 9
1.2 Reinforcement Learning: Theoretical Background. .......................................................... 10
1.2.1. Classical and instrumental conditioning. .............................................................. 10
1.2.2. Theoretical models of learning: a normative framework. ................................... 10
1.2.3. Computational models of reinforcement learning: Classical Conditioning. ........ 11
I. The Rescorla-Wagner Model. ......................................................................... 11
II. Computing temporal relationships of reinforcement. .................................. 12
III. The Temporal Difference Model. ................................................................. 12
1.2.4. Computational models of reinforcement learning: Instrumental Conditioning. . 13
I. What about active decision making? ............................................................. 13
II. The Actor-Critic framework ........................................................................... 14
III. Model-free (habit) learning vs. Model-based (goal-directed) learning ........ 14
1.3 Underlying neural mechanisms of reinforcement learning: Dopamine. .......................... 16
1.3.1. The neurobiology of dopamine ............................................................................ 16
1.3.2. Dopaminergic cell activity signals reward prediction errors. ............................... 16
1.4 Underlying neural mechanisms: The Striatal Habit Learning System. .............................. 17
1.4.1. Neurobiological features of the striatum and the basal ganglia. ......................... 17
I. The Cortico-Basal Ganglia-Thalamocortical circuitry modulates action
selection ............................................................................................................ 17
II. The direct “Go” Pathway ............................................................................... 18
III. The inderect “No-Go” Pathway .................................................................... 18
1.4.2. Striatal processing of predicted reinforcement outcomes .................................. 19
I. Evidence from fMRI-studies ........................................................................... 19
II. Evidence from pharmacological studies ........................................................ 19
III. Evidence from patient studies ...................................................................... 20
IV. Neurocomputational accounts ..................................................................... 21
1.4.3. Psychology traditionally distinguishes habit formation from goal-directed
learning………………………………………………………………………………………………………………………22
1.5 Underlying neural mechanisms: The Goal-Directed Learning System. ............................. 22
1.5.1.Brain regions involved in goal-directed learning................................................... 23
1.5.2. Interactions between habit and goal-directed learning? .................................... 23
1.5.3. Different value-based decision making across learning systems? ....................... 25
[6]
2. Aim of this study ...................................................................................................................... 26
2.1. Research question ........................................................................................................... 27
2.2. Rationale.......................................................................................................................... 27
2.3. Hypothesis ....................................................................................................................... 27
3. Method .................................................................................................................................... 28
3.1 Materials and Methods ..................................................................................................... 28
3.1.1. Participants ........................................................................................................... 28
3.1.2. Stimuli and Apparatus .......................................................................................... 28
3.1.3. General Procedure ............................................................................................... 28
3.2 Implicit procedural learning tasks ..................................................................................... 29
3.2.1. Probabilistic Selection Task .................................................................................. 29
I. Stimuli ............................................................................................................. 29
II. Procedures ..................................................................................................... 29
3.2.2. Transitive Inference Task ...................................................................................... 30
I. Stimuli ............................................................................................................. 30
II. Procedures ..................................................................................................... 31
3.3 Explicit episodic memory tasks ......................................................................................... 32
3.3.1. One Shot Learning Task (version 1 & 2) ............................................................... 32
I. Stimuli ............................................................................................................. 32
II. Procedures ..................................................................................................... 32
3.4 Data Analysis ..................................................................................................................... 33
3.4.1 Probabilistic Selection Task ................................................................................... 33
I. Data filtering ................................................................................................... 33
II. Test Pair Analysis ........................................................................................... 34
III. Training Analysis ........................................................................................... 34
IV. Session Analysis ............................................................................................ 34
3.4.2 Transitive Inference Task ....................................................................................... 35
I. Data filtering ................................................................................................... 35
II. Test Pair Analysis ........................................................................................... 35
III. Training Analysis ........................................................................................... 36
IV. Session Analysis ............................................................................................ 36
V. Awareness Questionnaire ............................................................................. 36
3.4.3. One Shot Learning Task (version 1 & 2) ............................................................... 37
I. Data filtering ................................................................................................... 37
[7]
II. Test Pair Analysis ........................................................................................... 37
III. Session Analysis ............................................................................................ 37
3.4.4 Cross-task Comparisons........................................................................................ 38
4. Results ..................................................................................................................................... 40
4.1. Separate Task Analysis .................................................................................................... 40
4.1.1 Probabilistic Selection Task ................................................................................... 40
4.1.2 Transitive Inference Task ....................................................................................... 42
4.1.3. One shot learning Task (1) .................................................................................... 45
4.1.4. One shot learning Task (2) .................................................................................... 46
4.2 Cross-Task Analysis ........................................................................................................... 46
4.2.1. Relationships between session and tasks on general test performance ............. 46
I. Tasks within and between sessions ................................................................ 46
II. Implicit and explicit tasks............................................................................... 47
4.2.2. Inter-individual bias towards positive or negative learning across tasks? ........... 48
I. Tasks within and between sessions ................................................................ 48
II. Implicit and explicit tasks............................................................................... 49
5. Discussion ................................................................................................................................ 52
5.1. Learning within the procedural memory system ............................................................. 52
5.1.1 The Probabilistic Selection Task ............................................................................ 52
5.1.2 The Transitive Inference Task ................................................................................ 54
5.2. Learning across the procedural and the declarative memory system ............................. 55
5.2.1 Comparing results with the ‘episodic-like’ one shot learning tasks ...................... 56
6. References ............................................................................................................................... 59
[8]
List of Figures
FIGURE 1: General distinctions in reinforcement learning theory……………………………………………15
FIGURE 2: Example of randomized task order…………………………………………………………………………29
FIGURE 3: Design Probabilistic selection task………………………………………………………………………….30
FIGURE 4: Design Transitive inference task…………………………………………………………………………….31
FIGURE 5: Design One shot learning task………………………………………………………………………………..33
FIGURE 6: Results Probabilistic selection task ……………………………………………..…………………………41
FIGURE 7: Results Transitive inference task …………………………………………………………………………...43
FIGURE 8: Results Transitive inference task, controlled for awareness……………………………………44
FIGURE 9: Results One shot learning task (version 1 & 2)…………………………………………………….…46
FIGURE 10: Cross-task regression analysis ……………………………………………………………………………..51
List of Tables
TABLE 1: Inter-individual variability across learning tasks……………………………………………………….39
TABLE 2: Descriptive statistics of the probabilistic selection task……………………………………………41
TABLE 3: Descriptive statistics of the transitive inference task ……………………………………………….42
TABLE 4: Descriptive statistics of the transitive inference task, controlled for awareness…….…44
TABLE 5: Descriptive statistics of the one shot learning tasks (1 & 2)…………………………………… 45
TABLE 6: Rank correlations of test performance across tasks………………………………………………….47
TABLE 7: Rank correlations of bias rates across tasks………………………………………………………….....49
TABLE 8: Rank correlations of bias rates across tasks, controlled for awareness TI-task……….…49
[9]
1. Introduction
In order to increase the likelihood of survival and reproduction, organisms have to
flexibly adapt to a constantly changing environment. To flexibly interact with the environment
requires an adaptive learning system that dynamically distinguishes between good, bad, novel,
relevant and irrelevant stimuli in different environmental contexts (Sugrue, Corrado, &
Newsome, 2005). When confronted with an external stimulus, organisms do not only have to
decide whether this stimulus is potentially harmful or beneficial for its preservation. They also
have to decide to act or not based on the expected outcome of this action behavior, so the
brain processes relevant for decision making not only have to encode signals related to values
of alternative options, they also have to be able to recall past experiences and store new
experiences to guide future decision making behaviors. These processes involve adaptive trial
and error learning and flexible memory updating (Daw, Yael, & Dayan, 2005; Gläscher, et al.,
2010).
1.1. Adaptive learning and memory.
In general, learning is defined as “a relatively permanent change in behavior based on
an organism’s interactional experiences with the environment” (Robbins, 1998, p.41).
Relatively permanent, because memories tend to get lost or changed over time. Therefore,
adaptive learning critically involves memory processes to make predictions concerning the
positive or negative outcomes following decision making behavior based on previous
experiences.
Memory is generally referred to as “the processes that are used to acquire, retain and
later on retrieve learned information” (Baddeley, Eysenck, & Anderson, 2009, p.5). These
memory processes are traditionally categorized into different ‘memory systems’ (Squire, 1992)
regarding to how long information is retained (Baddeley, 2001) or whether information
concerning events or cognitive skills are recalled deliberate and consciously (explicitly) or
automatic and unconsciously (implicitly) (Graf & Shacter, 1985; Milner, et al., 1998). Despite
the extensive amount of research dedicated to explore the neural underpinnings of multiple
memory systems (reviewed in Squire, 2004), together with growing evidence from animal
(White & McDonald, 2002), fMRI (Poldrack, et al., 2001) and patient studies (Knowlton,
Mangels, & Squire, 1996) concerning the important role of certain brain regions1 in specific
1 There is, for example, strong evidence for an important role of the striatum and connected basal ganglia (BG) structures in (implicit) procedural learning and habit formation (Yin & Knowlton, 2006)
[10]
memory subtypes, it is still poorly understood how value-related information is integrated with
stored knowledge about past experiences across different memory systems. These research
questions concerning how value-related choice behaviors are tuned by past experiences, are
typically studied by reinforcement learning theories (Sutton & Barto, 1998).
1.2 Reinforcement Learning: Theoretical Background.
1.2.1 Classical and instrumental conditioning.
Behavioral psychologists have researched the above-mentioned question using
Pavlovian (or classical) and instrumental (or operant) conditioning paradigms. In a typical
classical conditioning procedure, animals learn to associate a neutral conditioned stimulus (CS;
e.g., a tone) with a motivationally significant rewarding or punishing unconditioned stimulus
(UCS; e.g., food ), followed by an unconditioned physiological response (UCR; e.g., salivation).
Over time, animals will demonstrate this physiological response (conditioned response; CR) to
the conditioned stimulus even when the unconditioned stimulus is omitted. Henceforth, the
animal successfully learns the predictive value of the tone to the motivationally significant
reward (food) or punishment (shock), resulting in a conditioned response (salivation) towards
the, previously neutral, tone (Pavlov, 1927).
Instrumental conditioning is distinguished from classical conditioning in that it focuses
on making associations between voluntary behavioral decision making (e.g., performing an
action or not) and its rewarding consequence. Classical conditioning, on the other hand, deals
with making associations between an involuntary response (e.g., salivation) and a stimulus
(e.g., tone). Thus, agents learn passively in classical conditioning, whereas, in instrumental
conditioning, agents actively perform an action to receive a reward or to avoid punishment. In
a typical instrumental animal conditioning procedure, a modifiable operant cage is used.
Animals are put in this operant cage and trained to perform an action (e.g., lever press) in
order to obtain a reward (e.g., food) or to avoid punishment (e.g., electric shock).(Skinner,
1935; Skinner B. , 1987; Thorndike, 1911).
1.2.2 Theoretical models of learning: a normative framework.
Rooted in these psychological theories of learning in animals (Skinner, 1938;
Thorndike, 1911) and further developed in the field of machine learning (Sutton & Barto,
1998), reinforcement learning theory (RL) provides a theoretical framework to study choice
behavior by which humans and animals select actions in the light of expected reward or
[11]
punishment (Sutton & Barto, 1998). Since learning is essentially an unobservable process,
computational reinforcement learning models have focused on studying choice behavior as
the most advantageous adaptation to a given problem in a certain environment. Unlike
descriptive models (Skinner, 1935), that describe choice behavior as it presents itself,
computational models draw from a normative framework that describes behavior as an
optimal adaptation to reach an agents’ specific goals, based on its predicted future
consequences (Sutton & Barto, 1998).
According to this framework, decision making behaviors can thus be studied and
understood in the light of the efforts that most likely minimize or maximize future
punishments or rewards, respectively. Among computational reinforcement learning models,
there is a general consensus that the estimation of the likelihood that a given environmental
state or behavioral action will be followed by reward or punishment, together with the actual
experienced outcome, is the main engine that drives behavioral learning (Daw, Yael, & Dayan,
2005; Frank, 2005; Rescorla & Wagner, 1972; Sutton & Barto, 1990). The potential discrepancy
between predicted and actual reward outcome, formally known as “prediction error”, is the
fundamental basis of the learning rule as described by the Rescorla-Wagner model, which is
still one of the most influential models to understand and explain a wide range of animal (and
human) learning behaviors (Rescorla & Wagner, 1972).
1.2.3 Computational models of reinforcement learning: Classical Conditioning. I. The Rescorla-Wagner model.
At the basis of the Rescorla-Wagner model is the assumption that “learning occurs only
when events violate expectations” (Niv, 2009,p. 141; Rescorla & Wagner, 1972). This
assumption is postulated in a single learning rule which, simply put, states that associative
learning is enforced when prediction errors are positive (i.e., the actual reward outcome is
better than expected) and associative learning is demoted when prediction errors are negative
(i.e., the actual outcome is worse than expected (Rescorla & Wagner, 1972). Using this
relatively simple learning rule, the Rescorla-Wagner model can successfully predict several
behavioral phenomena described in classical conditioning protocols such as blocking (Kamin,
1969), stimulus generalization (Rescorla, 1976b) or conditioned inhibition (Miller, Barnet, &
Grahame, 1995). However, the Rescorla-Wagner model suffers from two major shortcomings.
First, it fails to predict and explain second-order or higher-order conditioning (i.e., when a
second stimulus predicts an already conditioned stimulus (CS2 -> CS1 ->US)) (Rashotte,
Marshall, & O'Connell, 1981), which is important because of its high prevalence in everyday
human life (e.g. money as a second-order predictor for food). Secondly, because the Rescorla-
[12]
Wagner model only calculates (and thus learns) prediction errors after the outcome of a trial is
known (US or no-US presentation), it generally fails to unravel the intensity of conditioning to
different temporal relationships between CSs and USs within a trial (Sutton, 1988). This issue
would, later on, prove to be very relevant when computational models merged with
neurobiological theories to understand the neural underpinnings of reinforcement learning
(Suri & Schultz, 1999). Nevertheless, the Rescorla-Wagner model has proven, due to its
simplicity and ease of application, to provide many important predictions and insights in
classical conditioning studies and adaptive learning using one elegant learning rule (Miller,
Barnet, & Grahame, 1995).
II. Computing temporal relationships of reinforcement.
To overcome the above-mentioned issues, Richard Sutton (1988) together with
Andrew Barto (1990) proposed a real-time model using a temporal difference learning rule
(Sutton R. , 1988). This model is an extension of the Rescorla-Wagner model that takes the
different temporal relationships between events into account. Real-time models are
continuous models that apply on a moment by moment basis (Sutton & Barto, 1990). These
models are distinguished from trial-level models (e.g., Rescorla-Wagner model) that apply
complete trials as a whole. This does not allow them to make predictions concerning temporal
relationships between CSs and USs within trials. In trial-level models, the prediction level of
CSs is equally high for all times prior to the US and the degree of associative strength depends
on the intensity and duration of the primary reinforcement (US1) (Sutton & Barto, 1990).
Studying the predictive value of associative strengths between CSs and US in this manner has
proven to be very successful in well controlled laboratory experiments on a trial by trial basis
(Miller, Barnet, & Grahame, 1995). However, studies have shown that animals seemingly show
weaker predictions for CSs presented long before the US (Sutton & Barto, 1981a). It is also far
less clear what constitutes the beginning and the end of a learning trial in real life. To provide a
theoretical simplification of complex real-world learning, models of reinforcement learning
thus need to specify how associative weights given to primary reinforcement decays with
delays over time.
III. The Temporal Difference model.
The Temporal Difference learning model (Bellman, 1958; Sutton & Barto, 1990;)
resolves this group of problems by quantifying the degree of delayed primary reinforcement
[13]
by a fraction2 (ɣ) over discrete ‘units’ of time. This fraction is implemented into algorithms of
future reward predictions that are divided into two parts (for a comprehensive overview of
these algorithms, see Sutton & Barto, 1998). A first part that regards to the immediate
reinforcement following a given CS and a second part that is the sum of all expected future
reinforcements. The desired prediction is stated in terms of the primary reinforcement and
desired prediction for successive time units (Sutton, 1988). The discrepancy between these
quantities (formally known as the temporal difference prediction error) is comparable to the
prediction error term used in the Rescorla-Wagner model, with the exception that it takes the
different timing of successively predicted reward outcomes into account (Sutton & Barto,
1990). By implementing the temporal difference prediction error learning rule into the
Rescorla-Wagner model, the TD-model can make successful predictions concerning the effects
on learning of temporal relationships within trials (Sutton & Barto, 1981a) and higher order
conditioning (Sutton & Barto, 1990). Furthermore, the temporal difference model, though
developed on purely theoretical grounds, provides an excellent account for neural findings on
classical conditioning (McClure, Berns, & Montague, 2003; O'Doherty, et al., 2003; Schultz,
1998).
1.2.4 Computational models of reinforcement learning: Instrumental Conditioning. I. What about active decision making?
The learning principles of the Rescorla-Wagner model and the Temporal Difference
model as described above hold true whenever associations are made between environmental
states that are fixed in such a way that agents do not influence them by voluntary actions (i.e.,
classical conditioning). But, in order to functionally adapt to the environment it is not only
important to predict rewarding outcomes of different environmental states, it is also essential
to decide whether to behaviorally act or not based on the expected outcome of this action
behavior (i.e., instrumental conditioning). Behavioral theories of optimal action-based decision
making have long suggested that organisms are more likely to perform specific actions when
the expected outcome is rewarding (Thorndike, 1911; Skinner B. F., 1935). On the other hand,
if expected outcomes are punishing, organisms are less likely to perform these actions
(Thorndike, 1911; Skinner B. F., 1935). Importantly, these early theories of decision making do
not address how we determine which particular action, from a series of previous sequential
2 In the case of immediate primary reinforcement, the associative weight between a CS and the US is
near its maximum (and ɣ is thus near 1), whereas in the case of long-delayed primary reinforcement, the associative weight between CS and US is near its minimum (and ɣ is thus near 0).
[14]
actions, should get credit for a positive or negative outcome, this issue is formally known as
the credit-assignment problem (Sutton & Barto, 1998).
II. The Actor-Critic framework.
Models of reinforcement learning efficiently solved the credit-assignment problem by
providing a two-process Actor-Critic learning system of instrumental conditioning (Barto,
Sutton, & Anderson, 1983; Barto, 1995). According to this framework one component, the
critic, uses a temporal difference prediction error signal to evaluate and update possible
actions and environmental states in terms of predictions of future rewards (Barto, 1995). The
other component, the actor, uses a similar prediction error signal to learn preferences for each
action in each environmental state and selects these actions, based on the evaluations
provided by the critic, that are associated with greater long-term reward (Barto, 1995). In
other terms the critic learns and stores values concerning the surrounding environmental
states (i.e., temporal-difference learning), which allows the actor to select and update
preferred actions (Sutton & Barto, 1998). In the actor, an action is strengthened (or weakened)
when immediately followed by a positive (or negative) prediction error (Barto, 1995).
Accordingly, the critic is involved in both classical and instrumental conditioning, whereas the
actor only applies to instrumental conditioning (O'Doherty, et al., 2004)
III. Model-free (habit) learning vs. Model-based (goal-directed) learning .
The actor-critic architecture of action selection in instrumental conditioning is closely
related to, in psychology, ‘habit’ (procedural) learning (Dickenson & Balleine, 2002) or, in
computational terms, ‘model free’ learning (Daw, Yael, & Dayan, 2005). In model free (habit)
learning approaches, through trial and error learning, associations between an organism’s
actions and outcomes are stored in a prediction error signal summarizing its long-term value
without specifying the nature of the outcome. This learning approach has the advantage of
being not susceptible to outcome devaluation at the cost of inflexibility (Daw, Yael, & Dayan,
2005). Model free learning approaches closely interact with, but are distinguished from, more
flexible ‘model based’ or goal directed learning approaches (Balleine & Dickinson, 1998;
Gläscher, Daw, Dayan, & O'Doherty, 2010). Model based (goal-directed) learning methods also
make predictions of long-term value outcomes by learning a ‘cognitive model’ of the
environment where actions are guided by explicit knowledge of action-outcome contingencies
(Daw, Yael, & Dayan, 2005; Gläscher, Daw, Dayan, & O'Doherty, 2010). These learning
methods are, contrary to model free approaches, very sensitive to outcome devaluation which
makes them more suitable to flexibly adapt to a changing environment.
[15]
Overall these computational models have contributed vastly to the research field of
reinforcement learning at a behavioral, neural and molecular level:
(1) By providing simple computational terms that led to new predictions and insights to
understand under what specific circumstances individual organisms differ in choosing
one specific action over another
(2) By distinguishing between different learning and memory systems which allows
psychologists and behavioral neuroscientists to set up experiments that directly test
how the processing of value-related information and optimal decision making differs
across these memory and learning systems (Frank, O'Reilly, & Curran, 2006; Frank,
D'Lauro, & Curran, 2007; Poldrack, et al., 2001).
(3) By aiding to understand the underlying neural bases of conditioning (Montague,
Dayan, & Sejnowski, 1996; Schultz , 1998; Schultz, 2007).
Classical Conditioning
Habit learning
(model free)
Actor
C R I T I C
Temporal
Difference
model
Rescorla-
Wagner model
Instrumental Conditioning
Goal directed
learning
(model based)
Figure. 1.General distinctions in reinforcement learning theory (Sutton & Barto, 1998)
[16]
Specifically, theoretical computational and algorithmic levels, merging with neural
implementations, have revealed a key learning signal in the mammalian brain that closely
resembles the temporal difference prediction error signal as proposed by the TD-model
(Schultz, Dayan, & Montague, 1997). It is widely accepted that this prediction error signal is
encoded by phasic bursts of dopamine, a neurotransmitter crucially involved in the midbrain
reward circuitry (Holroyd & Coles, 2002; Montague, Dayan, & Sejnowski, 1996; O'Doherty, et
al., 2003; Schultz , 1998; Suri & Schultz, 1999; Sutton & Barto, 1998).
1.3 Underlying neural mechanisms of reinforcement learning: Dopamine.
1.3.1. The neurobiology of dopamine.
Dopamine (DA) is a well studied catecholamine neurotransmitter involved in attention,
movement and various cognitive processes (reviewed in Schultz W. , 2007). Dopaminergic cell
groups are predominantly located in the substantia nigra pars compacta (SNc) and the ventral
tegmental area (VTA) from which they project to different brain regions involved in learning
and memory such as the striatum (Andén, et al. 1966), the amygdala (Fuxe, et al., 1974), the
hippocampus (Legault & Wise, 2001) and frontal cortices (Emson & Koob, 1978). Evidence for
the functional role of dopamine in reinforcement learning emerged from electrophysiological
studies with behaving monkeys conducted in the lab of Wolfram Schultz in the 90’s. Before
these studies the dominant view on the function of dopamine was that it is crucially involved
in the brain’s ‘pleasure centre’ and that it might serve as the brains reward signal (Olds &
Milner, 1954; Wise, Spindler, & Legault, 1978). According to this hypothesis, dopaminergic cell
firing corresponds with the ‘pleasure feeling’ that is experienced when a rewarding stimulus is
presented in the external environment (Wise, Spindler, & Legault, 1978).
1.3.2. Dopaminergic cell activity signals reward prediction errors.
Schultz and colleagues notoriously demonstrated that dopaminergic cell activity
resembles reward expectancy rather than reward itself, comparable to the prediction error
learning signals proposed by computational models (reviewed in Schultz, 2002). Specifically,
the classical conditioning studies conducted in the Schultz lab demonstrated that (1) phasic
dopaminergic cell firing disappeared over time when rewards in the external environment
became highly predictive to the agent (Romo & Schultz, 1990), (2) after a couple of trials,
phasic dopaminergic cell firing was observed during stimuli that predict reward, and thus,
before the presentation of a rewarding stimulus (Mirenowicz & Schultz, 1994), (3)
[17]
dopaminergic cell firing increased when actual reward outcomes were better than predicted
(Schultz, Dayan, & Montague, 1997) and (4) phasic dopaminergic cell firing shortly dropped
below baseline when expected rewards were omitted (Hollerman & Schultz, 1998). Studies in
humans, using fMRI and event related potentials, have reported prediction error like activation
in areas known to be richly innervated by dopaminergic signals such as the anterior cingulate
cortex (Brown & Braver, 2005; Holroyd & Coles, 2002), the striatum (Bray & O'Doherty, 2007)
and the orbitofrontal cortex (O'Doherty, et al., 2003).
These results from animal and human studies have led to the widely accepted idea
that phasic dopaminergic activity is the neural substrate of processing positive and negative
prediction errors (Holroyd & Coles, 2002; Schultz, Dayan, & Montague, 1997). Many
researchers have demonstrated that activity in ventral and dorsal parts of the striatum
corresponds with prediction error signals, suggesting an important role of the striatum in
reinforcement learning (Delgado, et al., 2000; O'Doherty, et al., 2004; Yin & Knowlton, 2006).
1.4 Underlying neural mechanisms: The Striatal Habit Learning System.
1.4.1. Neurobiological features of the striatum and the basal ganglia.
The striatum is a subcortical part of the forebrain formed by the putamen, caudate and
nucleus accumbens (Gerfen & Wilson, 1996). It is as a major input structure for the basal
ganglia and receives direct inputs from the SNc, VTA and many neocortical frontal structures
involved in motor actions(Gerfen, 2000; Mink, 2003) The striatum projects through the globus
pallidus (GP) and Substantia Nigra pars reticula (SNr) to the thalamus from which it then
projects back to the neocortex (Gerfen & Surmeier, 2011). Most likely due to its major
involvement in movement disorders like Parkinson’s disease and Huntington’s disease,
research concerning the basal ganglia has mainly focused on its functional role as a motor
control unit (Redgrave, Prescott, & Gurney, 1999).
I. The Cortico-Basal Ganglia-Thalamocortical Circuitry modulates action selection
Early studies on the neurobiological function of the basal ganglia have suggested that
the basal ganglia facilitate the selection of a specific motor action while inhibiting other motor
actions (Alexander & Crutcher, 1990; Mink, 2003). The basal ganglia facilitate a specific action
selection by modulating the execution of a certain motor response rather than encoding its
specific details (Mink, 1996). This motor action modulation occurs by signaling the most
appropriate “Go” or “No-Go” response on competing motor actions, represented in the motor
cortex (Alexander, DeLong, & Strick, 1986; Hikosaka, 1989). It is generally assumed that
selecting the appropriate “Go”/”No-Go” motor responses relies on striatal synaptic changes
[18]
which are modulated by dopaminergic cell activity via D1 and D2 receptors (Gerfen, et al.,
1990). These effects of dopaminergic modulation occur in the basal ganglia via two main
pathways, i.e., a direct and an indirect pathway (Gerfen, et al., 1990). These pathways are
thought to oppositely excite or inhibit the thalamus, through the basal ganglia circuitry (Gerfen
& Wilson, 1996).
II. The direct “Go” pathway.
In the direct “Go” pathway, striatal neurons3 project to the internal segment of the
globus pallidus (GPi) which, without striatal firing, tonically inhibits the thalamus (Gerfen &
Wilson, 1996). The striatal projection neurons in the direct pathway are characterized by a
predominant expression of D1 receptors (Gerfen, et al., 1990). D1 receptors are primarily
excited by dopamine. Phasic dopaminergic cell firing thus excites striatal D1 receptors (Nicola,
Surmeier, & Malenka, 2000). Researchers have demonstrated that the excitation of D1
receptors aids the depolarization of inhibitory striatal projections to the GPi (Gerfen, 2000).
This inhibition of the GPi then “disinhibits" the tonically inhibitory projections to the thalamus
which allows the thalamus to get excited from other excitatory projections (Hernandez-Lopez,
et al., 1997; Mink, 2003). The basal ganglia circuitry of the direct pathway can be compared to
the proverbial “releasing the brakes” of the thalamus to select the most appropriate action.
III. The indirect “No-Go” pathway.
In the indirect “No-Go” pathway, striatal inhibitory neurons project to the external
segment of the globus pallidus (GPe) which tonically inhibits the internal segment of the
globus pallidus (GPi) (Gerfen & Wilson, 1996). The striatal projection neurons in the indirect
pathway are characterized by a predominant expression of D2 receptors (Gerfen, 2000). It has
been shown that the activation of D2 receptors suppresses depolarization of inhibitory striatal
projections (Hernandez-Lopez, et al., 2000). As a consequence, during phasic bursts of
dopaminergic cell firing, the activity in the indirect pathway is suppressed via D2 receptors
(Gerfen, 2000). However, during dips of dopaminergic cell firing, the inhibitory striatal
projections to the GPe are activated which results in a net effect of further inhibiting the
thalamus (Joel & Weiner, 1997). Cell activity in the indirect pathway can thus be compared to
the proverbial “pressing the brakes”.
The integration of these findings concerning the basal ganglia’s modulatory role in
motor action selection, together with the hypothesis that dopamine activity signals prediction
3 The majority (90%-95%) of striatal neurons are GABAergic medium spiny neurons that inhibitory
project to other nuclei in the basal ganglia circuitry (Gerfen C. , 2000).
[19]
error, led to the broader view that the basal ganglia might serve as a more general functional
cognitive unit (Frank, 2005) . According to this hypothesis the basal ganglia modulates optimal
action selection by processing value-related information in the striatum to predict future
outcomes following actions (Frank, Seeberger, & O'reilly, 2004; Frank, 2005).
1.4.2. Striatal processing of predicted reinforcement outcomes.
Evidence for the hypothesis that the basal ganglia modulate optimal action selection
by processing value-related information in the striatum came from (1) behavioral studies using
fMRI, (2) pharmacological intervention studies, (3) patient studies using patients with
Parkinson’s disease and (4) computational neural network models.
I. Evidence from fMRI-studies.
First, it has been shown that activity in the striatum during classical and instrumental
conditioning tasks closely resembles dopaminergic prediction error signals (Delgado, et al.,
2000). Evidence for the behavioral relevance of the correlation between striatal activity and
prediction error signals came from a study using fMRI (Schönenberg, et al., 2007). This study
demonstrated that the magnitude of prediction error related dopaminergic activity in the
striatum could distinguish between subjects that learned to make optimal decisions4 against
those who did not (Schönenberg, et al., 2007). Furthermore, patterns of striatal activity could
be implied onto the computational actor-critic framework of instrumental conditioning, where
the dorsal and ventral striatum dissociably correspond with the actor and critic, respectively
(O'Doherty, et al., 2004).
II. Evidence from pharmacological studies.
Second, researches have shown that the magnitude of reward prediction error,
expressed in the striatum, was modulated by giving dopamine agonists and antagonists in a
instrumental conditioning task (Pessiglione, et al., 2006). A dopamine agonist, in this study L-
DOPA, enhances dopaminergic function by activating dopamine receptors (Huang & Kandel,
1995). A dopamine antagonist, in this study haloperidol, reduces dopaminergic function by
blocking dopamine receptors (Frey, 1990). Results showed that subjects on a dopamine
agonist had the tendency to choose the appropriate rewarding action relative to subjects
treated with a dopamine antagonist (Pessiglione, et al., 2006).
4 Optimal, since reinforcement in this task was probabilistic . Through trial and error learning over trials,
participants had to learn which of four choices was most likely to be the most rewarding (Schönenberg, Daw, Joel, & O'Doherty, 2007).
[20]
III. Evidence from patient studies.
Third, researchers have demonstrated that the individual variability to which patients
with Parkinson’s disease learn better from either positive feedback (i.e., positive prediction
error) or negative feedback (i.e., negative prediction error) depends on the degree of
dopamine dysfunction in the basal ganglia (Frank, Seeberger, & O'reilly, 2004). Patients with
Parkinson’s disease (PD) are characterized by a degenerating nigro-striatal dopamine system
(Jankovic, 2008). As a result of this dopamine depletion in the striatum, PD-patients typically
show impaired planning, initiation and control of movements and a wide variety of cognitive
deficits (Albin, Young, & Penney, 1989; Dubois, et al., 1994; Maddox & Filoteo, 2001; Swainson,
et al., 2000). Frank and colleagues (2004) used two cognitive implicit procedural learning tasks,
a probabilistic selection task and a transitive inference task, to test how a depleted dopamine
system affects value-related decision making in cognitive tasks.
In a probabilistic selection task (designed by Frank, Seeberger, & O'reilly, 2004), three
different stimulus pairs (AB,CD and DE) are shown randomly on the screen. Participants learn,
through trial and error processes, to choose or avoid one stimulus in a given pair based on the
probabilistic feedback contingencies over multiple trials. In a typical transitive inference task
(Dusek & Eichenbaum, 1997), participants learn a hierarchical structure of a sequence of
stimuli (A > B > C > D > E) based on positive (+) or negative (-) feedback following separate
individual adjacent pairs in the sequence (A+B-, B+C-, C+D- and D+E-). Importantly, in the
implicit version of this task, it is assumed that participants have no explicit awareness of the
underlying hierarchical relationships across stimuli (Frank, et al., 2005).
Results demonstrated that PD-patients off medication (i.e., low levels of dopamine)
were biased to learn better from negative feedback, whereas PD-patients on medication (i.e.,
higher levels of dopamine) were biased to learn better from positive feedback (Frank,
Seeberger, & O'reilly, 2004). Frank and colleagues (2004) suggested that the observed learning
biases were directly related to higher or lower levels of dopamine in the basal ganglia. PD-
patients off medication have systematicaly low levels of dopamine, which biases the basal
ganglia’s indirect “No-Go” pathway to be very active, with better learning performance
following negative feedback as a result (Frank, Seeberger, & O'reilly, 2004). PD-patients on L-
DOPA medication showed the opposite effect. Higher levels of dopamine in the basal ganglia
facilitate “Go” learning by increasing the signal to noise ratio in the direct “Go” pathway,
thought to aid the selection of the most appropriate response (Nicola, Surmeier, & Malenka,
2000). Consequently, PD-patients on medication show better learning performance following
[21]
positive feedback relative to their learning performance following negative feedback (Frank,
Seeberger, & O'reilly, 2004).
IV. Neurocomputational accounts.
Fourth, PD-patients are not only characterized by various movement impairments,
they also show seemingly unrelated cognitive impairments with a discrepancy between
implicit learning impairments on the one hand and ‘frontal-like’ impairments on the other
(Dubois, et al., 1994; Woordward, Bub, & Hunter, 2002). It has been shown that these two
kinds of cognitive processing can be dissociated to a certain degree, since patients with frontal
lesions do not show implicit learning deficits (Knowlton, Mangels, & Squire, 1996). To tie these
seemingly unrelated cognitive deficits together, Frank and colleagues suggested that
differences between PD-patients in value-related processing of cognitive ‘frontal-like’ tasks
might be modulated by the Go/No-Go pathways in the basal ganglia, already modulated by
dopaminergic input (Frank., 2005; Frank, Seeberger, & O'reilly, 2004). These researchers
implemented the neurobiological structure of the basal ganglia together with the cellular
dynamics of dopamine (see 1.3.3) into a reinforcement learning model that could test the
double modulatory role of the dopamine-basal ganglia circuitry in executing different cognitive
tasks (Frank, 2005).
The ‘Go-NoGo’ model of cognitive reinforcement learning could successfully replicate
different behavioral performances between medicated and non-medicated PD-patients in
simulated versions of the ‘weather-prediction’ and the ‘probabilistic-reversal’ task.(Frank,
2005). It could also successfully replicate results of simulated versions of the above mentioned
‘probabilistic-selection’ and ‘transitive-inference’ task (Frank, Seeberger, & O'reilly, 2004).
Later on, this neural network model of Go /No-Go learning in implicit cognitive tasks proved to
be an excellent framework for examining individual variability in processing prediction errors
with healthy subjects (Frank, Woroch, & Curran, 2005; Frank, O'Reilly, & Curran, 2006; Frank,
D'Lauro, & Curran, 2007; Simons, Howard, & Howard, 2010).
Overall these researchers directly and indirectly provided evidence for a major role of
the striatum in habit formation which involves learning associations between stimuli (or
contexts) and responses (S-R associations) over consecutive trials. These studies also provided
convincing evidence for the hypothesis that the basal ganglia modulate optimal action
selection by processing value-related information in striatal cortices. Furthermore, these
investigations showed that value-related processing is dynamically modulated by the degree
of dopaminergic input in the striatum.
[22]
1.4.3 Psychology traditionally distinguishes habit formation from goal-directed learning.
Psychologists who study the underlying mechanisms of instrumental conditioning have
distinguished habits from goal-directed actions (Balleine & Dickinson, 1998; Squire; 1992;
Tolman, 1932). Habit formation is a prototypical instance of procedural memory and involves
learning S-R associations without any explicit ‘conscious’ knowledge of how its actions specify
the nature of the (rewarding) outcome (Yin & Knowlton, 2006). Habits could thus be seen as
automatic ‘reflex-like’ behaviors that respond to those stimuli (or contexts) they are most
positively associated with. As discussed above, many studies have indicated that habit
formation involves the striatum and the basal ganglia (Frank, 2005; O'Doherty , et al., 2004; Yin
& Knowlton, 2006).
In contrast, goal-directed learning corresponds more to declarative (episodic) memory
and involves learning an explicit ‘cognitive model’ of the environment where actions are
guided by explicit knowledge of action-outcome contingencies (Balleine & Dickinson, 1998;
Daw, Yael, & Dayan, 2005; Squire; 1992; ). Goal-directed learning behaviors could thus be seen
as the integration of novel information into an already established cognitive model of the
environment to flexibly guide behavioral adaptations (Balleine & Dickinson, 1998; Tolman,
1932).
1.5 Underlying neural mechanisms: The Goal-Directed Learning System.
1.5.1. Brain regions involved in goal-directed learning.
Contrary to cortico-basal ganglia- thalamocortical circuitry in habit learning, it is far
less understood which specific brain regions are involved in goal-directed learning. Different
studies have suggested different brain regions to be involved in goal directed learning such as
the prefrontal cortex (Daw, Yael, & Dayan, 2005), the orbitofrontal cortex (Valentin, Dickinson,
& O'Doherty, 2007) and the dorsomedial striatum (Yin & Knowlton, 2006). Note that the
dorsomedial striatum is also in the cortico-basal ganglia-thalamocortical loop, comparable to
the dorsolateral striatum in habit learning, with the difference that the dorsomedial striatum
corresponds with the cortico-basal ganglia-thalamocortical loop that involves prefrontal
associative cortices, whereas the dorsolateral striatum corresponds with the cortico-basal
ganglia-thalamo-cortico-loop that involves sensorimotor cortices (Yin & Knowlton, 2006).
These researchers suggested that the sensorimotor loop and the associative loop might
function as the underlying neural circuitry of habits and goal-directed behavior, respectively
(Yin & Knowlton, 2006).
Furthermore, researchers have suggested an important role of the hippocampus and
[23]
its surrounding regions in the medial temporal lobe in goal directed learning (Packard &
McGaugh, 1996; Johnson & Redish, 2007; Shohamy & Adcock, 2010; Squire, Stark, & Clark,
2004). First, it has been shown that goal-directed decision making strategies are suppressed
following hippocampal lesions in rodents (Packard & McGaugh, 1996). Second, rodent maze
studies have observed hippocampal neural firing, not only during reward itself but also before
key decision points in the maze (Johnson & Redish, 2007). Similar hippocampal firing before
decision making is also observed using monkeys (Wirth, 2009). Third, it has been shown that
humans with bilateral hippocampal damage fail to mentally represent new experiences, which
is a crucial feature of goal directed decision making (Hassabis, et al., 2007). These studies are in
line with recent neurobiological models of an important role of the hippocampus in novelty
processing, modulated by dopamine, to flexibly update already established knowledge
concerning the environment (Lisman & Grace, 2005; Lisman, Grace, & Duzel, 2011).
Taken together these studies suggest an important role of the hippocampus and the
medial temporal lobe (MTL), together with the prefrontal cortex and the dorsomedial
striatum, in flexible goal-directed decision making. In the previous sections we discussed that
both the goal-directed learning system and the habit learning system can guide actions based
on explicit or implicit knowledge about its consequences. Despite different methodological
efforts to unravel the dynamics of these learning systems, there are still many unanswered
questions regarding this research topic. Two major research questions we will further address
are: (1) Under what specific circumstances each system is used, or in other words, how does
the habit learning system interact with the goal-directed learning system? (2) To what extend
does learning from positive or negative reinforcement differ between these learning systems?
1.5.2. Interactions between habit and goal-directed learning?
As a result of extensive research there is now a consensus that the habit (or
procedural) learning system and the goal directed (or declarative) learning system are engaged
under different circumstances (Balleine & Dickinson, 1998; Daw, Yael, & Dayan, 2005;
Gläscher, Daw, Dayan, & O'Doherty, 2010; Poldrack, et al., 2001). Crucial aspects of
determining the shift between habit or goal-directed behavioral control is the level of
(rewarding) uncertainty there is following choice behavior which indirectly involves the level of
training5 an agent receives (Daw, Yael, & Dayan, 2005). It is beneficial for organisms to
rationally evaluate alternative action-outcomes (e.g. goal-directed learning) early in training or
5 The more experience or training an organism has with the relevant parameters in a given
environmental context, the more certainty it will have concerning the reinforcing aspects of the parameters in this context.
[24]
when confronted with a novel context (Gilbert & Wilson, 2007; Shohamy & Adcock, 2010). This
allows them to rapidly and flexibly adapt to changing reinforcement contingencies, but comes
with the cost of being time consuming and inefficient (Balleine & Dickinson, 1998). It could
therefore be beneficial to shift to a less demanding system (e.g. habit learning) that slowly
learns after repeated experience over many training trial(Barnes, et al., 2005; Yin & Knowlton,
2006). Relying on this system however, comes with the cost of being inflexible to changes in
reinforcement contingencies (Daw, Yael, & Dayan, 2005).
Evidence for this account came from animal studies, showing that a goal-directed
strategy, sensitive to outcome devaluation, is used when animals are trained moderately.
When trained extensively, this strategy is shifted to a response-based habit learning strategy
which is insensitive to outcome devaluation (Packard & McGaugh, 1996). Moreover
physiological recording studies have demonstrated that firing patterns in the dorsolateral
striatum, an area crucially involved in habit learning, develop rather slowly (Barnes, et al.,
2005; Graybiel, 1998).
The concept of competing learning systems in humans was developed by Poldrack and
Packard (2003), following an influential fMRI study on how learning systems may compete
during classification learning (Poldrack, et al., 2001). In this study participants had to perform
a procedural ‘weather prediction’ task with probabilistic feedback which is thought to rely on
the implicit habit learning system (Knowlton, Mangels, & Squire, 1996). Performance on this
task was compared with an alternative ‘paired association’ task that emphasized explicit
declarative memory processes thought to rely on the medial temporal lobe and the
hippocampus (Squire, 1992). Participants had to alternate between these tasks and a baseline
task. As expected, results showed activation of the basal ganglia during the probabilistic
categorization task. An interesting finding was that the hippocampus was deactivated during
this task. To test whether this finding was task specific, activation patterns during the
procedural task and the declarative task were directly compared. Results suggested that
activation in the basal ganglia and hippocampal activation were negatively related (Poldrack,
et al., 2001).
Intrigued by their previous findings, Poldrack and colleagues (2001) tested a new group
of participants using the same procedural categorization task. During this experiment an event
related fMRI scanner was used. Results initially demonstrated hippocampal activity and lack of
basal ganglia activity, but after a couple of trials, the hippocampus quickly became
deactivated, whereas the basal ganglia became activated (Poldrack, et al., 2001). These
researchers suggested that the observed ‘competition’ between the striatal-based memory
system and the hippocampal-based memory system might serve as a mechanism between two
[25]
incompatible requirements of learning: the need for flexibly accessible knowledge on the one
hand and the need to learn fast automatic responses in specific situations on the other. These
results suggest that the former is supported by the medial temporal lobe and the
hippocampus, whereas the latter is supported by the striatum (Poldrack, et al., 2001; Poldrack
& Packard, 2003).
1.5.3. Different value-based decision making across learning systems?
Given the premise that the procedural habit learning system relies on different neural
processes when compared to the declarative goal-directed learning system, the question
remains whether value-based decision making differs across these systems. A bundle of
evidence has suggested that value-based decision making during implicit procedural learning
tasks is modulated by dopaminergic input into the striatum which facilitates ‘Go’ and ‘No-Go’
learning in cognitive tasks (Frank, et al., 2004, 2005, 2007). A commonly used task to study
individual variability in learning from positive and negative feedback is the probabilistic
selection task designed by Frank, Seeberger, & O'reilly in 2004 (see section 1.4.2. III.). Using
this task researchers have demonstrated that: (1) the degree of nigro-striatal dopamine
depletion has a direct influence on whether learning is better from positive or from negative
feedback (Frank, Seeberger, & O'reilly, 2004), (2) that the separation of positive learners and
negative learners based on performances in this task could successfully distinguish the
magnitude of event-related-potentials (ERP) related to error processing (Frank, Woroch, &
Curran, 2005) and (3) that there is an important genetic factor that contributes to biased
reinforcement learning, where participants carrying the A1 allele of the D2 receptor gene
polymorphism DRD2-TAQ-IA6 learn less efficiently to avoid negative feedback (Klein, et al.,
2007). These findings, regarding biased feedback learning in implicit cognitive tasks, led Frank
and colleagues to the hypothesis that individual learning biases might result from
dopaminergic striatal changes rather than prefrontal dopaminergic changes (Frank, et al.,
2004, 2005, 2007). This hypothesis proved to be very successful in explaining under what
circumstances PD-patients and healthy humans differ in learning from positive and negative
feedback during procedural learning tasks(Frank, et al., 2004, 2005).
However, it is still unclear how decision making might be biased following feedback
during ‘non-procedural’ tasks. Frank and colleagues (2007) addressed this question by making
‘positive-learner’ and ‘negative-learner’ subgroups, based on performances on the
6 People who carry the A1 allele of the D2 receptor gene polymorphism DRD2-TAQ-IA have been
associated with a reduction in D2 receptor density up to 30% which is linked to multiple addictive and compulsive behaviors (Ritchie & Noble, 2003).
[26]
probabilistic selection task. Next, these subgroups were compared on an unrelated recognition
memory task, using error-related negativity signals7 (ERN) as a dependent measurement.
Results showed that negative learners , based on probabilistic selection task performance, had
larger ERNs in the recognition memory task compared to positive learners (Frank, D'Lauro, &
Curran, 2007). According to Frank and colleagues these results suggest a common underlying
mechanism for error-processing across these tasks, thought to be modulated by striatal ‘Go-
NoGo’ learning with common frontal dopamine levels as a result (Frank, D'Lauro, & Curran,
2007).
Still, it could be argued that the recognition-memory task might not be that ‘not-
procedural’. The recognition memory task was designed in such a way that it might rely on the
same striatal processing as the probabilistic selection task. During the recognition-memory
task, participants were instructed to make ‘speeded responses’ within 700ms to promote
errors. Throughout the task participants also got feedback on response reaction times,
reminding them to make rapid judgments. As a consequence, it could be that, during the
recognition memory task, participants relied on the same striatal habit processing system as
during the probabilistic selection task to come up with fast responses instead of ‘declaratively’
reflecting upon the stimuli. In the current study we applied the same cross-task comparisons
methodology as used by Frank and colleagues (2007) to investigate more explicitly how
decision making following positive or negative feedback occurs across different learning
systems.
2. Aim of this study
In the current study we want to investigate whether decision making following positive
and negative feedback differs across procedural and declarative memory systems. Previous
research has suggested that there is a competition between the procedural striatal-based
memory system and the declarative hippocampal-based memory system (Poldrack, et al.,
2001). Recent insights derived from patient studies and neurocomputational models have
indicated that individual differences in value-related processing during feedback-learning tasks
are modulated by striatal synaptic changes through the ‘Go’ and the ‘No-Go’ pathway in the
basal ganglia (Frank, et al., 2004, 2005, 2007). These pathways are modulated by dopaminergic
cell firing (Gerfen, 2000).
7 Error related negativity (ERN) is an event-related brain potential which is thought to originate in the
anterior cingulate cortex , a brain area crucially involved in monitoring errors (Ridderinkhof, et al., 2004)
[27]
2.1. Research question
Upon till now it remains unclear whether organisms learn differently from positive or
negative feedback in tasks that rely on declarative memory cortices (e.g. explicit declarative
tasks) compared to tasks that rely on striatal processing (e.g. implicit procedural tasks). To
address this research question we adopted two well established procedural learning tasks
(Frank, Seeberger, & O'reilly, 2004) and compared decision making performance on these tasks
with feedback learning performance on two versions of an explicit declarative memory task.
2.2. Rationale
Previous researchers have demonstrated an important genetic factor in learning better
from either positive or negative feedback-during implicit procedural tasks(Frank, D'Lauro, &
Curran, 2007; Klein, et al., 2007). Given this premises, we rationalized that the implicit
procedural learning tasks will provide a normative scale of individual value-based processing in
the striatum. Using this normative scale, we can further compare whether individual
participants show the same feedback learning bias towards positive or negative feedback
during the more declarative memory tasks. We designed these tasks so that they most
probably rely on medial temporal cortices by (1) implying a cue-stimulus contingency to
promote and facilitate declarative associative learning (Buckner, et al., 1995; Squire, Knowlton,
& Musen, 1993), (2) providing only a single learning trial to promote hippocampal activation,
previously observed early in learning (Poldrack & Packard, 2003; Poldrack, et al., 2001) and (3)
omitting time constraints to promote explicit rational reflection upon the presented stimuli.
2.3. Hypothesis
Since many researchers have stated that value-based decision making is directly
modulated by dopamine (Huang & Kandel, 1995; McClure, Berns, & Montague, 2003;
Pessiglione, et al., 2006; Schultz, Dayan, & Montague, 1997); we expect that participants who
learn better from positive feedback in one task, will also learn better from positive feedback in
another task. We hypothesize that this is more so for tasks that rely on the same memory
cortices (implicit procedural learning tasks) when compared to tasks that rely on different
memory cortices (implicit vs explicit learning task).
[28]
3. Method
3.1 Materials and Methods
3.1.1. Participants
Thirty healthy first year (5 male and 25 female) bachelor students in psychology
participated in this study on two separate testing sessions (2 tasks per session). Participants
received partial credits for participation in the experiment after they completed both sessions.
Two participants (1 male and 1 female) were excluded from analysis since because they did
not show up for either the first and/or the second session, and thus, did not complete the full
experiment.
3.1.2. Stimuli and Apparatus
We made use of Dell computers (Windows XP) with 17 inch monitors. Participants
faced the monitor at an approximate distance of 50 cm. E-prime 1.1 software was used for
programming the different tasks in the experiment and developing the stimuli (Schneider,
Eschman, & Zuccolotto, 2002). During all four tasks participants had to choose between stimuli
appearing in pairs (left and right) on the screen. During the two implicit procedural memory
tasks stimuli consisted of Japanese Hiragana characters (Frank, Seeberger & O'reilly, 2004),
whereas standardized pictures of well known objects, tools and fruits were used during the
two explicit memory tasks (Brodeur, et al., 2010). Stimuli were randomized across subjects and
tasks. Responses were registered using the keyboard. Participants had to press key 1 or key 0
to select the left or right stimuli, respectively. All stimuli appeared in color (pictures) or in black
font (Hiragana) on a white background.
3.1.3. General Procedure
Each participant performed four learning tasks over two separate testing sessions (two
in the first session and two in the second session). There were at least 72 hours between
testing sessions to avoid potential learning effects across sessions. All four tasks had a two-
alternative forced choice procedure, where participants had to choose one of two stimuli on
the computer screen by pressing one of two keys on the keyboard. Some stimuli had a
negative reinforcement value, whereas others had a positive reinforcement value. There were
two implicit learning tasks (i.e., a probabilistic selection task and a transitive inference task)
and two explicit learning tasks (two versions of a one shot learning task). The order of the tasks
was randomized both within and between sessions. But, each session contained one implicit
learning task en one explicit learning task (Fig.2).
[29]
Figure 2. Example of randomized task order for a single subject. Tasks were randomized within and between sessions across subjects. There was always one implicit task and one explicit task in each session.
3.2 Implicit procedural learning tasks
3.2.1. Probabilistic Selection Task
I. Stimuli
During the probabilistic selection task (adopted from Frank, Seeberger, & O'reilly,
2004), pairs of visual stimuli that are not easily verbalized were used (Japanese Hiragana
characters, Fig.3). Following a fixation cross (duration 1000ms), Hiragana stimuli were shown
in black on a white background in 72 pt font. Responses were registered using key “1” (left on
the keyboard) to select the left stimulus or key “0” (right on the keyboard) to select the
stimulus on the right. Visual feedback appeared (duration 1.5 sec) following a choice. Either
the word “Correct!” or the word “Incorrect!” appeared centrally on the screen in green or red,
respectively (Courier New, pt 48). If no response was registered within six seconds, the words
“no response detected” appeared centrally in black (Courier New, pt 18).
II. Procedures
The probabilistic selection task consisted of two phases. Following a practice block,
which consisted of 10 trials, a learning phase was superimposed. During the learning phase
three different pairs of stimuli (AB, CD and EF) appeared randomly on the screen. Feedback
was given after each trial in a probabilistic manner (Fig.3A). Choosing stimulus A in the AB pair
led to positive feedback in 80% of AB trials, whereas choosing stimulus B led to negative
feedback in these trials. The CD and EF pairs were less reliable. Choosing stimulus C led to
positive feedback in 70% of CD trials and choosing stimulus E led to positive feedback in 60 %
of EF trials. Over the course of the learning phase, participants learned to choose A, C and E
above B,D and F. To make sure participants learned the correct associations between stimuli
and feedback, a performance evaluation had to be met before advancing to the next phase.
Evaluation occurred following each learning block of 60 trials. Because of the different
probabilistic nature of each stimulus pair, different performance criteria were chosen. In the
Session 1
Probalistic Selection Task
One Shot Learning Taks
(Version 2)
Session 2
One Shot Learning Taks
(Version 1)
Transitive Inference Task
[30]
AB pair, participants had to choose A above B at least in 65% of the trials. In the CD pair, C had
to be chosen above D in 60% of the trials . In the last pair, stimulus E had to be chosen in 50%
of the trials8. Participants advanced to the test phase if all these criteria were met or after six
learning blocks (360 trials). During the test phase (Fig.3B), training pairs and new pairs were
randomly shown on the screen without feedback. New pairs contained all other possible
combinations of stimuli (AC, DF, BE, …). Participants were instructed to instinctively choose
when confronted with novel pairs. Each test pair was presented six times.
3.2.2 Transitive Inference Task
I. Stimuli
During the (implicit) transitive inference task (Frank, Rudy, Levy, & O'Reilly, 2005), the
same type of visual stimuli (Japanese Hiragana characters) were used as in the probabilistic
selection task. To avoid confusion and confounding learning effects different characters were
used across the probabilistic selection task and the transitive inference task. Both the order
and the content of the Hiragana characters were counterbalanced. Fixation, stimulus
presentation and feedback presentation was exactly the same as in the probabilistic selection
task.
8 Note that stimulus E is correct in 60% of EF trials, which is particularly difficult to learn. We
implemented a 50% (chance level) performance criteria to ensure that participants who consequently choose the slightly more incorrect stimulus F over E cannot go through to the testing phase.
Training
A B 80% 20%
C D 70% 30%
E F 60% 40%
Test
Choose A? Avoid B?
AC AD AE AF
BC BD BE BF
Figure 3. (A) Example of the stimulus pairs (Hiragana) used during the probabilistic selection task. One pair was shown per trial. In actuality, stimuli were randomized across participants. (B) During test, all other combinations of pairs, together with all training pairs, appeared randomly. During test no feedback was given (not shown in this example). Performance was analyzed on all new pairs containing A (positive learning) or B (negative learning).
A B
[31]
II. Procedures
During the (implicit) transitive inference task, participants had to learn an underlying
ordinal sequence of stimuli (A>B>C>D>E) based on separate pairs of adjacent elements in the
sequence (AB, BC, …). During this task four pairs of stimuli (Fig.4A) are presented (A+B-, B+C-,
C+D- and D+E-). The + and - characters represent positive and negative feedback, respectively.
Again, as in the probabilistic selection task, participants had to get through a learning segment
before advancing to the testing segment. In the learning segment there were three phases of
blocked trials. The first phase consisted of eight (random) blocks of four trials. Per block a
stimulus pair is shown during four trials. Thus, the first block could for example consist of four
A+B- trials, the second block could consist of four C+D- trials and so on. Phase two consisted of
16 (random) blocks of two trials. The third phase was the performance evaluation phase, in
which 32 trials of pairs were randomly shown on the screen, still with feedback.
Associative Strength
Hypothesis
Test
A B C D E
Postive Associative value
Training
A+ B -
B+ C-
C+ D-
D+ E-
AB BC
CD DE
AE BD
Top
Novel Bottom
Figure 4. (A) Example of the adjacent stimulus pairs (Hiragana) used during the transitive inference task. Stimuli were randomized across participants and differed from the probabilistic selection task. (B) Example of associative strength hypothesis (Rudy, Frank, & O'Reilly, 2003), during training participants implicitly learn to strongly associate A with positive reinforcement. In contrast, E becomes associated with a lack of positive reinforcement, previously shown to induce dopamine dips (Schultz, 2002).These net associative values then “bleed over” to the other adjacent pairs, such that B in the BC pair has a stronger positive association, whereas D in the DC pair has a stronger negative association(Rudy, Frank, & O'Reilly, 2003),. (C) During test, all training pairs and two novel ‘transitive pairs’ were randomly presented eight times each. Performance was analyzed on top pairs AB & BC (positive learning) and bottom pairs CD & DE (negative learning).
A B
C
[32]
To make sure that participants learned the correct associations between stimuli and
feedback we have set a performance criteria at an accuracy level of 75% before participants
advanced to the test segment. In this segment all training pairs and two new transitive pairs
(AE and BD) were randomly shown eight times each, without feedback (Fig.4C). Following the
transitive inference task, participants were given a questionnaire to assess their explicit
awareness of the logical hierarchy of the stimuli, and to determine whether strategies were
used to respond to the novel test pairs.
3.3 Explicit episodic memory tasks
3.3.1 One Shot Learning Task (version 1 & 2)
I. Stimuli
In the one-shot learning tasks, a different set of stimuli was used. Instead of using
unknown Japanese Hiragana characters that are relatively difficult to verbalize, highly
recognizable standardized pictures of well known objects, tools and fruits were used (Brodeur,
et al., 2010). Following fixation (duration 1 sec), a cue9 (A) appeared (160x160 pixels) centrally
on the screen above the fixation cross. After 2 seconds a pair of target stimuli (BC) appeared
(160x160 pixels) left and right underneath the cue (A) on the screen. Responses were
registered using key “1” to select the left stimulus or key “0” to select the stimulus on the
right10. Because participants had only a single trial to learn the correct stimulus-stimulus
association, time constraints to make a choice were omitted. Visual feedback was provided
(duration 1.5 sec) after a choice was made. Either the word “Correct!” or “Incorrect!” was
printed centrally on the screen in green or red, respectively (Courier New, pt 48).There were
144 unique pictures of objects , tools or fruits used per task. Both stimulus content and order
was counterbalanced.
II. Procedures
During the one shot learning tasks there was a learning segment (Fig.5A), which
consisted of two learning blocks of 24 trials, followed by a memory retrieval test phase that
also consisted of two blocks of 24 trials. During the learning segment, participants only had
one shot at learning to match the cue (A) with one of two target stimuli (B or C). Following
their choice, positive or negative feedback appeared randomly on the screen. After 24 learning
trials there was a break, before the next learning block of 24 trials started. We presented the
9 Cues were implemented in the one shot learning tasks to promote and facilitate (explicit) declarative
associative learning and retrieval (Buckner, et al., 1995; Squire, Knowlton, & Musen, 1993) 10
Because we wanted to make sure that participants learned (to avoid or approach) about the chosen stimuli, no opportunity was given to look back at the correct stimuli when subject responded incorrectly.
[33]
same two blocks of 24 trials to the participants in the memory retrieval phase, without
feedback (Fig.4B). The order of trials, as well as the location of the target stimuli, were
randomized within each block. Both one shot learning tasks were exactly the same, though
different sets of stimuli were used across both tasks. Both tasks were never in the same
session (Fig.2).
3.4 Data Analysis
3.4.1 Probabilistic Selection Task
I. Data filtering
Since we were mainly interested to what extent subjects learned from positive and
negative feedback following their choices, we firstly had to make sure participants learned the
basic task. Although the performance criteria were implemented to resolve this issue, some
participants performed worse on the training pairs during the testing segment in comparison
with the learning segment. To overcome this issue, we excluded participants who did not
1000ms
1500ms
2000ms
2 Blocks (24 trials)
+
Incorrect!
Training
Figure 5. (A) Example of a learning trial during training. Each trial was only presented once. When a choice was made, stimuli disappeared and feedback was given. (B) Example of a correctly solved test trial, the procedure was the same as during training with the exception that feedback was omitted.
A
B Test
1000ms
2000ms
2 Blocks (24 trials)
+
?
[34]
perform better than chance during the test phase in the easiest training pair conditions (AB
pair). We rationalized that if these participants could not reliably choose A above B in this pair,
results in novel pairs were meaningless. By filtering the data in this manner, three participants
were excluded because they did not perform better than chance level (50%) in the easiest
training pair.
II. Test Pair Analysis
We firstly wanted to test whether there were systematic differences across subjects in
learning from positive reinforcement (choose A) versus learning from negative reinforcement
(avoid B) in this task. To test whether there were any differences as ascribed above, we
performed a paired sample Student t-test. The degree to which participants learned from
positive reinforcement (choosing A) was operationalized as the performance level on the novel
pairs involving A (AC, AD, AE and AF)(Fig.3B). Comparatively, the degree to which participants
learned from negative reinforcement (avoiding B) was operationalized as the performance
level on the novel pairs involving B (BC, BD, BE and BF)(Fig 3B.). We measured effect sizes using
Cohen’s d, where d ≥ 0.2 represents a small effect, d ≥0.5 represents a medium effect and d ≥
0.8 represents a large effect.
III. Training Analysis
In the learning phase of the probabilistic selection task, different performance criteria
were implemented to make sure participants learned the correct stimulus-reinforcement
associations. If participants failed to reach these performance criteria, they had to do the
learning phase again until they finally reached the performance criteria. Thus, some
participants performed more learning trials than others. We therefore checked, using general
linear model regression analysis with continuous measures, whether general test performance,
performance on the easiest training pair (AB) and performance on either choosing A or
avoiding B could be explained by the number of learning trials. For regression analysis we
measured effect sizes using eta squared (η2) , where η2 = 0.01 represents a small effect, η2 =
0.06 represents a medium effect and η2 = 0.14 or larger represents a large effect.
IV. Session Analysis
Some participants performed the probabilistic selection task in the first session and the
transitive inference task in the second session, whereas other participants did it the other way
around. To minimize learning effects between these two visually similar tasks, we made sure
there were 72 hours between sessions. Nevertheless, it is possible that there were some non-
specific transfer effects across sessions that are unrelated to the particular nature of each task.
[35]
We therefore checked, using general linear model regression analysis with the categorical
between-subjects variable session as predictor, whether general test performance,
performance on the easiest training pair (AB) and performance on either choosing A or
avoiding B could be explained by which session participants were in.
3.4.2 Transitive Inference Task
I. Data filtering
As in the probabilistic selection task (PS11), we excluded participants who did not
perform better than chance on the easiest training pairs (AB & DE) during the testing segment.
As a result, we filtered out two participants who did not perform better than, on average, 50%
across anchor pairs AB & DE. Analysis described below apply to the remainder of the
participants (n = 26).
II. Test Pair Analysis
Similarly to the PS-task, we investigated whether there were systematic differences
across subjects in learning from positive feedback versus learning from negative feedback.
Similar to previous studies using an implicit version of the transitive inference task and in
accordance with the associative strength hypothesis12, we rationalized that stimuli at the top
of the hierarchy (A, B) have net positive associations, whereas stimuli at the bottom of the
hierarchy (D, E) have net negative associations (Frank, O'Reilly, & Curran, 2006; Frank,
Seeberger, & O'reilly, 2004; Rudy, Frank, & O'Reilly, 2003). As a result, learning from positive
reinforcement ameliorates performance on the AB and BC pairs, while learning from negative
reinforcement ameliorates performance on the CD and DE pairs. Therefore, we
operationalized the degree to which participants learned from positive feedback as the
performance level on AB & BC pairs during test. Similarly, the degree to which participants
learned from negative feedback was operationalized as the performance level on CD & DE
pairs during test. We also checked whether subjects performed better on the ‘easier’ anchor
pairs (AB & DE) in comparison with inner pairs (BC & CD). Moreover, pairwise analysis between
separate training pairs during test were conducted to get a more detailed insight on the
11
For clarity reasons we will use the following abbreviations to refer to the different tasks: PS-task for the probabilistic selection task, TI-task for the transitive inference task and OSL1/OSL2-Task for the two versions of the one shot learning tasks. 12
According to the associative strength hypothesis, the top and bottom pairs, respectively AB and DE, “anchor” the development of associative values: During training agents implicitly learn to strongly associate A with positive reinforcement (since choosing A always induces positive feedback), while E becomes associated with strong negative reinforcement (since E always induces negative feedback). These net associative values then “bleed over” to the other adjacent pairs, such that B in the BC pair has a stronger positive association, whereas D in the DC pair has a stronger negative association, though B and D are positively (negatively) reinforced during half of the trials (see Rudy, Frank, & O'Reilly, 2003 for a detailed description on how this occurs).
[36]
pattern of previous results. To test differences between conditions, a paired sample Student t-
test was conducted. Novel test pairs AE and BD were analyzed separately since these pairs
could be solved by either learning to choose A and B or by avoiding D and E.
III. Training Analysis
As in the PS task, a performance criteria was implemented in the task. As a
consequence, some participants performed more learning trials when compared to others. We
checked, using general linear model regression analysis with continuous measures, whether
general test performance and performance top (AB & BC) or bottom (CD & DE) pairs could be
explained by the number of learning trials.
IV. Session Analysis
Comparatively to the PS task, we checked possible confounding effects of session using
general linear model regression analysis with categorical between-subjects variables. General
test performance and performance on top and bottom pairs were tested using the factor
session as predictor.
V. Awareness Questionnaire
After completing the transitive inference task participants were asked to fill in a
questionnaire (translated from Frank, Seeberger, & O'reilly, 2004) asking about the familiarity
with Japanese Hiragana characters and, importantly, the degree to which participants explicitly
became aware of the underlying hierarchy in the task. None of the participants indicated
having any experience with the Hiragana characters. Surprisingly, when asked whether
participants “had the impression that there was some kind of logical rule, order or hierarchy
between the symbols” (Frank, Seeberger, & O'reilly, 2004), 11 out of 28 participants indicated
becoming aware of the underlying order or hierarchy between the symbols. The remaining 17
participants did not notice the underlying hierarchy in the task. Because we are generally
interested in differences between learning from reinforcement across more implicit and more
explicit learning tasks, it would be interesting to see whether there are any differences in
learning performance between implicit and more explicit learning within one task. We
therefore reanalyzed the data checking whether the degree of awareness (implicit or explicit
learning) predicts differences in performance level in the transitive inference task. We tested,
using one-way between subjects ANOVAs, whether implicit learners differed significantly from
explicit learners on general test performance, performance on top pairs, bottom pairs and
novel pairs. Furthermore we carried out 2x2 ANOVAs for the between subjects factor group
(implicit, explicit) and the within subjects factor hierarchy (Top, Bottom) to check for
interaction effects between explicit/ implicit learners and positive (top) or negative (bottom)
[37]
learning. We also performed pairwise Student’s t-tests to check whether the previously
observed differences between anchor pairs and inner pairs remained significant for the implicit
and the explicit subgroup.
3.4.3 One Shot Learning Task (version 1 & 2)
I. Data filtering
We excluded participants who did not perform better than chance (50%) during the
testing phase. There was not a single participant who performed worse than chance level,
neither in the first OSL-task nor the second OSL task. Participants were instructed to associate
a cue-object and one of the target-objects based on feedback and to do this “as accurately as
possible”. As a consequence there were no specific time constraints during the learning or
testing blocks during the one shot learning tasks. To make sure participants did not take
advantage of this lack of time constraints to use all sorts of strategies during learning, we
checked reaction times during the learning segment. We excluded participants who had an
outlying average reaction time during the learning segment (M + 2SD). This was the case for
three participants in the first one shot learning task and for two participants in the second one
shot learning task (OSL task). Analysis described below apply to the remainder of the
participants (n = 25 for the first OSL task and n = 26 for the second OSL task).
II. Test pair Analysis
We used the same methodology in the one shot learning tasks as in the probabilistic
selection task and the transitive inference task. Again, we researched whether there were
systematic differences across subjects in learning from positive reinforcement versus learning
from negative reinforcement. To do so we tested, using paired sample Student t-tests,
whether there were differences across subjects between recognition accuracy following
positive feedback and recognition accuracy following negative feedback.
III. Session Analysis
Similar to previous tasks, we tested for possible confounding effects of session using
general linear model regression analysis with categorical between-subjects variables. General
test performance, recognition accuracy after positive feedback and negative feedback were
analyzed using the factor session as predictor.
3.4.4 Cross-task Comparisons
In general, we wanted to investigate how participants performed on learning from
positive and negative feedback across tasks. Previous studies have suggested that there is an
important genetic factor that determines the inter-individual variability in learning better from
[38]
either positive or negative feedback (Frank, D'Lauro, & Curran, 2007; Klein, et al., 2007).
Therefore, we hypotehsized that participants who learn better from positive feedback in one
task, also should learn better from postive feedback in another task. We expect that is more
the case for tasks that rely on the same memory cortices (implicit procedural learning tasks)
when compared to tasks that rely on different memory cortices (for example the PS-task
compard to the OSL1 task). To test this assumption we devided participants into two
subgroups (positive and negative learners, see Table 1) for each separate task, comparable to
Frank, D'Lauro, & Curran (2007). By doing so, we could check whether subgroups (positive
learners and negative learners) in one task, could predict value-related differences in other
tasks. We performed analysis across tasks to answer two main questions: (1) Do participants
who perform well on one task also perform well on another task? (2) Is the inter-individual
variability in biased feedback learning robust across tasks? To do so we examined how
between task performance was related within-sessions, between sessions and both within and
between implicit and explicit tasks.
1. Do participants who perform well on one task also perform well on another task?
To investigate whether the individual performance rank of participants was consistent
across tasks we conducted Spearman’s Rank Correlations on general test performance
between tasks in session 1 and 2, between tasks across sessions 1 and 2 and across implicit
and explicit tasks, regardless of session. To keep analysis across tasks as comparable as
possible, participants who were excluded from analysis in the separate task analysis were also
excluded from cross task-analysis. Results described below apply to the remainder of the
participants13 (n = 21).
2. Is the inter-individual variability in learning better from either positive or negative feedback
robust across tasks?
To investigate whether the positive or negative learning bias within subjects was
robust across tasks, we had to transform performance rates on positive and negative learning
conditions to a single score. We did so by simply subtracting performance rates of the negative
learning condition from performance rates of the positive learning condition. The range of this
bias rate is between (-1, 1). Negative bias rates represent subjects that learned better from
13
Three participants were excluded from the PS-task, two participants were excluded from the TI task. Three and two participants were excluded from the first and the second OSL task, respectively. Three participants were excluded in more than one task, which results in a total of seven participants that were excluded from cross-task comparisons on general test performance.
[39]
Subgroup Criterium Sample Size
Probabilistic Selection Task
Positive learners Choose A performance > Avoid B performance
n = 12
Negative learners
Avoid B performance > Choose A performance
n = 12
Transitive Inference Task Positive learners Top pair performance (AB & BC) > Bottom pair performance (CD & DE)
n = 10
Negative learners
Bottom pair performance (CD & DE) > Top pair performance (AB & BC)
n = 11
One Shot Learning Task (1)
Positive learners Performance after positive feedback > Performance after negative feedback
n = 16
Negative learning
Performance after negative feedback > Performance after positive feedback
n = 8
One Shot Learning Task (2)
Positive learners Performance after positive feedback > Performance after negative feedback
n = 18
Negative learners
Performance after negative feedback > Performance after positive feedback
n = 4
Table 1. Inter-individual variabillity across learning tasks in learning better from positive feedback compared to learning from negative feedback (positive learners) and learning better from negative feedback compared to learning from positive feedback (positive learners). Participants who scored equally well on positive and negative feedback were excluded since they do not add relevant information to the analysis derived from this classification.
[40]
negative feedback, whereas positive bias rates represent subjects that learned better from
positive feedback. Again we conducted Spearman’s Rank Correlations on the bias rate
between tasks in session 1 & 2, between tasks across sessions 1 & 2 and across implicit and
explicit tasks, regardless of session. Furthermore, we wanted to check whether biased learners
in one task could predict biased learning in the other tasks. To do so we examined whether the
factor group (Positive/Negative Learners), derived from the implicit PS task, could predict the
degree of bias rate in the other tasks, using general linear model regression analysis.
To avoid losing valuable information in the implicit and explicit cross-task analysis from
participants’ learning bias, we were less conservative in excluding participants. Contrary to
previous cross-task comparisons we only excluded participants that were relevant for the
between task comparisons14. This allowed us to have more power in separate between task
comparisons, although it makes comparisons across all tasks more difficult. Nevertheless, since
we are mainly interested in the relatedness of the bias rate between specific tasks, we argue
that is beneficial to have more power.
4. Results
4.1 Separate Task Analysis
4.1.1 Probabilistic Selection Task
Analysis across all subjects (Fig. 6A) on choose A performance compared to avoid B
performance (Table 2) did not show significant differences [t24 = -0.104, p = 0.918, two-tailed, d
= 0.03]. Training analysis did not show significant effects of the number of learning trials on
general test performance [F(1,23) = 0.058, p = 0.812, η2 < 0.001 ], AB pair performance [F(1,23)
= 0.238, p = 0.630, η2 = 0.01], general choose A performance [F(1,23) = 0.171, p = 0.683, η2 =
0.07] (Fig. 6B) or general avoid B performance [F(1,23) = 0.017, p = 0.898, η2 < 0.001](Fig. 6C).
Session analysis did not show significant effects of session on general test performance
[F(1,23) = 1.010, p = 0.325, η2 = 0.04], AB pair performance [F(1,23) = 0.446, p = 0.511, η2 =
0.01], general choose A performance [F(1,23) = 1.314, p = 0.263, η2 = 0.05] or general avoid B
performance [F(1,23) = 2.278, p = 0.145, η2 = 0.09].
14
For example we did not exclude these participants, previously excluded due to possible confounding strategy use in one of the OSL tasks, when correlating the PS task with the TI task.
[41]
n Mean SD
AB pairs
25
97.3%
5.63
Choose A 25 72% 23.79
Avoid B 25 72.6% 16.2
# Training trials
25 160 93.06
Table 2. Descriptive statistics of test performance during the probabilistic selection task on AB pairs, choose A, avoid B and training trial frequency.
PS Regression Analysis - Avoid B
0 60 120 180 240 30020
40
60
80
100
F(1,23) = 0.017, p= 0.898
Training Trials Frequency
Test P
erf
orm
ance %
Figure 6. (A) No significant differences were observed across subjects between choose A and avoid B test performance. Data are presented as Mean ± SEM (n.s.= non significant). (B) The number of learning trials did not significantly predict choose A performance. (C) Learning trial frequency did not significantly predict avoid B performance.
PS Regression Analysis - Choose A
0 60 120 180 240 30020
40
60
80
100
F(1,23) = 0.171, p= 0.683
Training Trials Frequency
Test P
erf
orm
ance %
Probalistic Selection Task
Choose A Avoid B50
60
70
80
90
100
n.s.
% P
erf
orm
ance T
est
A B
C
[42]
4.1.2 Transitive Inference Task
Across all subjects comparisons between top pairs AB & BC and bottom pairs CD & DE
(see Table 3 for descriptive statistics) did not show significant differences [t25 = 0.191, p =
0.850, two-tailed, d = 0.05 ] (Fig. 7A). Separate pair analysis on the performance level between
“anchor” pairs AB and DE showed no significant differences [t25 = 0.611, p = 0.547, two-tailed,
d = 0.17]. Accordingly, no significant difference was observed [t25 = -0.113, p = 0.811, two-
tailed, d < 0.01] between “inner” pairs BC and CD, when compared. Pairwise comparisons
between anchor pairs and inner pairs (Fig. 7B) did demonstrate a significant difference [t25
=3.757, p = 0.001, two-tailed, d = 0.7]. These results suggest that, on average, differences in
solving pairs correctly are mainly due to whether it is an inner or anchor pair rather than its
place (higher or lower) in the hierarchy.
Cross-subjects comparisons between novel pairs AE and BD (Fig. 7C) did not show
significant differences [t25 < 0.001, p > 0.05 , two-tailed, d < 0.01]. Furthermore, training
analysis did not demonstrate significant effects of the number of learning trials on general test
performance [F(1,24) = 3.573, p = 0.122, η2 = 0.096], general performance on top pairs AB &
BC [F(1,24) = 0.942, p = 0.341, η2 = 0.04] or general performance on bottom pairs CD & DE
[F(1,24) = 0.173, p = 0.681, η2 < 0.001]. Session analysis did not show significant effects across
subjects of the factor session on general test performance[F(1,24) = 0.887, p = 0.356, η2 =
0.036], performance on top pairs AB & BC [F(1,24) = 0.381, p = 0.543, η2 = 0.015] or on
performance on bottom pairs CD & DE [F(1,24) = 1.012, p = 0.325, η2 = 0.04].
n Mean SD
Top pairs AB & BC 26 87.9% 15.69
Bottom pairs CD & DE 26 87% 17.61
Anchor pairs 26 95.91% 14.13
Inner pairs 26 78.85% 21.32
Novel pair AE 26 93.75% 20.39
Novel pair BD 26 93.75% 19.76
# Training trials 26 118.15 19.77
Table 3. Descriptive statistics of test performance during the transitive inference task on top pairs, bottom pairs, anchor pairs, inner pairs, novel pairs and training trial frequency.
[43]
Top-Bottom Pairs TI
AB & BC CD & DE60
70
80
90
100n.s.
% P
erf
orm
ance T
est
Anchor-Inner Pairs TI
Anchor Inner60
70
80
90
100***
% P
erf
orm
ance T
est
Novel Pairs TI
AE BD60
70
80
90
100n.s.
% P
erf
orm
ance T
est
Figure 7. (A) No significant differences were observed across subjects between top and bottom pair test performance. (B) Participants scored significantly better (p = 0.001) on anchor pairs when compared to inner pairs. (C) No significant differences were observed across subjects between novel pairs AE and BD. Data are presented as mean ± SEM ( n.s.= non significant, ***= p<0.001).
When we divided participants15 into an implicit and explicit subclass based on the
awareness questionnaire, results indicated a significant difference between implicit and
explicit learners on general test performance [F(1,26) = 7.379, p = 0.012, η2 = 0.22].
Participants who explicitly learned the hierarchy between symbols generally performed better
during test relatively to participants who learned implicitly (see Table 4 for descriptive
statistics). This effect was still significant without the two, previously excluded, weak
performers16 [F(1,24) = 5.722, p = 0.025, η2 = 0.19]. Further analysis (Fig.8) suggests that the
above mentioned differences are driven by positive learning, since explicit learners perform
significantly better when compared to implicit learners on top pairs AB & BC [F(1,24) = 8.937, p
< 0.01, η2 = 0.27], but not on bottom pairs CD & DC [F(1,24) = 1.635, p = 0.213, η2 = 0.06] or
novel pairs AE & BD F(1,24) = 0.973, p = 0.334, η2 = 0.04]. However, both the explicit learning
group and the implicit learning group did not perform significantly better on top pairs when
compared to bottom pairs [t10 = 1.341, p = 0.209, d = 0.48 and t14= -0.246, p = 0.809, d = 0.11,
respectively]. Furthermore, no significant interaction effect for group x hierarchy was observed
[F(1,24) = 0.553, p = 0.464, η2 = 0.02 ].
15
Note that we firstly included all participants in the awareness analysis to check whether there was a general difference in performance between implicit or explicit learners. Both participants who were initially excluded from analysis indicated that they were not aware (implicit group) of the underlying order in the task. Since we do not know whether the low performance of these participants is related to either implicit learning or to a general confusion during the testing phase, we tested differences between implicit and explicit learners both with and without them. 16
These participants were again excluded for further analysis, to make sure that effects are driven by differences between implicit and explicit learning, not by outlying values (possibly due to test confusion) in the implicit group.
A B
C
C
[44]
Group comparisons between implicit and explicit learners (Fig .8) did not show a
significant difference on anchor pairs [F(1,24) = 1.045, p = 0.317, η2 = 0.043]. However, we did
observe a significant difference between implicit and explicit learners on inner pairs [F(1,24) =
8.555, p < 0.01, η2 = 0.26]. Similar to previous analysis, implicit learners performed significantly
better on anchor pairs when compared to inner pairs [t14 = 3.757, p = 0.002, two-tailed, d=
1.47]., but no significant difference was observed between performance on anchor pairs and
inner pairs for the explicit learning group [t10 = 1.627, p = 0.135, two-tailed, d = 0.65].
n Mean SD
General performance Explicit learners (aware of the hierarchy)
11 95.8% 7.34
Implicit learners (not aware of the hierarchy)
17 80.98% 16.90
Performance on top pairs AB & BC
Explicit learners 11 97.16% 4.30
Implicit learners 15 81.22% 17.18
Performance on inner pairs
Explicit learners 11 90.91% 15.65
Implicit learners 15 70.83% 18.25 Table 4. Descriptive statistics of the transitive inference task , controlled for underlying hierarchy awareness.
AB
BC
CD D
E
50
60
70
80
90
100
Underlying Hierarchy Awareness TI-task
Explicit learners
Implicit learners
% P
erf
orm
ance T
est
Figure 8. Test performance on training pairs for participants who were aware (explicit learners) or unaware (implicit learners) of the underlying hierarchy in the transitive inference task. Data are presented as mean ± SEM.
[45]
Taken together these results clearly indicate that participants who explicitly learned
the underlying hierarchy between symbols generally perform better than participants who
implicitly learned the hierarchical relationship between symbols. This effect seems driven by
learning performance following positive feedback (top pairs), rather than learning following
negative feedback . The main difference between implicit and explicit learners in this task is a
difference in performance on the inner pairs (BC & CD). Explicit learners, like implicit learners,
do not only perform well on anchor pairs. They also have a high performance rate on inner
pairs relative to implicit learners.
Although these results seem very clear, they should be interpreted with caution since
observed effects are mostly due to a ceiling effect on test performance in the relatively small
explicit group. This high performance rate of explicit learners on all testing pairs could also
explain the lack of an interaction effect between group and top/bottom pair performance.
4.1.3. One shot learning Task (1)
Test pair analysis across all subjects demonstrated a significant difference between
recognition accuracy following positive feedback and recognition accuracy following negative
feedback [t24 = 2.152, p = 0.042, two-tailed, d = 0.47], with a bias to learn better from positive
feedback compared to learning following negative feedback (Table 5, Fig. 9A). Session analysis
did not show significant effects of session on general recognition accuracy [F(1,23) = 0.374, p =
0.547, η2 = 0.016], recognition accuracy following positive feedback [F(1,23) = 0.018, p = 0.896,
η2 < 0.001] or recognition accuracy following negative feedback [F(1,23) = 0.487, p = 0.492, η2
= 0.02].
One Shot Learning Task (1) n Mean SD
Performance after positive feedback 25 86.5% 11.17
Performance after negative feedback 25 80.4% 14.41
One Shot Learning Task (2)
Performance after positive feedback 26 92.3% 8.77
Performance after negative feedback 26 78.2% 15.42
Table 5. Descriptive statistics for the one shot learning tasks (1 and 2).
[46]
One Shot Learning Task (1)
Pos Fee
dback
Neg
Fee
dback
60
70
80
90
100
*%
Perf
orm
ance T
est
One Shot Learning Task (2)
Pos Fee
dback
Neg
Fee
dback
60
70
80
90
100
% P
erf
orm
ance T
est
***
Figure 9.(A) In the first version of the one shot learning task, participants performed, on average, significantly better following positive feedback when compared to performance following negative feedback. (B) This bias effect across subjects towards learning better following positive feedback was also observed in the second version of the one shot learning task, using different stimuli. Data are presented as Mean ± SEM (* = p< 0.05, *** = p < 0.001).
4.1.4. One shot learning Task (2)
Across subjects analysis of test pairs confirmed the previously observed results (Fig
9B). In the second OSL task participants again showed higher recall accuracy following positive
feedback relative to recall accuracy following negative feedback (Table 5) during test [t25 =
4.886, p < 0.001, two-tailed, d = 1.12]. Session analysis did not show significant effects of
session on general recognition accuracy [F(1,24) = 2.560, p = 0.123, η2 = 0.095] and recognition
accuracy following negative feedback [F(1,24) = 0.165, p = 0.689, η2 < 0.001]. We did observe a
significant effect of session on performance following positive feedback [F(1,24) = 5.661, p =
0.026, η2 = 0.19], where performance following positive feedback was better for participants
who did the second OSL task in the second session (M = 88.23%, SD = 10.55) compared to
participants who did the second OSL task in the first session (M = 95.76%, SD = 5.05).
4.2 Cross-Task Analysis
4.2.1. Relationships between session and tasks on general test performance?
I. Tasks Within and Between Sessions
General test performance between the implicit and explicit task conducted in the first
session showed no significant correlation (n=21, rs= 0.19, p= 0.410). Also, no significant
correlation was observed between general test performance on the implicit and explicit task in
the second session (n=21, rs =- 0.03, p= 0.883). There were no significant correlations between
A B
[47]
the implicit task performance rate in the first session and the explicit task performance rate in
the second session (n=21, rs =- 0.01, p= 0.964), which is also the case for the relationship
between the implicit task in the second session and explicit task in the first session(n=21, rs = -
0.03, p= 0.436). There was a marginally insignificant negative correlation between implicit task
performance rate across sessions (n=21, rs =- 0.41, p= 0.062). We did observe a significant
positive correlation between explicit task performance rate across sessions (n=21, rs = 0.52, p=
0.015). These results suggest that better test performers, when compared to the other
subjects, on one task within a session do not consistently perform better on the other task
within this session. However participants who perform better on an implicit task in one session
seemingly perform worse, when compared to the other participants, on the implicit task in the
other session. On the other hand, better performers on one explicit task in a given session are
better performers on the explicit task in the other session.
II. Implicit and Explicit tasks
Correlation coefficients between test performance rates across tasks are shown in
table 6. There was a significant correlation between OSL tasks(n=21, rs = 0.51, p= 0.018). No
significant correlations across tasks were observed between PS-OSL1 (p = 0.586), PS-OSL2 (p =
0.175 ), TI-OSL1 (p = 0.943), TI-OSL2 (p = 0.489) and TI-PS task (p= 0.620). When we controlled
for awareness in the TI task, correlations between both implicit tasks became smaller for
participants who were unaware of the underlying hierarchy in the TI-task (n=11, rs = -0.05, p=
0.989). Correlation coefficients became insignificantly larger between the TI-task and OSL
(1&2) tasks when we restricted analysis to participants who were explicitly aware of the
underlying hierarchy in the TI task (n=10, rs = 0.22, p= 0.540 and rs = 0.23, p= 0.520,
respectively). These results suggest that participants who perform better, compared to the
other subjects, on one OSL task will also perform better on the other OSL task. No such
relationship was observed between the other tasks.
PS TI OSL (1) OSL (2)
PS 1
TI .12 1
OSL (1) .13 .02 1
OSL (2) - .31 .16 .51* 1
Table 6. Spearman rank correlation coefficients between test performance rate across tasks. A significant positive correlation was observed between OSL tasks, indicating that participants who performed well on one OSL-task are likely to score well on the other OSL task (* = p < 0.05).
[48]
4.2.2. Inter-individual bias towards positive or negative learning across tasks?
To examine whether positive or negative learners are consistently biased towards,
learning better from positive or negative feedback, we firstly tested whether positive (or
negative) learners do in fact learn better from positive (or negative) feedback when compared
to negative (or positive) learners, using one-way between subjects ANOVAs. Indeed, positive
learners did learn better from positive feedback, when compared to negative learners during
the PS-task [F(1,22) = 22.510, p < 0.001, η2 = 0.51], the TI-task [F(1,19) = 7.788, p = 0.012, η2 =
0.029] and the OSL2 task [F(1,20) = 13.124, p = 0.002, η2 = 0.39], but not during the OSL1
task[F(1,22) = 3.786, p = 0.065, η2 = 0.15].
Similarly, negative learners learned better from negative feedback, when compared to
positive learners during the PS-task [F(1,22) = 11.467, p = 0.003, η2 = 0.34] and the TI-
task[F(1,19) = 6.44, p = 0.02, η2 = 0.25], but not during the OSL1 [F(1,22) = 4.104, p = 0.055, η2
= 0.16] and the OSL2 task [F(1,20) = 2.747, p = 0.131, η2 = 0.11]. Though, between group tests
in the OSL tasks do show a very clear trend towards significance and the lack of significance is
most likely due to the smaller sample size of the negative learning group within the explicit
memory tasks (see table 1). We should therefore be cautious to draw any broad conclusions
from these analysis. Nevertheless we used the implicit procedural learning task subgroup
classifications for further analysis to check whether they could predict value-related
differences in other tasks.
I. Tasks Within and Between Sessions
Correlations coefficients on learning-bias rates between implicit and explicit tasks were
not significant for the first session (n=21, rs= -0.12, p= 0.601) or the second session (n=21, rs= -
0.10, p= 0.656). We did observe a significant negative correlation between the implicit task
bias rates in the first session and the explicit task bias rates in the second session (n=21, rs =-
0.56, p= 0.009). This was not the case for the relationship between the implicit task in the
second session and the explicit task in the first session (n=21, rs= -0.03, p= 0.895).
Furthermore, no significant correlations were observed between explicit task bias rates
across sessions (n=21, r= -0.17, p= 0.463) or between implicit task bias rates across sessions
(n=21, r= 0.08, p= 0.722). These results suggest that participants who learned better from
negative feedback on the implicit task in the first session, would more likely learn better from
positive feedback on the explicit task in the second session. No such relationships were
observed within and between sessions for the other tasks.
[49]
II. Implicit and Explicit tasks
Correlation coefficients between bias rates across tasks are shown in Table 7.. No
significant correlations were observed across tasks. However, we did observe some trends
towards a significant negative correlation between the TI-task and both the first (n=23, rs= -
0.36, p= 0.09) and second (n=24, rs= -0.38, p= 0.068) OSL-Task. Results of performed
correlation analysis between the PS-task and both the first (n=23, rs= -0.02, p= 0.973) and
second (n=23, rs= -0.31, p= 0.148) OSL-tasks were less clear. Correlations between implicit
tasks showed a small insignificant positive correlation (n=23, rs= 0.10, p= 0.644), whereas
correlations between explicit tasks showed a small insignificant negative correlation(n=25, rs= -
0.13, p= 0.539).
PS TI OSL (1) OSL (2)
PS 1
TI .10 1
OSL (1) -.02 -.37 1
OSL (2) - .31 -.38 -.13 1
Table 7. Spearman rank correlation coefficients between bias rates across tasks. No significant correlations were observed between tasks. A positive correlation was observed between implicit procedural tasks and a relatively high negative correlation was observed between implicit procedural tasks and the second one shot learning task.
Again we controlled for awareness in the TI task. Correlations coefficients are
presented in Table 8. However, the large inter-individual variation between bias rates and the
low sample sizes across tasks make it very hard to validly interpret these results.
Table 8. Spearman rank correlation coefficients between bias rates across tasks, controlled for awareness in the TI-task. No significant correlations were observed between tasks. The relatively low sample sizes make it hard to draw valid conclusions following this analysis.
PS OSL (1) OSL (2)
rs =
TI-Aw p =
n =
.59
0.072
10
-.37
0.260
11
-.37
0.263
11
rs =
TI-nAw p =
n =
-.25
0.417
13
-.41
0.191
12
-.52
0.068
13
[50]
General linear model regression analysis using the predictor groupPS (Positive and
Negative learners), derived from the implicit PS-task (see table 1), showed no significant
effects on bias rates in the TI-task (Fig.10A) [F(1,23) = 0.006, p = 0.939, η2 < 0.001] or the OSL1
task [F(1,22) = 0.036, p = 0.852, η2 < 0.001]. Interestingly, we did observe a significant effect of
the predictor groupPS on bias rates in the OSL2 task [F(1,23)=5.908, p= 0.023, η2 = 0.2 ], where
positive PS-learners (Fig.10C) have, on average, lower positive bias rates (n= 13, MBR = 0.07, SD
= 0.116) compared to negative PS-learners (n= 12, MBR = 0.21, SD = 0.155).
When we changed the predictor groupPS (Positive and Negative learners) to the
predictor groupTI (Positive and Negative learners), derived from the implicit TI-task (see table
1), results show broadly the same pattern. No significant effects were observed of the
predictor groupTI on bias rates in the PS-task (Fig. 10B) [F(1,16) = 0.596, p = 0.451, η2 = 0.036]
or the OSL1-task[F(1,17) = 0.655, p = 0.429, η2 = 0.038]. We did observe a significant of the
predictor groupTI on bias rates in the OSL2 task [F(1,18) = 5.054, p = 0.037, η2 = 0.22]. Again,
on average, positive TI-learners (Fig. 10D) have significant lower positive bias rates (n= 10, MBR
= 0.08, SD = 0.111) compared to negative TI-learners (n= 10, MBR = 0.22, SD = 0.162).
Taken together these results suggest that participants who perform better following
positive feedback, when compared to performance following negative feedback, in one
implicit learning task will more likely show the same pattern of results in the other procedural
learning task when compared to more explicit memory task. Participants are more likely to
show the opposite pattern of results in the explicit memory tasks. Biased positive learners in
one of the implicit procedural tasks show a lower positive bias rate the more explicit memory
learning tasks, when compared to biased negative learners. However most of our cross-task
comparisons show no or very small significant effects, most likely due to our relatively small
sample sizes across tasks. Further investigations using bigger sample sizes and perhaps
different implicit tasks (i.e., how implicit is the transitive inference task?) are necessary to
confirm this pattern of results.
[51]
OSL (2) Bias rate
Pos PS-L
earn
ers
Neg
PS-L
earn
ers
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Bia
s R
ate
OSL (2) Bias rate
Pos TI-L
earn
ers
Neg
TI-L
earn
ers
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.6B
ias R
ate
TI Bias rate
Pos
PS-L
earn
ers
Neg
PS-L
earn
ers
-0.5
-0.4
-0.3
-0.2
-0.1
-0.0
0.1
0.2
0.3
0.4
0.5
0.6B
ias
Rate
PS Bias rate
Pos
TI-Lea
rner
s
Neg
TI-L
earn
ers-0.5
-0.4
-0.3
-0.2
-0.1
-0.0
0.1
0.2
0.3
0.4
0.5
0.6
Bia
s R
ate
Figure 10. Regression analysis on bias rates using subgroups (positive and negative learners) derived from the probabilistic selection task and the transitive inference task as predictor. (A) Positive and negative PS-learners did not predict differences on inter-individual bias rates in the transitive inference task. (B) Positive and negative TI-learners also did not predict differences on inter-individual bias rates in the probabilistic selection task. (C) Positive PS-learners had significantly lower positive bias rates when compared to negative PS-learners’ bias rates in the second OSL task. (D) Positive TI-learners had significantly lower positive bias rates when compared the bias rates of negative TI-learners in the second OSL task.
A B
C D
[52]
5. Discussion
In the current study we investigated whether individual differences in learning from
positive or negative feedback differs between tasks that rely on declarative memory cortices
and tasks that rely on cortices involved in habit formation. Recent research on the neural bases
of making choices following feedback has mainly focused on the role of dopamine and the
striatum. Collectively, these studies pointed out a crucial role of midbrain dopamine neurons
and their striatal targets for learning to predict reward (Daw, Yael, & Dayan, 2005, ; Delgado,
et al., 2000; Frank, et al., 2004; Holroyd & Coles, 2002; Pessiglione, et al., 2006; Schultz, et al.,
1997). These findings concerning the role of the dopaminergic-striatal circuitry in
reinforcement learning are postulated in a prediction-error signal that guides choices by
updating value representations following repeated experience of feedback (Schultz, et al.,
1997; Hollerman & Schultz, 1998; Holroyd & Coles, 2002; Sutton & Barto, 1998). This allows an
organism to use previous experiences to optimize choices when confronted with a similar
situation.
However, since organisms are rarely confronted with the same environment, decisions
made in the past may not repeat themselves. Instead, novel choices mostly involve new
options and contexts which requires a flexible integration of knowledge from the past with
novel information. Previous investigations on flexibly generalizing knowledge from the past to
guide choices in novel situations have illuminated an important role of the declarative memory
system in the medial temporal lobe (Eichenbaum, 2000; Hassabis, et al., 2007; Shohamy &
Adcock, 2010; Squire, 1992). To investigate how individuals differently learn to make optimal
decisions following feedback, we adopted the probabilistic selection task designed by Frank et
al. (2004). We compared optimal decision making performance following positive and negative
feedback in this task with (1) performances on an implicit version of the transitive inference
task and (2) performances on two versions of a declarative memory task.
5.1 Learning within the procedural memory system.
5.1.1. The Probabilistic Selection Task
Besides its usage to investigate the underlying mechanisms of procedural learning
processes, the probabilistic selection task has frequently been used to research inter-individual
variability in learning more from good choices than from bad choices (Frank, et al., 2004;
Lighthall, et al., 2013; Simons, Howard, & Howard, 2010). Our results from the probabilistic
selection task showed no differences in learning performance following positive feedback
relative to learning performance following negative feedback, when averaged across subjects.
[53]
As expected, when distinguishing between a positive learner and negative learner subgroup17
similar to Frank and colleagues’ 2005 and 2007, we did observe that positive learners learned
significantly better following positive feedback relative to negative learners. Accordingly,
negative learners learned significantly better following negative feedback when compared to
positive learners. These results are in line with previous investigations that used the
probabilistic selection task with young and healthy subjects (Frank, et al., 2005, 2007; Klein, et
al., 2007; Simons, Howard, & Howard, 2010).
A previous study used the probabilistic selection task and the implicit transitive
inference task with PD-patients to investigate individual differences in reinforcement learning
(Frank, Seeberger, & O'reilly, 2004). In this experiment it was assumed that the probabilistic
selection task and the implicit transitive inference task rely on the same neural processes,
namely the basal ganglia. Results suggested that inter-individual biases in feedback-learning
are a direct consequence of higher or lower levels of dopamine that differently affect striatal
synaptic changes (Frank, et al., 2004).
Phasic burst of dopamine, following positive feedback, excite D1 receptors in the direct
pathway which induces long term potentiation (LTP18) in striatal Go cells (Holroyd &
Coles, 2002; Frank, 2005; Nischi, Snyder, & Greengard, 1997). Furthermore, phasic
bursts of dopamine inhibit the indirect pathway via D2 receptors which induces long-
term depression (LTD19) in striatal No-Go cells (Calabresi, et al., 1997). Short
dopaminergic drops below baseline that follow negative feedback have the opposite
effect, i.e., dissuading LTP and LTD in striatal Go and No-Go cells, respectively
(Calabresi, et al., 1997; Frank, 2005; Holroyd & Coles, 2002; Nischi, et al., 1997; Schultz,
2002). Consequently, high or low levels of dopamine biases the go or no-go pathway to
be more active with better learning from positive or negative feedback as a behavioral
output (Frank, et al., 2004).
17
We adopted this distinction from (Frank, Woroch, & Curran, 2005). Positive learners were operationalized as those participants who performed better on choosing A trials compared to performance on avoiding B trials, whereas negative learners were operationalized as those participants who performed better on avoiding B trials compared to performance on choosing A trials (Frank et al., 2005; 2007). 18
Long term potentiation (LTP) is an activity dependent change in the strength of synapses, mediated by NMDA receptors. As a result of pre- and postsynaptic co-activation, a wide range of local biochemical changes strengthen the synaptic connectivity. It is widely accepted that the LTP process can be interpreted as the cellular correlate of associative learning and memory formation in general (Fedulov, et al., 2007; Whitlock & al, 2006). 19
Long term depression (LTD).is the cellular mechanism of synaptic weakening. The difference between LTD and LTD, although not entirely the same, lies in the magnitude of calcium signals in the postsynaptic cell. LTD can be, to some extent, be seen as the cellular correlate of forgetting (Foy, 2001)
[54]
These assumptions of Frank’s Go-NoGo model, together with findings regarding a
genetic factor of biased feedback learning, led us to the hypothesis that the inter-individual
range of learning better from positive or negative feedback, observed in the probabilistic
selection task, will closely relate to the pattern of results in the implicit transitive inference
task (Frank, et al., 2004; Klein, et al., 2007). In our experiment, results showed only a very small
(rs = 0.10) insignificant correlation between bias rates from the probabilistic selection task and
bias rates from the implicit transitive inference task, suggesting that learning from positive and
negative feedback across these tasks might not be modulated by the same underlying
mechanisms.
5.1.2. The Transitive Inference task.
The transitive inference task has previously been used to study higher-order reasoning,
where organisms have to learn a hierarchical structure of stimuli based on the inferences that
are drawn from adjacent pairs in an ordinal sequence (Dusek & Eichenbaum, 1997; (Van
Opstal, Verguts, Orban, & Fias, 2007). In common terms, this means that participants learn to
logically infer that Vincent is taller than Eden, based on the premises that Vincent is taller than
Kevin and Kevin is taller than Eden. Studies, using different modifications of the transitive
inference task with both animals and humans, have suggested that this task importantly
involves the hippocampus (Dusek & Eichenbaum, 1997; Greene, et al., 2006; Van Opstal, et al.,
2007). However, a recent study challenged the assumption of a necessary involvement of the
hippocampus in transitive inference tasks. This study demonstrated that participants with a
temporally disrupted hippocampus, due to the benzodiazepene midazolam, showed enhanced
transitive inference performance by fully recruiting the dopamine-striatal learning system
(Frank, O'Reilly, & Curran, 2006). These results were in line with their previously proposed
associative strength hypothesis, which explains how organisms transitively infer associations
through implicit reinforcement learning mechanisms (Frank, Rudy, Levy, & O'Reilly, 2005;
Rudy, Frank, & O'Reilly, 2003).
According to the associative strength hypothesis, outer pairs (AB,DE) at the top
or bottom of an underlying hierarchy “anchor” the development of associative values.
Over consecutive trials, agents implicitly learn to associate A with positive
reinforcement, because choosing A always leads to positive feedback. In contrast,
choosing E becomes associated with negative reinforcement, because choosing E
always induces negative feedback. These net associative values than ‘transfer’ these
associative values to the inner adjacent pairs (BC,CD). As a result, B in the BC pair has a
[55]
stronger positive association, whereas D in the DC pair has a stronger negative
association, though B and D are positively (negatively) reinforced during half of the
trials (Rudy, Frank, & O'Reilly, 2003).
Although other researchers questioned some of the pharmacological assumptions of
Frank’s midazolam study (see Green, 2007 and Frank, et al., 2008 for comment and reply), it is
agreed upon that the transitive inference task can be solved by both explicit ‘declarative’
strategies and implicit ‘procedural’ strategies (Green, 2007; Van Elzakker, et al., 2003; Rudy,
Frank, & O'Reilly, 2003). This reason might explain the lack of a significant correlation in biased
feedback learning between the probabilistic selection task and the transitive inference task in
our study.
Indeed, when we controlled20 for explicit and implicit strategies in the TI-task, a
substantial part of the participants (more than 1/3) indicated using an explicit strategy to solve
the transitive inference task. When we took the results of our questionnaire into account,
results indicated that explicit learners performed better than implicit learners. Our data
suggest that this effect could largely be explained by better performances of explicit learners
on inner pairs. Furthermore, results from the implicit learner group suggest that implicitly
learning the underlying hierarchy is driven by positive and negative associative values in the
outer pairs, since implicit learners significantly scored better on ‘outer pairs’ (AB,DE) relative to
‘inner pairs’ (BC,CD). This pattern of results is in line with the associative strength hypothesis
(Rudy, Frank, & O'Reilly, 2003).
Regarding to our research question, no significant differences were observed across
subjects between performance following positive feedback (top pairs, AB & BC) relative to
performance following negative feedback (bottom pairs, CD & DE), neither for explicit learners,
nor implicit learners. However, one interesting finding was that explicit learners performed
significantly better on top pairs, but not on bottom pairs, when compared to implicit learners.
This finding suggests that the performance level of explicit learners is more likely driven by
learning following positive feedback rather than by learning following negative feedback.
5.2 Learning across the procedural and the declarative memory system.
In line with the results of Frank and colleagues’ midazolam study, results of animal
studies, using spatial navigation tasks, have indicated that inactivating one learning system
(e.g., hippocampus) improves performances on tasks related to the other learning system (e.g.
20
We adopted a translated version of the questionnaire used in Frank and colleagues’ 2004 to control for explicit awareness of the underlying hierarchy in the transitive inference task.
[56]
striatal processing). These results have provided strong evidence for a bidirectional
dissociation between the ‘declarative’ memory system and the ‘procedural’ memory system,
indicating that both memory systems, respectively supported by the hippocampus and the
striatum, competitively interact under some circumstances (Frank, O'Reilly, & Curran, 2006;
Lee, et al., 2008; Poldrack & Packard, 2003).
However, a very recent lesion study , using a associative reinforcement learning task,
rather than a spatial navigation task, suggested that striatal processing might be a prerequisite
for declarative associative learning following reinforcement. Results of this study showed that
rodents with striatal lesions had impaired procedural and declarative-like memories, whereas
rodents with hippocampal lesions had impaired declarative-like memories, but spared
procedural memories (Jacquet, et al., 2013). These results suggest that striatal processing
might be necessary for decision making following feedback in declarative memory tasks.
5.2.1. Comparing results with the ‘episodic-like’ one-shot learning tasks.
In our study we wanted to investigate whether decision making from feedback differed
between procedural learning tasks and declarative learning tasks within subjects. We
presumed that participants who learn better from positive feedback in a procedural task will
show the same learning bias in the declarative memory tasks.
First, dopamine modulates learning from positive or negative feedback (Holroyd &
Coles, 2002; Schultz, et al., 1997). Second, it has been shown that dopaminergic projections to
the striatum and the hippocampus modulate cellular learning in both regions (Frank, et al.,
2004; Frey, 1990; Huang & Kandel, 1995). Third, there is strong evidence that learning from
reinforcement is directly or indirectly modulated by striatal processing, shown to be a
prerequisite to learn declarative memory tasks (Frank M. J., 2005; Pessiglione,et al., 2006;
Jacquet, et al., 2013). Fourth, previous investigations using performances following
probabilistic feedback to examine error-processing in a recognition memory task found results
consistent with our hypothesis (Frank, et al., 2007).
Surprisingly, when we directly compared implicit and explicit tasks, results indicated
that participants who learned better from negative feedback in the implicit procedural tasks
were more likely to learn better from positive feedback in the explicit declarative memory
tasks, when compared to participants who learned better from positive feedback in the
implicit procedural task. This opposite learning bias across implicit and explicit learning tasks
was especially true for the second version of the explicit memory task and was seemingly
driven by a general bias towards learning from positive feedback during explicit memory tasks.
In both one shot learning tasks we observed that subjects had significantly better recognition
[57]
accuracy on trials previously followed by positive feedback when compared to trials previously
followed by negative feedback. It could be argued that these effects are task specific.
However, a similar learning bias for explicit learners in the transitive inference task suggests
otherwise.
One possible explanation for these results could be that dopamine plays a functionally
different modulatory role in hippocampal-based associative learning, compared to its role in
striatal value-related processing (Horvitz, 2000; Frey & Morris, 1998a; Li, 2003;). Recent
methodological developments, together with a growing interest in dopamine’s
neuromodulatory role in the brain, has led researchers to belief that dopamine might have a
more heterogeneous functional role in processing motivationally relevant information, closely
related to encoding value-related information (Lisman, Grace, & Duzel, 2011; Shohamy &
Adcock, 2010).
In accordance with this hypothesis, studies have demonstrated (1) dopaminergic
activity when non-rewarding salient stimuli are presented (Horvitz, 2000), (2) activity in some
dopamine cells when aversive stimuli are presented (Matsumoto & Hikosaka, 2009) and (3)
dopaminergic cell firing when novel stimuli are presented (Ljungberg, et al., 1992).
Furthermore, emerging findings have indicated that dopaminergic cell firing to novel and
salient stimuli directly corresponds with hippocampal activity (Fyhn, et al., 2002; Jenkins, et al.,
2004; Kumaran & Maguire, 2006). Inspired by these findings, Lisman and Grace (2005)
proposed a neurobiological model concerning the functional interaction between the midbrain
dopamine system and the declarative memory system in processing motivationally relevant
novel information.
According to this model, novel sensory information is compared with stored
information concerning the environment in the CA1 output region of the
hippocampus. If the novel information is not in accordance with the already
established information, a novelty signal (i.e., comparable to the prediction-error
signal) is sent, through some parts of cortico-basal ganglia-thalamocortical loop,
towards midbrain dopamine cells, where these signals increase dopaminergic cell-
firing. From the midbrain dopamine system, directly projecting to the hippocampus,
increased cell activity directly facilitates hippocampal associative-learning via D1/D5-
receptors (Legault & Wise, 2001; Lisman & Grace, 2005;).
Later on these researchers, among others, suggested that this dopamine-dependent
facilitation of LTP in the hippocampus is relevant for all events related to increased
[58]
dopaminergic cell firing (Frey & Morris, 1998a; Huang & Kandel, 1995; Lisman, Grace, & Duzel,
2011; Shohamy & Adcock, 2010). When we apply the hypothesis derived from this model to
the one shot learning tasks, novel associations followed by increased dopaminergic cell firing
(i.e., related to positive feedback) should result in better memory traces, which is exactly what
we observed.
Taken together, many studies have indicated an important involvement of dopamine
and the striatum in procedural learning, based on positive and negative feedback. However, it
largely unclear how learning from feedback occurs in more declarative memory tasks, thought
to rely on the medial temporal lobe and the hippocampus. Previous investigations have
suggested that the procedural (habit) and declarative (goal-directed) learning system under
some circumstances competitively interact. In this study we researched how learning following
feedback differs in both systems by directly comparing individual learning biases across
procedural and declarative learning tasks. Results showed a general tendency to learn better
from positive feedback in the declarative learning tasks, but not in the procedural learning
tasks. Participants who learned better from negative feedback during procedural tasks were
more likely to learn better from positive feedback in the explicit declarative memory tasks.
These results suggest a different functional role for the declarative and procedural memory
system in learning from negative or positive feedback.
[59]
6. References
Albin, R., Young, A., & Penney, J. (1989). The functional anatomy of basal ganglia disorders.
Trends in neurosciences 12, 366-375.
Alexander, G. E., DeLong, M. R., & Strick, P. L. ( 1986). Parallel organization of functionally
segregated circuits linking basal ganglia and cortex. Annual Review of Neuroscience 9,
357-381.
Alexander, G., & Crutcher, M. (1990). Preparation for movement: Neural representations of
intended direction in three motor areas of the monkey. Journal of neurophysiology 64,
133-150.
Andén, N. E., Fuxe, K., Hamberger, B., & Hökfelt, T. (1966). A quantitative study on the nigro-
neostriatal dopamine neurons. Acta physiology scandinavia 67, 306-312.
Baddeley, A. (2001). The concept of episodic memory. Philosophical Transactions of the Royal
Society B: Biological Sciences 356, 1345-1250.
Baddeley, A., Eysenck, M. W., & Anderson, M. C. (2009). Memory. New York: Psychology Press.
Balleine, B. W., & Dickinson, A. (1998). Goal-directed instrumental action: Contingency and
incentive learning and their cortical substrates. Neuropharmacology 37, 407-419.
Barnes, T. D., Kubota, Y., Hu, D., Jin, D., & Graybiel, A. (2005). Activity of striatal neurons
reflects dynamic encoding and recoding of procedural memories. Nature,437, 1158-
1161.
Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G.
Beiser, Models of information processing in the basal ganglia (pp. 215-232).
Cambridge,MA: MIT Press.
Barto, A., Sutton, R., & Anderson, C. (1983). Neuronlike adaptive elements that can solve
difficult learning control problems. IEEE Transaction on Systems, Man & Cybernetics,
13, 834-846.
Bellman, R. (1958). On a routing problem. Quart. J. Appl. Math. 16, 87-90.
Bray, S., & O'Doherty, J. (2007). Neural coding of reward-prediction error signals during
classical conditioning with attractive faces. Journal of Neurophysiology, 97, 3036-3045.
Brodeur, M. B., Dionne-Dostie, E., Montreuil, T., & Lepage, M. (2010). The bank of
standardized stimuli (BOSS), a new set of 480 normative photos of objects to be used
as visual stimuli in cognitive research. PloS ONE, 5, e10773.
Brown, J., & Braver, T. (2005). Learned predictions of error likelihood in the anterior cingulate
cortex. Science, 307, 1118-1121.
[60]
Buckner, R. L., Petersen, S. E., Ojemann, J. G., Miezin, F. M., Squire, L. R., & Raichle, M. E.
(1995). Functional Anatomical Studies of Explicit and Implicit Memory Retrieval tasks.
Journal of Neuroscience 15(1), 12-29.
Calabresi, P., Saiardi, A., Pisani, A., Baik, J., Centonze, D., Mercuri, N., . . . Borelli, E. (1997).
Abnormal synaptic plasticity in the striatum of mice lacking dopamine D2 receptors.
Journal of Neuroscience, 17, 4536-4544.
Daw, N. D., Yael, N., & Dayan, P. (2005, ). Uncertainty-based competition between prefrontal
and dorsolateral striatal sustems for behavioral control. nature neuroscience, 8(12),
1704-1711.
Delgado, M., Nystrom, L., Fissell, C., Noll, D., & Fiez, J. (2000). Tracking the hemodynamic
responses to reward and punishment in the striatum. Journal of Neurophysiology, 84,
3072-3077.
Dickenson, A., & Balleine, B. (2002). The role of learning in motivation. In C. R. Gallistel,
Stevens' Handbook of Experimental Psychology Vol.3 Learning, Motivation and
Emotion (pp. 497-533). New York: John Wiley & Sons.
Dubois, B., Malapani, C., Verin, C., Rogelet, P., Deweer, B., & Pillon, B. (1994). Cognitive
functions in the basal ganglia: the model of Parkinson disease. Revue Neurologique
(Paris), 150, 763-770.
Dusek, J. A., & Eichenbaum, H. (1997). The hippocampus and memory for orderly stimulus
relations. Proceedings of the National Academy of Sciences, 94, 7109-7114.
Eichenbaum, H. (2000). A cortical-hippocampal system for declarative memory. Nature Review
Neuroscience 1, 41-50.
Emson, P., & Koob, G. (1978). The origin distribution of dopamine-containing afferents to the
rat frontal cortex. Brain research, 142, 249-267.
Fedulov, V., Rex, C., Simmons, D., Palmer, L., Gall, C., & Lynch, G. (2007). Evidence that Long-
Term Potentiation occurs whitin individual hippocampal synapses during learning.
Journal of Neuroscience 27, 8031-9039.
Foy, M. R. (2001). Long-term Depression (Hippocampus). International Encyclopedia of the
Social & Behavioral , 9047-9078.
Frank, M. J. (2005). Dynamic Dopamine modulation in the Basal Ganglia: A
neurocomputational Account of cognitive Deficits in Medicated and Nonmedicated
Parkinsonism. Journal of Cognitive Neuroscience, 51-72.
Frank, M., D'Lauro, C., & Curran, T. (2007). Cross-task individual differences in error processing:
Neural, electrophsyiological and genetic components. Cognitive, Affective & Behavioral
Neuroscience, 7, 297-308.
Frank, M., O'Reilly, R., & Curran, T. (2006). When memory fails, intuition reigns: Midazolam
enhances implicit inference in humans. Psychological Science ,16, 700-707.
[61]
Frank, M., O'Reilly, R., & Curran, T. (2008). Midazolam, hippocampal function, and transitive
inference: Reply to Green. Behavioral and Brain Functions, 4:5.
Frank, M., Rudy, J., Levy, W., & O'Reilly, R. (2005). When logic fails: Implicit transitive inference
in humans. Memory and Cognition, 742-750.
Frank, M., Seeberger, L. C., & O'reilly, R. (2004). By Carrot or by Stick: Cognitive reinforcement
learning in Parkinsonism. Science, 1940-1943.
Frank, M., Woroch, B., & Curran, T. (2005). Error-Related Negativity Predicts Reinforcement
learning and Conflict Biases. Neuron, 47, 495-501.
Frey, U. (1990). Dopaminergic antagonists prevent long-term maintenance of posttetanic LTP
in the CA1 region of rat hippocampal slices. Brain Research 522, 69-75.
Frey, U., & Morris, R. (1998a). Synaptic tagging and long term potentiation.
Neuropharmacology 37, 545-552.
Fuxe, K., Hokfelt, T., Johansso, O., Jhonson, G., Lidbrink, P., & Ljungdah, A. (1974). Origin of
dopamine nerve-terminals in limbic and frontal cortex- evidence for meso-cortico
dopamine neurons. Brain Research, 82, 349-355.
Fyhn, M., Molden, S., Hollup, S., Moser, M., & Moser, E. (2002). Hippocampal neurons
responding to first-time dislocation of a target object. Neuron, 555-566.
Gerfen, C. (2000). Molecular effects of dopamine on striatal projection pathways. Trends in
neurosciences 23, 64-70.
Gerfen, C. R., & Surmeier, J. (2011). Modulation of striatal projection systems by dopamine.
Annual review Neuroscience,34, 441-466.
Gerfen, C. R., Engber, T. M., Mahan, L. C., Susel, Z., Chase, T. N., Monsama, F., & Sibley, D. R.
(1990). D1 and D2 dopamine recepto-regulated gene expression of striatonigral and
striatopallidal neurons. Science, 250 , 1429-1432.
Gerfen, C., & Wilson, C. (1996). The basal ganglia. In L. Swanson, A. Bjorkland, & T. Hokfelt,
Handbook of chemical neuroanatomy Vol 12: Integrated systems of the CNS (pp. 371-
468). Amsterdam: Elsevier.
Gilbert, D. T., & Wilson, T. D. (2007). Prospection: experiencing the future. Science, 317, 1351-
1354.
Gläscher, J., Daw, N., Dayan, P., & O'Doherty, J. P. (2010). States versus Rewards: Dissociable
Neural prediction error signals underlying model-based and model-free reinforcement
learning. Neuron, 585-595.
Graf, P., & Shacter, D. L. (1985). Implicit and explicit memory for new associations in normal
and amnesic subjects. Journal of Experimental Psychology Learning Memory and
Cognition 11, 501-518.
[62]
Graybiel, A. (1998). The basal ganglia and chunking of action repertoires. Neurobiology of
Learning and Memory, 70, 119-136.
Green, A. J. (2007). Implicit transitive inference and the human hippocamus: does intravenous
midazolam function as a reversivle hippocampal lesion? Behavioral Brain Function, 3,
51.
Greene, A. J., Gross, W., Elsinger, C., & Rao, S. M. (2006). An fMRI analysis of the human
hippocampus: inference, context and task awareness. Journal of Cognitive
Neuroscience, 18, 1156-1173.
Hassabis, D., Kumaran, D., Vann, S. D., & Maguire, E. A. (2007). Patients with hippocampal
amnesia cannot imagine new experiences. PNAS, 104, 1726-1731.
Hernandez-Lopez, S., Bargas, J., Surmeier, D., Reyes, A., & Galarraga, E. (1997). D1 receptor
activation enhances evoked discharge in neostriatal medium spinu neurons by
modulating an L-type Ca conductance. Journal of Neuroscience 17, 3334-3342.
Hernandez-Lopez, S., Tkatch, T., Perez-Garci, E., Galagarra, E., Bargas, J., Hamm, H., &
Surmeier, D. (2000). D2 dopamine receptors in striatal medium spiny neurons reduce
L-type Ca currents and excitabillity via a novel PLC1-IP-calcineurin-signaling cascade.
Journal of Neuroscience 20, 8987-8995.
Hikosaka, O. (1989). Role of basal ganglia in initiation of voluntary movements. In M. A. Arbib,
& S. Amari, Dynamic interactions in neural networks: Models and data (pp. 153-167).
Berlin: Springer-Verlag.
Hollerman, J., & Schultz, W. (1998). Dopamine neurons report an error in the temporal
predicition of reward during learning. Nature Neuroscience, 304-309.
Holroyd, C., & Coles, M. (2002). The neural basis of human error processing: Reinforcement
learning,dopamine and the error-related negativity. Psychological Review 109, 679-
709.
Horvitz, J. (2000). Mesolimobocortical and nigrostriatal dopamine responses to salient non-
reward events. Neuroscience 96, 651-656.
Huang, Y., & Kandel, E. ( 1995). D1/D5 receptor agonist induce a proteine synthesis-dependent
late potentiation in the CA1 region of the hippocampus. Proceedings of the National
Academy of Sciences, 2446-2450.
Jacquet, M., Lecourtier, L., Cassel, R., Loureiro, M., Cosquer, B., Escoffier, G., Marchetti, E.
(2013). Dorsolateral striatum and dorsal hippocampus : a serial contribution to
acquisition of cue-reward associations in rats. Behavioral Brain Research, 239, 94-103.
Jankovic, J. (2008). Parkinson's disease: clinical features and diagnosis. Journal of Neurology,
Neurosurgery and Psychiatry 79, 386-376.
[63]
Jenkins, T., Amin, E., Pearce, J., Brown, M., & Aggleton, J. (2004). Novel spatial arrangements of
familiar visual stimuli promote activity in the rat hippocampal formation but not the
parahippocampal cortices: a c-fos expression study. Neuroscience 124, 43-52.
Joel, D., & Weiner, I. (1997). The connections of the primate subthalamic nucleus: indirect
pathways and the open-interconnected scheme of basal ganglia-thalamocortical
circuitry. Brain research review,23, 62-78.
Johnson, A., & Redish, A. D. (2007). Neural ensembles in CA3 transiently encode paths forward
of the animal at a decision point. Journal Neuroscience 27, 12176-12189.
Kamin, L. (1969). Predictability, suprise, attention and conditioning. In A. Campbell, & R. M.
Church, Punishment and aversive behavior (pp. 242-259). New York: Appleton-Century
Crofts.
Klein, T., Neumann, J., Reuter, M., Hennig, J., von Cramon, Y., & Ullsperger, M. (2007).
Genetically Determined Differences in Learning from Errors. Science, 318, 1642-1645.
Klopf, A. H. (1982). The Hedonistic Neuron: A theory of Memory Learning, and Intelligence.
Hemisphere.
Knowlton, B. J., Mangels, J. A., & Squire, L. R. (1996). A neostriatal habit learning system in
humans. Science, 273, 1399-1402.
Kumaran, D., & Maguire, E. (2006). An unexpected sequence of events: mismatch detection in
the human hippocampus. Plos Biology (4) 12, e424.
Lee, A. S., Duman, R. S., & Pittenger, C. (2008). A double dissociation revealing bidirectional
competition between striatum and hippocampus during learning. PNAS, 105, 241-249.
Legault, M., & Wise, R. (2001). Novelty-evoked elevations of nucleus accumbens dopamine:
dependence on impulse flow from the ventral subiculum and glutamatergic
neurotransmission in the ventral tegmental area. European journal of Neuroscience,
819-828.
Li, S. (2003). Dopamine-dependent facilitation of LTP induction in hippocampal CA1 by
exposure to spatial novelty. Nature Neuroscience 6, 1407-1417.
Lighthall, N. R., Gorlick, M. A., Schoeke, A., Frank, M., & Mather, M. (2013). Stress modulates
reinforcement learning in younger and older adults. Psychology and Aging, 28, 35-46.
Lisman, J., & Grace, A. (2005). The Hippocampal-VTA Loop: Controlling the Entry of Information
into Long-Term Memory. Neuron, 703-712.
Lisman, J., Grace, A., & Duzel, E. (2011 ). A neoHebbian framework for episodic memory; role
of dopamine-dependent late LTP. Trends in Neurosciences 34, 536-547.
Ljungberg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons
during learning of behavioral reactions. Journal of Neurophysiology 67, 145-163.
[64]
Maddox, W., & Filoteo, J. (2001). Striatal contributions to category learning: Quantitative
modeling of simple linear and complex non-linear rule learning in patients with
Parkinson's disease. Journal of the international Neuropsychological Society 7, 710-
727.
Matsumoto, M., & Hikosaka, O. (2009). Two types of dopamine neuron distinctly convey
positive and negative motivational signals. Nature, 837-841.
McClure, S. M., Berns, G. S., & Montague, P. R. (2003). Temporal prediction errors in a passive
learning task activate human striatum. Neuron,38, 339-346.
Miller, R. R., Barnet, R. C., & Grahame, N. J. (1995). Assessment of the Rescorla-Wagner model.
Psychological Bull. 117(3), 363-386.
Milner, B., Squire, L. R., & Kandel, E. (1998). Cognitive Neuroscience and the study of memory.
Neuron 20, 445-468.
Mink, J. (1996). The basal ganglia: Focused selection and inhibition of competing motor
programs. Progress in Neurobiology 50, 381-425.
Mink, J. (2003). The basal ganglia and involuntary movements: impaired inhibition of
competing motor patterns. Archives of neurology, 60, 1365-1368.
Mirenowicz, J., & Schultz, W. (1994). Importance of unpredictablillity for reward responses in
primate dopamine neurons. Journal of Neurophysiology 72, 1024-1027.
Montague, P., Dayan, P., & Sejnowski, T. (1996). A framework for mesencephalic dopamine
systems based on predictive Hebbian learning. Journal of Neuroscience 16, 1936-1947.
Nicola, S., Surmeier, J., & Malenka, R. (2000). Dopaminergic modulation of neuronal
excitabillity in the striatum and nucleus accumbens. Annual Review of Neuroscience
23, 185-215.
Nischi, A., Snyder, G., & Greengard, P. (1997). Bidirectional regulation of DARPP-32
phosphorylation by dopamine. Journal of Neuroscience 17, 8147-8155.
Niv, Y. (2009). Reinforcement learning in the brain. Journal of mathematical psychology 53,
139-154.
O'Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference
models and reward-related learning in the human brain. Neuron, 38, 329-337.
O'Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. (2004). Dissociable
Roles of Ventral and Dorsal Striatum in Instumental Conditioning. Science, 304, 452-
454.
Olds, J., & Milner, P. (1954). Positive reinforcement produced by electrical stimulation of septal
area and other regions of the rat brain. Journal of comparative and physiological
psychology 47, 419-427.
[65]
Packard, M. G., & McGaugh, J. L. (1996). Inactivation of hippocampus or caudate nucleus with
lidocaine differentially affects expression of place and response learning. Neurobiology
of Learning and Memory, 65, 65-72.
Pavlov, I. (1927). Conditioned Reflexes: An Investigation of the physiological Activity of the
Cerebral Cortex. Translated and Edited by G.V. Anrep. London: Oxford University Press.
Pessiglione, M., Seymour, B., Flandin, G., Dolan, R. J., & Chris, F. D. (2006). Dopamine-
dependent prediction errors underpin reward-seeking behaviour in humans.
Nature,442, 1042-1045.
Poldrack, R., & Packard, M. (2003). Competition among multiple memory systems: converging
evidence from animal and human brain studies. Neuropsychologia, 245-251.
Poldrack, R., Clark, J., Paré-Blagoev, E., Shohamy, D., Creso Moyano, J., Myers, C., & Gluck, M.
(2001). Interactive memory systems in the human brain. Nature, 546-550.
Rashotte, M. E., Marshall, B. S., & O'Connell, J. M. (1981). Signaling functions of the second-
order CS: Partial reinforcement during second-order conditioning of the pigeon's
keypeck. Animal learning & Behavior 9, 253-260.
Redgrave, P., Prescott, T., & Gurney, K. (1999). The basal ganglia: a vertrebate solution to the
selection problem? Neuroscience, 89, 1009-1023.
Rescorla, R. A. (1976b). Stimulus generalization: Some predictions from a Pavlovian
conditioned behavior. Journal of Experimenatl Psychology: Animal behavior processes
2, 88-96.
Rescorla, R., & Wagner, A. (1972). A theory of Pavlovian conditioning: Variations in the
effectivness of reinforcement and nonreinforcement. In A. Black, & W. Prokasy, In
Classical Conditioning II: Current Research and Theory (pp. 64-99). New York: Appleton
Century Crofts.
Ridderinkhof, K. R., Ullsperger, M., Crone, E. A., & Nieuwenhuis, S. (2004). The role of the
medial frontal cortex in cognitive control. Science, 306, 443-447.
Ritchie, T., & Noble, E. P. (2003). Association of seven polymorphisms of the D2 dopamine
receptor gene with brain receptor binding characteristics. Neurochemical research, 28,
73-82.
Robbins, S. (1998). Organization Behavior. NJ: Prentice Hall.
Romo, R., & Schultz, W. (1990). Dopamine neurons of the monkey midbrain: Contingencies of
responses to active touch during self-initiated arm movements. Journal of
Neurophysiology 63, 592-606.
Rudy, J., Frank, M., & O'Reilly, R. (2003). Transitivity, Flexibility, Conjunctive Representations
and the hippocampus:II. A Computational Analysis . Hippocampus 13, 341-354.
[66]
Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-Prime User's Guide. Pittsburgh:
Psychology Software Tools Inc.
Schönenberg, T., Daw, N., Joel, D., & O'Doherty, J. (2007). Reinforcement Learning Signals in
the humans Striatum Distinguish Learners from Nonlearners during Reward-Based
Decision Making. The journal of Neuroscience, 27(47), 12860-12867.
Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journa of Neurophysiology
80, 1-27.
Schultz, W. (2002). Getting Formal with Dopamine and Reward. Neuron, 241-263.
Schultz, W. (2007). Multiple Dopamine functions at different time courses. Annual review of
neuroscience, 259-288.
Schultz, W., Dayan, P., & Montague, R. (1997). A neural substrate of prediction and reward .
Science 275, 1593-1599.
Shohamy, D., & Adcock, A. (2010). Dopamine and adaptive memory. Trends in Cognitive
Sciences, 464-472.
Simons, J., Howard, J. H., & Howard, D. (2010). Adult Age Differences in Learning From Positive
and Negative Probabilistic Feedback. Neuropsychology, 24, 534-541.
Skinner. (1938). The behavior of Organisms: An Experimental Analysis. New York: Appleton-
Century.
Skinner, B. (1987). In B. Skinner, Upon further reflection (pp. 105-108). Englewood Cliffs, NJ:
Prentice-Hall.
Skinner, B. F. (1935). Two types of conditioned reflex and a pseudo type. Journal of General
Psychology 12, 66-77.
Squire, L. R. (1992). Declarative and non-declarative memory: multiple brain systems
supporting learning and memory. Journal of Cognitive Neuroscience, 232-243.
Squire, L. R. (2004). Memory systems of the brain: A brief History and current perspective.
Neurobiology of Learning and Memory 82, 171-177.
Squire, L., Knowlton, B., & Musen, G. (1993). The structure and organization of memory.
Annual Reviews of Psychology 44, 453-495.
Squire, L., Stark, C., & Clark, R. (2004). The medial temporal lobe. Annual review neuroscience
27, 279-306.
Sugrue, L. P., Corrado, G. S., & Newsome, W. T. (2005). Choosing the greater of two goods:
neural currencies for valuation and decision making. Nature Review Neuroscience 6,
363-375.
Suri, R., & Schultz, W. (1999). A neural network with dopamine-like reinforcement signal that
learns a spatial delayed response task. Neuroscience 91, 871-890.
[67]
Sutton, R. (1988). Learning to Predict by the methods of Temporal Differences. Machine
Learning, 3, 9-44.
Sutton, R. S., & Barto, A. G. (1981a). Toward a modern theory of adaptive networks:
Expectation and prediction. Psychological Review 88, 135-171.
Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M.
&. (Eds.), Learning and computational neuroscience: Foundations of adaptive networks
(pp. 497-537). MIT Press.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge (MA):
MIT Press.
Swainson, R., Rogers, R. D., Sahakian, B., Summers, B., Polkey, C., & Robbins, T. W. (2000).
Probabilistic learning and reversal deficits in patients with Parkinson's disease or
frontal or temporal lobe lesions: possible adverse effects of dopaminergic medication.
Neuropsychologia,38, 596, 596-612.
Thorndike, E. (1905). The elements of psychology. New York: A.G. Seiler.
Thorndike, E. (1911). Animal Intelligence: Experimental Studies. New York: MacMillian.
Tolman, E. L. (1932). Purposive behavior in Animals and Man. New York: Century.
Valentin, V., Dickinson, A., & O'Doherty, J. (2007). Determining the Neural Substrates of Goal-
Directed Learnin in the Human Brain. The Journal of Neuroscience, 27, 4019-4026.
Van Elzakker, M., O'Reilly, R. C., & Rudy, J. W. (2003). Transitivity, flexibility, conjunctive
representations, and the hippocampus. I An empirical analysis. Hippocampus 13, 292-
298.
Van Opstal, F., Verguts, T., Orban, G., & Fias, W. (2007). A hippocampal-parietal network for
learning an ordered sequence. NeuroImage, 333-341.
White, N. M., & McDonald, R. J. (2002). Multiple parallel memory systems in the brain of the
rat. Neurobiology of Learning & Memory 77, 125-184.
Whitlock, J., & al, e. (2006). Learning induces Long-Term Potentiation in the hippocampus.
Science 313, 1093-1097.
Wirth, S. e. (2009). Trial outcome and associative learning signals in the monkey hippocampus.
Neuron, 930-940.
Wise, R., Spindler, J., & Legault, L. (1978). Major attenuation of food reward with performance-
sparing doses of pimozide in the rat. Canadian journal of Psychology 32, 77-85.
Woordward, T. S., Bub, D. N., & Hunter, M. A. (2002). Task switching deficits associated with
Parkinson's disease reflect depleted attentional resources. Neuropsychologia, 40,
1948-1955.
[68]
Yin, H. H., & Knowlton, B. J. (2006). The role of the basal ganglia in habit formation. Nature
Review Neuroscience, 464-476.