68
Ghent University Faculty of Psychology and Educational Sciences Second year Master of Science in Psychology Theoretical and Experimental Psychology Second exam period To Go or Not to Go: Differences in Cognitive Reinforcement Learning Master thesis to obtain a degree as master of science in Psychology, in the field of Theoretical and Experimental Psychology. Michiel Van Boxelaere 00700889 Promoter: Prof. Dr. Tom Verguts Supervisor: Dr. Filip Van Opstal Department of Experimental Psychology 13-08-2013

To Go or Not to Go: Differences in Cognitive Reinforcement

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: To Go or Not to Go: Differences in Cognitive Reinforcement

Ghent University

Faculty of Psychology and Educational Sciences

Second year Master of Science in Psychology

Theoretical and Experimental Psychology

Second exam period

To Go or Not to Go: Differences in Cognitive Reinforcement

Learning

Master thesis to obtain a degree as master of science in Psychology,

in the field of Theoretical and Experimental Psychology.

Michiel Van Boxelaere 00700889

Promoter: Prof. Dr. Tom Verguts

Supervisor: Dr. Filip Van Opstal

Department of Experimental Psychology

13-08-2013

Page 2: To Go or Not to Go: Differences in Cognitive Reinforcement

[2]

Page 3: To Go or Not to Go: Differences in Cognitive Reinforcement

[3]

Abstract

Background. Psychologists have long suggested that the procedural learning system and the

declarative learning system are engaged under different circumstances (Poldrack & Packard,

2003). Researchers have indicated an important involvement of dopamine and the striatum in

procedural trial and error learning (Yin & Knowlton, 2006). It remains largely unclear whether

learning from feedback occurs similarly in more declarative memory tasks, thought to rely on

the medial temporal lobe and the hippocampus (Squire, 1992).

Objective. In the current study we want to investigate whether individual differences in

learning from positive or negative feedback differs between tasks that rely on declarative

memory cortices and tasks that rely on cortices involved in habit formation.

Methodology. To address this research question we adopted two well established procedural

learning tasks (Frank, Seeberger, & O'reilly, 2004) and compared decision making performance

on these tasks with feedback learning performance on two versions of a newly developed

explicit declarative memory task.

Hypothesis. We hypothesized that participants who learn better from positive feedback in one

task, will also learn better from positive feedback in another task.

Results. We observed a general bias to learn better from positive feedback in the declarative

learning tasks, but not in the procedural learning tasks. Participants who learned better from

negative feedback during procedural tasks were more likely to learn better from positive

feedback in the explicit declarative memory tasks. These results suggest a different functional

role for the declarative and procedural memory system in learning from negative or positive

feedback.

Page 4: To Go or Not to Go: Differences in Cognitive Reinforcement

[4]

Acknowledgements

First of all my special thanks go out to my promoter, Dr. Tom Verguts, for

introducing me to the research field of reinforcement learning and providing me the

necessary input and guidance to complete this master thesis.

Secondly, I would also like to thank my supervisor, Dr. Filip van Opstal, for

guiding me while programming the experiments.

Thirdly, special thanks goes out to Jan Van Boxem and my partner Fien Van

Boxem for providing me the necessary feedback to improve various stylistic aspects of

this thesis and for general support and guidance during the process of writing.

Some thoughts go out to my stepbrother, Jochen Pichal, and my uncle, Geert

Van Boxelaere, who passed away while writing this thesis, teaching me to put the little

‘big’ problems in perspective.

General thanks goes out to my family and friends for giving me the opportunity

to develop myself as a person, both personally and professionally, and giving me the

support I needed over the years.

Page 5: To Go or Not to Go: Differences in Cognitive Reinforcement

[5]

Table of Contents

1. Introduction............................................................................................................................... 9

1.1. Adaptive learning and memory. ......................................................................................... 9

1.2 Reinforcement Learning: Theoretical Background. .......................................................... 10

1.2.1. Classical and instrumental conditioning. .............................................................. 10

1.2.2. Theoretical models of learning: a normative framework. ................................... 10

1.2.3. Computational models of reinforcement learning: Classical Conditioning. ........ 11

I. The Rescorla-Wagner Model. ......................................................................... 11

II. Computing temporal relationships of reinforcement. .................................. 12

III. The Temporal Difference Model. ................................................................. 12

1.2.4. Computational models of reinforcement learning: Instrumental Conditioning. . 13

I. What about active decision making? ............................................................. 13

II. The Actor-Critic framework ........................................................................... 14

III. Model-free (habit) learning vs. Model-based (goal-directed) learning ........ 14

1.3 Underlying neural mechanisms of reinforcement learning: Dopamine. .......................... 16

1.3.1. The neurobiology of dopamine ............................................................................ 16

1.3.2. Dopaminergic cell activity signals reward prediction errors. ............................... 16

1.4 Underlying neural mechanisms: The Striatal Habit Learning System. .............................. 17

1.4.1. Neurobiological features of the striatum and the basal ganglia. ......................... 17

I. The Cortico-Basal Ganglia-Thalamocortical circuitry modulates action

selection ............................................................................................................ 17

II. The direct “Go” Pathway ............................................................................... 18

III. The inderect “No-Go” Pathway .................................................................... 18

1.4.2. Striatal processing of predicted reinforcement outcomes .................................. 19

I. Evidence from fMRI-studies ........................................................................... 19

II. Evidence from pharmacological studies ........................................................ 19

III. Evidence from patient studies ...................................................................... 20

IV. Neurocomputational accounts ..................................................................... 21

1.4.3. Psychology traditionally distinguishes habit formation from goal-directed

learning………………………………………………………………………………………………………………………22

1.5 Underlying neural mechanisms: The Goal-Directed Learning System. ............................. 22

1.5.1.Brain regions involved in goal-directed learning................................................... 23

1.5.2. Interactions between habit and goal-directed learning? .................................... 23

1.5.3. Different value-based decision making across learning systems? ....................... 25

Page 6: To Go or Not to Go: Differences in Cognitive Reinforcement

[6]

2. Aim of this study ...................................................................................................................... 26

2.1. Research question ........................................................................................................... 27

2.2. Rationale.......................................................................................................................... 27

2.3. Hypothesis ....................................................................................................................... 27

3. Method .................................................................................................................................... 28

3.1 Materials and Methods ..................................................................................................... 28

3.1.1. Participants ........................................................................................................... 28

3.1.2. Stimuli and Apparatus .......................................................................................... 28

3.1.3. General Procedure ............................................................................................... 28

3.2 Implicit procedural learning tasks ..................................................................................... 29

3.2.1. Probabilistic Selection Task .................................................................................. 29

I. Stimuli ............................................................................................................. 29

II. Procedures ..................................................................................................... 29

3.2.2. Transitive Inference Task ...................................................................................... 30

I. Stimuli ............................................................................................................. 30

II. Procedures ..................................................................................................... 31

3.3 Explicit episodic memory tasks ......................................................................................... 32

3.3.1. One Shot Learning Task (version 1 & 2) ............................................................... 32

I. Stimuli ............................................................................................................. 32

II. Procedures ..................................................................................................... 32

3.4 Data Analysis ..................................................................................................................... 33

3.4.1 Probabilistic Selection Task ................................................................................... 33

I. Data filtering ................................................................................................... 33

II. Test Pair Analysis ........................................................................................... 34

III. Training Analysis ........................................................................................... 34

IV. Session Analysis ............................................................................................ 34

3.4.2 Transitive Inference Task ....................................................................................... 35

I. Data filtering ................................................................................................... 35

II. Test Pair Analysis ........................................................................................... 35

III. Training Analysis ........................................................................................... 36

IV. Session Analysis ............................................................................................ 36

V. Awareness Questionnaire ............................................................................. 36

3.4.3. One Shot Learning Task (version 1 & 2) ............................................................... 37

I. Data filtering ................................................................................................... 37

Page 7: To Go or Not to Go: Differences in Cognitive Reinforcement

[7]

II. Test Pair Analysis ........................................................................................... 37

III. Session Analysis ............................................................................................ 37

3.4.4 Cross-task Comparisons........................................................................................ 38

4. Results ..................................................................................................................................... 40

4.1. Separate Task Analysis .................................................................................................... 40

4.1.1 Probabilistic Selection Task ................................................................................... 40

4.1.2 Transitive Inference Task ....................................................................................... 42

4.1.3. One shot learning Task (1) .................................................................................... 45

4.1.4. One shot learning Task (2) .................................................................................... 46

4.2 Cross-Task Analysis ........................................................................................................... 46

4.2.1. Relationships between session and tasks on general test performance ............. 46

I. Tasks within and between sessions ................................................................ 46

II. Implicit and explicit tasks............................................................................... 47

4.2.2. Inter-individual bias towards positive or negative learning across tasks? ........... 48

I. Tasks within and between sessions ................................................................ 48

II. Implicit and explicit tasks............................................................................... 49

5. Discussion ................................................................................................................................ 52

5.1. Learning within the procedural memory system ............................................................. 52

5.1.1 The Probabilistic Selection Task ............................................................................ 52

5.1.2 The Transitive Inference Task ................................................................................ 54

5.2. Learning across the procedural and the declarative memory system ............................. 55

5.2.1 Comparing results with the ‘episodic-like’ one shot learning tasks ...................... 56

6. References ............................................................................................................................... 59

Page 8: To Go or Not to Go: Differences in Cognitive Reinforcement

[8]

List of Figures

FIGURE 1: General distinctions in reinforcement learning theory……………………………………………15

FIGURE 2: Example of randomized task order…………………………………………………………………………29

FIGURE 3: Design Probabilistic selection task………………………………………………………………………….30

FIGURE 4: Design Transitive inference task…………………………………………………………………………….31

FIGURE 5: Design One shot learning task………………………………………………………………………………..33

FIGURE 6: Results Probabilistic selection task ……………………………………………..…………………………41

FIGURE 7: Results Transitive inference task …………………………………………………………………………...43

FIGURE 8: Results Transitive inference task, controlled for awareness……………………………………44

FIGURE 9: Results One shot learning task (version 1 & 2)…………………………………………………….…46

FIGURE 10: Cross-task regression analysis ……………………………………………………………………………..51

List of Tables

TABLE 1: Inter-individual variability across learning tasks……………………………………………………….39

TABLE 2: Descriptive statistics of the probabilistic selection task……………………………………………41

TABLE 3: Descriptive statistics of the transitive inference task ……………………………………………….42

TABLE 4: Descriptive statistics of the transitive inference task, controlled for awareness…….…44

TABLE 5: Descriptive statistics of the one shot learning tasks (1 & 2)…………………………………… 45

TABLE 6: Rank correlations of test performance across tasks………………………………………………….47

TABLE 7: Rank correlations of bias rates across tasks………………………………………………………….....49

TABLE 8: Rank correlations of bias rates across tasks, controlled for awareness TI-task……….…49

Page 9: To Go or Not to Go: Differences in Cognitive Reinforcement

[9]

1. Introduction

In order to increase the likelihood of survival and reproduction, organisms have to

flexibly adapt to a constantly changing environment. To flexibly interact with the environment

requires an adaptive learning system that dynamically distinguishes between good, bad, novel,

relevant and irrelevant stimuli in different environmental contexts (Sugrue, Corrado, &

Newsome, 2005). When confronted with an external stimulus, organisms do not only have to

decide whether this stimulus is potentially harmful or beneficial for its preservation. They also

have to decide to act or not based on the expected outcome of this action behavior, so the

brain processes relevant for decision making not only have to encode signals related to values

of alternative options, they also have to be able to recall past experiences and store new

experiences to guide future decision making behaviors. These processes involve adaptive trial

and error learning and flexible memory updating (Daw, Yael, & Dayan, 2005; Gläscher, et al.,

2010).

1.1. Adaptive learning and memory.

In general, learning is defined as “a relatively permanent change in behavior based on

an organism’s interactional experiences with the environment” (Robbins, 1998, p.41).

Relatively permanent, because memories tend to get lost or changed over time. Therefore,

adaptive learning critically involves memory processes to make predictions concerning the

positive or negative outcomes following decision making behavior based on previous

experiences.

Memory is generally referred to as “the processes that are used to acquire, retain and

later on retrieve learned information” (Baddeley, Eysenck, & Anderson, 2009, p.5). These

memory processes are traditionally categorized into different ‘memory systems’ (Squire, 1992)

regarding to how long information is retained (Baddeley, 2001) or whether information

concerning events or cognitive skills are recalled deliberate and consciously (explicitly) or

automatic and unconsciously (implicitly) (Graf & Shacter, 1985; Milner, et al., 1998). Despite

the extensive amount of research dedicated to explore the neural underpinnings of multiple

memory systems (reviewed in Squire, 2004), together with growing evidence from animal

(White & McDonald, 2002), fMRI (Poldrack, et al., 2001) and patient studies (Knowlton,

Mangels, & Squire, 1996) concerning the important role of certain brain regions1 in specific

1 There is, for example, strong evidence for an important role of the striatum and connected basal ganglia (BG) structures in (implicit) procedural learning and habit formation (Yin & Knowlton, 2006)

Page 10: To Go or Not to Go: Differences in Cognitive Reinforcement

[10]

memory subtypes, it is still poorly understood how value-related information is integrated with

stored knowledge about past experiences across different memory systems. These research

questions concerning how value-related choice behaviors are tuned by past experiences, are

typically studied by reinforcement learning theories (Sutton & Barto, 1998).

1.2 Reinforcement Learning: Theoretical Background.

1.2.1 Classical and instrumental conditioning.

Behavioral psychologists have researched the above-mentioned question using

Pavlovian (or classical) and instrumental (or operant) conditioning paradigms. In a typical

classical conditioning procedure, animals learn to associate a neutral conditioned stimulus (CS;

e.g., a tone) with a motivationally significant rewarding or punishing unconditioned stimulus

(UCS; e.g., food ), followed by an unconditioned physiological response (UCR; e.g., salivation).

Over time, animals will demonstrate this physiological response (conditioned response; CR) to

the conditioned stimulus even when the unconditioned stimulus is omitted. Henceforth, the

animal successfully learns the predictive value of the tone to the motivationally significant

reward (food) or punishment (shock), resulting in a conditioned response (salivation) towards

the, previously neutral, tone (Pavlov, 1927).

Instrumental conditioning is distinguished from classical conditioning in that it focuses

on making associations between voluntary behavioral decision making (e.g., performing an

action or not) and its rewarding consequence. Classical conditioning, on the other hand, deals

with making associations between an involuntary response (e.g., salivation) and a stimulus

(e.g., tone). Thus, agents learn passively in classical conditioning, whereas, in instrumental

conditioning, agents actively perform an action to receive a reward or to avoid punishment. In

a typical instrumental animal conditioning procedure, a modifiable operant cage is used.

Animals are put in this operant cage and trained to perform an action (e.g., lever press) in

order to obtain a reward (e.g., food) or to avoid punishment (e.g., electric shock).(Skinner,

1935; Skinner B. , 1987; Thorndike, 1911).

1.2.2 Theoretical models of learning: a normative framework.

Rooted in these psychological theories of learning in animals (Skinner, 1938;

Thorndike, 1911) and further developed in the field of machine learning (Sutton & Barto,

1998), reinforcement learning theory (RL) provides a theoretical framework to study choice

behavior by which humans and animals select actions in the light of expected reward or

Page 11: To Go or Not to Go: Differences in Cognitive Reinforcement

[11]

punishment (Sutton & Barto, 1998). Since learning is essentially an unobservable process,

computational reinforcement learning models have focused on studying choice behavior as

the most advantageous adaptation to a given problem in a certain environment. Unlike

descriptive models (Skinner, 1935), that describe choice behavior as it presents itself,

computational models draw from a normative framework that describes behavior as an

optimal adaptation to reach an agents’ specific goals, based on its predicted future

consequences (Sutton & Barto, 1998).

According to this framework, decision making behaviors can thus be studied and

understood in the light of the efforts that most likely minimize or maximize future

punishments or rewards, respectively. Among computational reinforcement learning models,

there is a general consensus that the estimation of the likelihood that a given environmental

state or behavioral action will be followed by reward or punishment, together with the actual

experienced outcome, is the main engine that drives behavioral learning (Daw, Yael, & Dayan,

2005; Frank, 2005; Rescorla & Wagner, 1972; Sutton & Barto, 1990). The potential discrepancy

between predicted and actual reward outcome, formally known as “prediction error”, is the

fundamental basis of the learning rule as described by the Rescorla-Wagner model, which is

still one of the most influential models to understand and explain a wide range of animal (and

human) learning behaviors (Rescorla & Wagner, 1972).

1.2.3 Computational models of reinforcement learning: Classical Conditioning. I. The Rescorla-Wagner model.

At the basis of the Rescorla-Wagner model is the assumption that “learning occurs only

when events violate expectations” (Niv, 2009,p. 141; Rescorla & Wagner, 1972). This

assumption is postulated in a single learning rule which, simply put, states that associative

learning is enforced when prediction errors are positive (i.e., the actual reward outcome is

better than expected) and associative learning is demoted when prediction errors are negative

(i.e., the actual outcome is worse than expected (Rescorla & Wagner, 1972). Using this

relatively simple learning rule, the Rescorla-Wagner model can successfully predict several

behavioral phenomena described in classical conditioning protocols such as blocking (Kamin,

1969), stimulus generalization (Rescorla, 1976b) or conditioned inhibition (Miller, Barnet, &

Grahame, 1995). However, the Rescorla-Wagner model suffers from two major shortcomings.

First, it fails to predict and explain second-order or higher-order conditioning (i.e., when a

second stimulus predicts an already conditioned stimulus (CS2 -> CS1 ->US)) (Rashotte,

Marshall, & O'Connell, 1981), which is important because of its high prevalence in everyday

human life (e.g. money as a second-order predictor for food). Secondly, because the Rescorla-

Page 12: To Go or Not to Go: Differences in Cognitive Reinforcement

[12]

Wagner model only calculates (and thus learns) prediction errors after the outcome of a trial is

known (US or no-US presentation), it generally fails to unravel the intensity of conditioning to

different temporal relationships between CSs and USs within a trial (Sutton, 1988). This issue

would, later on, prove to be very relevant when computational models merged with

neurobiological theories to understand the neural underpinnings of reinforcement learning

(Suri & Schultz, 1999). Nevertheless, the Rescorla-Wagner model has proven, due to its

simplicity and ease of application, to provide many important predictions and insights in

classical conditioning studies and adaptive learning using one elegant learning rule (Miller,

Barnet, & Grahame, 1995).

II. Computing temporal relationships of reinforcement.

To overcome the above-mentioned issues, Richard Sutton (1988) together with

Andrew Barto (1990) proposed a real-time model using a temporal difference learning rule

(Sutton R. , 1988). This model is an extension of the Rescorla-Wagner model that takes the

different temporal relationships between events into account. Real-time models are

continuous models that apply on a moment by moment basis (Sutton & Barto, 1990). These

models are distinguished from trial-level models (e.g., Rescorla-Wagner model) that apply

complete trials as a whole. This does not allow them to make predictions concerning temporal

relationships between CSs and USs within trials. In trial-level models, the prediction level of

CSs is equally high for all times prior to the US and the degree of associative strength depends

on the intensity and duration of the primary reinforcement (US1) (Sutton & Barto, 1990).

Studying the predictive value of associative strengths between CSs and US in this manner has

proven to be very successful in well controlled laboratory experiments on a trial by trial basis

(Miller, Barnet, & Grahame, 1995). However, studies have shown that animals seemingly show

weaker predictions for CSs presented long before the US (Sutton & Barto, 1981a). It is also far

less clear what constitutes the beginning and the end of a learning trial in real life. To provide a

theoretical simplification of complex real-world learning, models of reinforcement learning

thus need to specify how associative weights given to primary reinforcement decays with

delays over time.

III. The Temporal Difference model.

The Temporal Difference learning model (Bellman, 1958; Sutton & Barto, 1990;)

resolves this group of problems by quantifying the degree of delayed primary reinforcement

Page 13: To Go or Not to Go: Differences in Cognitive Reinforcement

[13]

by a fraction2 (ɣ) over discrete ‘units’ of time. This fraction is implemented into algorithms of

future reward predictions that are divided into two parts (for a comprehensive overview of

these algorithms, see Sutton & Barto, 1998). A first part that regards to the immediate

reinforcement following a given CS and a second part that is the sum of all expected future

reinforcements. The desired prediction is stated in terms of the primary reinforcement and

desired prediction for successive time units (Sutton, 1988). The discrepancy between these

quantities (formally known as the temporal difference prediction error) is comparable to the

prediction error term used in the Rescorla-Wagner model, with the exception that it takes the

different timing of successively predicted reward outcomes into account (Sutton & Barto,

1990). By implementing the temporal difference prediction error learning rule into the

Rescorla-Wagner model, the TD-model can make successful predictions concerning the effects

on learning of temporal relationships within trials (Sutton & Barto, 1981a) and higher order

conditioning (Sutton & Barto, 1990). Furthermore, the temporal difference model, though

developed on purely theoretical grounds, provides an excellent account for neural findings on

classical conditioning (McClure, Berns, & Montague, 2003; O'Doherty, et al., 2003; Schultz,

1998).

1.2.4 Computational models of reinforcement learning: Instrumental Conditioning. I. What about active decision making?

The learning principles of the Rescorla-Wagner model and the Temporal Difference

model as described above hold true whenever associations are made between environmental

states that are fixed in such a way that agents do not influence them by voluntary actions (i.e.,

classical conditioning). But, in order to functionally adapt to the environment it is not only

important to predict rewarding outcomes of different environmental states, it is also essential

to decide whether to behaviorally act or not based on the expected outcome of this action

behavior (i.e., instrumental conditioning). Behavioral theories of optimal action-based decision

making have long suggested that organisms are more likely to perform specific actions when

the expected outcome is rewarding (Thorndike, 1911; Skinner B. F., 1935). On the other hand,

if expected outcomes are punishing, organisms are less likely to perform these actions

(Thorndike, 1911; Skinner B. F., 1935). Importantly, these early theories of decision making do

not address how we determine which particular action, from a series of previous sequential

2 In the case of immediate primary reinforcement, the associative weight between a CS and the US is

near its maximum (and ɣ is thus near 1), whereas in the case of long-delayed primary reinforcement, the associative weight between CS and US is near its minimum (and ɣ is thus near 0).

Page 14: To Go or Not to Go: Differences in Cognitive Reinforcement

[14]

actions, should get credit for a positive or negative outcome, this issue is formally known as

the credit-assignment problem (Sutton & Barto, 1998).

II. The Actor-Critic framework.

Models of reinforcement learning efficiently solved the credit-assignment problem by

providing a two-process Actor-Critic learning system of instrumental conditioning (Barto,

Sutton, & Anderson, 1983; Barto, 1995). According to this framework one component, the

critic, uses a temporal difference prediction error signal to evaluate and update possible

actions and environmental states in terms of predictions of future rewards (Barto, 1995). The

other component, the actor, uses a similar prediction error signal to learn preferences for each

action in each environmental state and selects these actions, based on the evaluations

provided by the critic, that are associated with greater long-term reward (Barto, 1995). In

other terms the critic learns and stores values concerning the surrounding environmental

states (i.e., temporal-difference learning), which allows the actor to select and update

preferred actions (Sutton & Barto, 1998). In the actor, an action is strengthened (or weakened)

when immediately followed by a positive (or negative) prediction error (Barto, 1995).

Accordingly, the critic is involved in both classical and instrumental conditioning, whereas the

actor only applies to instrumental conditioning (O'Doherty, et al., 2004)

III. Model-free (habit) learning vs. Model-based (goal-directed) learning .

The actor-critic architecture of action selection in instrumental conditioning is closely

related to, in psychology, ‘habit’ (procedural) learning (Dickenson & Balleine, 2002) or, in

computational terms, ‘model free’ learning (Daw, Yael, & Dayan, 2005). In model free (habit)

learning approaches, through trial and error learning, associations between an organism’s

actions and outcomes are stored in a prediction error signal summarizing its long-term value

without specifying the nature of the outcome. This learning approach has the advantage of

being not susceptible to outcome devaluation at the cost of inflexibility (Daw, Yael, & Dayan,

2005). Model free learning approaches closely interact with, but are distinguished from, more

flexible ‘model based’ or goal directed learning approaches (Balleine & Dickinson, 1998;

Gläscher, Daw, Dayan, & O'Doherty, 2010). Model based (goal-directed) learning methods also

make predictions of long-term value outcomes by learning a ‘cognitive model’ of the

environment where actions are guided by explicit knowledge of action-outcome contingencies

(Daw, Yael, & Dayan, 2005; Gläscher, Daw, Dayan, & O'Doherty, 2010). These learning

methods are, contrary to model free approaches, very sensitive to outcome devaluation which

makes them more suitable to flexibly adapt to a changing environment.

Page 15: To Go or Not to Go: Differences in Cognitive Reinforcement

[15]

Overall these computational models have contributed vastly to the research field of

reinforcement learning at a behavioral, neural and molecular level:

(1) By providing simple computational terms that led to new predictions and insights to

understand under what specific circumstances individual organisms differ in choosing

one specific action over another

(2) By distinguishing between different learning and memory systems which allows

psychologists and behavioral neuroscientists to set up experiments that directly test

how the processing of value-related information and optimal decision making differs

across these memory and learning systems (Frank, O'Reilly, & Curran, 2006; Frank,

D'Lauro, & Curran, 2007; Poldrack, et al., 2001).

(3) By aiding to understand the underlying neural bases of conditioning (Montague,

Dayan, & Sejnowski, 1996; Schultz , 1998; Schultz, 2007).

Classical Conditioning

Habit learning

(model free)

Actor

C R I T I C

Temporal

Difference

model

Rescorla-

Wagner model

Instrumental Conditioning

Goal directed

learning

(model based)

Figure. 1.General distinctions in reinforcement learning theory (Sutton & Barto, 1998)

Page 16: To Go or Not to Go: Differences in Cognitive Reinforcement

[16]

Specifically, theoretical computational and algorithmic levels, merging with neural

implementations, have revealed a key learning signal in the mammalian brain that closely

resembles the temporal difference prediction error signal as proposed by the TD-model

(Schultz, Dayan, & Montague, 1997). It is widely accepted that this prediction error signal is

encoded by phasic bursts of dopamine, a neurotransmitter crucially involved in the midbrain

reward circuitry (Holroyd & Coles, 2002; Montague, Dayan, & Sejnowski, 1996; O'Doherty, et

al., 2003; Schultz , 1998; Suri & Schultz, 1999; Sutton & Barto, 1998).

1.3 Underlying neural mechanisms of reinforcement learning: Dopamine.

1.3.1. The neurobiology of dopamine.

Dopamine (DA) is a well studied catecholamine neurotransmitter involved in attention,

movement and various cognitive processes (reviewed in Schultz W. , 2007). Dopaminergic cell

groups are predominantly located in the substantia nigra pars compacta (SNc) and the ventral

tegmental area (VTA) from which they project to different brain regions involved in learning

and memory such as the striatum (Andén, et al. 1966), the amygdala (Fuxe, et al., 1974), the

hippocampus (Legault & Wise, 2001) and frontal cortices (Emson & Koob, 1978). Evidence for

the functional role of dopamine in reinforcement learning emerged from electrophysiological

studies with behaving monkeys conducted in the lab of Wolfram Schultz in the 90’s. Before

these studies the dominant view on the function of dopamine was that it is crucially involved

in the brain’s ‘pleasure centre’ and that it might serve as the brains reward signal (Olds &

Milner, 1954; Wise, Spindler, & Legault, 1978). According to this hypothesis, dopaminergic cell

firing corresponds with the ‘pleasure feeling’ that is experienced when a rewarding stimulus is

presented in the external environment (Wise, Spindler, & Legault, 1978).

1.3.2. Dopaminergic cell activity signals reward prediction errors.

Schultz and colleagues notoriously demonstrated that dopaminergic cell activity

resembles reward expectancy rather than reward itself, comparable to the prediction error

learning signals proposed by computational models (reviewed in Schultz, 2002). Specifically,

the classical conditioning studies conducted in the Schultz lab demonstrated that (1) phasic

dopaminergic cell firing disappeared over time when rewards in the external environment

became highly predictive to the agent (Romo & Schultz, 1990), (2) after a couple of trials,

phasic dopaminergic cell firing was observed during stimuli that predict reward, and thus,

before the presentation of a rewarding stimulus (Mirenowicz & Schultz, 1994), (3)

Page 17: To Go or Not to Go: Differences in Cognitive Reinforcement

[17]

dopaminergic cell firing increased when actual reward outcomes were better than predicted

(Schultz, Dayan, & Montague, 1997) and (4) phasic dopaminergic cell firing shortly dropped

below baseline when expected rewards were omitted (Hollerman & Schultz, 1998). Studies in

humans, using fMRI and event related potentials, have reported prediction error like activation

in areas known to be richly innervated by dopaminergic signals such as the anterior cingulate

cortex (Brown & Braver, 2005; Holroyd & Coles, 2002), the striatum (Bray & O'Doherty, 2007)

and the orbitofrontal cortex (O'Doherty, et al., 2003).

These results from animal and human studies have led to the widely accepted idea

that phasic dopaminergic activity is the neural substrate of processing positive and negative

prediction errors (Holroyd & Coles, 2002; Schultz, Dayan, & Montague, 1997). Many

researchers have demonstrated that activity in ventral and dorsal parts of the striatum

corresponds with prediction error signals, suggesting an important role of the striatum in

reinforcement learning (Delgado, et al., 2000; O'Doherty, et al., 2004; Yin & Knowlton, 2006).

1.4 Underlying neural mechanisms: The Striatal Habit Learning System.

1.4.1. Neurobiological features of the striatum and the basal ganglia.

The striatum is a subcortical part of the forebrain formed by the putamen, caudate and

nucleus accumbens (Gerfen & Wilson, 1996). It is as a major input structure for the basal

ganglia and receives direct inputs from the SNc, VTA and many neocortical frontal structures

involved in motor actions(Gerfen, 2000; Mink, 2003) The striatum projects through the globus

pallidus (GP) and Substantia Nigra pars reticula (SNr) to the thalamus from which it then

projects back to the neocortex (Gerfen & Surmeier, 2011). Most likely due to its major

involvement in movement disorders like Parkinson’s disease and Huntington’s disease,

research concerning the basal ganglia has mainly focused on its functional role as a motor

control unit (Redgrave, Prescott, & Gurney, 1999).

I. The Cortico-Basal Ganglia-Thalamocortical Circuitry modulates action selection

Early studies on the neurobiological function of the basal ganglia have suggested that

the basal ganglia facilitate the selection of a specific motor action while inhibiting other motor

actions (Alexander & Crutcher, 1990; Mink, 2003). The basal ganglia facilitate a specific action

selection by modulating the execution of a certain motor response rather than encoding its

specific details (Mink, 1996). This motor action modulation occurs by signaling the most

appropriate “Go” or “No-Go” response on competing motor actions, represented in the motor

cortex (Alexander, DeLong, & Strick, 1986; Hikosaka, 1989). It is generally assumed that

selecting the appropriate “Go”/”No-Go” motor responses relies on striatal synaptic changes

Page 18: To Go or Not to Go: Differences in Cognitive Reinforcement

[18]

which are modulated by dopaminergic cell activity via D1 and D2 receptors (Gerfen, et al.,

1990). These effects of dopaminergic modulation occur in the basal ganglia via two main

pathways, i.e., a direct and an indirect pathway (Gerfen, et al., 1990). These pathways are

thought to oppositely excite or inhibit the thalamus, through the basal ganglia circuitry (Gerfen

& Wilson, 1996).

II. The direct “Go” pathway.

In the direct “Go” pathway, striatal neurons3 project to the internal segment of the

globus pallidus (GPi) which, without striatal firing, tonically inhibits the thalamus (Gerfen &

Wilson, 1996). The striatal projection neurons in the direct pathway are characterized by a

predominant expression of D1 receptors (Gerfen, et al., 1990). D1 receptors are primarily

excited by dopamine. Phasic dopaminergic cell firing thus excites striatal D1 receptors (Nicola,

Surmeier, & Malenka, 2000). Researchers have demonstrated that the excitation of D1

receptors aids the depolarization of inhibitory striatal projections to the GPi (Gerfen, 2000).

This inhibition of the GPi then “disinhibits" the tonically inhibitory projections to the thalamus

which allows the thalamus to get excited from other excitatory projections (Hernandez-Lopez,

et al., 1997; Mink, 2003). The basal ganglia circuitry of the direct pathway can be compared to

the proverbial “releasing the brakes” of the thalamus to select the most appropriate action.

III. The indirect “No-Go” pathway.

In the indirect “No-Go” pathway, striatal inhibitory neurons project to the external

segment of the globus pallidus (GPe) which tonically inhibits the internal segment of the

globus pallidus (GPi) (Gerfen & Wilson, 1996). The striatal projection neurons in the indirect

pathway are characterized by a predominant expression of D2 receptors (Gerfen, 2000). It has

been shown that the activation of D2 receptors suppresses depolarization of inhibitory striatal

projections (Hernandez-Lopez, et al., 2000). As a consequence, during phasic bursts of

dopaminergic cell firing, the activity in the indirect pathway is suppressed via D2 receptors

(Gerfen, 2000). However, during dips of dopaminergic cell firing, the inhibitory striatal

projections to the GPe are activated which results in a net effect of further inhibiting the

thalamus (Joel & Weiner, 1997). Cell activity in the indirect pathway can thus be compared to

the proverbial “pressing the brakes”.

The integration of these findings concerning the basal ganglia’s modulatory role in

motor action selection, together with the hypothesis that dopamine activity signals prediction

3 The majority (90%-95%) of striatal neurons are GABAergic medium spiny neurons that inhibitory

project to other nuclei in the basal ganglia circuitry (Gerfen C. , 2000).

Page 19: To Go or Not to Go: Differences in Cognitive Reinforcement

[19]

error, led to the broader view that the basal ganglia might serve as a more general functional

cognitive unit (Frank, 2005) . According to this hypothesis the basal ganglia modulates optimal

action selection by processing value-related information in the striatum to predict future

outcomes following actions (Frank, Seeberger, & O'reilly, 2004; Frank, 2005).

1.4.2. Striatal processing of predicted reinforcement outcomes.

Evidence for the hypothesis that the basal ganglia modulate optimal action selection

by processing value-related information in the striatum came from (1) behavioral studies using

fMRI, (2) pharmacological intervention studies, (3) patient studies using patients with

Parkinson’s disease and (4) computational neural network models.

I. Evidence from fMRI-studies.

First, it has been shown that activity in the striatum during classical and instrumental

conditioning tasks closely resembles dopaminergic prediction error signals (Delgado, et al.,

2000). Evidence for the behavioral relevance of the correlation between striatal activity and

prediction error signals came from a study using fMRI (Schönenberg, et al., 2007). This study

demonstrated that the magnitude of prediction error related dopaminergic activity in the

striatum could distinguish between subjects that learned to make optimal decisions4 against

those who did not (Schönenberg, et al., 2007). Furthermore, patterns of striatal activity could

be implied onto the computational actor-critic framework of instrumental conditioning, where

the dorsal and ventral striatum dissociably correspond with the actor and critic, respectively

(O'Doherty, et al., 2004).

II. Evidence from pharmacological studies.

Second, researches have shown that the magnitude of reward prediction error,

expressed in the striatum, was modulated by giving dopamine agonists and antagonists in a

instrumental conditioning task (Pessiglione, et al., 2006). A dopamine agonist, in this study L-

DOPA, enhances dopaminergic function by activating dopamine receptors (Huang & Kandel,

1995). A dopamine antagonist, in this study haloperidol, reduces dopaminergic function by

blocking dopamine receptors (Frey, 1990). Results showed that subjects on a dopamine

agonist had the tendency to choose the appropriate rewarding action relative to subjects

treated with a dopamine antagonist (Pessiglione, et al., 2006).

4 Optimal, since reinforcement in this task was probabilistic . Through trial and error learning over trials,

participants had to learn which of four choices was most likely to be the most rewarding (Schönenberg, Daw, Joel, & O'Doherty, 2007).

Page 20: To Go or Not to Go: Differences in Cognitive Reinforcement

[20]

III. Evidence from patient studies.

Third, researchers have demonstrated that the individual variability to which patients

with Parkinson’s disease learn better from either positive feedback (i.e., positive prediction

error) or negative feedback (i.e., negative prediction error) depends on the degree of

dopamine dysfunction in the basal ganglia (Frank, Seeberger, & O'reilly, 2004). Patients with

Parkinson’s disease (PD) are characterized by a degenerating nigro-striatal dopamine system

(Jankovic, 2008). As a result of this dopamine depletion in the striatum, PD-patients typically

show impaired planning, initiation and control of movements and a wide variety of cognitive

deficits (Albin, Young, & Penney, 1989; Dubois, et al., 1994; Maddox & Filoteo, 2001; Swainson,

et al., 2000). Frank and colleagues (2004) used two cognitive implicit procedural learning tasks,

a probabilistic selection task and a transitive inference task, to test how a depleted dopamine

system affects value-related decision making in cognitive tasks.

In a probabilistic selection task (designed by Frank, Seeberger, & O'reilly, 2004), three

different stimulus pairs (AB,CD and DE) are shown randomly on the screen. Participants learn,

through trial and error processes, to choose or avoid one stimulus in a given pair based on the

probabilistic feedback contingencies over multiple trials. In a typical transitive inference task

(Dusek & Eichenbaum, 1997), participants learn a hierarchical structure of a sequence of

stimuli (A > B > C > D > E) based on positive (+) or negative (-) feedback following separate

individual adjacent pairs in the sequence (A+B-, B+C-, C+D- and D+E-). Importantly, in the

implicit version of this task, it is assumed that participants have no explicit awareness of the

underlying hierarchical relationships across stimuli (Frank, et al., 2005).

Results demonstrated that PD-patients off medication (i.e., low levels of dopamine)

were biased to learn better from negative feedback, whereas PD-patients on medication (i.e.,

higher levels of dopamine) were biased to learn better from positive feedback (Frank,

Seeberger, & O'reilly, 2004). Frank and colleagues (2004) suggested that the observed learning

biases were directly related to higher or lower levels of dopamine in the basal ganglia. PD-

patients off medication have systematicaly low levels of dopamine, which biases the basal

ganglia’s indirect “No-Go” pathway to be very active, with better learning performance

following negative feedback as a result (Frank, Seeberger, & O'reilly, 2004). PD-patients on L-

DOPA medication showed the opposite effect. Higher levels of dopamine in the basal ganglia

facilitate “Go” learning by increasing the signal to noise ratio in the direct “Go” pathway,

thought to aid the selection of the most appropriate response (Nicola, Surmeier, & Malenka,

2000). Consequently, PD-patients on medication show better learning performance following

Page 21: To Go or Not to Go: Differences in Cognitive Reinforcement

[21]

positive feedback relative to their learning performance following negative feedback (Frank,

Seeberger, & O'reilly, 2004).

IV. Neurocomputational accounts.

Fourth, PD-patients are not only characterized by various movement impairments,

they also show seemingly unrelated cognitive impairments with a discrepancy between

implicit learning impairments on the one hand and ‘frontal-like’ impairments on the other

(Dubois, et al., 1994; Woordward, Bub, & Hunter, 2002). It has been shown that these two

kinds of cognitive processing can be dissociated to a certain degree, since patients with frontal

lesions do not show implicit learning deficits (Knowlton, Mangels, & Squire, 1996). To tie these

seemingly unrelated cognitive deficits together, Frank and colleagues suggested that

differences between PD-patients in value-related processing of cognitive ‘frontal-like’ tasks

might be modulated by the Go/No-Go pathways in the basal ganglia, already modulated by

dopaminergic input (Frank., 2005; Frank, Seeberger, & O'reilly, 2004). These researchers

implemented the neurobiological structure of the basal ganglia together with the cellular

dynamics of dopamine (see 1.3.3) into a reinforcement learning model that could test the

double modulatory role of the dopamine-basal ganglia circuitry in executing different cognitive

tasks (Frank, 2005).

The ‘Go-NoGo’ model of cognitive reinforcement learning could successfully replicate

different behavioral performances between medicated and non-medicated PD-patients in

simulated versions of the ‘weather-prediction’ and the ‘probabilistic-reversal’ task.(Frank,

2005). It could also successfully replicate results of simulated versions of the above mentioned

‘probabilistic-selection’ and ‘transitive-inference’ task (Frank, Seeberger, & O'reilly, 2004).

Later on, this neural network model of Go /No-Go learning in implicit cognitive tasks proved to

be an excellent framework for examining individual variability in processing prediction errors

with healthy subjects (Frank, Woroch, & Curran, 2005; Frank, O'Reilly, & Curran, 2006; Frank,

D'Lauro, & Curran, 2007; Simons, Howard, & Howard, 2010).

Overall these researchers directly and indirectly provided evidence for a major role of

the striatum in habit formation which involves learning associations between stimuli (or

contexts) and responses (S-R associations) over consecutive trials. These studies also provided

convincing evidence for the hypothesis that the basal ganglia modulate optimal action

selection by processing value-related information in striatal cortices. Furthermore, these

investigations showed that value-related processing is dynamically modulated by the degree

of dopaminergic input in the striatum.

Page 22: To Go or Not to Go: Differences in Cognitive Reinforcement

[22]

1.4.3 Psychology traditionally distinguishes habit formation from goal-directed learning.

Psychologists who study the underlying mechanisms of instrumental conditioning have

distinguished habits from goal-directed actions (Balleine & Dickinson, 1998; Squire; 1992;

Tolman, 1932). Habit formation is a prototypical instance of procedural memory and involves

learning S-R associations without any explicit ‘conscious’ knowledge of how its actions specify

the nature of the (rewarding) outcome (Yin & Knowlton, 2006). Habits could thus be seen as

automatic ‘reflex-like’ behaviors that respond to those stimuli (or contexts) they are most

positively associated with. As discussed above, many studies have indicated that habit

formation involves the striatum and the basal ganglia (Frank, 2005; O'Doherty , et al., 2004; Yin

& Knowlton, 2006).

In contrast, goal-directed learning corresponds more to declarative (episodic) memory

and involves learning an explicit ‘cognitive model’ of the environment where actions are

guided by explicit knowledge of action-outcome contingencies (Balleine & Dickinson, 1998;

Daw, Yael, & Dayan, 2005; Squire; 1992; ). Goal-directed learning behaviors could thus be seen

as the integration of novel information into an already established cognitive model of the

environment to flexibly guide behavioral adaptations (Balleine & Dickinson, 1998; Tolman,

1932).

1.5 Underlying neural mechanisms: The Goal-Directed Learning System.

1.5.1. Brain regions involved in goal-directed learning.

Contrary to cortico-basal ganglia- thalamocortical circuitry in habit learning, it is far

less understood which specific brain regions are involved in goal-directed learning. Different

studies have suggested different brain regions to be involved in goal directed learning such as

the prefrontal cortex (Daw, Yael, & Dayan, 2005), the orbitofrontal cortex (Valentin, Dickinson,

& O'Doherty, 2007) and the dorsomedial striatum (Yin & Knowlton, 2006). Note that the

dorsomedial striatum is also in the cortico-basal ganglia-thalamocortical loop, comparable to

the dorsolateral striatum in habit learning, with the difference that the dorsomedial striatum

corresponds with the cortico-basal ganglia-thalamocortical loop that involves prefrontal

associative cortices, whereas the dorsolateral striatum corresponds with the cortico-basal

ganglia-thalamo-cortico-loop that involves sensorimotor cortices (Yin & Knowlton, 2006).

These researchers suggested that the sensorimotor loop and the associative loop might

function as the underlying neural circuitry of habits and goal-directed behavior, respectively

(Yin & Knowlton, 2006).

Furthermore, researchers have suggested an important role of the hippocampus and

Page 23: To Go or Not to Go: Differences in Cognitive Reinforcement

[23]

its surrounding regions in the medial temporal lobe in goal directed learning (Packard &

McGaugh, 1996; Johnson & Redish, 2007; Shohamy & Adcock, 2010; Squire, Stark, & Clark,

2004). First, it has been shown that goal-directed decision making strategies are suppressed

following hippocampal lesions in rodents (Packard & McGaugh, 1996). Second, rodent maze

studies have observed hippocampal neural firing, not only during reward itself but also before

key decision points in the maze (Johnson & Redish, 2007). Similar hippocampal firing before

decision making is also observed using monkeys (Wirth, 2009). Third, it has been shown that

humans with bilateral hippocampal damage fail to mentally represent new experiences, which

is a crucial feature of goal directed decision making (Hassabis, et al., 2007). These studies are in

line with recent neurobiological models of an important role of the hippocampus in novelty

processing, modulated by dopamine, to flexibly update already established knowledge

concerning the environment (Lisman & Grace, 2005; Lisman, Grace, & Duzel, 2011).

Taken together these studies suggest an important role of the hippocampus and the

medial temporal lobe (MTL), together with the prefrontal cortex and the dorsomedial

striatum, in flexible goal-directed decision making. In the previous sections we discussed that

both the goal-directed learning system and the habit learning system can guide actions based

on explicit or implicit knowledge about its consequences. Despite different methodological

efforts to unravel the dynamics of these learning systems, there are still many unanswered

questions regarding this research topic. Two major research questions we will further address

are: (1) Under what specific circumstances each system is used, or in other words, how does

the habit learning system interact with the goal-directed learning system? (2) To what extend

does learning from positive or negative reinforcement differ between these learning systems?

1.5.2. Interactions between habit and goal-directed learning?

As a result of extensive research there is now a consensus that the habit (or

procedural) learning system and the goal directed (or declarative) learning system are engaged

under different circumstances (Balleine & Dickinson, 1998; Daw, Yael, & Dayan, 2005;

Gläscher, Daw, Dayan, & O'Doherty, 2010; Poldrack, et al., 2001). Crucial aspects of

determining the shift between habit or goal-directed behavioral control is the level of

(rewarding) uncertainty there is following choice behavior which indirectly involves the level of

training5 an agent receives (Daw, Yael, & Dayan, 2005). It is beneficial for organisms to

rationally evaluate alternative action-outcomes (e.g. goal-directed learning) early in training or

5 The more experience or training an organism has with the relevant parameters in a given

environmental context, the more certainty it will have concerning the reinforcing aspects of the parameters in this context.

Page 24: To Go or Not to Go: Differences in Cognitive Reinforcement

[24]

when confronted with a novel context (Gilbert & Wilson, 2007; Shohamy & Adcock, 2010). This

allows them to rapidly and flexibly adapt to changing reinforcement contingencies, but comes

with the cost of being time consuming and inefficient (Balleine & Dickinson, 1998). It could

therefore be beneficial to shift to a less demanding system (e.g. habit learning) that slowly

learns after repeated experience over many training trial(Barnes, et al., 2005; Yin & Knowlton,

2006). Relying on this system however, comes with the cost of being inflexible to changes in

reinforcement contingencies (Daw, Yael, & Dayan, 2005).

Evidence for this account came from animal studies, showing that a goal-directed

strategy, sensitive to outcome devaluation, is used when animals are trained moderately.

When trained extensively, this strategy is shifted to a response-based habit learning strategy

which is insensitive to outcome devaluation (Packard & McGaugh, 1996). Moreover

physiological recording studies have demonstrated that firing patterns in the dorsolateral

striatum, an area crucially involved in habit learning, develop rather slowly (Barnes, et al.,

2005; Graybiel, 1998).

The concept of competing learning systems in humans was developed by Poldrack and

Packard (2003), following an influential fMRI study on how learning systems may compete

during classification learning (Poldrack, et al., 2001). In this study participants had to perform

a procedural ‘weather prediction’ task with probabilistic feedback which is thought to rely on

the implicit habit learning system (Knowlton, Mangels, & Squire, 1996). Performance on this

task was compared with an alternative ‘paired association’ task that emphasized explicit

declarative memory processes thought to rely on the medial temporal lobe and the

hippocampus (Squire, 1992). Participants had to alternate between these tasks and a baseline

task. As expected, results showed activation of the basal ganglia during the probabilistic

categorization task. An interesting finding was that the hippocampus was deactivated during

this task. To test whether this finding was task specific, activation patterns during the

procedural task and the declarative task were directly compared. Results suggested that

activation in the basal ganglia and hippocampal activation were negatively related (Poldrack,

et al., 2001).

Intrigued by their previous findings, Poldrack and colleagues (2001) tested a new group

of participants using the same procedural categorization task. During this experiment an event

related fMRI scanner was used. Results initially demonstrated hippocampal activity and lack of

basal ganglia activity, but after a couple of trials, the hippocampus quickly became

deactivated, whereas the basal ganglia became activated (Poldrack, et al., 2001). These

researchers suggested that the observed ‘competition’ between the striatal-based memory

system and the hippocampal-based memory system might serve as a mechanism between two

Page 25: To Go or Not to Go: Differences in Cognitive Reinforcement

[25]

incompatible requirements of learning: the need for flexibly accessible knowledge on the one

hand and the need to learn fast automatic responses in specific situations on the other. These

results suggest that the former is supported by the medial temporal lobe and the

hippocampus, whereas the latter is supported by the striatum (Poldrack, et al., 2001; Poldrack

& Packard, 2003).

1.5.3. Different value-based decision making across learning systems?

Given the premise that the procedural habit learning system relies on different neural

processes when compared to the declarative goal-directed learning system, the question

remains whether value-based decision making differs across these systems. A bundle of

evidence has suggested that value-based decision making during implicit procedural learning

tasks is modulated by dopaminergic input into the striatum which facilitates ‘Go’ and ‘No-Go’

learning in cognitive tasks (Frank, et al., 2004, 2005, 2007). A commonly used task to study

individual variability in learning from positive and negative feedback is the probabilistic

selection task designed by Frank, Seeberger, & O'reilly in 2004 (see section 1.4.2. III.). Using

this task researchers have demonstrated that: (1) the degree of nigro-striatal dopamine

depletion has a direct influence on whether learning is better from positive or from negative

feedback (Frank, Seeberger, & O'reilly, 2004), (2) that the separation of positive learners and

negative learners based on performances in this task could successfully distinguish the

magnitude of event-related-potentials (ERP) related to error processing (Frank, Woroch, &

Curran, 2005) and (3) that there is an important genetic factor that contributes to biased

reinforcement learning, where participants carrying the A1 allele of the D2 receptor gene

polymorphism DRD2-TAQ-IA6 learn less efficiently to avoid negative feedback (Klein, et al.,

2007). These findings, regarding biased feedback learning in implicit cognitive tasks, led Frank

and colleagues to the hypothesis that individual learning biases might result from

dopaminergic striatal changes rather than prefrontal dopaminergic changes (Frank, et al.,

2004, 2005, 2007). This hypothesis proved to be very successful in explaining under what

circumstances PD-patients and healthy humans differ in learning from positive and negative

feedback during procedural learning tasks(Frank, et al., 2004, 2005).

However, it is still unclear how decision making might be biased following feedback

during ‘non-procedural’ tasks. Frank and colleagues (2007) addressed this question by making

‘positive-learner’ and ‘negative-learner’ subgroups, based on performances on the

6 People who carry the A1 allele of the D2 receptor gene polymorphism DRD2-TAQ-IA have been

associated with a reduction in D2 receptor density up to 30% which is linked to multiple addictive and compulsive behaviors (Ritchie & Noble, 2003).

Page 26: To Go or Not to Go: Differences in Cognitive Reinforcement

[26]

probabilistic selection task. Next, these subgroups were compared on an unrelated recognition

memory task, using error-related negativity signals7 (ERN) as a dependent measurement.

Results showed that negative learners , based on probabilistic selection task performance, had

larger ERNs in the recognition memory task compared to positive learners (Frank, D'Lauro, &

Curran, 2007). According to Frank and colleagues these results suggest a common underlying

mechanism for error-processing across these tasks, thought to be modulated by striatal ‘Go-

NoGo’ learning with common frontal dopamine levels as a result (Frank, D'Lauro, & Curran,

2007).

Still, it could be argued that the recognition-memory task might not be that ‘not-

procedural’. The recognition memory task was designed in such a way that it might rely on the

same striatal processing as the probabilistic selection task. During the recognition-memory

task, participants were instructed to make ‘speeded responses’ within 700ms to promote

errors. Throughout the task participants also got feedback on response reaction times,

reminding them to make rapid judgments. As a consequence, it could be that, during the

recognition memory task, participants relied on the same striatal habit processing system as

during the probabilistic selection task to come up with fast responses instead of ‘declaratively’

reflecting upon the stimuli. In the current study we applied the same cross-task comparisons

methodology as used by Frank and colleagues (2007) to investigate more explicitly how

decision making following positive or negative feedback occurs across different learning

systems.

2. Aim of this study

In the current study we want to investigate whether decision making following positive

and negative feedback differs across procedural and declarative memory systems. Previous

research has suggested that there is a competition between the procedural striatal-based

memory system and the declarative hippocampal-based memory system (Poldrack, et al.,

2001). Recent insights derived from patient studies and neurocomputational models have

indicated that individual differences in value-related processing during feedback-learning tasks

are modulated by striatal synaptic changes through the ‘Go’ and the ‘No-Go’ pathway in the

basal ganglia (Frank, et al., 2004, 2005, 2007). These pathways are modulated by dopaminergic

cell firing (Gerfen, 2000).

7 Error related negativity (ERN) is an event-related brain potential which is thought to originate in the

anterior cingulate cortex , a brain area crucially involved in monitoring errors (Ridderinkhof, et al., 2004)

Page 27: To Go or Not to Go: Differences in Cognitive Reinforcement

[27]

2.1. Research question

Upon till now it remains unclear whether organisms learn differently from positive or

negative feedback in tasks that rely on declarative memory cortices (e.g. explicit declarative

tasks) compared to tasks that rely on striatal processing (e.g. implicit procedural tasks). To

address this research question we adopted two well established procedural learning tasks

(Frank, Seeberger, & O'reilly, 2004) and compared decision making performance on these tasks

with feedback learning performance on two versions of an explicit declarative memory task.

2.2. Rationale

Previous researchers have demonstrated an important genetic factor in learning better

from either positive or negative feedback-during implicit procedural tasks(Frank, D'Lauro, &

Curran, 2007; Klein, et al., 2007). Given this premises, we rationalized that the implicit

procedural learning tasks will provide a normative scale of individual value-based processing in

the striatum. Using this normative scale, we can further compare whether individual

participants show the same feedback learning bias towards positive or negative feedback

during the more declarative memory tasks. We designed these tasks so that they most

probably rely on medial temporal cortices by (1) implying a cue-stimulus contingency to

promote and facilitate declarative associative learning (Buckner, et al., 1995; Squire, Knowlton,

& Musen, 1993), (2) providing only a single learning trial to promote hippocampal activation,

previously observed early in learning (Poldrack & Packard, 2003; Poldrack, et al., 2001) and (3)

omitting time constraints to promote explicit rational reflection upon the presented stimuli.

2.3. Hypothesis

Since many researchers have stated that value-based decision making is directly

modulated by dopamine (Huang & Kandel, 1995; McClure, Berns, & Montague, 2003;

Pessiglione, et al., 2006; Schultz, Dayan, & Montague, 1997); we expect that participants who

learn better from positive feedback in one task, will also learn better from positive feedback in

another task. We hypothesize that this is more so for tasks that rely on the same memory

cortices (implicit procedural learning tasks) when compared to tasks that rely on different

memory cortices (implicit vs explicit learning task).

Page 28: To Go or Not to Go: Differences in Cognitive Reinforcement

[28]

3. Method

3.1 Materials and Methods

3.1.1. Participants

Thirty healthy first year (5 male and 25 female) bachelor students in psychology

participated in this study on two separate testing sessions (2 tasks per session). Participants

received partial credits for participation in the experiment after they completed both sessions.

Two participants (1 male and 1 female) were excluded from analysis since because they did

not show up for either the first and/or the second session, and thus, did not complete the full

experiment.

3.1.2. Stimuli and Apparatus

We made use of Dell computers (Windows XP) with 17 inch monitors. Participants

faced the monitor at an approximate distance of 50 cm. E-prime 1.1 software was used for

programming the different tasks in the experiment and developing the stimuli (Schneider,

Eschman, & Zuccolotto, 2002). During all four tasks participants had to choose between stimuli

appearing in pairs (left and right) on the screen. During the two implicit procedural memory

tasks stimuli consisted of Japanese Hiragana characters (Frank, Seeberger & O'reilly, 2004),

whereas standardized pictures of well known objects, tools and fruits were used during the

two explicit memory tasks (Brodeur, et al., 2010). Stimuli were randomized across subjects and

tasks. Responses were registered using the keyboard. Participants had to press key 1 or key 0

to select the left or right stimuli, respectively. All stimuli appeared in color (pictures) or in black

font (Hiragana) on a white background.

3.1.3. General Procedure

Each participant performed four learning tasks over two separate testing sessions (two

in the first session and two in the second session). There were at least 72 hours between

testing sessions to avoid potential learning effects across sessions. All four tasks had a two-

alternative forced choice procedure, where participants had to choose one of two stimuli on

the computer screen by pressing one of two keys on the keyboard. Some stimuli had a

negative reinforcement value, whereas others had a positive reinforcement value. There were

two implicit learning tasks (i.e., a probabilistic selection task and a transitive inference task)

and two explicit learning tasks (two versions of a one shot learning task). The order of the tasks

was randomized both within and between sessions. But, each session contained one implicit

learning task en one explicit learning task (Fig.2).

Page 29: To Go or Not to Go: Differences in Cognitive Reinforcement

[29]

Figure 2. Example of randomized task order for a single subject. Tasks were randomized within and between sessions across subjects. There was always one implicit task and one explicit task in each session.

3.2 Implicit procedural learning tasks

3.2.1. Probabilistic Selection Task

I. Stimuli

During the probabilistic selection task (adopted from Frank, Seeberger, & O'reilly,

2004), pairs of visual stimuli that are not easily verbalized were used (Japanese Hiragana

characters, Fig.3). Following a fixation cross (duration 1000ms), Hiragana stimuli were shown

in black on a white background in 72 pt font. Responses were registered using key “1” (left on

the keyboard) to select the left stimulus or key “0” (right on the keyboard) to select the

stimulus on the right. Visual feedback appeared (duration 1.5 sec) following a choice. Either

the word “Correct!” or the word “Incorrect!” appeared centrally on the screen in green or red,

respectively (Courier New, pt 48). If no response was registered within six seconds, the words

“no response detected” appeared centrally in black (Courier New, pt 18).

II. Procedures

The probabilistic selection task consisted of two phases. Following a practice block,

which consisted of 10 trials, a learning phase was superimposed. During the learning phase

three different pairs of stimuli (AB, CD and EF) appeared randomly on the screen. Feedback

was given after each trial in a probabilistic manner (Fig.3A). Choosing stimulus A in the AB pair

led to positive feedback in 80% of AB trials, whereas choosing stimulus B led to negative

feedback in these trials. The CD and EF pairs were less reliable. Choosing stimulus C led to

positive feedback in 70% of CD trials and choosing stimulus E led to positive feedback in 60 %

of EF trials. Over the course of the learning phase, participants learned to choose A, C and E

above B,D and F. To make sure participants learned the correct associations between stimuli

and feedback, a performance evaluation had to be met before advancing to the next phase.

Evaluation occurred following each learning block of 60 trials. Because of the different

probabilistic nature of each stimulus pair, different performance criteria were chosen. In the

Session 1

Probalistic Selection Task

One Shot Learning Taks

(Version 2)

Session 2

One Shot Learning Taks

(Version 1)

Transitive Inference Task

Page 30: To Go or Not to Go: Differences in Cognitive Reinforcement

[30]

AB pair, participants had to choose A above B at least in 65% of the trials. In the CD pair, C had

to be chosen above D in 60% of the trials . In the last pair, stimulus E had to be chosen in 50%

of the trials8. Participants advanced to the test phase if all these criteria were met or after six

learning blocks (360 trials). During the test phase (Fig.3B), training pairs and new pairs were

randomly shown on the screen without feedback. New pairs contained all other possible

combinations of stimuli (AC, DF, BE, …). Participants were instructed to instinctively choose

when confronted with novel pairs. Each test pair was presented six times.

3.2.2 Transitive Inference Task

I. Stimuli

During the (implicit) transitive inference task (Frank, Rudy, Levy, & O'Reilly, 2005), the

same type of visual stimuli (Japanese Hiragana characters) were used as in the probabilistic

selection task. To avoid confusion and confounding learning effects different characters were

used across the probabilistic selection task and the transitive inference task. Both the order

and the content of the Hiragana characters were counterbalanced. Fixation, stimulus

presentation and feedback presentation was exactly the same as in the probabilistic selection

task.

8 Note that stimulus E is correct in 60% of EF trials, which is particularly difficult to learn. We

implemented a 50% (chance level) performance criteria to ensure that participants who consequently choose the slightly more incorrect stimulus F over E cannot go through to the testing phase.

Training

A B 80% 20%

C D 70% 30%

E F 60% 40%

Test

Choose A? Avoid B?

AC AD AE AF

BC BD BE BF

Figure 3. (A) Example of the stimulus pairs (Hiragana) used during the probabilistic selection task. One pair was shown per trial. In actuality, stimuli were randomized across participants. (B) During test, all other combinations of pairs, together with all training pairs, appeared randomly. During test no feedback was given (not shown in this example). Performance was analyzed on all new pairs containing A (positive learning) or B (negative learning).

A B

Page 31: To Go or Not to Go: Differences in Cognitive Reinforcement

[31]

II. Procedures

During the (implicit) transitive inference task, participants had to learn an underlying

ordinal sequence of stimuli (A>B>C>D>E) based on separate pairs of adjacent elements in the

sequence (AB, BC, …). During this task four pairs of stimuli (Fig.4A) are presented (A+B-, B+C-,

C+D- and D+E-). The + and - characters represent positive and negative feedback, respectively.

Again, as in the probabilistic selection task, participants had to get through a learning segment

before advancing to the testing segment. In the learning segment there were three phases of

blocked trials. The first phase consisted of eight (random) blocks of four trials. Per block a

stimulus pair is shown during four trials. Thus, the first block could for example consist of four

A+B- trials, the second block could consist of four C+D- trials and so on. Phase two consisted of

16 (random) blocks of two trials. The third phase was the performance evaluation phase, in

which 32 trials of pairs were randomly shown on the screen, still with feedback.

Associative Strength

Hypothesis

Test

A B C D E

Postive Associative value

Training

A+ B -

B+ C-

C+ D-

D+ E-

AB BC

CD DE

AE BD

Top

Novel Bottom

Figure 4. (A) Example of the adjacent stimulus pairs (Hiragana) used during the transitive inference task. Stimuli were randomized across participants and differed from the probabilistic selection task. (B) Example of associative strength hypothesis (Rudy, Frank, & O'Reilly, 2003), during training participants implicitly learn to strongly associate A with positive reinforcement. In contrast, E becomes associated with a lack of positive reinforcement, previously shown to induce dopamine dips (Schultz, 2002).These net associative values then “bleed over” to the other adjacent pairs, such that B in the BC pair has a stronger positive association, whereas D in the DC pair has a stronger negative association(Rudy, Frank, & O'Reilly, 2003),. (C) During test, all training pairs and two novel ‘transitive pairs’ were randomly presented eight times each. Performance was analyzed on top pairs AB & BC (positive learning) and bottom pairs CD & DE (negative learning).

A B

C

Page 32: To Go or Not to Go: Differences in Cognitive Reinforcement

[32]

To make sure that participants learned the correct associations between stimuli and

feedback we have set a performance criteria at an accuracy level of 75% before participants

advanced to the test segment. In this segment all training pairs and two new transitive pairs

(AE and BD) were randomly shown eight times each, without feedback (Fig.4C). Following the

transitive inference task, participants were given a questionnaire to assess their explicit

awareness of the logical hierarchy of the stimuli, and to determine whether strategies were

used to respond to the novel test pairs.

3.3 Explicit episodic memory tasks

3.3.1 One Shot Learning Task (version 1 & 2)

I. Stimuli

In the one-shot learning tasks, a different set of stimuli was used. Instead of using

unknown Japanese Hiragana characters that are relatively difficult to verbalize, highly

recognizable standardized pictures of well known objects, tools and fruits were used (Brodeur,

et al., 2010). Following fixation (duration 1 sec), a cue9 (A) appeared (160x160 pixels) centrally

on the screen above the fixation cross. After 2 seconds a pair of target stimuli (BC) appeared

(160x160 pixels) left and right underneath the cue (A) on the screen. Responses were

registered using key “1” to select the left stimulus or key “0” to select the stimulus on the

right10. Because participants had only a single trial to learn the correct stimulus-stimulus

association, time constraints to make a choice were omitted. Visual feedback was provided

(duration 1.5 sec) after a choice was made. Either the word “Correct!” or “Incorrect!” was

printed centrally on the screen in green or red, respectively (Courier New, pt 48).There were

144 unique pictures of objects , tools or fruits used per task. Both stimulus content and order

was counterbalanced.

II. Procedures

During the one shot learning tasks there was a learning segment (Fig.5A), which

consisted of two learning blocks of 24 trials, followed by a memory retrieval test phase that

also consisted of two blocks of 24 trials. During the learning segment, participants only had

one shot at learning to match the cue (A) with one of two target stimuli (B or C). Following

their choice, positive or negative feedback appeared randomly on the screen. After 24 learning

trials there was a break, before the next learning block of 24 trials started. We presented the

9 Cues were implemented in the one shot learning tasks to promote and facilitate (explicit) declarative

associative learning and retrieval (Buckner, et al., 1995; Squire, Knowlton, & Musen, 1993) 10

Because we wanted to make sure that participants learned (to avoid or approach) about the chosen stimuli, no opportunity was given to look back at the correct stimuli when subject responded incorrectly.

Page 33: To Go or Not to Go: Differences in Cognitive Reinforcement

[33]

same two blocks of 24 trials to the participants in the memory retrieval phase, without

feedback (Fig.4B). The order of trials, as well as the location of the target stimuli, were

randomized within each block. Both one shot learning tasks were exactly the same, though

different sets of stimuli were used across both tasks. Both tasks were never in the same

session (Fig.2).

3.4 Data Analysis

3.4.1 Probabilistic Selection Task

I. Data filtering

Since we were mainly interested to what extent subjects learned from positive and

negative feedback following their choices, we firstly had to make sure participants learned the

basic task. Although the performance criteria were implemented to resolve this issue, some

participants performed worse on the training pairs during the testing segment in comparison

with the learning segment. To overcome this issue, we excluded participants who did not

1000ms

1500ms

2000ms

2 Blocks (24 trials)

+

Incorrect!

Training

Figure 5. (A) Example of a learning trial during training. Each trial was only presented once. When a choice was made, stimuli disappeared and feedback was given. (B) Example of a correctly solved test trial, the procedure was the same as during training with the exception that feedback was omitted.

A

B Test

1000ms

2000ms

2 Blocks (24 trials)

+

?

Page 34: To Go or Not to Go: Differences in Cognitive Reinforcement

[34]

perform better than chance during the test phase in the easiest training pair conditions (AB

pair). We rationalized that if these participants could not reliably choose A above B in this pair,

results in novel pairs were meaningless. By filtering the data in this manner, three participants

were excluded because they did not perform better than chance level (50%) in the easiest

training pair.

II. Test Pair Analysis

We firstly wanted to test whether there were systematic differences across subjects in

learning from positive reinforcement (choose A) versus learning from negative reinforcement

(avoid B) in this task. To test whether there were any differences as ascribed above, we

performed a paired sample Student t-test. The degree to which participants learned from

positive reinforcement (choosing A) was operationalized as the performance level on the novel

pairs involving A (AC, AD, AE and AF)(Fig.3B). Comparatively, the degree to which participants

learned from negative reinforcement (avoiding B) was operationalized as the performance

level on the novel pairs involving B (BC, BD, BE and BF)(Fig 3B.). We measured effect sizes using

Cohen’s d, where d ≥ 0.2 represents a small effect, d ≥0.5 represents a medium effect and d ≥

0.8 represents a large effect.

III. Training Analysis

In the learning phase of the probabilistic selection task, different performance criteria

were implemented to make sure participants learned the correct stimulus-reinforcement

associations. If participants failed to reach these performance criteria, they had to do the

learning phase again until they finally reached the performance criteria. Thus, some

participants performed more learning trials than others. We therefore checked, using general

linear model regression analysis with continuous measures, whether general test performance,

performance on the easiest training pair (AB) and performance on either choosing A or

avoiding B could be explained by the number of learning trials. For regression analysis we

measured effect sizes using eta squared (η2) , where η2 = 0.01 represents a small effect, η2 =

0.06 represents a medium effect and η2 = 0.14 or larger represents a large effect.

IV. Session Analysis

Some participants performed the probabilistic selection task in the first session and the

transitive inference task in the second session, whereas other participants did it the other way

around. To minimize learning effects between these two visually similar tasks, we made sure

there were 72 hours between sessions. Nevertheless, it is possible that there were some non-

specific transfer effects across sessions that are unrelated to the particular nature of each task.

Page 35: To Go or Not to Go: Differences in Cognitive Reinforcement

[35]

We therefore checked, using general linear model regression analysis with the categorical

between-subjects variable session as predictor, whether general test performance,

performance on the easiest training pair (AB) and performance on either choosing A or

avoiding B could be explained by which session participants were in.

3.4.2 Transitive Inference Task

I. Data filtering

As in the probabilistic selection task (PS11), we excluded participants who did not

perform better than chance on the easiest training pairs (AB & DE) during the testing segment.

As a result, we filtered out two participants who did not perform better than, on average, 50%

across anchor pairs AB & DE. Analysis described below apply to the remainder of the

participants (n = 26).

II. Test Pair Analysis

Similarly to the PS-task, we investigated whether there were systematic differences

across subjects in learning from positive feedback versus learning from negative feedback.

Similar to previous studies using an implicit version of the transitive inference task and in

accordance with the associative strength hypothesis12, we rationalized that stimuli at the top

of the hierarchy (A, B) have net positive associations, whereas stimuli at the bottom of the

hierarchy (D, E) have net negative associations (Frank, O'Reilly, & Curran, 2006; Frank,

Seeberger, & O'reilly, 2004; Rudy, Frank, & O'Reilly, 2003). As a result, learning from positive

reinforcement ameliorates performance on the AB and BC pairs, while learning from negative

reinforcement ameliorates performance on the CD and DE pairs. Therefore, we

operationalized the degree to which participants learned from positive feedback as the

performance level on AB & BC pairs during test. Similarly, the degree to which participants

learned from negative feedback was operationalized as the performance level on CD & DE

pairs during test. We also checked whether subjects performed better on the ‘easier’ anchor

pairs (AB & DE) in comparison with inner pairs (BC & CD). Moreover, pairwise analysis between

separate training pairs during test were conducted to get a more detailed insight on the

11

For clarity reasons we will use the following abbreviations to refer to the different tasks: PS-task for the probabilistic selection task, TI-task for the transitive inference task and OSL1/OSL2-Task for the two versions of the one shot learning tasks. 12

According to the associative strength hypothesis, the top and bottom pairs, respectively AB and DE, “anchor” the development of associative values: During training agents implicitly learn to strongly associate A with positive reinforcement (since choosing A always induces positive feedback), while E becomes associated with strong negative reinforcement (since E always induces negative feedback). These net associative values then “bleed over” to the other adjacent pairs, such that B in the BC pair has a stronger positive association, whereas D in the DC pair has a stronger negative association, though B and D are positively (negatively) reinforced during half of the trials (see Rudy, Frank, & O'Reilly, 2003 for a detailed description on how this occurs).

Page 36: To Go or Not to Go: Differences in Cognitive Reinforcement

[36]

pattern of previous results. To test differences between conditions, a paired sample Student t-

test was conducted. Novel test pairs AE and BD were analyzed separately since these pairs

could be solved by either learning to choose A and B or by avoiding D and E.

III. Training Analysis

As in the PS task, a performance criteria was implemented in the task. As a

consequence, some participants performed more learning trials when compared to others. We

checked, using general linear model regression analysis with continuous measures, whether

general test performance and performance top (AB & BC) or bottom (CD & DE) pairs could be

explained by the number of learning trials.

IV. Session Analysis

Comparatively to the PS task, we checked possible confounding effects of session using

general linear model regression analysis with categorical between-subjects variables. General

test performance and performance on top and bottom pairs were tested using the factor

session as predictor.

V. Awareness Questionnaire

After completing the transitive inference task participants were asked to fill in a

questionnaire (translated from Frank, Seeberger, & O'reilly, 2004) asking about the familiarity

with Japanese Hiragana characters and, importantly, the degree to which participants explicitly

became aware of the underlying hierarchy in the task. None of the participants indicated

having any experience with the Hiragana characters. Surprisingly, when asked whether

participants “had the impression that there was some kind of logical rule, order or hierarchy

between the symbols” (Frank, Seeberger, & O'reilly, 2004), 11 out of 28 participants indicated

becoming aware of the underlying order or hierarchy between the symbols. The remaining 17

participants did not notice the underlying hierarchy in the task. Because we are generally

interested in differences between learning from reinforcement across more implicit and more

explicit learning tasks, it would be interesting to see whether there are any differences in

learning performance between implicit and more explicit learning within one task. We

therefore reanalyzed the data checking whether the degree of awareness (implicit or explicit

learning) predicts differences in performance level in the transitive inference task. We tested,

using one-way between subjects ANOVAs, whether implicit learners differed significantly from

explicit learners on general test performance, performance on top pairs, bottom pairs and

novel pairs. Furthermore we carried out 2x2 ANOVAs for the between subjects factor group

(implicit, explicit) and the within subjects factor hierarchy (Top, Bottom) to check for

interaction effects between explicit/ implicit learners and positive (top) or negative (bottom)

Page 37: To Go or Not to Go: Differences in Cognitive Reinforcement

[37]

learning. We also performed pairwise Student’s t-tests to check whether the previously

observed differences between anchor pairs and inner pairs remained significant for the implicit

and the explicit subgroup.

3.4.3 One Shot Learning Task (version 1 & 2)

I. Data filtering

We excluded participants who did not perform better than chance (50%) during the

testing phase. There was not a single participant who performed worse than chance level,

neither in the first OSL-task nor the second OSL task. Participants were instructed to associate

a cue-object and one of the target-objects based on feedback and to do this “as accurately as

possible”. As a consequence there were no specific time constraints during the learning or

testing blocks during the one shot learning tasks. To make sure participants did not take

advantage of this lack of time constraints to use all sorts of strategies during learning, we

checked reaction times during the learning segment. We excluded participants who had an

outlying average reaction time during the learning segment (M + 2SD). This was the case for

three participants in the first one shot learning task and for two participants in the second one

shot learning task (OSL task). Analysis described below apply to the remainder of the

participants (n = 25 for the first OSL task and n = 26 for the second OSL task).

II. Test pair Analysis

We used the same methodology in the one shot learning tasks as in the probabilistic

selection task and the transitive inference task. Again, we researched whether there were

systematic differences across subjects in learning from positive reinforcement versus learning

from negative reinforcement. To do so we tested, using paired sample Student t-tests,

whether there were differences across subjects between recognition accuracy following

positive feedback and recognition accuracy following negative feedback.

III. Session Analysis

Similar to previous tasks, we tested for possible confounding effects of session using

general linear model regression analysis with categorical between-subjects variables. General

test performance, recognition accuracy after positive feedback and negative feedback were

analyzed using the factor session as predictor.

3.4.4 Cross-task Comparisons

In general, we wanted to investigate how participants performed on learning from

positive and negative feedback across tasks. Previous studies have suggested that there is an

important genetic factor that determines the inter-individual variability in learning better from

Page 38: To Go or Not to Go: Differences in Cognitive Reinforcement

[38]

either positive or negative feedback (Frank, D'Lauro, & Curran, 2007; Klein, et al., 2007).

Therefore, we hypotehsized that participants who learn better from positive feedback in one

task, also should learn better from postive feedback in another task. We expect that is more

the case for tasks that rely on the same memory cortices (implicit procedural learning tasks)

when compared to tasks that rely on different memory cortices (for example the PS-task

compard to the OSL1 task). To test this assumption we devided participants into two

subgroups (positive and negative learners, see Table 1) for each separate task, comparable to

Frank, D'Lauro, & Curran (2007). By doing so, we could check whether subgroups (positive

learners and negative learners) in one task, could predict value-related differences in other

tasks. We performed analysis across tasks to answer two main questions: (1) Do participants

who perform well on one task also perform well on another task? (2) Is the inter-individual

variability in biased feedback learning robust across tasks? To do so we examined how

between task performance was related within-sessions, between sessions and both within and

between implicit and explicit tasks.

1. Do participants who perform well on one task also perform well on another task?

To investigate whether the individual performance rank of participants was consistent

across tasks we conducted Spearman’s Rank Correlations on general test performance

between tasks in session 1 and 2, between tasks across sessions 1 and 2 and across implicit

and explicit tasks, regardless of session. To keep analysis across tasks as comparable as

possible, participants who were excluded from analysis in the separate task analysis were also

excluded from cross task-analysis. Results described below apply to the remainder of the

participants13 (n = 21).

2. Is the inter-individual variability in learning better from either positive or negative feedback

robust across tasks?

To investigate whether the positive or negative learning bias within subjects was

robust across tasks, we had to transform performance rates on positive and negative learning

conditions to a single score. We did so by simply subtracting performance rates of the negative

learning condition from performance rates of the positive learning condition. The range of this

bias rate is between (-1, 1). Negative bias rates represent subjects that learned better from

13

Three participants were excluded from the PS-task, two participants were excluded from the TI task. Three and two participants were excluded from the first and the second OSL task, respectively. Three participants were excluded in more than one task, which results in a total of seven participants that were excluded from cross-task comparisons on general test performance.

Page 39: To Go or Not to Go: Differences in Cognitive Reinforcement

[39]

Subgroup Criterium Sample Size

Probabilistic Selection Task

Positive learners Choose A performance > Avoid B performance

n = 12

Negative learners

Avoid B performance > Choose A performance

n = 12

Transitive Inference Task Positive learners Top pair performance (AB & BC) > Bottom pair performance (CD & DE)

n = 10

Negative learners

Bottom pair performance (CD & DE) > Top pair performance (AB & BC)

n = 11

One Shot Learning Task (1)

Positive learners Performance after positive feedback > Performance after negative feedback

n = 16

Negative learning

Performance after negative feedback > Performance after positive feedback

n = 8

One Shot Learning Task (2)

Positive learners Performance after positive feedback > Performance after negative feedback

n = 18

Negative learners

Performance after negative feedback > Performance after positive feedback

n = 4

Table 1. Inter-individual variabillity across learning tasks in learning better from positive feedback compared to learning from negative feedback (positive learners) and learning better from negative feedback compared to learning from positive feedback (positive learners). Participants who scored equally well on positive and negative feedback were excluded since they do not add relevant information to the analysis derived from this classification.

Page 40: To Go or Not to Go: Differences in Cognitive Reinforcement

[40]

negative feedback, whereas positive bias rates represent subjects that learned better from

positive feedback. Again we conducted Spearman’s Rank Correlations on the bias rate

between tasks in session 1 & 2, between tasks across sessions 1 & 2 and across implicit and

explicit tasks, regardless of session. Furthermore, we wanted to check whether biased learners

in one task could predict biased learning in the other tasks. To do so we examined whether the

factor group (Positive/Negative Learners), derived from the implicit PS task, could predict the

degree of bias rate in the other tasks, using general linear model regression analysis.

To avoid losing valuable information in the implicit and explicit cross-task analysis from

participants’ learning bias, we were less conservative in excluding participants. Contrary to

previous cross-task comparisons we only excluded participants that were relevant for the

between task comparisons14. This allowed us to have more power in separate between task

comparisons, although it makes comparisons across all tasks more difficult. Nevertheless, since

we are mainly interested in the relatedness of the bias rate between specific tasks, we argue

that is beneficial to have more power.

4. Results

4.1 Separate Task Analysis

4.1.1 Probabilistic Selection Task

Analysis across all subjects (Fig. 6A) on choose A performance compared to avoid B

performance (Table 2) did not show significant differences [t24 = -0.104, p = 0.918, two-tailed, d

= 0.03]. Training analysis did not show significant effects of the number of learning trials on

general test performance [F(1,23) = 0.058, p = 0.812, η2 < 0.001 ], AB pair performance [F(1,23)

= 0.238, p = 0.630, η2 = 0.01], general choose A performance [F(1,23) = 0.171, p = 0.683, η2 =

0.07] (Fig. 6B) or general avoid B performance [F(1,23) = 0.017, p = 0.898, η2 < 0.001](Fig. 6C).

Session analysis did not show significant effects of session on general test performance

[F(1,23) = 1.010, p = 0.325, η2 = 0.04], AB pair performance [F(1,23) = 0.446, p = 0.511, η2 =

0.01], general choose A performance [F(1,23) = 1.314, p = 0.263, η2 = 0.05] or general avoid B

performance [F(1,23) = 2.278, p = 0.145, η2 = 0.09].

14

For example we did not exclude these participants, previously excluded due to possible confounding strategy use in one of the OSL tasks, when correlating the PS task with the TI task.

Page 41: To Go or Not to Go: Differences in Cognitive Reinforcement

[41]

n Mean SD

AB pairs

25

97.3%

5.63

Choose A 25 72% 23.79

Avoid B 25 72.6% 16.2

# Training trials

25 160 93.06

Table 2. Descriptive statistics of test performance during the probabilistic selection task on AB pairs, choose A, avoid B and training trial frequency.

PS Regression Analysis - Avoid B

0 60 120 180 240 30020

40

60

80

100

F(1,23) = 0.017, p= 0.898

Training Trials Frequency

Test P

erf

orm

ance %

Figure 6. (A) No significant differences were observed across subjects between choose A and avoid B test performance. Data are presented as Mean ± SEM (n.s.= non significant). (B) The number of learning trials did not significantly predict choose A performance. (C) Learning trial frequency did not significantly predict avoid B performance.

PS Regression Analysis - Choose A

0 60 120 180 240 30020

40

60

80

100

F(1,23) = 0.171, p= 0.683

Training Trials Frequency

Test P

erf

orm

ance %

Probalistic Selection Task

Choose A Avoid B50

60

70

80

90

100

n.s.

% P

erf

orm

ance T

est

A B

C

Page 42: To Go or Not to Go: Differences in Cognitive Reinforcement

[42]

4.1.2 Transitive Inference Task

Across all subjects comparisons between top pairs AB & BC and bottom pairs CD & DE

(see Table 3 for descriptive statistics) did not show significant differences [t25 = 0.191, p =

0.850, two-tailed, d = 0.05 ] (Fig. 7A). Separate pair analysis on the performance level between

“anchor” pairs AB and DE showed no significant differences [t25 = 0.611, p = 0.547, two-tailed,

d = 0.17]. Accordingly, no significant difference was observed [t25 = -0.113, p = 0.811, two-

tailed, d < 0.01] between “inner” pairs BC and CD, when compared. Pairwise comparisons

between anchor pairs and inner pairs (Fig. 7B) did demonstrate a significant difference [t25

=3.757, p = 0.001, two-tailed, d = 0.7]. These results suggest that, on average, differences in

solving pairs correctly are mainly due to whether it is an inner or anchor pair rather than its

place (higher or lower) in the hierarchy.

Cross-subjects comparisons between novel pairs AE and BD (Fig. 7C) did not show

significant differences [t25 < 0.001, p > 0.05 , two-tailed, d < 0.01]. Furthermore, training

analysis did not demonstrate significant effects of the number of learning trials on general test

performance [F(1,24) = 3.573, p = 0.122, η2 = 0.096], general performance on top pairs AB &

BC [F(1,24) = 0.942, p = 0.341, η2 = 0.04] or general performance on bottom pairs CD & DE

[F(1,24) = 0.173, p = 0.681, η2 < 0.001]. Session analysis did not show significant effects across

subjects of the factor session on general test performance[F(1,24) = 0.887, p = 0.356, η2 =

0.036], performance on top pairs AB & BC [F(1,24) = 0.381, p = 0.543, η2 = 0.015] or on

performance on bottom pairs CD & DE [F(1,24) = 1.012, p = 0.325, η2 = 0.04].

n Mean SD

Top pairs AB & BC 26 87.9% 15.69

Bottom pairs CD & DE 26 87% 17.61

Anchor pairs 26 95.91% 14.13

Inner pairs 26 78.85% 21.32

Novel pair AE 26 93.75% 20.39

Novel pair BD 26 93.75% 19.76

# Training trials 26 118.15 19.77

Table 3. Descriptive statistics of test performance during the transitive inference task on top pairs, bottom pairs, anchor pairs, inner pairs, novel pairs and training trial frequency.

Page 43: To Go or Not to Go: Differences in Cognitive Reinforcement

[43]

Top-Bottom Pairs TI

AB & BC CD & DE60

70

80

90

100n.s.

% P

erf

orm

ance T

est

Anchor-Inner Pairs TI

Anchor Inner60

70

80

90

100***

% P

erf

orm

ance T

est

Novel Pairs TI

AE BD60

70

80

90

100n.s.

% P

erf

orm

ance T

est

Figure 7. (A) No significant differences were observed across subjects between top and bottom pair test performance. (B) Participants scored significantly better (p = 0.001) on anchor pairs when compared to inner pairs. (C) No significant differences were observed across subjects between novel pairs AE and BD. Data are presented as mean ± SEM ( n.s.= non significant, ***= p<0.001).

When we divided participants15 into an implicit and explicit subclass based on the

awareness questionnaire, results indicated a significant difference between implicit and

explicit learners on general test performance [F(1,26) = 7.379, p = 0.012, η2 = 0.22].

Participants who explicitly learned the hierarchy between symbols generally performed better

during test relatively to participants who learned implicitly (see Table 4 for descriptive

statistics). This effect was still significant without the two, previously excluded, weak

performers16 [F(1,24) = 5.722, p = 0.025, η2 = 0.19]. Further analysis (Fig.8) suggests that the

above mentioned differences are driven by positive learning, since explicit learners perform

significantly better when compared to implicit learners on top pairs AB & BC [F(1,24) = 8.937, p

< 0.01, η2 = 0.27], but not on bottom pairs CD & DC [F(1,24) = 1.635, p = 0.213, η2 = 0.06] or

novel pairs AE & BD F(1,24) = 0.973, p = 0.334, η2 = 0.04]. However, both the explicit learning

group and the implicit learning group did not perform significantly better on top pairs when

compared to bottom pairs [t10 = 1.341, p = 0.209, d = 0.48 and t14= -0.246, p = 0.809, d = 0.11,

respectively]. Furthermore, no significant interaction effect for group x hierarchy was observed

[F(1,24) = 0.553, p = 0.464, η2 = 0.02 ].

15

Note that we firstly included all participants in the awareness analysis to check whether there was a general difference in performance between implicit or explicit learners. Both participants who were initially excluded from analysis indicated that they were not aware (implicit group) of the underlying order in the task. Since we do not know whether the low performance of these participants is related to either implicit learning or to a general confusion during the testing phase, we tested differences between implicit and explicit learners both with and without them. 16

These participants were again excluded for further analysis, to make sure that effects are driven by differences between implicit and explicit learning, not by outlying values (possibly due to test confusion) in the implicit group.

A B

C

C

Page 44: To Go or Not to Go: Differences in Cognitive Reinforcement

[44]

Group comparisons between implicit and explicit learners (Fig .8) did not show a

significant difference on anchor pairs [F(1,24) = 1.045, p = 0.317, η2 = 0.043]. However, we did

observe a significant difference between implicit and explicit learners on inner pairs [F(1,24) =

8.555, p < 0.01, η2 = 0.26]. Similar to previous analysis, implicit learners performed significantly

better on anchor pairs when compared to inner pairs [t14 = 3.757, p = 0.002, two-tailed, d=

1.47]., but no significant difference was observed between performance on anchor pairs and

inner pairs for the explicit learning group [t10 = 1.627, p = 0.135, two-tailed, d = 0.65].

n Mean SD

General performance Explicit learners (aware of the hierarchy)

11 95.8% 7.34

Implicit learners (not aware of the hierarchy)

17 80.98% 16.90

Performance on top pairs AB & BC

Explicit learners 11 97.16% 4.30

Implicit learners 15 81.22% 17.18

Performance on inner pairs

Explicit learners 11 90.91% 15.65

Implicit learners 15 70.83% 18.25 Table 4. Descriptive statistics of the transitive inference task , controlled for underlying hierarchy awareness.

AB

BC

CD D

E

50

60

70

80

90

100

Underlying Hierarchy Awareness TI-task

Explicit learners

Implicit learners

% P

erf

orm

ance T

est

Figure 8. Test performance on training pairs for participants who were aware (explicit learners) or unaware (implicit learners) of the underlying hierarchy in the transitive inference task. Data are presented as mean ± SEM.

Page 45: To Go or Not to Go: Differences in Cognitive Reinforcement

[45]

Taken together these results clearly indicate that participants who explicitly learned

the underlying hierarchy between symbols generally perform better than participants who

implicitly learned the hierarchical relationship between symbols. This effect seems driven by

learning performance following positive feedback (top pairs), rather than learning following

negative feedback . The main difference between implicit and explicit learners in this task is a

difference in performance on the inner pairs (BC & CD). Explicit learners, like implicit learners,

do not only perform well on anchor pairs. They also have a high performance rate on inner

pairs relative to implicit learners.

Although these results seem very clear, they should be interpreted with caution since

observed effects are mostly due to a ceiling effect on test performance in the relatively small

explicit group. This high performance rate of explicit learners on all testing pairs could also

explain the lack of an interaction effect between group and top/bottom pair performance.

4.1.3. One shot learning Task (1)

Test pair analysis across all subjects demonstrated a significant difference between

recognition accuracy following positive feedback and recognition accuracy following negative

feedback [t24 = 2.152, p = 0.042, two-tailed, d = 0.47], with a bias to learn better from positive

feedback compared to learning following negative feedback (Table 5, Fig. 9A). Session analysis

did not show significant effects of session on general recognition accuracy [F(1,23) = 0.374, p =

0.547, η2 = 0.016], recognition accuracy following positive feedback [F(1,23) = 0.018, p = 0.896,

η2 < 0.001] or recognition accuracy following negative feedback [F(1,23) = 0.487, p = 0.492, η2

= 0.02].

One Shot Learning Task (1) n Mean SD

Performance after positive feedback 25 86.5% 11.17

Performance after negative feedback 25 80.4% 14.41

One Shot Learning Task (2)

Performance after positive feedback 26 92.3% 8.77

Performance after negative feedback 26 78.2% 15.42

Table 5. Descriptive statistics for the one shot learning tasks (1 and 2).

Page 46: To Go or Not to Go: Differences in Cognitive Reinforcement

[46]

One Shot Learning Task (1)

Pos Fee

dback

Neg

Fee

dback

60

70

80

90

100

*%

Perf

orm

ance T

est

One Shot Learning Task (2)

Pos Fee

dback

Neg

Fee

dback

60

70

80

90

100

% P

erf

orm

ance T

est

***

Figure 9.(A) In the first version of the one shot learning task, participants performed, on average, significantly better following positive feedback when compared to performance following negative feedback. (B) This bias effect across subjects towards learning better following positive feedback was also observed in the second version of the one shot learning task, using different stimuli. Data are presented as Mean ± SEM (* = p< 0.05, *** = p < 0.001).

4.1.4. One shot learning Task (2)

Across subjects analysis of test pairs confirmed the previously observed results (Fig

9B). In the second OSL task participants again showed higher recall accuracy following positive

feedback relative to recall accuracy following negative feedback (Table 5) during test [t25 =

4.886, p < 0.001, two-tailed, d = 1.12]. Session analysis did not show significant effects of

session on general recognition accuracy [F(1,24) = 2.560, p = 0.123, η2 = 0.095] and recognition

accuracy following negative feedback [F(1,24) = 0.165, p = 0.689, η2 < 0.001]. We did observe a

significant effect of session on performance following positive feedback [F(1,24) = 5.661, p =

0.026, η2 = 0.19], where performance following positive feedback was better for participants

who did the second OSL task in the second session (M = 88.23%, SD = 10.55) compared to

participants who did the second OSL task in the first session (M = 95.76%, SD = 5.05).

4.2 Cross-Task Analysis

4.2.1. Relationships between session and tasks on general test performance?

I. Tasks Within and Between Sessions

General test performance between the implicit and explicit task conducted in the first

session showed no significant correlation (n=21, rs= 0.19, p= 0.410). Also, no significant

correlation was observed between general test performance on the implicit and explicit task in

the second session (n=21, rs =- 0.03, p= 0.883). There were no significant correlations between

A B

Page 47: To Go or Not to Go: Differences in Cognitive Reinforcement

[47]

the implicit task performance rate in the first session and the explicit task performance rate in

the second session (n=21, rs =- 0.01, p= 0.964), which is also the case for the relationship

between the implicit task in the second session and explicit task in the first session(n=21, rs = -

0.03, p= 0.436). There was a marginally insignificant negative correlation between implicit task

performance rate across sessions (n=21, rs =- 0.41, p= 0.062). We did observe a significant

positive correlation between explicit task performance rate across sessions (n=21, rs = 0.52, p=

0.015). These results suggest that better test performers, when compared to the other

subjects, on one task within a session do not consistently perform better on the other task

within this session. However participants who perform better on an implicit task in one session

seemingly perform worse, when compared to the other participants, on the implicit task in the

other session. On the other hand, better performers on one explicit task in a given session are

better performers on the explicit task in the other session.

II. Implicit and Explicit tasks

Correlation coefficients between test performance rates across tasks are shown in

table 6. There was a significant correlation between OSL tasks(n=21, rs = 0.51, p= 0.018). No

significant correlations across tasks were observed between PS-OSL1 (p = 0.586), PS-OSL2 (p =

0.175 ), TI-OSL1 (p = 0.943), TI-OSL2 (p = 0.489) and TI-PS task (p= 0.620). When we controlled

for awareness in the TI task, correlations between both implicit tasks became smaller for

participants who were unaware of the underlying hierarchy in the TI-task (n=11, rs = -0.05, p=

0.989). Correlation coefficients became insignificantly larger between the TI-task and OSL

(1&2) tasks when we restricted analysis to participants who were explicitly aware of the

underlying hierarchy in the TI task (n=10, rs = 0.22, p= 0.540 and rs = 0.23, p= 0.520,

respectively). These results suggest that participants who perform better, compared to the

other subjects, on one OSL task will also perform better on the other OSL task. No such

relationship was observed between the other tasks.

PS TI OSL (1) OSL (2)

PS 1

TI .12 1

OSL (1) .13 .02 1

OSL (2) - .31 .16 .51* 1

Table 6. Spearman rank correlation coefficients between test performance rate across tasks. A significant positive correlation was observed between OSL tasks, indicating that participants who performed well on one OSL-task are likely to score well on the other OSL task (* = p < 0.05).

Page 48: To Go or Not to Go: Differences in Cognitive Reinforcement

[48]

4.2.2. Inter-individual bias towards positive or negative learning across tasks?

To examine whether positive or negative learners are consistently biased towards,

learning better from positive or negative feedback, we firstly tested whether positive (or

negative) learners do in fact learn better from positive (or negative) feedback when compared

to negative (or positive) learners, using one-way between subjects ANOVAs. Indeed, positive

learners did learn better from positive feedback, when compared to negative learners during

the PS-task [F(1,22) = 22.510, p < 0.001, η2 = 0.51], the TI-task [F(1,19) = 7.788, p = 0.012, η2 =

0.029] and the OSL2 task [F(1,20) = 13.124, p = 0.002, η2 = 0.39], but not during the OSL1

task[F(1,22) = 3.786, p = 0.065, η2 = 0.15].

Similarly, negative learners learned better from negative feedback, when compared to

positive learners during the PS-task [F(1,22) = 11.467, p = 0.003, η2 = 0.34] and the TI-

task[F(1,19) = 6.44, p = 0.02, η2 = 0.25], but not during the OSL1 [F(1,22) = 4.104, p = 0.055, η2

= 0.16] and the OSL2 task [F(1,20) = 2.747, p = 0.131, η2 = 0.11]. Though, between group tests

in the OSL tasks do show a very clear trend towards significance and the lack of significance is

most likely due to the smaller sample size of the negative learning group within the explicit

memory tasks (see table 1). We should therefore be cautious to draw any broad conclusions

from these analysis. Nevertheless we used the implicit procedural learning task subgroup

classifications for further analysis to check whether they could predict value-related

differences in other tasks.

I. Tasks Within and Between Sessions

Correlations coefficients on learning-bias rates between implicit and explicit tasks were

not significant for the first session (n=21, rs= -0.12, p= 0.601) or the second session (n=21, rs= -

0.10, p= 0.656). We did observe a significant negative correlation between the implicit task

bias rates in the first session and the explicit task bias rates in the second session (n=21, rs =-

0.56, p= 0.009). This was not the case for the relationship between the implicit task in the

second session and the explicit task in the first session (n=21, rs= -0.03, p= 0.895).

Furthermore, no significant correlations were observed between explicit task bias rates

across sessions (n=21, r= -0.17, p= 0.463) or between implicit task bias rates across sessions

(n=21, r= 0.08, p= 0.722). These results suggest that participants who learned better from

negative feedback on the implicit task in the first session, would more likely learn better from

positive feedback on the explicit task in the second session. No such relationships were

observed within and between sessions for the other tasks.

Page 49: To Go or Not to Go: Differences in Cognitive Reinforcement

[49]

II. Implicit and Explicit tasks

Correlation coefficients between bias rates across tasks are shown in Table 7.. No

significant correlations were observed across tasks. However, we did observe some trends

towards a significant negative correlation between the TI-task and both the first (n=23, rs= -

0.36, p= 0.09) and second (n=24, rs= -0.38, p= 0.068) OSL-Task. Results of performed

correlation analysis between the PS-task and both the first (n=23, rs= -0.02, p= 0.973) and

second (n=23, rs= -0.31, p= 0.148) OSL-tasks were less clear. Correlations between implicit

tasks showed a small insignificant positive correlation (n=23, rs= 0.10, p= 0.644), whereas

correlations between explicit tasks showed a small insignificant negative correlation(n=25, rs= -

0.13, p= 0.539).

PS TI OSL (1) OSL (2)

PS 1

TI .10 1

OSL (1) -.02 -.37 1

OSL (2) - .31 -.38 -.13 1

Table 7. Spearman rank correlation coefficients between bias rates across tasks. No significant correlations were observed between tasks. A positive correlation was observed between implicit procedural tasks and a relatively high negative correlation was observed between implicit procedural tasks and the second one shot learning task.

Again we controlled for awareness in the TI task. Correlations coefficients are

presented in Table 8. However, the large inter-individual variation between bias rates and the

low sample sizes across tasks make it very hard to validly interpret these results.

Table 8. Spearman rank correlation coefficients between bias rates across tasks, controlled for awareness in the TI-task. No significant correlations were observed between tasks. The relatively low sample sizes make it hard to draw valid conclusions following this analysis.

PS OSL (1) OSL (2)

rs =

TI-Aw p =

n =

.59

0.072

10

-.37

0.260

11

-.37

0.263

11

rs =

TI-nAw p =

n =

-.25

0.417

13

-.41

0.191

12

-.52

0.068

13

Page 50: To Go or Not to Go: Differences in Cognitive Reinforcement

[50]

General linear model regression analysis using the predictor groupPS (Positive and

Negative learners), derived from the implicit PS-task (see table 1), showed no significant

effects on bias rates in the TI-task (Fig.10A) [F(1,23) = 0.006, p = 0.939, η2 < 0.001] or the OSL1

task [F(1,22) = 0.036, p = 0.852, η2 < 0.001]. Interestingly, we did observe a significant effect of

the predictor groupPS on bias rates in the OSL2 task [F(1,23)=5.908, p= 0.023, η2 = 0.2 ], where

positive PS-learners (Fig.10C) have, on average, lower positive bias rates (n= 13, MBR = 0.07, SD

= 0.116) compared to negative PS-learners (n= 12, MBR = 0.21, SD = 0.155).

When we changed the predictor groupPS (Positive and Negative learners) to the

predictor groupTI (Positive and Negative learners), derived from the implicit TI-task (see table

1), results show broadly the same pattern. No significant effects were observed of the

predictor groupTI on bias rates in the PS-task (Fig. 10B) [F(1,16) = 0.596, p = 0.451, η2 = 0.036]

or the OSL1-task[F(1,17) = 0.655, p = 0.429, η2 = 0.038]. We did observe a significant of the

predictor groupTI on bias rates in the OSL2 task [F(1,18) = 5.054, p = 0.037, η2 = 0.22]. Again,

on average, positive TI-learners (Fig. 10D) have significant lower positive bias rates (n= 10, MBR

= 0.08, SD = 0.111) compared to negative TI-learners (n= 10, MBR = 0.22, SD = 0.162).

Taken together these results suggest that participants who perform better following

positive feedback, when compared to performance following negative feedback, in one

implicit learning task will more likely show the same pattern of results in the other procedural

learning task when compared to more explicit memory task. Participants are more likely to

show the opposite pattern of results in the explicit memory tasks. Biased positive learners in

one of the implicit procedural tasks show a lower positive bias rate the more explicit memory

learning tasks, when compared to biased negative learners. However most of our cross-task

comparisons show no or very small significant effects, most likely due to our relatively small

sample sizes across tasks. Further investigations using bigger sample sizes and perhaps

different implicit tasks (i.e., how implicit is the transitive inference task?) are necessary to

confirm this pattern of results.

Page 51: To Go or Not to Go: Differences in Cognitive Reinforcement

[51]

OSL (2) Bias rate

Pos PS-L

earn

ers

Neg

PS-L

earn

ers

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Bia

s R

ate

OSL (2) Bias rate

Pos TI-L

earn

ers

Neg

TI-L

earn

ers

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

0.5

0.6B

ias R

ate

TI Bias rate

Pos

PS-L

earn

ers

Neg

PS-L

earn

ers

-0.5

-0.4

-0.3

-0.2

-0.1

-0.0

0.1

0.2

0.3

0.4

0.5

0.6B

ias

Rate

PS Bias rate

Pos

TI-Lea

rner

s

Neg

TI-L

earn

ers-0.5

-0.4

-0.3

-0.2

-0.1

-0.0

0.1

0.2

0.3

0.4

0.5

0.6

Bia

s R

ate

Figure 10. Regression analysis on bias rates using subgroups (positive and negative learners) derived from the probabilistic selection task and the transitive inference task as predictor. (A) Positive and negative PS-learners did not predict differences on inter-individual bias rates in the transitive inference task. (B) Positive and negative TI-learners also did not predict differences on inter-individual bias rates in the probabilistic selection task. (C) Positive PS-learners had significantly lower positive bias rates when compared to negative PS-learners’ bias rates in the second OSL task. (D) Positive TI-learners had significantly lower positive bias rates when compared the bias rates of negative TI-learners in the second OSL task.

A B

C D

Page 52: To Go or Not to Go: Differences in Cognitive Reinforcement

[52]

5. Discussion

In the current study we investigated whether individual differences in learning from

positive or negative feedback differs between tasks that rely on declarative memory cortices

and tasks that rely on cortices involved in habit formation. Recent research on the neural bases

of making choices following feedback has mainly focused on the role of dopamine and the

striatum. Collectively, these studies pointed out a crucial role of midbrain dopamine neurons

and their striatal targets for learning to predict reward (Daw, Yael, & Dayan, 2005, ; Delgado,

et al., 2000; Frank, et al., 2004; Holroyd & Coles, 2002; Pessiglione, et al., 2006; Schultz, et al.,

1997). These findings concerning the role of the dopaminergic-striatal circuitry in

reinforcement learning are postulated in a prediction-error signal that guides choices by

updating value representations following repeated experience of feedback (Schultz, et al.,

1997; Hollerman & Schultz, 1998; Holroyd & Coles, 2002; Sutton & Barto, 1998). This allows an

organism to use previous experiences to optimize choices when confronted with a similar

situation.

However, since organisms are rarely confronted with the same environment, decisions

made in the past may not repeat themselves. Instead, novel choices mostly involve new

options and contexts which requires a flexible integration of knowledge from the past with

novel information. Previous investigations on flexibly generalizing knowledge from the past to

guide choices in novel situations have illuminated an important role of the declarative memory

system in the medial temporal lobe (Eichenbaum, 2000; Hassabis, et al., 2007; Shohamy &

Adcock, 2010; Squire, 1992). To investigate how individuals differently learn to make optimal

decisions following feedback, we adopted the probabilistic selection task designed by Frank et

al. (2004). We compared optimal decision making performance following positive and negative

feedback in this task with (1) performances on an implicit version of the transitive inference

task and (2) performances on two versions of a declarative memory task.

5.1 Learning within the procedural memory system.

5.1.1. The Probabilistic Selection Task

Besides its usage to investigate the underlying mechanisms of procedural learning

processes, the probabilistic selection task has frequently been used to research inter-individual

variability in learning more from good choices than from bad choices (Frank, et al., 2004;

Lighthall, et al., 2013; Simons, Howard, & Howard, 2010). Our results from the probabilistic

selection task showed no differences in learning performance following positive feedback

relative to learning performance following negative feedback, when averaged across subjects.

Page 53: To Go or Not to Go: Differences in Cognitive Reinforcement

[53]

As expected, when distinguishing between a positive learner and negative learner subgroup17

similar to Frank and colleagues’ 2005 and 2007, we did observe that positive learners learned

significantly better following positive feedback relative to negative learners. Accordingly,

negative learners learned significantly better following negative feedback when compared to

positive learners. These results are in line with previous investigations that used the

probabilistic selection task with young and healthy subjects (Frank, et al., 2005, 2007; Klein, et

al., 2007; Simons, Howard, & Howard, 2010).

A previous study used the probabilistic selection task and the implicit transitive

inference task with PD-patients to investigate individual differences in reinforcement learning

(Frank, Seeberger, & O'reilly, 2004). In this experiment it was assumed that the probabilistic

selection task and the implicit transitive inference task rely on the same neural processes,

namely the basal ganglia. Results suggested that inter-individual biases in feedback-learning

are a direct consequence of higher or lower levels of dopamine that differently affect striatal

synaptic changes (Frank, et al., 2004).

Phasic burst of dopamine, following positive feedback, excite D1 receptors in the direct

pathway which induces long term potentiation (LTP18) in striatal Go cells (Holroyd &

Coles, 2002; Frank, 2005; Nischi, Snyder, & Greengard, 1997). Furthermore, phasic

bursts of dopamine inhibit the indirect pathway via D2 receptors which induces long-

term depression (LTD19) in striatal No-Go cells (Calabresi, et al., 1997). Short

dopaminergic drops below baseline that follow negative feedback have the opposite

effect, i.e., dissuading LTP and LTD in striatal Go and No-Go cells, respectively

(Calabresi, et al., 1997; Frank, 2005; Holroyd & Coles, 2002; Nischi, et al., 1997; Schultz,

2002). Consequently, high or low levels of dopamine biases the go or no-go pathway to

be more active with better learning from positive or negative feedback as a behavioral

output (Frank, et al., 2004).

17

We adopted this distinction from (Frank, Woroch, & Curran, 2005). Positive learners were operationalized as those participants who performed better on choosing A trials compared to performance on avoiding B trials, whereas negative learners were operationalized as those participants who performed better on avoiding B trials compared to performance on choosing A trials (Frank et al., 2005; 2007). 18

Long term potentiation (LTP) is an activity dependent change in the strength of synapses, mediated by NMDA receptors. As a result of pre- and postsynaptic co-activation, a wide range of local biochemical changes strengthen the synaptic connectivity. It is widely accepted that the LTP process can be interpreted as the cellular correlate of associative learning and memory formation in general (Fedulov, et al., 2007; Whitlock & al, 2006). 19

Long term depression (LTD).is the cellular mechanism of synaptic weakening. The difference between LTD and LTD, although not entirely the same, lies in the magnitude of calcium signals in the postsynaptic cell. LTD can be, to some extent, be seen as the cellular correlate of forgetting (Foy, 2001)

Page 54: To Go or Not to Go: Differences in Cognitive Reinforcement

[54]

These assumptions of Frank’s Go-NoGo model, together with findings regarding a

genetic factor of biased feedback learning, led us to the hypothesis that the inter-individual

range of learning better from positive or negative feedback, observed in the probabilistic

selection task, will closely relate to the pattern of results in the implicit transitive inference

task (Frank, et al., 2004; Klein, et al., 2007). In our experiment, results showed only a very small

(rs = 0.10) insignificant correlation between bias rates from the probabilistic selection task and

bias rates from the implicit transitive inference task, suggesting that learning from positive and

negative feedback across these tasks might not be modulated by the same underlying

mechanisms.

5.1.2. The Transitive Inference task.

The transitive inference task has previously been used to study higher-order reasoning,

where organisms have to learn a hierarchical structure of stimuli based on the inferences that

are drawn from adjacent pairs in an ordinal sequence (Dusek & Eichenbaum, 1997; (Van

Opstal, Verguts, Orban, & Fias, 2007). In common terms, this means that participants learn to

logically infer that Vincent is taller than Eden, based on the premises that Vincent is taller than

Kevin and Kevin is taller than Eden. Studies, using different modifications of the transitive

inference task with both animals and humans, have suggested that this task importantly

involves the hippocampus (Dusek & Eichenbaum, 1997; Greene, et al., 2006; Van Opstal, et al.,

2007). However, a recent study challenged the assumption of a necessary involvement of the

hippocampus in transitive inference tasks. This study demonstrated that participants with a

temporally disrupted hippocampus, due to the benzodiazepene midazolam, showed enhanced

transitive inference performance by fully recruiting the dopamine-striatal learning system

(Frank, O'Reilly, & Curran, 2006). These results were in line with their previously proposed

associative strength hypothesis, which explains how organisms transitively infer associations

through implicit reinforcement learning mechanisms (Frank, Rudy, Levy, & O'Reilly, 2005;

Rudy, Frank, & O'Reilly, 2003).

According to the associative strength hypothesis, outer pairs (AB,DE) at the top

or bottom of an underlying hierarchy “anchor” the development of associative values.

Over consecutive trials, agents implicitly learn to associate A with positive

reinforcement, because choosing A always leads to positive feedback. In contrast,

choosing E becomes associated with negative reinforcement, because choosing E

always induces negative feedback. These net associative values than ‘transfer’ these

associative values to the inner adjacent pairs (BC,CD). As a result, B in the BC pair has a

Page 55: To Go or Not to Go: Differences in Cognitive Reinforcement

[55]

stronger positive association, whereas D in the DC pair has a stronger negative

association, though B and D are positively (negatively) reinforced during half of the

trials (Rudy, Frank, & O'Reilly, 2003).

Although other researchers questioned some of the pharmacological assumptions of

Frank’s midazolam study (see Green, 2007 and Frank, et al., 2008 for comment and reply), it is

agreed upon that the transitive inference task can be solved by both explicit ‘declarative’

strategies and implicit ‘procedural’ strategies (Green, 2007; Van Elzakker, et al., 2003; Rudy,

Frank, & O'Reilly, 2003). This reason might explain the lack of a significant correlation in biased

feedback learning between the probabilistic selection task and the transitive inference task in

our study.

Indeed, when we controlled20 for explicit and implicit strategies in the TI-task, a

substantial part of the participants (more than 1/3) indicated using an explicit strategy to solve

the transitive inference task. When we took the results of our questionnaire into account,

results indicated that explicit learners performed better than implicit learners. Our data

suggest that this effect could largely be explained by better performances of explicit learners

on inner pairs. Furthermore, results from the implicit learner group suggest that implicitly

learning the underlying hierarchy is driven by positive and negative associative values in the

outer pairs, since implicit learners significantly scored better on ‘outer pairs’ (AB,DE) relative to

‘inner pairs’ (BC,CD). This pattern of results is in line with the associative strength hypothesis

(Rudy, Frank, & O'Reilly, 2003).

Regarding to our research question, no significant differences were observed across

subjects between performance following positive feedback (top pairs, AB & BC) relative to

performance following negative feedback (bottom pairs, CD & DE), neither for explicit learners,

nor implicit learners. However, one interesting finding was that explicit learners performed

significantly better on top pairs, but not on bottom pairs, when compared to implicit learners.

This finding suggests that the performance level of explicit learners is more likely driven by

learning following positive feedback rather than by learning following negative feedback.

5.2 Learning across the procedural and the declarative memory system.

In line with the results of Frank and colleagues’ midazolam study, results of animal

studies, using spatial navigation tasks, have indicated that inactivating one learning system

(e.g., hippocampus) improves performances on tasks related to the other learning system (e.g.

20

We adopted a translated version of the questionnaire used in Frank and colleagues’ 2004 to control for explicit awareness of the underlying hierarchy in the transitive inference task.

Page 56: To Go or Not to Go: Differences in Cognitive Reinforcement

[56]

striatal processing). These results have provided strong evidence for a bidirectional

dissociation between the ‘declarative’ memory system and the ‘procedural’ memory system,

indicating that both memory systems, respectively supported by the hippocampus and the

striatum, competitively interact under some circumstances (Frank, O'Reilly, & Curran, 2006;

Lee, et al., 2008; Poldrack & Packard, 2003).

However, a very recent lesion study , using a associative reinforcement learning task,

rather than a spatial navigation task, suggested that striatal processing might be a prerequisite

for declarative associative learning following reinforcement. Results of this study showed that

rodents with striatal lesions had impaired procedural and declarative-like memories, whereas

rodents with hippocampal lesions had impaired declarative-like memories, but spared

procedural memories (Jacquet, et al., 2013). These results suggest that striatal processing

might be necessary for decision making following feedback in declarative memory tasks.

5.2.1. Comparing results with the ‘episodic-like’ one-shot learning tasks.

In our study we wanted to investigate whether decision making from feedback differed

between procedural learning tasks and declarative learning tasks within subjects. We

presumed that participants who learn better from positive feedback in a procedural task will

show the same learning bias in the declarative memory tasks.

First, dopamine modulates learning from positive or negative feedback (Holroyd &

Coles, 2002; Schultz, et al., 1997). Second, it has been shown that dopaminergic projections to

the striatum and the hippocampus modulate cellular learning in both regions (Frank, et al.,

2004; Frey, 1990; Huang & Kandel, 1995). Third, there is strong evidence that learning from

reinforcement is directly or indirectly modulated by striatal processing, shown to be a

prerequisite to learn declarative memory tasks (Frank M. J., 2005; Pessiglione,et al., 2006;

Jacquet, et al., 2013). Fourth, previous investigations using performances following

probabilistic feedback to examine error-processing in a recognition memory task found results

consistent with our hypothesis (Frank, et al., 2007).

Surprisingly, when we directly compared implicit and explicit tasks, results indicated

that participants who learned better from negative feedback in the implicit procedural tasks

were more likely to learn better from positive feedback in the explicit declarative memory

tasks, when compared to participants who learned better from positive feedback in the

implicit procedural task. This opposite learning bias across implicit and explicit learning tasks

was especially true for the second version of the explicit memory task and was seemingly

driven by a general bias towards learning from positive feedback during explicit memory tasks.

In both one shot learning tasks we observed that subjects had significantly better recognition

Page 57: To Go or Not to Go: Differences in Cognitive Reinforcement

[57]

accuracy on trials previously followed by positive feedback when compared to trials previously

followed by negative feedback. It could be argued that these effects are task specific.

However, a similar learning bias for explicit learners in the transitive inference task suggests

otherwise.

One possible explanation for these results could be that dopamine plays a functionally

different modulatory role in hippocampal-based associative learning, compared to its role in

striatal value-related processing (Horvitz, 2000; Frey & Morris, 1998a; Li, 2003;). Recent

methodological developments, together with a growing interest in dopamine’s

neuromodulatory role in the brain, has led researchers to belief that dopamine might have a

more heterogeneous functional role in processing motivationally relevant information, closely

related to encoding value-related information (Lisman, Grace, & Duzel, 2011; Shohamy &

Adcock, 2010).

In accordance with this hypothesis, studies have demonstrated (1) dopaminergic

activity when non-rewarding salient stimuli are presented (Horvitz, 2000), (2) activity in some

dopamine cells when aversive stimuli are presented (Matsumoto & Hikosaka, 2009) and (3)

dopaminergic cell firing when novel stimuli are presented (Ljungberg, et al., 1992).

Furthermore, emerging findings have indicated that dopaminergic cell firing to novel and

salient stimuli directly corresponds with hippocampal activity (Fyhn, et al., 2002; Jenkins, et al.,

2004; Kumaran & Maguire, 2006). Inspired by these findings, Lisman and Grace (2005)

proposed a neurobiological model concerning the functional interaction between the midbrain

dopamine system and the declarative memory system in processing motivationally relevant

novel information.

According to this model, novel sensory information is compared with stored

information concerning the environment in the CA1 output region of the

hippocampus. If the novel information is not in accordance with the already

established information, a novelty signal (i.e., comparable to the prediction-error

signal) is sent, through some parts of cortico-basal ganglia-thalamocortical loop,

towards midbrain dopamine cells, where these signals increase dopaminergic cell-

firing. From the midbrain dopamine system, directly projecting to the hippocampus,

increased cell activity directly facilitates hippocampal associative-learning via D1/D5-

receptors (Legault & Wise, 2001; Lisman & Grace, 2005;).

Later on these researchers, among others, suggested that this dopamine-dependent

facilitation of LTP in the hippocampus is relevant for all events related to increased

Page 58: To Go or Not to Go: Differences in Cognitive Reinforcement

[58]

dopaminergic cell firing (Frey & Morris, 1998a; Huang & Kandel, 1995; Lisman, Grace, & Duzel,

2011; Shohamy & Adcock, 2010). When we apply the hypothesis derived from this model to

the one shot learning tasks, novel associations followed by increased dopaminergic cell firing

(i.e., related to positive feedback) should result in better memory traces, which is exactly what

we observed.

Taken together, many studies have indicated an important involvement of dopamine

and the striatum in procedural learning, based on positive and negative feedback. However, it

largely unclear how learning from feedback occurs in more declarative memory tasks, thought

to rely on the medial temporal lobe and the hippocampus. Previous investigations have

suggested that the procedural (habit) and declarative (goal-directed) learning system under

some circumstances competitively interact. In this study we researched how learning following

feedback differs in both systems by directly comparing individual learning biases across

procedural and declarative learning tasks. Results showed a general tendency to learn better

from positive feedback in the declarative learning tasks, but not in the procedural learning

tasks. Participants who learned better from negative feedback during procedural tasks were

more likely to learn better from positive feedback in the explicit declarative memory tasks.

These results suggest a different functional role for the declarative and procedural memory

system in learning from negative or positive feedback.

Page 59: To Go or Not to Go: Differences in Cognitive Reinforcement

[59]

6. References

Albin, R., Young, A., & Penney, J. (1989). The functional anatomy of basal ganglia disorders.

Trends in neurosciences 12, 366-375.

Alexander, G. E., DeLong, M. R., & Strick, P. L. ( 1986). Parallel organization of functionally

segregated circuits linking basal ganglia and cortex. Annual Review of Neuroscience 9,

357-381.

Alexander, G., & Crutcher, M. (1990). Preparation for movement: Neural representations of

intended direction in three motor areas of the monkey. Journal of neurophysiology 64,

133-150.

Andén, N. E., Fuxe, K., Hamberger, B., & Hökfelt, T. (1966). A quantitative study on the nigro-

neostriatal dopamine neurons. Acta physiology scandinavia 67, 306-312.

Baddeley, A. (2001). The concept of episodic memory. Philosophical Transactions of the Royal

Society B: Biological Sciences 356, 1345-1250.

Baddeley, A., Eysenck, M. W., & Anderson, M. C. (2009). Memory. New York: Psychology Press.

Balleine, B. W., & Dickinson, A. (1998). Goal-directed instrumental action: Contingency and

incentive learning and their cortical substrates. Neuropharmacology 37, 407-419.

Barnes, T. D., Kubota, Y., Hu, D., Jin, D., & Graybiel, A. (2005). Activity of striatal neurons

reflects dynamic encoding and recoding of procedural memories. Nature,437, 1158-

1161.

Barto, A. G. (1995). Adaptive critics and the basal ganglia. In J. C. Houk, J. L. Davis, & D. G.

Beiser, Models of information processing in the basal ganglia (pp. 215-232).

Cambridge,MA: MIT Press.

Barto, A., Sutton, R., & Anderson, C. (1983). Neuronlike adaptive elements that can solve

difficult learning control problems. IEEE Transaction on Systems, Man & Cybernetics,

13, 834-846.

Bellman, R. (1958). On a routing problem. Quart. J. Appl. Math. 16, 87-90.

Bray, S., & O'Doherty, J. (2007). Neural coding of reward-prediction error signals during

classical conditioning with attractive faces. Journal of Neurophysiology, 97, 3036-3045.

Brodeur, M. B., Dionne-Dostie, E., Montreuil, T., & Lepage, M. (2010). The bank of

standardized stimuli (BOSS), a new set of 480 normative photos of objects to be used

as visual stimuli in cognitive research. PloS ONE, 5, e10773.

Brown, J., & Braver, T. (2005). Learned predictions of error likelihood in the anterior cingulate

cortex. Science, 307, 1118-1121.

Page 60: To Go or Not to Go: Differences in Cognitive Reinforcement

[60]

Buckner, R. L., Petersen, S. E., Ojemann, J. G., Miezin, F. M., Squire, L. R., & Raichle, M. E.

(1995). Functional Anatomical Studies of Explicit and Implicit Memory Retrieval tasks.

Journal of Neuroscience 15(1), 12-29.

Calabresi, P., Saiardi, A., Pisani, A., Baik, J., Centonze, D., Mercuri, N., . . . Borelli, E. (1997).

Abnormal synaptic plasticity in the striatum of mice lacking dopamine D2 receptors.

Journal of Neuroscience, 17, 4536-4544.

Daw, N. D., Yael, N., & Dayan, P. (2005, ). Uncertainty-based competition between prefrontal

and dorsolateral striatal sustems for behavioral control. nature neuroscience, 8(12),

1704-1711.

Delgado, M., Nystrom, L., Fissell, C., Noll, D., & Fiez, J. (2000). Tracking the hemodynamic

responses to reward and punishment in the striatum. Journal of Neurophysiology, 84,

3072-3077.

Dickenson, A., & Balleine, B. (2002). The role of learning in motivation. In C. R. Gallistel,

Stevens' Handbook of Experimental Psychology Vol.3 Learning, Motivation and

Emotion (pp. 497-533). New York: John Wiley & Sons.

Dubois, B., Malapani, C., Verin, C., Rogelet, P., Deweer, B., & Pillon, B. (1994). Cognitive

functions in the basal ganglia: the model of Parkinson disease. Revue Neurologique

(Paris), 150, 763-770.

Dusek, J. A., & Eichenbaum, H. (1997). The hippocampus and memory for orderly stimulus

relations. Proceedings of the National Academy of Sciences, 94, 7109-7114.

Eichenbaum, H. (2000). A cortical-hippocampal system for declarative memory. Nature Review

Neuroscience 1, 41-50.

Emson, P., & Koob, G. (1978). The origin distribution of dopamine-containing afferents to the

rat frontal cortex. Brain research, 142, 249-267.

Fedulov, V., Rex, C., Simmons, D., Palmer, L., Gall, C., & Lynch, G. (2007). Evidence that Long-

Term Potentiation occurs whitin individual hippocampal synapses during learning.

Journal of Neuroscience 27, 8031-9039.

Foy, M. R. (2001). Long-term Depression (Hippocampus). International Encyclopedia of the

Social & Behavioral , 9047-9078.

Frank, M. J. (2005). Dynamic Dopamine modulation in the Basal Ganglia: A

neurocomputational Account of cognitive Deficits in Medicated and Nonmedicated

Parkinsonism. Journal of Cognitive Neuroscience, 51-72.

Frank, M., D'Lauro, C., & Curran, T. (2007). Cross-task individual differences in error processing:

Neural, electrophsyiological and genetic components. Cognitive, Affective & Behavioral

Neuroscience, 7, 297-308.

Frank, M., O'Reilly, R., & Curran, T. (2006). When memory fails, intuition reigns: Midazolam

enhances implicit inference in humans. Psychological Science ,16, 700-707.

Page 61: To Go or Not to Go: Differences in Cognitive Reinforcement

[61]

Frank, M., O'Reilly, R., & Curran, T. (2008). Midazolam, hippocampal function, and transitive

inference: Reply to Green. Behavioral and Brain Functions, 4:5.

Frank, M., Rudy, J., Levy, W., & O'Reilly, R. (2005). When logic fails: Implicit transitive inference

in humans. Memory and Cognition, 742-750.

Frank, M., Seeberger, L. C., & O'reilly, R. (2004). By Carrot or by Stick: Cognitive reinforcement

learning in Parkinsonism. Science, 1940-1943.

Frank, M., Woroch, B., & Curran, T. (2005). Error-Related Negativity Predicts Reinforcement

learning and Conflict Biases. Neuron, 47, 495-501.

Frey, U. (1990). Dopaminergic antagonists prevent long-term maintenance of posttetanic LTP

in the CA1 region of rat hippocampal slices. Brain Research 522, 69-75.

Frey, U., & Morris, R. (1998a). Synaptic tagging and long term potentiation.

Neuropharmacology 37, 545-552.

Fuxe, K., Hokfelt, T., Johansso, O., Jhonson, G., Lidbrink, P., & Ljungdah, A. (1974). Origin of

dopamine nerve-terminals in limbic and frontal cortex- evidence for meso-cortico

dopamine neurons. Brain Research, 82, 349-355.

Fyhn, M., Molden, S., Hollup, S., Moser, M., & Moser, E. (2002). Hippocampal neurons

responding to first-time dislocation of a target object. Neuron, 555-566.

Gerfen, C. (2000). Molecular effects of dopamine on striatal projection pathways. Trends in

neurosciences 23, 64-70.

Gerfen, C. R., & Surmeier, J. (2011). Modulation of striatal projection systems by dopamine.

Annual review Neuroscience,34, 441-466.

Gerfen, C. R., Engber, T. M., Mahan, L. C., Susel, Z., Chase, T. N., Monsama, F., & Sibley, D. R.

(1990). D1 and D2 dopamine recepto-regulated gene expression of striatonigral and

striatopallidal neurons. Science, 250 , 1429-1432.

Gerfen, C., & Wilson, C. (1996). The basal ganglia. In L. Swanson, A. Bjorkland, & T. Hokfelt,

Handbook of chemical neuroanatomy Vol 12: Integrated systems of the CNS (pp. 371-

468). Amsterdam: Elsevier.

Gilbert, D. T., & Wilson, T. D. (2007). Prospection: experiencing the future. Science, 317, 1351-

1354.

Gläscher, J., Daw, N., Dayan, P., & O'Doherty, J. P. (2010). States versus Rewards: Dissociable

Neural prediction error signals underlying model-based and model-free reinforcement

learning. Neuron, 585-595.

Graf, P., & Shacter, D. L. (1985). Implicit and explicit memory for new associations in normal

and amnesic subjects. Journal of Experimental Psychology Learning Memory and

Cognition 11, 501-518.

Page 62: To Go or Not to Go: Differences in Cognitive Reinforcement

[62]

Graybiel, A. (1998). The basal ganglia and chunking of action repertoires. Neurobiology of

Learning and Memory, 70, 119-136.

Green, A. J. (2007). Implicit transitive inference and the human hippocamus: does intravenous

midazolam function as a reversivle hippocampal lesion? Behavioral Brain Function, 3,

51.

Greene, A. J., Gross, W., Elsinger, C., & Rao, S. M. (2006). An fMRI analysis of the human

hippocampus: inference, context and task awareness. Journal of Cognitive

Neuroscience, 18, 1156-1173.

Hassabis, D., Kumaran, D., Vann, S. D., & Maguire, E. A. (2007). Patients with hippocampal

amnesia cannot imagine new experiences. PNAS, 104, 1726-1731.

Hernandez-Lopez, S., Bargas, J., Surmeier, D., Reyes, A., & Galarraga, E. (1997). D1 receptor

activation enhances evoked discharge in neostriatal medium spinu neurons by

modulating an L-type Ca conductance. Journal of Neuroscience 17, 3334-3342.

Hernandez-Lopez, S., Tkatch, T., Perez-Garci, E., Galagarra, E., Bargas, J., Hamm, H., &

Surmeier, D. (2000). D2 dopamine receptors in striatal medium spiny neurons reduce

L-type Ca currents and excitabillity via a novel PLC1-IP-calcineurin-signaling cascade.

Journal of Neuroscience 20, 8987-8995.

Hikosaka, O. (1989). Role of basal ganglia in initiation of voluntary movements. In M. A. Arbib,

& S. Amari, Dynamic interactions in neural networks: Models and data (pp. 153-167).

Berlin: Springer-Verlag.

Hollerman, J., & Schultz, W. (1998). Dopamine neurons report an error in the temporal

predicition of reward during learning. Nature Neuroscience, 304-309.

Holroyd, C., & Coles, M. (2002). The neural basis of human error processing: Reinforcement

learning,dopamine and the error-related negativity. Psychological Review 109, 679-

709.

Horvitz, J. (2000). Mesolimobocortical and nigrostriatal dopamine responses to salient non-

reward events. Neuroscience 96, 651-656.

Huang, Y., & Kandel, E. ( 1995). D1/D5 receptor agonist induce a proteine synthesis-dependent

late potentiation in the CA1 region of the hippocampus. Proceedings of the National

Academy of Sciences, 2446-2450.

Jacquet, M., Lecourtier, L., Cassel, R., Loureiro, M., Cosquer, B., Escoffier, G., Marchetti, E.

(2013). Dorsolateral striatum and dorsal hippocampus : a serial contribution to

acquisition of cue-reward associations in rats. Behavioral Brain Research, 239, 94-103.

Jankovic, J. (2008). Parkinson's disease: clinical features and diagnosis. Journal of Neurology,

Neurosurgery and Psychiatry 79, 386-376.

Page 63: To Go or Not to Go: Differences in Cognitive Reinforcement

[63]

Jenkins, T., Amin, E., Pearce, J., Brown, M., & Aggleton, J. (2004). Novel spatial arrangements of

familiar visual stimuli promote activity in the rat hippocampal formation but not the

parahippocampal cortices: a c-fos expression study. Neuroscience 124, 43-52.

Joel, D., & Weiner, I. (1997). The connections of the primate subthalamic nucleus: indirect

pathways and the open-interconnected scheme of basal ganglia-thalamocortical

circuitry. Brain research review,23, 62-78.

Johnson, A., & Redish, A. D. (2007). Neural ensembles in CA3 transiently encode paths forward

of the animal at a decision point. Journal Neuroscience 27, 12176-12189.

Kamin, L. (1969). Predictability, suprise, attention and conditioning. In A. Campbell, & R. M.

Church, Punishment and aversive behavior (pp. 242-259). New York: Appleton-Century

Crofts.

Klein, T., Neumann, J., Reuter, M., Hennig, J., von Cramon, Y., & Ullsperger, M. (2007).

Genetically Determined Differences in Learning from Errors. Science, 318, 1642-1645.

Klopf, A. H. (1982). The Hedonistic Neuron: A theory of Memory Learning, and Intelligence.

Hemisphere.

Knowlton, B. J., Mangels, J. A., & Squire, L. R. (1996). A neostriatal habit learning system in

humans. Science, 273, 1399-1402.

Kumaran, D., & Maguire, E. (2006). An unexpected sequence of events: mismatch detection in

the human hippocampus. Plos Biology (4) 12, e424.

Lee, A. S., Duman, R. S., & Pittenger, C. (2008). A double dissociation revealing bidirectional

competition between striatum and hippocampus during learning. PNAS, 105, 241-249.

Legault, M., & Wise, R. (2001). Novelty-evoked elevations of nucleus accumbens dopamine:

dependence on impulse flow from the ventral subiculum and glutamatergic

neurotransmission in the ventral tegmental area. European journal of Neuroscience,

819-828.

Li, S. (2003). Dopamine-dependent facilitation of LTP induction in hippocampal CA1 by

exposure to spatial novelty. Nature Neuroscience 6, 1407-1417.

Lighthall, N. R., Gorlick, M. A., Schoeke, A., Frank, M., & Mather, M. (2013). Stress modulates

reinforcement learning in younger and older adults. Psychology and Aging, 28, 35-46.

Lisman, J., & Grace, A. (2005). The Hippocampal-VTA Loop: Controlling the Entry of Information

into Long-Term Memory. Neuron, 703-712.

Lisman, J., Grace, A., & Duzel, E. (2011 ). A neoHebbian framework for episodic memory; role

of dopamine-dependent late LTP. Trends in Neurosciences 34, 536-547.

Ljungberg, T., Apicella, P., & Schultz, W. (1992). Responses of monkey dopamine neurons

during learning of behavioral reactions. Journal of Neurophysiology 67, 145-163.

Page 64: To Go or Not to Go: Differences in Cognitive Reinforcement

[64]

Maddox, W., & Filoteo, J. (2001). Striatal contributions to category learning: Quantitative

modeling of simple linear and complex non-linear rule learning in patients with

Parkinson's disease. Journal of the international Neuropsychological Society 7, 710-

727.

Matsumoto, M., & Hikosaka, O. (2009). Two types of dopamine neuron distinctly convey

positive and negative motivational signals. Nature, 837-841.

McClure, S. M., Berns, G. S., & Montague, P. R. (2003). Temporal prediction errors in a passive

learning task activate human striatum. Neuron,38, 339-346.

Miller, R. R., Barnet, R. C., & Grahame, N. J. (1995). Assessment of the Rescorla-Wagner model.

Psychological Bull. 117(3), 363-386.

Milner, B., Squire, L. R., & Kandel, E. (1998). Cognitive Neuroscience and the study of memory.

Neuron 20, 445-468.

Mink, J. (1996). The basal ganglia: Focused selection and inhibition of competing motor

programs. Progress in Neurobiology 50, 381-425.

Mink, J. (2003). The basal ganglia and involuntary movements: impaired inhibition of

competing motor patterns. Archives of neurology, 60, 1365-1368.

Mirenowicz, J., & Schultz, W. (1994). Importance of unpredictablillity for reward responses in

primate dopamine neurons. Journal of Neurophysiology 72, 1024-1027.

Montague, P., Dayan, P., & Sejnowski, T. (1996). A framework for mesencephalic dopamine

systems based on predictive Hebbian learning. Journal of Neuroscience 16, 1936-1947.

Nicola, S., Surmeier, J., & Malenka, R. (2000). Dopaminergic modulation of neuronal

excitabillity in the striatum and nucleus accumbens. Annual Review of Neuroscience

23, 185-215.

Nischi, A., Snyder, G., & Greengard, P. (1997). Bidirectional regulation of DARPP-32

phosphorylation by dopamine. Journal of Neuroscience 17, 8147-8155.

Niv, Y. (2009). Reinforcement learning in the brain. Journal of mathematical psychology 53,

139-154.

O'Doherty, J., Dayan, P., Friston, K., Critchley, H., & Dolan, R. J. (2003). Temporal difference

models and reward-related learning in the human brain. Neuron, 38, 329-337.

O'Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. (2004). Dissociable

Roles of Ventral and Dorsal Striatum in Instumental Conditioning. Science, 304, 452-

454.

Olds, J., & Milner, P. (1954). Positive reinforcement produced by electrical stimulation of septal

area and other regions of the rat brain. Journal of comparative and physiological

psychology 47, 419-427.

Page 65: To Go or Not to Go: Differences in Cognitive Reinforcement

[65]

Packard, M. G., & McGaugh, J. L. (1996). Inactivation of hippocampus or caudate nucleus with

lidocaine differentially affects expression of place and response learning. Neurobiology

of Learning and Memory, 65, 65-72.

Pavlov, I. (1927). Conditioned Reflexes: An Investigation of the physiological Activity of the

Cerebral Cortex. Translated and Edited by G.V. Anrep. London: Oxford University Press.

Pessiglione, M., Seymour, B., Flandin, G., Dolan, R. J., & Chris, F. D. (2006). Dopamine-

dependent prediction errors underpin reward-seeking behaviour in humans.

Nature,442, 1042-1045.

Poldrack, R., & Packard, M. (2003). Competition among multiple memory systems: converging

evidence from animal and human brain studies. Neuropsychologia, 245-251.

Poldrack, R., Clark, J., Paré-Blagoev, E., Shohamy, D., Creso Moyano, J., Myers, C., & Gluck, M.

(2001). Interactive memory systems in the human brain. Nature, 546-550.

Rashotte, M. E., Marshall, B. S., & O'Connell, J. M. (1981). Signaling functions of the second-

order CS: Partial reinforcement during second-order conditioning of the pigeon's

keypeck. Animal learning & Behavior 9, 253-260.

Redgrave, P., Prescott, T., & Gurney, K. (1999). The basal ganglia: a vertrebate solution to the

selection problem? Neuroscience, 89, 1009-1023.

Rescorla, R. A. (1976b). Stimulus generalization: Some predictions from a Pavlovian

conditioned behavior. Journal of Experimenatl Psychology: Animal behavior processes

2, 88-96.

Rescorla, R., & Wagner, A. (1972). A theory of Pavlovian conditioning: Variations in the

effectivness of reinforcement and nonreinforcement. In A. Black, & W. Prokasy, In

Classical Conditioning II: Current Research and Theory (pp. 64-99). New York: Appleton

Century Crofts.

Ridderinkhof, K. R., Ullsperger, M., Crone, E. A., & Nieuwenhuis, S. (2004). The role of the

medial frontal cortex in cognitive control. Science, 306, 443-447.

Ritchie, T., & Noble, E. P. (2003). Association of seven polymorphisms of the D2 dopamine

receptor gene with brain receptor binding characteristics. Neurochemical research, 28,

73-82.

Robbins, S. (1998). Organization Behavior. NJ: Prentice Hall.

Romo, R., & Schultz, W. (1990). Dopamine neurons of the monkey midbrain: Contingencies of

responses to active touch during self-initiated arm movements. Journal of

Neurophysiology 63, 592-606.

Rudy, J., Frank, M., & O'Reilly, R. (2003). Transitivity, Flexibility, Conjunctive Representations

and the hippocampus:II. A Computational Analysis . Hippocampus 13, 341-354.

Page 66: To Go or Not to Go: Differences in Cognitive Reinforcement

[66]

Schneider, W., Eschman, A., & Zuccolotto, A. (2002). E-Prime User's Guide. Pittsburgh:

Psychology Software Tools Inc.

Schönenberg, T., Daw, N., Joel, D., & O'Doherty, J. (2007). Reinforcement Learning Signals in

the humans Striatum Distinguish Learners from Nonlearners during Reward-Based

Decision Making. The journal of Neuroscience, 27(47), 12860-12867.

Schultz, W. (1998). Predictive reward signal of dopamine neurons. Journa of Neurophysiology

80, 1-27.

Schultz, W. (2002). Getting Formal with Dopamine and Reward. Neuron, 241-263.

Schultz, W. (2007). Multiple Dopamine functions at different time courses. Annual review of

neuroscience, 259-288.

Schultz, W., Dayan, P., & Montague, R. (1997). A neural substrate of prediction and reward .

Science 275, 1593-1599.

Shohamy, D., & Adcock, A. (2010). Dopamine and adaptive memory. Trends in Cognitive

Sciences, 464-472.

Simons, J., Howard, J. H., & Howard, D. (2010). Adult Age Differences in Learning From Positive

and Negative Probabilistic Feedback. Neuropsychology, 24, 534-541.

Skinner. (1938). The behavior of Organisms: An Experimental Analysis. New York: Appleton-

Century.

Skinner, B. (1987). In B. Skinner, Upon further reflection (pp. 105-108). Englewood Cliffs, NJ:

Prentice-Hall.

Skinner, B. F. (1935). Two types of conditioned reflex and a pseudo type. Journal of General

Psychology 12, 66-77.

Squire, L. R. (1992). Declarative and non-declarative memory: multiple brain systems

supporting learning and memory. Journal of Cognitive Neuroscience, 232-243.

Squire, L. R. (2004). Memory systems of the brain: A brief History and current perspective.

Neurobiology of Learning and Memory 82, 171-177.

Squire, L., Knowlton, B., & Musen, G. (1993). The structure and organization of memory.

Annual Reviews of Psychology 44, 453-495.

Squire, L., Stark, C., & Clark, R. (2004). The medial temporal lobe. Annual review neuroscience

27, 279-306.

Sugrue, L. P., Corrado, G. S., & Newsome, W. T. (2005). Choosing the greater of two goods:

neural currencies for valuation and decision making. Nature Review Neuroscience 6,

363-375.

Suri, R., & Schultz, W. (1999). A neural network with dopamine-like reinforcement signal that

learns a spatial delayed response task. Neuroscience 91, 871-890.

Page 67: To Go or Not to Go: Differences in Cognitive Reinforcement

[67]

Sutton, R. (1988). Learning to Predict by the methods of Temporal Differences. Machine

Learning, 3, 9-44.

Sutton, R. S., & Barto, A. G. (1981a). Toward a modern theory of adaptive networks:

Expectation and prediction. Psychological Review 88, 135-171.

Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M.

&. (Eds.), Learning and computational neuroscience: Foundations of adaptive networks

(pp. 497-537). MIT Press.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge (MA):

MIT Press.

Swainson, R., Rogers, R. D., Sahakian, B., Summers, B., Polkey, C., & Robbins, T. W. (2000).

Probabilistic learning and reversal deficits in patients with Parkinson's disease or

frontal or temporal lobe lesions: possible adverse effects of dopaminergic medication.

Neuropsychologia,38, 596, 596-612.

Thorndike, E. (1905). The elements of psychology. New York: A.G. Seiler.

Thorndike, E. (1911). Animal Intelligence: Experimental Studies. New York: MacMillian.

Tolman, E. L. (1932). Purposive behavior in Animals and Man. New York: Century.

Valentin, V., Dickinson, A., & O'Doherty, J. (2007). Determining the Neural Substrates of Goal-

Directed Learnin in the Human Brain. The Journal of Neuroscience, 27, 4019-4026.

Van Elzakker, M., O'Reilly, R. C., & Rudy, J. W. (2003). Transitivity, flexibility, conjunctive

representations, and the hippocampus. I An empirical analysis. Hippocampus 13, 292-

298.

Van Opstal, F., Verguts, T., Orban, G., & Fias, W. (2007). A hippocampal-parietal network for

learning an ordered sequence. NeuroImage, 333-341.

White, N. M., & McDonald, R. J. (2002). Multiple parallel memory systems in the brain of the

rat. Neurobiology of Learning & Memory 77, 125-184.

Whitlock, J., & al, e. (2006). Learning induces Long-Term Potentiation in the hippocampus.

Science 313, 1093-1097.

Wirth, S. e. (2009). Trial outcome and associative learning signals in the monkey hippocampus.

Neuron, 930-940.

Wise, R., Spindler, J., & Legault, L. (1978). Major attenuation of food reward with performance-

sparing doses of pimozide in the rat. Canadian journal of Psychology 32, 77-85.

Woordward, T. S., Bub, D. N., & Hunter, M. A. (2002). Task switching deficits associated with

Parkinson's disease reflect depleted attentional resources. Neuropsychologia, 40,

1948-1955.

Page 68: To Go or Not to Go: Differences in Cognitive Reinforcement

[68]

Yin, H. H., & Knowlton, B. J. (2006). The role of the basal ganglia in habit formation. Nature

Review Neuroscience, 464-476.