17
REINFORCEMENT LEARNING BY HEBBIAN SYNAPSES WITH ADAPTIVE THRESHOLDS C. M. A. PENNARTZ* Physics of Computation, California Institute of Technology, Pasadena, U.S.A. Abstract––A central problem in learning theory is how the vertebrate brain processes reinforcing stimuli in order to master complex sensorimotor tasks. This problem belongs to the domain of supervised learning, in which errors in the response of a neural network serve as the basis for modification of synaptic connectivity in the network and thereby train it on a computational task. The model presented here shows how a reinforcing feedback can modify synapses in a neuronal network according to the principles of Hebbian learning. The reinforcing feedback steers synapses towards long-term potentiation or depression by critically influencing the rise in postsynaptic calcium, in accordance with findings on synaptic plasticity in mammalian brain. An important feature of the model is the dependence of modification thresholds on the previous history of reinforcing feedback processed by the network. The learning algorithm trained networks successfully on a task in which a population vector in the motor output was required to match a sensory stimulus vector presented shortly before. In another task, networks were trained to compute coordinate transformations by combining dierent visual inputs. The model continued to behave well when simplified units were replaced by single-compartment neurons equipped with several conductances and operating in continuous time. This novel form of reinforcement learning incorporates essential properties of Hebbian synaptic plasticity and thereby shows that supervised learning can be accomplished by a learning rule similar to those used in physiologically plausible models of unsupervised learning. The model can be crudely correlated to the anatomy and electrophysiology of the amygdala, prefrontal and cingulate cortex and has predictive implications for further experiments on synaptic plasticity and learning processes mediated by these areas. ? 1997 IBRO. Published by Elsevier Science Ltd. Key words: amygdala, long-term depression, long-term potentiation, NMDA receptor, prefrontal cortex, reinforcement. Learning in neural networks can be accomplished by two fundamentally dierent strategies, namely supervised and unsupervised training. Unsupervised learning is guided by correlations in the inputs to the network. This training strategy has been widely applied to models of, for example, associative memory 42 and self-organizing maps. 51 From a neuro- biological point of view, some methods for un- supervised learning have been relatively successful in that they oer an ecient learning strategy and incorporate biologically plausible principles for syn- aptic modification. One of the most attractive models in this respect is the Bienenstock–Cooper–Munro (BCM) model for the development of orientation- selective cells in the visual system. The Hebbian learning rule of this model has received con- siderable support from experiments on long- term potentiation (LTP) and long-term depression (LTD 9,12,21,25,29,49,67 ). In supervised learning, the output of a network is evaluated with respect to a desired response. This evaluation results in the feedback of error values to the network, which serve to modify the synaptic connections between neurons. Models for supervised learning have generally been met with considerable scepticism in the neurobiological community. This may be largely due to the widespread application of one particular algorithm for supervised learning, i.e. backpropagation of errors. 81,94 There are convincing reasons to reject backpropagation as a procedure by which interconnected brain structures may learn. 20,71 However, other algorithms have been developed that are closely related to supervised learning. These are captured under the term ‘‘reinforcement learning’’ and do not violate the basic laws of neurophysiology. In reinforcement learning, a sensory cue is pre- sented to a network, which subsequently gives rise to an output pattern and, as a consequence of this response, a reinforcing feedback from the environ- ment (Fig. 1). In a biological context, this feedback can be interpreted as a rewarding (appetitive) or punishing (aversive) signal emitted by an animal’s environment. The feedback reflects an evaluation of *Present address: Graduate School of Neurosciences Amsterdam, Netherlands Institute for Brain Research, Meibergdreef 33, 1105AZ, Amsterdam, The Netherlands. Abbreviations: AMPA, Æ-amino-3-hydroxy-5-methylisoxa- zole-4-propionic acid; ARP, associative reward–penalty; BCM, Bienenstock, Cooper and Munro; HSAT, Hebbian synapses with adaptive thresholds; LTD, long-term depression; LTP, long-term potentiation; NMDA, N-methyl--aspartate; RPM, reward-processing module. Pergamon Neuroscience Vol. 81, No. 2, pp. 303–319, 1997 Copyright ? 1997 IBRO. Published by Elsevier Science Ltd Printed in Great Britain. All rights reserved 0306–4522/97 $17.00+0.00 PII: S0306-4522(97)00118-8 303

Reinforcement learning by Hebbian synapses with adaptive thresholds

  • Upload
    uva

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

REINFORCEMENT LEARNING BY HEBBIAN SYNAPSESWITH ADAPTIVE THRESHOLDS

C. M. A. PENNARTZ*Physics of Computation, California Institute of Technology, Pasadena, U.S.A.

Abstract––A central problem in learning theory is how the vertebrate brain processes reinforcing stimuliin order to master complex sensorimotor tasks. This problem belongs to the domain of supervisedlearning, in which errors in the response of a neural network serve as the basis for modification of synapticconnectivity in the network and thereby train it on a computational task. The model presented here showshow a reinforcing feedback can modify synapses in a neuronal network according to the principles ofHebbian learning. The reinforcing feedback steers synapses towards long-term potentiation or depressionby critically influencing the rise in postsynaptic calcium, in accordance with findings on synaptic plasticityin mammalian brain. An important feature of the model is the dependence of modification thresholds onthe previous history of reinforcing feedback processed by the network. The learning algorithm trainednetworks successfully on a task in which a population vector in the motor output was required to matcha sensory stimulus vector presented shortly before. In another task, networks were trained to computecoordinate transformations by combining different visual inputs. The model continued to behave wellwhen simplified units were replaced by single-compartment neurons equipped with several conductancesand operating in continuous time.This novel form of reinforcement learning incorporates essential properties of Hebbian synaptic

plasticity and thereby shows that supervised learning can be accomplished by a learning rule similar tothose used in physiologically plausible models of unsupervised learning. The model can be crudelycorrelated to the anatomy and electrophysiology of the amygdala, prefrontal and cingulate cortex and haspredictive implications for further experiments on synaptic plasticity and learning processes mediated bythese areas. ? 1997 IBRO. Published by Elsevier Science Ltd.

Key words: amygdala, long-term depression, long-term potentiation, NMDA receptor, prefrontal cortex,reinforcement.

Learning in neural networks can be accomplishedby two fundamentally different strategies, namelysupervised and unsupervised training. Unsupervisedlearning is guided by correlations in the inputs tothe network. This training strategy has been widelyapplied to models of, for example, associativememory42 and self-organizing maps.51 From a neuro-biological point of view, some methods for un-supervised learning have been relatively successful inthat they offer an efficient learning strategy andincorporate biologically plausible principles for syn-aptic modification. One of the most attractive modelsin this respect is the Bienenstock–Cooper–Munro(BCM) model for the development of orientation-selective cells in the visual system. The Hebbianlearning rule of this model has received con-siderable support from experiments on long-

term potentiation (LTP) and long-term depression(LTD9,12,21,25,29,49,67).In supervised learning, the output of a network is

evaluated with respect to a desired response. Thisevaluation results in the feedback of error values tothe network, which serve to modify the synapticconnections between neurons. Models for supervisedlearning have generally been met with considerablescepticism in the neurobiological community. Thismay be largely due to the widespread application ofone particular algorithm for supervised learning, i.e.backpropagation of errors.81,94 There are convincingreasons to reject backpropagation as a procedure bywhich interconnected brain structures may learn.20,71

However, other algorithms have been developed thatare closely related to supervised learning. These arecaptured under the term ‘‘reinforcement learning’’and do not violate the basic laws of neurophysiology.In reinforcement learning, a sensory cue is pre-

sented to a network, which subsequently gives rise toan output pattern and, as a consequence of thisresponse, a reinforcing feedback from the environ-ment (Fig. 1). In a biological context, this feedbackcan be interpreted as a rewarding (appetitive) orpunishing (aversive) signal emitted by an animal’senvironment. The feedback reflects an evaluation of

*Present address: Graduate School of NeurosciencesAmsterdam, Netherlands Institute for Brain Research,Meibergdreef 33, 1105AZ, Amsterdam, The Netherlands.

Abbreviations: AMPA, á-amino-3-hydroxy-5-methylisoxa-zole-4-propionic acid; ARP, associative reward–penalty;BCM, Bienenstock, Cooper and Munro; HSAT, Hebbiansynapses with adaptive thresholds; LTD, long-termdepression; LTP, long-term potentiation; NMDA,N-methyl--aspartate; RPM, reward-processing module.

Pergamon

Neuroscience Vol. 81, No. 2, pp. 303–319, 1997Copyright ? 1997 IBRO. Published by Elsevier Science Ltd

Printed in Great Britain. All rights reserved0306–4522/97 $17.00+0.00PII: S0306-4522(97)00118-8

303

network performance and consists of only one scalarvalue. Although many reinforcement learning proce-dures have been constructed, two classical modelsdeserve mentioning in particular. The associativereward–penalty algorithm (ARP)6b,39,59 has beenapplied in engineering as well as neurobiology. Oneapplication in neuroscience has been to train a neuralnetwork in converting retinal and eye position infor-mation into head-centred coordinates.59 The feed-back signal used in this algorithm is a reward value,reflecting the similarity between actual network out-put and desired output. Another successful trainingprocedure is temporal difference learning,7,87b whichhas found its way to neurobiological models ofclassical conditioning,87c development of sensorymaps63 and bee foraging behaviour.23 The key ideabehind temporal difference learning is to let a net-work generate predictions about reward delivery inthe future and to use the error in reward prediction asa reinforcing feedback to modify connectivity.Classical reinforcement learning algorithms for

neural networks are characterized by a learning rulewhich computes synaptic weight changes by multipli-cation of three variables: values related to presynap-tic activity, postsynaptic activity and reinforcingfeedback.6b,59,63 The general form of these learningrules is:

Äwij=çraiaj (1)

with Äwij the weight change of the synapse fromneuron j to neuron i, ç a rate constant, r a reward-related quantity, aj and ai the activity of neuron j andi, respectively. The reward-related quantity r has beenhypothesized to correspond to a diffusible neuro-active substance which directs incremental as well asdecremental changes in synaptic efficacy.23,24,30,59,63

On account of the hypothesis that noradrenaline,acetylcholine and dopamine might act as reinforce-ment signals, some investigators have qualified clas-sical reinforcement learning as neurobiologicallyplausible. However, as pointed out elsewhere,71,73

this hypothesis fails to draw essential support fromexperiments. In brief, evidence is generally lackingfor the requirement that acetylcholine, noradrenalineor dopamine, depending on their concentration in thetarget area, direct weight changes bidirectionally.That the release of these neuromodulators shouldbe tightly correlated to the reward-related quantityfiguring in eq. 1 is not generally supported by exper-iments either, although a subset of mesencephalicdopamine neurons may present an exception tothis.61 Finally, behavioural experiments have gener-ally failed to reveal specific effects of the respectivereceptor antagonists or transmitter depletions onreward-processing or reward-dependent task acquisi-tion. Thus, it would be premature to consider theexisting reinforcement learning algorithms to beneurobiologically plausible at the present time.I therefore sought to establish a novel approach

that combines the principle of reinforcing feedback

with experimentally validated findings on Hebbiansynaptic plasticity. The new algorithm introducedhere utilizes a simple reinforcing feedback for synap-tic modification and relies on Hebbian synapses withadaptive thresholds (HSAT). Since the model sharesa number of experimentally validated assumptionswith the BCM model, it may help to bridge theexisting gap between unsupervised and supervisedlearning algorithms.

MODEL DESCRIPTION

Definitions and overview of the model

A conventional feed-forward network was used, i.e. a netconsisting of three layers, termed sensory (or: input), hiddenand motor (or: output) layer. The connections betweenthese three layers are designated here as ‘‘sensorimotorsynapses’’. The inclusion of a hidden layer is not obligatorybut enables networks to solve a much wider range of tasksthan is possible with two layers.39 Inputs to the units in thesensory layer (not drawn in Fig. 1B) are assumed to benon-modifiable, whereas the connections from sensory tohidden layer and from hidden to motor layer are modifiable.The presentation of an input pattern (cue) marks thebeginning of each learning trial. After a delay has elapsed,the network is required to produce an output pattern whichinteracts with the environment and thereby evokes a rewardsignal feeding back to the reward-processing module(RPM). Except for this environmental feedback, the RPMalso receives modifiable inputs from the sensory units. Theoutput of the RPM is a single scalar, termed ‘‘reward value’’or ‘‘reinforcement signal’’ and is transmitted to all hiddenand motor units by non-modifiable connections. Equations2–11 are all specified at the level of single learning trials.Training with sets of different cues (‘‘batches’’) was notemployed in this study. In the Results section, attention willbe paid to the question whether the model can be made tooperate in real time. The ‘‘units’’ referred to throughout thetext can be considered equivalent to either single neurons orgroups of similarly behaving neurons. Table 1 provides anoverview of all symbols used to explain the model. Table 2summarizes the physiological properties of the various typesof synapses in relation to learning.Before describing how the various types of synapses

accomplish learning, it is instructive to recall some generalproperties of glutamatergic synapses since these are assumedto constitute the connections between the units. Ionotropicglutamatergic transmission is mediated by two main types ofreceptor, N-methyl--aspartate (NMDA) and non-NMDAreceptors. Whereas the non-NMDA (á-amino-3-hydroxy-5-methylisoxazole-4-propionic acid [AMPA]/kainate) receptorchannels have a linear current–voltage relationship and de-polarize the postsynaptic cell by means of a mixed Na+/K+

current, NMDA receptor channels open at depolarizedlevels of membrane potential and are permeable toCa2+.12,40,45,46 NMDA receptors play a dominant role in theinduction of changes in synaptic efficacy by regulating thepostsynaptic calcium concentration ([Ca2+]i.

12,13,21,25,67

It should be noted that synaptic weights were allowed toassume both positive (excitatory) and negative (inhibitory)values in the simulations. To justify the occurrence ofnegative weights, each connection can be thought of asbeing composed of two components: a positive, modifiable(glutamatergic) component and a constant negative offset(GABAergic). Although GABAergic circuitry was notexplicitly included in the model, the assumption of such anegative offset can be tentatively justified by consideringthe widespread occurrence of feed-forward inhibitionthroughout the CNS.2,27,54,74

It is assumed that the various types of synapses in themodel differ in their ability to regulate [Ca2+]i or to influence

304 C. M. A. Pennartz

the activation of biochemical mechanisms for synapticweight changes in other ways. The output from RPMto hidden and motor units is assumed to be able toregulate [Ca2+]i, whereas sensorimotor synapses leave[Ca2+]i unaffected. In addition, a glutamatergic spontaneousinput is assumed to be present on each hidden and motorunit. This input introduces a random component into theactivity of the postsynaptic cell, which is important toguarantee sufficient variability in the output of the networkand thus to explore which responses are optimal given aparticular stimulus.6b,39 The spontaneous input is assumedto be able to enhance [Ca2+]i and to affect synaptic weightchanges.The network learns by modifications of the sensorimotor

synapses and the synapses connecting sensory units to

RPM. In contrast, plasticity is absent in the synapsesconveying reward values from RPM to the hidden andmotor layer and in those transmitting spontaneous inputs.In the efferents from RPM, plasticity is unnecessary becausethey are required to inform the sensorimotor network aboutreward without rescaling or recalibration during learning.As spontaneous inputs merely serve to generate randomnessin the firing activity of neurons, they lack plasticity as well.The main difference with previously proposed reinforce-

ment learning models is that transmission of reward valuesis not mediated by a diffusible substance, which interactswith pre- and postsynaptic variables in a multiplicativefashion (eq. 1), but directly by glutamatergic synapses.The glutamatergic synapses signalling reward not onlydepolarize the postsynaptic unit but also elevate [Ca2+]i. If a

Table 1. Overview of symbols used in the Hebbian synapses with adaptive thresholds model

ã rate constant for modifying sensory-to-RPM synapsesai firing activity of unit iAó fixed positive value of spontaneous input, if different from 0â steepness of the activation functioncmin minimally allowed difference between Èd and ÈpCm membrane capacitance[Ca2+]i postsynaptic calcium concentration in unit iD rate constant for depression of sensorimotor synapseså euclidean distance error of the population vectorEe reversal potential of excitatory conductancesEl reversal potential of leak conductancef constant relating error magnitude to reward valuef([Ca2+]i) function defining the relationship between [Ca2+]i and synaptic weight change

(presynaptic contribution to weight change not included)fd([Ca

2+]i) function defining the relationship between [Ca2+]i and synaptic depressionfp([Ca

2+]i) function defining the relationship between [Ca2+]i and synaptic potentiationGb excitatory background conductanceGe sum of excitatory conductancesGl leak conductanceµ index for input patternvi preferred direction of motor unit ivpop population vector representing ensemble output of motor layerp probability that spontaneous input ói assumes value AóP rate constant for potentiation of sensorimotor synapsesr reward valuer µ mean reward value specific to input pattern µói spontaneous input to unit i, assuming a value of either 0 or Aót timeÈ threshold of the activation functionÈd postsynaptic threshold in [Ca2+]i for induction of synaptic depressionÈp postsynaptic threshold in [Ca2+]i for induction of synaptic potentiationVm membrane potentialwij weight of synapse from unit j to iwk weight of synapse from sensory unit k to RPMxvc (yvc) x(y) coordinate of the cue vectorxvpop (yvpop) x(y) coordinate of the population vector

Table 2. Functional properties of synapses in the Hebbian synapses with adaptive thresholdsmodel

Type of synapse Modifiable?Affecting synapticweight changes?

Sensorimotor Yes NoSensory synapse on RPM Yes YesReinforcement synapse (on hidden or motor unit) No YesSpontaneous input (on hidden or motor unit) No Yes

Refer to Fig. 1B for the positioning of the different types of synapse in the network. Modifiablesynapses can be subject to weight changes according to the rules given in eq. 11 (forsensorimotor synapses) or eq. 7 (for sensory synapses on the RPM). While reinforcementsynapses on sensorimotor units and spontaneous inputs are not modifiable, they can affectthe modification of sensorimotor synapses according to eq. 11.

Reinforcement learning by Hebbian synapses 305

high reward value correlates with a large excitatory spon-taneous input to a postsynaptic neuron, [Ca2+]i will besufficiently elevated to induce LTP in the sensorimotorsynapses to that neuron. If, however, spontaneous inputdoes not correlate with high reward, only a small rise in[Ca2+]i ensues and LTD is induced.

Activation of individual units and formation of ensembleoutput

When a cue is presented to the input layer, forwardpropagation occurs according to the following activationfunction:

ai=1

1+e1â((j=0

m

wijaj+ói1È)(2)

where ai is the firing activity of postsynaptic unit i, wij is theweight of the synapse from unit j to i, m is the total numberof units presynaptic to i, â is a constant representing thesteepness of the activation function and È a thresholdfactor. In addition to the integrated synaptic input from theprevious layer, Ó wij aj, a spontaneous component ói may beadded in order to generate stochastic variation in theactivity of neuron i:

ói=Aó with probability p and Aó>0 (3)

ói=0 with probability (11p)

In most simulations Aó assumed a fixed value for allunits. In many contemporary models of supervised learning,network output is represented by activity values for individ-ual units coding, for instance, parameters of movement suchas force or angular velocity. Although this approach allowsa straightforward calculation of errors for the individualunits and is convenient for engineering purposes, it isimportant to confront the issue as to how reinforcingfeedback arises within a natural ecological setting. Externalagents or circumstances that regulate the availability ofprimary reinforcers interact with the global output of ananimal’s motor system, not with the activity of its individualmotor units. Therefore, I chose to represent network outputat the population level. Specifically, network responses weremodelled as a population vector representing the directionof an impending motor action.In a majority of single cells in several motor-related areas

(i.e. superior colliculus, primary motor cortex, pontine andmesencephalic reticular formation33,36,76), firing activitychanges in advance of a motor response. In motor cortexneurons, the magnitude of this change is broadly tunedto the movement direction of the upcoming response.The direction corresponding to peak activity is termed‘‘preferred direction’’ and firing activity decreases as thedirection of movement progressively deviates from thepreferred direction. In large populations of motor neurons,preferred directions are distributed across the entire con-tinuum of possible movement directions, be it in a two- orthree-dimensional space. If the product of a neuron’schange in firing rate (a scalar) and its preferred direction (avector) is considered as a vectorial contribution or ‘‘vote’’,the sum of the individual vectors constitutes a populationvector.15,34–36,75,76 In motor cortex, the population vectorpoints in a direction similar to that of an impending armmovement.34,35 Thus, population vectors are computed asfollows:

ípop=(i=0

N

íiai (4)

where ípop is the population vector, N the total number ofmotor units and the preferred direction of motor unit íi.The activity of unit i, ai, (eq. 2) will equal the change in

firing activity with respect to pre-stimulus levels if oneadjusts the parameters of eq. 2 so as to produce zerobackground activity.A conditioning task can be regarded as a problem of

sensorimotor mapping. The task is to find an efficientconversion from activity patterns of cue-activated sensoryunits onto a target set of motor units producing a popula-tion vector. For example, a monkey can be instructed tomake a saccadic eye movement to the position where alight spot appeared shortly before on a screen (cf. 31).Computationally, this particular instance of sensorimotormapping is similar to an n-copy problem,1 since the motorsystem is required to copy the direction and length of asensory vector in its output. The output is encoded kin-ematically, i.e. in terms of locations of target movements.The model is not concerned with the intrinsic dynamicsof movement generation, e.g., changes in muscle activityaffecting joint torques.

Evaluation of motor performance and learning rule

A useful measure for the error in a sensorimotor copyingtask is the euclidean distance between the coordinates of thepopulation vector and cue vector:

å=12√(xvpop1xvc)

2+(yvpop1yvc)2 (5)

where å is the euclidean distance error in two dimensionsand xvpop, yvpop, xvc and yvc represent the x and y coordinatesof the population vector and cue vector, respectively. Thecue is presented on one of N positions on a unit circle; thecorresponding cue vector is the arrow from the origin tothat position on the circle. Furthermore, å is normalizedand assumes values between 0 and 1. With random perform-ance, individual output units become active in an uncoordi-nated fashion; the resulting population vectors areuncorrelated to the cue vector and are usually of smallamplitude, because contributions by output units with op-posite preferred directions cancel out in the populationvector (eq. 4). Hence, random performance results in amean error of 0.5, corresponding to a mean populationvector of zero amplitude. The reward value is calculated as:

r=11åf (6)

where r is the reward value and f a constant, usuallybetween 0.3 and 2.0 (cf. 59).Each sensory unit not only projects to the hidden units

but also to the reward-processing module RPM (Fig. 1B).The function of these synaptic connections is to store arunning average of reward that is specific for a particularinput pattern. Consider a simple case in which an inputpattern µ activates only one sensory unit k (with ak =1; fora general rule, see Appendix). No other input patterns willactivate unit k. In that case sensory-to-RPM synapse wk isuniquely associated with input pattern µ. Weight changes insynapse wk are computed according to:

Äwk=ã(r1akwk)ak (7a)

where ã is a rate constant. Equation 7a can be considered aHebbian update rule on account of the following. Thepresynaptic component of the rule, ak, prescribes thatmodifications of wk can only occur when unit k is activatedupon presentation of µ; otherwise the activity of k is zeroand no weight change occurs. Thus, the information storedby wk is specific to input pattern µ.The postsynaptic component of the rule, (r"akwk), deter-

mines the sign and amount of change in wk. In addition tosensory input akwk, RPM also receives a reward input r

306 C. M. A. Pennartz

which arrives at some later point in the trial (Fig. 1). Inagreement with many other variations of Hebb’s rule, it isassumed that changes in wk are computed with respect to acertain threshold marking the transition from depression topotentiation (e.g., Refs 9,11). For the sensory connectionsonto RPM, this threshold is set equal to the current level of

sensory input akwk. Potentiation of wk is induced when theincoming reward signal r overshoots the threshold anddepression in the opposite case. By this mechanism, wk willbe updated at each trial in such a manner that it will cometo reflect a running average of the reward (recall that ak=1).Because this postsynaptic rule is effectuated only when input

Fig. 1. Temporal structure of learning trials and network architecture used in the HSAT algorithm. (A)The learning trial is structured according to operant conditioning. After a sensory cue has been presentedto the animal, a delay elapses before the animal generates a response. A reward of a magnitude dependingon the response error becomes available to the animal shortly after the response. In practice, the onset ofthe response is often timed by an auxiliary cue, which was not explicitly included in the model. (B) A smallnetwork of three sensory, two hidden and three motor units exemplifies the general structure of a HSATarchitecture. The different types of synapses and their modifiability are indicated below the diagram.

RPM, reward-processing module.

Reinforcement learning by Hebbian synapses 307

pattern µ is presented, the running average is input-specific,i.e. associated with one particular input pattern:

r µ=wk (7b)

where rµ is the running average of reward specific to inputpattern µ. In practice, input-specificity of reward valuesturns out to be useful especially in tasks where the networklearns some input–output transformations more readilythan others.The reasoning behind the learning rule for the sensori-

motor connections is to associate a change in rewardmagnitude to an activity change due to a spontaneous inputto one or several neurons. Changes in reward are conveyedupon the synaptic modification mechanisms in sensorimotorlayers by, firstly, transmitting the mean reward value r µ tothe network, which acts to set postsynaptic thresholds forweight changes in an early stage of the learning trial.Secondly, at the end of the trial the novel reward value r istransmitted to the network. The level of potentiation ordepression induced in the sensorimotor synapses dependscritically on the extent to which r over- or undershoots themean reward.Synaptic weight changes in the sensorimotor synapses are

computed according the following Hebbian scheme:

Äwij=aj f([Ca2+]i) (8)

where aj is the activity of the presynaptic unit and f([Ca2+]i)

defines the relationship between [Ca2+]i and synaptic weightchange. [Ca2+]i is proportional to the magnitude of summedinputs which are assumed to cause a rise in postsynapticcalcium levels:

[Ca2+]i=C(ói+r) (9)

where C is a constant that can be dispensed with later on(see Appendix). Let it be noted that r is assumed to raise[Ca2+]i by an excitatory action on the postsynaptic neuron.Although this excitation will generally enhance the firingrate, r was not included in eq. 2 because it arrives at theneuron after a response has been generated and thefiring activity in eq. 2 refers to the activity achieved shortlybefore and during the response. Sensorimotor synapses areassumed not to bias [Ca2+]i. For convenience, f([Ca

2+]i) issplit into two components representing the dependence ofsynaptic depression and potentiation on [Ca2+]i:

fd([Ca2+]i)=1D([Ca2+]i1Èd) with fd([Ca

2+]i)#0 (10a)

fp([Ca2+]i)=P([Ca

2+]i1Èp) with fp([Ca2+]i)$0 (10b)

where D and P represent learning rate constants and Èd andÈp postsynaptic thresholds in [Ca

2+]i for induction of syn-aptic depression and potentiation, respectively (Fig. 2).f([Ca2+]i) is equal to the sum of fd([Ca

2+]i) and fp([Ca2+]i)

(denoted as fd+p([Ca2+]i) in Fig. 2C). When the thresholds

are specified in terms of mean reward value and magnitudeof spontaneous input (see Appendix), eq. 8 can be rewrittenas a set of two simple learning rules:

if ói=0: Äwij=1D(r1r µ)aj with Äwij#0 (11a)

if ói=Aó: Äwij=(P1D)(r1r µ)aj with P$D (11b)

Thus, if a postsynaptic unit is more active than averagedue to the presence of a spontaneous component Aó andfurthermore, r>rµ, LTP will be induced. In Fig. 2 [Ca2+]i,reflecting the sum of ói and r, is then located to the right of

Èp so that LTP will be induced. LTD is elicited when ói=Aóand r<r µ. In that case the sum of ói and r is insufficient toraise [Ca2+]i above Èp, but still high enough to shift [Ca

2+]iinto the ‘‘LTD-trough’’ (Fig. 2). If a postsynaptic unit is lessactive than average due to ói=0, [Ca

2+]i will remain belowthe threshold for LTP because even high values of r cannotraise [Ca2+]i above this threshold. However, in the absenceof spontaneous input LTD can still be induced if r>r µ,shifting [Ca2+]i into the LTD-trough. No weight changeoccurs when ói=0 and r<r

µ, because under these conditionsthe sum of ói and r is not sufficiently high to raise [Ca2+]iabove Èd (Fig. 2). In summary, the correlation betweenactivity changes due to ói and changes in reward is crucial indetermining the sign and amount of weight changes.In a broad sense, the learning rule (eq. 11) can be related

to two important physiological principles governing synap-tic plasticity, viz. heterosynaptic depression and associativepotentiation. When a neuron receives two non-overlappingsynaptic inputs, one strongly active and the other oneinactive, the strong input may depress synaptic efficacy inthe inactive pathway.14,57b In the HSAT model, a similarsituation arises when r>r µ and ói=0, leading to LTD in thesensorimotor synapses. Associative LTP is induced whenboth inputs are strongly activated at the same time, whereasactivation of a single input would not be sufficient toenhance synaptic strength.12,14 The corresponding configu-ration in the model is that r>r µ and ói=Aó, resulting in LTPin the sensorimotor synapses. It should be borne in mind,however, that in the HSAT model the activity level in the

Fig. 2. Relationship between the postsynaptic calcium con-centration ([Ca2+]i) and synaptic weight change. In (A) and(B) the functions fd([Ca

2+]i) and fp([Ca2+]i) describe the

relationship between [Ca2+]i and synaptic depression andpotentiation separately (see eq. 10). Èd and Èp represent thepostsynaptic thresholds in [Ca2+]i for induction of synapticdepression and potentiation, respectively. In (C) the func-tion fd+p([Ca

2+]i) equals the sum of fd([Ca2+]i) and

fp([Ca2+]i) and represents the total postsynaptic contribu-

tion to the synaptic weight change (if the presynapticcomponent aj equals 1, this contribution corresponds to the

actual weight change; see eq. 8).

308 C. M. A. Pennartz

sensorimotor input itself is not decisive for the postsynapticcontribution to LTD or LTP, because this input doesnot affect [Ca2+]i. It is the magnitude of the spontaneousbias in postsynaptic activity ói, in combination with r,that determines the direction of weight change. This indi-cates an important difference with conventional rules forheterosynaptic interaction.The probability p of having ói=Aó was generally low

(0.2–0.4) and D was usually much smaller than P in thesimulations described below.

RESULTS

Sensorimotor mapping task

The HSAT algorithm was first tested on the sen-sorimotor mapping task described above: the popu-lation vector generated by the output layer wasrequired to copy the cue vector in both direction andlength. Figure 3 shows how network performanceimproved across trials for networks of different size.As the number of sensory and motor units increasedfrom four to eight in each layer, the learning speeddecreased, but the network reliably converged tonear-zero error. The examples given in Fig. 3 arerepresentative for 20 different random seeds for bothnetwork sizes. It was possible to enhance the learningspeed with higher P and D values than used in Fig. 3,but in these cases changes in weight or neural activity

were not guaranteed to be distributed across thenetwork. In general, the model was tolerant to par-ameter changes; correct performance was maintainedwhen most parameters varied between about 40 and200% of the values used to obtain Fig. 3 (an excep-tion was ã, which usually assumed values between0.05 and 0.15). Convergence to near-zero error wasalso achieved when the hidden layer contained half asmany units as the sensory or motor layers. Further-more, the network continued to learn well if themagnitude of spontaneous input was drawn from agaussian distribution instead of being fixed at Aó.Learning remained intact with the coefficient ofvariation of these gaussians increasing up to 3.0.

Comparison to other algorithms

The performance of the algorithm was comparedto backpropagation81,94 and ARP.6b,39,59 For back-propagation, I used the version as described in Hertzet al.39 (pp. 115–120) without a momentum term orother modifications. Parameter settings were opti-mized for each of these algorithms. Table 3 shows theaveraged results (n=20) for simulations with net-works having four and eight units in the sensory andmotor layers. As expected, training by backpropaga-tion proceeded more rapidly due to the availability ofan error value for each output unit instead of oneerror value for the entire network. However, it canalso be noted that the HSAT algorithm trained thenetwork significantly faster than the associativereward–penalty algorithm.

Application to larger networks: usefulness of‘‘patches’’

It has been realized before that learning by re-inforcing feedback slows down dramatically as theoutput space expands.1,6b,39 Reinforcement learningproceeds rapidly when the output space is composedof only two possible responses, e.g., ‘‘go left’’ and ‘‘goright’’ or ‘‘respond’’ and ‘‘do not respond’’. In thesebinary choice tasks, the absence or presence of re-ward signals unambiguously whether or not a correctresponse was made. Since many biological problemsare characterized by a high-dimensional outputspace, it is of interest to examine how well the HSATalgorithm stands up to an expansion of this space.Training large networks with HSAT may slow

down considerably because of the following problem.Since units having a spontaneous input of size Aó

have randomly distributed preferred directions acrossa full circle, spontaneously enhanced activities ofoutput units will tend to cancel out each other,resulting in a population vector of small amplitude.Lowering p remedies this problem somewhat, but thiswill make less units amenable for large changes insynaptic input per learning trial (eqs 3 and 11), thusslowing down learning. Therefore I examinedwhether the introduction of ‘‘patches’’ may be useful

Fig. 3. Learning curves for a sensorimotor mapping taskperformed by fully connected networks of two differentsizes. The solid line illustrates the drop in population vectorerror for a network of four sensory, four hidden and fourmotor units, whereas the dotted line applies to a networkwith eight units in each layer. The height of each histogrambin corresponds to the error averaged over all trials includedin that bin. In the curves shown here and below, plots oferrors for individual trials were noisier but followed thesame curve without excursions to unusually large errors.Note that the number of cues to be matched in the dottedlearning curve was twice as large as in the solid curve.Parameter settings for solid line: P=2.0, D=0.25, p=0.5,È=3.0, â=1.0, Aó=3.0, f=0.83, ã=0.1; for dotted line:P=2.75, D=0.25, p=0.2, È=3.0, â=0.8, Aó=3.0, f=0.83,

ã=0.1.

Reinforcement learning by Hebbian synapses 309

for these larger networks. A patch is defined as asubgroup of contiguous motor units, each of whichhas a probability p of receiving a spontaneous inputAó. No units lying outside the patch receive a spon-taneous input, so the neural activity in the outputlayer can be spontaneously enhanced only in a re-stricted band of units having similar preferred direc-tions. As a result, motor units receiving Aó will beable to cooperate in producing a robust populationvector, which occasionally will point in the correctdirection and have the correct amplitude (i.e. copiesthe cue vector). In each trial the centre of the patch isdetermined randomly, whereas the size of the patchremains fixed throughout the training period.Patchwise allocation of spontaneous inputs was

applied to networks of 24, 16 and 24 units in the threeconsecutive layers (Fig. 4). The error in euclideandistance between cue and population vector droppedmore rapidly with patchwise allocation of spon-taneous input as compared to completely randomallocation. While the mean ‘‘steady-state’’ errorswere similar for both methods (0.088&0.004 withand 0.086&0.002 without patches after 30,000 trials;n=5), the number of trials needed to reach an error of0.25 decreased from 17.8&0.103 trials withoutpatches to 13.2&0.103 trials with patches (reductionof 26%). These results indicate that patchwise genera-tion of spontaneous biases in activity may to someextent remedy the general problem of slow learningby reinforcement in larger networks.

Forced cooperation between motor units

The sensorimotor task used thus far required map-ping of a population vector to discrete points distrib-uted evenly on a circle. The amplitude of the desiredvector was constant in this task. A network of 25sensory, 16 hidden and 24 motor units was subse-quently trained to map population vectors to cuesprojected in a two-dimensional field. This configura-tion required the network to produce populationvectors of variable direction and amplitude. In thistask the euclidean distance error of the population

vector reliably decreased to 0.084&0.023 in 30,000trials (n=5). Although this result is not strikinglydifferent from the circular mapping task, it is worthnoting that in the former task only one motor unitcould suffice for generating a correct populationvector. This is because the target points were allremoved by one unit length from the origin and asingle motor unit generates a vector of the samelength if its activity is maximal (eq. 4). In thetwo-dimensional field task, peripheral targets wereremoved from the origin by three unit lengths maxi-mally, requiring cooperation of at least three motor

Table 3. Performance of different learning algorithms in training neural networks on asensorimotor mapping task

Algorithm

Network size: number of sensory-hidden-motor units

4-4-4 8-8-8

1000 trials 5000 trials 5000 trials 20,000 trials

Backpropagation 0.022&0.001* 0.009&0.000 0.008&0.000* 0.004&0.000*ARP 0.277&0.016* 0.011&0.002 0.320&0.016* 0.047&0.010HSAT 0.163&0.014 0.008&0.003 0.085&0.004 0.033&0.004

The values shown in the table represent population vector errors (&S.E.M.) obtained after thenumber of learning trials indicated and averaged across 20 simulations for each networkconfiguration. Care was taken to optimize the parameters in each of the algorithms. Asterisksindicate values differing significantly (P<0.0001) from errors obtained with the HSATalgorithm (Student’s t-test; Mann–Whitney U-test gave similar results).

Fig. 4. Usefulness of ‘‘patches’’ in sensorimotor mappingtask for larger networks. The solid line illustrates a typicallearning curve for a network of 24 sensory, 16 hidden and24 motor units with patches that were seven units in size; thedotted line applies to the same configuration withoutpatches. Although the initial and final portions of the curveswere similar, the population vector error dropped morerapidly during the mid-phase of training when usingpatches. Parameters were optimized separately for bothseries of simulations. See Fig. 3 for further explanations.Parameter settings for solid line: P=1.3, D=0.05, p=0.35,È=5.0, â=0.45, Aó=1.3, f=0.33, ã=0.1; for dotted line:P=1.5, D=0.05, p=0.15, È=5.0, â=0.45, Aó=1.3, f=0.50,

ã=0.1.

310 C. M. A. Pennartz

units to generate the correct population vector. Thus,the HSAT algorithm can force motor units to co-operate in order to produce the required sensori-motor mapping.Cooperation between motor units was also tested

in a network configuration having 16 sensory, 12hidden and eight motor units. In each trial, a cue wasrandomly allocated to one of 16 positions lying on acircle. Because there were twice as many sensory asmotor units, at least two motor units were required tocooperate in order to produce a population vectorhaving the same direction as a cue vector lying inbetween the preferred directions of two adjacentmotor units. Again, training by the HSAT algorithmforced motor units to cooperate in order to produce aconsistently low error in the population vector.

Vector summation task

To assess a wider applicability of the HSAT algo-rithm, it was used in a task requiring integration oftwo distinct sensory inputs in order to produce anoutput. Specifically, the network was required to pro-duce an output equal to the vector sum of two cuevectors. An example illustrating the neurobiologicalrelevance of this task is the coordinate transformationfrom retinal and eye-position information into cranio-topic coordinates, which may involve area 7a of pari-etal cortex.59 Andersen4 has argued that groups ofarea 7a units may encode distributed representationsof craniotopic coordinates of objects, as their firingbehaviour is modulated by variations in both retinalposition and angle of gaze. This is only one of manytasks requiring computational processing of multipleinputs in order to produce a common output.A network consisting of two separate sensory lay-

ers, one hidden and one motor layer was designedwith full connectivity of both sensory layers to hiddenunits (Fig. 5A). Vectors representing retinal positionsof cues were used as input to an R-layer, while eyeposition information was processed by an E-layer.The output layer was required to produce a popula-tion vector having the same direction as the vectorsum of both inputs. A population vector equal to orlarger than unitary length was accepted as correctunless the retinal and eye input vector pointed inopposite directions, a case which required a popula-tion vector of zero amplitude. While Fig. 5B reveals arather low learning speed for this task, it shouldbe noted that the network reliably converged to anaverage final error of 0.057&0.001 within 20,000trials and in 100 simulations. Furthermore, thenetwork was not trapped in local minima far fromoptimum as indicated by a maximal error of 0.08.

Model implementation with single-compartmentneurons operating in continuous time

A neurobiologically plausible model of learning isrequired to be operational not only at the level of

Fig. 5. Performance of the HSAT algorithm in a vectorsummation task. Specifically, the task was to compute acoordinate transformation from retinal and eye positioninformation to craniotopic coordinates. (A) Network archi-tecture consisting of a sensory layer encoding the position ofa visual cue on the retina (R-layer), a sensory layer trans-mitting eye position (E-layer), a hidden and output layerand the reward-processing module. Although the layerswere fully connected, only the connections to the outermostunits in each layer are drawn here. (B) The learning curvewas typically characterized by a rapid decrease in popula-tion vector error followed by a much slower improvement.The height of each histogram bin corresponds to the erroraveraged over all trials included in that bin. Parametersettings: P=2.5, D=0.125, p=0.25, È=5.0, â=0.35, Aó=1.0,

f=1.18, ã=0.6.

Reinforcement learning by Hebbian synapses 311

discrete learning trials but also in continuous time.The algorithm described above was therefore imple-mented in a program specifying within-trial changesin membrane potential and firing rates of a single-compartment neuron. Trials were constructed follow-ing the temporal sequence of operant conditioning(Fig. 1A). After a cue was first presented, a delayelapsed before the response was initiated. The onsetof spontaneous input, which marks the initiation ofpopulation vector formation, preceded the responseby a short amount of time (Fig. 6). The arrival ofreward input was timed shortly after the response.This sequence of events led to the activation of thefollowing synaptic inputs in order of occurrence: (i)long-lasting sensory inputs to hidden units and to theRPM; (ii) inputs from the hidden units to the motorunits and transient inputs from the RPM to hiddenand motor units representing the mean reward valuefor the presented input pattern; (iii) spontaneousinput; (iv) input from the RPM to hidden and motorunits representing the reward value for the currentmotor response. At the cellular level, each of theseinputs was modelled as a synaptic conductancehaving a reversal potential of 0 mV, in agreementwith findings on AMPA/kainate receptors.40,45,46

These synaptic conductances interacted with anexcitatory background conductance Gb having areversal potential of 0 mV and a leak conductanceGl reversing at "70 mV. Membrane potential

changes were calculated according to the followingequation:

dVmdt

=1Cm[(Ee1Vm)Ge+(El1Vm)Gl] (12)

where Vm is the membrane potential, t is time, Cm themembrane capacitance of the neuron, Ge the sum ofexcitatory conductances (i.e. synaptic inputs andbackground conductance) and Ee and El the reversalpotentials for the excitatory and leak conductances,respectively. The instantaneous firing rate was com-puted according to eq. 2, except that for each neuronthe sum of spontaneous input ói and the integratedsynaptic input from the preceding layer (Ówijaj) wasreplaced by Vm. The learning rule (eq. 11) remainedunchanged, but it should be noted that in the real-time model intracellular calcium concentrations, syn-aptic weight changes and presynaptic activities werecomputed on a millisecond time-scale (generallytime-steps of 5 ms were used). Consequently, theconstants D and P were about two orders of mag-nitude smaller than in the trial-level model. Thethresholds for depression (Èd) and potentiation (Èp)were defined as above with respect to rµ whichchanges only slowly across trials (eq. 7).As the spontaneous input arrives earlier at the

neuron than the reward signal (Fig. 6) and Aó may besufficient to induce LTD in the absence of reward (eq.11 and Appendix), depression of synaptic weightsmay occur before an evaluation of motor perform-ance takes place. However, across many trials thissmall amount of LTD unrelated to reward waspredictable and could be compensated for by mildlylowering D. Small networks performing the sensori-motor vector-copying task were able to achieve98–99% correct performance within 3000–4000learning trials. This learning rate is similar to thatachieved in the trial-level model.

DISCUSSION

The present model intends to demonstrate that it ispossible to implement reinforcement learning bylinearly additive synapses modified by Hebbianweight changing mechanisms. In a minimal sense, themodel provides an existence proof combined withevidence for the effectiveness and well-behaved per-formance of the algorithm. As the algorithm has notbeen proven to be optimal in the sense of learningspeed or convergence to minimal error, it may beamenable to further improvement. Within the class oferror-driven learning models that have some prob-ability of being neurobiologically relevant, the modelpresents an alternative to classical reinforcementlearning, which relies on the controversial assump-tion of a diffusible reinforcement signal.71 As such, itbrings models for reinforcement learning closer toplausible models of unsupervised learning.9,11 First Ishall discuss the computational performance of

Fig. 6. Temporal sequence of neuronal membrane potentialchanges in the course of a single learning trial. (A) illustrateshow external (B) and internal (C) events cause a cascade ofmembrane potential changes in a single hidden unit. Thesensory cue, shown in (B), elicits a depolarizing responsethat continues after the actual cue has disappeared (someneurons in high-level sensory areas show such persistentactivity during the delay period of learning trials). Propaga-tion of activity from the sensory layer to the RPM andhence onwards to the hidden and motor layers results in anadditional, transient depolarization representing the recallof mean previous reward (A,C). The motor response isinitiated after the onset of spontaneous input (C). Thereward signal (B) arriving slightly later in time causes anadditional depolarization. Although the model does notspecify how the onset and termination of the motor re-sponse are regulated, it is relatively straightforward toenvisage control mechanisms external to this model thatmay allow or prohibit motor responses during time intervals

locked to environmental stimuli.

312 C. M. A. Pennartz

the model and continue with pointing out possibleneurobiological correlates of model components.

Computational performance of the model

Neural networks trained by the HSAT algorithmlearned the sensorimotor vector-copying task effi-ciently, in particular when the sensory and motorlayers were of limited size. Output units could beforced to cooperate in producing an appropriatepopulation vector in particular variations on thetask, e.g., when targets were distributed across atwo-dimensional field. Training by the HSAT algo-rithm proceeded more slowly than by back-propagation but faster than by the ARP algorithm.One should be cautious in extrapolating these com-parative results to other tasks. It is known thattraining by reinforcement learning slows downgreatly with increasing size of the output layer, and inthis respect the HSAT algorithm does not present anexception. Three arguments will be given to argue infavour of an undiminished importance of reinforce-ment learning in general and the HSAT algorithm inparticular. Firstly, in addition to processing reward,rule-based decisions and visuospatial short-termmemory may be used by an animal to facilitatelearning in sensorimotor tasks (cf. Ref. 31). Secondly,when in real-world tasks reinforcement provides theonly feedback to modify an animal’s behaviour, thenumber of response options is often quite limited. Aspointed out earlier, the term ‘‘unit’’ may refer to asingle neuron or to a group of similarly behavingneurons. In the motor layer, such groups can beconsidered the functional entities for executing theseresponse options and according to this interpretationtheir number can be quite limited. Thirdly, even if theoutput space is extensive, the use of randomlyselected patches enables networks to learn faster(Fig. 4). This useful tool awaits further explorationthat may lead to reliable convergence to lower errorswithout loss of improved learning speed.The HSAT algorithm proved to be able to train

networks on a vector summation task, previouslyapplied to the problem of converting retinal and eyeposition information into craniotopic coordinates byparietal cortex area 7a.4,59 This observation showsthat the HSAT algorithm can be applied to morecomplex problems than vector-copying tasks and isfunctionally versatile. When modelling learning incortical regions such as area 7a, it is particularlydifficult to maintain a high confidence in the plausi-bility of three-factor reinforcement learning rules(eq. 1), since, for instance, these areas are only verysparsely innervated by putative reinforcement-signalling dopaminergic fibres.10,13 Reinforcementlearning by HSAT provides a different and attractiveframework for representing error transmissionbetween cortical and subcortical structures. In asimilar vein, this model may be of use in understand-ing how glutamatergic inputs to cerebellar Purkinje

cells transmit error signals and thereby may guidelearning.56

Finally, the HSAT algorithm was shown to func-tion in networks with single-compartment neuronswhose activity changes were specified in continuoustime. These neurons had a membrane capacitanceand resistance and their analogue firing rates weregoverned by synaptic interactions obeying Ohm’slaw, modified for ionic currents with specific reversalpotentials. Since the learning speed was not signifi-cantly reduced in this model, the HSAT algorithmdoes not fail to behave properly when realisticneurobiological constraints are more rigorouslyapplied than in trial-level models.While these results illustrate the well-behaved

performance of the model, little attention has beenpaid so far to the neurobiological justification of itsunderlying set of assumptions. Below I briefly indi-cate putative neurobiological correlates of the net-work architecture and the learning rule. In almostany respect this architecture should be regarded as ahighly simplified model of the reward-processingstructures of the vertebrate brain.

Neurobiological correlates of the network architecture

A key element of the model is the glutamatergicoutput of RPM. By means of NMDA receptor-mediated calcium influx into postsynaptic neurons,the reward value biases synaptic modificationof sensorimotor synapses towards potentiation ordepression. The motivation to focus on gluta-matergic synapses stems from the observationthat reinforcement-related unit responseshave been found, amongst other areas, in theprefrontal cortex, cingulate cortex and amyg-dala.58,69,70,80,83,86,89,93 These limbic areas areall thought to contain primarily glutamatergicprojection neurons.22,28,60,72,87a,88,95,97

Tracing studies have shown that these brainregions project to a multitude of motor and sensori-motor areas, e.g., the premotor cortex and supple-mentary motor area,6a,65 the nucleus accumbens andcaudate–putamen,3,38,53,90,97 the visceromotor cortexsituated in the infralimbic cortex,53,84 the medialagranular cortex involved in oculomotor control90

and several brainstem motor nuclei.22 The modelpredicts that reinforcement signals emitted byprefrontal/cingulate cortex and amygdala orchestratelearning in these sensorimotor networks by directingweight modifications in the synaptic interfacesbetween the sensorimotor structures. It is recognizedthat reinforcement-related responses are found insome other areas, e.g., the ventral tegmental areaand substantia nigra pars compacta,61 nucleusaccumbens5 and lateral hypothalamus.78 These areasare not included as possible correlates of RPM sincethey probably use other transmitters than glutamate.Output signals from RPM may represent instan-

taneous reward as well as input-specific mean reward

Reinforcement learning by Hebbian synapses 313

received previously. Instantaneous reward signals aregenerated in the RPM due to excitation by primaryappetitive or aversive stimuli. Positively or negativelyreinforcing inputs may be relayed to the amygdala,prefrontal and cingulate cortex by way of the midlineand intralaminar thalamic nuclei processing visceralsensory information and pain and arousal sig-nals,37,64,90 the olfactory bulb and piriform cortexrelaying olfactory information3 and the insular cor-tex processing gustatory inputs.52,96 Signals repre-senting input-specific mean reward are generated inthe RPM when units in the sensory layer of the modelexcite RPM neurons (Fig. 1B). Owing to weightmodifications of these connections (eq. 7), a runningaverage of the reward value pertaining to a particularcue is stored and used for threshold-setting in thelearning rule for sensorimotor synapses. Anatomi-cally, many non-reward-related sensory areas areknown to project to the amygdala and prefrontal/cingulate cortex. Highly processed visual informationreaches these areas by way of area TE/inferotemporalcortex, area 46/dorsolateral prefrontal cortex andparietal cortex. Auditory information reaches theamygdala directly from the temporal cortex and viathe medial geniculate nucleus. Parts of the insularcortex are likely to convey somatosensory in-formation to the amygdala and orbitofrontalcortex.3,17,65,79,84,90

Tentative support for the storage of mean rewardvalues of particular sensory cues can be derived fromunit recording studies in the amygdala andprefrontal/cingulate cortex. Nishijo et al.70 manipu-lated the rewarding quality of visual stimuli experi-mentally and found a fast modification of amygdalaunit responses to these stimuli following changes intheir appetitiveness. These results are in agreementwith several lesion studies on the amygdala demon-strating profound deficits in learning to associatedistinct stimuli to reward.32,47,84 In primate prefron-tal neurons, Watanabe93 recorded anticipatory firingactivity that was specific to a primary reinforcer (suchas a visible food item) or to an initially neutralstimulus associated with food. Because the anticipa-tory firing activity is thought to result from process-ing and storage of previous reward experiences, suchresponses come very close to the type of informationprocessed by RPM in the current model.

Neurobiological correlates of the learning rule

When considering putative physiological correlatesof the learning rule for sensorimotor synapses (eqs8–11 and Appendix), it is noted first that the bidirec-tional dependence of synaptic plasticity on [Ca2+]ihas been corroborated by studies in both neocortexand hippocampus12,21,25,48,67 (see, however, Ref. 68).It is important to note that this bidirectional depen-dence on [Ca2+]i can be used for both supervisedlearning, as demonstrated here, and unsupervisedlearning tasks such as development of stimulus selec-

tivity (BCM model11). Evidence has been raisedsuggesting that similar modification principles mayhold, at least partially, in brain structures implicatedby the present model. Results of Hirsch and Crepel41

strongly suggest that moderate calcium entry throughNMDA receptor channels leads to LTD in prefrontalcortical neurons, whereas strong calcium influxinduces LTP. NMDA receptor-dependent LTP hasbeen demonstrated in the basolateral complex ofthe amygdala (for review see Ref. 22), nucleusaccumbens,72 cingulate cortex82 and motor-relatedcortical structures.16,44 Furthermore, plasticity in apathway relaying auditory inputs to the amygdalawas found by Rogan and LeDoux77 duringconditioning of fear responses to acoustic stimuli.Pharmacological blockade of NMDA receptorsin the basolateral amygdaloid nucleus caused animpairment in the acquisition but not expression ofconditioned fear.62

For comparing the current model and the BCMlearning rule in more detail, it is useful to recall theessential principles of the latter model. Bienenstocket al.11 described how visual patterns entering thevisual cortex via modifiable afferents may competefor synaptic strengthening and thereby shape stimu-lus (e.g., orientation) selectivity. An important novelfeature of this model was that Hebbian weightchanges are not only governed by instantaneouspre- and postsynaptic activities but also by a time-averaged value of the postsynaptic activity, providinga postsynaptic modification threshold. Essentially, aset of temporally coherent synaptic inputs wouldexcite the postsynaptic neuron above its averagelevel of activity and thus above its threshold forsynaptic strengthening. By consequence, the post-synaptic response to that set of inputs would beenhanced, thereby improving the neuron’s selectivity,whereas weights transmitting uncorrelated inputswould be depressed. The general BCM learning rulecan be expressed as follows:

Äwij(t)=f(ai(t))aj(t)1çwij(t) (13a)

where f(ai(t)) is a function of postsynaptic activityai(t) at time t, aj(t) is the presynaptic activity and ç isa decay constant. The function f(ai(t)) is similar to thefunction in eq. 8 and illustrated in Fig. 2, albeit thatthe constraints are different. Specifically, when thepostsynaptic activity ai(t)=0, then f(ai(t)) must bezero. Furthermore, the depression-potentiationcrossover point or modification threshold ÈM isdefined as a nonlinear function of the average post-synaptic activity ai. One example of a thresholdfunction satisfying the latter constraint is:

ÈM(ai)=(aia0)pai (13b)

where a0 is a positive constant and p is a powerconstant (p>1). When these demands are met,stimulus selectively develops under a broad range of

314 C. M. A. Pennartz

conditions in the sensory environment with goodnoise tolerance.When comparing the BCM learning rule with the

current model, the following differences can be noted.Whereas the modification threshold in the BCMlearning rule depends on the average postsynapticactivity, the present model assumes a dependence ofthreshold on an average input-specific reward value.The latter adaptation process may be implementedphysiologically as a priming effect of cue-evokedfiring activity in RPM neurons on the postsynapticmodification threshold in sensorimotor neurons.Electrophysiological evidence for such primingeffects and related forms of use-dependent adaptationhas been raised by e.g., Huang et al.,43 Christieand Abraham,19 Selig et al.85 and Kirkwoodet al.49 Additional studies, reporting regulation ofNMDA receptor activity by relatively fast-actingkinases50,92 and phosphatases,55,91,92 exist to supportthis suggestion.A second difference with the BCM model is that

some types of synapse affect weight changes whereasothers do not (eq. 9). One as yet uncorroboratedconfiguration allowing such as differential influence isthat sensorimotor synapses would be unable to causesignificant changes in [Ca2+]i due to lack of func-tional NMDA receptors. Experimental evidence forpostsynaptic elements devoid of NMDA receptorshas been raised for instance by Christensson andGrillner18 in primary afferent neurons in the lampreyspinal cord. Furthermore, in the lateral and basalamygdala NMDA receptors were reported to bepreferentially located on spines, whereas AMPAreceptor subunits were abundant on dendriticshafts.26 This study suggests that glutamatergic

afferents to postsynaptic target cells in some brainareas may preferentially activate AMPA or NMDAreceptors, but not both. Alternatively, sensorimotorsynapses might be unable to activate another secondmessenger system besides [Ca2+]i that would benecessary for persistent modification, e.g., cGMP98

or a metabotropic glutamate receptor-activatedcascade.8

CONCLUSIONS

The HSAT algorithm presented here may be usedin modelling learning by reinforcement as an alterna-tive to classical three-factor reinforcement learningrules. Multi-layer neural networks can be trained ondifferent computational tasks using this algorithmand the algorithm remains functional in more realis-tic simulations with single-compartment neurons.While several features of the model remain to betested experimentally, both the learning rule andcomponents of the network architecture can becrudely correlated to physiological principles under-lying LTP/LTD induction and to the connectivityand electrophysiology of several limbic structures.Importantly, the model indicates that both super-vised and unsupervised learning can be accomplishedby algorithms sharing essential properties common toHebbian learning.

Acknowledgements—I am grateful for the stimulating adviceof John J. Hopfield during this research project andthank Arjen van Ooyen, Fernando Lopes da Silva andCarlos Brody for their helpful comments on the manuscript.This project was supported by a Talent Fellowship ofthe Netherlands Organization for Scientific Research.

REFERENCES

1. Ackley D. H. and Littman M. L. (1990) Generalization and scaling in reinforcement learning. In Advances in NeuralInformation Processing Systems 2, pp. 550–557. Kaufmann, San Mateo, CA.

2. Alger B. E. and Nicoll R. A. (1982) Feed-forward dendritic inhibition in rat hippocampal pyramidal cells studiedin vitro. J. Physiol. 328, 105–123.

3. Amaral D. G., Price J. L., Pitkanen A. and Carmichael S. T. (1992) Anatomical organization of the primateamygdaloid complex. In The Amygdala: Neurobiological Aspects of Emotion, Memory, and Mental Dysfunction (ed.Aggleton J. P.), pp. 1–66. Wiley–Liss, London.

4. Andersen R. A. (1995) Encoding of intention and spatial location in the posterior parietal cortex. Cerebr. Cort. 5,457–469.

5. Apicella P., Ljungberg T., Scarnati E. and Schultz W. (1991) Responses to reward in monkey dorsal and ventralstriatum. Expl Brain Res. 85, 491–500.

6a. Avendano C., Price J. L. and Amaral D. G. (1983) Evidence for an amygdaloid projection to premotor cortex but notto motor cortex in the monkey. Brain Res. 264, 111–117.

6b. Barto A. G. and Jordan M. I. (1987) Gradient following without back-propagation in layered networks. In IEEE FirstInt. Conf. Neural Networks, Vol. II, pp. 629–636.

7. Barto A. G., Sutton R. S. and Anderson C. W. (1983) Neuron-like adaptive elements that can solve difficult learningcontrol problems. IEEE Trans. Syst. Man Cybern. 13, 834–846.

8. Bashir Z. I., Bortolotto Z. A., Davies C. H., Berretta N., Irving A. J., Seal A. J., Henley J. M., Jane D. E., WatkinsJ. C. and Collingridge G. L. (1993) Induction of LTP in the hippocampus needs synaptic activation of glutamatemetabotropic receptors. Nature 363, 347–350.

9. Bear M. F., Cooper L. N. and Ebner F. F. (1987) A physiological basis for a theory of synaptic modification. Science237, 42–48.

10. Berger B., Trottier S., Verney C., Gaspar P. and Alvarez C. (1988) Regional and laminar distribution of the dopamineand serotonin innervation in the macaque cerebral cortex: a radioautographic study. J. comp. Neurol. 273, 99–119.

11. Bienenstock E. L., Cooper L. N. and Munro P. W. (1982) Theory for the development of neuron selectivity:orientation specificity and binocular interaction in visual cortex. J. Neurosci. 2, 32–48.

Reinforcement learning by Hebbian synapses 315

12. Bliss T. V. P. and Collingridge G. L. (1993) A synaptic model of memory: long-term potentiation in the hippocampus.Nature 361, 31–39.

13. Brown R. M., Crane A. M. and Goldman P. S. (1979) Regional distribution of monoamines in the cerebral cortex andsubcortical structures of the rhesus monkey: concentrations and in vivo synthesis rates. Brain Res. 168, 133–150.

14. Brown T. H., Kairiss E. W. and Keenan C. L. (1990) Hebbian synapses: biophysical mechanisms and algorithms.A. Rev. Neurosci. 13, 475–511.

15. Caminiti R., Johnson P. B. and Urbano A. (1990) Making arm movements within different parts of space: dynamicaspects in the primate motor cortex. J. Neurosci. 10, 2039–2058.

16. Castro-Alamancos M. A., Donoghue J. P. and Connors B. W. (1995) Different forms of synaptic plasticity insomatosensory and motor areas of the neocortex. J. Neurosci. 15, 5324–5333.

17. Cavada C. and Goldman-Rakic P. S. (1989) Posterior parietal cortex in rhesus monkey: II. Evidence for segregatedcorticocortical networks linking sensory and limbic areas with the frontal lobe. J. comp. Neurol. 287, 422–445.

18. Christenson J. and Grillner S. (1991) Primary afferents evoke excitatory amino acid receptor-mediated EPSPs that aremodulated by presynaptic GABAb receptors in lamprey. J. Neurophysiol. 66, 2141–2149.

19. Christie B. R. and Abraham W. C. (1992) Priming of associative long-term depression in the dentate gyrus by thetafrequency synaptic activity. Nature 9, 79–84.

20. Crick F. (1989) The recent excitement about neural networks. Nature 337, 129–132.21. Cummings J. A., Mulkey R. M., Nicoll R. A. and Malenka R. C. (1996) Ca2+ signaling requirements for long-term

depression in the hippocampus. Neuron 16, 825–833.22. Davis M., Rainnie D. and Cassell M. (1994) Neurotransmission in the rat amygdala related to fear and anxiety. Trends

Neurosci. 17, 208–214.23. Dayan P., Montague P. R. and Sejnowski T. J. (1994) Foraging in an uncertain environment using predictive Hebbian

learning. In Advances in Neural Information Processing Systems (eds Cowan J. D., Tesauro G. and Alspector J.),Vol. 6. Morgan Kaufmann, San Mateo, CA.

24. Dehaene S. and Changeux J. P. (1991) The Wisconsin card sorting test: theoretical analysis and modeling in a neuronalnetwork. Cerebr. Cort. 1, 62–79.

25. Dudek S. M. and Bear M. F. (1992) Homosynaptic long-term depression in area CA1 of hippocampus and effects ofN-methyl--aspartate receptor blockade. Proc. natn. Acad. Sci. U.S.A. 89, 4363–4367.

26. Farb C. R., Aoki C. and LeDoux J. E. (1995) Differential localization of NMDA and AMPA receptor subunits in thelateral and basal nuclei of the amygdala: a light and electron microscopic study. J. comp. Neurol. 362, 86–108.

27. Finch D., Tan A. M. and Isokawa-Akesson M. (1988) Feed-forward inhibition of the rat entorhinal cortex andsubicular complex. J. Neurosci. 8, 2213–2226.

28. Fonnum F., Storm-Mathisen J. and Divac I. (1981) Biochemical evidence for glutamate as neurotransmitter incorticostriatal and corticothalamic fibres in rat brain. Neuroscience 6, 863–873.

29. Fregnac Y., Shulz D., Thorpe S. and Bienenstock E. (1988) A cellular analogue of visual cortical plasticity. Nature 333,367–370.

30. Friston K. J., Tononi G., Reeke G. N., Sporns O. and Edelman G. M. (1994) Value-dependent selection in the brain:simulation in a synthetic neural model. Neuroscience 59, 229–243.

31. Funahashi S., Bruce C. J. and Goldman-Rakic P. S. (1991) Neuronal activity related to saccadic eye movements in themonkey’s dorsolateral prefrontal cortex. J. Neurophysiol. 65, 1464–1483.

32. Gaffan D. (1992) Amygdala and the memory of reward. In The Amygdala: Neurobiological Aspects of Emotion,Memory and Mental Dysfunction (ed. Aggleton J. P.), pp. 471–483. Wiley–Liss, London.

33. Georgopoulos A. P. (1990) Neural coding of the direction of reaching and a comparison with saccadic eye movements.In Cold Spring Harbor Symposia on Quantitative Biology, Vol. LV, pp. 849–859. Cold Spring Harbor LaboratoryPress, Cold Spring Harbor.

34. Georgopoulos A. P. (1994) New concepts in generation of movement. Neuron 13, 257–268.35. Georgopoulos A. P., Lurito J. T., Petrides M., Schwartz A. B. and Massey J. T. (1989) Mental rotation of the neuronal

population vector. Science 243, 234–236.36. Van Gisbergen J. A. M., van Opstal A. J. and Tax A. A. M. (1987) Collicular ensemble coding of saccades based on

vector summation. Neuroscience 21, 541–555.37. Groenewegen H. J. and Berendse H. W. (1994) The specificity of the ‘‘nonspecific’’ midline and intralaminar thalamic

nuclei. Trends Neurosci. 17, 52–57.38. Groenewegen H. J., Berendse H. W., Wolters J. G. and Lohman A. H. M. (1990) The anatomical relationship of the

prefrontal cortex with the striatopallidal system, the thalamus and the amygdala: evidence for a parallel organization.In Progress in Brain Research (eds Uylings H. B. M., Van Eden C. G., De Bruin J. P. C., Corner M. A and FeenstraM. G. P.), Vol. 85, pp. 95–118. Elsevier, Amsterdam.

39. Hertz J., Krogh A. and Palmer R. G. (1991) Introduction to the Theory of Neural Computation. Addison-Wesley,Redwood City.

40. Hestrin S., Nicoll R. A., Perkel D. J. and Sah P. (1990) Analysis of excitatory synaptic action in pyramidal cells usingwhole-cell recording from rat hippocampal slices. J. Physiol. 422, 203–225.

41. Hirsch J. C. and Crepel F. (1992) Postsynaptic calcium is necessary for the induction of LTP and LTD ofmonosynaptic EPSPs in prefrontal neurons: an in vitro study in the rat. Synapse 10, 173–175.

42. Hopfield J. J. (1984) Neurons with graded response have collective computational properties like those of two-stateneurons. Proc. natn. Acad. Sci. U.S.A. 81, 3088–3092.

43. Huang Y.-Y., Colino A., Selig D. K. and Malenka R. C. (1992) The influence of prior synaptic activity on theinduction of long-term potentiation. Science 255, 730–733.

44. Iriki A., Pavlides C., Keller A. and Asanuma H. (1989) Long-term potentiation in the motor cortex. Science 245,1385–1387.

45. Keinanen K., Wisden W., Sommer B., Werner P., Herb A., Verdoorn T. A., Sakmann B. and Seeburg P. H. (1990) Afamily of AMPA-selective glutamate receptors. Science 249, 556–560.

46. Keller B. U., Konnerth A. and Yaari Y. (1991) Patch clamp analysis of excitatory synaptic currents in granule cells ofrat hippocampus. J. Physiol. 435, 275–293.

316 C. M. A. Pennartz

47. Kesner R. P. and Williams J. M. (1995) Memory for magnitude of reinforcement: dissociation between the amygdalaand hippocampus. Neurobiol. Learning Memory 64, 237–244.

48. Kirkwood A., Dudek S. M., Gold J. T., Aizenman C. D. and Bear M. F. (1993) Common forms of synaptic plasticityin the hippocampus and neocortex in vitro. Science 260, 1518–1521.

49. Kirkwood A., Rioult M. G. and Bear M. F. (1996) Experience-dependent modification of synaptic plasticity in visualcortex. Nature 381, 526–528.

50. Kitamura Y., Miyazaki A., Yamanaka Y. and Nomura Y. (1993) Stimulatory effects of protein kinase Cand calmodulin kinase II on N-methyl--aspartate receptor/channels in the postsynaptic density of rat brain.J. Neurochem. 61, 100–109.

51. Kohonen T. (1989) Self-organization and Associative Memory. Springer, Berlin.52. Kosar E., Grill H. J. and Norgren R. (1986) Gustatory cortex in the rat. I. Physiological properties and

cytoarchitecture. Brain Res. 379, 329–341.53. Krettek J. E. and Price J. L. (1977) Projections from the amygdaloid complex to the cerebral cortex and thalamus in

the rat and cat. J. comp. Neurol. 172, 687–722.54. Kriegstein A. R. (1987) Synaptic responses of cortical pyramidal neurons to light stimulation in the isolated turtle

visual system. J. Neurosci. 7, 2488–2492.55. Lieberman D. N. and Mody I. (1994) Regulation of NMDA channel function by endogenous Ca2+-dependent

phosphatase. Nature 369, 235–239.56. Lisberger S. G. (1988) The neural basis for learning of simple motor skills. Science 242, 728–735.57a. Lisman J. (1989) A mechanism for the Hebb and the anti-Hebb processes underlying learning and memory. Proc. natn.

Acad. Sci. U.S.A. 86, 9574–9578.57b. Lynch G. S., Dunwiddie T. and Gribkoff V. (1977) Heterosynaptic depression: a postsynaptic correlate of long-term

potentiation. Nature 266, 737–739.58. Markowitsch H. J. and Pritzel M. (1976) Reward related neurons in cat association cortex. Brain Res. 111, 185–

188.59. Mazzoni P., Andersen R. A. and Jordan M. I. (1991) A more biologically plausible learning rule for neural networks.

Proc. natn. Acad. Sci. U.S.A. 88, 4433–4437.60. McDonald A. J. (1996) Glutamate and aspartate immunoreactive neurons of the rat basolateral amygdala:

colocalization of excitatory amino acids and projections to the limbic circuit. J. comp. Neurol. 365, 367–379.61. Mirenowicz J. and Schultz W. (1996) Preferential activation of midbrain dopaine neurons by appetitive rather than

aversive stimuli. Nature 379, 449–451.62. Miserendino M. J. D., Sananes C. B., Melia K. R. and Davis M. (1990) Blocking of acquisition but not expression of

conditioned fear-potentiated startle by NMDA antagonists in the amygdala. Nature 345, 716–718.63. Montague P. R., Dayan P., Nowlan S. J., Pouget A. and Sejnowski T. J. (1993) Using aperiodic reinforcement for

directed self-organization during development. In Advances in Neural Information Processing Systems (eds HansonS. J., Cowan J. D. and Giles C. L.), Vol. 5. Morgan Kaufmann, San Mateo, CA.

64. Morecraft R. J., Geula C. and Mesulam M.-M. (1992) Cytoarchitecture and neural afferents of orbitofrontal cortex inthe brain of the monkey. J. comp. Neurol. 323, 341–358.

65. Morecraft R. J. and Van Hoesen G. W. (1992) Cingulate input to the primary and supplementary motor cortices inthe rhesus monkey: evidence for somatotopy in areas 24c and 23c. J. comp. Neurol. 322, 471–489.

66. Mulkey R. M., Herron C. E. and Malenka R. C. (1993) An essential role for protein phosphatases in hippocampallong-term depression. Science 261, 1051–1055.

67. Mulkey R. M. and Malenka R. C. (1992) Mechanisms underlying induction of homosynaptic long-term depression inarea CA1 of the hippocampus. Neuron 9, 967–975.

68. Neveu D. and Zucker R. S. (1996) Postsynaptic levels of [Ca2+]i needed to trigger LTD and LTP. Neuron 16, 619–629.

69. Niki H. and Watanabe M. (1979) Prefrontal and cingulate unit activity during timing behavior in the monkey. BrainRes. 171, 213–224.

70. Nishijo H., Ono T. and Nishino H. (1988) Single neuron responses in amygdala of alert monkey during complexsensory stimulation with affective significance. J. Neurosci. 8, 3570–3583.

71. Pennartz C. M. A. (1996) The ascending neuromodulatory systems in learning by reinforcement: comparingcomputational conjectures with experimental findings. Brain Res. Rev. 21, 219–245.

72. Pennartz C. M. A., Ameerun R. F., Groenewegen H. J. and Lopes da Silva F. H. (1993) Synaptic plasticity in anin vitro slice preparation of the rat nucleus accumbens. Eur. J. Neurosci. 5, 107–117.

73. Pennartz C. M. A., Groenewegen H. J. and Lopes da Silva F. H. (1994) The nucleus accumbens as a complex offunctionally distinct neuronal ensembles: an integration of behavioural, electrophysiological and anatomical data.Prog. Neurobiol. 42, 719–761.

74. Pennartz C. M. A. and Kitai S. T. (1991) Hippocampal inputs to identified neurons in an in vitro slice preparation ofthe rat nucleus accumbens: evidence for feed-forward inhibition. J. Neurosci. 11, 2838–2847.

75. Pitts W. and McCulloch W. S. (1947) How we know universals. The perception of auditory and visual forms. Bull.math. Biophys. 9, 127.

76. Robinson D. A. (1972) Eye movements evoked by collicular stimulation in the alert monkey. Vision Res. 12, 1795.77. Rogan M. T. and LeDoux J. E. (1995) LTP is accompanied by commensurate enhancement of auditory-evoked

responses in a fear conditioning circuit. Neuron 15, 127–136.78. Rolls E. T., Sanghera M. K. and Roper-Hall A. (1979) The latency of activation of neurones in the lateral

hypothalamus and substantia innominata during feeding in the monkey. Brain Res. 164, 121–135.79. Romanski L. M. and LeDoux J. E. (1993) Information cascade from primary auditory cortex to the amygdala:

corticocortical and corticoamygdaloid projections of temporal cortex in the rat. Cerebr. Cort. 3, 515–532.80. Rosenkilde C. E., Bauer R. H. and Fuster J. M. (1981) Single cell activity in ventral prefrontal cortex of behaving

monkeys. Brain Res. 209, 375–394.81. Rumelhart D. E., Hinton G. E. and Williams R. J. (1986) Learning internal representations by error propagation. In

Parallel Distributed Processing (eds Rumelhart D. E., McClelland J. L. and the PDP research group), Vol. 1, chapter8. MIT Press, Cambridge, MA.

Reinforcement learning by Hebbian synapses 317

82. Sah P. and Nicoll R. A. (1991) Mechanisms underlying potentiation of synaptic transmission in rat anterior cingulatecortex in vitro. J. Physiol. 433, 615–630.

83. Sanghera M. K., Rolls E. T. and Roper-Hall A. (1979) Visual responses of neurons in the dorsolateral amygdala ofthe alert monkey. Expl Neurol. 63, 610–626.

84. Sarter M. and Markowitsch H. J. (1985) Involvement of the amygdala in learning and memory: a critical review, withemphasis on anatomical relations. Behav. Neurosci. 99, 342–380.

85. Selig D. K., Hjelmstad G. O., Herron C. E., Nicoll R. A. and Malenka R. C. (1995) Independent mechanisms forlong-term depression of AMPA and NMDA responses. Neuron 15, 417–426.

86. Sikes R. W. and Vogt B. A. (1992) Nociceptive neurons in area 24 of rabbit cingulate cortex. J. Neurophysiol. 68,1720–1732.

87a. Streit P. (1984) Glutamate and aspartate as transmitter candidates for systems of the cerebral cortex. In CerebralCortex (eds Peters A. and Jones E. G.), Vol. 2, pp. 119–143. Plenum, New York.

87b. Sutton R. S. (1988) Learning to predict by the methods of temporal differences. Machine Learning 3, 9–44.87c. Sutton R. S. and Barto A. G. (1987) A temporal-difference model of classical conditioning. In Ninth Annual Conf.

Cogn. Society, pp. 355–378. Lawrence Erlbaum Assoc.88. Takayama K. and Miura M. (1991) Glutamate-immunoreactive neurons of the central amygdaloid nucleus projecting

to the subretrofacial nucleus of SHR and WKY rats: a double-labeling study. Neurosci. Lett. 134, 62–66.89. Thorpe S. J., Rolls E. T. and Maddison S. (1983) The orbitofrontal cortex: neuronal activity in the behaving monkey.

Expl Brain Res. 49, 93–115.90. Vogt B. A. (1985) Cingulate Cortex. In Cerebral Cortex (eds Peters A. and Jones E. G.), Vol. 4, pp. 89–149. Plenum,

New York.91. Wang L. Y., Orser B. A., Brautigan D. L. and McDonald J. F. (1994) Regulation of NMDA receptors in cultured

hippocampal neurons by protein phosphatases 1 and 2A. Nature 369, 230–232.92. Wang Y. T. and Salter M. W. (1994) Regulation of NMDA receptors by tyrosine kinases and phosphatases. Nature

369, 233–235.93. Watanabe M. (1996) Reward expectancy in primate prefrontal neurons. Nature 382, 629–632.94. Werbos P. (1974) Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph.D. thesis,

Harvard University.95. White E. L. (1989) Cortical Circuits. Synaptic Organization of the Cerebral Cortex. Structure, Function, and Theory.

Birkhauser, Basel.96. Yasui Y., Breder C. D., Saper C. B. and Cechetto D. F. (1991) Autonomic responses and efferent pathways from the

insular cortex in the rat. J. comp. Neurol. 303, 355–374.97. Yim C. Y. and Mogenson G. J. (1988) Neuromodulatory action of dopamine in the nucleus accumbens: an in vivo

intracellular study. Neuroscience 26, 403–415.98. Zhuo M., Hu Y., Schultz C., Kandel E. R. and Hawkins R. D. (1994) Role of guanylyl cyclase and cGMP-dependent

protein kinase in long-term potentiation. Nature 368, 635–639.

(Accepted 10 March 1997)

APPENDIX

Computation of mean input-specific reward

Equation (7a) presents the learning rule for sensory synapses on RPM for cases where an input pattern µ activates only onesensory unit k with ak=1. According to this rule the weight of these synapses represents a running average of reward (r

µ)specifically pertaining to µ. A general rule for storing averaged and input-specific reward values is:

r µq+1=ár µq+(11á)rq (A1)

where rq is the actual reward in trial q and á is a constant promoting the retention of previously stored reward values (infact á=1"ã; for ã see eq. 7a). According to this general scheme r µq is stored in the combination of weights of synapses fromactivated sensory units to RPM.

Specification of modification thresholds and derivation of learning rule (eq. 11)

The general function describing the influence of post-synaptic calcium concentration on weight changes, f([Ca2+]i), is shownin Fig. 2 and equals the sum of the functions fd([Ca

2+]i) and fp([Ca2+]i), expressing the dependence of synaptic depression

and potentiation on [Ca2+]i, respectively:

f([Ca2+]i)=P([Ca2+]i1Èp)1D([Ca2+]i1Èd) (A2)

with P([Ca2+]i"Èp)§0 and "D([Ca2+]i"Èd)¦0. The aim of this section is to specify the depression and potentiationthresholds Èd and Èp in such a way that functional, balanced weight changes are induced that reflect the correlationprinciples explained in the main text. Èd and Èp can be taken proportional to the mean reward value for an input patternµ:

Èd=C.rµ (A3a)

Èp=C(rµ+c) (A3b)

where c is a constant to be specified below and C is the same auxiliary constant as used in eq. 9. Let it be recalled that r µ

is being stored in the weights of sensory-to-RPM synapses. Presentation of pattern µ will lead to excitation of RPM, which

318 C. M. A. Pennartz

is assumed to transmit a transient barrage of inputs to hidden and motor neurons and thus cause threshold resetting in thoseneurons. When during learning r µ is gradually enhanced, the sensory excitation of RPM, the transient barrage of inputs andthe modification thresholds will increase proportionally. Equations 9 and A2–3 can be combined to:

f([Ca2+]i)=P(ói+r1r µ"c)1D(ói+r1r µ) (A4)

in which C has been set to unity and can be left out.The constant c, corresponding to the difference between Èd and Èp, can be further specified by considering that ideally

no weight changes should be induced if, for a given input pattern µ, the actual reward value equals the mean reward valuefor that pattern acquired over previous trials. Thus, if for a given unit r=r µ and ói=Aó and f([Ca

2+]i) must be zero, eqs A3–4give the following result:

Èp=rµ+Aó(

P1DP

) (A5)

Combination of A2–5 leads to the following overall equation:

f([Ca2+]i)=PSói+r1r µ"AóSP1DP DD1D(ói+r1r µ) (A6)

with the neurobiological constraint that the constant Aó(P"D)/P§cmin. cmin represents the minimally allowed differencebetween Èd and Èp, corresponding to the difference in Ca

2+ affinity of enzymes mediating depression and potentiation,respectively (cf. Refs 25, 57a, 66). In practice, cmin was fixed at 1 and 0<r<1, so that no potentiation occurred in the absenceof spontaneous input. When the presynaptic factor is taken into account (eq. 8) and the cases ói=0 and ói=Aó are consideredseparately, eqs.11a and b are obtained.

Reinforcement learning by Hebbian synapses 319