7
Reward Hierarchical Temporal Memory Model for Memorizing and Computing Reward Prediction Error by Neocortex Hansol Choi, Jun-Cheol Park, Jae Hyun Lim, Jae Young Jun and Dae-Shik Kim Korea Advanced Institute of Science and Technology Daejeon, Republic of Korea [email protected] Abstract—In humans and animals, reward prediction error encoded by dopamine systems is thought to be important in the temporal difference learning class of reinforcement learning (RL). With RL algorithms, many brain models have described the function of dopamine and related areas, including the basal ganglia and frontal cortex. In spite of this importance, how the reward prediction error itself is computed is not understood well, including the problem of how the current states are assigned to a memorized states and how the values of the states are memorized. In this paper, we describe a neocortical model for memorizing state space and computing reward prediction error, known as ‘reward hierarchical temporal memory’ (rHTM). In this model, the temporal relationships among events are hierarchically stored. Using this memory, rHTM computes reward prediction errors by associating the memorized sequences to rewards and inhibits the predicted reward. In a simulation, our model behaved similarly to dopaminergic neurons. We suggest that our model can provide a hypothetical framework of interaction between cortex and dopamine neurons. Keywords-component; reward; reinforcement learning; reward prediction error; temporal difference; HTM; rHTM; reward-HTM I. BACKGROUND A. Reward prediction Error Reinforcement Learning (RL) has been one of the most influential computational theories in neuroscience [1] since Sutton and Barto proposed and developed it [2]. In the RL systems of the brain, dopaminergic neurons are thought to encode reward prediction error (RPE) [8–10] which is the core of RL. In particular, Schultz and colleagues [3–5] proposed neurophysiological data to suggest that dopaminergic neurons encode the temporal difference error. A type of RL known as temporal difference (TD) learning has been thoroughly studied since. Moreover, many computational models proposed the functional mechanisms of dopamine (DA) in the basal ganglia and related areas with the TD learning model [6], [7]. Despite its wide acceptance, the TD model has several limitations in explaining the behavior of dopaminergic neurons (as explained below). TD error is the difference between the expected value of the previous state (V(S t )) and the value actually delivered (the currently delivered reward (r t+1 and the value of current state V(S t+1 ) with discount γ). [11] δ t = r t+1 +γV(S t+1 )-V(S t ) (1) To compute the TD error, several steps are required. First, the current and previous neural patterns must be assigned to some states (S t and S t+1 ). If the patterns are similar enough to previously memorized states, the memorized states must be assigned to the patterns. Otherwise, a new state must be assigned and memorized. Second, the values of the previous and current states should be restored from memory (V(S t ) and V(S t+1 )). Third, the values should be summated and negated with the current reward (r t+1 ). In most TD learning models of the brain related to dopamine functions, the values of states are stored in a numerical value matrix [1], [12]. When a state pattern is delivered, the numerical values are restored from the matrix and used for computing the TD error with the current reward given. This requires brain systems to store the absolute value of states and to buffer the values for the computations, for which no biological evidence exists. Moreover, when associating a new reward-predicting event to a reward event in TD learning, the value of the reward-predicting event is updated by updating the values of intermediate patterns between the two events (see Fig. 6a for the Pavlovian conditioning case). TD learning predicts that this is done by the TD error of the intermediate states between the two events during learning [13]. In contrast to the TD model’s prediction, evidence exists that DA signals are given only to the reward signal in the beginning of learning and only to the reward predicting signal upon learning saturation. No intermediate dopamine signal is observed [14]. The lack of the DA activity in the intermediate states creates a problem known as the “distal reward problem.” The brain should find what state is responsible for the reward when the patterns are no longer there. Moreover, the values of the patterns should be directly updated without the help of intermediate patterns. Another problem in current dopamine models is the lack of a model mechanism for controlling the activity of dopaminergic neurons. Computing the reward prediction error is the core of RL in which the values and actions of the states are guided and learned. Despite this importance, only a few models describe how the DA activity is controlled with respect to memory. More precisely, how the important states for reward information and the structure of the relationship between the states are memorized and how the memory is utilized for mapping the current state into previous knowledge are not understood well. Moreover, how the states are This project was supported by the KI brand project, KI, KAIST in 2012 and by the next generation haptic user interface project funded by Kolon Industries in 2012 U.S. Government work not protected by U.S. copyright WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN 509

Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

Embed Size (px)

Citation preview

Page 1: Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

Reward Hierarchical Temporal Memory Model for Memorizing and Computing Reward Prediction Error by Neocortex

Hansol Choi, Jun-Cheol Park, Jae Hyun Lim, Jae Young Jun and Dae-Shik Kim Korea Advanced Institute of Science and Technology

Daejeon, Republic of Korea [email protected]

Abstract—In humans and animals, reward prediction error encoded by dopamine systems is thought to be important in the temporal difference learning class of reinforcement learning (RL). With RL algorithms, many brain models have described the function of dopamine and related areas, including the basal ganglia and frontal cortex. In spite of this importance, how the reward prediction error itself is computed is not understood well, including the problem of how the current states are assigned to a memorized states and how the values of the states are memorized. In this paper, we describe a neocortical model for memorizing state space and computing reward prediction error, known as ‘reward hierarchical temporal memory’ (rHTM). In this model, the temporal relationships among events are hierarchically stored. Using this memory, rHTM computes reward prediction errors by associating the memorized sequences to rewards and inhibits the predicted reward. In a simulation, our model behaved similarly to dopaminergic neurons. We suggest that our model can provide a hypothetical framework of interaction between cortex and dopamine neurons.

Keywords-component; reward; reinforcement learning; reward prediction error; temporal difference; HTM; rHTM; reward-HTM

I. BACKGROUND

A. Reward prediction Error Reinforcement Learning (RL) has been one of the most

influential computational theories in neuroscience [1] since Sutton and Barto proposed and developed it [2]. In the RL systems of the brain, dopaminergic neurons are thought to encode reward prediction error (RPE) [8–10] which is the core of RL. In particular, Schultz and colleagues [3–5] proposed neurophysiological data to suggest that dopaminergic neurons encode the temporal difference error. A type of RL known as temporal difference (TD) learning has been thoroughly studied since. Moreover, many computational models proposed the functional mechanisms of dopamine (DA) in the basal ganglia and related areas with the TD learning model [6], [7]. Despite its wide acceptance, the TD model has several limitations in explaining the behavior of dopaminergic neurons (as explained below).

TD error is the difference between the expected value of the previous state (V(St)) and the value actually delivered (the currently delivered reward (rt+1 and the value of current state V(St+1) with discount γ). [11]

δt= rt+1+γV(St+1)-V(St) (1)

To compute the TD error, several steps are required. First, the current and previous neural patterns must be assigned to some states (St and St+1). If the patterns are similar enough to previously memorized states, the memorized states must be assigned to the patterns. Otherwise, a new state must be assigned and memorized. Second, the values of the previous and current states should be restored from memory (V(St) and V(St+1)). Third, the values should be summated and negated with the current reward (rt+1).

In most TD learning models of the brain related to dopamine functions, the values of states are stored in a numerical value matrix [1], [12]. When a state pattern is delivered, the numerical values are restored from the matrix and used for computing the TD error with the current reward given. This requires brain systems to store the absolute value of states and to buffer the values for the computations, for which no biological evidence exists. Moreover, when associating a new reward-predicting event to a reward event in TD learning, the value of the reward-predicting event is updated by updating the values of intermediate patterns between the two events (see Fig. 6a for the Pavlovian conditioning case). TD learning predicts that this is done by the TD error of the intermediate states between the two events during learning [13]. In contrast to the TD model’s prediction, evidence exists that DA signals are given only to the reward signal in the beginning of learning and only to the reward predicting signal upon learning saturation. No intermediate dopamine signal is observed [14]. The lack of the DA activity in the intermediate states creates a problem known as the “distal reward problem.” The brain should find what state is responsible for the reward when the patterns are no longer there. Moreover, the values of the patterns should be directly updated without the help of intermediate patterns.

Another problem in current dopamine models is the lack of a model mechanism for controlling the activity of dopaminergic neurons. Computing the reward prediction error is the core of RL in which the values and actions of the states are guided and learned. Despite this importance, only a few models describe how the DA activity is controlled with respect to memory. More precisely, how the important states for reward information and the structure of the relationship between the states are memorized and how the memory is utilized for mapping the current state into previous knowledge are not understood well. Moreover, how the states are

This project was supported by the KI brand project, KI, KAIST in 2012 and by the next generation haptic user interface project funded by Kolon Industriesin 2012

U.S. Government work not protected by U.S. copyright

WCCI 2012 IEEE World Congress on Computational Intelligence June, 10-15, 2012 - Brisbane, Australia IJCNN

509

Page 2: Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

associated with reward values were not dealt with in previous studies.

In this paper, we suggest that the hierarchical structure of the neocortical memory system is related to the problems above encompassing the state assignment, the storing of a numerical value and the distal reward problem. We developed a novel algorithm to compute the RPE, reward Hierarchical Temporal Memory (rHTM).

Figure 1. Memorizing pavlovian conditioning by a HTM. (A) Sensory input: CS is followed by US with a constant temporal gap. This sequence is repeated with a random temporal delay. (B) CS and US introduce the neural correlate of cs and us in brains. A cs state generates the intermediate neural states of s1 and s2. Those task sequences are connected by diverse uncorrelated patterns (random). (C) The cs-to-us sequence was learned by HTM. In level 1, the cs-to-us state sequence is represented as a state (pavlovseq). (D).The structure of HTM: large boxes denote the HTM regions.

B. Hierarchical Temporal Memory A Hierarchical Temporal Memory (HTM) is a functional

model for neocortex, recently developed by Hawkins and colleagues [15], [16]. It builds a spatio temporal model of the world from sequences of sensory input patterns and uses the models for the prediction of subsequent inputs and the inference of causes.

The HTM model consists of a tree-shaped hierarchy of memory regions. Every region shares the same algorithm to build the spatio-temporal model of the input pattern sequences to the region, regardless of its position in the hierarchy. The algorithm is composed of a two-step Bayesian process which involves spatial pooling and temporal pooling. When an input pattern is submitted to the HTM, each region assigns Bayesian beliefs to spatial patterns which the region learned previously. Each belief is the similarity between a spatial pattern and the input pattern. This process is known as spatial pooling. The spatial patterns input to the regions are categorized in this step. Next, from the beliefs of spatial patterns, the beliefs of temporal patterns are updated (temporal pooling). A temporal pattern is a group of spatial patterns that frequently occurs together. The main idea in this procedure is that patterns which frequently occur together in time share a common cause. A temporal pattern is a spatio-temporal model of an independent entity projected onto a HTM region. Beliefs about temporal patterns are the current states of a node and are fed forward to a higher node.

In this hierarchical structure, a higher node forms a more stable and more abstract model of input patterns with larger spatio-temporal receptive fields. Temporally, a stable state in higher nodes can unfold into fast changes in the lower levels of the hierarchy. The unfolded information predicts the next state of the lower regions. As a result, a HTM can learn the structure of sequences, infer the causes of the sequences, and recall them with a hierarchy [15].

The rHTM exploits this spatial and temporal pattern recognition feature of HTM in the memorizing structure of event sequences and uses memory to assign a current input pattern to the memorized states and a computing reward. In a HTM, the input pattern activates several states in multiple regions in the hierarchy. By assigning delivered reward into currently active states, we can assign a reward to the sequences which are causes or to the contexts of the current event. In this paper, we suggest that rHTM can solve the distal reward problem and produce an activity resembling DA activity. In the following sections, we explain the algorithm in detail with Pavlovian conditioning and Instrumental learning tasks.

Figure 2. The activity of the rHTM in a Pavlovian conditioning task before learning rewards. Each gray cell shows an active pattern over time (y-axis). CS and US show that the time input patterns are given. US is associated with a reward (cross), and as it is unpredictable, RPE is given when us is activated (white cross under RPE). Here, the level 1 pattern of pavlovseq is shown to be active when the RPE is given. This RPE associates pavlovseq with a reward.

Figure 3. The activity of rHTM after learning the rewards. Here, pavlovseq is associated with a reward. In the atrial, when the current pattern is CS, cs and pavlovseq are activated, as pavlovseq is associated with the reward and the reward is not predicted, and RPE is given (RPE on the right). When te current

510

Page 3: Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

state becomes US, the rHTM can predict us from the previous value of s2. s2 inhibits the reward prediction error function. As a result, no RPE is produced.

II. ALGORITHM OF REWARD HIERARCHICAL TEMPORAL MEMORY

A. Computing RPE for a Pavlovian Conditioning Task The main algorithm of rHTM consists of three mechanisms:

sequence structure learning, reward association and reward prediction. We’ll explain the main algorithm of rHTM in terms of Pavlovian conditioning [17], which is one of the most important tasks in reward learning systems. In Pavlovian conditioning, conditioned stimuli (CS) are repetitively followed by unconditioned stimuli (US) with constant temporal delays (Fig. 1A). CSs are biologically meaningless stimuli (i.e., a bell sound), and USs are biologically relevant rewards that present a stimuli (i.e., sugar water). Tasks are repeated with a random delay. In the beginning of learning, dopamine is activated only when a US is given. As learning proceeds and the USs become predictable, the DA activity for a US diminishes. On the other hand, the DA becomes active for CSs. After learning, the DA fires only for a CS which is unexpected as tasks are given with random delays. DA neurons do not fire for the delivery of USs fully predicted with CSs. Omitting a US generates a negative TD error as the delivered value is smaller than the expected value.

1) Inferring the Causes or Context States of the Input

Sequence of Events in Pavlovian Conditioning We hypothesized that CSs and USs will produce the

corresponding neural patterns css and uss in the neocortex. We hypothesized also that the neural correlate for CS (cs in Fig. 1B) would produce stereotypical neural patterns (s1 and s2 in Fig. 1B). The cs, s1, s2 and us sequences would be repeated for Pavlovian tasks. Those sequences are input to the base region of the rHTM (level 0 in Fig. 1B). This region groups the four spatial patterns into a temporal pattern. The temporal pattern is input to a higher region (Fig. 1. CD; pavlovseq is the temporal pattern and level 1 is the higher region). In level 1, pavlovseq is active while a CS-to-US sequence is processed.

The activity of HTM during a CS-to-US sequence is summarized in Figure 2. As the input patterns change from noisy random patterns (random1 in Fig. 2) to cs, pavlovseq becomes active in level 1. The activity of level 1 feeds back information for the prediction of the next step to level 0. Pavlovseq is deactivated when the following cs is deactivated, representing the end of the learned sequence.

2) Associating Rewards to Active States when a Reward is Given

USs are the primary reward stimuli (Fig. 2, white cross in the table). Without prior knowledge, a RPE is given when a state transits to USs (Fig. 2, white cross on the right). When a state moves to a US and a RPE is given by the activation of us, pavlovseq is active at level 1. This means that during the activation of pavlovseq, a reward will be given. To memorize this, we associated pavlovseq with a reward (Fig. 3, white cross

at Level 1). In the rHTM, the states, which are active when a positive RPE given, update the association with the reward by formula (4). Fig. 3 shows the behavior of the rHTM after learning a reward. The delivery of CSs activates pavlovseq, which is associated with a reward. This activation of pavlovseq produces a reward value which is not predicted. As a result, it gives the RPE (Fig. 3, RPE on the right).

rewardt(S) = rewardt-1(S)+γ�RPE�active(S)

active(S)= 1 if pattern S is active

= 0, if pattern S is inactive (2)

3) Inhibition of predicted reward Signal USs (Fig. 2) always follow s2 at the level 0 sequence. In the

rHTM, we can use this information for predicting a reward (Fig. 3, black arrow beginning from s2 and ending at us). Deactivating a previous state to a reward-associated state inhibits RPE. The inhibition function of a state can be updated with the RPE by formula (5). In our example, the deactivation of s2 inhibits RPE, as it predicts a reward coming in, us. With this inhibition, the reward delivered by us is inhibited and no RPE is given (black arrow in Fig. 3).

Prediction(S)= prediction(S)-α(reward(S)- prediction(S))

(3

Figure 4 Go-Nogo instrumental conditioning. (A) Beginning of the task (Cue) is followed by an action target (GoCue). Animals have a policy of selecting one of two options (p=0.5 for Go and NoGo). After some delay (Wait), a reward or punishment is given. (B) The sensory signal and the activity gives neural correlates which are linked with intermediate sequences. (C) The common intermediate patterns of w1 and w2 can be split into separate patterns according to the context of the sequences. The level 1 region learns the split patterns and finds the temporal sequence of the patterns, which are the inputs to level 2 (D). As a higher region recognizes the sequences of patterns with lower correlations, (E) the level 2 region learns the entire task as a pattern.

511

Page 4: Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

Figure 5 shows the activity of the instrumental conditioning HTM. The rHTM learns pattern sequences and then associates states with rewards. In Temporal Difference Learning without a hierarchy, the beginning of the task and 'Go' will result in a partial reward prediction error and no reward prediction error for a reward signal itself. The rHTM generates the same RPE signals with a partial prediction of a reward-associated pattern and a partial paring of a reward and patterns (right).

B. Computing the RPE for Instrumental Conditioning Next, we explain the behavior of the RPE with a more

complex system, instrumental conditioning [18]. In contrast to Pavlovian conditioning, instrumental conditioning has probabilistic transitions between states that are derived from the probabilistic policies of the subject. In instrumental conditioning, a reward-neutral action of an animal is associated with a reward. As the animal learns this condition, the animal comes to repeat the reward-giving actions. Here, a subset of instrumental conditioning, a go-nogo discrimination task, will be used. After a cue for the beginning a task, one of two task cues are presented to the subject animal (Fig. 3A; Go Cue). The subjects have to choose between one of two choices (Fig. 3A; Go or NoGo). After some delay, (Fig. 3A; Wait) a reward is given for only one choice (Fig. 3A; Rew). A reward is not delivered for the other choice (Fig. 3A; Pur). To simplify the understanding of this, we use a fixed policy for choosing each action with a probability of 0.5.

As in the Pavlovian task, we hypothesized that each behavioral event (Fig. 4A) produces neural correlates and that they are followed by intermediate neural patterns (Fig. 4B). One problem with this is that the some intermediate states may be indistinguishable from to each other (in Fig. 4B, both go and nogo are followed by w1). Hawkins et al. solved this problem using states that are distinguished depending in the context. This method diversifies a common state into different states based on previous states. (Fig. 4C demonstrates that w1 is diversified into g1 and n1 based on the context of go and nogo.)

Highly correlated sequences of states are recognized at level 1 (Fig. 4D, cueseq, goseq, and nogoseq). The states in level 1 are grouped in a higher region (Fig. 4E; gonogotask). As the policy on the choice is constant at 0.5, we can summarize the behavior of this rHTM, as shown in Fig. 5. Moreover, gonogotask is partially associated with a reward, as

it cannot be associated with a reward when NoGo is selected. As no state predicts a reward for gonogotask, partial RPE is given when Cue is given. Also, goseq is fully associated with a reward and partially inhibited by the end of cueseq, by which only half of the sequence produces RPE. In addition, rew gives a reward and is inhibited by g2. As a result, rHTM produces a partial reward for Cue and Go and a partial negative reward for a NoGo signal after learning.

III. SIMULATION

A. Method 1) Pavlovian Conditioning

We trained an rHTM for computing reward prediction error in Pavlovian conditioning. A training trial composed of a cs state, which was followed by 18 intermediate states distinguishable from each other and then a us with a reward signal. This trial was repeated after a random length of states with noise signals (0 to 100 time steps in a uniform distribution). Each state was represented as a unique text pattern.

A two-level rHTM was trained to simulate the development of a reward-prediction error during a Pavlovian task. Training was performed in two steps: in the first step, the rHTM learned the structure of the event sequences. The task sequence was given to the rHTM without reward information for 100 trials. In the second step, the development of reward-prediction error was simulated. A state sequence was inserted into the rHTM and primary reward was given when the state was us. Trials were repeated 40 times and the development of an association between the states and a reward and the reward prediction error was observed. Upon the 30th trial, the reward was omitted to observe the behavior of the system when a predicted reward was omitted.

To compare our methods to those of previous reports, the development of a TD error and the values of states were simulated for a conventional TD reinforcement learning model [13]. In the simulation, the value of 20 (us, intermediate and cs) states and the reward-prediction error during the same Pavlovian conditioning situation were observed.

2) Go-Nogo Task To verify that our model can be applied to a more complex

system, we used our model to simulate RPE in a go-nogo task. The overall structure of our task is identical to the example explained earlier (Fig. 4b). In a trial, a cue signal is followed by a task cue, which is either a go or a no-go signal. Based on the task cue, a subject should select either the go or no-go action. Rewards are given only for one action selection. In our simulation, the task cues were fixed to the go signal and the policy of the agent was to choose each the go or no-go action at a probability of 0.5. The time between the states was filled with intermediate states, as shown in Fig. 4b. Rewards were given for a reward signal and no other rewards or punishments were given. We trained a three-layer rHTM on the structure of the event in 350 trials (half were go and half were nogo) within random noise states. After that, another 350 trials with reward information were given and the development of the reward-prediction error was monitored.

512

Page 5: Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

Figure 6. Simulation of Pavlovian conditioning with TD learning and reward HTM. cs and us were connected by 18 intermediate states. The value (A) and reward prediction error (B) for the state transitions in each trial are computed by TD learning (upper). In TD learning, positive TD values for intermediate states occur, which is not observed biological systems. The values of the intermediate sequences are set to one, as in learning. Compared to the traditional temporal difference model, rHTM associates rewards to patterns (C, inlet) and gives a reward when the pattern is activated (the shape of the left ridge in D is identical to that of the inlet of C). The prediction of a reward-sequence inhibits (C), the predicted reward-prediction error signal. This produces a reward prediction error (D). The RPE is not shown in the intermediate sequences. A negative RPE was noted in trial 30, where the reward was omitted.

B. Result The reward prediction errors for states CS and US were

similar for TD learning and the reward HTM models. In both models, the RPE upon US delivery diminishes as the learning progresses and the RPE appears upon delivery of the CS (Fig. 6, right). Both models show a negative RPE to an omitted delivery of the US – when the predicted reward is not delivered - as evidenced by the activity of dopaminergic neurons in the basal ganglia. The main difference of TD learning model and the rHTM is the reward prediction error value of the intermediate states during the learning progress (Fig. 6B, arrow, and Fig. 6D). In the TD learning model, the values of the intermediate patterns held out to the CS (Fig. 6A) according to the positive temporal difference values in the intermediate patterns (Fig. 6B, arrow). Otherwise, the rHTM computes only the reward association to a state and the prediction of a reward in the next state. The representative state of the CS-to-US sequence at a higher level (the equivalent of pavseq in Figs. 1-3) is associated with a reward. This is shown in the inlet of Figure 6B. By finding the sequence structure within the state transition patterns, our model solves the distal reward problem. Moreover, a reward is associated with a CS without intermediate TD values. As a reward delivered by a US is becoming predictable by the previous state (Fig. 6B), the reward-prediction error in the US decreases with the negation of the reward delivered from the US and the reward prediction from the HTM.

Figure 7. Development of the reward prediction error during instrumental conditioning. The reward prediction errors after each state as submitted are shown here. “others” refers to all other events except those specified. They were 0 during the simulation process.

In the go-nogo task, RPEs were observed only when the cue, go, nogo and reward states are activated. Those states are the first states of temporal patterns in higher regions or the state with the primary reward information (Fig. 7). Those are states directly related to determining the RPE. The reward prediction error in other states did not change from zero during the entire simulation process (Fig. 7; other), as in the intermediate states during Pavlovian conditioning. The result shows that the RPE from reward diminishes to zero as the reward of reward becomes predictable (Fig. 7; other). The cue state, which is the beginning of a trial, produces a RPE throughout the simulation, as the signal is given within the noise sequence and the cue signal is not predictable. Interesting is the behavior of RPE during go and nogo. As the probability for go and nogo is 0.5 in each case, the go state and the nogo state never become totally predictable. As the go state can predict that a reward signal will be coming, selecting go produces a RPE. Otherwise, selecting nogo produces a negative reward prediction error, as the predicted reward from entering the trial became zero. This behavior is similar to the behavior of the TD error except that the intermediate states produce a TD error during TD learning.

IV. DISCUSSION In this paper, we introduced a hierarchical model for the

learning sequences of events and used the information to computing the reward prediction errors. The HTM framework can memorize patterns and relationships between patterns to identify the structure of events given to the HTM. Associating a reward to the representation of memorized sequences and using the memory to inhibit predictable rewards, we could develop a model which computes the RPE similarly to the DA neuron in Pavlovian conditioning. Our model behavior resembles DA activity with no signal wave in intermediate sequences, which the previous TD learning method falsely predicted to exist. Moreover, our model could mimic the behavior of DA in instrumental conditioning, especially the negative RPE for the selection of non-reward-giving actions.

As our sequence memory could find the beginning of a sequence, the distal reward problem could be solved by associating a reward with the representation of the sequence in higher regions. Previously, Izhikevich suggested a model to solve the distal reward problem which involved interaction

513

Page 6: Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

between the STDP and dopamine neurons [19]. He showed the movement of the DA activity from the US to the CS. The problem was that the DA activity to the US disappeared at all. Otherwise, when a CS context is not given, a US should give the DA signal in itself. Additionally, Izhikevich's model could not show the reduced activity of dopamine activity upon the omitting of the US, which was shown in our model.

Our model explains the critical part of reinforcement learning in the actor-critic framework, which gives information about whether the current policy is good or bad. For the actor part, we used a fixed policy in instrumental conditioning. This helped describe the activity of our model as a critic. In actual reinforcement learning, the reward prediction error from the critic should update the policy of the actor. How our memory system would be updated with the RPE signal and the action is left as another important issue to study. With the algorithm, the complete actor-critic model based on HTM may be able to learn the state space in a more active manner, but this is out of the scope of this paper.

Recently, a study on monkey brains reported direct control of the activity of dopaminergic neurons by mPFC activities [24]. DA neurons were both activated and inhibited by stimulating mPFC areas. Applying this to our model, the activation of DA neurons may arise from the state-representing neuron, which is associated with a reward, and the inhibition of DA neurons may be associated with the reward prediction by the previous states. With a well-developed hierarchical memory model of the neocortical system, our model can explain the behavior of the DA with a simple potentiation rule between the neocortex and the DA cells. We suggest that our model does not require explicit numerical values to be assigned to every state.

To verify the feasibility of our model, biological experiments are required. As our model depends on hierarchical event sequence memory, we can predict that two independent events in the same higher states can affect the reward value of each other by changing the reward value of the common higher node. We can verify this in human or animal behavioral experiments.

V. CONCLUSION We proposed a novel hierarchical memory model for the

behavior of dopaminergic neurons. In this model, the dopamine system and neocortex memorize and compute the reward prediction error with the hierarchical event memory. Our model can explain several limitations that the existing TD model of DA behavior cannot explain. First, it is in good agreement with biological observations of dopaminergic neurons. Second, it has a model for interaction between the neocortical memory and the reward prediction error. Finally, the state assignment and event sequence structure can be learnt by rHTM.

References

[1] M. Kawato and K. Samejima, “Efficient reinforcement learning: computational theories, neuroscience and

robotics,” Current Opinion in Neurobiology, vol. 17, no. 2, pp. 205-212, Apr. 2007.

[2] R. S. Sutton and A. G. Barto, Reinforcement learning. MIT Press, 1998.

[3] W. Schultz, P. Dayan, and P. R. Montague, “A Neural Substrate of Prediction and Reward,” Science, vol. 275, no. 5306, pp. 1593 -1599, Mar. 1997.

[4] P. N. Tobler, C. D. Fiorillo, and W. Schultz, “Adaptive Coding of Reward Value by Dopamine Neurons,” Science, vol. 307, no. 5715, pp. 1642 -1645, Mar. 2005.

[5] C. D. Fiorillo, P. N. Tobler, and W. Schultz, “Discrete coding of reward probability and uncertainty by dopamine neurons,” Science (New York, N.Y.), vol. 299, no. 5614, pp. 1898-902, Mar. 2003.

[6] M. X. Cohen and M. J. Frank, “Neurocomputational models of basal ganglia function in learning, memory and choice,” Behavioral brain research, vol. 199, no. 1, pp. 141-156, Apr. 2009.

[7] T. V. Maia and M. J. Frank, “From reinforcement learning models to psychiatric and neurological disorders,” Nat Neurosci, vol. 14, no. 2, pp. 154-162, Feb. 2011.

[8] H. Nakahara, H. Itoh, R. Kawagoe, Y. Takikawa, and O. Hikosaka, “Dopamine Neurons Can Represent Context-Dependent Prediction Error,” Neuron, vol. 41, no. 2, pp. 269-280, Jan. 2004.

[9] N. D. Daw, Y. Niv, and P. Dayan, “Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control,” Nature Neuroscience, vol. 8, no. 12, pp. 1704-11, Dec. 2005.

[10] K. A. Zaghloul et al., “Human Substantia Nigra Neurons Encode Unexpected Financial Rewards,” Science, vol. 323, no. 5920, pp. 1496-1499, Mar. 2009.

[11] P. R. Montague, P. Dayan, C. Person, and T. J. Sejnowski, “Bee foraging in uncertain environments using predictive Hebbian learning,” Nature, vol. 377, no. 6551, pp. 725-8, Oct. 1995.

[12] M. Silvetti, R. Seurinck, and T. Verguts, “Value and prediction error in medial frontal cortex: integrating the single-unit and systems levels of analysis,” Frontiers in Human Neuroscience, vol. 5, p. 75, 2011.

[13] W. Schultz, P. Dayan, and P. R. Montague, “A neural substrate of prediction and reward,” Science (New York, N.Y.), vol. 275, no. 5306, pp. 1593-9, Mar. 1997.

[14] C. L. Hull, Principles of behavior: an introduction to behavior theory. Oxford, England: Appleton-Century, 1943.

[15] J. Hawkins, D. George, and J. Niemasik, “Sequence memory for prediction, inference and behavior,” Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, vol. 364, no. 1521, pp. 1203-1209, May 2009.

[16] D. George and J. Hawkins, “Towards a mathematical theory of cortical micro-circuits,” PLoS Computational Biology, vol. 5, no. 10, p. e1000532, Oct. 2009.

514

Page 7: Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

[17] R. A. Rescorla, “Behavioral Studies of Pavlovian Conditioning,” Annual Review of Neuroscience, vol. 11, no. 1, pp. 329-352, Mar. 1988.

[18] B. W. Balleine, M. Liljeholm, and S. B. Ostlund, “The integrative function of the basal ganglia in instrumental conditioning,” Behavioral Brain Research, vol. 199, no. 1, pp. 43-52, Apr. 2009.

[19] E. M. Izhikevich, “Solving the Distal Reward Problem through Linkage of STDP and Dopamine Signaling,” Cerebral Cortex, vol. 17, no. 10, pp. 2443 -2452, Oct. 2007.

[20] J. Friedrich, R. Urbanczik, and W. Senn, “Spatio-Temporal Credit Assignment in Neuronal Population Learning,” PLoS Comput Biol, vol. 7, no. 6, p. e1002092, Jun. 2011.

[21] J. J. F. Ribas-Fernandes et al., “A Neural Signature of Hierarchical Reinforcement Learning,” Neuron, vol. 71, no. 2, pp. 370-379, Jul. 2011.

[22] M. Schembri, M. Mirolli, and G. Baldassarre, “Evolving internal reinforcers for an intrinsically motivated reinforcement-learning robot,” in IEEE 6th International Conference on Development and Learning, 2007. ICDL 2007, 2007, pp. 282-287.

[23] J. Gläscher, N. Daw, P. Dayan, and J. P. O’Doherty, “States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning,” Neuron, vol. 66, no. 4, pp. 585-595, May 2010.

[24] D. E. Moorman and G. Aston-Jones, “Orexin/hypocretin modulates response of ventral tegmental dopamine neurons to prefrontal activation: diurnal influences,” The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, vol. 30, no. 46, pp. 15585-15599, Nov. 2010.

515