21
Connection Science Vol. 18, No. 3, September 2006, 265–285 Lexical and syntactic structures in a connectionist model of reading multi-digit numbers TOM VERGUTS* and WIM FIAS Department of Psychology, Ghent University, H. Dunantlaan 2, 9000 Ghent, Belgium (Received 30 August 2005; revised 5 January 2006; accepted 30 January 2006) A connectionist model of reading aloud multi-digit numbers is proposed. Unlike earlier models, a model that has no prior knowledge is trained on this task; it is found that the model develops both a lexical route and a route that implements (syntactic) rules. Lesion studies of the model show that it can exhibit the double dissociation between patients, with either lexical or syntactical problems in number naming. Results are discussed in terms of the rules-versus-connections debate in cognitive science. Keywords: Multi-digit number reading; Lexical and syntactic structures; Recurrent connectionist models 1. Introduction Numbers come in a number of formats, the most important being Arabic (written) and verbal (primarily spoken). Adults can easily switch between these formats; this process is called transcoding. How do we do this? A number of theories have been proposed in the literature, for both Arabic-to-verbal transcoding (e.g. Power and Dal Martello 1997) and verbal-to- Arabic transcoding (e.g. Barrouillet et al. 2004). Whatever the transcoding direction, the major theoretical dividing line runs between asemantic transcoding routes (e.g. Power and Dal Martello 1997) and semantic routes. A semantic route entails that the meaning of a num- ber is necessarily encoded. One implementation of a semantic route is one in which a number is decomposed into its base-ten structure (e.g. McCloskey 1992). For example, in transcod- ing from Arabic to verbal format in McCloskey’s (1992) model, the input string ‘543’ is represented as 5 × 10 2 + 4 × 10 1 + 3 × 10 0 . In other words, the cognitive system assigns a value of 5, 4 and 3 to the hundreds, tens and units, respectively. Transcoding rules act on this (semantic) representation. On the other hand, in an asemantic route lexical and syntactic rules operate directly on the given input string (e.g. a string of Arabic digits such as ‘65’) and transform the number into a different notation (e.g. ‘sixty-five’). For example, one transcoding rule in Power and Dal Martello’s (1997) system postulates that a string ‘ABC’should be refor- mulated as ‘eng(A00) and eng(BC)’ (with eng representing ‘the English pronunciation of’). *Corresponding author. Email: [email protected] Connection Science ISSN 0954-0091 print/ISSN 1360-0494 online © 2006 Taylor & Francis http://www.tandf.co.uk/journals DOI: 10.1080/09540090600639396

Lexical and syntactic structures in a connectionist model

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lexical and syntactic structures in a connectionist model

Connection ScienceVol. 18, No. 3, September 2006, 265–285

Lexical and syntactic structures in a connectionistmodel of reading multi-digit numbers

TOM VERGUTS* and WIM FIAS

Department of Psychology, Ghent University, H. Dunantlaan 2, 9000 Ghent, Belgium

(Received 30 August 2005; revised 5 January 2006; accepted 30 January 2006)

A connectionist model of reading aloud multi-digit numbers is proposed. Unlike earlier models, amodel that has no prior knowledge is trained on this task; it is found that the model develops both alexical route and a route that implements (syntactic) rules. Lesion studies of the model show that it canexhibit the double dissociation between patients, with either lexical or syntactical problems in numbernaming. Results are discussed in terms of the rules-versus-connections debate in cognitive science.

Keywords: Multi-digit number reading; Lexical and syntactic structures; Recurrent connectionistmodels

1. Introduction

Numbers come in a number of formats, the most important being Arabic (written) and verbal(primarily spoken). Adults can easily switch between these formats; this process is calledtranscoding. How do we do this? A number of theories have been proposed in the literature,for both Arabic-to-verbal transcoding (e.g. Power and Dal Martello 1997) and verbal-to-Arabic transcoding (e.g. Barrouillet et al. 2004). Whatever the transcoding direction, themajor theoretical dividing line runs between asemantic transcoding routes (e.g. Power andDal Martello 1997) and semantic routes. A semantic route entails that the meaning of a num-ber is necessarily encoded. One implementation of a semantic route is one in which a numberis decomposed into its base-ten structure (e.g. McCloskey 1992). For example, in transcod-ing from Arabic to verbal format in McCloskey’s (1992) model, the input string ‘543’ isrepresented as 5 × 102 + 4 × 101 + 3 × 100. In other words, the cognitive system assigns avalue of 5, 4 and 3 to the hundreds, tens and units, respectively. Transcoding rules act onthis (semantic) representation. On the other hand, in an asemantic route lexical and syntacticrules operate directly on the given input string (e.g. a string of Arabic digits such as ‘65’) andtransform the number into a different notation (e.g. ‘sixty-five’). For example, one transcodingrule in Power and Dal Martello’s (1997) system postulates that a string ‘ABC’ should be refor-mulated as ‘eng(A00) and eng(BC)’ (with eng representing ‘the English pronunciation of’).

*Corresponding author. Email: [email protected]

Connection ScienceISSN 0954-0091 print/ISSN 1360-0494 online © 2006 Taylor & Francis

http://www.tandf.co.uk/journalsDOI: 10.1080/09540090600639396

Page 2: Lexical and syntactic structures in a connectionist model

266 T. Verguts and W. Fias

Other transcoding rules then operate on the constituents of the resultant string (eng(A00) andeng(BC)) until a complete pronunciation is obtained. This system is called asemantic becausea correct (base-ten) interpretation of the input string is not constructed and not required.

The debate in this domain has centred on the number of routes required for number readingand the nature of these routes. Over the last two decades, evidence has accumulated sug-gesting that a purely semantic approach is not sufficient to account for the available rangeof data obtained from brain-damaged patients. For example, some authors (e.g. Cohen andDehaene 1991, Cipolotti 1995, Cipolotti and Butterworth 1995) have described patients withintact number comprehension (suggesting that number semantics is intact), yet suffering fromtranscoding problems. In our opinion, however, the commonalities between the two routesand their characteristics have been overlooked. Both semantic and asemantic routes adhere toa rules-and-frames approach,† according to which a set of rules generate a number frame andfill the slots of this frame with (Arabic or verbal) numbers. In fact, such an approach seemsvery well suited for transcoding: a relatively small set of rules can be formulated that generatesall legal, but no illegal, verbal numbers (Power and Dal Martello 1997).

We raise two general arguments against such an approach. The first is that none of the currenttranscoding models specify where exactly these rules come from. It has been argued that theserules arise from declarative knowledge (Barrouillet et al. 2004), but this still begs a number ofquestions. How is the relevant declarative knowledge itself acquired? What kinds of declarativeknowledge are recruited for the task?And how is declarative knowledge transformed into task-appropriate rules? A second, more empirical argument concerns the ‘gradedness’ of errors innumber naming. For example, patient HY described by McCloskey et al. (1986) exhibitedproblems retrieving the correct number within a lexical class (e.g. instead of 5, he may say 7,or instead of 30 he may say 40). According to a rule-based approach, a chunk of declarativeknowledge or a retrieval rule would be deleted. However, the patient does not always make sucherrors: in fact, his error rate is quite low (8.9% for one- to three-digit numbers). To remedy thisproblem, rule-based models could introduce gradedness, for example, by assuming variablestorage strength of declarative elements. Computationally, the incorporation of gradednessis actually a step toward an inherently graded, e.g. connectionist, model (cf. McClellandet al. 2003).

Rather than patching up a rule-based model with graded rules or representations, we shallfollow an alternative strategy and train an inherently graded (connectionist) model to transcodefrom Arabic to verbal notation and evaluate to what extent and how the model develops rule-like principles. The model starts from a set of 29 lexical primitives (e.g. ‘one’, ‘two’) at theoutput level, but otherwise has no prior knowledge concerning the task (i.e. all connectionweights are set to random values). Training consists of error minimization with a variant ofthe backpropagation algorithm. After training, the model can read all numbers from 1 to 999.It generalizes very well to new numbers because the nodes in the network come to encodeapproximate rules, and these approximate rules apply also to numbers that the model has neverseen before. The basic principles behind this model are described in the next section.

2. Model: basic principles

This work builds on an earlier model of numerical cognition (Verguts et al. 2005) in whichhuman performance on three tasks was integrated: number comparison; parity judgement; andnumber naming. One of the central concepts in this earlier paper is the mental number line,

†We have adapted this notation from Dell et al. (1999), who called this the ‘frame-and-slot’ approach in the contextof language production.

Page 3: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 267

a popular metaphor in numerical cognition (e.g. Dehaene 1992). Most researchers assumethat, after identification, numbers are projected on this mental number line, from which pointon further processing takes place (e.g. number comparison). Part of the evidence in favour ofa line-like mental representation of number, and its involvement in number naming, comesfrom priming studies. In particular, it has been observed that naming a single-digit number isfaster if it is primed by a numerically close digit rather than by a numerically far digit (e.g.Reynvoet et al. 2002).

In Verguts et al. (2005), we implemented the mental number line in a connectionist modelas a set of nodes with the property that numbers that are numerically close activate overlappingsets of nodes. For example, numbers 1 and 2 partially activate the same nodes on the mentalnumber line, but 1 and 9 activate different nodes. Yet, number coding was localist in thesense that each number activated exactly one node most strongly, so each number had itsdedicated node on the mental number line. It was further argued that, because of the highfrequency of small numbers and number naming in daily life (Dehaene and Mehler 1992),the number-specific positions on the number line for these small numbers had acquired strongconnections to the relevant responses. For example, for number 1, the relevant response is‘one’, so the connection from the 1-position on the mental number line to the response ‘one’ isstrong. We consider this a semantic naming route, because the mental number line encodes themeaning of (small) numbers. Note, however, that the underlying number semantics is differentfrom that in McCloskey’s (1992) model: in that model, number semantics refers to a base-tendecomposition of a number, whereas in the present case, number semantics refers to a mentalnumber line route in naming.

The Verguts et al. (2005) model was trained on number naming, and its connection weightswere set by error minimization (using the delta rule). Therefore, repeated presentation of aparticular number is required before the connection weight is appropriate for that number.However, large numbers are so infrequent in daily life that number-specific positions cannotacquire strong connections to their corresponding response. To obtain an approximation of thedistribution of numbers in daily life, we typed each number from 1 to 999 in a search engine(Google) on the Internet and recorded each number’s frequency.† The resulting probabilitydistribution is shown in figure 2. Clearly, numbers become very infrequent as soon as theyare larger than about 100. Hence, because formation of direct associations between individualsemantic number representation (on the mental number line) and output is not feasible orplausible, what is required instead is a model with considerable generalization power. Thesemantic (in the sense of mental-number-line-mediated) approach as used in our earlier modelof small numbers is inappropriate for larger numbers.

To overcome this problem, a decompositional approach, in which a large number is brokendown into smaller constituent parts, is needed. In particular, we propose to consider productionof a large number as a sequential behaviour, in the sense that it consists of a sequence ofdifferent responses. For example, the Arabic number 645 is read as ‘six hundred forty five’,a sequence of four lexical primitives. The low frequency problem for large numbers can thenbe solved because the lexical primitives (‘six’, ‘hundred’, . . .) themselves are sufficientlyfrequent, so reliable output connections can be formed for them. Pronouncing a large numberthen reduces to pronouncing a string of lexical primitives. A class of models that is suitedfor generating responses with a sequential structure is recurrent networks. Starting from the

†Strictly speaking, Google does not return the frequency of a particular number, but rather the frequency ofdocuments containing that number. For word frequencies, it has been found that these two variables are highlycorrelated (Blair et al. 2002). Further, the model is quite robust to exactly which frequency distribution is used (seesimulation 2). Therefore, we think the distinction between a number’s frequency and the number of documents thatcontain a given number can be safely ignored.

Page 4: Lexical and syntactic structures in a connectionist model

268 T. Verguts and W. Fias

seminal work of Elman (1990) and Jordan (1986), recurrent models have been widely appliedto other tasks that require sequence production (e.g. sentence production: Dell et al. 1997;word production: Plaut 1999; routine behaviour: Botvinick and Plaut 2004).

In the present context, what is required is a recurrent model that takes into account the presentstimulus (which consists of a (spatial) sequence of Arabic digits, e.g. 546), and keeps track ofthe position in the sequence; in other words, it should distinguish between previously uttered(partial) responses and to be uttered responses. For example, if one has already pronounced‘hundred’, the next response, given the previous response and given the stimulus 546, shouldbe ‘forty’. However, it is not just the previous response that is required: With stimulus 545,after pronouncing just ‘five’, the next response should be ‘hundred’, but after pronouncing‘forty’ followed by ‘five’, one should stop talking.

In the remainder of this paper, first the model and its training procedure are described inmore detail. After training, the model was able to pronounce all numbers from 1 to 999 withouterror. We investigate how it solves this task by analysing the model’s hidden node activity andconnection weight patterns (simulation 1). In simulation 2, we test the generalization power ofthe model by omitting a large set of numbers from the training set and evaluating performanceon these numbers in the testing phase. In simulations 3–5, we lesion the model reported insimulation 1 to investigate whether the lesioned model reproduces patient data reported in theliterature. In section 9, we discuss a number of extensions and bring our work into contactwith the rules-versus-connections controversy (e.g. McClelland and Patterson 2002, Pinkerand Ullman 2002).

3. Model: implementation

A model that can cope with the requirements outlined in the previous section is shown infigure 1(a). Processing in this model consists of sequences of two alternating time steps. Onthe first time step, hidden nodes receive activation from the input, output and hidden fields(arrows 2, 4 and 5, respectively). On the second time step, the response field receives activationfrom the input field and the hidden field (arrows 1 and 3, respectively). The response whosenode attains the highest activation value is regarded as the response produced by the model.Hence, hidden node values were changed at odd time steps (1, 3, . . .) and response node valuesat even time steps (2, 4, . . .). This two-step process of alternate hidden node and response nodeactivation continues until a ‘done’ response is given (cf. Plaut 1999). Inclusion of this ‘done’response is probably the simplest way to ascertain whether the model considers its responsesequence to be finished or not. The relevant information on exactly which response is neededat a given point in the processing sequence is encoded in the weights, and these weights areupdated with a variant of backpropagation (see later).

Although the model is sequential in the sense that it generates a sequence of responses(e.g. six–hundred–forty–five), the model is not sequential at the input level: the input string isconstant at any given trial. In the example, the model receives the input string ‘645’immediatelyrather than sequentially. At this point, it is unclear whether this is a realistic input scheme ornot; in any case, it is a convenient starting place in the first attempt at modelling the processof naming multi-digit numbers.

3.1 Input–output coding

We assume that numbers are coded ‘right-aligned’ in a set of three banks of input nodes (seefigures 1(b) and (c)). The first number is assigned to the rightmost bank, the second number

Page 5: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 269

Figure 1. Schematic overview of implemented model.

(if any) to the middle, and the third number (if any) to the leftmost bank. The three bankscan accordingly be considered hundreds, tens and units banks, respectively (H, T and U infigures 1(b) and (c)). In each bank, there is one input node for each of the digits 0 to 9, makinga total of 3 × 10 = 30 input nodes. Hence, a one-digit number activates only one node in theunits bank, a two-digit number additionally activates a node in the tens bank, and a three-digit number activates an additional node in the hundreds bank. In the example of figure 1(b),number 291 is presented; in the example of figure 1(c), number 2 is presented. Note that inputnumbers (e.g. 2, 9, 1) were not presented sequentially to the model but simultaneously.

We used the lexical primitives of the English number naming system at the response side(‘phonology’ in figure 1(d)). We slightly simplified this system, in the sense that the connectiveword ‘and’ (e.g. ‘five hundred and forty six’) was not used explicitly. We do not think thatincorporating this connective would bring about extra difficulties. Further, each number from1 to 19 had its own single response (node). Each tens word (twenty, thirty, . . .) also had itsown node, and hundred also. Finally, there was one response that indicated that the sequencehad ended (‘done’: see figure 1(d)). This resulted in a total of 29 response nodes. We requiredthe model to generate a ‘done’ response at the end of each response sequence so we couldunambiguously determine whether the model considered the response sequence to be finishedor not.

Response sequences could range from two to at most five words long. For example, theresponse sequence required for stimulus 2 is ‘two’–‘done’; the response sequence for stimu-lus 546 is ‘five’–‘hundred’–‘forty’–‘six’–‘done’. The model received information about therequired length only implicitly: after a stimulus such as 546, target values were provided for

Page 6: Lexical and syntactic structures in a connectionist model

270 T. Verguts and W. Fias

five consecutive responses, whereas for stimulus 2, target values were given for only tworesponses. In this way, the model extracted the required length of each number.

3.2 Connectivity pattern

The connectivity pattern used in the model is depicted in figure 1(a). Hidden nodes receiveactivation from input, hidden and output nodes (figure 1(a), arrows 2, 5 and 4, respectively).Ten hidden nodes were used. Output nodes receive activation from input nodes and hiddennodes (arrows 1 and 3, respectively). If there is a connection between two fields, there is fullinterconnection from the one field to the other.

3.3 Activation equations and implementation parameters

Activation of a hidden node was a sigmoid function of the input to that node. In particular,

xhiddeni = 1

1 + exp(−g × neti + bi), (1)

where neti is the net input to that hidden node. The net input is the sum of the activation thatthe hidden node receives from the input, hidden and response fields (arrows 2, 5 and 4, respec-tively). The parameter bi is a bias parameter. A similar equation holds for a response node,except that it receives input from the input and hidden fields (arrows 1 and 3, respectively).Although it is beyond the scope of this investigation, an extension of the model to simulateresponse times is straightforward: in that case, one assumes that activation of the hidden nodesbuilds up gradually, with rate parameter equal to the right-hand side of equation (1). A similarlogic holds for the response nodes. If any of the hidden (c.q., response) nodes reaches a fixedthreshold, the response (c.q., hidden) nodes become activated until one of these nodes reachesa fixed threshold, and so on. Response time can then be modelled as the total time requiredfor completing the response sequence.

Connections and bias parameters were updated simultaneously using a generalization ofthe backpropagation algorithm developed for recurrent network models (backpropagation-through-time; Rumelhart et al. 1986). The error measure used for training was the standardmean squared error.An error tolerance level of 0.05 was introduced, meaning that errors smallerthan this level were not considered as errors. For example, if the target (i.e. required) value fora given output node was 1, all values larger than 0.95 were considered to be correct and did notlead to further weight adaptation. This was to ensure that the model would not adapt its weightsonly to make small and unnecessary adaptations (cf. Seidenberg and McClelland 1989).

The gain parameter g determines the extent of separation between activation values aftertransformation with equation (1). In particular, higher values of g push nodes with net inputabove zero (neti > 0) toward one, and nodes with net input below zero (neti < 0) toward zero.Its value was set at g = 4. This value is not critical; other values led to similar results. Thedistribution of numbers presented to the network is shown in figure 2; in particular, on eachtraining trial, a number was sampled from this distribution. After training, all numbers werepresented consecutively to the network. For each number, the model was considered to stopits response sequence if the ‘done’ response was given. We trained the model for a total of500 000 trials in simulations 1 and 2. In simulations 3–5, lesions were applied to the modelobtained from simulation 1.

Page 7: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 271

Figure 2. Distribution of numbers as used in training corpus.

4. Simulation 1: lexical and syntactical processes

After training, the model was 100% correct. We now describe in more detail how the modelmanages to do this.

4.1 Lexicon

Consider the pattern of connections in the direct input–output path (arrow 1 in figure 1(a)). Infigure 3(a), we depict the connections from each of the three input banks to the units responses(‘one’, ‘two’, . . .). We only depict connections from a particular digit at the input banks to itscorresponding response. For example, the connection from units digit 1 at the input side to theresponse ‘one’is shown on the graph indicated ‘units digit’ in (a).Also, the connection from thetens digit 1 at the input side to the same response is shown on the graph indicated ‘tens digit’.The connection from unit digit 1 to response ‘two’ is not shown. Such cross-connections (inwhich input number and response correspond to different numbers) were generally negative(inhibitory) after training (not shown in figure 3(a)).

As can be seen in figure 3(a), both hundred-position and unit-position digits map to theunit responses, but the ten-position digits do not. Note that this pattern of weights is learned,not imposed. The reason the learning algorithm chooses this configuration is that the digitsin the tens position do not discriminate between unit responses. For example, if the numberpresented is 3x2, with x any number larger than 1, the relevant unit responses in this numberare ‘three’ and ‘two’, and the value of x is irrelevant (other than that it is larger than 1). Thisis why the connections from tens-position digits to unit responses are approximately zero.

The plot in figure 3(b) shows connection weights to the ‘teen’ responses. In this case,connections from the unit-position digits are generally large, because the value at this positionindicates which teen response should be generated. In addition, the first tens-position digit(1) is also strongly connected to all teen responses; this is because a value of 1 at the tens

Page 8: Lexical and syntactic structures in a connectionist model

272 T. Verguts and W. Fias

Figure 3. Connections from Arabic input field to (a) unit responses, (b) teen responses and (d) tens responses.(d) Summary of connection pattern from input field to phonology field (dotted arrows).

position indicates that a teen response is required (figure 3(b) plots only the connection fromthe one-position tens digit toward responses ‘eleven’). Finally, figure 3(c) shows that only thetens digits are strongly connected to their corresponding tens words. Figure 3(d) schematicallysummarizes the pattern of connections (dashed arrows) from the three input banks (hundreds,tens, units) to the three response banks (units, teens, tens).

Two general principles apply to the resultant connection weights. First, all digit-to-corresponding-response connections (e.g. digit 1, response ‘one’ or ‘eleven’) are strong, andabout equally strong within a field. Second, all cross-connections (e.g. digit 1, response ‘two’or ‘twelve’) are inhibitory (not shown in figure 3). These two findings allow us to interpretthis direct path as a lexical path: there is a transparent mapping from each individual inputnode to its corresponding response(s), so that presentation of, for example, stimulus 2, leadsto response ‘two’.

Page 9: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 273

4.2 Syntax

After training, the hidden nodes represented syntactic rules. To illustrate this, we depict infigure 4 the behaviour of the first hidden node. Figure 4(a) depicts the cumulative distributionfunction (CDF) of activation values for this hidden node upon presentation of each numberfrom 1 to 999. Of course, the activation value of a hidden node changes during the responsesequence (e.g. ‘thirty’–‘seven’–‘done’); we plot the activation on the first two time steps(i.e. before the first response; because hidden node and response node activations were changedalternately, their activation values always remained the same for two consecutive time steps).More concretely, we presented numbers 1 to 999 to the trained model, recorded the (999)activation values of the first hidden node after the first time step, and plotted the CDF of theactivation values. Hence, given that each activation value occurs only once, the CDF increasesat exactly 999 points along the abscissa.

As figure 4(a) shows, this hidden node has very low activation (below 0.1) for approximately100 numbers (point 0.1, or 10%, on the ordinate), but has strong activation (from 0.75 upward)for all other numbers (0.1 to 1, or 10% to 100%, on the ordinate). Figure 4(b) shows the weightsoriginating from this hidden node and projecting to the response layer (response nodes areshown on the abscissa; for clarity, projection to the ‘done’ response is not shown). This hiddennode selectively activates unit responses (‘one’ to ‘nine’) and the ‘hundred’ response, andinhibits teen and tens responses. Figures 4(c)–(e) characterize the 100 numbers that lead tothe lowest hidden node activation (i.e. the 100 leftmost numbers in (a)). Figure 4(c) shows thedistribution of hundred position values for these 100 numbers. Figures 4(d) and (e) showthe same distribution but for tens position and unit position digits, respectively. From theseplots, it is clear that this hidden node is least responsive to numbers that have a zero atthe hundreds position; that is, it does not respond to numbers 1–99; in fact, numbers 1–99

Figure 4. (a) Cumulative distribution of activation values of hidden node 1 after first time step. (b) Connectionsfrom hidden node 1 to response (phonology) nodes. (c) Histogram of hundred position values of the 100 numbers forwhich hidden node 1 was least responsive. (d, e) Similar histograms for tens and units position values, respectively.(f–h) Similar histograms of the 100 numbers for which hidden node 1 was most responsive.

Page 10: Lexical and syntactic structures in a connectionist model

274 T. Verguts and W. Fias

are the only numbers at the low end of the CDF in figure 4(a). To sum up, the hidden node isunresponsive to numbers 1–99 (activation < 0.1) but responds strongly (activation > 0.75) toall larger numbers. Given this, together with the fact that the hidden node activates unit wordsand inhibits teen and tens words (as shown in figure 4(b)), this hidden node can be said toimplement the following rule:

IF number is larger than 99THEN activate Unit responses and inhibit Teen and Tens responses.

This hidden node encodes a useful rule in number naming, because if a number larger than99 is presented (e.g. 513 or 543), the first response should be a unit response (‘five’) and teenresponses (‘thirteen’) and tens responses (‘forty’) should be momentarily inhibited.

Finally, figures 4(f)–(h) show the same distribution but for the 100 numbers to which thishidden node is most responsive. These panels show this hidden node is most responsive tonumbers containing a zero at the tens position, such as 504. Indeed, in numbers such as these,teen and tens responses should never be uttered. Hence, although the hidden node appearsto encode a strict rule, it does, nevertheless, distinguish between the numbers within the twocategories (smaller or equal to versus larger than 99) it has created.

Until now, we have only looked at the hidden node’s behaviour at the first time step, whenthere is only input from the input field. On later time steps, there is also input from the hiddenand response fields. To see what happens to this hidden node after the first time step, considerfigure 5. This figure shows the activation value of the hidden nodes (left column) and responsenodes (right column) at all time steps when ‘145’ is presented. Each row corresponds to twotime steps. Network node activations are depicted by a black-to-white code (black: activation =1; white: activation = 0). As can be seen, the network produces the correct response becausethe different response primitives are activated sequentially (figure 5, second column; response‘one’ at time step 2, ‘hundred’ at time step 4, and so on, until a ‘done’ response is given at thelast time step), and incorrect responses are completely inactivated.

The first hidden node is activated at all time steps except the third one. At the third timestep, the required response is ‘forty’, so the tens responses should not be inhibited at this

Figure 5. Evolution of the hidden node (first column) and response node (second column) activation values acrossdifferent time steps (two steps on each row) when input string ‘145’ is presented. The first row of nodes in thephonology field stands for responses ‘one’ to ‘nineteen’. The second row represents responses ‘twenty’ to ‘hundred’and ‘done’. Grey scale of node represents activation value, ranging from activation = 0 (white) to activation = 1(black).

Page 11: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 275

point. Consequently, the inhibition toward the tens responses originating from this hiddennode is itself blocked (because the hidden node is inactivated). Hence, the hidden node itself,and the rule it implements, can also be overruled. This overruling must be accomplished bya combination of the hidden and response nodes, because the activation of the input nodes isconstant throughout the response sequence.

A summary plot is shown for another hidden node (hidden node 2) in figure 6. Figure 6(b)shows that this node activates the unit responses ‘two’ and ‘four’, but inhibits ‘one’, ‘three’,‘five’, ‘seven’ and ‘nine’. Figures 6(c)–(e) show that it is least responsive to numbers thatcontain 2 or 4 (or 8) at the unit position. This makes sense because, if a number contains 2 or4 at the unit position, responses ‘two’ and ‘four’ should be inhibited as the first response (inalmost all cases). On the contrary, this hidden node is very responsive to numbers containing2 or 4 at the hundreds position (see figure 5(f)), because the corresponding unit responseshould be given at the first step in the response sequence. At the same time, the node is veryunresponsive to numbers containing 1, 3, or 5 at the hundreds position (figure 5(c)), but veryresponsive to numbers containing 1, 3, 5, 7, or 9 at the units position (figure 5(h)). As anexample, suppose number 235 is presented. In this case, response ‘five’ needs to be inhibited,because it is activated via the (direct) lexical path but is not appropriate as a first response.This hidden node accomplishes this. Summarizing this rule’s behaviour on the first time step,this hidden node can be considered to implement the rule:

IF number contains 2 or 4 at the hundreds positionOR number contains 1, 3, 5, 7, or 9 at unit position

THEN activate responses ‘two’ and ‘four’AND inhibit responses ‘one’, ‘three’, ‘five’, ‘seven’, and ‘nine’.

Figure 6. (a) Cumulative distribution of activation values of hidden node 2 after first time step. (b) Connectionsfrom hidden node 2 to response (phonology) nodes. (c) Histogram of hundred position values of the 100 numbers forwhich hidden node 2 was least responsive. (d, e) Similar histograms for tens and units position values, respectively.(f–h) Similar histograms of the 100 numbers for which hidden node 2 was most responsive.

Page 12: Lexical and syntactic structures in a connectionist model

276 T. Verguts and W. Fias

In a number of ways, the behaviour of this hidden node is less rule-like than that of the firsthidden node. First, the rule it implements is considerably more complex. Second, as can beseen in figure 6(a), the CDF of this hidden node is not ‘ramp-like’ like that of the previousone. As a result, there is no clear distinction between numbers to which it does and does notrespond, but rather, it responds to all numbers to some degree. Third, whereas the first hiddennode was either active or inactive on consecutive time steps (upon presentation of number145; see figure 5), the current hidden node is active on time steps 2–6 but only partially activeon time step 1 (see figure 5, first column, second node).

Response profiles for the other eight hidden nodes are similar to these two. In general,each hidden node implements a rule that is either strict (as for hidden node 1), not strict at all(as for hidden node 2), or somewhere in between. Because these hidden nodes encode suchapproximate rules, which ensure that number words are used in the proper order, it makessense to call the pathway via the hidden nodes a syntactic pathway.

To sum up, the direct route (arrow 1 in figure 1(a)) implements a lexical route, and theindirect route (arrows 2–5 in figure 1(a)) implements syntax. Although only one simulationhas been described here, this result was obtained over a number of parameter specifications(see, e.g. simulation 2). Note that we did not force the model to make this lexicon/syntax dis-tinction. The model was allowed to spread the labour of lexical and syntactic processing acrossthe two pathways. Rather than doing so, however, the lexicon/syntax distinction emerged asa result of trying to minimize pronunciation error in multi-digit naming. Such a division oflabour between routes has been reported in a number of studies (e.g. Rueckl et al. 1989, Zorziet al. 1998, Chang 2002, Gordon and Dell 2002, Harm and Seidenberg 2004). We shall comeback to this issue in section 9.

The analysis of hidden rule behaviour also clarifies how the model solves the task of numbernaming. It does not construct a syntactic frame in which successively more words are inserted.Instead, all digits in the input field try to activate appropriate responses via the direct route; forexample, a 2 in the units position will try and activate responses ‘two’ and ‘twelve’ becausethese are the two possible responses given this digit at this position (see figure 1(d)). The roleof the hidden nodes is selectively to inhibit and excite response nodes, depending on the otherdigits in the input string and depending on the position within the response sequence. Thisbehaviour is similar to the syntactic path in the model described by Gordon and Dell (2002,2003). They trained a model to produce sentences when messages were presented at the inputlayer. The input layer consisted of syntactic nodes (determiner, noun, verb) and content nodes(e.g. wing, female). The authors found that the syntactic nodes came to function as a kind of‘traffic cop’; they did not determine the particular word that was needed at a particular pointwithin a sentence, but only restricted the class of allowed words (e.g. only nouns). This finding,in conjunction with the results reported in this paper, shows that a division of labour betweena content-based route and a syntactic route that places restrictions on the set of allowed wordsis a natural solution to the production of structured sequences such as sentences or multi-digitnumbers.

5. Simulation 2: generalization

Given that the model encodes (approximate) rules in its hidden nodes, at least one requirementfor high generalization power seems to be fulfilled. To illustrate the generalization prop-erties of the model, we depict in figure 7 the response sequence generated by the modelupon presentation of the input sequence ‘45’. Comparing figures 5 and 7, one can see thatthe hidden node pattern is quite similar at corresponding responses (i.e. responses ‘forty’,

Page 13: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 277

Figure 7. Hidden and response node activation upon presentation of input string ‘45’. See figure 7 for furtherdetails.

‘five’ and ‘done’). To quantify this similarity, we calculated the correlations between thehidden node patterns for these three responses: they were 0.41, 0.94 and 0.83, respectively.Hence, although the response sequence is considerably different for 145 and 45, there isgeneralization from one pattern to the other as indicated in the hidden node correlationalpatterns.

To examine the model’s generalization ability more directly, we ran the model anew but on atraining set in which numbers 330–389 (60 numbers) were removed. Otherwise, the procedurewas the same as in simulation 1.

5.1 Results

After training, accuracy was 100% on numbers 1–999. The model solved the problem ofnumber naming the same way as in simulation 1: the direct path came to function as a lexicalpath (cf. figure 3 for simulation 1) and the hidden nodes came to implement (approximate)rules. Because a lexical path and syntactic rules are constructed, it should not matter if a set of(non-critical) numbers is left out of the training set: if numbers 330–389 are later presented attest, naming them can be achieved with the pattern of weights generated by the other numbers.As a result, the model generalizes very well.

6. Simulation 3: lesion to the direct lexical pathway

In the following three lesion studies, we start from the trained model described in simulation 1.First, our aim was to see what would happen if the model’s lexical route were impaired. Weperformed a lesion to the direct route (arrow 1 in figure 1(a)) of the trained network as describedin simulation 1. We did this by adding zero-mean Gaussian noise (standard deviation 1/4) toweights in the direct route. Given that these values represent noise, different values wereinjected for each number that was presented to the network. In order to guarantee generalityof the conclusions, this procedure was repeated three times with new noise values added tothe weights for each simulation round. Results were very similar across the three simulations,and we report only the mean results (aggregated over the three simulation rounds).

We distinguish between legal and illegal errors, where illegal errors are those in which anillegal string is formed (e.g. ‘hundred hundred six’). Regarding legal errors, we follow earlierconventions (Deloche and Seron 1982, Cipolotti 1995) and distinguish between lexical and

Page 14: Lexical and syntactic structures in a connectionist model

278 T. Verguts and W. Fias

Table 1. Error percentages in the lesion simulations.†

Lexical- Lexical-Illegal Syntactic within between Lexical-errors errors stack stack other Total

Direct route lesion (simulation 3) 15.7 18.0 62.9 3.4 0 3.0Hidden to response lesion (simulation 4) 14.6 26.0 58.3 1.0 0 3.2Input to hidden lesion (simulation 5) 10.3 81.2 8.5 0 0 3.9

†The rightmost column shows the total percentage of errors, the other columns report the relative percentages of the different errortypes.

syntactic errors. Lexical errors are those in which the correct syntactic structure is preserved,but one or more responses are incorrect. Further, within lexical errors, we distinguish betweenerrors of the within-stack type, in which the response is incorrect but from the correct stack(e.g. 542→ ‘five hundred sixty two’; 542→ ‘five hundred forty seven’), and errors of thebetween-stack type, in which the position in a stack is correct but the wrong stack is used(e.g. 40→ ‘fourteen’). Other lexical errors are simply classified as ‘other lexical errors’(e.g. 40→ ‘sixteen’). Finally, errors in which the syntactic frame is incorrect (but the responsesequence is legal) are called syntactic errors. For example, if 543 is transcoded as ‘five hundredthree’, the response sequence is legal, but the number of words is incorrect (three instead offour).

6.1 Results

The error percentages of the different error types (over the three simulation rounds) are shownin the first row of table 1. Overall error rate is quite low (3.0%). The majority of the errorsare of the lexical within-stack type (62.9%). Typical examples are 357→ ‘three hundred fiftythree’and 992→ ‘nine hundred thirty two’. The most common syntactical errors are omissions(e.g. 921→ ‘nine hundred one’; or 921→ ‘nine hundred’).

The performance of the lesioned model can be compared with that of patient HY, describedby McCloskey et al. (1986). Like the model, his overall error percentage was rather low (8.9%for one- to three-digit numbers). Further, most of his errors were within-stack lexical errors(94.2%, aggregated over one- to six-digit numbers). Hence, one possible interpretation of thispatient’s behaviour is that his lexical path was damaged.

7. Simulation 4: lesion to the indirect syntactic pathway—impaired connectionsfrom hidden to response nodes

We performed a lesion to the indirect pathway (via hidden nodes) of the network describedin simulation 1. We did this by adding zero-mean Gaussian noise (standard deviation 1/4.6)to the weights in the hidden-layer-to-response pathway. The standard deviation was chosenso as to maintain approximately the same error percentage as in simulation 3. This procedurewas again repeated three times, and data were aggregated over the three simulation rounds.

7.1 Results

As can be seen in table 1, the error pattern was quite similar to that in simulation 3: within-stackerrors were the most frequent error type. Hence, a second possible interpretation of patient

Page 15: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 279

HY’s behaviour is that it was not so much his lexical path that was damaged, but rather thepathway from the syntactic nodes to output.

7.2 Discussion

To explain why within-stack errors are generated when noise is injected in this path, wevisualized the hidden node activation values in a different manner than we did before.First, a principal components analysis (PCA) was performed on the hidden node activationvalues after the first time step. We then calculated the ‘factor scores’ of all numbers 1–999 onthe first two principal components. In this way, an approximation was obtained of the original10-dimensional hidden node space. The resulting plot is shown in figure 8(a). For clarity, onlythe hidden node space representations of numbers 11–19 and 21–29 are shown.As can be seen,there is a tendency for the hidden nodes to differentiate between the two sets of numbers, butthere is no linear separation possible. Figure 8(b) shows the factor plot after the third time step(i.e. just before the second response; hidden node activation after the second time step is equalto that after the first time step). Now the two sets of numbers can be linearly separated. Further,the two sets of numbers form distinct ‘clouds’ (with the exception of numbers 16 and 19; this isbecause the PCA factors are only an approximation to the underlying hidden node space). Thisseparation does not represent number similarity in input space. For example, in the input space,the similarity between 11 and 12 is the same as the similarity between 11 and 21.Yet, in hiddennode space 11 and 12 are closer than 11 and 21. The reason for this differentiation is that,for example, 11 and 12 require the same second response (‘done’), but 11 and 21 do not (for21, ‘one’ is the appropriate second response). From this analysis, it follows that noise addedto these representations leads to lexical errors. For example, numbers 21–29 will be mutuallyconfused, and numbers 11–19 will be mutually confused, leading to within-stack errors.

Based on the findings that the errors made by patient HY preserved lexical class, McCloskeyet al. (1986) concluded that these different response types were stored in three different stacks,one for the units, one for teens and one for tens. The present model provides an implementationof the stack concept. It also illustrates why a stack structure would be formed in a numbernaming system: all numbers within a stack require a similar response sequence (cf. 11 and12 on the one hand and 21 on the other). The role of the hidden nodes is to make sure thatthe response sequence is syntactically appropriate for the input number (and the lexical pathfills in the appropriate verbal primitives), and therefore the hidden nodes implement stackstructure.

Figure 8. (a) Position of numbers 11–19 and 21–29 in hidden node space, on time step 1, as approximated by PCA.(b) Similar plot on time step 2.

Page 16: Lexical and syntactic structures in a connectionist model

280 T. Verguts and W. Fias

8. Simulation 5: lesion to the indirect syntactic pathway—impaired connectionsfrom input to hidden nodes

In the next simulation, we added Gaussian noise to the input-to-hidden node connections(standard deviation 1/8).

8.1 Results

The last row of table 1 shows the results. In this case, most errors were of the syntactic type.Most errors of the syntactic type were omissions (e.g. 722→ ‘seven hundred twenty’), butthere were also a few additions (e.g. 30→ ‘thirty four’).

In this case, errors were not of the within-stack lexical type. This is because the mapping frominput nodes to hidden nodes was damaged, and consequently the hidden node stack structure,as implemented by a pattern of activation values over the hidden nodes, was damaged beforeit was constructed. Instead, because the required syntactic restrictions on the output systemcould not be imposed correctly by the hidden nodes, errors were of the syntactic type.

8.2 Discussion

In conclusion, simulation studies 3–5 show that lexical and syntactic errors can be dissociated,in line with the patient literature (e.g. Cohen and Dehaene 1991, Cipolotti 1995). The modelalso allows us to make sense of the dissociation reported in the literature between namingsingle- and multi-digit numbers (e.g. Cohen and Dehaene 1995, 2000). Patients sufferingfrom pure alexia perform relatively well in single-digit naming, but are much worse in multi-digit naming. The problem is not so much a syntactic one, because syntactic structure is usuallycorrect; most errors are digit substitutions (within-stack errors). Hence, a patient may transcode‘4’ as ‘four’, but ‘34’ as ‘thirty seven’. Our framework predicts this finding because it positstwo separate naming routes, one semantic (mental-number-line based; cf. earlier discussionof the Verguts et al. (2005) paper) and one asemantic (for large numbers, figure 1(a)). Sincethe semantic route can read small numbers (but not large numbers), a lesion to either the directlexical path (cf. simulation 3) or the hidden-to-response path (cf. simulation 4) producesthe dissociation. Interestingly, our framework predicts that the difference between the twotypes of number naming does not necessarily coincide with the difference between single-and multi-digit numbers: small two-digit numbers (10, 11, . . .) are relatively frequent also,so direct semantic connections can be formed for these numbers, and they may be relativelyspared in pure alexia. Hence, we predict a gradual deficit in number naming as one proceedstoward larger numbers, rather than a strict dissociation between single- and multi-digit numbernaming. This account remains to be tested.

9. General discussion

We have described a neural network model that is able to read multi-digit numbers. The modelwas given digit input strings (e.g. ‘546’) and was trained to generate the correct sequence ofwords corresponding to the input string. The model constructed appropriate (but approximate)rules that were used in generating the correct word on each time step. We interpreted a numberof patient data reported in the literature in terms of this model. Exactly which aspects of themodel are required by the data and which are not remains to be seen, however. For example,it could well be that a model trained without a direct route would also account for the double

Page 17: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 281

dissociation between lexical and syntactic errors. Yet, we do argue that the present modelconstitutes a cogent interpretation of available number naming data.

In the context of our discussion of the Barrouillet et al. (2004) transcoding model, we pointedout that no specific hypotheses have yet been raised as to how children learn to transcode. Themodel presented here provides such a hypothesis: children learn from extensive practice withand immediate feedback on concrete transcoding experiences. Explicit rule instruction is notnecessary: the child will extract and store a set of sufficient rules for solving the task. One otherprediction that follows immediately is that the set of rules extracted is not necessarily the sameacross individuals—any set of rules that solves the task (and is reachable in weight space) willdo. For example, the two rules described in the context of simulation 1 were not extracted inthe model of simulation 2. Further empirical work is required to test this hypothesis. Becausethe model generated such rules, it was able to cope with a very skewed (but realistic) numberdistribution (cf. figure 2), and could generalize to input strings that were never presented in thetraining corpus. The model achieved the task by activating all possibly appropriate responsesin parallel via the input-to-response path (lexical pathway), and then selectively inhibitingall but one response via the hidden nodes (syntactic pathway). The stack concept that wasintroduced earlier in the context of symbolic models of number transcoding (e.g. McCloskeyet al. 1986) was given a new interpretation in terms of the model. A number of lesion studiesshowed that a dissociation between lexical and syntactic errors could be obtained dependingon which part of the model was lesioned.

9.1 Semantic and asemantic transcoding

In conjunction with the Verguts et al. (2005) naming model for small numbers, the modelpresented here can be considered a two-route model because there is a semantic and anasemantic route, both of which can name small numbers, but only the asemantic route canalso name large numbers. Terminologically, this corresponds to many earlier proposals in theliterature that have also proposed a two-route (semantic/asemantic) naming system (e.g. Cohenet al. 1994, Cipolotti and Butterworth 1995). The asemantic route is conceptually similar acrossthe different models (except that ours is explicit), but the semantic system is different. Cipolottiand Butterworth’s semantic system involves a base-ten decomposition of each number to betranscoded (in line with McCloskey 1992). In our view, the similarities between their semanticand asemantic systems outweigh the differences: both are essentially rule-based systems. Thesemantic route proposed by Cohen and Dehaene is a route that is used for familiar numbers only(e.g. 1789 in the French culture). While we do not question the existence of such a semanticroute, it is unclear whether it is important in ordinary number reading. The evidence for suchan extra route comes from a patient who named familiar numbers with higher accuracy thanunfamiliar numbers, but most of these familiar numbers were named very slowly and onlyafter a long chain of semantic associations. Numbers that were named fast were of similarfrequency among familiar and unfamiliar numbers. To conclude, we propose that in ordinarynumber naming, there are (at least) two routes: one semantic, which is used for small andvery high-frequency numbers, and one asemantic, which is itself composed of a lexical and asyntactic pathway.

9.2 Language and number

Another prediction from the model is that syntactic processes in language and number namingcan be dissociated. Earlier authors have suggested that numerical abilities requiring structuresensitivity (e.g. number naming, or operating with bracket calculation such as 5 × (6 + 1)) use

Page 18: Lexical and syntactic structures in a connectionist model

282 T. Verguts and W. Fias

abstract-level syntactic properties such as recursion that are also used in language processing(e.g. Hauser et al. 2002, Semenza et al. 2004). The problem with this argument is that ithas never been spelled out in sufficient detail; no one has shown how exactly an abstractproperty like ‘recursion’may be implemented and recruited usefully across both language andnumerical skills (although there have been exact proposals about how recursion is implementedfor sentence production; Christiansen and Chater (1999)). Our model suggests, instead, thattraining on a specific task that requires structure sensitivity, such as number naming, mayrecruit a set of nodes that exhibits this property, without these nodes being used (or useful) inother domains. In this way, structure sensitivity in sentence production and number processingcan be dissociated. In fact, such a dissociation has been reported recently (Varley et al. 2005).The authors described three patients who were severely impaired in grammar, but all wereessentially perfect inArabic-to-verbal number transcoding and vice versa.As noted by Gelmanand Butterworth (2005) in a more general context, it may well be that language facilitatesnumber processing, but it is something different altogether to claim that language and numberprocessing derive from a common (as yet underspecified) source.

9.3 Division of labour in neural networks

As mentioned earlier, this is not the first paper to report a division of labour between differentpaths in neural network models. Instead, it has been shown a number of times that a neuralnetwork model has the tendency to split a particular task into different components (e.g.Chang 2002, Harm and Seidenberg 2004). At least two reasons can be distinguished why thiswould be so. First, error-correction learning tends to force different parts of the network tolearn different things (Gordon and Dell 2003). As a simple example, consider the delta rule,the one-layer-network variant of backpropagation. This learning rule predicts the blockingphenomenon of classical conditioning due to its error-driven learning property. This is because,according to this rule, once a given cue is already associated with a given response, a secondcue will not be associated with that same response because there is no more error left to be‘accounted for’. More generally, a given part of a network trained by error-driven learningwill develop weights only to the extent that there is error left to be accounted for by otherparts of the network. At least in this sense, error-driven learning does not tolerate redundancy.A second reason for division of labour is architectural constraints. If one part of a networkis particularly good at one task and another part of the network at another, then error-drivenlearning will exploit these differences. As an example, Shillcock et al. (2000) proposed thatbecause the left-side letters of a word are projected to the right hemisphere and vice versa,this hemispheric distinction is preserved throughout the visual word processing stream. Inthe model described in this paper, there are also clear architectural differences between thetwo paths: one passes through a hidden node layer, and the other one does not. In at leastsome situations, the combination of error-driven learning and broad architectural differencesappears to be sufficient to generate a modular (cognitive) system.

9.4 Rules and connections

One of the most vigorous debates in cognitive science of the last 20 years has been the rules-versus-connections debate (e.g. McClelland and Plaut 1999, Pinker and Ullman 2002). Wethink there are some definite arguments in favour of rules determining at least some domains of(human) behaviour. First, empirically, Wallis et al. (2001) found evidence for the existence ofabstract rules with single-unit recordings in monkey prefrontal cortex. Second, conceptually,humans and other (higher) animals are able to perform ‘far’ generalization in the sense of

Page 19: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 283

generalization to stimuli that are not obviously similar to stimuli in the training set. Forexample, students of physics must be able to generalize the concept of gravity from applesfalling from a tree toward planets orbiting around the sun. For this purpose, abstract rules arevery well suited.

These and other arguments have led many authors to conclude that a connectionist systemis insufficient and that a hybrid system is required to account for (human) cognition. In thisview, one subsystem would be associative (possibly connectionist) and a second one rule-based(e.g. Sloman 1996, Marcus et al. 1999, Pinker 1999). However, an alternative conclusion isthat connectionist systems can acquire rule-like capabilities. In this respect, the ability of theElman recurrent model (ERM) to extract structure from sequences of input has been welldocumented. Elman (1990) showed that an ERM trained on an artificial language was ableto extract the grammatical categories from this language and represent them in its hiddennodes. Servan-Schreiber et al. (1991) trained an ERM on finite state grammars and foundthat the network represented the underlying states in its hidden nodes, irrespective of thecurrent input pattern. Hence, both models encoded abstract (rather than stimulus-dependent)features in its hidden nodes. Later work with ERMs trained on finite state grammars showedthat they could transfer knowledge learned in a particular domain to different domains andto different grammars, again stressing the symbolic capacities of this type of model (Dieneset al. 1999, Hanson and Negishi 2002). Hence, the fact that humans use rules does not implythat a connectionist system is insufficient to describe human rule-like behaviour. One mighteven conclude that the rules-versus-connections debate loses at least some of its force throughthese rules-from-connections demonstrations.

The present paper is squarely in this last tradition and, in fact, this model can be seenas a variant of the ERM. Yet, it goes well beyond earlier demonstrations by showing ruleemergence in a domain (multi-digit number reading) that is less artificial than in most earlierstudies, and with a training corpus that is tightly linked to the empirical frequencies. It alsoclarifies how a strict rule/non-rule distinction is obsolete; hidden nodes can behave very rule-like (e.g. figure 4), very non-rule-like (e.g. figure 6), or anything in between. Yet, despite itsapparent rule-like character, the rules generated by the model were never of the pure characterpostulated in traditional rule-based systems. For example, the rule

IF number is larger than 99THEN activate Unit responses and inhibit Teen and Tens responses

encoded by the first hidden node applies more strongly to numbers of the form x0y (e.g. 504)than to other numbers larger than 99. Rules, according to the current framework, can be strictunder certain circumstances, but are approximate by default.

To conclude, we have shown how a neural network model can solve the large-number/low-frequency problem in number naming by adopting a decompositional approach. Without beingforced to do so, the model constructed a direct lexical path and an indirect syntactic path. Thehidden nodes in the latter path came to encode (approximate) rules that captured functionallyuseful syntactic regularities. By lesioning either of the two pathways, the model was shown tobe able to account for some dissociations reported in the neuropsychological patient literature.Extending the model to make contact with response time data remains an open challenge.

Acknowledgements

The contribution of Wim Fias is supported by grant P5/04 from the Interuniversity AttractionPoles Programme—Belgian Science Policy.

Page 20: Lexical and syntactic structures in a connectionist model

284 T. Verguts and W. Fias

References

P. Barrouillet, V. Camos, P. Perruchet and X. Seron, “ADAPT: a developmental, asemantic, and procedural model fortranscoding from verbal to Arabic numerals”, Psychol. Rev., 111, pp. 368–394, 2004.

I.V. Blair, G.R. Urland and J.E. Ma, “Using Internet search engines to estimate word frequency”, Behav. Res. Meth.,Instrum. Comput., 34, pp. 286–290, 2002.

M. Botvinick and D.C. Plaut, “Doing without schema hierarchies: a recurrent connectionist approach to normal andimpaired routine sequential action”, Psychol. Rev., 111, pp. 395–429, 2004.

F. Chang, “Symbolically speaking: a connectionist model of sentence production”, Cognitive Sci., 26, pp. 609–651,2002.

M.H. Christiansen and N. Chater, “Toward a connectionist model of recursion in human linguistic performance”,Cognitive Sci., 23, pp. 157–205, 1999.

L. Cipolotti, “Multiple routes for reading words, why not numbers? Evidence from a case of Arabic numericaldyslexia”, Cognitive Neuropsychol., 12, pp. 313–342, 1995.

L. Cipolotti and B. Butterworth, “Toward a multiroute model of number processing: impaired number transcodingwith preserved calculation skills”, J. Exp. Psychol.: Gen., 124, pp. 375–390, 1995.

L. Cohen and S. Dehaene, “Neglect dyslexia for numbers? A case report”, Cognitive Neuropsychol., 8, pp. 39–58,1991.

L. Cohen and S. Dehaene, “Number processing in pure alexia: the effect of hemispheric asymmetries and taskdemands”, NeuroCase, 1, pp. 121–137, 1995.

L. Cohen and S. Dehaene, “Calculating without reading: unsuspected residual abilities in pure alexia”, CognitiveNeuropsychol., 17, pp. 563–583, 2000.

L. Cohen, S. Dehaene and P. Verstichel, “Number words and number non-words—a case of deep dyslexia extendingto Arabic numerals”, Brain, 117, pp. 267–279, 1994.

S. Dehaene, “Varieties of numerical abilities”, Cognition, 44, pp. 1–42, 1992.S. Dehaene and J. Mehler, “Cross-linguistic regularities in the frequency of number words”, Cognition, 43, pp. 1–29,

1992.G.S. Dell, L.K. Burger and W.R. Svec, “Language production and serial order: a functional analysis and a model”,

Psychol Rev., 104, pp. 123–147, 1997.G.S. Dell, F. Chang and Z.M. Griffin, “Connectionist models of language production: lexical access and grammatical

encoding”, Cognitive Sci., 23, pp. 517, 542, 1999.G. Deloche and X. Seron, “From one to 1: an analysis of a transcoding process by means of neuropsychological

data”, Cognition, 12, pp. 119–149, 1982.Z. Dienes, G.T.M. Altmann and S.J. Gao, “Mapping across domains without feedback: a neural network model of

transfer of implicit knowledge”, Cognitive Sci., 23, pp. 53–82, 1999.J.L. Elman, “Finding structure in time”, Cognitive Sci., 14, pp. 179–211, 1990.R. Gelman and B. Butterworth, “Number and language: How are they related?”, Trends Cognitive Sci., 9, pp. 6–10,

2005.J.K. Gordon and G.S. Dell, “Learning to divide the labor between syntax and semantics: a connectionist account of

deficits of heavy and light verb production”, Brain Cognit., 48, pp. 376–381, 2002.J.K. Gordon and G.S. Dell, “Learning to divide the labor: an account of deficits in light and heavy verb production”,

Cognitive Sci., 27, pp. 1–40, 2003.S.J. Hanson and M. Negishi, “On the emergence of rules in neural networks”, Neural Comput., 14, pp. 2245–2268,

2002.M.W. Harm and M.S. Seidenberg, “Computing the meanings of words in reading: cooperative division of labor

between visual and phonological processes”, Psychol. Rev., 111, pp. 662–720, 2004.M.D. Hauser, N. Chomsky and W.T. Fitch, “The faculty of language: What is it, who has it, and how did it evolve?”,

Science, 298, pp. 1569–1579, 2002.M.I. Jordan, “Serial order: a parallel distributed processing approach”. Institute for Cognitive Science Report 8604,

University of California, San Diego (1986).G.F. Marcus, S. Vijayan, S.B. Rao and P.M. Vishton, “Rule learning by seven-month-old infants”, Science, 283,

pp. 77–80, 1999.J.L. McClelland and K. Patterson, “Rules or connections in past-tense inflections: What does the evidence rule out?”,

Trends Cognitive Sci., 6, pp. 465–472, 2002.J.L. McClelland and D.C. Plaut, “Does generalization in infant learning implicate abstract algebra-like rules?”, Trends

Cognitive Sci., 3, pp. 166–168, 1999.J.L. McClelland, D.C. Plaut, S.J. Gotts and T.V. Maia, “Developing a domain-general framework for cognition: What

is the best approach?”, Behav. Brain Sci., 26, pp. 611–614, 2003.M. McCloskey, “Cognitive mechanisms in numerical processing: evidence from acquired dyscalculia”, Cognition,

44, pp. 107–157, 1992.M. McCloskey, S.M. Sokol and R.A. Goodman, “Cognitive processes in verbal-number production: inferences from

the performance of brain-damaged subjects”, J. Exp. Psychol.: Gen., 115, pp. 307–330, 1986.S. Pinker, “Cognition—out of the minds of babes”, Science, 283, pp. 40–41, 1999.S. Pinker and M.T. Ullman, “The past and future of the past tense”, Trends Cognitive Sci., 6, pp. 456–463, 2002.D.C. Plaut, “A connectionist approach to word reading and acquired dyslexia: extension to sequential processing”,

Cognitive Sci., 23, pp. 543–568, 1999.

Page 21: Lexical and syntactic structures in a connectionist model

Lexical and syntactic structures 285

R. Power and M.F. Dal Martello, “From 834 to eighty thirty four: the reading of Arabic numerals by seven-year-oldchildren”, Math. Cognit., 3, pp. 63–85, 1997.

B. Reynvoet, M. Brysbaert and W. Fias, “Semantic priming in number naming”, Q. J. Exp. Psychol. A, 55,pp. 1127–1139, 2002.

J.G. Rueckl, K.R. Cave and S.M. Kosslyn, “Why are ‘What’ and ‘Where’ processed by separate cortical visualsystems? A computational investigation”, J. Cognitive Neurosci., 1, pp. 171–186, 1989.

D. Rumelhart, G. Hinton and R. Williams, “Learning internal representations by error propagation”, in ParallelDistributed Processing, D.E. Rumelhart, J.L. McClelland and the PDP Research Group, Eds, Cambridge, MA:MIT Press, 1986, pp. 318–362.

M.S. Seidenberg and J.L. McClelland, “A distributed, developmental model of word recognition and naming”, Psychol.Rev., 96, pp. 523–568, 1989.

C. Semenza, M. Delazer, L. Bartha, F. Domahs, L. Bertella,A. Grana, I. Mori, R. Pignatti and F.M. Conti, “Mathematicsin right hemisphere aphasia: a case series study”, Brain Language, 91, pp. 164–165, 2004.

D. Servan-Schreiber, A. Cleeremans and J.L. McClelland, “Graded state machines—the representation of temporalcontingencies in simple recurrent networks”, Machine Learn., 7, pp. 161–193, 1991.

R. Shillcock, T.M. Ellison and P. Monaghan, “Eye-fixation behavior, lexical storage and visual word processing in asplit processing model”, Psychol. Rev., 107, pp. 824–851, 2000.

S.A. Sloman, “The empirical case for two systems of reasoning”, Psychol. Bull., 119, pp. 3–22, 1996.R.A. Varley, N.J.C. Klessinger, C.A.J. Romanowski and M. Siegal, “Agrammatic but numerate”, Proc. Natl Acad.

Sci. USA, 102, pp. 3519–3524, 2005.T. Verguts, W. Fias and M. Stevens, “A model of exact small-number representation”, Psychon. Bull. Rev., 12,

pp. 66–80, 2005.J.D. Wallis, K.C. Anderson and E.K. Miller, “Single neurons in prefrontal cortex encode abstract rules”, Nature, 411,

pp. 953–956, 2001.M. Zorzi, G. Houghton, B. Butterworth, “Two routes or one in reading aloud? A connectionist dual-process mode”,

J. Exp. Psychology: Human Perception and Performance, 24, pp. 1131–1161, 1998.