Re-thinking stages of cognitive development: An appraisal ... · 1. Introduction The primary purpose of this paper is to provide an in-depth appraisal of a partic-ular connectionist

www.elsevier.com/locate/COGNIT

Cognition 103 (2007) 413–459

Re-thinking stages of cognitive development:An appraisal of connectionist models of

the balance scale task q

Philip T. Quinlan a,*, Han L.J. van der Maas b,Brenda R.J. Jansen b, Olaf Booij b, Mark Rendell a

a Department of Psychology, University of York, Heslington, York YO10 5DD, UKb University of Amsterdam, The Netherlands

Received 21 March 2005; revised 31 January 2006; accepted 9 February 2006

Abstract

The present paper re-appraises connectionist attempts to explain how human cognitivedevelopment appears to progress through a series of sequential stages. Models of performanceon the Piagetian balance scale task are the focus of attention. Limitations of these models arediscussed and replications and extensions to the work are provided via the Cascade-Correla-tion algorithm. An application of multi-group latent class analysis for examining performanceof the networks is described and these results reveal fundamental functional characteristics ofthe networks. Evidence is provided that strongly suggests that the networks are unable toacquire a mastery of torque and, although they do recover certain rules of operation thathumans do, they also show a propensity to acquire rules never previously seen.� 2006 Elsevier B.V. All rights reserved.

Keywords: Connectionist models; Balance scale task; Latent class analysis

0010-0277/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.cognition.2006.02.004

q This manuscript was accepted under the editorship of Jacques Mehler.* Corresponding author. Tel.: +44 1904 433 135; fax: +44 1904 433 181.

E-mail address: [email protected] (P.T. Quinlan).

mailto:[email protected]

414 P.T. Quinlan et al. / Cognition 103 (2007) 413–459

1. Introduction

The primary purpose of this paper is to provide an in-depth appraisal of a partic-ular connectionist approach to modelling the acquisition of knowledge about theoperation of a simple balance beam. Despite this rather narrow focus, there is amuch wider context to the work that touches on several fundamental issues aboutthe nature of human knowledge and human knowledge acquisition. Discussion ofthese key issues is contained in the introductory sections of the paper. A more par-ticular introduction to the relevant empirical literature is provided next, and havingset out the theoretical and empirical basis of the research, a detailed discussion ofconnectionist modelling is then included. A description of the Cascade-Correlation(CC: Fahlman, 1988) algorithm is given and previous CC simulations concerningperformance on the balance scale task are considered. A replication of this workis reported and the capabilities of the CC algorithm, when learning about the balancescale, are discussed. Next a less well-known statistical technique, namely latent classanalysis (LCA: Clogg, 1995; Goodman, 1975; McCutcheon, 1987), is described andtwo independent sets of simulations involving the balance scale task are reported.The data from these simulations are analysed via LCA and a much clearer pictureof the CC algorithm’s behaviour on the balance scale task emerges. It will be con-cluded that, although the algorithm does show evidence of acquiring some rules ofoperation of the balance scale that humans also acquire, it categorically fails toacquire a proper mastery of torque. Moreover, in acquiring human-like rules it alsoacquires rules of operation that have never been previously seen.

2. General theoretical concerns

Of the many issues that surround debates concerning the validity of connectionistmodels of cognitive function, one, that is central, concerns the status of mental rules(see Marcus, 2001; in particular, Chapter 3). Opposing views on this issue may becharacterized as the establishment and the connectionist positions, respectively (asdiscussed by Fodor & Pylyshyn, 1988). Some insight into both positions can begleaned from something that has come to be known as ‘‘The Past Tense Debate’’(see Pinker & Ullman, 2002).1 Central to this debate are certain basic facts about Eng-lish morphology and the morphological relations between past- and present-tenseforms of verbs. Historically it has been accepted that English verbs can simply bedivided into regular and irregular forms, but even this general classification schemehas been the focus of some debate (McClelland & Patterson, 2002). The distinctionbetween these verb forms was justified on the following grounds. For regular verbs,

1 This is not the place to provide an in-depth review of the extensive, and at times, acrimonious past-tense debate. The interested reader is directed towards the various interchanges between Pinker andcolleagues and McClelland and colleagues – some of which are referenced in the main body of the text. Theonly intention here is to provide a brief discussion of a concrete example of how the notion of a mental rulehas been used in certain theoretical approaches to cognition.

P.T. Quinlan et al. / Cognition 103 (2007) 413–459 415

past-tense forms are generated from the present-tense form via the rule of adding thesuffix ed (e.g., walk fi walked). In contrast no such generic rule applies to irregularverbs, and because of this, the particular morphological transformations must bespecified on, essentially, an item-by-item basis (e.g., go fi went). In adopting thisframework, and according to the establishment view, there is an important divisionbetween a form of grammatical system that specifies the rules of morphology thatapply to regular verbs together with a memory system or lexicon that specifies ‘‘thethousands of arbitrary sound-meaning pairings that underlie morphemes and simplewords of a language’’ (Pinker & Ullman, 2002, p. 456). Separate cognitive systems areposited for the generation of past tense regular and irregular verbs, respectively.

Of the many (apparently inflammatory) claims being made here, there is one thatstands out as being particularly contentious and this is the one concerning mentalrules. The claim is that, fundamentally, mental operations comprise combinatorialoperations defined over variables. This has been succinctly discussed by Marcus(2001) in terms of something known as universally quantified one-to-one mappings.

According to him ‘‘A function is universally quantified when it applies to all instanc-es in its domain’’ (e.g., ‘‘For all x such that x is a verb stem’’) and it is ‘‘one-to-one ifeach output maps onto a single input in its domain’’ (e.g., ‘‘f (x) = 2x, the output sixcorresponds to the input 3’’) (p. 36). In terms of the past-tense debate, the establish-ment view is that the mental processes that reflect competence with regular verb mor-phology critically consist of combinatorial operations involving variables (where anygiven variable relates to a present tense verb form). The fundamental issue though, iswhether cognitive rules play a causal role in determining this form of languagebehaviour and a contrary view is that espoused by connectionist modellers. Theyclaim that, irrespective of the system of rules that may be used to describe the lan-guage, such rules are incidental to how the language is produced and understood.By this view the linguistic rules, which are encapsulated in terms of syntactic andmorphological structures, merely reflect the statistical regularities that exist in thelinguistic input. What the brain has done, through exposure to the language, is torecover the statistical regularities by adapting weighted connections between its myr-iad component processing units. So although it may be possible to describe the lan-guage in terms of its grammatical rules, such rules play no causal role in languagebehaviour (see Quinlan, 1991, pp. 232–236).

Important support for such a connectionist view is the provision of a computerprogramme which provides the correct outputs for a given set of inputs but whichdoes not contain the rules of operation that are otherwise assumed to be fundamen-tal. Neural network models also, typically, provide accounts of how the acquisitionof the input–output mapping proceeds. In this regard, there are connectionistaccounts showing how the mapping between present- and past-tense forms of Eng-lish verb forms may be acquired (Plunkett & Juola, 1999; Plunkett & Marchmann,1996; Rumelhart & McClelland, 1986). The fact that such models are successful inproducing the correct past-tense forms when presented with corresponding presenttense forms is taken as a demonstration that rule-based accounts of language arenot demanded by the data. The network models do not acquire a rule-based systemthat (in some sense) operationalises the rules of grammar. What they acquire is a set


of weighted connections that capture the statistical regularities that exist in the map-ping between the present and past tenses of the verbs.

To be clear, it would be inappropriate to attempt to appraise the different pointsof view on the issue of past-tense learning in detail here. Such an exercise would takeus far from the main topic of interest. The example provides a relatively simple illus-tration of rule-based and connectionist accounts of how to approach a central issueconcerning mental rules. In brief though, there is now a large body of connectionistevidence in favour of single mechanism accounts of past-tense learning couched interms of computer models (see above citations and Joanisse & Seidenberg, 1999).Symbolic simulations also exist (Ling & Marinov, 1993; Taatgen & Anderson,2002). In parallel with this modelling work there is an emerging literature on provoc-ative neuropsychological cases (see e.g., Miozzo, 2003; Tyler et al., 2002; Ullmanet al., 1997) that seem to suggest the operation of two different routes to past tensegeneration. It suffices to state that the debate continues, the fundamental issues haveyet to be resolved, and the evidence remains equivocal.

In the current context the focus of attention is not with language but with cognitivedevelopment and the central issue revolves around consideration of the claim that‘‘the highly consistent behavior patterns often observed on Piagetian problem-solvingtasks are the result of relatively simple rules that are stored in the child’s long-termmemory and sampled from memory to solve the problem at hand.’’ (Kerkman &Wright, 1988, p. 325). The opposing connectionist view is that cognitive developmentin general can be adequately explained by the same sorts of statistical learning proce-dures that have been examined in the neural network models of past-tense learning.

Fundamental though is the distinction between rule-following and rule-governeddevices. Rule-governed mechanisms may be said to act in accordance with a set ofrules. By contrast, the notion of rule-following applies only to cases where ‘‘a repre-sentation of the rules they follow constitutes one of the causal determinants of theirbehavior’’ (Fodor, 1975, p. 74). In addressing the rule-following/governed distinc-tion two problems immediately become apparent. First, it is important to be ableto be clear about the nature of the rules that are assumed to govern the behaviour,and, second, it is crucial to be able to show that there is convincing empirical supportfor the use of such rules. Although, in principle, it may well be easy to provide a rule-based account of some aspect of human performance, it is quite another matter toshow that those same rules play a causal role in the behaviour (see Kerkman &Wright, 1988, p. 349, for further evidence on this point). As will become clear, muchof what follows concerns (a) how best characterise those rules, and (b) the evidencethat has been used to argue for the existence and use of particular cognitive rules.

3. The acquisition of an understanding of the principle of torque: The balance scale task

Of primary concern is the problem of the acquisition of knowledge about theoperation of a balance scale. This topic has a long history in the human developmen-tal psychology and it stems from the seminal work of Piaget (Inhelder & Piaget,1958; Piaget & Inhelder, 1969). Piaget instigated this line of research by allowing


children to explore and manipulate various kinds of simple balance scales and bynoting their responses when engaged in a structured verbal protocol with an exper-imenter. The eventual qualitative data from these studies were used to support theidea that children progress through an invariant sequence of conceptual stages inwhich they successively approximate to a full understanding of proportions andthe principle of torque. Each of the stages is characterised by utilisation of a partic-ular rule that is taken to reflect the child’s current state of knowledge about the oper-ation of the balance scale.

This particular line of research was pursued, most notably, by Siegler (1976, 1981)and again the central idea to emerge from this work is that each of the different stag-es of development can be characterised by use of a particular rule. This was estab-lished in the following manner. Siegler (1976) gave his participants (youngchildren) various tests with an actual wooden balance scale that comprised loadingup the scale with different configurations of weights located somewhere on four equi-distant pegs on each side of the fulcrum. On a given trial the child was presented witha balance scale configured with weights placed on pegs, and the ends of the scalewere supported by wooden blocks. The child had to say whether the scale would bal-ance, whether it would tip to the right or, whether it would tip to the left when theblocks were removed. In assessing a child’s competence, several different problemtypes were defined. Test problem types were known as balance, weight, distanceand three types of conflict problems were also defined. Specifically, the types were

1. Balanced patterns in which the same weight is positioned at the same distancefrom the fulcrum on both sides.

2. Weight patterns in which different weights are placed at the same distance fromthe fulcrum on both sides.

3. Distance patterns in which the same amount of weight is placed at different posi-tions from the fulcrum.

4. Conflict patterns in which weight and distance are placed in conflict on the differ-ent sides of the fulcrum. These can be further subdivided into three sub-typesdefined relative to the nature of torque (i.e., the product of the weight and the dis-tance from the fulcrum):

(a) Conflict-weight patterns in which the side with the greater weight has thegreater torque.(b) Conflict-distance patterns in which the side with the greater distance has thegreater torque.(c) Conflict-balance patterns in which there is equivalent torque on both sidesof the fulcrum.

Performance was assessed with a test set comprising four balance, four weight,four distance, six conflict-weight, six conflict-distance and six conflict-balance prob-lems (Siegler, 1976). Using this approach, each stage of conceptual development wasassociated with a characteristic response profile across these 30 balance scaleproblems – so the assignment of an individual to a particular stage of developmentwas governed by that individual’s responses to the 30 test problems.


Siegler discussed four such stages and these are perhaps best conveyed in the man-ner described by Raijmakers, van Koten, and Molenaar (1996):

Rule I: Only consider one dimension (i.e., the so-called dominant dimension –either weight or distance) and respond according to an assessment of this dimen-sion. (The distinction between dominant and subordinate dimensions is to takeaccount of the bias that children may initially exhibit towards either weight ordistance.)Rule II: Consider the subordinate dimension if and only if the values on the dom-inant dimension are equal.Rule III: Consider both dimensions and in the case of conflict guess.Rule IV: Consider both dimensions in an appropriate fashion and combine thenumerical values according to the correct multiplication rule.

To determine which of the rules a child used to solve the balance scale problems,Siegler (1976) chose the following criteria which collectively define the rule assess-

ment methodology:

1. To be classified as a Rule I user: at least 26 of 30 responses must be based on theweight cue, in addition, at least three Balance responses must be made to the dis-tance problems;

2. To be classified as a Rule II user: at least 26 of the 30 responses must conform tothe following rule ‘‘If there are an unequal number of weights consider onlyweight – if the number of weights is equal also consider distance’’, in addition,three of the four distance problems must be correct;

3. To be classified as a Rule III user: 10 of 12 correct responses to the non- conflictproblems and three out of the four distance problems must be correct. In addition,there must be at least five departures in 18 trials from complete reliance on theweight (distance) cue as indicating the correct answer on the conflict problems;

4. To be classified as Rule IV user: at least 26 of 30 responses must be correct.

By using these criteria, almost 90% of all participants were classified as followingone of the four rules (Siegler, 1976). Moreover, in accordance with Siegler’s predic-tions, the youngest children used Rule I, and the oldest children used Rule IV.

Despite the early success in the derivation and application of the rule assessmentmethodology, serious concerns have been raised subsequently over its usefulness (fora more detailed discussion of its short-comings, see Dawson & Zimmerman, 2003;Jansen & van der Maas, 2002; Kerkman & Wright, 1988). Now it seems that thissimple framework for thinking fails to account adequately for various subtleties inperformance that have been uncovered in more recent experiments.

3.1. Further evidence bearing on rules and rule-use in humans

Central here are the findings of the existence of rules in addition to thosedescribed by Siegler (1976). Important examples are (a) a rule of addition (Ferretti,


Butterfield, Cahn, & Kerkman, 1985; Normandeau, Larivee, Roulin, & Longeot,1989), (b) a buggy-rule (Van Maanen, Been, & Sitjsma, 1989), and (c) a qualitativeproportionality (QP) rule (Boom, Hoijtink, & Kunnen, 2001; Normandeau et al.,1989). The rule of addition consists of simply comparing the sum of the weightand distance values on either side of the fulcrum. In contrast, Van Maanen et al.(1989) defined the ‘‘buggy-rule’’ as ‘‘If side X has more weights and the weightson side X have the smaller distance to the fulcrum then shift the weights on sideX away from the fulcrum until the distances on both sides are equal and removefor every shift on side X one weight on side X’’ (p. 72). Discussion of the rule of addi-tion and the buggy rule is normally conflated because, as Jansen and van der Maas(1997, p. 326) have remarked, the same profile of responding fits with both. Againstthis though it is possible to discriminate between these two rules when response times(RTs) are considered (see van der Maas & Jansen, 2003) and from this evidence itseems that the buggy rule prevails over the addition rule.

Regarding the QP rule, Jansen and van der Maas (1997) noted that ‘‘Rule QPusers consider both weight and distance and conclude that a heavy weight at a smalldistance on one side of the fulcrum compensates for a light weight at a greater dis-tance on the other side of the fulcrum.’’ (p. 325). Evidence for this kind of rule is thatparticipants classify all the conflict problems as being balanced.

In addition to the fact that the original rule assessment methodology provides noaccount for the existence of these newer rules, the procedure has also been shown tosuggest falsely the presence of rules (see e.g., Jansen & van der Maas, 2002). Part ofthe problem here stems from the fact that the criteria for the identification of rulesand rule-use, lack statistical foundations (cf. Kerkman & Wright, 1988). In order toaddress this shortcoming, Jansen and van der Maas (1997) proposed to bolster therule assessment methodology by adopting a psychometric technique, namely,LCA. As will be demonstrated, a strength of this technique is that it arrives at themost economical system of rules to describe performance (at any point during acqui-sition) as given by statistical measures of goodness-of-fit. A rough analogy here iswith factor analysis: following data analysis a factor structure is revealed that mustthen be interpreted.

In attempting to understand such a ‘‘factor-structure’’ the theorist is free to fit anyprofile that is deemed plausible. As will be shown below, as a first step the theoristmay wish to consider the different profiles of responses that are, respectively, com-mensurate with Rules I to IV as discussed by Siegler (1976) (see Jansen & van derMaas, 2002). In this way the application of LCA becomes a statistical extensionof Siegler’s rule assessment methodology – it offers a statistical method for assessingSiegler’s theoretical framework. An advantage over the standard rule assessmentmethodology though, is that now the theorist is able to consider the statistical fitof the data with each of the profiles of responses that defines competence at oneof the four stages on the task. More generally, with LCA it is possible to establish,on the basis of a firm statistical footing, the degree to which any pre-existing rule setcan be said to account for the data.

Another aspect of this approach is that the analysis can reveal profiles of respons-es that suggest rules other than those discussed by Siegler (1976). For instance in


their study of performance on the balance scale task, Jansen and van der Maas(2002) discussed evidence from LCA that revealed evidence of children using a small-

est distance down rule (p. 400). That is, some responses revealed a propensity torespond that the scale would tilt to the side with the smallest distance. Although dis-cussion of this sort of profile was already present in the literature (see Siegler &Chen, 1998), it indicates that LCA can reveal patterns of performance that mayotherwise be unexpected. In such circumstances LCA has been productive in forcingtheorists to consider possibilities outside the bounds of current thinking, for instancein taking seriously the possibility of rules in addition to those discussed originally bySiegler (1976). Indeed, it is also quite possible that LCA may reveal evidence of pro-files of responses that are novel and are therefore not predicted by any current the-ory. When such unexpected patterns emerge, the theorist is, by definition, placed inthe position of having to interpret the data in a post hoc fashion. However, thisshould not be construed as being a problem for the method. It simply shows that,despite the theorists’ preconceptions, LCA can reveal evidence for profiles ofresponding that have not hitherto been considered. What then is the most adequatetheoretical interpretation may become the focus of future research, especially if inter-preter bias is suspected.

Along with such methodological advances, what would be desirable is a generalframework for arriving at a consensus on rules and rule-use in the balance scale task.One such framework, that has been discussed (see Jansen & van der Maas, 1997), isprovided by the criteria for rules set out by Reese (1989). These are that rule-likebehaviour should be regular, consistent, transferable, evidenced by data of differentsources, discontinuous, and (to some degree) conscious. Sound evidence for each ofthese criteria can be found in the literature, but whether application of the ruleassessment methodology alone can provide such evidence is contentious. It is there-fore useful to consider briefly what the evidence for rules and rule-use in humans ison the balance scale task.

3.2. Criteria for rules and rule-use

3.2.1. Regular and consistent rule-use

Evidence for regular and consistent rule-use on the balance scale task can befound in the studies of Boom et al. (2001) and Jansen and van der Maas (1997,2002) in their applications of LCA. Both studies replicated Siegler’s (1981) originalfinding that children consistently use rules with the balance scale items. Across thesethree studies a large majority of the children (i.e., 71%, 80%, and 81%, respectively)were classified as using Rule I–IV or the addition rule. The behaviour of the remain-ing children did not conform to any of these rules but, nevertheless, many showedconsistent answer patterns in accordance with alternative rules (such as the QP rule,Boom et al., 2001; Normandeau et al., 1989). Indeed in analysing the data for eachitem type separately, Jansen and van der Maas (1997, 2002) reported some violationsof the expected rules in children’s data. Different latent classes were observed thanwere expected. However, such violations were minor and often attributable to atyp-ical rules.


Particular patterns of rule inconsistency have also been observed, however (Jan-sen & van der Maas, 1997, 2002; Siegler, 1981; van der Maas & Jansen, 2003). Ruleinconsistency refers to cases where participants’ responses deviate from that predict-ed by a given rule. Such deviations have been explored in some detail by Jansen andvan der Maas (1997; see also van der Maas & Jansen, 2003), and they concluded thatthese patterns of behaviour are understandable in terms of the rule-switching thattypically occurs around the transitions between stages. During critical periods (forinstance in the transition between Stages 1 and 2; Jansen & van der Maas, 2001)the children may vacillate in applying different rules in a bid to solve problemsnot covered by a less complex rule. An important caveat here though is that no suchinconsistency in rule use was found in children who had achieved Stage 4 (Jansen &van der Maas, 2001). Having achieved a mastery of the principle of torque there is noneed to switch from Rule IV because it applies to all cases.

3.2.2. Transfer ability of rules

Siegler (1981) reported data showing the transferability of proportional reasoningrules over different kinds of proportional reasoning tasks, that is, children behavedconsistently across a range of different reasoning tasks. Interestingly, he also docu-mented consistency between answer patterns and verbal explanations in the vastmajority of cases (incidences of between 70% and 80%), that is, participants showedconsistency across different measures on the same tasks. Indeed, the consistencybetween answer patterns and verbal reports, described by Siegler (1981), and others(Chletsos, De Lisi, Turner, & McGillicuddy-De Lisi, 1989), points to participantshaving some awareness of the rules they use in the tasks. Other forms of consistencyacross different measures has also been reported in the RT study of van der Maasand Jansen (2003). Van der Maas and Jansen derived expected RTs from the rulemodel of Siegler and showed that the responses of children were remarkably consis-tent with their answer patterns.

3.2.3. The discontinuous character of rule-use

In considering the discontinuous character of rule-use, Jansen and van der Maas(2001) applied catastrophe theory to examine the transition from Rule I to Rule II.This theoretical framework was used to support the idea of phases of stable perfor-mance interspersed with abrupt changes in performance indicative of definite shiftsbetween qualitatively different stages. Jansen and van der Maas compared variousmodels in terms of how well they accounted for progression from Rule I to II onthe balance scale task. They contended that only the cusp model (an elementaryembodiment of catastrophe theory) truly explains discontinuous (stage-like) behav-iour: four of the five relevant catastrophe flags (bimodality, inaccessible region, sud-den jump and hysteresis) associated with discontinuity were observed in the transitionfrom Rule I to II. Critically, the last flag is sufficient for concluding that change isabrupt and stage-like and this cannot be explained by non-transitional models.

Preliminary evidence also suggests that whereas the transition to Rule IV is alsorather sudden (Jansen & van der Maas, 2002; van der Maas & Jansen, 2003), the ruletransitions involving Rule III, are probably not discontinuous. As Siegler (1981) has


pointed out, Rule III includes a host of idiosyncratic strategies – such as guessing.Therefore variability in responding is only to be expected and is present in the data.Indeed such variability may arise for a number of different reasons relating to mea-surement error. Indeed, the idea that the presence of such variability is peculiarlyproblematic for rule-based accounts (Elman, 2005, p. 112) is not compelling. Suchvariability is bound to arise when intensive testing of young children is undertaken.As van der Maas and Jansen (2003) noted, lack of concentration and other motiva-tional problems can be significant determiners of performance in such cases.

3.2.4. Further evidence against rule-based accounts

Although the foregoing sets out supporting evidence for the use of rules by childrenas they attempt to master the principle of torque, the rule-like character of children‘s’responses has been criticised on the basis of the so-called torque difference effect (Fer-retti & Butterfield, 1986, 1992; Ferretti et al., 1985). Ferretti and Butterfield (1992)changed the problem set so that the distance dimension was increased in order thatdifferences between the weight/distance products could be amplified. Now childrenwho had previously been classified as using Rule I were re-classified as using RulesII and III when solving the distance and conflict cases. On a superficial reading thisapparently breaches the stricture about invariant rule-use. (Indeed such a result sitsuncomfortably with rule-use because, on the assumption that the children are reason-ing with rules comprising variables, different values of weight and distance should notaffect the application of the rules). However, following a more thorough examination,Jansen and van der Maas (1997) demonstrated that this conclusion about inconsisten-cy in rule use was restricted to cases where the most extreme level of the product dif-ference existed (i.e., level 4 where the product differences were larger than 18).

Indeed, further analysis showed that when the very extreme cases were deleted (onthe grounds that these never occur in any of the balance scale tests), Siegler’sassumption of insensitivity to quantitative variations within item types was an ade-quate description of performance. In the tests used by Siegler (1976), Van Maanenet al. (1989), and Jansen and van der Maas (2002), the average torque differenceon non-balance items, was only 3.04 with a maximum of 12. The available evidencenow shows that childrens’ behaviour is consistent with moderate torque level differ-ences and that the torque difference effect is limited to cases where the torque differ-ence is large. It is however notable that children in transition from Rule I to Rule IIwill be sensitive to extreme values in product or distance difference (Jansen & van derMaas, 1997, 2001) and the more recent evidence is more in keeping with this restrict-

ed torque difference effect (see van Rijn, van Someren, & van der Maas, 2003).This is of some import given that McClelland (1995) has reported simulations

using a connectionist network (discussed below) that mimic, to some degree, theeffects of torque difference reported by Ferretti and Butterfield (1986). The corre-spondences between the network and human data were interpreted as suggesting‘‘that the mechanisms used in the model and mechanisms used by the children havesomething in common’’ (McClelland, 1995, p. 183). However, such a conclusionneeds to be examined in the light of more recent work by van Rijn et al. (2003).van Rijn et al. (2003) have reported a rule-based computer model that also simulates


the torque difference effect, and importantly, the restricted torque difference effect.Both connectionist and rule-based models therefore pass these tests of sufficiency.More generally though any account of the data must address why ‘‘childrens’ behav-iour is homogenous for moderate torque difference levels and that a torque differenceeffect is limited to large torque differences.’’ (van Rijn et al., 2003, p. 232). Regard-less, what the simulation work of van Rijn et al. (2003) suggests is that the presenceof the torque difference effect will be most evident ‘‘in the vicinity of transitions’’ (vanRijn et al., 2003, p. 253).

The idea that rule-like performance may be artificially induced by the testing situ-ation has been alluded to by McClelland (1995). He commented that the tendency ofchildren to conform to an explicit rule is much higher in experiments, like those ofSiegler (1976, 1981), in which they are required to justify their answers and explaintheir strategies, than in studies like those of Ferretti and colleagues (Ferretti & But-terfield, 1986; Ferretti et al., 1985), in which they merely make judgments of balancescale problems without being required to verbalise the basis for their answers. Clearlyany model of performance will eventually have to account for such a state of affairs,but at present no model does. More critically though, it should not be concluded thatrule-use in the task is necessarily induced by having the participant engage in verbaldialogue. Jansen and van der Maas (1997, 2002) have reported statistically significantlevels of consistent rule-use when paper-and-pencil tests were employed and the chil-dren did not have to verbalise their responses. Although verbalisation may encouragerule-like performance, it is neither necessary nor sufficient to account for it.

Finally, rule-like performance has been questioned on the reinterpretation offeredby adopting the information integration account of the balance scale task put for-ward by Wilkening and Anderson (1982). The information integration accountassumes continuity of development change, and as Kerkman and Wright (1988) not-ed, by this account information from both weight and distance dimensions is pooledvia a single, continuous function for integrating these dimensional values intoresponses for all of the balance scale problems. By this view, there is no point in gen-erating a taxonomy of different problem types because all are covered by the sameintegrating function. According to the theory, performance on the balance scale taskcan be understood in terms of the application of a kind of weighted addition rule inwhich differential importance is attached to the weight and distance dimensionsacross patterns and across developmental progression.

Following on from the description of this theory, Kerkman and Wright (1988)provided an in-depth appraisal of it and how it attempted to explain performanceon the balance scale and other forms of compensation tasks. Without repeatingthe details of the critique here, Kerkman and Wright (1988) concluded that, in com-pensation tasks such as the balance scale task, the forms of algebraic integration dis-cussed in the information integration theory are exceptional in human data.2 Indeed,

2 The points being made relate to performance on particular compensation tasks discussed in thedevelopmental literature. The intention is not to discount all models that posit processes of informationintegration. Indeed as the work of Massaro (1989) has shown, such models do fair much better in terms ofaccounting for critical aspects of speech perception, for example.


a notably shortcoming of the theory is that it fails to provide any mechanism forswitching between different algebraic rules as childrens’ competence at the tasksimproves (Kerkman & Wright, 1988; McClelland, 1995, p. 168). In this regard thereis no account of the transitional behaviour found in children.

In summary therefore, rule-based accounts are not without their critics, but thehuman data are generally consistent with such accounts (a) after careful examinationof the extant data, and (b) when the particular rules are derived from rigorous sta-tistical methods. More particularly, variability in responding and deviations fromparticular rules are revealing in important ways and are generally consistent withstages theories of conceptual development and contingent rule-based accounts(van Rijn et al., 2003).

4. Computer modelling of the balance scale task

In addition to the large empirical literature on the balance scale task, there hasbeen a considerable amount of research effort expended in developing computermodels of performance on the task. A useful starting point here is the work reportedby McClelland (1989). McClelland configured the network shown in schematic formin Fig. 1 and then trained this network on solutions to the balance scale problem

Fall to the Left Fall to the Right

Weight Input Units Distance Input Units

Output Units

Left Input Units

Right Input Units

Fig. 1. (a) Schematic form the backpropagation network used by McClelland (1989) to simulate theacquisition of knowledge about the balance beam whose structure is also schematised in (b). The balancecontained five weight locations on each side the fulcrum and a set of five weights for each side. Bypermission of Oxford University Press.


(Inhelder & Piaget, 1958). The particular balance scale simulated had five left andright weight positions and five different weights for the left and right sides to the ful-crum. The network comprised separate banks of left and right input units for theweight and distance dimensions. These weight and distance units were selectivelyconnected to a hidden layer of units. Two hidden units were connected to the leftinput units and the remaining two hidden units were connected to the right inputunits. All of the hidden units were fully connected to two output units.

During training each input pattern corresponded to a possible configuration ofweights placed at particular distance on the balance scale. Each teaching signal rep-resented whether the configuration gave rise to a fall to the right (i.e., an output val-ue greater than 0.333 on the right output unit), a fall to the left (i.e., an output valuegreater than 0.333 on the left output unit), or, to a balanced state of affairs (neitheroutput value greater than 0.333). The distinction between the dominant and subor-dinate dimensions was contrived by setting up a bias towards the weight dimension.The network experienced more patterns in which the weight cue varied than patternsin which the distance cue varied. In this regard, weight became the dominant dimen-sion for the networks and they learnt more quickly about this cue than the distancecue.

In order to draw direct comparisons with the human data, McClelland (1989)used the same testing procedure as Siegler (1976) to classify the developmental pro-gression of his neural network across its training. The network’s performance wasassessed after each epoch during training and the results revealed that althoughthe network progressed through the first three stages, it was unable to achieve stableRule 4 performance. In order to convey this, the data were cast in the form of whatcan be termed ‘‘stages of development graphs’’ as shown in Fig. 2. The abscissa

00 20 40 60 80 100

1

2

3

4

Epoch

Rul

e

Fig. 2. A stages of development graph re-drawn from McClelland (1989). The graph depicts the results ofa single run with the network shown in Fig. 1a. The ordinate reflects the stage of development defined bySiegler (1981) and the abscissa reflects the training epoch. According to McClelland (1989, p. 29) Stage 0performance reflects the network always outputting a balanced response on each output unit. An *pinpoints an epoch where a missed rule was due to a distance bias. The data were also scored so as to allowborderline cases that failed to comply with two adjacent rules. By permission of Oxford University Press.


shows the number of training epochs undertaken by the network and the ordinatemaps out the four stages of development defined above. Such a graph provides arough indication of the developmental sequence that the network progressedthrough during the training.

Clearly such a demonstration fundamentally undermines any account of develop-ment that posits discrete cognitive stages of development that implicate distinct peri-ods when particular hypotheses about the operation of the balance scale aregenerated and tested by the children. The connectionist account demonstrates thatmuch of human developmental performance can be mirrored by a device thatembodies general principles of learning based on quasi-statistical methods that oper-ate in a continuous rather than discrete manner. On such grounds, therefore it isimportant to be clear about what exactly has been shown and, in this regard, thework by Raijmakers et al. (1996) and Jansen and van der Maas (1997) is germane.

4.1. Critical assessment of the early connectionist modelling work

Raijmakers et al. (1996) re-visited the work of McClelland (1989; McClelland &Jenkins, 1991) and, following a much more detailed statistical analysis of the outputsof the networks, showed that there was no evidence of discontinuous jumps in thelearning exhibited by the networks. The data from the networks appeared to reflectcontinuously incremental knowledge acquisition (cf. McClelland, 1995). In contrast,Raijmakers et al. discussed the evidence in the human data to suggest notable dis-crete improvements in performance consistent with a progression through a sequen-tial series of cognitive stages – for further converging evidence for this see the morerecent work of Jansen and van der Maas (2001).

More critical perhaps, is the research reported by Jansen and van der Maas(1997). Here the networks’ responses to the items used with a human sample wereanalysed, but it proved impossible to provide a clear summary of performance interms of any rules of classification. It was difficult to fit any latent class model tothese data, and when such a fit was found, interpretation was difficult and veryfew (i.e., 19%) of the response patterns were associated with Siegler’s rules or anyknown alternatives. In stark contrast, in analysing the human data many latent clas-ses were associated with Siegler’s rules (about 80%) and only some classes wereimpossible to interpret. The general conclusion was that the connectionist networksfailed to conform to any system of rules uncovered for humans.

It is important to try to be clear here. From the characterisation of connectionistmodels provided in the introductory section, it should be evident that there is nosense in which the work of McClelland and Jenkins (1991) acquired explicit mentalrules that determined the behaviour of the networks. The networks acquired weight-ed connections that collectively gave rise to patterns of behaviour that may bedescribed by rules when the rule assessment method is applied – the differencebetween rule-following and rule-governed devices is critical. The issue therefore iswhether the developmental pattern of performance of the networks mirrors that ofhumans. The general view that emerged from the more intensive examinations ofthe networks (Jansen & van der Maas, 1997; Raijmakers et al., 1996) is that although


there are general correspondences between the behaviour of the networks and that ofhumans, there are also substantial differences that cast doubt on the degree to whichthe networks and humans follow the same developmental trajectories. In this regard,it is interesting to note that Munakata and McClelland (2003, p. 417) are in agree-ment with this conclusion in stating that the McClelland (1989) model failed to cap-ture important aspects of stage transitions seen in the developmental data.

Overall therefore the reappraisals of the early connectionist models of the balancescale tasks appear to have uncovered fundamental problems and limitations of thework. The networks’ performance simply does not correspond with that exhibitedby humans when careful and detailed analyses are carried out. Clearly the modelsdo mimic some aspects of the human data, but they do so in ways that do not cor-respond with how children are mastering the task. There is now good evidence in thehuman empirical literature for stages and stage-like transitions (Jansen & van derMaas, 2001; Van der Maas & Hopkins, 1998), and these remain to be convincinglyaccounted for by the continuous, statistical learning mechanisms embodied in thetype of connectionist networks discussed so far. Against this rather bleak backdrop,alternative network models have been put forward and it is to these that discussionnow turns.

4.2. Cascade-Correlation

In the network accounts of the balance scale just described the backpropagationlearning algorithm was central. Indeed despite the fact that this algorithm has fea-tured heavily in many connectionist accounts of psychological processes, reasonablyearly on after its description by Rumelhart, Hinton, and Williams (1986), Fahlman(1988) pin-pointed a major problem with its speed of learning: Backpropagationtraining could be inordinately slow even for small problems. In attempting to over-come this problem, Fahlman developed the Quickprop algorithm. This algorithm isan extension of the backpropagation method for updating weights, but now, at eachstage in learning, the aim is to increase the size of the weight changes to the greatestpossible degree without causing the network to converge to a sub-optimal solution.The details of the method need not be repeated here (see Fahlman, 1988): it sufficesto note that the extensions introduced by Fahlman (1988) were successful in produc-ing dramatic improvements in learning speeds over comparable simulations using thestandard backpropagation algorithm.

4.2.1. Gradient descent and the CC algorithmIn the now familiar multi-layered perceptrons that feature in many connectionist

accounts of human performance (e.g., see Fig. 1), the architecture is fixed at the out-set and remains so throughout training. In extending the Quickprop algorithmFahlman and Lebiere (1991) introduced the CC algorithm as a means to allow a net-work to recover its own structure during the course of training. The initial CC net-work structure comprises fully interconnected input and output layers of units only:it does not possess any hidden units. In very general terms initial training in CC is thesame as in Quickprop, however the CC algorithm is distinctive in allowing a network


to grow over time. It is well known that networks with no hidden units are limited inbeing only capable of solving linear separable problems (Minsky & Papert, 1988) soif the training problem is non-linearly separable the network will ‘‘stagnate’’. That is,the net will converge on a sub-optimal solution and will never recover the desiredinput–output mapping. Therefore the network only has a chance of solving suchnon-linearly separable problems if it has access to hidden units (see Lippmann,1987; for a more formal approach to this issue).

Normally it is appropriate to define a training epoch as being a single presentationof every member of a set of input stimuli. Each epoch then defines a step in the learn-ing process. In standard backpropagation, for instance, it is possible to curtail train-ing either when the global error has fallen below a certain value or when a certainnumber of epochs have transpired. Similar constraints can be applied to trainingwith the CC algorithm. More importantly, epochs also feature in one way in whichthe CC algorithm automatically adds hidden units. For instance, one alternative is toadd a new hidden unit after a given number of epochs has transpired. A second alter-native is to monitor the network’s output error and wait until learning ‘‘stagnates’’.This occurs when the change in error over successive epochs is less than some pro-portion of the previous error (as determined by a free parameter known as the output

change threshold). The number of times this check is invoked is governed by the out-

put patience parameter, hence the number of epochs over which the stagnation stateis gauged, can be varied. Both the output patience parameter and the output changethreshold are free parameters that the experimenter is at liberty to vary.

In very general terms CC training can be divided into two distinct kinds of train-ing and the algorithm alternates between them. They are known, respectively, as (a)output, and, (b) input training. Output training refers to modification of the weightson all connections to output units. This kind of training is undertaken initially priorto the addition of any hidden units, but, more specifically, it refers to weight updatesthat are made to all connections to the output units. In contrast, input training refersto the processes concerning the addition of hidden units. Although some so-calledconstructivist algorithms simply graft on an arbitrary new hidden unit when trainingstagnates (e.g., the dynamic node creation algorithm described by Ash, 1989; seeQuinlan, 1998, for a review), CC is more sophisticated than this and attempts toadd a unit that will benefit the net’s overall performance (see also Prechelt, 1997).

When the net possesses no hidden units the input units are fully connected to apool of (typically) eight candidate units. Now input training is undertaken toattempt to maximise the magnitude of the correlation between a given candidateunit’s activation and the residual error found at the original output units by modi-fying the input connections to the candidate units. Now the best candidate unit ischosen and installed as the new hidden unit into the original network. As Mareschaland Shultz (1996) note this ‘‘new unit has been specifically trained to encode . . . somefeature of the input . . . that still leads to error in the network’s performance’’ (p. 588).This new unit will ensure that some of the current error will be reduced.

The first new hidden unit is installed into the network by linking it to all of theinput units and all of the output units. All of the links to the new hidden unitare frozen and therefore do not change for the remainder of training. Output


connections from the new hidden unit though are free to vary. Now output trainingis re-invoked and all of the connections (new and old) to the output units are mod-ified. If training continues to stagnation again, then the process of re-training thepool of eight new candidate units is carried out with the intent of adding a new hid-den unit as before. Fig. 3 provides, in schematic form, the stages of growth of a CCnetwork and Fig. 4 provides in blue print form the basic CC algorithm.

4.3. CC and the balance scale task

Taking McClelland’s (1989) work as a starting point Shultz, Mareschal, andSchmidt (1994) re-visited the balance scale task and began to explore it with CC.Fig. 5 shows a stages of development graph that Shultz et al. (1994) gave as an exam-ple of the performance of one of their networks. Following the work of McClelland(1989), the networks were analyzed with a variant of Siegler’s (1981) rule assessment

Frozen connection

Variable connection Output Units

Input Units

Bias Unit +1

Initial State: No hidden units

Stage 1: Installation of first hidden unit

Stage 2: Installation of second hidden unit

Bias Unit +1

Bias Unit +1

Input Units

Input Units

Fig. 3. Schematic representation of the growth of the network governed by the CC algorithm. The figureshows three separate stages of development of the network.

1. Calculate the errors over all

patterns in the training set.

2. Is the performance of

the network satisfactory? Stop training.

3. Has the learning

stagnated?

4. Update the

weights connected to

the output units.

5. Configure the candidate network

6. Calculate the candidate unit’s activations and the

correlation between this and the output error (over all

patterns)

7. Calculate the error with

respect to the correlation.

8. Has the correlation

increase stagnated?

9. Update the links in

the candidate network.

10. Choose the best candidate units and insert it

into the original network.

Yes

Yes

Yes

No

No

No

Fig. 4. Flow diagram of the component stages of the CC algorithm.


methodology. As can be seen from Fig. 5, the network was able to progress throughthe ordered stages of development and finally achieved stable performance atStage 4. In this regard Shultz et al. (1994) had overcome a limitation of the networksreported by McClelland (1989) and demonstrated that stages of development couldapparently emerge from a relatively simple incremental and statistical learningmechanism.

4.3.1. Network training in the Shultz et al. (1994) simulationsFollowing the careful re-appraisals that have been carried out on McClelland’s

modelling of performance on the balance scale (Jansen & van der Maas, 1997; Raij-makers et al., 1996), it seemed important to re-consider the work of Shultz et al.

0

1

2

3

4

0 50 100 150 200 250

Epoch

Rul

e

H H

Fig. 5. A stages of development graph re-drawn from Shultz et al. (1994, p. 67). The positioning of a letter‘‘H’’ indicates an epoch at which a hidden unit was added. With kind permission of Springer Science andBusiness Media.


(1994). The extant evidence is that the CC networks examined by Shultz et al. (1994)do provide an adequate and plausible account of the acquisition of understandingthe operation of a balance scale.

First consider the network training undertaken by Shultz et al. (1994). Given theconstraints of the balance scale problem described by McClelland (1989), there areonly 625 possible training patterns. In the simulations carried out by Shultz et al.(1994) they initially selected (without replacement) 100 patterns from the 625 andthese constituted the first training pattern set. Across this set there was a 0.9 biasin favour of equal distance problems. After each epoch, the training set was expand-ed by adding a new pattern. Again the same 0.9 bias was retained and sampling wascarried out with replacement. Shultz et al. (1994) referred to this method as expan-

sion training because the size of the training set increased by one pattern after everyepoch. The particular bias in pattern sampling was justified on the grounds that‘‘children have plenty of experience lifting differing numbers of objects but relativelylittle experience placing objects at discrete different distances from a fulcrum’’ (p. 64).

On the basis of the details provided by Shultz et al. (1994) the actual probabilitiesof sampling each of the constituent pattern types are given in Fig. 6. This clearlyreveals the bias that operates against selecting a conflict pattern type. Indeed, withinthe pattern set, the largest category corresponds to the 200 instances where bothweight and distance agree. Although this may seem a minor point it appears to bethe major reason why the networks needed extended training to reach Stage 4.

4.3.2. Network testing in the Shultz et al. (1994) simulations

Now consider the testing of the networks that was undertaken. Here, the testingwas said to be ‘‘inspired’’ by Siegler’s (1976, 1981) work with children. A 24-item testset was used and the following rule assessment was made of a given network’s stageof development.

Stage 4 development was indicated by correct classification on 20 or more of the24 patterns.

Weight/ Balance

Others

Weight 100/125 0.72

Balance 25/125 0.18

Distance 100/500 0.02

Weight or 200/500 0.04 Distance

Conflict- 88/500 0.0176-distance

Conflict- 88/500 0.0176-weight

Conflict- 24/500 0.0048-balance

P(0.9)

P(0.1)

Selection Pattern type and probability of selection

Bias

Fig. 6. The selection biases that operate in the training regime used by Shultz et al. (1994) and replicatedhere. The figure shows the probabilities associated with selecting a particular type of pattern prior to eachnew epoch.


Stage 2 development was indicated by correct classification on 13 or more of the16 balance, weight, distance and conflict-weight patterns together with correct clas-sification of 3 or less of the eight conflict-distance and conflict-balance patterns.

Stage 3 development was indicated by correct classification on 10 or more of the12 balance, weight and distance patterns together with correct classification of 10 orless of the 12 conflict patterns.

Stage 1 development was indicated by correct classification on 10 or more of the12 balance, weight and conflict-weight patterns together with correct classification of3 or less of the 12 distance, conflict-distance and conflict-balance patterns.

Crucially, the stages were assessed in an ordered fashion whereby Stage 4 wasassessed first, then Stage 2, then Stage 3 and finally Stage 1, and this is not partof the standard rule assessment procedure (Siegler, 1976, 1981). Fig. 5 shows thestages of development graph presented by Shultz et al. (1994) for one network thatprogressed through the four stages of development in an orderly fashion and alsoexhibited stable Stage 4 performance after 200 epochs of training. However, having


used such a strict scoring scheme, the graph fails to reflect any Rule 0 performanceand borderline cases are also disallowed (cf. McClelland, 1989). Although such issuesmay seem minor they are critical in defining what a given stage of development istaken to be.

So in summary, problems concerning both training and testing of the networkmodels described by Shultz et al. (1994) have been identified in the absence of under-taking any further simulations. However, a much more thorough appraisal of thework is now provided in terms of new simulations of the balance scale task.

4.4. Replicating the work of Shultz et al. (1994)

Initially a report of work carried out in York will be set out although in actualfact two quite independent sets of simulations known, respectively, as the Yorkand Amsterdam (Y and A) simulations were carried out. (The researchers in the dif-ferent laboratories were pursing very similar lines of inquiry with no knowledge ofeach other.) Initially therefore attention will focus on the work carried out in York.Later comparisons will be drawn with the A simulations.

The Y implementation of the various networks to be discussed was based on theprogrammed examples provided by Rogers (1997). The Y simulator was implement-ed on a Win2000 PC in C++ and the CC algorithm was developed with reference toFahlman’s (1988) original LISP implementation and the C code written by Crowder(1990). Various benchmarking simulations were carried out in a bid to ‘‘tune’’ thefree CC parameters to achieve ‘‘optimal’’ performance. As a result of this, it wasfound that learning speed was increased by allowing the networks to stagnate morequickly. In the Y simulations therefore the output change threshold was set at 0.1.(In the original reports of CC this value was set at 0.01.) This brought about theaddition of the hidden units earlier on and this did facilitate the consequent reduc-tion in error. Although this general pattern is robust, some cases were discovered inwhich after extended periods of training the eventual error was marginally greaterfor networks that added units early on when compared against counterparts thatadded units later on. Regardless, in all cases the addition of hidden units early onduring training facilitated learning. (The complete set of parameter values in the sim-ulations is provided in Appendix A.)

Given the fundamental theoretical consequences that follow from the results ofShultz et al. (1994), it was deemed important to replicate the particular pattern ofresults shown in Fig. 5. To this end CC networks were configured and run on thebalance scale problems just described. Barring minor changes3 the basic results

3 Shultz et al. (1994) reported simulations in which the input parameter values were defined in integers inthe range 1–5. Pilot studies in York revealed that such input values tended to result in the networksappearing to fall into a cycle in which the error oscillated wildly. In the original CC algorithm a newhidden unit could be added after a certain number of epochs if the networks had failed to stagnate. Thiscondition was removed in the York simulations and because of this the networks sometimes failed to learn.It was discovered that one way to overcome this limitation was to simply decrease the inputs to values lessthan one.


reported by Shultz et al. (1994) were found to be replicable using the Y simulator.Fig. 7a shows the results of one network that generated five hidden units duringthe course of training on the balance scale task using the York instantiation of theCC algorithm. Points where the hidden units were added are shown on the graphas vertical lines. Although the graph does not exactly mirror the example providedby Shultz et al. (1994; compare their Figure 4 and the current Figure 7a), it doesreveal all of the same important characteristics: Namely, (a) stable Stage 4 perfor-mance at around 200 epochs, (b) a strictly sequential progression through the threeprecursor stages, and (c) the addition of hidden units during training. The onlymajor difference is that, in the present example, the network generated three hiddenunits prior to Stage 4; in contrast, the network described by Shultz et al. (1994) gen-erated only two hidden units. Despite some differences between the current CC sim-ulations and those reported by Shultz et al. (1994) there is very good agreementacross the two studies.

0

1

2

3

4

0 50 100 150 200 250 300

Rul

e

0

1

2

3

4

0 50 100 150 200 250 300

Epoch

Rul

e

Fig. 7. Two different renditions of a stages of development graph. (a) provides an example of a networkthat closely replicates the findings of Shultz et al. (1994: see Figure 7 here). (b) The same data but scoredwithout order restriction as in the original procedure of Siegler. The vertical lines indicate the epochs atwhich new hidden units were added.


4.4.1. Further examination of the networks’ performance

During the course of this investigation it became apparent that the assessment ofthe networks tended to reflect the hierarchical nature of the testing method adoptedby Shultz et al. (1994). In this reversed hierarchical scoring method, the most theo-retically complex rule is tested first, and, importantly, the four rules are not mutuallyexclusive despite the rather strict scoring regimen defined and used by Shultz et al.(1994). Indeed, when the simultaneous assessment of each of the four stages wasundertaken after each training epoch a rather different picture emerged.

Fig. 7b shows how the results in Fig. 7a can be re-drawn when all four stages wereassessed after each epoch. Comparing the two parts of Fig. 7 it is clear that the ori-ginal description of the network’s performance is crucially dependent on the meansby which rule-use has been assessed. Using the less constrained means (see Fig. 7b) itbecomes apparent that the serial progression through the ordered stages of develop-ment is no longer so obvious – for instance between 80 and 100 epochs the networkappears to be able to mimic, simultaneously, Stage 1, 2 and 3 rule-use. Moreover, anambiguity remains over the degree to which the network has achieved stable Stage 4rule use. To be clear, Rules III and IV can simultaneously be diagnosed using thescoring procedure adopted by Shultz et al. (1994). For Rule IV to be diagnosed 20or more of the 24 patterns need to be correct. For Rule III to be diagnosed 10 ormore of the 12 non-conflict patterns need to be correct and less than 10 of the 12conflict patterns need to be correct. The simultaneous diagnosis occurs in one ofthree possible ways:

with 11 non-conflict patterns and 9 conflict patterns correct (total = 20),with 12 non-conflict patterns and 9 conflict patterns correct (total = 21),with 12 non-conflict patterns and 8 conflict patterns correct (total = 20).

Such examples as these call into question the validity of using the particularscheme advocated by Shultz et al. (1994) in attributing rule-use to the networks. Giv-en such concerns it was felt important to examine alternative means for establishingrule-use and this exercise forms the basis of the final sections of the paper.

As a further important aside, during the course of running the simulations, caseswere found where networks were able to achieve stable Stage 4 performance withouthaving generated any hidden units at all (see Fig. 8; this is one example taken fromthe Y simulations but similar cases were also uncovered in running standard CC net-works using the A simulations). Again though it must be borne in mind that Shultzet al. (1994) rule assessment method was used here. Nevertheless, this was a quiteunexpected result, but accords well with sentiments expressed by Fahlman (personalcommunication) in that many problems turn out to be linearly separable when exam-ined with fast gradient descent procedures such as Quickprop.

The present finding conveyed in Fig. 8 is of interest and is not one that is men-tioned by Shultz et al. (1994). In the Y simulations hidden units were only addedif and when learning stagnated. Shultz et al. (1994) also allowed their networks togenerate a new hidden unit after a given number of epochs regardless of whetheror not the network had stagnated. Hence their chances of discovering a networkwithout hidden units that attained Stage 4 were minimal. That such networks exist,

0

1

2

3

4

0 50 100 150 200 250 300

Epoch

Rul

e

Fig. 8. A stages of development graph that shows an example of a network whose results have been scoredusing the Shultz et al. (1994) method, but this network was able to attain Rule 4 without the addition ofany hidden units. The vertical lines indicate the epochs at which new hidden units were added and as canbeen seen hidden units were only added after Stage 4 had been reached.


however, provides a demonstration proof that hidden units play no necessary role inthe acquisition of Stage 4 classification when diagnosed by the methods advocatedby Shultz et al. (1994). More importantly such simulations as illustrated in Fig. 8reveal that the apparent abrupt shifts in performance are clearly not necessarilylinked to the addition of hidden units.

Given the problems that have been uncovered with the previous methods forattributing rule-use to the networks, some initial work was undertaken to considerhow the networks might be solving the problem. For instance, there is evidence thatsome children hit on the notion that balance scale problems may be solved by simplycomparing the sums of the weight and distance values on each side of the fulcrum oruse a buggy-rule. They consequently respond with the side with the greater sum (Fer-retti et al., 1985; Jansen & van der Maas, 2002; Normandeau et al., 1989). To explorewhether the current networks also recovered and operated with respect to such anaddition rule, the following examinations were undertaken.

A proper understanding of torque is based on comparing the product of theweight and distance on one side of the fulcrum with the corresponding product onthe other side. Despite this, a surprising number of the current pattern set can besolved by a rule of addition – 573 of the 625 patterns. Just 52 patterns (henceforththe torque patterns) from the total 625 cannot be solved by application of the addi-tion rule. These patterns – 24 conflict distance, 24 conflict weight, and, 4 conflict bal-ance patterns – provide a critical means for discovering whether or not the networksare merely operationalising the rule of addition. That is, an ability to respond cor-rectly to these patterns would be strong evidence against the idea that the networkshad simply discovered the rule of addition. Therefore, ten different networks weretrained, as before, but training was halted after 300 epochs. The previous workhad established that this length of training was sufficient to ensure that the networkshad achieved Stage 4 performance as assessed by the Shultz et al. (1994) method ofassessment.


Following this training, the networks were tested with the 52 torque patterns. Theaverage number of hidden units generated during the course of training across theten networks was 5.4: range 3–8. The average correct performance over these 52 tor-que patterns was 10.7% and this is not even chance responding. To examine perfor-mance further, a more detailed exercise was undertaken in a bid to see how thenetworks were dealing with the different types of torque patterns. Attention willfocus on the conflict weight and conflict distance patterns, as the number of conflictbalance patterns was small. Fig. 9 provides a schematic breakdown of performancewith these two pattern types. Comparing across the two parts of Fig. 9 it is possibleto see that, when the networks were correct, they tended to exhibit either a biastowards the weight dimension or a bias towards the distance dimension. Two ofthe networks exhibited a response bias towards the weight dimension (networks 4and 9). Of the eight remaining networks, network 6 exhibited a response biastowards the distance dimension and the rest appeared to mimic the rule of addition.

In summary, the picture that emerges from this analysis is that there is an over-whelming tendency for the networks to use addition even in cases where such a ruleproduces the wrong response. When the problems cannot be solved by the rule ofaddition (as is the case with the torque problems) the networks tend to fail. Somenetworks did produce a correct response purely as a function of responding onthe basis of either weight information or distance information. It is therefore difficultto avoid the conclusion that there is no evidence that the networks reveal any mas-tery of the principle of torque. Understanding the principle demands a proper con-sideration of both weight and distance information.4

5. Further Examination of the Nature of Learning and Rule-use by the CC Algorithm

by means of LCA

In order to examine the performance of CC on the balance scale task further, theapproach set out by Jansen and van der Maas (1997, 2002) was adopted. Details areprovided below of these methods, but, in brief, the intention was to take a sample ofCC networks and examine their performance on the simulated balance scale task viaLCA. The primary aims were (a) to see whether CC networks do recover rules ofoperation of the balance scale, (b) to uncover the nature of these rules, and, (c) tothen draw comparisons with the extant human data.

4 This is not to argue that, in the limit, the networks would be unable to learn the correct classificationresponse for all patterns. As should be obvious, by allowing learning to continue indefinitely it is feasibleto assume that enough hidden units would be generated to capture information about all 625 patterns(Marcus, 1998a; Mareschal & Shultz, 1996). In the limit the network would develop pathways specific toparticular patterns. Indeed from further simulations in which the networks were allowed to run for 1000epochs demonstrable reductions in the global error were still being observed. Nevertheless, the networkswere still incorrect on the majority of the critical 52 torque patterns. It remains possible that some form ofCC network could be discovered that was adept at learning the complete pattern set. However, thepsychological relevance of such a result remains to be seen.

Fig. 9. Histograms showing performance for ten different networks on the 20 conflict-weight and conflict-distance torque patterns. The upper panel shows the performance on the conflict-weight patterns and thelower shows performance on the conflict-distance patterns. The responses have been broken downaccording to correct responses (white), incorrect responses consistent with the application of the rule ofaddition (grey), and, incorrect response that are uninterpretable (black).


5.1. Application of LCA to responses on balance scale items

In LCA a distinction is made between manifest and latent variables. At a verygeneral level the idea is to discover the latent variables from data derived from themanifest variables. With respect to the balance scale task, the manifest variablesare the actual balance scale patterns and values on these are the responses. In thetask, legitimate responses are ‘‘fall to the left’’, ‘‘fall to the right’’ or ‘‘balance’’,but in the analysis reported, the data were simply scored as correct or incorrect. Alatent variable is the unobserved or underlying variable that is generating theresponse and, in the present application a given latent variable refers to a putativecognitive rule of proportional reasoning. In LCA both manifest and latent variablesare taken to be categorical.

The analysis eventuates in a description of the data set across individuals in termsof a set of latent classes whereby each individual is assumed to belong to one andonly one latent class. By this characterisation, a given individual is assumed to apply


a particular rule of classification consistently across all test items. The parameters ofa latent class model are (a) the unconditional probabilities (ucps) that specify the siz-es of the latent classes, and (b) the conditional probabilities (cps) of giving a certainresponse for each balance scale item for each latent class. The ucps sum to one acrossthe different classes. In this case, the cps correspond to either responding correctly orincorrectly to an item. By way of example, the class that corresponds to Rule I hashigh cps for responding correctly to weight and conflict-weight items together withlow cps for responding correctly to the distance, balance and the remaining conflictitems. Other profiles are consistent with the other rules and the correspondingdetailed predictions are set out in Table 1. The cps in Table 1 are not expected tobe exact because children may err or accidentally guess the correct answer. So eventhough the data may reflect various forms of measurement error, LCA can accom-modate such variability.

Although Siegler’s (1981) rule assessment methodology has the same purpose asLCA-that of dividing the response patterns into a finite number of qualitatively dif-ferent rules-LCA does so in a statistically advanced way that offers a number ofimportant advantages over the traditional rule assessment methodology. First, thetechnique does not depend on any a priori knowledge of putative rules. It groupssimilar response patterns together and segregates these from dissimilar ones by sta-tistical means. Contrary to the rule classifications of the rule assessment methodol-ogy adopted by Siegler (1976; see Jansen & van der Maas 1997), the latent classesreflect the structure in the observed data and arise independently from the rules pos-tulated in any particular theory. Second, the technique does not use an arbitrary cri-terion to assign response patterns to latent classes, but an objective, statistical one.Third, the criterion is easily accommodated across item sets. Fourth, measures ofstatistical fit indicate whether a latent class model accurately describes the data orwhether a different number of classes are needed. Finally, by using different typesof restrictions, confirmatory analyses can be performed.

Table 1Ideal values for expected latent class models for responses to the various balance scale items based onparticular rule-use

Rule/model Conditional probabilities of answering an item correctly

W1–5 D1–5 CB1–5 CW1, 2, 3 CW4,5 CD1–4 CD5

RI 1a 0b 0 1 1 0 0RII 1 1 0 1 1 0 0RIII 1 1 0.33c 0.33 0.33 0.33 0.33RIV 1 1 1 1 1 1 1Buggy/addition 1 1 1 0 1 0 1

Note. W, weight items; D, distance items; CB, conflict-balance items; CW, conflict-weight items; and CD,conflict-distance items. A complete definition of the pattern set is provided by Jansen and van der Maas(1997, Table 2).

a Corresponds to a high conditional probability, near 1.b Corresponds to a low conditional probability.c Corresponds to a conditional probability indicative of guessing, (i.e., approximately 0.33).


Although LCA is not particularly well known in experimental psychology, andhas yet to be incorporated into any student textbook of psychological statistics, itis, nevertheless, now a well-established statistical technique since being developedin the 1960s (see Goodman, 1975 & Lazarsfeld & Henry, 1968). Hundreds of appli-cations have been reported in the last five years in areas as diverse as medicine, eco-nomics, sociology, psychiatry, marketing, and epidemiology. More critical, in thepresent context, are the applications in the developmental psychology literature; par-ticularly those dealing with the balance scale task (see Boom et al., 2001; Jansen &van der Maas, 1997, 2002).

5.2. Analysis of the current CC networks

As was noted previously two sets of quite independent simulations were carriedout. The Y simulations used the CC parameters as already discussed and the A sim-ulations used the standard CC parameters as defined by Fahlman (1988). Pilot worksuggested that, with respect to the current five distance/five weight balance scale, theA networks could achieve Stage 4 performance by 300 epochs. As described previ-ously, in the Y simulations the output change threshold was set at 0.1 and thisallowed the networks to stagnate earlier. Pilot work showed that these networkscould achieve Stage 4 performance by 250 epochs according to the criteria adoptedby Shultz et al. (1994).

Therefore, the ‘‘age range’’ of the networks examined was defined as beingbetween 1 and 300 epochs for the A simulations and 1 and 250 epochs for theY simulations. Each network was initialised with a different set of random weightsand allowed to run for a random number of epochs within the specified age range.Performance of each network was then examined by assessing responses to a 25item test set comprising five each of the weight, distance, conflict-balance, con-flict-weight and conflict-distance configurations. This test set is the same as thatused previously by Van Maanen et al. (1989) in collecting and analyzing an empir-ical data set; by Jansen and van der Maas (1997; see their Table 2) in analyzingboth empirical and simulated data sets; and by Boom et al. (2001). For every net-work each response was scored as correct or incorrect across the 25 test items.Both the A and Y data sets comprised the responses of 500 different networks.This sample size is of the same order as used by Jansen and van der Maas(1997, 2002) in their empirical analyses. These data sets were then initially analyzedseparately by LCA.

LCA was applied in a rather exploratory way. The procedure was started withfitting a latent class model comprising only two latent classes. The LCA program(here: PANMARK as developed by Van de Pol, Langeheine, & De Jong, 1996)estimated values for the parameters in the two-class model by minimising the log-likelihood by means of an EM-algorithm (details of this method can be found inClogg, 1995; McCutcheon, 1987). In this particular application the ucps reflectthe proportion of networks that exhibit a particular type of classificationresponse. The fit measure Loglikelihood Ratio (G2) indicates the fit of the model,and is an index of the difference between the expected frequencies from the model


and the observed frequencies of the response patterns in the data set. This indexis compared to the degrees of freedom to decide whether the difference is toolarge to fit the data. A p-value larger than a (here .05) demonstrates that the dif-ference between expected and observed frequencies is small enough to decide thatthe model accurately describes the data. As the data set is relatively small, com-pared with the number of possible response patterns, an empirical distributionwas used in addition to the theoretical distribution of G2. The empirical distribu-tion is obtained by parametric bootstrapping (Langeheine, Pannekoek, & Van dePol, 1995). The number of classes is increased until a model is found with aninsignificant G2.

Restrictions make a model more parsimonious. For instance, one may testwhether the conditional probabilities for two items can be restricted to the same val-ue. Models with insignificant G2, which differ in the number of restrictions, are com-pared by means of the Bayesian Information Criterion (BIC: Schwartz, 1978). Thiscombines fit and parsimony and the model with the lowest value is selected.

Following application of the method, the classes of the model need to be inter-preted. This was accomplished with the expected cps of Table 1 in mind. If a latentclass does not match any of the expected latent classes, an alternative interpretationmay explain the cps. Finally, response patterns can be assigned to the latent classesby means of posterior probabilities based on the selected model. Jansen and van derMaas (2002) were able to relate the children’s ages to particular latent classes. In thecurrent application, the intention was to relate the network’s ‘‘age’’ (i.e., last trainingepoch) to the latent classes.

LCA is applied to data expressed in a frequency table where the number of cells inthe frequency table equals categoriesitems. For the present data set, the frequencytable contains 225 = 33,554,432 cells and there are many more cells than observationsin the frequency table. Although Boom et al. (2001) have provided promising resultsby analysing all items simultaneously, we prefer a different method to deal with thelarge number of items in the balance scale test. Here each subset of five items of eachof the five types was analyzed separately.

This form of analysis also makes it possible to test item homogeneity. Item homo-geneity refers to an assumption in Siegler’s (1981) rule theory that children solveitems of the same type in the same way. Item homogeneity is revealed when thecps across a given type of item are equal and this indicates that each of the itemsis being responded to in the same way. The expected numbers of latent classes foreach item type can be derived from Table 1 and because item homogeneity wasfound in most cases, only a summary of these analyses is included in the initial partof the Results section. Given that item homogeneity was generally high, we attachmost importance to our analyses of the responses to a selected set of items of differ-ent item types. By such analyses it was possible to recover a better understanding ofhow the networks were responding to the range of patterns that make up the balancescale task. Next, the selected model of the combination of items was used to assignthe response patterns to the most probable latent class by means of a posteriori prob-abilities. Finally, by using the epoch number, it was possible to show the propensityof use of the different rules across the sample of networks tested.


5.3. Results

5.3.1. Testing for item homogeneity

The A and Y data sets were analysed separately with LCA and the similaritiesbetween the two data sets are striking and are conveyed in Table 2. Table 2 provides,in summary form, the latent classes recovered for both data sets together with inter-pretations. (A more thorough summary of the analyses is included in Appendix B.) Itis of some interest to note the points of contact with human data. For instance, bothA and Y data sets show clear evidence for the use of Rule I, especially on the distanceitems. There is also some evidence for other rules (i.e., Rule II, and the addition rule),but this evidence is not conclusive. In sum, several latent classes are consistent withrule use but these cannot uniquely be attributed to a single rule. The analyses

Table 2Summary of behaviour of the A and Y networks on the test items as shown by the item homogeneityanalyses

LC Data set A Data set Y

ucp Interpretation ucp Interpretation

Weight items

1 0.89 Any rule 0.81 Any rule2 0.01 ? 0.05 Left bias3 0.02 Right bias 0.08 ?4 0.07 ? 0.06 ?

Distance items

1 0.25 Rule I 0.42 Rule I2 0.67 Any rule except Rule I 0.49 Any rule except Rule I3 0.04 Left bias 0.03 Left bias4 0.05 Right bias 0.05 Right bias5a — 0.01 ?

Conflict-balance items

1 0.38 Rule I or II 0.27 Rule I or II2 0.37 Rule IV or addition/buggy 0.08 Rule IV or addition/buggy3 0.22 Addition bias 0.47 Addition bias4 0.02 Integration rule 0.17 Integration rule?

Conflict-weight items

1 0.31 Rule I, II, or IV 0.34 Rule I, II, or IV2 0.63 Addition/buggy 0.27 Addition/buggy3 0.02 ? 0.36 ?4 0.04 ? 0.03 Right bias

Conflict-distance items

1 0.93 Rule I, or II 0.85 Rule I, or II2 0.004 Rule IV 0.03 Rule IV3 0.02 Left Bias 0.04 Left Bias4 0.04 ? 0.08 ?

Note. ‘‘LC’’ is an abbreviation for ‘‘latent class’’. A ‘‘?’’ indicates a case where the class is uninterpretable.a Whereas a model with four latent classes was sufficient for responses to the distance items in the A data

set, a five class model was needed for the Y data set.


presented in the next section are more informative in this respect, but an importantpreliminary conclusion can be drawn: the present analyses have revealed evidence forrule-use in CC networks and this aspect of performance accords with similar evi-dence found in humans.

Nevertheless, there are also obvious and important disparities between the humanand network data. For instance, some small latent classes in the network data wereassociated with a bias for the left or the right side of the scale and such biases havenot been found in humans. Moreover, it is questionable whether the networks showtrue Rule IV behaviour. Latent classes that showed correct responses to all conflict-distance items were very small (i.e., 0.004) and correct responses to the remainingconflict items can also be achieved with Rule I, Rule II (for the conflict-weight items)or the addition/buggy rule (for most conflict-weight items, and all conflict-balanceitems). These data therefore fail to provide any firm evidence that the networkshad mastered the principle of torque. Finally, in several cases (see Table 2), the latentclasses were uninterpretable. Such uninterpretable patterns reveal that certain CCnetworks were responding in ways that deviate from all known forms of human clas-sification. Indeed, it is generally the case that the large majority of classes identifiedin human data can be readily understood.

5.3.2. Combination of item types

In the previous analyses, item homogeneity was shown for most latent classes: thecps across items were highly similar in most cases. Where there were clear violationsof homogeneity these arose because of a bias to respond to one side of the scale. Giv-en that the networks were generally consistent in their performance with items of aparticular type, it was now important to examine performance across combinationsof items of different types. To assess this aspect of performance, we tested the net-works’ performance with the same combination of items as used by Jansen andvan der Maas (1997). The critical list of test items comprised two distance items; aconflict-balance item that requires three buggies (i.e., CB3 see Table 3); a conflict-bal-ance item that requires only one buggy (i.e., CB1); a conflict-weight item that can besolved correctly with the addition rule (i.e., CW4); a conflict-weight that cannotbe solved correctly with the addition rule (i.e., CW1); a conflict-distance that canbe solved correctly with the addition rule (i.e., CD5); and a conflict-distance item thatcannot be solved correctly with the addition rule (i.e., CD1).

Following Jansen and van der Maas (1997) weight items were not included in thistest set because success with these items does not provide discriminatory evidence infavour of any particular rule. The predictions concerning the latent classes in theexpected latent class model can be derived from Table 1.

The A and Y data sets were analysed simultaneously in a multi-group analysis. Allcps were restricted to be equal between the two data sets whereas all ucps were esti-mated freely. The nine-class model was selected because the bootstrapped p-value ofthe loglikelihood ratio did not reach statistical significance (G2 = 61.24, p = .54). TheBIC was also lower for the nine-class model than for any other model(BIC = 6761.01), hence, nine classes were needed to model the data accurately. Thisstrongly suggests that the analyses have recovered general patterns of performance

able 3stimated values of the parameters of the nine-class model for the responses to the critical combinatio of eight items

ucp (A) ucp (Y) D1 D2 CB1 CB3 CW1 CW4 CD1 CD5 Interpretation#b = 0 #b = 0 #b = 3 #b = l #b = l #b 2 #b = 2 #b = 2A = C A = C A = C A = C A = F A C A = F A = CC = l C = l C = 2 C = 2 C = 3 C l C = 3 C = l

C1 0.17 0.17 0.01 0.01 0.01 0.00 1.00 1. 0.02 0.00 RIC2 0.10 0.04 1.00 1.00 0.04 0.00 1.00 1. 0.00 0.00 RIIC3 0.05 0.14 1.00 1.00 0.07 0.07 0.70 0. 0.00 0.99 RII/AddC4 0.38 0.25 1.00 1.00 0.55 0.98 0.01 1. 0.00 0.95 AdditionC5 0.14 0.03 0.91 0.96 0.00 0.97 0.02 1. 0.00 0.08 AddwC6 0.08 0.28 0.01 0.00 0.00 0.94 0.03 1. 0.00 0.00 AddWC7 0.03 0.07 1.00 1.00 0.33 0.79 0.04 0. 1.00 0.94 AddDC8 0.02 0.02 0.00 0.00 0.25 0.00 1.00 0. 0.90 0.00 Right biasC9 0.03 0.00 0.00 0.00 0.79 1.00 0.00 0. 0.06 0.00 Always balance

ote. ucp (A), ucp(Y), unconditional probabilities for data sets A and Y, respectively. D1, distance item ; D2, distance item 2, CB1, conflict-balance item 1;B3, conflict-balance item 3; CW1, conflict-weight item 1; CW4, conflict-weight item 4; CD1, conflict-di ance item 1; CD5, and conflict-distance item 5 (see

ansen and van der Maas, 1997, Table 2). #b, number of buggies; A = C, addition rule results in co ect response; A = F, addition rule results in falsesponse; C = l, 2, and 3, correct response is ‘‘left side down’’, ‘‘balance’’, ‘‘right side down’’, respectively. ddw signifies ‘‘addition, with a small preference foreight’’, AddW signifies ‘‘addition with a large preference for weight’’, AddD signifies ‘‘addition with preference for distance’’. RII/Add signifies variantule II performance: Rule II accounts for everything apart from the responses to CD5.

444P

.T.

Qu

inla

net

al.

/C

og

nitio

n1

03

(2

00

7)

41

3–

45

9

TE

LLLLLLLLL

N

CJrewR

n

===

000099000000890044

1strrA

a


that can be ascribed to the CC algorithm with some confidence. The results provideclear indications of how the CC networks learn the balance scale task. Clearly the fitis not exactly the same for both data sets, but differences in the ucps of the modelsmay be traced to differences in some of the free parameter values adopted in the twosets of simulations. Table 3 contains the estimated values for the ucps and cps of thenine-class model.

5.3.3. CC latent classes in detail

The cps in the first latent class (i.e., LC1 where ucp = .17 in both data sets, seeTable 3) were low for all items, except for the conflict-weight items. This patternof cps was to be expected if Rule I was being used. The cps in LC2 were high forthe distance items and the conflict-weight items, but low for the conflict-balanceand the conflict-distance items. This matched the expected cps for Rule II. In con-trast to findings to emerge from the LCA of data of the McClelland model carriedout by Jansen and van der Maas (1997), the LCA of the CC model indicates goodevidence of Rule I and Rule II behaviour. Although this pattern of results is inter-esting it is perhaps not so surprising given the bias in the training set towards weightitems.

The cps in LC3 were very similar to those in LC2, except for the cp relating toanswering CD5 correctly, which was close to one instead of zero. It is possible tosolve CD5 correctly with the addition rule. The best we can make of this class is amix of Rule II and Addition, which is not a mix of rules seen in children’s data.LC4 was quite large and the cps generally matched the expected pattern for the addi-tion rule. There were high probabilities of giving the correct responses to the distanceand conflict-balance items as well as to items CW4 and CD5. The only anomaly is thecp of .55 of CB1, which should have a value of near one.

Interpretation of the next three classes is even less clear-cut. For LC5 perfor-mance was relatively high on the two distance items and also on CB3 and CW4,but not CD5. This seems to suggest a pattern consistent with the addition ruletogether with some preference for using weight information. With LC6 perfor-mance was particularly poor with the distance and conflict distance items butagain performance with CB3 and CW4 was high. This suggests evidence of theaddition rule with a very strong preference for weight information. Evidencefor the acquisition and application of the addition rule in its pure form is furtherquestioned by the pattern of cps for LC7. The main anomaly is that the patternof cps is high performance for CD1. This suggests the use of addition with somepreference for distance information. Overall, the patterns that have emerged forthese three latent classes (i.e., LC5, LC6 and LC7) are suggestive of the add-ing-type/integration rules proposed by Wilkening and Anderson (1982). Forexample, in such cases it may be hypothesized that weight is twice as importantas distance. Anderson and Cuneo (1978) have presented evidence for such rulesin several judgment of quantity tasks, but no such evidence has been found instudies of the standard balance scale task (Jansen & van der Maas, 1997,2002). The evidence here is that the networks do apply such rules whilst learningabout the balance scale. In sum it may be concluded that although four or five


classes resemble the addition rule none of the classes matches the addition ruleexactly.

The remaining two classes are relatively easy to understand. The cps in LC8were high only for giving the correct response to CW1 and CD1. These were theonly items for which the correct response was ‘‘scale tips to the right side’’. Hence,this pattern may be explained by means of a bias for the right side of the scale.Such a bias has never been found in human data. Finally, LC9 showed high cpsof answering the conflict-balance items correctly. The cps for the remaining itemswere very low (except in the case of CW4). This pattern is indicative of responding‘‘Balance’’ to all of the items. Collectively, LC3 and LC5 to LC9 take account of35% and 54% in the A and Y data sets, respectively. In this regard, the networksdo reveal a propensity to solve the balance scale problems in ways never beforeseen in humans.

Finally, note that in accordance with the results reported in Fig. 9, no evidence forRule IV was found. On these grounds it seems that the claim that ‘‘cascade correla-tion networks learned to perform on balance scale problems as if they were followingrules, including clear performance at the level of the rule that characterizes stage 4’’(Shultz et al., 1994), is simply incorrect.

5.3.4. Rule development

Although the critical results have been conveyed via the foregoing statisticalanalyses of the networks’ data, further insights into the behaviour of the networkscan be gained by an alternative approach involving graphing the results. Fig. 10 isredrawn from Jansen and van der Maas (2002). In that study the ages of the par-ticipants ranged from 5 to 19 years, and following LCA of the children’s’ responsesit was possible to estimate the proportion of children of a given age who could beassigned to a particular latent class. Fig. 10 provides a cross-sectional (not longi-tudinal) illustration of the data and as such all the graphs reveal is the propensityto use a particular rule at a given point in developmental time across the targetsample. In this regard the data reveal developmental trends across, but not within,individuals.

In this data set, Jansen and van der Maas (2002) observed classes that correspond-ed to the use of Rule I, Rule II, Rule III, Rule IV, the addition rule and two deviantclasses. The cps for one of these deviant classes suggested the use of a combination ofthe addition rule and Rule III, whereas the cps for the second deviant class suggestedthat the children were always responding that the scale would tip to the side onwhich the weights were nearest to the fulcrum (this is the evidence for the use ofthe smallest distance down rule).

The reason for including Fig. 10 here is that it allows further comparisons to bedrawn across human and network data sets. In this regard, the data from the net-works were assigned to the most probable latent class in the nine-class model, iden-tified above, by means of the a posteriori probabilities. For the A and Y data setsperformance was assessed across the entire ‘‘age range’’ of the networks. This wasachieved by dividing the time line into intervals of 25 epochs. Next the frequencyof each rule was counted for each of the 25 epoch intervals and was then divided

Fig. 10. This figure is, in part, replicated from Jansen and van der Maas (2002) and shows data collectedfrom a human sample. Each graph reflects the proportion of children at each age that is applying theparticular rule identified.


by the total number of response patterns. The resulting proportions were plottedagainst the epoch numbers. For any given time point the graphed data points sumto one. Fig. 11 shows the developmental trends for Y data set, whereas Fig. 12 showsthe developmental trends for the A data set.

The use of Rule I shows an inverted U-shape: very few networks used Rule I ini-tially; the incidence grew to a peak but then diminished over time. In contrast, thehuman data reveals a large propensity for children to use Rule I early on, but todo so decreasingly as they age. It is plausible that some of the discrepancy for RuleI use across the simulated data and the human’s data is that the very ‘‘young’’ net-works may not have been comparable with the youngest children tested. If veryyoung children had been tested by Jansen and van der Maas (2002), then the humanand simulated data sets may have revealed a stronger degree of concordance. As aconsequence it would be incautious to attach too much importance to the apparentdisparities between the model and human performance at the earliest stages.Whereas the models essentially bring a blank slate to the task the children are severalyears old before being tested on the task.

Fig. 11. Each graph reflects the proportion of networks from the York simulations at each ‘‘age’’ that isapplying the particular rule identified. The rules are defined in Table 3.


There are no striking consistencies across the data sets in the use of Rule II.Indeed the incidence of Rule II in the simulated data sets is notably less than inthe human data especially early on in development. Also Rule II use tends to dieout with increasing age in the human participants but this trend is apparentlyreversed in the A data set.

Agreement across the human and simulated data sets is most striking foruse of the addition rule (although, as has been noted, the corresponding latentclass in the CC network data does not exactly correspond with the additionrule in its pure form). Both simulated data sets revealed a gradual increasein the use of the addition rule with time and the majority of networks endup operationalising this particular rule. Indeed visual inspection of the respec-tive graphs reveals that all three data sets produced the same developmentaltrend.

More troubling perhaps are the several rules that have not been identified previ-ously; namely, a combination of Rule II and addition, addition with a small prefer-ence for weight, addition with a large preference for weight, addition with a

Fig. 12. Each graph reflects the proportion of networks from the Amsterdam simulations at each ‘‘age’’that are applying the particular rule identified. The rules are defined in Table 3.


preference for distance, the right bias and ‘‘always balance’’ strategies. The incidenceof these rules are collected together in a single panel of Figs. 11 and 12 and it isimportant to note that these rules are not found in human data. This is of some sig-nificance because whereas it might be argued that children also generate strange rulesfor solving the task – such as the smallest distance down rule – the models generatestrange rules of a different sort.

6. General discussion

At the centre of the present paper is an in-depth analysis of how CC networkssimulate the acquisition of knowledge about the operation of a balance scale. Thisexercise has provided a reasonable clear picture of the manner in which these net-


works respond to the various kinds of balance scale patterns that define the problemspace associated with learning to master the principle of torque. One positive resultto emerge is that the current analyses have revealed for the first time clear statisticalevidence that a class of connectionist network models do simulate rule-governedbehaviour on the balance scale task. Examination of the current CC networksrevealed behaviour variously commensurate with Rule I, Rule II and to some degreeadditive rules.

In terms of the basic issues discussed at the beginning of this article the pres-ent evidence shows rather convincingly that a particular type of connectionistnetwork can mimic a rule-based system even though there is no sense in whichsuch rules play any causal role in the behaviour of the device. In this regard thenetworks discussed here behave as though they are rule-following even thoughthey are merely rule-governed. More particularly the present analyses providevery precise descriptions of the rules that the networks capture. In this regardthe present simulations provide an example of what has been referred to aseliminative connectionism (Fodor & Pylyshyn, 1988; Marcus, 1998a, 1998b; Pink-er & Prince, 1988) – ‘‘a school of connectionism that denies the existence ofsymbol-manipulation primitives (i.e., cognitive rules)’’ (Marcus, 1998b; materialin parentheses added; for a defence of eliminativism in general view see Church-land, 1981). Whereas in a symbolic model cognitive rules are made explicit andcausally determine the operation of the system, the current networks acquired aset of weighted connections that mimic such a system in the absence of possess-ing any such rules.

It is important to be clear about just what exactly such a result reveals. In theabstract it is possible to specify the input–output pairings that correspond to theinputs to and outputs from the networks by a system of rules. It has been shown thatthere are instances where both humans and networks produce the same outputs forthe balance scale test patterns. In the cases where the results for the latent class anal-yses of the human and network data sets agree, then there is evidence that the samerules have been acquired by both humans and networks. However, this statementneeds careful consideration. From an eliminative perspective the critical point is thatthere is no sense in which the networks have actually acquired any cognitive rulesand so if the networks do not have rules, then, by extension, neither do humans.It can therefore be concluded that it is no longer necessary to posit that the processof understanding the principle of torque reveals the acquisition of a system of cog-nitive rules.

This eliminitivist argument would be particularly compelling if it could be shownthat network simulations behave in the same manner as humans and exhibit thesame developmental trends as humans, but caution is warranted here. What the pres-ent simulations have shown is that there are some very general similarities in theways in which the networks and humans approach the problem of mastering theoperation of a balance scale. Both proceed from an initial stage of ignorance to astage in which some understanding has been achieved (see e.g., Fig. 7a). Howeverdespite this measure of success, what is striking is that the network models fail toprovide comprehensive accounts of the mastery of the principle of torque in humans.


There is no evidence of Rule IV performance in any of the CC networks examinedhere: none of the networks achieved competence at Stage 4, whereas a significantnumber of human adolescents do.5

The more speculative conclusion is that the currently proposed networks areactually incapable of mastering the principle of torque. Across the various CCaccounts of the balance scale task discussed here it seems that the networks areparticularly adept at homing in on the addition rule. The overwhelming majorityof the balance scale configurations can be solved by the rule of addition and it isthis that the networks are particularly sensitive to. The same conclusion also(apparently) applies to the standard sort of multi-layered (backpropagation) per-ceptrons examined by McClelland (1989, 1995). In examining the balance scaletask with this kind of perceptron, Dawson and Zimmerman (2003) analysed theoperation of the hidden units and concluded that these were computing some var-iant of the addition rule such as [right weight + right distance] – [left weight + leftdistance] (cf. Wilkening & Anderson, 1982). Now it might be argued that Dawsonand Zimmerman (2003) used a slightly different kind of hidden unit, a slightlydifferent kind of updating rule and a different architecture to that used byMcClelland (1989), but the central point still remains – connectionist networks failto acquire a mastery of the principle of torque because they attempt to representthe problem as a problem concerning the comparison of two sums. They areapparently unable to represent the problem as a problem concerning comparisonsbetween two products.

In this regard the discussion of the data shown in Fig. 9 is also particularly rele-vant. From that analysis it was concluded that the CC networks essentially muddledthrough with the patterns that do no confirm to the addition rule. Here there wasevidence of mis-applying the addition rule or alternatively responding to the sidewith the greater distance or the greater weight.

Now the argument might be the following. What has been shown is a bias towardsthe rule of addition and in adhering to this bias the networks were unable to achievea proper understanding of the principle of torque. In other words, the networks domodel the same bias and limitations found in humans – many humans never acquirethis knowledge. Again caution is warranted here.

First although it is true that there is evidence for the rule of addition in bothnetworks and humans, the networks also reveal ways of solving the balance scalethat have never before been seen in humans. Several of these idiosyncrasies aresimply uninterpretable. So alongside some aspects of human performance the

5 In this regard, it is important to note the relevant but alternative views that exist in the connectionistliterature. Whereas Shultz et al. (1994, p. 84) conclude that ‘‘In our networks, reaching stage 4 is essentiallya matter of learning the . . . problem to sufficient depth’’ (a point that has no support in the foregoing),McClelland (1989, p. 39) believes that ‘‘Rule 4 (unlike the other rules) can only be adhered to strictly as anexplicit (arithmetic) rule’’. Moreover, McClelland (1989) does point out that the ‘‘conscious, verballyaccessible component to the problem-solving activity’’ is not addressed by his simulations. His view is that‘‘the model captures the gradual acquisition mechanisms which establish the possible contents of theseconscious processes’’.


networks produce responses quite unlike anything seen in humans – indeed itmight be argued that it is as natural for the networks to exhibit human-likeresponding as it is for them to exhibit responding that bears no resemblance tothat of humans.6

Second it is important to address the point about Rule IV performance – thenetworks fail and so do humans (see Shultz et al., 1994, p. 71 for more on thispoint). In reply the actual evidence is that many children and adults master RuleIV, and the rule can be easily taught to those who do not discover it spontane-ously. In contrast, there is no evidence that any of the networks considered heredid master the principle. This is not to say that the CC networks are incapable oflearning how to map the complete set of input patterns that define the presentbalance scale onto the corresponding outputs. As is noted in Footnote 2, inthe limit, and in principle, the CC networks are capable of generating new hiddenunits until an acceptable level of performance is achieved. Nevertheless, it wouldbe difficult to claim that the networks have mastered the principle of torque. Theevidence here suggests that they would have essentially acquired a default ofresponding according to the addition rule together with an item-by-item set ofmappings for the exceptional cases. Critically, although such a solution wouldwork with the particular balance scale used in training, the network would beunable to deal appropriately with a more extended range of weight and distancevalues.

6.1. Rules, rule-use and the CC networks

Prime concern here has been with evidence for rules as provided by LCA. Withthis technique, the regularity and consistency of response patterns generated withthe CC model has been tested and found in the main to be wanting. Such anapproach though can be bolstered by appealing to other criteria for rule use suchas those set out by Reese (1989), and we can now ask how well the performanceof the CC models fulfil these criteria. Clearly it is inappropriate to expect the CC net-works to explain their responses, so it is not possible to address the criterion concern-ing awareness of rule use.

It is though possible to ask about the criterion concerning discontinuity. Thisis an important criterion (see Raijmakers, 1997) and there is a class of connec-tionist models that certainly do have the potential to show real phase transitionsin their behaviour (see the recent description of the Exact ART networks by Raij-makers & Molenaar, 2004). In this regard, Shultz (2003) has recently argued thatthe discontinuity criterion is also met by the CC model. He has shown, withrespect to conservation learning, that sudden increases in performance are a sig-nature of the CC acquisition process (p. 150). Although this could be true, thesudden jumps in improvement in learning shown by the networks are not as steep

6 In this regard it is interesting to note the other examples of idiosyncratic responding in the networksimulations of past-tense learning never before seen in humans (see Pinker & Prince, 1988).


and as dramatic as reported for some children (from completely incorrect tocorrect, see Raijmakers et al., 1996; Van der Maas & Molenaar, 1996). Moretroubling though is that in the present simulations no causal link between theaddition of hidden units and stage progression has been established (see e.g.,the discussion of Fig. 8).

It is also pertinent to ask about whether any of the findings from latent classanalyses provide cast iron evidence that can settle arguments about the rule-gov-erned/rule-following nature of cognition. Perhaps the least contentious conclusionis that such evidence merely suggests the operation of cognitive mechanisms of acertain type. Converging evidence for such mechanisms though can be gleanedfrom other sources. For instance, van der Maas and Jansen (2003) reported pat-terns of RTs for children and adolescents that were generally consistent with themental rules proposed by Siegler (1976). For instance, older subjects were muchslower on conflict items than younger subjects, but more interesting perhaps wasthe evidence relating to performance with scale items in which the larger weightconfiguration was furthest from the fulcrum. According to the rule-based frame-work proposed by Siegler (1976), participants who use Rule III and higher ruleswill make at least two sequential steps in their reasoning than would be the caseif they are reasoning with the less complex rules. Indeed the RT data revealed thatthe older participants were on average about one second slower than the youngchildren who were identified as applying either Rule I and II. Explaining theseRT data constitutes a clear challenge for any account of performance on the bal-ance scale task and it will be interesting to see how future connectionist modelsaddress this challenge.

Before closing it is important to be clear about what general points haveemerged from this exercise. In caricature there are two very different schools ofthought about how best to account for performance on the balance scale task.By the connectionist view the developmental trend is a gradual journey fromcomplete ignorance to mastery of the task. Such a developmental trajectory isunderpinned by small incremental changes in connection strengths in the centralnervous system. Collectively these molecular changes can lead to stage-like molarchanges in the behaviour. According to this caricature the child is essentially apassive statistical counting machine whose central nervous system is molded bythe environment. On the other hand there is the rule-based view. Here the childis seen to be nothing other than some form of computational device whosebehaviour is determined by a system of mental rules implemented in some formof abstract cognitive architecture. By this view the child slavishly adopts one rulewithout exception and at some later date drops this rule and immediately adoptsanother. Here the developmental trajectory is described as being the rigid adher-ence to a strict sequence of stages of cognitive competence that is followed with-out exception.

The picture that has emerged from the experimental literature is far more complexthan either caricature can hope to account for. Nevertheless there are very goodgrounds for arguing about generally consistent patterns of responding that are wellexplained by rules. There are also qualitative changes in behaviour such that different


consistent patterns of responding emerge over time. Such changes in behaviourfollow a general lawful progression and are revealed by sharp transitions. One signa-ture of such transitions is the inconsistent application of the rules. A central pointthat has been argued here is that the extant network models fail to adequatelyaccount for these general patterns of performance.

7. Conclusions

In closing it is clear that some success has been achieved in the present mod-elling exercises but such achievements must be considered against the clear dis-crepancies between the network and human data sets. Briefly; (a) the networksfrequently recovered rules never previously seen in humans; (b) the networksfailed to recover some of the rules that have been established in human samples;and, (c) there was no evidence that any of the networks acquired the principleof torque.

All of these conclusions must be borne in mind when assessing the worth of theconnectionist modelling work. At the very least they underscore the fact that thereare, at present, no adequate connectionist accounts of acquiring knowledge of theprinciple of torque. Indeed, we feel that the current body of evidence is such as tocast doubt on whether the overall connectionist approach is adequate for the taskin hand. Others however may take the evidence provided here merely to suggestproblems with particular connectionist models and that such limitations may beovercome with future work on more advanced network architectures. In thisregard it would be an oversight not to point out the degree of success that hasbeen achieved using more traditional computational modelling techniques. Onesuch example has been recently put forward by van Rijn et al. (2003). This pro-duction system model is based around the ACT-R architecture (Anderson, 1990;Anderson & Lebiere, 1998) and accounts for a large body of empirical data.Unlike the connectionist models discussed above, the production system modelacquires rules of operation of the balance scale that are both explicitly representedand play a causal role in its behaviour. It is therefore tempting to suggest that anyattempts at modelling human reasoning will be most successful if they posit under-lying rules that are explicitly represented and play a causal role in the operation ofthe device.

Quite apart from the marked theoretical differences that exist over how best toframe issues in human cognitive development, any account of the mastery of theprinciple of torque must be able to explain the particular patterns of behaviourthat are now well documented in the developmental literature and have been cov-ered in detail here. Ultimately the goal may then be to provide an account couchedin terms of biologically plausible mechanisms that can be seen to operate in neu-ronal networks. What the present paper has attempted to show is that current con-nectionist models fall somewhat short of this ideal, and questions have been raisedover their suitability for this purpose. It remains tantalising to speculate thathuman cognition may be captured adequately by network models in which rule-


like behaviour emerges from the myriad interactions between simple processingunits, and nothing in the foregoing excludes this possibility. A challenge now ishow best to proceed given the limitations and shortcomings of the extant model-ling work that we have exposed. We trust that this will act as the spur for futureresearch.

Acknowledgements

We express our sincere thanks to Scott Fahlman for his extremely helpful andswift responses to our enquiries during the development of the CC simulator andto Maartje Raijmakers for her comments on various drafts of the paper. ThomasShultz and Sylvain Sirois also provided support during our simulations of the bal-ance scale task. The research was in part funded by an ESRC grant (Grant No.R000223827) to the first author.

Appendix A

Free parameters that are included in the CC algorithm

Parameter
Value
Number of candidate units in each
8 Candidate network Output training max-epochs Not limited Input training max-epochs Not limited Output training patience 8 Input training patience 8 Output change-threshold 0.1 Input change-threshold 0.03 Output training l 2.0 Input training l 2.0 Output training e 0.4 Input training e 1.0 Output training weight-decay 0.0001 Input training weight-decay 0.0001 Weight-initial-range (r) �1.0 to +1.0 Sigmoid-prime-offset 0.1
Note. Unless otherwise stated the values given were adopted throughout for all of the York simulationsreported here. These values are the same as those reported by Fahlman (1988) except for the output changethreshold that he set at 0.01 (see text for details). The original Fahlman values were used in the Amsterdamsimulations.


Appendix B

Estimated values of the parameters of the LCA models of responses to balance scaleproblems broken down by item type and item

Conditional probabilities of giving a correct response

LC
ucp Data set A ucp Data set Y
I1
I2 I3 I4 I5 I1 I2 I3 I4 I5
Weight items

1
0.89 0.99 1.00 1.00 0.98 1.00 0.81 0.99 0.99 1.00 1.00 1.00 2 0.01 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.54 1.00 1.00 3 0.02 0.91 1.00 0.91 0.00 0.00 0.08 0.83 0.85 1.00 0.00 0.65 4 0.07 0.16 0.00 0.75 0.59 0.94 0.06 0.00 0.00 0.76 0.00 0.80
Distance items

1
0.25 0.01 0.00 0.47 0.00 0.02 0.42 0.00 0.00 0.29 0.00 0.00 2 0.67 0.99 1.00 1.00 1.00 0.98 0.49 0.99 1.00 1.00 1.00 1.00 3 0.04 0.88 0.93 1.00 0.00 0.00 0.03 1.00 0.89 1.00 0.00 0.00 4 0.05 0.00 0.00 0.26 1.00 0.83 0.05 0.00 0.00 0.42 0.92 1.00 5 0.01 0.33 0.00 1.00 0.80 0.00
Conflict-balance items

1
0.38 0.00 0.00 0.11 0.14 0.01 0.27 0.00 0.00 0.00 0.00 0.00 2 0.37 0.63 1.00 0.99 0.92 0.97 0.08 1.00 1.00 1.00 1.00 1.00 3 0.22 0.05 0.04 1.00 1.00 0.06 0.47 0.08 0.00 0.75 0.81 0.02 4 0.02 1.00 0.10 0.00 0.70 0.00 0.17 0.28 1.00 0.93 0.84 0.81
Conflict-weight items

1
0.31 0.93 1.00 0.95 1.00 1.00 0.34 0.87 0.95 0.99 1.00 1.00 2 0.63 0.00 0.43 0.00 1.00 0.95 0.27 0.00 0.00 0.01 0.97 0.98 3 0.02 0.00 0.00 0.00 0.00 0.00 0.36 0.06 1.00 0.00 1.00 0.89 4 0.04 1.00 0.00 0.85 0.64 1.00 0.03 0.93 0.00 0.92 0.00 1.00
Conflict-distance items

1
0.93 0.00 0.00 0.00 0.02 0.44 0.85 0.00 0.00 0.00 0.00 0.38 2 0.004 1.00 1.00 1.00 1.00 1.00 0.03 0.81 0.69 1.00 1.00 1.00 3 0.02 0.00 0.91 0.91 0.00 1.00 0.04 0.00 0.92 0.94 0.00 1.00 4 0.04 1.00 0.00 0.00 0.73 0.55 0.08 0.88 0.00 0.00 0.65 0.54
Note. ‘‘I’’ is an abbreviation for ‘‘Item’’. ‘‘LC’’ is an abbreviation for ‘‘latent class’’. Definitions of theactual items are provided in Table 2, Jansen and van der Maas (1997).

Appendix C. Supplementary data

Supplementary data associated with this article can be found in the online versionat doi:10.1016/j.cognition.2006.02.004.

http://dx.doi.org/10.1016/j.cognition.2006.02.004


References

Anderson, J. R. (1990). The adaptive character of thought. Hillsdale, NJ: Lawrence Erlbaum.Anderson, N. H., & Cuneo, D. O. (1978). The height + width rule in children’s judgments of quantity.

Journal of Experimental Psychology: General, 107, 335–378.Anderson, J. R., & Lebiere, C. (1998). The atomic components of thought. Mahwah, NJ: Lawrence

Erlbaum.Ash, T. (1989). Dynamic node creation in back propagation networks. Connection Science, 1, 365–375.Boom, J., Hoijtink, H., & Kunnen, S. (2001). Rules in the balance: Classes, strategies, or rules for the

balance scale task. Cognitive Development, 16, 717–735.Chletsos, P. N., De Lisi, R., Turner, G., & McGillicuddy-De Lisi, A. V. (1989). Cognitive assessment of

proportional reasoning strategies. Journal of Research and Development in Education, 23, 18–27.Churchland, P. M. (1981). Eliminative materialism and propositional attitudes. The Journal of Philosophy,

LXXVIII, 67–90.Clogg, C. C. (1995). Latent class models. In G. Arminger, C. C. Clogg, & M. E. Sobel (Eds.), Handbook of

statistical modeling for the social and behavioral sciences (pp. 311–359). New York: Plenum Press.Crowder, S. (1990). C implementation of the Cascade-Correlation learning algorithm. [Computer

software]. <http://www-2.cs.cmu.edu/~sef/sefSoft.htm>.Dawson, M. R. W., & Zimmerman, C. (2003). Interpreting the internal structure of a connectionist model

of the balance scale task. Brain and Mind, 4, 129–149.Elman, J. L. (2005). Connectionist models of cognitive development: Where next?. Trends in Cognitive

Sciences 9, 111–117.Fahlman, S. E. (1988). An empirical study of learning speed in back-propagation networks. (Tech. Rep.

CMU-CS-88-162). Pittsburgh, USA: Carnegie-Mellon University, Department of Computer Science.Fahlman, S. E., & Lebiere, C. (1991) The cascade-correlation learning architecture. (Tech. Rep. CMU-CS-

90-100). Pittsburgh, USA: Carnegie-Mellon University, Department of Computer Science.Ferretti, R. P., & Butterfield, E. C. (1986). Are childrens’ rule-assessment classifications invariant across

instances of problem types?. Child Development 57, 1419–1428.Ferretti, R. P., & Butterfield, E. C. (1992). Intelligence-related differences in the learning, maintenance,

and transfer of problem-solving strategies. Intelligence, 16, 207–223.Ferretti, R. P., Butterfield, E. C., Cahn, A., & Kerkman, D. (1985). The classification of children’s

knowledge: Development on the balance-scale and inclined-plane tasks. Journal of Experimental Child

Psychology, 9, 131–160.Fodor, J. A. (1975). The language of thought. Cambridge, MA: Harvard University Press.Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis.

Cognition, 28, 3–71.Goodman, L. A. (1975). A new model for scaling response patterns: An application of the quasi-

independence concept. Journal of the American Statistical Society, 70, 755–768.Inhelder, B., & Piaget, J. (1958). The growth of logical thinking from childhood to adolescence. New York:

Basic Books.Jansen, B. R. J., & van der Maas, H. L. J. (1997). Statistical test of the rule assessment methodology by

latent class analysis. Developmental Review, 17, 321–357.Jansen, B. R. J., & van der Maas, H. L. J. (2001). Evidence for the phase transition from Rule I to Rule II

on the balance scale task. Developmental Review, 21, 450–494.Jansen, B. R. J., & van der Maas, H. L. J. (2002). The development of children’s rule use on the balance

scale task. Journal of Experimental Child Psychology, 81, 383–416.Joanisse, M. F., & Seidenberg, M. S. (1999). Impairments in verb morphology after brain injury: a

connectionist model. Proceedings of the National Academy of Sciences USA, 96, 7592–7597.Kerkman, D. D., & Wright, J. C. (1988). An exegesis of two theories of compensation development:

Sequential decision theory and information integration theory. Developmental Review, 8, 323–360.Langeheine, R., Pannekoek, J., & Van de Pol, F. (1995). Bootstrapping goodness-of-fit measures in

categorical data analysis. The Netherlands: CBS Statistics.Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton Mifflin.

http://www-2.cs.cmu.edu/~sef/sefSoft.htm


Ling, C. X., & Marinov, M. (1993). Answering the connectionist challenge: A symbolic model of learningthe past tenses of English. Cognition, 49, 235–290.

Lippmann, R. (1987). An introduction to computing with neural nets. ASSP Magazine IEEE, 4, 4–22.Marcus, G. F. (1998a). Can connectionism save constructivism?. Cognition 66, 153–182.Marcus, G. F. (1998b). Rethinking eliminative connectionism. Cognitive Psychology, 37, 243–282.Marcus, G. F. (2001). The algebraic mind: Integrating connectionism and cognitive science. Cambridge,

MA: The MIT press.Mareschal, D., & Shultz, T. R. (1996). Generative connectionist networks and constructivist cognitive

development. Cognitive Development, 11, 571–603.Massaro, D. W. (1989). Testing between the TRACE model and the Fuzzy Logical model of speech

perception. Cognitive Psychology, 21, 398–421.McClelland, J. L. (1989). Parallel distributed processing: Implications for cognition and development. In

R. G. M. Morris (Ed.), Parallel distributed processing: Implications for psychology and neurobiology

(pp. 8–45). Oxford: Clarendon Press.McClelland, J. L. (1995). A connectionist perspective on knowledge and development. In T. J. Simon & G.

S. Halford (Eds.), Developing cognitive competence: New approaches to process modelling (pp. 157–204).Hillsdale, NJ: LEA.

McClelland, J. L., & Jenkins, E. (1991). Nature, nurture, and connections: Implications of connectionsmodels of cognitive development. In K. van Lehn (Ed.), Architectures for intelligence: The twenty-

second Carnegie-Mellon Symposium on cognition (pp. 41–73). Hillsdale, NJ: LEA.McClelland, J. L., & Patterson, K. (2002). ’Words or Rules’ cannot exploit the regularity in exceptions.

Trends in Cognitive Science, 6, 464–465.McCutcheon, A. L. (1987). Latent class analysis. Newbury Park, CA: Sage.Minsky, M. L., & Papert, S. A. (1988). Perceptrons: An introduction to computational geometry.

Cambridge, MA: The MIT Press (Expanded ed.).Miozzo, M. (2003). On the processing of regular and irregular forms of verbs and nouns: Evidence from

neuropsychology. Cognition, 87, 101–127.Munakata, Y., & McClelland, J. L. (2003). Connectionist models of development. Developmental Science,

6, 413–429.Normandeau, S., Larivee, S., Roulin, J. L., & Longeot, F. (1989). The balance-scale dilemma: Either the

subject or the experimenter muddles through. Journal of Genetic Psychology, 150, 237–250.Piaget, P., & Inhelder, B. (1969). The psychology of the child. London: Routledge.Pinker, S., & Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed

processing model of language acquisition. Cognition, 28, 73–193.Pinker, S., & Ullman, M. T. (2002). The past and future of the past tense. Trends in Cognitive Science, 6,

456–463.Plunkett, K., & Juola, P. (1999). A connectionist model of English past tense and plural morphology.

Cognitive Science, 23, 463–490.Plunkett, K., & Marchmann, V. A. (1996). Learning from a connectionist model of the English past tense.

Cognition, 61, 299–308.Prechelt, L. (1997). Investigation of the CasCor family of learning algorithms. Neural Networks, 10,

885–896.Quinlan, P. T. (1991). Connectionism and psychology. A psychological perspective on new connectionist

research. Hemel Hempstead: Harvester Wheatsheaf.Quinlan, P. T. (1998). Structural change and development in real and artificial neural networks. Neural

Networks, 11, 577–599.Raijmakers, M. E. J. (1997). Is the learning paradox resolved?. Behavioral and Brain Sciences 20,

573–574.Raijmakers, M. E. J., & Molenaar, P. C. M. (2004). Modelling developmental transitions in adaptive

resonance theory. Developmental Science, 7, 149–157.Raijmakers, M. E. J., van Koten, S., & Molenaar, P. C. M. (1996). On the validity of simulating stagewise

development by means of PDP networks: Application of catastrophe analysis and an experimental testof rule-like network performance. Cognitive Science, 29, 101–136.


Reese, H. W. (1989). Rules and rule-governance: Cognitive and behavioristic views. In S. C. Hayes (Ed.),Rule governed behavior: Cognition, contingencies, and instructional control (pp. 3–84). New York:Plenum Press.

Rogers, J. (1997). Object-oriented neural networks in C++. San Diego, CA: Academic Press.Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error

propagation. In D. E. Rumelhart, J. L. McClelland and the PDP Research Group (Eds.), Parallel

distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations (pp.318–362). Cambridge, MA: The MIT Press.

Rumelhart, D., & McClelland, J. L. (1986). On learning the past tense of English verbs: implicit rules orparallel distributed processing? In J. L. McClelland, D. E. Rumelhart, and the PDP Research Group(Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 2:

Psychological and biological models (pp. 216–271). Cambridge, MA: The MIT Press.Shultz, T. R. (2003). Computational developmental psychology. Cambridge, MA: MIT Press.Shultz, T. R., Mareschal, D., & Schmidt, W. (1994). Modeling cognitive development on balance scale

phenomena. Machine Learning, 16, 57–86.Siegler, R. S. (1976). Three aspects of cognitive development. Cognitive Psychology, 8, 481–520.Siegler, R. S. (1981). Developmental sequences between and within concepts. Monographs of the Society

for Research in Child Development, 46, Whole No. 189.Siegler, R. S., & Chen, Z. (1998). Developmental differences in rule learning: a microgenetic analysis.

Cognitive Psychology, 36, 273–310.Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464.Taatgen, N. A., & Anderson, J. R. (2002). Why do children learn to say ‘‘Broke. A model of learning the

past tense without feedback. Cognition, 86, 123–155.Tyler, L. K., de Mornay-Davies, P., Anokhina, R., Longworth, C., Randell, B., & Marslen-Wilson, W. D.

(2002). Dissociations in processing past tense morphology: Neuropathology and behavioural studies.Journal of Cognitive Neuroscience, 14, 79–94.

Ullman, M. T., Corkin, S., Coppola, M., Hickok, G., Growdon, J. H., Koroshetz, W. J., & Pinker, S.(1997). A neural dissociation within language: Evidence that the mental dictionary is part ofdeclarative memory, and that grammatical rules are processed by the procedural system. Journal of

Cognitive Neuroscience, 9, 266–276.Van de Pol, R., Langeheine, R., & De Jong, W. (1996). PANMARK 3. Panel Analysis Using Markov

Chains. A Latent Class Analysis Program. The Netherlands: Voorburg, [User manual].Van der Maas, H. L. J., & Hopkins, B. (1998). Developmental transitions: So what’s new?. British Journal

of Developmental Psychology 16, 1–13.Van der Maas, H. L. J., & Jansen, B. R. J. (2003). What response times tell of children’s behavior on the

balance scale task. Journal of Experimental Child Psychology, 85, 141–177.Van der Maas, H. L. J., & Molenaar, P. C. M. (1996). Catastrophe analysis of discontinuous development.

In A. A. van Eye & C. C. Clogg (Eds.), Categorical variables in developmental research. Methods of

analysis (pp. 77–105). San Diego: Academic Press.Van Maanen, L., Been, P., & Sitjsma, K. (1989). The linear logistic test model and heterogeneity of

cognitive strategies. In E. E. Rosram (Ed.), Mathematical psychology in progression (pp. 267–287).Berlin: Springer-Verlag.

van Rijn, H., van Someren, M., & van der Maas, H. (2003). Modeling developmental transitions on thebalance scale task. Cognitive Science, 27, 227–257.

Wilkening, F., & Anderson, N. H. (1982). Comparison of the two rule-assessment methodologies forstudying cognitive development and structure. Psychological Bulletin, 92, 215–237.

Documents

Re-thinking stages of cognitive development: An appraisal ... · 1. Introduction The primary purpose of this paper is to provide an in-depth appraisal of a partic-ular connectionist