A derivation of a rote learning curve from the total uncertainty of a task

BULLETIN OF MATHEMATICAL BIOPHYSICS

VOLUME 22, 1960

A DERIVATION OF A ROTE LEARNING CURVE FROM THE TOTAL UNCERTAINTY OF A TASK

A N A T O L R A P O P O R T UNIVERSITY OF MICHIGAN

The der iva t ion of H. D. L a n d a h l ' s learn ing curve (1941, Bull. Math. Biophysics, 3, 71-77) from a s ing le informat ion- theore t ica l assumpt ion obta ined p rev ious ly (Rapoport, 1956, Bull. Math. Biophysics, 18, 317-21) i s ex tended to obtain the ent ire family of such cu rves wi th the number of s t imul i M ( t o each of which one of N r e s p o n s e s i s to be a s s o c i a t e d ) as a parameter . No addi t ional assumpt ions are required. The ent i re family thus appears as a funct ion of a s ing le free parameter , k, all other param- e te rs be ing exper imenta l ly determined. The theory is compared with a s e t of exper iments involving the learn ing of a r t i f i c ia l l anguages . An a l t e rna t ive quas i -neuro log ica l model l ead ing to the same equat ion is offered.

In a previous paper (Rapoport, 1956), we showed how a learning curve, derived by H. D. Landahl (1941) from a neurological model, could be derived as a consequence of a single assumption, namely that the rate of decrease of uncertainty of response per error is cons tan t . That is , we derived an equation formally identical to L~ndahl 's in which our single parameter, k, corresponded to his (essent ia l ly) single parameter, kb.

We were not able, however, to show how this parameter de- pended on M, the number of independent stimuli with each of which one of N responses was to be assoc ia ted . Hence to get a family of curves, formally ident ical to Landahl ' s , we were obl iged to make the same addit ional (ad hoc) assumption, namely, k = ye -~M, where 77 and ~ were now two free parameters.

Here we will drop this extra assumption and will re-derive our learning curve from a slightly different hypothesis . The form of each single curve of the family will then remain the same. But the family will not contain the experimentally control led parameter M

85

86 ANATOL RAPOPORT

and the resulting dependence of our parameter (now k') upon M will be similar (though not identical) to Landahl ' s . We will then fit our family of curves to some experimental data.

~e will assume this time that the rate of i ncrease of the total

in format ion gained by the learner per error is constant . Le t h(t) be the total information gained after t stimulus presenta-

tions, the stimuli being chosen at random from a se t of M. Then, by our hypothesis

d h / d w = k, a constant, (1)

where w is the cumulated number of wrong responses . As in the previous paper (Rapoport, 1956), we define the uncertainty of the response to a given stimulus in " n i t s " as U = - l o g e ( 1 - dw/d t ) ,

that is, the negative logarithm of the probability of correct response to any one stimulus.*

Then the total uncertainty remaining will be U - - M l o g e ( 1 - d w / d t ) . If H i s the initiaI total uncertainty of the task, the amount of information gained will be

A = H - U = H + M log e ( 1 - d w / d t ) . (2)

Integrating (1), we get

h = k w , (3)

where the constant of integration must vanish, since h(0) -- w(0) -- 0.

Eence, substi tuting (3) into (9),

k w = H + M log e (1 - d w / d t ) , (4)

Exp I (kw - H)/MI = 1 - d w / d t , (5)

dw = d t . (6)

1 - E x p { ( k w - H)/M}

Integrating (6), we obtain, again considering the initial condition w(0) -- 0,

M Exp [H/M]

- - ff log~ [Exp[H/M] - 1] . E x p l - k t / M } + I (7) qlJ I

*Assuming equi-probability of possible responses, this definition coincides with Shannon's uncertainty (Shannon and Weaver, 1949). A ' ~ n i t ~ i s a " n a t u r a l b i t ~ " i . e . , l o g e 2 = . 6 9 3 . . . .

ROTE LEARNING CURVE 87

If the responses to the M stimuli are independently ass igned, then the total initial uncertainty of the task must be

H --- U log N. ( 8 )

Substituting (8) in to (7), we obtain

e N u: = ~- log , (N - 1 ) F x p l - k t / M I + 1' (9)

which is Landahl ' s equation (11), where his parameter kb Coincides with our parameter k ' = k/M (Landahl, 1941). In other words in- stead of Landahl ' s assumed exponential dependence upon M (kb = 71e-~M), we obtain a hyperbolic relation as a consequence of our hypothesis . ~hen only a few curves are avai lable (in Landahl ' s exper iment~ th ree ) , the dist inction between the two relat ions can. not be made with confidence.

If the responses are not independently ass igned, we cannot assume (8). However, the quantity e H/M is s t i l l a constant in a given experiment. If we designate this constant by N, our equation (7) is st i l l formally identical with Landahl ' s equation (11). However N'now becomes the " e f f e c t i v e " N, which will, in general, be smaller than the actual number of poss ib le responses to each stimulus. The dec rease i s a resul t of the redundancy introduced by non-independent assignments of responses to stimuli.

Application to an Experiment. Several groups of sub jec t s (10- 50 in each group) were taught several artificial languages respec t ive ly by assoc ia t ing irregular geometric figures with their names in the languages. The referents were the same in all c a s e s , namely, a s e t of figures of five different shapes and five different colors. Out of a se t of 25 poss ib le combinations, 20 were used as the referents, the remaining five as t e s t stimuli wherever generali- zation of the naming principle could be made. In the course of the training, each of the 20 referents chosen in random order was presented to the subject~ who was required to make a verbal r e s p o n s e - - a n o n s e n s e sy l lab le or a pair of nonsense sy l lab les depending on what language was being learned. Only a completely correct response ( i .e . , correct p ronunc i a t i on )was counted as correct. All others were counted a s errors. Whether the s u b j e c t ' s response was correct or not, the experimenter countered with the correct response . Each ses s ion las ted about an hour. Success ive

88 ANATOL RAPOPORT

ses s ions were held on consecu t ive days. It took about 4 to 30 s e s s i o n s to complete the learning to criterion (three perfect consecu t ive scores on the 20 responses in random order), depending on the language and, of course, on the subject . The languages were structured a s follows:

LI : With each of the 20 figures a s ingle syl lable was assoc ia ted . L2: With each of the 20 figures a pair of dis t inct sy l l ab les were

assoc ia ted , no single syl lable being assoc ia ted with more than one figure.

La: With each of the five shapes and with each of the five colors a single syl lable was as soc ia ted respect ively . The sy l lab les were combined in pairs to name each figure by shape and color. The shape syl lable a lways came first.

L4: The vocabulary of this language was the same as that of L a (10 sy l lab les) , but there was no sys temat ic correspondence between the single s y l l a b l e s and the aspec t s (color and s h a p e ) o f the referents. Thus to each referent there corresponded a unique ordered pair of sy l l ab le s , but there were incons is tenc ies in corre- spondences between single sy l lab les and the a spec t s of the reference figures.

To tes t our theory, we turn our attention f irs t to the asymptotic values of w. Setting t = oo in equation (7) we get the theoret ical asymptotic value,

w ( ~ ) -- W = H / k . (10)

If now k is indeed independent of the complexity of the task, then H/W ought to be constant for all languages or, at leas t , its fluctuations ought to be attributable to population sample f luctuations.

To tes t this hypothesis , we need to ca lcula te the total uncertainty H in each language. Note that the assignments of responses to stimuli are not made independently, except in L 4. In L1, for example, having ass igned a syl lable to one figure, we have only 19 sy l lab les to choose from to ass ign one to the next figure. Obviously the total number of assignments is 20! Therefore the total uncertainty in L 1 is

H (L 1) -- l~ 20! -- 42.3. (11)

The task of learning L 1 was given to 20 sub jec t s . The average total number of errors per subjec t was observed to be 304. Hence, the value of k was calculated as 42.3/304 -- 0.139.

Turning to L 2 and taking the syl lable as our " u n i t , " we find that the amount of uncertainty in that task is jus t twice that of L 1


or 84.6 nits. We should add another bit (=0.7 nits) to represent the uncertainty of which group of 20 sy l lab les represents the repertoire from which the first syl lable of the two is chosen, since the two sy l lab les must be pronounced in the correct order. Taking 85.3 nits as our total uncertainty and observing the total average (10 sub iec t s ) number of errors to be W--540, we obtain k(L2) -- 85.3/540 -- 0.154, which agrees fairly well with the values previously obtained~

In L a we had data on 50 subiec ts . The average W was observed to be 85.8. To calculate H, note that the units are now syl lab les to be assoc ia ted with the " a s p e c t s " of the referents, namely the five shapes and five colors. Therefore, the amount of uncertainty i s n o w

/ / (La) = 2 log e 5! + 0.7 = 10.3. (12)

Hence, k ( L a ) = 10.3/85.8 0.120. We see that this value fluc- tuates, but so far not excess ive ly . Anticipating s ta t i s t i ca l analy- s is , we conjecture that the fluctuations of k observed do net exceed those expected in samples of populations of this s ize .

Finally, turning to L4, we note that here the responses are as- signed independently, each of the two sy l lab les being se lec ted at random (with replacement) from its own repertoire of 5. Therefore

/ / (L4) -- 2 log e 5 x 20 + 0.7 = 65.1. (13)

Observing W = 720 (10 subiects) , we obtain k (L4) - - 65.1/720 = 0.09. This is the largest deviation of k we have observed. It can perhaps be attributed to the confusion in the language due to the non-unique assoc ia t ions between the sy l lables and the shape-color a spec t s of the referents.

We now wish to compare the theoretically predicted learning curves with our data. We can proceed in two ways.

1. Determine k from the data on one of the languages. I4aving thus fixed the k, use it in all the experiments to ca lcula te the value of H(H kW), which will then be the free parameter in each c u r v e .

2. Calculate the value of/-/ in advance, as we have done above, and use it and W to determine k, which will then be the free parameter.

Comparisons of theory and experiment are shown in Figures 1-4. Curves I resul t if k is kept fixed throughout at 0.139 obtained from data on L1, and H is treated as a free parameter. In LI , L2,

90 ANATOL RAPOPORT

55C

30(

Z5(

20q

W

JSC

5(

0 0

LI

o / k --.139 k =.139 OBSERVED o / H=42.3 H=42.3

I I , I I [ I I I I00 200 300 400 500 600 700 800 900 1000

t

F I G U R E 1. C o m p a r i s o n o f t h e o r y and e x p e r i m e n t fo r L 1 . N u m b e r o f t r i a l s ( h o r i z o n t a l ) vs . cumu la ted er rors curve c a l c u l a t e d from e q u a t i o n (7) fitted with k as a free parameter against observed values (circles). Here curves I and II are identical (see text).

and L4, M -- 20 (twenty referents), but in Ls, M ~ 10 (the referents are five shapes and five colors). Curves II result if H is treated as an experimentally controlled parameter, calculated for each language as shown above, and k is treated as a free parameter. In L1, L2, and La, the two curves are barely dist inguishable; but in L 4 the discrepancy is considerable. Since in the latter case curve II gives definitely the better fit, we prefer to use method 2, which incidental ly is also more at tractive from the point of view of applying information theory to the learning model, since in this method the " informat ion" of a language is theoret ical ly calculated.

We note, further, that the theory fits the curves obtained in L 1 a n d La very well with relat ively slight fluctuation of the parameter. It also fits L 4 very well but with a considerable fluctuation of the parameter k in the direction expected because of the "inter- fe rence" inherent in that language.

In the case of L2, however, there seems to be a discrepancy between the theoretical curve and the data.

To discuss the possible reasons for the discrepancy, it will be useful to prove the fol lowin~

Lemma. >~ 0; => 0. W

ROTE LEARNING C U R V E 91

Since H = kW, it i s s u f f i c i e n t to prove the f irst o f t h e s e in- e q u a l i t i e s .

Proof. Subst i tu t ing k = H/W into (7), we obtain

~/W u = W - log [ ( e H / M - 1 ) e -Ht/r~l~l § 1]. (14)

H

Dif fe ren t ia t ing with r e s p e c t to t, we get

dw (e H/M - 1) e - H t / W M

dt (e H/M - 1 ) e -Ht /~ 'M + 1

For t = 0 , (15) b e c o m e s 1 - E x p I - H / M I . L e t now w 1

w 2 = w ( H 2 ) , w h e r e H 2 > H I. Then

(dj2 dw,I -H,/~ H2/M t d r ~ = e - e - > 0 .

t=O

( 1 5 )

= w ( / / 1 ) ,

(16)

600

500

400

W 3 0 0

20C

IO0

L2

Z 22" J~" I( =. 139 k=.158 OBSERVED H= 7"5 .0 H=85.3 W=540 W=541

~1 , , , I , , ~ 1 , , , 1 ~ , , I , j , l , , , I , , , I , , , I , , , 80 2440 400 560 720 880 1040 1200 1360 1520

t

FIGURE 2. Compar i son of theory and e x p e r i m e n t for L 2. Curve I i s o b t a i n e d by t ak ing the va lue of k from L 1 and t r ea t i ng H as a free param- e ter ; curve II by c a l c u l a t i n g H t h e o r e t i c a l l y and t r e a t i n g k a s a free pa ramete r . The d i s c r e p a n c y b e t w e e n t h e o r e t i c a l curve and da ta is d i s c u s s e d in the tex t . The d i s c r e p a n c y b e t w e e n the v a l u e s of W is due to round ing -o f f error .

92 ANATOL RAPOPORT

I 0 0

9 0

W

L 3

6O ~.

5 0 k =. 139 k# .120 40 H= It.. 9 H . 10.3

W=85~ W=858 30 ~ r

20 OIg,.qi~i~ED

I O - - I I I r I f I I I f I I

Oq 40 80 120 200 280 360 440 500 t

FIGURE 3. Comparison of theory and exper iment for L3. Here M -- 10.

In other words, w 2 increases faster than w 1 at t = 0. Further, if W is fixed, wl(~ ) = w 2 ( ~ ) = W , and of course wl(0 ) = w 2 ( 0 ) = 0 . Our lemma will, therefore, be proved if we can show that w 2 - w 1 has a single maximum, i .e . , (dw2/dt- dwl/dt ) has a single posi- tive root. Setting this express ion equal to zero, we obtain from (15), after s implif icat ions

( e I t , / M - 1) e - l t 2 t / v / M - ( e H , / M - 1 ) e - H l t / W M = O. (17)

80C

70G

60G

50C

4O(: W

30r

20C

lOG

L4

/ ,--o9o ~1 H= I00.1 H=65.1

~/ W= 2"20 W= 723 nz

I 1 J I l I J t ~ [ I I I I I 1 I 2oo 400 6O0 8oo IOOO I~)o 1400 ~DO 1800

t

FIGURE 4. Comparison of theory and exper iment for L4. Here the difference between curves I and II is cons ide rab le . P re fe rence , therefore , is g iven to ca lcu la t ing H theo re t i ca l l y and taking a lower value of k.


But this is of the form

ae-X'- be- ' t=ae-~t [ 1-abe('-')'t~ =0, (18)

where a > b, x > y. The express ion in the brackets vanishes for exact ly one posi t ive value of t, and its coeff ic ient never; so our lemma is proved.

The resul t can now be applied to our theory as follows. If w(0) is pegged a t 0, and ~z(~) at some observed asymptotic value W, then if // (and consequent ly k) increases , the entire curve shifts upward and vice versa.

Observe now the d iscrepancy between theoret ical and experimental curves in L 2. This d iscrepancy can be reduced (perhaps pract ical ly eliminated) if the value of k (and of H with it) is taken lower than 0.154. We note that this value is the highest value of k obtained. Consequently a lower value will be more in accord with the other values of k obtained in the other experiments. But is there a iust i f icat ion for reducing H, which, after all, was calculated by the same method as in the other experiments, where the agreement was good? It can be argued that spec i f ica l ly in the case of Lu, H can indeed be taken lower than we have taken it. The assumption in c a l c u l a t i n g / / ( L 2 ) was that the single sy l lab les were the units of which the repertoire is composed. But in this lan~guage a pair of such sy l lab les is inseparable , t lence, logically speaking, a pair of sy l l ab les could be taken as a unit. If this were done, L u would have exact ly the same H as L 1, namely 42.3 nits. In that case , it can be shown that the theoret ical curve for L 2 will lie entirely below the data. The conjecture is that the pairs of sy l lab les " c o a l e s c e " but not completely. An intermediate value of H between 42.3 and 85.3 should give a fit of accuracy comparable to that obtained in the other languages.

Note that the argument about the inseparable sy l l ab les does not apply either to L a or to L4, the other two-syl lable languages.

Theoretical Relation to Landahl's model. It might be construed that our bas ic assumption, dh/d~c ~ k implies that the subjec t " l e a r n s " (i.e., gains information) only from his wrong responses , in contradiction to the assumption underlying Landahl 's equation (11), which s ta tes that increments of excitat ion at the synapse leading to the correct response result only from the s u b j e c t ' s correct responses .

94 ANATOL RAPOPORT

In view of the formal identity of the equations derived from each of these assumptions, this seems anomalous. However, it is not true that our assumption implies that the sub jec t gains information on ly from wrong responses . To show this let us calculate the rate of change of information with respec t to correct responses c. This is

~qow

d h / d e = ( d h / d w ) ( d w / d c ) = k d w / d c . (19)

dw d w / de

do d t d t

d w / d t - . ( ~ 0 )

1 - d w / d t

Combining (15), (10), (20), and (19) we get

dh - k ( e n / ~ - 1) e -kt/M. (21)

de

In other words , our model implies that the rate of gain of information per correct response is an exponentially decaying func- tion of t, the number of trials. This resul t is intuit ively plausible. At the start of the learning process , when very few correct responses have become fixed, the rate of gain of information per correct response is large because of the possibi l i ty that some responses , correct for the first time, will become fixed. But at the end of the process where most of the correct responses are already fixed, the gain of information from correct responses is small. In particular no information is gained from a correct response which had already been fixed.

Another way of looking at the model is to find the express ion for d h / d t . ~ e note that d h / d w = ( d h / d t ) / ( d w / d t ) = ( d h / d t ) / ( 1 - Pc ), where Pc is the probability of correct response. If dl~/dw = k , a

constant , we have d h / d t = k (1 - Pc), i .e . , the information increases proportionately to the number of trials and the amount " l e f t to learn ." This seems reasonable for the case of 100% prompting, which is the case in our experiment.

R e l a t i o n to E s t e s - B u s h - M o s t e l l e r m o d e l s . 2:he s imples t form of the type of rate-learning models proposed by !~. K. F s t e s (1950),


R. R. Bush and F. Mosteller (1955) is expressed by the following recursive formula:

P t + l = % + ( 1 - ~ ) X . ( 2 2 )

mere p is the probability of some response, ~ and )t are con- stants. If p represents the probability of the correct response, and if the process ends in perfect learning, )t must equal unity. t ience, in this case, subtracting Pt from both s ides ,

P,+I - P , = ( 1 - cr - ( 1 - ~ ) p . ( 9 ~ )

The emphasis in the theoretical development of the authors mentioned is on s tochast ic aspects of the learning process. That is, the center of attention is on the frequency distributions of the responses on the success ive stimulus presentations.

Our model, on the other hand, is deterministic. ~e take for our time dependent variable, the average number of responses (say over a large population) at each stimulus presentation.

Pass ing from the difference to a differential equation, we obtain from (23)

dp - (1 - ~ ) ( 1 - p ) . ( ~ 4 )

dt

Integrating,

p = 1 - (1 - Po)e - ( l - ~ ) t , (25)

where Po is the init ial probability of the response in question. Yow in our notation p (the probability of correct response is

de/dr, and 1 - p is dw/dt . Substituting into (25), we obtain

dw = Ae -(1-c~)t (96)

dt

where A = dw/d t at t = 0 is the initial probability of the wrong response. Integrating once again, we obtain the expression for based on the deterministic version of Fstes-P.ush-~ostel ler model, namely

A ~c = - - ( 1 - e - ( ' - ~ ) ~ ) . ( 2 7 )

l -C( .

96 ANATOL RAPOPORT

Since A is experimentally controlled, equation (24) contains one free parameter ex, as does our equation (7) and Landahl's equivalent equation (11). The parameter can be fixed by observing the asymptotic value of ~c in the learning curve. All attempts to fit the experimental data described above by equation (27) show a large discrepancy, all in the same direction: the theoretical curves are all considerably below the experimental. We suspect that our model and Landahl's, which are formally equivalent although couched in different terms, are more suitable for the learning curves of human subjects, since they take into consideration (as our analysis above has shown) the characteristic fixating of the responses as they are learned in the process.

Of course in the theory of stochastic models, equation (27) applies only when there are only two possible responses, and moreover t he effect upon the probability of the correct response is the same whatever response is given. Because of the very special assumptions which underlie equation (27), the failure of this equation to predict our learning curves in no way invalidates the previous models. It only shows the inapplicability of the very special assumptions in this case.

It will nevertheless be instructive to compare the equation derived from our model, which is analogous to equation (27), that is, the expression o f d p / d t in terms of p, with (27).

From (7), we get by differentiation

dw (e HIM - 1 )e - k t / M

d t - (e H/M - 1 ) e - k t / M + 1 (28)

In the notation of Bush and Mosteller, d w / d t is 1 - p . Also e H/M is (1 - A)- 1 Hence, translating (28) into that notation,

p = ( B e "k t lM + 1)-1 (99)

where B = A / ( 1 - A) . Differentiating,

dp B k e - k t /M - . (ao)

d t M [ B e - k t lM + 1] 2

Substituting (29) into (30)~ we obtain

dp k - p ( 1 - p ) . ( 3 1 )

dt M


t~ut (31) is the well-known logistic equation. It represents the simplest case of a contagion process. Based on this analogy, one can now give our model a quasi-neurological interpretation, similar to the one given by Bush and Mosteller (op. cir., Chap. 2).

Suppose the probability of correct response is given by the fraction of "properly condi t ioned" neural elements, i .e. , a correct response occurs if the response happens to be channeled into a "properly condi t ioned" pathway, but not otherwise. Then our equation (31) says that the properly conditioned pathways behave like " i n f e c t e d " individuals who infect others until every pathway is "properly condi t ioned."

Equation (24), on the other hand, derived from the s implest assumption of the Estes-Bush-Mostel ler type, says that the " i n f e c t i o n " strikes randomly from the outside, i .e . , the fraction of the non- infected is exponentially decaying. Our conjecture therefore that equation (7), or equivalently, (31), is more characterist ic of human subjects may well be a conjecture about a higher interdependency among conditioned pathways in the human brain.

~[his work was supported in part by a research grant from the Rackham Fund of the Horace ti. Rackham School of Graduate Studies, University of Michigan, and in part by a grant-in-aid by the International Society for General Seman t i c s .

LITERATURE

Bush, R. R. and F. Mosteller. 1955. Stochastic Models /or Learning. New York: John Wiley and Sons.

Estes, W.K. 1950. "Toward a Statistical Theory of Learning." Psychol. Rev., 57, 94-107.

Landahl, H. D. 1941. "Studies in the Mathematical Biophysics of Dis- crimination and Conditioning II: Special Case: Errors, Trials, and Number of Possible Responses." Bull. Math. Biophysics, 3, 71-77.

Rapoport, A. 1956. "On the Application of the Information Concept to Learning Theory." BUZZ. Math. Biophysics, 18, 317-321.

Shannon~ C. E. and W. Weaver. 1949. The Mathematical Theory o/ Communication. Urbana: University of Illinois Press.

R EC EI V ED 9-15-59

Documents

A derivation of a rote learning curve from the total uncertainty of a task