A derivation of a rote learning curve from the total uncertainty of a task

  • Published on

  • View

  • Download

Embed Size (px)


<ul><li><p>BULLETIN OF MATHEMATICAL BIOPHYSICS </p><p>VOLUME 22, 1960 </p><p>A DERIVATION OF A ROTE LEARNING CURVE FROM THE TOTAL UNCERTAINTY OF A TASK </p><p>ANATOL RAPOPORT UNIVERSITY OF MICHIGAN </p><p>The derivation of H. D. Landah l ' s learning curve (1941, Bull. Math. Biophysics, 3, 71-77) from a s ingle information-theoretical assumption obtained previously (Rapoport, 1956, Bull. Math. Biophysics, 18, 317-21) is extended to obtain the entire family of such curves with the number of stimuli M(to each of which one of N responses is to be assoc iated) as a parameter. No additional assumptions are required. The entire family thus appears as a function of a s ingle free parameter, k, all other param- eters being experimental ly determined. The theory is compared with a set of experiments involving the learning of art i f ic ial languages. An alternat ive quasi-neurological model leading to the same equation is offered. </p><p>In a previous paper (Rapoport, 1956), we showed how a learning curve, derived by H. D. Landahl (1941) from a neurological model, could be derived as a consequence of a single assumption, namely that the rate of decrease of uncertainty of response per error is constant. That is, we derived an equation formally identical to L~ndahl's in which our single parameter, k, corresponded to his (essential ly) single parameter, kb. </p><p>We were not able, however, to show how this parameter de- pended on M, the number of independent stimuli with each of which one of N responses was to be associated. Hence to get a family of curves, formally identical to Landahl's, we were obliged to make the same additional (ad hoc) assumption, namely, k = ye -~M, where 77 and ~ were now two free parameters. </p><p>Here we will drop this extra assumption and will re-derive our learning curve from a slightly different hypothesis. The form of each single curve of the family will then remain the same. But the family will not contain the experimentally controlled parameter M </p><p>85 </p></li><li><p>86 ANATOL RAPOPORT </p><p>and the resulting dependence of our parameter (now k') upon M will be similar (though not identical) to Landahl's. We will then fit our family of curves to some experimental data. </p><p>~e will assume this time that the rate of increase of the total information gained by the learner per error is constant. </p><p>Let h(t) be the total information gained after t stimulus presenta- tions, the stimuli being chosen at random from a set of M. </p><p>Then, by our hypothesis </p><p>dh/dw = k, a constant, (1) </p><p>where w is the cumulated number of wrong responses. As in the previous paper (Rapoport, 1956), we define the uncertainty of the response to a given stimulus in "n i t s " as U =- loge(1 - dw/dt) , that is, the negative logarithm of the probability of correct re- sponse to any one stimulus.* </p><p>Then the total uncertainty remaining will be U- -M loge(1 - dw/dt) . If H is the initiaI total uncertainty of the task, the amount of information gained will be </p><p>A=H- U=H+M log e (1 -dw/dt ) . (2) </p><p>Integrating (1), we get </p><p>h = kw, (3) </p><p>where the constant of integration must vanish, since h(0) -- w(0) -- 0. Eence, substituting (3) into (9), </p><p>kw = H + M log e (1 - dw/dt) , (4) </p><p>Exp I(kw - H)/MI = 1 - dw/dt , (5) </p><p>dw = dt . (6) </p><p>1 -Exp{(kw - H)/M} </p><p>Integrating (6), we obtain, again considering the initial condition w(0) -- 0, </p><p>M Exp [H/M] -- ff log~ [Exp[H/M] - 1] .Exp l -k t /M} + I </p><p>(7) qlJ I </p><p>*Assuming equi-probability of possible responses, this definition coincides with Shannon's uncertainty (Shannon and Weaver, 1949). A '~n i t ~ i s a "natura l b i t~" i .e . , l og e 2 = .693 . . . . </p></li><li><p>ROTE LEARNING CURVE 87 </p><p>If the responses to the M stimuli are independently assigned, then the total initial uncertainty of the task must be </p><p>H --- U log N. (8 ) </p><p>Substituting (8) into (7), we obtain </p><p>e N u: = ~- log, (N - 1 )Fxp l -k t /M I + 1' (9) </p><p>which is Landahl's equation (11), where his parameter kb Coincides with our parameter k '= k/M (Landahl, 1941). In other words in- stead of Landahl 's assumed exponential dependence upon M (kb = 71e-~M), we obtain a hyperbolic relation as a consequence of our hypothesis. ~hen only a few curves are available (in Landahl's experiment~three), the distinction between the two relations can. not be made with confidence. </p><p>If the responses are not independently assigned, we cannot assume (8). However, the quantity e H/M is sti l l a constant in a given experiment. If we designate this constant by N, our equa- tion (7) is stil l formally identical with Landahl's equation (11). However N'now becomes the "e f fec t ive" N, which will, in general, be smaller than the actual number of possible responses to each stimulus. The decrease i s a result of the redundancy introduced by non-independent assignments of responses to stimuli. </p><p>Application to an Experiment. Several groups of subjects (10- 50 in each group) were taught several artificial languages re- spectively by associat ing irregular geometric figures with their names in the languages. The referents were the same in all cases, namely, a set of figures of five different shapes and five different colors. Out of a set of 25 possible combinations, 20 were used as the referents, the remaining five as test stimuli wherever generali- zation of the naming principle could be made. In the course of the training, each of the 20 referents chosen in random order was presented to the subject~ who was required to make a verbal response- -a nonsense syl lable or a pair of nonsense syl lables depending on what language was being learned. Only a completely correct response (i .e., correct pronunciat ion)was counted as correct. All others were counted as errors. Whether the subject 's response was correct or not, the experimenter countered with the correct response. Each session lasted about an hour. Successive </p></li><li><p>88 ANATOL RAPOPORT </p><p>sess ions were held on consecutive days. It took about 4 to 30 sess ions to complete the learning to criterion (three perfect con- secutive scores on the 20 responses in random order), depending on the language and, of course, on the subject. The languages were structured as follows: </p><p>LI: With each of the 20 figures a single syl lable was associated. L2: With each of the 20 figures a pair of distinct syl lables were </p><p>associated, no single syl lable being associated with more than one figure. </p><p>La: With each of the five shapes and with each of the five colors a single syllable was associated respectively. The syl lables were combined in pairs to name each figure by shape and color. The shape syl lable always came first. </p><p>L4: The vocabulary of this language was the same as that of L a (10 syl lables), but there was no systematic correspondence be- tween the single sy l lab les and the aspects (color and shape)o f the referents. Thus to each referent there corresponded a unique ordered pair of syl lables, but there were inconsistencies in corre- spondences between single syl lables and the aspects of the reference figures. </p><p>To test our theory, we turn our attention first to the asymptotic values of w. Setting t = oo in equation (7) we get the theoretical asymptotic value, </p><p>w(~) -- W = H/k . (10) </p><p>If now k is indeed independent of the complexity of the task, then H/W ought to be constant for all languages or, at least, its fluc- tuations ought to be attributable to population sample fluctuations. </p><p>To test this hypothesis, we need to calculate the total un- certainty H in each language. Note that the assignments of re- sponses to stimuli are not made independently, except in L 4. In L1, for example, having assigned a syl lable to one figure, we have only 19 syl lables to choose from to assign one to the next figure. Obviously the total number of assignments is 20! Therefore the total uncertainty in L 1 is </p><p>H (L 1) -- l~ 20! -- 42.3. (11) </p><p>The task of learning L 1 was given to 20 subjects. The average total number of errors per subject was observed to be 304. Hence, the value of k was calculated as 42.3/304 -- 0.139. </p><p>Turning to L 2 and taking the syl lable as our "un i t , " we find that the amount of uncertainty in that task is just twice that of L 1 </p></li><li><p>ROTE LEARNING CURVE 89 </p><p>or 84.6 nits. We should add another bit (=0.7 nits) to represent the uncertainty of which group of 20 syl lables represents the repertoire from which the first syllable of the two is chosen, since the two syl lables must be pronounced in the correct order. Taking 85.3 nits as our total uncertainty and observing the total average (10 subiects) number of errors to be W--540, we obtain k(L2) -- 85.3/540 -- 0.154, which agrees fairly well with the values previously obtained~ </p><p>In L a we had data on 50 subiects. The average W was observed to be 85.8. To calculate H, note that the units are now syl lables to be associated with the "aspects" of the referents, namely the five shapes and five colors. Therefore, the amount of uncertainty i s now </p><p>/ / (La) = 2 log e 5! + 0.7 = 10.3. (12) </p><p>Hence, k (La)= 10.3/85.8 0.120. We see that this value fluc- tuates, but so far not excessively. Anticipating stat ist ical analy- sis, we conjecture that the fluctuations of k observed do net exceed those expected in samples of populations of this size. </p><p>Finally, turning to L4, we note that here the responses are as- signed independently, each of the two syl lables being selected at random (with replacement) from its own repertoire of 5. Therefore </p><p>//(L4) -- 2 log e 5 x 20 + 0.7 = 65.1. (13) </p><p>Observing W = 720 (10 subiects), we obtain k(L4)-- 65.1/720 = 0.09. This is the largest deviation of k we have observed. It can perhaps be attributed to the confusion in the language due to the non-unique associat ions between the syl lables and the shape-color aspects of the referents. </p><p>We now wish to compare the theoretically predicted learning curves with our data. We can proceed in two ways. </p><p>1. Determine k from the data on one of the languages. I4aving thus fixed the k, use it in all the experiments to calculate the value of H(H kW), which will then be the free parameter in each curve . </p><p>2. Calculate the value of/-/ in advance, as we have done above, and use it and W to determine k, which will then be the free parameter. </p><p>Comparisons of theory and experiment are shown in Figures 1-4. Curves I result if k is kept fixed throughout at 0.139 obtained from data on L1, and H is treated as a free parameter. In LI , L2, </p></li><li><p>90 ANATOL RAPOPORT </p><p>55C </p><p>30( </p><p>Z5( </p><p>20q </p><p>W </p><p>JSC </p><p>5( </p><p>0 0 </p><p>LI </p><p>o/ k --.139 k =.139 OBSERVED o/ H=42.3 H=42.3 </p><p>I I , I I [ I I I I00 200 300 400 500 600 700 800 900 1000 </p><p>t </p><p>F IGURE 1. Compar i son o f theory and exper iment for L1 . Number o f t r ia l s (hor i zonta l ) vs. cumulated errors curve ca lcu la ted from equat ion (7) fitted with k as a free parameter against observed values (circles). Here curves I and II are identical (see text). </p><p>and L4, M -- 20 (twenty referents), but in Ls, M ~ 10 (the referents are five shapes and five colors). Curves II result if H is treated as an experimentally controlled parameter, calculated for each lan- guage as shown above, and k is treated as a free parameter. In L1, L2, and La, the two curves are barely distinguishable; but in L 4 the discrepancy is considerable. Since in the latter case curve II gives definitely the better fit, we prefer to use method 2, which incidentally is also more attractive from the point of view of applying information theory to the learning model, since in this method the " information" of a language is theoretically calculated. </p><p>We note, further, that the theory fits the curves obtained in L 1 and La very well with relatively slight fluctuation of the parameter. It also fits L 4 very well but with a considerable fluctuation of the parameter k in the direction expected because of the "inter- ference" inherent in that language. </p><p>In the case of L2, however, there seems to be a discrepancy between the theoretical curve and the data. </p><p>To discuss the possible reasons for the discrepancy, it will be useful to prove the followin~ </p><p>Lemma. &gt;~ 0; =&gt; 0. W </p></li><li><p>ROTE LEARNING CURVE 91 </p><p>Since H = kW, it is su f f i c ient to prove the f irst of these in- equa l i t ies . </p><p>Proof. Substituting k = H/W into (7), we obtain ~/W </p><p>u = W - log [ (e H /M - 1)e -Ht/r~l~l 1]. (14) H </p><p>Differentiat ing with respect to t, we get </p><p>dw (e H/M - 1) e -Ht /WM </p><p>dt (e H/M - 1 )e -Ht/~'M + 1 </p><p>For t = 0, (15) becomes 1 -Exp I -H /MI . Let now w 1 </p><p>w 2 =w(H2) ,where H 2&gt; H I. Then </p><p>(dj2 dw,I -H,/~ H2/M t d r~ =e -e - &gt;0. </p><p>t=O </p><p>(15) </p><p>= w ( / /1) , </p><p>(16) </p><p>600 </p><p>500 </p><p>400 </p><p>W 300 </p><p>20C </p><p>IO0 </p><p>L2 </p><p>Z 22" J~" I( =. 139 k=.158 OBSERVED H= 7"5.0 H=85.3 W=540 W=541 </p><p>~1 , , , I , , ~1 , , ,1~, , I , j , l , , , I , , , I , , , I , , , 80 2440 400 560 720 880 1040 1200 1360 1520 </p><p>t </p><p>FIGURE 2. Compar ison of theory and exper iment for L 2. Curve I i s obta ined by taking the value of k from L 1 and treat ing H as a free param- eter; curve II by ca lcu la t ing H theoret i ca l ly and t reat ing k as a free parameter. The d isc repancy between theoret ica l curve and data is d i scussed in the text. The d isc repancy between the va lues of W is due to rounding-of f error. </p></li><li><p>92 ANATOL RAPOPORT </p><p>I 00 </p><p>90 </p><p>W </p><p>L 3 </p><p>6O ~. </p><p>50 k =. 139 k# .120 40 H= It.. 9 H . 10.3 </p><p>W=85~ W=858 30 ~r </p><p>20 OIg,.qi~i~ED </p><p>IO - - I I I r I f I I I f I I </p><p>Oq 40 80 120 200 280 360 440 500 t </p><p>FIGURE 3. Comparison of theory and experiment for L3. Here M -- 10. </p><p>In other words, w 2 increases faster than w 1 at t = 0. Further, if W is fixed, wl(~ )=w2(~ )=W, and of course wl(0 )=w2(0 )=0. Our lemma will, therefore, be proved if we can show that w 2 - w 1 has a single maximum, i.e., (dw2/dt- dwl/dt ) has a single posi- tive root. Setting this expression equal to zero, we obtain from (15), after simplifications </p><p>(e I t , /M - 1) e - l t2 t /v /M - (e H , /M - 1 )e -H l t /WM = O. (17) </p><p>80C </p><p>70G </p><p>60G </p><p>50C </p><p>4O(: W </p><p>30r </p><p>20C </p><p>lOG </p><p>L4 </p><p>/ ,--o9o ~1 H= I00.1 H=65.1 </p><p>~/ W= 2"20 W= 723 nz </p><p>I 1 J I l I J t ~ [ I I I I I 1 I 2oo 400 6O0 8oo IOOO I~)o 1400 ~DO 1800 </p><p>t </p><p>FIGURE 4. Comparison of theory and experiment for L4. Here the dif- ference between curves I and II is considerable. Preference, therefore, is given to calculat ing H theoret ica l ly and taking a lower value of k. </p></li><li><p>ROTE LEARNING CURVE 93 </p><p>But this is of the form </p><p>ae-X'- be-'t=ae-~t [ 1-abe('-')'t~ =0, (18) </p><p>where a &gt; b, x &gt; y. The expression in the brackets vanishes for exactly one positive value of t, and its coefficient never; so our lemma is proved. </p><p>The result can now be applied to our theory as follows. If w(0) is pegged at 0, and ~z(~) at some observed asymptotic value W, then if // (and consequently k) increases, the entire curve shifts upward and vice versa. </p><p>Observe now the discrepancy between theoretical and experi- mental curves in L 2. This discrepancy can be reduced (perhaps practically eliminated) if the value of k (and of H with it) is taken lower than 0.154. We note that this value is the highest value of k obtai...</p></li></ul>


View more >