A stochastic model for free recall

PSYCHOMETRIKA--VOL. 27, NO. 2 JUNE, 1962

A STOCHASTIC MODEL FOR FREE RECALL

I~ANCY C. WAUGH ¢

HARVARD UNIVERSITY

AND

J . E . I41EITH SMITH

LINCOLN LABORATORY~ ~"

MASSACHUSETTS INSTITUTE OF TECHNOLOGY

A statistical model for verbal learning is presented and tested against experimental data. The model describes a Markov process with a realizable absorbing state, allowing complete learning on some finite trial as well as imperfect retention prior to this trial.

This paper describes a probabilistic model for verbal learning. The reason for adding another such model to the number already available [2] is that none of the earlier models is adequate to describe some recent experimental data on free recall. In our experiment, subiects would look at a series of 48 words, presented one at a time, then attempt to recall them in any order they chose. Most of the nine subjects recited 12 such lists, each for six trials. The total number of lists recited was 105, and the total number of subject- words was therefore 5040. The data were self-consistent and reliable enough to make us dissatisfied with the models that would not fit them, and to motivate us to find a better alternative.

The experiment, described in detail in [9], was similar to one reported by Brunet, Miller, and Zimmerman [1]. It was in fact performed partly in order to replicate their data, which have been fitted with linear-operator and set-theoretical models by Bush and Mosteller [3] and by Miller and McGill [7], respectively. Our experimental procedure differed from that of Brunner, Miller, and Zimmerman in two ways: our subjects each learned several lists of words rather than lust one, and instead of listening to the words they looked at them. We have not attempted to discover which of these variations may be responsible for certain differences in our data.

Most conspicuous among these differences is the proportion of words recalled for the first time on each trial. The stochastic models mentioned

*This work was carried out while the author was at Lincoln Laboratory, Massa- chusetts Institute of Technology.

tOperated with support from the U. S. Army, Navy, and Air Force.

141

142 PSYCHOMETRIKA

ubo'¢e predict a geometric distribution for these data. The variance of the distribution that we observed, however, was so large as to render a geometric distribution implausible. At first we thought that this unexpected finding might stem from the differences (i) between our subjects in their ability to learn words or (ii) between our words in their ability to be learned. The excessive variance remained, however, even when each subject's data were analyzed separately. Moreover, no relation was found between the average number of the trial on which a word was first recalled and its frequency of usage in printed English, as estimated by Thorndike and Lorge [8].

We were thus unable to find any obvious artifactual basis for the dis- crepancy between our distribution of trials to first recall and that predicted by the earlier models. Therefore we decided to suppose that our distribution was in fact generated by a process different from the one-parameter process assumed by the latter. We did not have to search far in order to find an Mternative hypothesis: a straightforward two-stage process was sufficient to account for the distribution of first recalls. A third parameter was subsequently found necessary to describe retention after initial recall.

The following description will be expressed in terms of three hypothetical processes that we have found helpful in understanding the data. These processes were suggested by the three parameters of the model. No attempt. has been made to identify them with any classical psychological functions: they cannot be differentiated empirically until we discover how small variations in the experimental procedure affect the data and thus the parameters of the model. The reader should bear in mind, however, that the empirical significance of the model's parameters is not tested by how well the model describes the present set of data. We shall here restrict ourselves entirely to the descriptive problem and leave open the question of the model's generMity. We present the following interpretation of the three parameters principally for its heuristic value.

One of the three processes, which we call labeling, occurs with probability X on any trial, and is irreversible. Labeling, in other words, need occur only once in order for a word to be recalled for the first time. Another process, selecting, is assumed to occur with probability ~ on each trial. It is as though select.ing a word were to rehearse it, and labeling it, to find a mnemonic association for it. Blind rehearsal is ineffective, but once a word has acquired a mnemonic tag it is recalled after every trial on which it is rehearsed (or attended to, or selected). A word may be either labeled, or selected, or both, on any trial. In order to be recalled for the first time after a given trial, the word must have been selected on that trial. I t must also have been labeled on that trial, or it must have been labeled (but not yet selected) on some previous triM. A word that is selected on trim t, with probability ~, but not yet labeled, with probability (1 -- h)', will not yet be recalled. On the other hand, a word that is labeled on trial t, with probability ),(1 - X)'-~, will

NANCY C. WAUGH AND J. E. KEITH SMITH 143

TABLE 1

Relative Frequency of Recall as a Function of the Number of Previous Consecutive Recalls

Previous consecutive recalls 1 2 3 4 5 Proportion recalled O. 797 0.879 0.914 0.968 0.958

not be recalled until it is selected, with probability ~, either on that trial or on some subsequent trial. This word is then recalled again after every trial on which it is selected.

This formulation accounts for the distribution of trials to first recall. The model as it stands, however, implies that the conditional probability of recalling an item once it has been recalled should simply be z, the probability of its being selected, no matter how often this particular item has been recalled. (The model so iar implies in addition thai the total proportion of items recalled on each successive trial should approach a value not of unity but also of z.) From Table 1 it is clear, however, that the probability of recall is an increasing function of the number of previous consecutive recalls. Con- sequently, a third process has to be invoked. This process we call fixing. I t is assumed that, on any trial on which an item is recalled, it is fixed with probability ~b. Once fixed, this item will be recalled on every subsequent trial, regardless of whether it is selected. Before it has been fixed, on the other hand, it must be selected in order to be recalled. Thus it is as thoogh each item in a list will sooner or later become permanently fixed in the learner's memory. It will then always be recalled even when it has not been rehearsed on a particular trial.

A word may accordingly be in any one of five states after a given trial.

1. I t has not yet been labeled. 2. I t has been labeled but not yet selected. 3. I t has been labeled and was selected (and therefore recalled) on this

trial, but has not yet been fixed. 4. I t was recalled but not fixed on some previous trial, and it was not

selected (and therefore not recalled) on this trial. 5. I t has been fixed, either on this trial or on some previous trial.

All words are initially in state 1. All of them eventually end up in state 5. As far as the formal properties of the model are concerned, state 4 is

exactly equivalent to state 2. In applying the model, however, it will be necessary to distinguish between a word that has been forgotten and one that has not yet been recalled. We have therefore distinguished between these two formally identical states.

The five states are represented by the numbered circles in Fig. i. The

144 PSYCHOMETRIKA

1-X

X~(1-c x(1-o-) ×o-q,

,,-(l-q,)

2

FIGUm~ 1 Directed Graph of the Hypothetical Learning Process

The states are defined as follows: (1) not labeled (not yet recMled), (2) labeled, not processed (not yet recalled), (3) processed, not stored (recalled), (4) not processed, not stored (forgotten), (5) stored (recalled).

arrows here denote the paths open to a word on a particular trial. According to this diagram, a word may go from state 1 (not labeled) to state 2 (labeled but not selected), to state 3 (recalled but not fixed), or to state 5 (fixed). I t may similarly go from state 2 to state 3 or state 5. A word in state 3 (recalled but not fixed) may go to state 4 (forgotten) or to state 5 (fixed). A word in state 4 (forgotten) may move to state 3 (recalled) or to state 5 (recalled and fixed). A word in any one of states 1 through 4, furthermore, m a y also remain in that state on a given trial. A word in state 5 always remains in this state. The present model, then, describes a five-state Markov process with an absorbing state (state 5). All words eventually reach this state, which represents perfect retention. For a general discussion of Markovian models in psychology, see Miller [6].

NANCY C. WAUGH AND J . E. KEITH SMITH 145

According to the present hypothesis, the proportion of words that have not yet been labeled (and have therefore not yet been recalled) on trial i is given by

(1) Pia1 = (1 - A)P+l,l . The proportion that is labeled but not yet selected, and thus not yet recalled, by this trial is

(2) Pi2 = (1 - a)Pi-1,2 + (1 - 4Pi-,,I . The proportion recalled on this trial but not yet fixed is

(3) Pi,3 = ~ ( 1 - +)(Pi-l,z + Pi-1.3 + Pi-l,*) + ~ h ( 1 - $)Pi-1,l

The proportion forgotten on this trial (recalled at least once before but not yet fixed, and not selected on this trial) is

(4) p i . 4 = (1 - a)(p"-1,3 + pi-1,4).

Finally, the propor-tion fixed by this trial is

(5) Pi,, = Pi-1.5 + a4(Pi-l.2 + Pi-1.3 + Pi-1.4) + Aa4Pi-1.1 . This system of equations may be written in matrix notation as follows:

Let T denote this matrix of transitional probabilities, and let pi denote the column vector of state probabilities on trial i. Therefore Tpi-l = pi . Before the learning trials begin, all words are in state 1. The initial distribution of probabilities p, is thus the column vector [0, 0, 0, 0, 11'. Given this initial vector, the state probabilities on trial i are

pi,5 = 1 - (1 - a ) i + l - a4 [(I - a ~ ) ~ + l - (1 - A ) ~ + ~ ] , A - a4

a(1 - a)X(l - 4) - (1 - A)', (, -)(A - 4 )

146 PSYCHOMETRIKA

P~2, - 3'"~ " -Xa)) [ (1 - X ) ' - (1 - o-)'],

P,.1 = (1 -- X)'.

Estimation o] the Parameters

The main considerat ion tha t led to this model was the dis t r ibut ion of trials to first recall, which is shown in Fig. 2. According to the model, the probabi l i ty of first recall on trial i, F, , is given b y P,÷1 ,2 q- P~+I ,i - P~ .~ - P~ ,1 , which by (7) is

(8) F, = . @"v[ (1 -- X ) ' - (1 - ~)'] when ¢ # X 6r A

= iX2(1 -- X) ~-1 when ¢ = X.

E q u a t i o n (8) describes a negat ive binomial dis t r ibut ion in the special ease t h a t ¢ = X. The m i n i m u m ehi-square es t imators of X and ¢ based on the d a t a shown in Fig. 2 are X = ¢ = 0.495. This is the center of a confidence ellipse within which X 2 is less t han 9.5, the 5-percent, significance level. The

. 3

txl d ..d <[ (D ta3 rw

. 2

1,1 Z

Z O .I

t - rr" O rl O ely 12= - - 0 0

I I I I I I

I I I I I 1 I 2. 3 4 5 6

T R I A L

FmU~E 2 Proportion of Words Recalled for the First Time ~s a Function of the Number of Trials

(The theoretical curve is F~ = 1.4[(.580) i -- (.400)'].)

NANCY C. WAUGH AND J. E. KEITH SMITH

TABLE 2 Proportion of Words Forgotten on Trial t - 1 but Recalled on Trial t as a

Function of the Number of Previous Recalls ( j ) (Number of occurrences in parentheses)

extreme values of the ellipse are reached a t a = 0.60, X = .42, or vice versa, since (8) is symmetric in a and X.

The final estimates of'X and u were chosen so as to be maximally consistent with the data shown in Table 2. Here each entry represents the transitional probability for the recall of a word on trial t , given that it was forgotten on trial t - I after having been recalled j times previously ( j 2 I). For the various combinations of t and j , these relative frequencies range from .54 to .79. Now, the present model predicts that a word which has been recalled at Ieast once, but has then been forgotten on one or more consecutive trials, will be recalled again on the next trial with probability a. A word that has been forgotten is in state 4, and its chances of moving either to state 3 or to state 5 on the next trial are u(1 - 6) and a+, respectively. The entries in Table 2, then, are estimates of a. They are all larger than .495, which was found to be the minimum chi-square estimator for this parameter. Therefore, we chose the largest estimate of u consistent with the first-recall data, or .60. The corresponding value of X is in this case '42.

According to (3) and (5), the entries in Table 2 should be illdependent of j, the number of times a word has been recalled previously, as well as of t , the trial on which it is recalled again. It is evident, however, that these proportions are greater for j 2 2 than for j = 1. The data are in this respect a t variance with the model.

The next characteristic of the data that we examined was the learning curve, the proportion of recalls as a function of the number of trials. This function appears in Fig. 3. From the model, the proportion of recalls on trial i, Ri , is given by Pi,, f P,,, , which by (7) is

Note that if u = X this yields a negative expoilential function; and even when a Z X the third term of (9) is likely to be rather small. Using the

148 PSYCHOMETRIKA

1 .0 1 I I I - I I g3 hA .A O .A .8 <[ ¢ 9 hA g: . 6

Z (D ~ . 4 0 13_ 0 . 2 - a2 Q.

0~) I 2 3 4 5 6 TRIAL

FmuaE 3 Total Proportion of Words Recalled as a Function of the Number of Trials

(The theoretical curve is R~ -- 1 -- (.706) ~ -- ][(.706) ~ -- (.580)~].)

estimates of X and ¢ previously obtained, the least-squares estimate of ~, the fixing parameter, was found to be .49. Equation (9) with these values is plotted as the tl]eoretical curve in Fig. 3.

The goodness of fit of (8) depends primarily on the average of the estimates of X and ~. The data restrict this average to lie between .485 and .510. The fit of (9) depends primarily on the product ~ , which can take on values between .25 and .30, depending on the other parameter.

I t is clear tha t the present model enables us to calculate the probabil i ty tha t a single word will or will not be recalled on each of n successive trials. In our experiment there were six trials, and the number of possible pat terns of recall and non-recall was therefore 64. The ideal test of the model would be to compare the 64 frequencies observed in the experiment with those predicted by the model. Many of these patterns, however, were too r a r e - - either in tile model or in the d a t a - - t o provide a valid comparison. We therefore decided to pool them in two ways: according to the trial of first recall and according to the number of previous recalls. Thus we tabulated the frequencies with which (i) words recalled for the first t ime on trial i were also recalled on trial t and (it) words recalled i times previously were recalled on trial t. The total number recalled on any trial is, of course, obtained in either case by summing over i. These frequencies can be derived from the model as follows.


(i) Let us first determine what proportion of the items that were recalled for the first time on trial i will be recalled again on trial i + j. The model states tha t an i tem that is initially recalled on trial i has moved on this trial from state 1 or state 2 into state 3 or state 5. Once it has done so, it cannot revert to state 1 or state 2. Thus the nine transitional probabilities tha t appear in the intersection of the first three columns and rows of T, the matrix operator in (6), form a closed set. Therefore, in order to predict what state a word will be in on the jth trial af ter it was first recalled, we have simply to apply this set of nine transitional probabilities to the vector which represents a set of state probabilities on the previous trial, where these states are now 3, 4, and 5.

Let us call the new matrix operator U, and let us designate by R;.k the probability tha t a word which was recalled for the first t ime on trial i will be in state k on trial i + j. Let r denote the column vector of these state probabilities. Then U r j _ l = r i :

(10) 1 - - o" I - - o" • i-1,4 = Ri,4 j ~ 1.

0 ~-(1 - 4) o-(1 - 4) L R i - . , . J L R , , ~ J

The column vector of initial probabilities, ro , is [4, 0, 1 - ~b]', since a proportion ¢ of the words are fixed (go into state 5) on the trial on which they are first recalled, while a proportion i - ~b are selected but not fixed (go into

T A B L E 3

Number of Words Recalled on Trial t Tha t Were Recalled for the Firs t Time on Trial i

(The expected frequencies are shown in parentheses. Along the main diagonal is the number of words recalled for the first t ime on each trial.)

t i 1 2 3 4 5 6 Total

1 (1270) (1270) 1255 1255

2 (1011) (1245) (2256) 1065 1203 2268

3 (1087) ( 9 9 1 ) (925) (3003) 1065 979 941 2985

4 (1141) (1065) (736) (618) (3561) 1119 992 743 660 3514

5 (1179) (1118) (742) (492) (391) (3972) 1111 1013 788 516 417 3845

6 (1206) (1155) (831) (529) (311) (240) (4272) 1133 1068 819 541 330 241 4143

PSYCHOMETRIKA

TABLE 4 Number of Words Recalled for the i th Time on Trial t

(The expected frequencies are shown in parentheses. The first column is identical with the main diagonal of Table 3.)

1 i 1 2 3 4 5 6 Total

1 (1270) (1270) 1255 1255

2 (1245) (1011) (2256) 1203 1065 2268

3 ( 925) (1146) ( 932) (3003) 94 1 1097 947 2985

4 ( 618) ( 951) (1084) ( 907) (3561) 660 904 1046 904 3514

5 ( 391) ( 691) ( 934) (1056) ( 900) (3972) 417 717 904 967 840 3845

6 ( 240) ( 466) ( 706) ( 91s) (1044) (898) (4272) 241 489 730 908 970 805 4143

state 3). Therefore the solution for ri is

The probability that a word which was recalled for the first time on trial i will be recalled again on trial i + j is therefore 1 - Ri , or

Each of these probabilities, expressed as a frequency, is shown in Table 3 along with the corresponding observed frequency.

(ii) We now wish to determine the probability with which a word will be recalled for the jth time on trial t . There exists no simple expression for ob- taining this value. In order to estimate it, we first had to determine the probability of each pattern of recall and non-recall possible over the course of t trials. There are, of course, 2' such patterns, half of which specify recall on trial t. The probability of each such pattern was obtained through repeated applications of the matrix given in (1). The next step was to sum the probabilities associated with those patterns which specify a total of j recalls, up to and including trial t . These expected values are shown in Table 4, along with those actually observed.


Evaluation ol the Model

In order to evaluate the agreement between the model and the data, a series of Monte Carlo runs was performed on an IBM 709. This computer determined, on the basis of a random number, (i) whether a "word" would be labeled on trial n, (ii) whether or not it would be selected on each of six trials, and (iii) whether it would be fixed after it had been recalled n times (n -- 1, 2, . . . , 6). A "word" was thus either "recalled" or "not recalled" by the computer on each of six trials. This process was repeated 5040 times to simulate an experiment. A total of 1000 such experiments was carried out.

In order to generate the simulated data, the parameters were all set equal to .50 (rather than .42, .60, and .49 for ~, ~, and ¢ , respectively) for two reasons. First, these were the values obtained earlier with a rather ineffi- cient estimation procedure. Furthermore, these values greatly simplified the computer program. The expected values obtained with the two sets of parameters were so similar that we felt justified in following the more convenient procedure.

The data obtained in each simulated experiment were pooled to give the following sets of frequencies: (i) the number of words recalled for the first time on trial i and also on trial t, (ii) the number recalled for the ith time on trial t, and (iii) the total number recalled on trial t. The corresponding frequencies observed in the actual experiment, along with those predicted by the model, are presented in Tables 3 and 4 and in the right-hand margins of these tables, respectively.

As an index of goodness of fit, we then computed for each table of actual

. 0 8 I I I

.06

.04

.02 -

DO

.000 .020 .040 .060

a. tU r r

Ca

REP FIOURE 4

Distribution of R E P s Calculated for the Data Shown in the Body of Table 3

]52

n W ¢r v

PSYCHOMETRIKA

. 0 8

. 0 6

1 I ......... I

.04 -

.02 -

.00 - -

.000 .020 .040 .060

REP

FtGURE 5 Distribution of REPs Calculated for the Data Shown in the Body of Table 4

data a relative error of prediction (REP). This index is simply the absolute difference between an expected and an observed value, divided by the ex- pec ted - tha t is, the absolute error expressed as a proportion of the expected value. The REPs obtained for the various entries in a table were averaged to give an index for the entire table. An average REP was also computed for each table of simulated data. The expected frequencies were in this case calculated on the basis of ~, = z = ¢ = .50, which were the parameter values used in generating the simulated data. Naturally, we should not expect the simulated data to match the actual data as closely as if the more efficiently estimated parameters had been used. The REPs calculated for each set of data--actual and simulated--should nevertheless be similar if the model is adequate.

Finally, we compiled frequency distributions of the average REPs obtained in the 1000 simulated experiments. These distributions are shown in Figs. 4, 5, and 6. Each of them provides an estimate of the probability of an average REP equal to or greater than the REP calculated for the actual data. The REPs for the data that appear in Tables 3 and 4, and in their right- hand margins, respectively, are .038 (P > .117), .042 (P > .043), and .016 (P > .089). None of these REPs is sufficiently high to warrant our reiecting the hypothesis that the same process generated the actual and the simulated data. Interestingly enough, the earlier estimates (X = ~ = ¢ = .50) provided an almost equally good fit to the observed data.

NANCY C. WAUGH AND 3. E. :KEITH SMITH 153

. 1 2 1 t

. t 0

. 0 8

i ,g .06

D 4 -

. 0 2 -

DO . . . . . . . . . 1 ..........

. 0 0 0 .010 .07.0

REP

FIGURE 6 Distribution of REPs Calculated for the Data Shown in the Right-hand Margins of

Tables 3 and 4

These data, as well as the model, obviously bear upon the currently popular issue of incremental versus discontinuous learning. Most stochastic models have reflected an assumption tha t responses are learned gradually ra ther than on a single trial. Accordingly, to them, the function describing probabi l i ty of recall over trials is asymptot ic to unity, and perfect retention of a particular i tem therefore cannot occur except after an infinite number of trials. The present model, on the other hand, describes a Markov process with a realizable absorbing state, and thereby allows complete learning to occur on some finite trial. In this respect it resembles Estes ' "simple pa t t e rn" model [5] and the "Krechevsky" model described by Bush and Mosteller [4]. Unlike them, however, it assumes tha t initial recall depends on a two-stage process, ra ther than a uni tary one.

REFERENCES

[1] Bruner, J. S., Miller, G. A., and Zimmerman, C. Discriminative skill and discriminative matching in perceptual recognition. J. exp. Psychol., 1955, 49, 187-192.

[2] Bush, R. R. and Estes, W. K. (Eds.) Studies in mathematical learning theory. Stanford: Stanford Univ. Press, 1959.

[3] Bush, R. R. and Mosteller, F. Stochastic models for learning. New York: Wiley, 1955. [4] Bush, R. R. and Mosteller, F. A comparison of eight models. In Bush, R. R. and Estes,

154 PSYCHOMETRIKA

W. K. (Eds.), ~tudies in mathematical learning theory. Stanford: Stanford Univ. Press~ 1959.

[5] Estes, W. K. New developments in statistical behavior theory: differential tests of axioms for associative learning. Psychomctrika, 1961, 26, 73-84.

[6] Miller, G. A. Finite Markov processes in psychology. Psychometrika, 1952, 17, 149-168. [7] Miller, G. A. and McGitl, W. A statistical description of verbal learning. Psychometrika,

1952, 17, 369-396. [83 Thorndike, E. L. and Lorge, I. The teacher's word book of 3000 words. New York:

Teachers College, Columbia Univ., 1955. [9] Waugh, N. C. Free versus serial recall. J. exp. Psychol., 1961. (In press)

Manuscript received 7/5/60 Renised manuscript received 12/2/61

Documents

A stochastic model for free recall