[Part 2: Physical Sciences] || Optimal Sequential Sampling from Two Populations

Optimal Sequential Sampling from Two PopulationsAuthor(s): T. L. Lai and Herbert RobbinsSource: Proceedings of the National Academy of Sciences of the United States of America,Vol. 81, No. 4, [Part 2: Physical Sciences] (Feb. 15, 1984), pp. 1284-1286Published by: National Academy of SciencesStable URL: http://www.jstor.org/stable/23305 .

Accessed: 07/05/2014 17:17

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

.

National Academy of Sciences is collaborating with JSTOR to digitize, preserve and extend access toProceedings of the National Academy of Sciences of the United States of America.

http://www.jstor.org

This content downloaded from 169.229.32.136 on Wed, 7 May 2014 17:17:14 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=nas

http://www.jstor.org/stable/23305?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp


Proc. Natl. Acad. Sci. USA Vol. 81, pp. 1284-1286, February 1984 Statistics

Optimal sequential sampling from l (two-armed bandit/adaptive allocation)

T. L. LAI AND HERBERT ROBBINS

Department of Statistics, Columbia University, New York, NY 10027

Contributed by Herbert Robbins, October 31, 1983

ABSTRACT Given two statistical populations with unknown means, we consider the problem of sampling xl, x2, ... sequentially from these populations so as to achieve the greatest possible expected value of the sum Sn = x1 + ... + Xn. In particular, for normal populations, we obtain the optimal rule and study its properties when the average of the two population means is assumed known, and exhibit an asymptotically optimal rule without assuming any prior knowledge about the population means.

Introduction

Let A and B denote two statistical populations with means zA p uB. How should we sample xl, x2, ... sequentially from these populations if our objective is to achieve the greatest possible expected value of the sum Sn, = x1 + ... + x, as n oo? At each stage the choice of A or B is allowed to depend on the previous observations. Let f(g) denote the density, with respect to some measure v, of the population with the larger (smaller) mean, and let T,(f) (T,(g)) denote the number of observations taken from this population through stage n, so that T,(f) + T,(g) = n. Since

ESn = xf(x) dv(x) )ET(f) + xg(x) dv(x) )ETn(g)

= n max (/A,AB) - I/A - BJIET,(g),

the problem of maximizing ES, is equivalent to that of minimizing the expected sample size ET,(g) from the inferior population.

The Optimal Sampling Rule When f and g Are Known

Suppose that we know f and g. Let OA and ?B denote the density functions of the populations A and B, respectively. At each stage it seems natural to test the hypothesis Ho: (OA,4B) = (f,g) versus the alternative H1: (41A,4B) = (g,f) on the basis of the previous observations, and to sample from A if Ho is (tentatively) accepted and to sample from B otherwise. This suggests the following rule, denoted by p*. Let yl, Y2, ... denote successive observations from A, and let Zl, Z2, ... denote successive observations from B. At the first stage choose A or B with probability /2 each. Let

(T(n,A) T(n,B) /rT(n,A) T(n,B)

Ln = f(Yi) g(j) g(yi) t f(

where T(n,A) and T(n,B) denote the number of observations through stage n from A and B, respectively, and sample at

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. ?1734 solely to indicate this fact.

1284

two populations

stage n + I from A or B according as L, > 1 or L, < 1, choosing A or B with probability /2 when L, = 1. Note that Ln is the usual likelihood ratio statistic in favor of Ho (versus H1).

For p*, EoT,(g) = E1T,(g) by symmetry, and as shown by Feldman (1), EiT,(g) is bounded in n. Feldman (1) also showed that p* is optimal, in the sense of minimizing (over the class of all allocation rules)

EoTn(g) + EIT,(g)

for every n < oo. Hence, p* is Bayes with respect to the prior distribution of equal probabilities for Ho and H1. We now compute EiT (g) for this optimal rule in two cases of inter- est.

Example 1. Suppose that f and g are normal densities with unit variance and means 01 > 02, and let 0 = (01 + 02)/2. Then, log L, = (01 - 02)Wn, where

T(n,A) T(n,B)

W, = (Yi- O) + E (O- zj). [1] i=l 1 j=1

The rule p*, therefore, samples at stage n + 1 from A or B according as Wn > 0 or Wn < 0. Under Ho, y1 - 0, y2 - 0, ... 0 - Zl, 0 - Z2, ... are i.i.d. (independent, identically dis- tributed) normal random variables with unit variance and mean 8 = (01 - 02)/2, and therefore,

n-l

EoTn(g) = 1/2 + Po{Wi < 0} i=1

n-I

= 1/2 + (-v7), i=1

where 1p denotes the standard normal distribution function. Hence, for p*

EoT.(g) = EIT.(g) = h(8), where h(8) = /2 + 1E(-\/T). 1

Clearly, h(8) --> /2 as --> oo, while as --> 0,

oc so

h(S) - f (-8/t)dt = 282 uF(-u)du = 26-2. [2]

Some numerical values of h(8) are shown in Table 1.

Table 1. Numerical values of h(8) 8 3 2 1 0.7 0.5 0.3 0.2 0.1 0.05

h(8) 0.50 0.52 0.83 1.32 2.28 5.82 12.8 50.2 200.2 82h(8) 4.51 2.09 0.83 0.65 0.57 0.52 0.51 0.502 0.5006



Statistics: Lai and Robbins

Example 2. Suppose that f and g are Bernoulli densities (with respect to counting measure) such that

f(1) = g() = p, f(0) = g(l) = q,

where /2 < p < 1 and q =1 - p. Then, log L, = (logp/q)W~, where

T(n,A) T(n,B)

Wn = E (2Yi - 1) + (1- 2zi). [3] i=1 j=l

Under H1, 2yi - 1, 2y2 - 1, ... ,1 - 2z1, 1 - 2Z2, .... are i.i.d. random variables assuming the values I1 (with probability q) and -1I (with probability p). For k = 0, +, ..., let N(k) denote the number of n such that W'n = k. The renewal measure of the simple random walk {Wn} is given by

1 + ElN(O) = l/(p - q) EIN(k) for k = -1, -2, ...,.

[4] EiN(k) = (q/p)k(1 + E1N(O)) for k = 1, 2,....

noting that (cf. ref. 2)

Pl{W' = k for some n > 1} = 1 if k < 0, = (q/p)k if k > 0.

Since at stage n + 1 the rule p* samples from A or B according as Wn > 0 or Wn < 0 and chooses A or B with probability 1/2 each if Wn = 0, it follows from Eq. 4 that

1 EliT(g) = /2(1 + EN(O)) + EN(k) = 2( q)2. [5]

k=1 2(p - q)

Therefore,

EoT.(g) = ElToc(g) = V2-2, where 8 = p - q,

in agreement with the asymptotic result (see Eq. 2). When L, = .1, the posterior probability in favor of Ho is the

same as that in favor of H1, so the Bayes rule can choose A or B arbitrarily at stage n + 1 (cf. ref. 1). In particular, the following modification of p* is also Bayes. At the first stage, choose A or B with probability 1/2. If A is chosen, continue sampling from A until stage n(l) = inf{n: Ln < 1}, and sample from B at stage n(l) + 1. Continue sampling from B until stage n(2) = inf{n > n(l): Ln >- 1}, and then switch back to A until stage n(3) = inf{n > n(2): Ln < 1}, etc. If B is chosen at the first stage, continue sampling from B until stage n'(1) = inf{n: Ln > 1}, and then sample from A until stage n'(2) = inf{n > n'(1): Ln < 1}, etc. This rule will be denoted by p. By symmetry, EoTn(g) = ET,n(g) for both p and p*. Since p and p* are Bayes rules, they have the same value of EiT,(g).

In Examples 1 and 2, under both Ho and H1, {log Ln} is a random walk whose distribution does not depend on the sampling rule, and this enabled us to obtain explicit formulas for EiT (g). In general, in analogy with Eqs. 1 and 3, we have

T(n,A) T(n,B) logLn = E h(yi) - h(z), [6]

i=1 j=l

where h = log(f/g). However, h(yi) and -h(zi) need no long- er have the same distribution under Ho or H1. An upper bound for EoT (g) = EiT5(g) under the optimal rule p is ob- tained by considering a simpler rule p, as described below.

Proc. Natl. Acad. Sci. USA 81 (1984) 1285

At the first stage, choose A or B with probability 1/2 each. If A is chosen, continue sampling from A until stage TI = inf{n: I h(yi) < 0} (inf 0 = oo), and then sample from B at stage TI + 1. Continue sampling from B up to a-1 observations, where c-1 = inf{n: I1 h(zi) < O}, and switch to A at stage TI + o-1 + 1. Proceeding inductively in this way, the rule p is defined by "switching times" at stages T1 + 1, TI + o-1 + 1, Ti + o-i + T2 + 1 ..., where

Tj+l = infn: h(yi) < 0 , oj+1 =infn: E h(zi) < 0. i= Tj+ i= j+l

If B is chosen at the first stage, then the switching times of p occur at stages o-c + 1, (1 + TI + 1, o-1 +TI + o-2+ 1, .... For the Bernoulli distributions of Example 2, since h(yi) and h(zi) assume the values ?log p/q, log Ln(1) = log Ln(3) = ... = -log p/q, log Ln(2) = log Ln(4) = ... = 0, log L,'(1) = log Ln(3) = ... = log p/q, and log Ln'(2) = log Ln'(4) = ... = 0. Conse- quently, the optimal rule pj in this case is the same as the rule p. The rule p ignores the effect of "overshoots" in the Bayes rule p at stages n(1), n(2), ..., n'(1), n'(2), .... By analyzing these overshoots, we can modify the argument to obtain a lower bound for EoT (g) = ElT.(g) under the optimal rule p. This is the content of Theorem 1.

THEOREM 1. Let u, ul, U2, ... be i.i.d. random variables with density f, and let v, Vl, v2, ... be i.i.d. random variables with density g. Let h = log(f/g). Then Eh(u) > 0 and Eh(v) < 0. Let Um = E' h(ui), Vm = m

h(vi). Define

ev = E(inf{m > 1: Vm < 0}),

pu(a) = P{Um < a for some m > 1}, Pu = pu(0),

7Tv(a) = inf P{a + t < h(v) < tlh(v) < t}. t<o

Then for the optimal rule p*,

2(1 - P) ev > EoT (g) = ElT.(g) 2(1 - pu)

1 + pu(a)Prv(a) 2(1 - pu(a)7rv(a)) ev

for every a < 0. Proof: To derive the upper bound, note that for the subop-

timal rule p,

EoT.(g) = i/2jEoa + f r-2dPo +...

+1/2 a-idPo +.

=l/2ev + (Pu + Pu + ...)ev 2(1 -u) ev.

Since EoT (g) = EiT~(g) for both the rule p and the Bayes rule p, the upper bound is established.

To derive the lower bound, suppose that Ho is true. Then, Ui = Yi, Vi = Zi. Define a-1 = inf{n: Vn < 0} as before, and let R1= V,(<0). Define Ti = inf{n: Un - R1 < 0} and let rl = UT - R1 (?0). Letting Vmn,, = m+ h(v,) and Um, =

lim+n' h(ui), define ar- = inf{n: Va, - r < 0}, R2 = V,Jy, - rl(<0), Ti = inf{n: Un,, - R2 < 0}, etc. We note that for



1286 Statistics: Lai and Robbins

any a < 0

Eo[T,(g)[B is chosen at the first stage]

= Eoarl + f ao' dPo+ ? o- dP +... "{<7?l+Tf <?} l{a-i+rf l+Tl<X}

> e, + Po{R1 > a} Po{Un -< a for some n} e, + Po{Ri > a} P0{R2 - a} Po{Un < a for some n} e, + ...

> eV + Tv(a)pu(a)ev + -2v(a)p (a)ev +

A similar argument also shows that for any a < 0

Eo[Tx(g)IA is chosen at the first stage]

> Puev + pu7rv(a)pu(a)ev + pur2 (a)p2 (a) + ...

Hence, EoToo(g) > /2 ev + {1rv(a)pu(a) + 2v(a)pu(a) + ...} x ev, and the lower bound is established.

For the Bernoulli distributions of Example 2, ev = (p - q)- , Pu = q/p (cf. ref. 2), and therefore the upper bound of Theorem I reduces to

+ Pu 1 2(1 - Pv) ev 2(p -q)2

which is in agreement with Eq. 5.

Normal Populations with Unknown Means

Suppose that A and B are normal populations with unit variance and means Â : AB. If 0 = (ptA + /B)/2 is known, then we can use the rule p*, which chooses A or B at stage n + 1 according as Wn > 0 or Wn < 0, where

T(n,A) T(n,B)

Wn= - (yi- ) + E (t - Zj). i=1 j=l

When 0 is unknown, we can take one observation from A and one from B at the first two stages and then try estimating 0 at stage n (>2) by either

On = V2(YT(n,A) + T(n,,B))

(where an denotes the arithmetic mean of n numbers al, ..., an), or

/. (n,A) T(n,B) n A Yi + = zj In.

\ j1 j1=

If we replace 0 by On in Wn, we get T(n,A) T(n,B)

> (yi -

On) + Z (6n, - ZJ)

= (YT(n,A) -ZT(n,B))

i=1 j=1 2

If we replace 0 by 0* in Wn, we get T(n,A) T(n,B)

(Yi -0) + E (0 - Zj) i=1 j=l

2 = - T(n,A)T(n,B)(YT(n,A) ZT(n,B)). n

Hence replacing 0 in Wn by 6n or o, in the rule p* leads to the "sample from the leader" rule. At stage n + 1 sample from A or B according as YT(n,A) > ZT(n,B) or YT(n,A) < ZT(,,B).

The difficulty with the "sample from the leader" rule is that we may have sampled too little from the apparently inferior population to get a reliable estimate of its mean, and we may thereby miss the actually superior population. In fact,

Proc. Natl. Acad. Sci. USA 81 (1984)

the expected sample size from the inferior population is of order n: for /B > PA

EA^,A,B T(n,A) > (n - l)p, where

P = P'A,B {fi > Zi for all i - 1} > 0.

Some insight into how much we need to sample from the apparently inferior population is provided by studying the finite-horizon problem, where a preassigned total of N observations are to be taken, and we know one of the means, say AB = /*, but not the other. In this case, since sampling from B does not add any information about AB, we need only sample for information from A. However, because our objective is to maximize ESN, we should stop sampling from A once we are reasonably confident that i^ < /* and then switch to B, whereupon no more information is gained about /A or fB. Hence, we can reduce the allocation problem to a stopping problem involving A alone, restricting ourselves to rules that sample from A until some stopping time TN(<N), whereupon H: A > A* is rejected in favor of K: AA < /* and the remaining N - TN observations are taken from B (cf. ref. 3). Because {Yi} is a sequence of sufficient statistics, we can also restrict ourselves to stopping rules of the form

TN = inf{i < N: Yi - I* -< -aNi}(inf 0 = N), [7]

where aNi (i = 1, ..., N) are positive constants. We now consider the case where both /A and AB are un-

known and there is no preassigned horizon N. Start with one observation from A and one from B. At the conclusion of stage n > 2, if T(n,B) > T(n,A), then T(n,B) > n/2, so the amount of information from B is relatively adequate, while there is more need to sample for information from A. Replac- ing N by n and A* by TT(,,B) in Eq. 7, analogy with Eq. 7 suggests sampling at stage n + 1 from A if

YT(n,A) > ZT(n,B) -

an,T(n,A),

and sampling from B otherwise. Likewise, in the case T(n,A) > T(n,B), we sample from B at stage n + 1 if

ZT(n,B) > YT(n,A) -

an,T(n,B)i

and we sample from A otherwise. Finally, in the case T(n,A) = T(n,B) = n/2, we sample from B only if ZT(n,B) - YT(n,A). This sampling rule will be denoted by po.

If the constants ani are such that for every fixed i, an, is nondecreasing in n

>

i, and if there exist en -> 0 for which

aan, - (log n)/i s e,n(log n)?/2/i /2for all i < n, [8]

then it can be shown for the rule po that as n > oo

E,Â,I'BT(n,A) - (2 log n)/(A - AB)2 if AB > PA,

EÂ,,ABT(n,B) - (2 log n)/(A - ABB)2 if AA > lB. [9]

Moreover, the rule po is asymptotically optimal in the sense that lim inf EÂ, BTn(g)/log n > 2(-A

- -B)-2 for every rule p such that E^ A,BTn(g) = 0(10g n) at all parameter values pAA AB. The proof of these assertions is given in ref. 4.

This research was supported by the National Science Foundation and the National Institutes of Health.

1. Feldman, D. (1962) Ann. Math. Stat. 33, 847-856. 2. Feller, W. (1966) An Introduction to Probability Theory and Its

Applications (Wiley, New York), Vol. 1, 2nd Ed. 3. Bradt, R. N., Johnson, S. M. & Karlin, S. (1956) Ann. Math.

Stat. 27, 1060-1070. 4. Lai, T. L. & Robbins, H. (1984) Adv. Appl. Math., in press.



Documents

[Part 2: Physical Sciences] || Optimal Sequential Sampling from Two Populations