Convergence properties of sequential Bayesian D-optimal designs

Journal of Statistical Planning and Inference 139 (2009) 425 -- 440

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference

journal homepage: www.e lsev ier .com/ locate / jsp i

Convergence properties of sequential Bayesian D-optimal designs

Anindya Roya,∗, Subhashis Ghosalb, William F. Rosenbergerc

aDepartment of Mathematics and Statistics, University of Maryland, Baltimore Co., Baltimore, MD 21250, USAbDepartment of Statistics, North Carolina State University, Raleigh, NC 27695, USAcDepartment of Statistics, George Mason University, Fairfax, VA 22030, USA

A R T I C L E I N F O A B S T R A C T

Article history:Received 13 February 2007Received in revised form11 April 2008Accepted 25 April 2008Available online 8 May 2008

Keywords:Adaptive designsAsymptotic normalityDiscrete optimal designDose--responsePosterior convergence

We establish convergence properties of sequential Bayesian optimal designs. In particular,for sequential D-optimality under a general nonlinear location-scale model for binary ex-periments, we establish posterior consistency, consistency of the design measure, and theasymptotic normality of posterior following the design.We illustrate our results in the contextof a particular application in the design of phase I clinical trials, namely a sequential design ofHaines et al. [2003. Bayesian optimal designs for phase I clinical trials. Biometrics 59, 591--600]that incorporates an ethical constraint on overdosing.

© 2008 Elsevier B.V. All rights reserved.

1. Introduction

1.1. Statement of the problem

Wemotivate our problem with an example from clinical trials. Let � = {d1,d2, . . . , dK } be a discrete space of K treatments andlet X1,X2, . . . ,Xn ∈ � be the treatments assigned to a sequence of n patients. Let Y1,Y2, . . . ,Yn be patient responses. The responsesare assumed to follow a location-scale model:

Yi|Xi = dk ∼ Bernoulli(F�(dk)),F�(dk) = F((dk − �)/�),

k = 1, 2, . . . ,K, (1)

where F(·) is a known distribution function with density f (·) with respect to Lebesgue measure, 0 < f (x) <M for all x ∈ R andsome positive constant M and the parameter � = (�,�) belongs to R × R+. A design for this model is a probability measuren= (�1,�2, . . . ,�K )

′ belonging to the K-dimensional simplex

� =⎧⎨⎩n : 0��k�1 and

K∑k=1

�k = 1

⎫⎬⎭ ,

which puts weight �k on treatment dk. An optimal design is one which optimizes a function of the Fisher information matrix forthe parameters of the model over all possible choices of designs.

∗ Corresponding author.E-mail addresses: [email protected] (A. Roy), [email protected] (S. Ghosal), [email protected] (W.F. Rosenberger).

0378-3758/$ - see front matter © 2008 Elsevier B.V. All rights reserved.doi:10.1016/j.jspi.2008.04.025

http://www.sciencedirect.com/science/journal/jspi

http://www.elsevier.com/locate/jspi

426 A. Roy et al. / Journal of Statistical Planning and Inference 139 (2009) 425 -- 440

More specifically, suppose I(�;dk) is the information at a single dose level dk ∈ �. For model (1) the information (cf.: Silvey,1980) is given by

I(�; dk) = ak(�)(1 zkzk z2k

), (2)

where ak(�) = f2(zk)/[�2F(zk)(1 − F(zk))], zk = (dk − �)/� and � = (�,�)′. The information matrix is given by

Mn(�) =K∑

k=1

�kI(�; dk).

An optimal design will then maximize an appropriately chosen concave function of the information matrix Mn(�). However, forsuch nonlinear problems one needs to have knowledge of the unknown parameter �. One solution is to substitute a best guess for� in the optimization criterion. This leads to a locally optimal design. Another choice is to optimize an expectation of the criterionfunction with respect to some prior probability distribution of the parameters. This leads to a Bayesian optimal design. Let �(�) bea compactly supported prior probability distribution on R × R+, i.e., �() = 1, where is a compact subset of R × R+. Let thetrue value of the parameter, �0, be an interior point of . The criterion we will consider in this paper is D-optimality, where thedesign is found as a solution to

n∗ = arg maxn∈�

∫[log det(Mn(�))]�(�) d�.

Now consider a sequential experimentwhere patients arrive sequentially in time and each is assigned one of the K treatments,conditional on all previous treatment assignments and responses. Let

Dr = {X1,X2, . . . ,Xr ,Y1,Y2, . . . ,Yr}be the -algebra generated by the first r treatments and responses. Also let N(r) = (N1(r),N2(r), . . . ,NK (r))

′, where Nk(r) is thenumber of patients assigned to dose level dk after r treatments have been allocated and let �r,k=Nk(r)/r be the observed allocationproportion for dose level dk. Then after r patients we can define

Mr(�) = 1r

K∑k=1

Nk(r)I(�;dk). (3)

Silvey (1980, Chapter 7) expressed unease about what Mr(�) actually represents. It is technically not the Fisher informationbecause of the dependence structure of the sequential procedure.We consider that treatments are allocated through a sequentialBayesian D-optimal design by maximizing the posterior expectation of the log determinant of the information at each stage ofallocation. Let �(�|Dr) denote the posterior distribution at stage r for r = 1, 2, . . . ,n. The (r + 1)th patient is then assigned to thedose level Xr+1 where

Xr+1 = arg maxd∈�

∫[log det(rMr(�) + I(�; d))]�(�|Dr) d�. (4)

At the end of the experiment, we are interested in computing Bayes estimators �̃n = E�(�|Dn) = ∫ ��(�|Dn).

Before we pose the main questions that are answered in this paper, we introduce the following notation. Let Q�(n) denote thedeterminant of Mn(�). Also let

Q(n) := det(Mn(�0)) = Q�0(n). (5)

Simple algebra shows that Q�(n) = n′Qnwhere elements of Q are defined as

Q = ((qij)) = 0.5�−2(di − dj)2ai(�0)aj(�0), (6)

and ai(�) are defined following (2).In this paper, we answer the following questions:

1. Does the sequential Bayesian D-optimal design converge to the local D-optimal design at the true parameter value �0,given by

n0 = arg maxn∈�

log(Q(n))? (7)

We refer to this problem as convergence of the design measure.

A. Roy et al. / Journal of Statistical Planning and Inference 139 (2009) 425 -- 440 427

2. Are the Bayes estimators consistent and asymptotically normal, and if so, what is the correct asymptotic variance?3. How do we characterize the limiting design (a problem which takes on additional subtlety since � is discrete)?

Optimal designs in nonlinear experiments have a rich literature. Earlier work include those by Box andHunter (1965), Fedorov(1972), Ford and Silvey (1980), Abdelbasit and Plackett (1983), Ford et al. (1985),Minkin (1987) andmany others. Ford et al. (1989)provide a comprehensive reviewof theoptimal design literature innonlinear experiments. Also seeChaudhuri andMykland (1993,1995) for other references.

Sequential Bayesian optimal designs stem from earlier work by Tsutakawa (1972, 1980), Zacks (1977), Leonard (1982) andChaloner (1989). The form of the allocation function in (4) is in Haines (1998) and Haines et al. (2003) use the form for a particularapplication to phase I clinical trials. The allocation function induces a response-adaptive allocation procedure, inwhich the responsesequences are dependent random variables. Convergence and estimation must then be viewed in this context.

Sequential D-optimal designs have a long history in the non-Bayesian context, where at each stage the local optimal designis computed at the current maximum likelihood estimates of the parameters. The problem was explored by White (1975) in herdoctoral thesis. Silvey (1980, p. 64) notes the difficulty of proving convergence of the sequentially constructed design to the locallyoptimal design measure determined at �0. He suggests that techniques used to prove convergence of standard algorithms tosequentially computeD-optimal designs (cf.Wynn, 1970; Fedorov, 1972) could be useful in proving convergence in the sequentialdesign context also. In the frequentist case the problem is largely solved by Wu (1985), Lai (1994), and Chaudhuri and Mykland(1993, 1995) who prove convergence of the sequentially computed maximum likelihood estimators (MLEs) or least squaresestimators, give their asymptotic distribution, and show convergence of the sequentially computed information matrices undervarious assumptions about the regularity of the problem, the asymptotic of the eigenvalues of the observed information matrixand the sample size of the pilot design.

Our goals here are to show similar asymptotic properties of the posterior distribution of the parameter � and estimatorsin a Bayesian context and also show convergence of the information matrices. Most of our assumptions are basic regularityassumptions which are similar to those in Wu (1985), Lai (1994) and Chaudhuri and Mykland (1993, 1995). However, to applythe results of Chaudhuri and Mykland (1993, 1995) we would have to assume that the size of the initial sample goes to infinity.This is not the case in our problem, specifically for the clinical trial example considered here. We prove our results when the pilotsample size is fixed. Due to our specific binary structure with compact parameter space we are able to show boundedness of theeigenvalues of the observed information rather than imposing conditions on them. This leads to stronger form of convergence ofthe optimality criterion.

A critical component in the analysis of sequential Bayesian designs is proving posterior consistency. Hu (1997) proves posteriorconsistency under the product measure, which is weaker than consistency under the prior measure, and therefore cannot beextended to derive any further asymptotic properties of estimators. Schwartz (1965) studies regular posterior consistency ina general context. Her result has been extended in various ways; see Ghosh and Ramamoorthi (2003) for a detailed account.Although our case is parametric, it is nontrivial because of the dependence structure of the sequentially computed posteriordistributions. Indeed, we shall use some techniques that are normally used to establish Bayesian consistency in nonparametricproblems. We also show asymptotic normality of the posterior distribution and of the Bayes estimators. We now discuss ourparticular application to phase I clinical trials and review the literature on convergence of sequential designs.

1.2. Application to phase I clinical trials

In phase I clinical trials, patients enter sequentially in time, each is assigned one ofK predetermined dose levels, and the patientis observed for either a toxic or nontoxic response to the dose level. One of the goals of a phase I study is to find a "maximumtolerated dose,” which in the parametric case is defined as a quantile of the dose--response curve. Rosenberger and Haines (2002)present a comprehensive reviewof the literature on phase I clinical trial designs. There has been some controversy in the literatureas towhether accurate identification of that quantile through a stochastic approximation-type approach or efficient estimation ofthe quantile through an optimal design-type approach is more appropriate. The former stochastic approximation-type approachis the basis for the continual reassessment method (CRM; O'Quigley et al., 1990) and escalation with overdose control (EWOC;Babb et al., 1998). There has been some theoretical work dealing with asymptotic properties of these procedures (see Shen andO'Quigley, 1996 for the CRM; Zacks et al., 1998 for the EWOC procedure). The estimation and optimal design-type approach hasbeen described by Whitehead and Brunier (1995) and generalized by Haines et al. (2003). In this paper, we also concentrate onestimation issues in an optimal design setup.

When response is toxicity and experimentation is on human beings, it is unethical to assign patients to highly toxic doses. Thusthe procedure in (4) may not be appropriate. Haines et al. (2003) introduced an overdosing constraint (similar to one described byBabb et al., 1998), given by

Pr(�R(�) > d)�� (8)

for � >0 small, where �R is some quantile corresponding to a probability of toxicity R; that is, �R(�) = � + �F−1( R). This leadsto their suggested procedure that solves the optimization given in (4), subject to the constraint (8). Note that the constraint in(8) is computed with respect to the distribution of �R(�) induced by �. In order to ensure nonsingularity of the design resulting


from application of the sequential scheme under the constraint (8), we need the following assumption. Suppose the constraint�R(�0) < d is satisfied by K∗(�K) doses, i.e.,

K∗ = #{di ∈ � : �R(�0) < di}, (9)

where # denotes the cardinality of the set. Without loss of generality let the doses be d1, . . . , dK∗ . We will assume throughout thatK∗ �2. Any sensible choice of the dose levels and the value of � in the constraint (8) will satisfy the condition K∗ �2.

Practical problems that need to be addressed in a sequential design procedure include initial patient assignments based solelyon the prior distribution and appropriate prior elicitation. These and computational issues are addressed in Rosenberger et al.(2005).

1.3. Organization of the paper

In Section 2, we prove consistency of the posterior distributions computed under the sequential design framework. We alsocomment that all results hold for the clinical trials problem with an overdosing constraint. In Section 3, we prove convergenceof the design measures. In Section 4, we derive posterior asymptotic normality. Because we are operating on a discrete designspace, characterization of the limiting design measure is different from the usual consideration of continuous designs. In Section5, we outline a characterization of the limiting designmeasure for the clinical trial application.We draw conclusions in Section 6.

2. Consistency of the posterior

As given in Silvey (1980, Section 7.3) the general heuristics in deriving the asymptotic justification for a sequential procedureis the following. First, one shows strong consistency of the estimator, then, for large enough samples, the stochastic procedure forchoosing the next design point can be essentially replaced by the corresponding deterministic algorithm for finding the optimaldesign for a known value of the parameter. Then one needs to show convergence of the deterministic algorithm to the optimaldesign and hence show the convergence of the information corresponding to a design measure based on n observations to theFisher information for the optimal design. This in turnwill guarantee asymptotic normality of the estimators.Wu (1985) followedthis approach in the non-Bayesian case, and wewill also follow this approach. In this section we discuss strong consistency of theestimator. First we give conditions for strong consistency under the unconstrained sequential Bayesian design. Remark 1 argueshow strong consistency continues to hold when the sequential Bayesian design is performed with the overdosing constraint (8).The proof of the main theorem in this section relies on the following lemma which is a variant of a theorem by Schwartz (1965).

Lemma 1. Let X1, . . . ,Xn be random variables with arbitrary joint distribution Pn�, � ∈ ⊂ Rd. Assume that every Pn� admits a joint

density ln(x1, . . . , xn;�) with respect to a -finite measure. Let � be a prior on � and �0 ∈ be the true value of �. Suppose that

1. for every � >0, there exists a � >0 such that whenever ‖� − �0‖ <�, � ∈ , we have

lim infn→∞ n−1 log

ln(X1, . . . ,Xn;�)

ln(X1, . . . ,Xn;�0)� − � a.s. P∞

�0,

2. for every � >0, there exists a test function �n = �n(X1, . . . ,Xn) for testing H0 : � = �0 against H : ‖� − �0‖ > � such that for someB, b >0,

P�0�n�Be−bn, sup

�∈:‖�−�0‖>�P�(1 − �n)�Be−bn,

3. �(� ∈ : ‖� − �0‖ <�) >0 for all � >0.

Then for every � >0, �(� ∈ : ‖� − �0‖ > �|X1, . . . ,Xn) → a.s. P∞�0

.

We now state our main theorem.

Theorem 1. LetDn be the data generated by the sequential Bayesian scheme (4). Let �0 ∈ and U be an arbitrary neighborhood of �0inwhere is a compact subset of R×R+. Let � have prior � such that �()=1 and � has positive density for some neighborhood of�0. Suppose there exists M1 >0 such that sup�∈ maxx∈� ‖�F�(x)/��‖ = M1 <∞, where �F�(x)/�� is the vector of partial derivatives�F�(x)/��i. Then, the posterior probability �(� ∈ U|Dn) → 1 a.s.

Remark 1. Theorem 1 also holds under an overdosing constraint, given in (8). It will be clear from the proof of Theorem 1,consistency of the posterior is achieved as long as there are at least two design points that have positive allocation proportion.The following proposition establishes that even under the overdosing constraint, the allocation proportions of at least two designpoints stay bounded away from zero.


Proposition 1. Let nn = (�1,n,�2,n, . . . ,�K,n)′ be the sequence of observed proportion vectors for design points d1,d2, . . . ,dK for the

sequential Bayesian design with an overdosing constraint (8). Let the initial allocation be an interior point of the K-dimensional simplex�. Then there exist a positive constant 0 <� <1 and a positive integer N, possibly depending on �, such that if n >N then for all � ∈ we have

‖nn − ek‖ >�/2, k = 1, 2, . . . ,K, (10)

where ek is the vertex of � with all zeros except a one at the kth place.

The following corollary of Theorem 1 provides insight into the limiting structure of the design vector for the sequentialBayesian design with an overdosing constraint.

Corollary 1. Suppose 2�K∗ <K, where K∗ is defined in (9). Then, under the sequential Bayesian optimal design with overdosingconstraint (8), the allocation proportions �n,i, i = K∗ + 1, . . . ,K, converge to zero almost surely.

3. Consistency of the design measure

Given that the posterior is consistent under the Bayesian sequential design, the limiting properties of the design can be studiedunder deterministic allocation, which is essentially a quadratic optimization problem. In this section we show that the optimalitycriterion under the sequential design converges to a limiting value and the corresponding design measure has limit points thatare all locally D-optimal. In view of Corollary 1, the limiting values of the design measure are all in a K∗ dimensional simplex, i.e.,at least K − K∗ components are zero.

Wewill prove the result in slightlymore generality as itmay be of interest as an optimization result. Suppose the goal is tomax-imize a continuous function Q(n) which admits bounded second partial derivative over the simplex �, i.e., ‖�2Q(n)/�n�n′‖ <M′′for all n ∈ �, for all � ∈ and for some constant M′′. In our special case, Q(n) = n′Qn over the K-dimensional simplex where Qis a symmetric matrix with zeros on the diagonal and arbitrary positive off-diagonal entries . Suppose the algorithm chooses tomove from a point nn at the nth iteration to

nin+1 = (1 − �n)nn + �nei, (11)

where ei is a vertex of the simplex which is chosen in the direction in which the Q(n) is maximized. Suppose we have diminishingstepsize �n satisfying the following assumption.

Assumption (SS).∑∞

n=1 �n = ∞ and∑∞

n=1 �2n <∞.

Then the following theorem shows that the iterations nn converge to a point in the simplex.

Theorem 2. Let {nn} be the sequence of iterations defined by (11) and let �n satisfy Assumption (SS). Then there exists n∗ ∈ � suchthat

limn→∞Q(nn) = Q(n∗),

and limn→∞ d(nn,A(n∗)) = 0, where A(n∗) = {n : Q(n) = Q(n∗)} where for any set A and a point x, d(x,A) = inf{‖x − y‖ : y ∈ A}.

Remark 2. We have proven convergence of the design criterion to an optimal value. The design measure need not converge ifthere are multiple optimum values, i.e., if the set of optimal values A(n∗) is not a singleton set. If there is a ridge in the designspace whose height is Q(n∗), then the iterations can travel along the ridge without converging. Thus, the design measure neednot converge to a single optimal value in the presence of such ridges.

4. Asymptotic normality of the posterior distribution

Next we derive asymptotic normality of the posterior and that of the Bayes estimators. Let ln(�) be the log-likelihood obtainedfrom the data (X1,Y1), . . . , (Xn,Yn). Let the observed information matrix be defined as

M∗n(�) = − n−1 �2

��′ ln(�)

= n−1n∑

i=1

�2

��′ [Yi log F�(Xi) + (1 − Yi) log(1 − F�(Xi))], (12)

where ln(�) is the log-likelihood function. Let �̂n be the MLE, i.e.,

�̂n = arg max�∈

ln(�). (13)


Note that there is a direct relationship between the "observed” information matrixM∗n(�) and the matrixMn(�) in (3). If we write

the expression (12) as M∗n(�) = n−1∑K

i=1 I∗i (�), then

Mn(�) = n−1n∑

i=1

E(I∗i (�)|Di−1).

Thus, the matrix Mn(�) is not the actual Fisher information matrix, but rather an intermediate quantity between the observedinformationmatrix and the Fisher informationmatrix. See Silvey (1980, Chapter 7) for more discussion about thematricesMn(�),M∗

n(�) and the Fisher information matrix.Theproof of asymptotic normality of posterior uses the fact that theMLE is asymptotically normal. There aremany results prov-

ing asymptotic normality of the MLE in similar adaptive sequential designs. The most relevant are by Wu (1985) and Chaudhuriand Mykland (1995). Also, Lai (1994) provides conditions for asymptotic normality of least squares estimators in nonlinear re-gression models with adaptive sequential designs. Chaudhuri and Mykland (1995) show that the MLE is asymptotically normalif the data are generated by an adaptive sequential design in a nonlinear experiment. In order to use their result we need toassume that the initial allocation size n0 tends to infinity as the sample size increases. This is not directly applicable to theparticular application of phase I clinical trial considered here. Also, Chaudhuri and Mykland show a weaker form of consistencyfor the MLE by assuming a weaker condition on the eigenvalues of the observed information. Due to the binary form of ourexperiment we are able to show boundedness of the eigenvalues of the observed information directly in Proposition 2. A keystep toward the proof of asymptotic normality of the MLE is proving consistency of it. Consistency of some suitable solutionof the likelihood equation may be shown using Hall and Heyde (1980, Section 6.2). However, unless the solution is unique, theconvergent solution is unknown, a well-known problem associated with the solution of the likelihood equation (cf. Serfling,1981). To avoid this difficulty, we use a global maxima. Our approach is useful even in the absence of a martingale structure. Dueto the differences in the set of assumptions from those considered in the literature and due to the general appeal of our proofin situations where martingale structure is not applicable, we present the proof of consistency of MLE as a separate theorem inthe appendix. Once consistency is established, asymptotic normality follows relatively easily by Taylor's expansion, a martingalecentral limit theorem and Proposition 2.

Hu et al. (2005) show asymptotic normality of a sequence of solutions of the likelihood equations in a response-adaptiverandomization procedure under an exponential family model and the assumption that the design measure converges almostsurely to an optimal value. Most proofs require the convergence of the design measure to a unique value. The convergence ofthe design measure is not necessarily guaranteed for our problem (cf. Remark 2). We circumvent the problem of convergenceof the design measure by normalizing the MLE sequence with the observed information. First we prove the boundedness of theeigenvalues of the observed information matrix.

Proposition 2. Let M∗n(�) be the observed information matrix as defined in (12). Let �min,n(�) and �max,n(�) be the smallest and the

largest eigenvalues of M∗n(�), respectively. Then there exist constants 0 < Lmin < Lmax <∞ and � >0 such that Lmin <�min,n(�) <�max,n

(�) < Lmax for all n and for all � ∈ N�(�0) where N�(�0) = {� : ‖� − �0‖ <�}.

Define

wn(�) = l(�) − l(�̂n). (14)

Then by Taylor expansion

wn(�̂n + (nM∗n)

−1/2u) = − 12u

′u + Rn(u), (15)

where u= √nM∗1/2

n (� − �̂),M∗n =M∗

n(�̂n) is the observed Fisher information matrix (12) evaluated at the MLE �̂n, and Rn(u) is theremainder term. We shall show that the L1-distance between the posterior density of u

�∗n(u) = �(u|Dn) = ewn(u)�(�̂n + (nM∗

n)−1/2u)∫

R2 ewn(t)�(�̂n + (nM∗n)

−1/2t) dt, (16)

and the multivariate normal density �(u) = e−(1/2)u′u/∫R2 e−(1/2)t′t dt converges to zero a.s. [P�0 ].

Theorem 3. IfDn satisfy all the assumptions of Theorem 1 and Lemma 1 and if, in addition, Assumption (SS) holds, then

limn→∞

∫R2

|�∗n(u) − �(u)|du = 0 a.s. [P�0 ]. (17)

The asymptotic normality of the Bayes estimator follows directly from the asymptotic normality of the posterior and theasymptotic normality of the MLE (cf. Bickel and Yahav, 1969).


Corollary 2. Let �̃n = E�(�|Dn) be the Bayes estimator of � with respect to a prior �. Then

(nM∗n)

1/2(�̃n − �0) −→ N(0, I), (18)

where M∗n = M∗

n(�̂n) is defined in (12).

5. Characterization of the limiting design

In this section we characterize the limiting design obtained following the sequential Bayesian optimal procedure. Kiefer andWolfowitz (1960) give the general equivalence theorem for finding the globalD-optimal design. For example, one can show that ifthe design space is large enough, theD-optimal design for the logistic link is a 2-point design,with equalweights at two symmetricquantiles in the z scale. If the design space is not wide enough, the design is still a 2-point design with at least one of the pointsfalling on the boundary of the design space. However, when the design space is discrete, the optimal designs for the continuousdesign space are no longer optimal. In order to find the design that is D-optimal over the design space � = {d1, d2, . . . , dK } oneneeds to solve the following optimization problem:

maxn∈�

Q(n) := maxn∈�n′Qn, (19)

where Q is defined in (6). Then, the Lagrangian of the problem can be written as

L(n,�,�) := −n′Qn− �c(n) −K∑i=1

�ici(n), (20)

where c(n) = �(J′n − 1), ci(n) = �i, i = 1, 2, . . . ,K, and J is the vector of ones. Because Q is a matrix with arbitrary positive offdiagonal entries, (19) is not a convex optimization problem. Thus, the Karush--Kuhn--Tucker (KKT) conditions are only necessaryprovided the linear independence constraint qualification (LICQ) conditions are satisfied. However, clearly not all �i can be inthe inactive set (inequality constraints) as J′n= 1. Suppose, without loss of generality, the first l(<K) components of n are in theactive set. Then, the LICQ conditions require the vectors in {∇c(n),∇c1(n), . . . ,∇cl(n)} = {J, e1, . . . , el} to be linearly independent.Thus, the LICQ are clearly satisfied. Then the KKT first order necessary conditions can be written as

∇nL(n,�,�) = 0, J′n= 1, n�0, ��0, ��0, � · n= 0, (21)

where a ·b is the Hadamard (componentwise) product between two vectors a and b and a�0means that all components of a arenonnegative. The last condition in (21) is the complementary slackness condition which implies that at the solution (n∗,�∗,�∗),for each i, either �∗

i = 0 or �∗i = 0 or both are zero. This in turn implies that there exists a permutation matrix P such that

Pn∗ =(n∗10

)and P�∗ =

(0

�∗2

),

where �∗2 could be identically zero. From the first condition in (21) we see that a solution must necessarily satisfy PQP′Pn= �∗J,

or

Q1n∗1 = �∗J, (22)

where Q1 is the upper left block of PQP′ corresponding to n∗1. Then, the set of points satisfying the necessary conditions is

A={n ∈ � : ∃ permutation P such that n=

(n10

)and PQn=

(�Jx

)for � >0

}.

LetM be the set of all r × r principal minors of Q for r�2. Since diagonal elements of Q are zero, the principal minors for r = 1are all zero. In order to look at the solution more explicitly we make the following assumption:

Assumption (C). All r × r (r�2) principal minors of Q are nonsingular.Consider the set Q= {M : M ∈ M;M−1J >0}. Then from (22) we have

n∗ = (J′M∗−1J)−1M∗−1J, (23)

whereM∗ = arg min{J′M−1J : M ∈ Q}. Thus the D-optimal design has positive weights for the components corresponding to theprincipal minor M ∈ Q for which the sum of all the entries in the inverse is minimum among all minors in Q. Note that Q has atthe most 2K − K − 1 elements. Of course even for moderate K this could be excessively large for the characterization (23) to beuseful. However, for some simple cases it can be insightful.


Proposition 3. Suppose K=3. Suppose the entries of Q , q12, q13 and q23 are all distinct. Then the D-optimal design is a 3-point designif q12, q13 and q23 are the three sides of a triangle, i.e., the sum of any two is greater than the remaining third. Otherwise, the D-optimaldesign is a 2-point design with equal weights at di and dj where qij = max{q12, q13, q23}.

Proof. For simplicity, let q12 = a, q13 = b and q23 = c. Then

Q−1J = (2abc)−1

⎛⎝ c(a + b − c)b(c + a − b)a(b + c − a)

⎞⎠ .

Clearly if a, b, c are the sides of a triangle thenQ ∈ Q. The inverses of the 2×2minors are a−1R, b−1R and c−1Rwhere R=(01

10

).

All 2 × 2 minors belong to Q. Then the design will be a 3-point design if

J′Q−1J <2min(a−1, b−1, c−1).

Equivalently, the D-optimal design is a 3-point design if

2(ab + bc + ca) − a2 − b2 − c2

2abc<

2max{a, b, c} .

Without loss of generality, if a=max{a, b, c}, the above claim is true because (b+c−a)2 >0. If the triangle inequality is not satisfiedwith a = max{a, b, c}, then the maximum value of the criterion is a/2 corresponding to the ( 12 ,

12 , 0). �

Proposition 4. Let Q be as defined in (6). Then rank(Q ) = min(K, 3) and all 2 × 2 and 3 × 3 principal minors of Q are nonsingular.

Proof. Note that Q =KDK where K is a diagonal matrix with ith diagonal entry �i = 2−1/2�−1ai(�), D is the matrix with (i, j)thentry dij = (di − dj)

2, i, j = 1, 2, . . . ,K, and ai(�) are defined in (6). Because �i >0 for all i, rank(Q ) = rank(D). Now,

D = d2J′ + Jd′2 − 2dd′, (24)

where dr = (dr1, dr2, . . . ,d

rK )

′. Because all matrices on the right side of (24) are of rank one, we have rank(Q )�3. Algebraiccomputation shows

det(D) ={−d212 if K = 2,2d12d23d13 if K = 3.

Since the design points are all distinct, we have the result. �

Proposition 5. The D-optimal design is either a 2-point or a 3-point design, where the nonzero weights �1 are of the form n1 =(J′M−1J)−1M−1J andM is the principal minor of Q corresponding to the components with nonzero weights.

Proof. By theKKTconditions, theD-optimal design is necessarily of the formn=(n′1 : 0′)′ andQn=(�J′ : x′)′ (after a rearrangementof the doses to put the doses with nonzero weights at the beginning) for some positive constant � and some vector x. But by (4),if Q11 is the upper left 3 × 3 principle minor of Q1 where

Q =(Q1 Q2Q ′2 Q3

)

then Q1 =(

IB′)Q11(I : B) for some matrix B. Then solving Q1n1 = �J, we obtain

n11 + Bn12 = �Q−111 J, B′Q11(n11 + Bn12) = �J. (25)

Therefore, J′B= J′. This implies �J′Q−111 J= J′n11 + J′Bn12 = J′n11 + J′n12 =1. Thus, themaximum value over a r-point design can be

achieved over a 3-point design. However, because the choice of the 3-doses for Q11 within Q1 was rather arbitrary, (25) impliesthat �= (J′Q11J)

−1 for any choice of 3×3 principal minor of Q1. Simple algebra shows that this is not possible unless all elementsof Q1 are zero. Hence, there are no r >3 point design that satisfy the KKT necessary conditions for local maxima. By Proposition 4,if for some 3 × 3 principle minor, the elements above the diagonal form three sides of a triangle, then the design with positiveweights corresponding to the rows in the principal minor is a 3-point candidate for the D-optimal design and one can eliminatethe three 2-point subsets of the 3-point design. The form of the design by the KKT conditions is n1 = (J′M−1J)−1M−1J where n1is the subvector of the design vector with positive weights andM is the corresponding principle minor of Q . �


Proposition 3 characterizes the candidate set of D-optimal designs. In the case of design space with K design points, one needsto search within only

(K3

)possible designs. Typically the search will be over an even smaller set as the optimal 2-point design

among all possible 2-point design is the one that puts equal weights to the two design points corresponding to the indices ofthe maximum element in Q . Because the link function is quite arbitrary, there are cases when a 3-point design gives a solutionwith a higher value of the optimality criterion even though no principal minor involving the maximum element of Q satisfies thecriterion for a 3-point optimal design.

6. Discussion

We have shown that the sequential Bayesian D-optimal design approach generates data that allows parameters of themodel to be estimated consistently. Also the design criterion converges over the iterations. A pertinent question is what is therate of posterior convergence. Given the finite dimensional parametric structure one can expect to have classical

√n rates for

convergence. However, the link distribution function is often unknown. Consistency and rate questions becomemore challengingif one assumes only such a semiparametric model. Specifically, if the distribution function F is assumed to be arbitrary exceptfor some conditions on the quantiles to make the model identifiable, then one can mimic the approach of Haines et al. (2003)by adding some process prior on F. Also error rates for estimation of specific quantiles will be of interest. It seems that in thesemiparametric case, the design space must be continuous to be able to retrieve full information about F and the discrete designspace at each stage must become gradually dense everywhere in the overall design space. Haines et al. (2003) demonstrate bysimulation that the procedure works quite well with n = 35 patients, even with a misspecified prior distribution.

Acknowledgments

Anindya Roy's and William Rosenberger's research was partially supported by Grant R01-CA87746 from the US NationalCancer Institute. Subhashis Ghosal's research was supported by Grant DMS-0349111 from the US National Science Foundation.We would like to thank the referees for their thorough and helpful comments on this paper.

Appendix

Proof of Lemma 1. The above result follows by proceeding as in the proof of Schwartz's theorem (cf. Schwartz, 1965) byobserving that for any b >0,

lim infn→∞ ebn

∫ln(X1, . . . ,Xn;�)

ln(X1, . . . ,Xn;�0)d�(�)� lim inf

n→∞ ebn∫‖�−�‖<�

ln(X1, . . . ,Xn;�)

ln(X1, . . . ,Xn;�0)d�(�)

�

∫‖�−�‖<�

lim infn→∞ ebn

ln(X1, . . . ,Xn;�)

ln(X1, . . . ,Xn;�0)d�(�)

= ∞for � sufficiently small, by Fatou's lemma, since

n

(b + n−1 log

ln(X1, . . . ,Xn;�)

ln(X1, . . . ,Xn;�0)

)→ ∞

for all � sufficiently close to �0 by Condition 1. �

We also require the following lemma in the proof of Theorem 1. The proof of the lemma follows immediately from Taylor'sexpansion.

Lemma 2. Let 0 < �0 <12 and �0 < �,� <1 − �0. Then there exist a constant L depending only on �0 such that

�(log

��

)m+ (1 − �)

(log

1 − �1 − �

)m�L(� − �)2, m = 1, 2.

Proof of Theorem 1. First, we show the existence of a test function �n satisfying Condition 2 of Lemma 1 in the appendix.Consider a pair of hypotheses H0 : � = �0 versus HA : � = �1. Let the design space be partitioned as �0 = �0(�1) = {x ∈ � :F�0

(x)�F�1(x)} and �1 = �1(�1) = �\�0(�1). Consider the test �n = 1{Tn > n�∗}, where

Tn = Tn(�1) =∑

i:Xi∈�0

[Yi − E�0(Yi|Di−1)] −

∑i:Xi∈�1

[Yi − E�0(Yi|Di−1)], (A.1)

�∗ = inf{�(�) : ‖� − �0‖ > �}, �(�)=min{|F�(x)− F�0(x)| : x ∈ �} and 1{A} denotes the indicator of the set A. Note that �∗ >0 by the

identifiability of the family F�, continuity of � → F� and the compactness of .


Let

Zi(�∗,�1) =

{Yi − E�∗ (Yi|Di−1), Xi ∈ �0(�1),(1 − Yi) − E�∗ (1 − Yi|Di−1), Xi ∈ �1(�1).

Then Zi is a martingale difference sequence for P�∗ and −1�Zi�1. By Azuma's inequality (cf. Ross, 1996, p. 307),

P�0(Tn > n�∗)�e−n�2∗ /2.

To bound P�(Tn�n�∗), note that

Tn =n∑

i=1

Zi(�1,�1) +∑

i:Xi∈�0

(F�0 (Xi) − F�1(Xi)) −

∑i:Xi∈�1

(F�1 (Xi) − F�0(Xi)).

Therefore

P�1(Tn�n�∗) = P�1

⎛⎜⎝ n∑i=1

Zi(�1,�1)�n�∗ −∑

i:Xi∈�0

(F�0 (Xi) − F�1(Xi)) +

∑i:Xi∈�1

(F�1 (Xi) − F�0(Xi))

⎞⎟⎠

� P�1

⎛⎝ n∑i=1

Zi(�1,�1) < − n�∗

⎞⎠ �e−n�2∗ /2.

In order to remove the dependence of �1 on the test, where ‖�1 −�0‖ > �, consider �∗ such that ‖�∗ −�1‖ <�. Then∑

x∈� |F�∗ (x)−F�1

(x)| <MK�, and thus

Tn(�1) =n∑

i=1

Zi(�∗,�1) +

∑i:Xi∈�0

(F�0 (Xi) − F�∗ (Xi)) −∑

i:Xi∈�1

(F�∗ (Xi) − F�0(Xi))

�n∑

i=1

Zi(�∗,�1) +

∑i:Xi∈�0

(F�1 (Xi) − F�∗ (Xi)) −∑

i:Xi∈�1

(F�∗ (Xi) − F�1(Xi))

�n∑

i=1

Zi(�∗,�1) −

∑i:Xi∈�0

|F�1 (Xi) − F�∗ (Xi)|

�n∑

i=1

Zi(�∗,�1) − MK�.

Thus, if � = �∗/(2MK), another application of Azuma's inequality gives

P�∗ (Tn(�1)�n�∗)�P�∗

⎛⎝ n∑i=1

Zi(�∗,�1)� − n�∗/2

⎞⎠ �e−n�2∗ /8.

As is compact, we can cover {� : ‖� − �0‖ > �} by finitely many balls of radius �∗/(2MK). Construct tests as in (A.1) for �1equal the center of each ball. Each such test has type I error probability bounded by e−n�2∗ /2 and type II error probability on the

respective ball bounded by e−n�2∗ /8. Then the maximum of these finitely many tests satisfies Condition 2 of Lemma 1.To verify Condition 1 of Lemma 1, note that the log-likelihood ratio is given by �(n)(�0,�) =∑n

i=1 �i(�0,�), where

�i(�0,�) = Yi logF�0

(Xi)

F�(Xi)+ (1 − Yi) log

1 − F�0(Xi)

1 − F�(Xi).

Then

E(�i(�0,�)|Di−1) = F�0(Xi) log

F�0(Xi)

F�(Xi)+ (1 − F�0

(Xi)) log1 − F�0

(Xi)

1 − F�(Xi),

which is bounded above by amultiple of ‖�−�0‖2 by Lemma2 and the assumption that sup{‖�F�(x)/��‖ : � ∈ 0, x ∈ �}=M1 <∞.Because

�(n)(�0,�) =n∑

i=1

{�i(�0,�) − E(�i(�0,�)|Di−1)} +n∑

i=1

E(�i(�0,�)|Di−1)


and

n−1n∑

i=1

{�i(�0,�) − E(�i(�0,�)|Di−1)} → 0 a.s. [P�0 ]

by the strong law of large numbers for martingales, it follows that for some constant C0, we have n−1�(n)(�0,�)�C0�2 a.s. for

all sufficiently large n and ‖� − �‖ <�. Thus Condition 1 holds.Condition 3 of Lemma 1 is clearly satisfied. Theorem 1 then follows. �

Proof of Proposition 1. We will first show that the design vector is bounded away from the vertices in the unconstrained case.

It is enough to prove (10) for k = 1. Fix � ∈ . Consider Q�(n1n+1) − Q�(n

jn+1) for j�1 where

njn+1 = n

n + 1nn + 1

n + 1ej, (A.2)

and Q�(·) is defined just prior to (5). By simple algebra,

Q�(n1n+1) − Q�(n

jn+1) = 2n�−2

(n + 1)2a1aj(d1 − dj)

2(�1,n − �j,n)

+ 2n�−2

(n + 1)2

K∑i,j=2;i�j

ai�i,n[a1(d1 − di)2 − aj(dj − di)2]

= 2n�−2

(n + 1)2

K∑i=1

ai�i,n[a1(d1 − di)2 − aj(dj − di)2]

:= 2n�−2

(n + 1)2lj(nn). (A.3)

The linear function lj(nn) converges to a negative limit as nn tends to e1, i.e.,

limnn→e1

ll(nn) = −a1aj(d1 − dj)2 <0 for j = 2, 3, . . . ,K.

Thus there exists 0��j(�) <1 such that for �1,n >1 − �j(�) we have

Q�(n1n+1) <Q�(n

jn+1).

Let � = sup{�j(�) : � ∈ , 2� j�K}. Because is compact we have 0 <� <1. Now if �1,n >1 − � then �1,n+1 = n�1,n/(n + 1). Thus

�1,n+1 − �1,n = �1,n/(n+ 1) > (1− �)/(n+ 1). Because the harmonic series∑

n−1 is divergent, we have �1,n <1− � infinitely often.

Choose N1,� such that N−11,� <�/2. Let N� = inf{n : n>N1,� and �1,n <1 − �}. Then for n>N� we have

�1,n ={(1 − n−1)�1,n−1 + n−1 <1 − �/2 if 1 − � <�1,n−1 <1 − �/2,

(1 − n−1)�1,n−1 <1 − �/2 if �1,n−1 <1 − �.

Clearly, the constraint bounds the design vector away from any of the vertices eK∗+1, . . . , eK , where K∗ is defined in (9). Tosee that the design vector is bounded away from the vertices e1, . . . , eK∗ one needs to only consider j ∈ {1, 2, . . . ,K∗} in theabove proof. Thus, (10) continues to hold even when the iterations in the design space are performed in conjunction with theconstraint (8). �

Proof of Corollary 1. By continuity, there exist a neighborhood U∗ of �0 such that �R(�) < di, i= 1, . . . ,K∗ and �R(�)�di, i= K∗ +1, . . . ,K, for every � ∈ U∗. By Theorem 1, there exists N∗, possibly depending on �, where � is defined in (8), such that for n�N∗the posterior probability �(U∗|Dn) is greater than 1− �/2. Hence, by the constraint (8), for n�N∗, the allocation will be restrictedto d1, . . . ,dK∗ . �

To prove Theorem 2, we require the following lemma.

Lemma 3. Let {sn} be a real sequence and let sn�M2 <∞ for all n. Let �n = sn+1 − sn. If∑

n �nI(�n <0) > − ∞ then limn→∞ sn = sfor some s ∈ (−∞,∞).


Proof. Let M3 = ∑n �nI(�n <0). Then sn = ∑n

k=2 �k + s1 >M3 + s1. Therefore, {sn} is a bounded sequence, and, hence haveaccumulation points. Suppose limn→∞ sn does not exist. Then sn has at least two distinct accumulation points. Let t1 < t2 be twoaccumulationpoints of {sn}. Let for eachn,mn denote thenumber of crossings from [t2−(t2−t1)/4,∞) to (−∞, t1+(t2−t1)/4]. Then,∑n

k=1�kI(�k <0) <−mn(t2−t1)/2. Because t1 and t2 are accumulation points, limn→∞ mn=∞. Hence, limn→∞∑n

k=1 �kI(�k <0)=−∞, and we have a contradiction. �

Proof of Theorem 2. Define the partition of N, the set of natural numbers, N = I ∪ J, where I= {i : Q(ni)−Q(ni−1)�0} and J = Ic

in N. Because Q(·) is bounded above, by Lemma 3 it is then enough to show that the sequence {Q(nj+1) − Q(nj)}j∈J is summable.

Let n ∈ J. Let nln+1 = (1 − �n)nn + �nel, l = 1, 2, . . . ,K, be the K possible candidates at the nth iteration. Let �ln = Q(nln+1) − Q(nn)

and Dn = (�1n, . . . ,�

Kn )

′. Because Q admits a bounded second mixed derivatives we have Dn = �n(I − 1′n)�Q(n)/�n|n=nn + O(�2n),

where 1 is the vector of ones and I is the identity matrix. Therefore∑K

l=1 �n,l�ln = O(�2n). By assumption there exists a positive

constant M4 such that Q(n) <M4 for n ∈ �. Let −M4 <�(1)n ��(2)

n � · · · ��(K)n <0 be the ordered values of �l

n, l = 1, 2, . . . ,K.

Also max{�n,i : 1� i�K} > (K + 1)−1 for all n. Let i′ be the index of the ordered �ln corresponding to max{�n,i : 1� i�K}. Then

|Q(nn+1) − Q(nn)| = |�(K)n |� |�(i′)

n | so that |Q(nn+1) − Q(nn)| < (K + 1)|∑Kl=1 �n,l�

ln| = O(�2n). Because {�2n}n∈J is a subsequence of

the summable sequence {�2n}j∈N we have

limn→∞Q(nn) = Q(n∗).

Clearly, by continuity of Q(·), d(B�(n∗),A(n∗)) → 0 as � → 0, where d(·, ·) is a set distance and B�(n

∗) = {n ∈ � : |Q(n) − Q(n∗)| < �}.Hence the result follows. Suppose we define connectivity on the set A(n∗) as follows: two points n1 and n2 are connected if thereexists a path Pn : n1�n�n2 from n1 to n2 such that Q(n) = Q(n1) for all n ∈ Pn. Then the sequence of iterations must convergeto a connected subset of A(n∗). �

Proof of Proposition 2. The observed information can be written as

M∗n(�) =

K∑i=1

�ni

[(F̂(zi) − F(zi))�

−2

F(zi)(1 − F(zi))

](f ′(zi) f (zi) + zif ′(zi)

f (zi) + zif ′(zi) 2zif (zi) + z2i f′(zi)

)

+K∑i=1

�ni

[F̂(zi)F2(zi)

+ (1 − F̂(zi))

(1 − F(zi))2

](�−2f2(zi))

(1 zizi z2i

)

:= Bn(�) + Dn(�), (A.4)

where zi = (di − �)/�, F̂(zi)= Ti(n)/Ni(n) is the observed proportion of 1's at dose level di and Ti(n) and Ni(n) are the number of 1'sand total number of allocations to design point di, respectively.We show that Bn(�0) → 0 a.s. [P�0 ]. It suffices to show that for each

i, �ni(F̂(zi)− F(zi)) → 0, a.s. [P�0 ]. Suppose �ni does not go to zero almost surely for some design point di. Then, Ni(n) → ∞ almost

surely. Because, given the design point di, the observations are i.i.d. Bernoulli(F(zi)), by the strong law, we have (F̂(zi) − F(zi))tending to zero almost surely. Hence, Bn(�0) → 0 almost surely. The eigenvalues of Bn(�) and Dn(�) are continuous functionsof �. Thus, for large n we can find a neighborhood of �0 where Bn(�) is arbitrarily small and the eigenvalues of Bn(�) + Dn(�)are uniformly close to those of Dn(�). Specifically, given any � >0, we can find a � >0, such that|�max(M∗

n(�)) − �max(Dn(�))| < �and |�min(M∗

n(�)) − �min(Dn(�))| < � for all � ∈ N�(�0). Thus, it is enough to show that the eigenvalues of Dn(�) are bounded andbounded away from zero. Now, Dn(�) is a convex combination of finitely many rank one matrices, i.e., Dn(�) =∑K

i=1�nicni(�)Ai,where Ai are the rank one matrices (1, zi)′(1, zi) and

cni(�) =[

F̂(zi)F2(zi)

+ (1 − F̂(zi))

(1 − F(zi))2

](�−2f2(zi)).

The nonzero eigenvalue of cni(�)Ai is cni(�)(1 + z2i ). Thus, the largest eigenvalue of Dn(�) satisfies

�max(Dn(�))� max1� i�K

cni(�)(1 + z2i ).

Since is compact it follows that �max,n�Lmax <∞ for some constant Lmax. The matrix Dn(�) is singular if and only if one of theallocation proportions is one and the rest are zero. Thus, theminimum eigenvalue, �min(Dn(�)) of Dn(�) is a continuous nonnega-tive function of the design vector nn over the compact set� taking zero value only at the vertices ei of the simplex. By Proposition1, the vectors nn are bounded away from the vertices. Thus, there exists a constant Lmin,1 such that 0 < Lmin,1 <�min(Dn(�))for all n. �


Proof of Theorem 3. Observe that, with �n(u) = ewn(u)/∫R2 ewn(t) dt we have

∫|�∗

n(u) − �n(u)|du�

∫ ∣∣∣∣∣ ewn(u)�(�̂n + (nM∗n)

−1/2u)∫ewn(t)�(�̂n + (nM∗

n)−1/2u) dt

− ewn(u)�(�̂n + (nM∗n)

−1/2u)∫ewn(t)�(�̂n) dt

∣∣∣∣∣ du

+∫ ∣∣∣∣∣e

wn(u)�(�̂n + (nM∗n)

−1/2u)∫ewn(t)�(�̂n) dt

− ewn(u)�(�̂n)∫ewn(t)�(�̂n) dt

∣∣∣∣∣ du

� 2∫ewn(u)|�(�̂n) − �(�̂n + (nM∗

n)−1/2u)|du∫

ewn(u)�(�̂) du.

Since �̂n → �0 a.s. [P(�0)] and �(�) is continuous and positive at �0, the left-hand side of the above display goes to zero. Thus,we may replace �∗

n(u) by �n(u) in (17). Let fn(u) = exp(wn(u)). Consider the ratio [∫‖u‖>�√

n fn(u) du/∫R2 fn(u) du]. The ratio is

the posterior probability P(� /∈N�,n(�̂n)|Dn), where N�,n(�) = {�′ : (�′ − �)′M∗n(�

′ − �)��2}. Since �̂n → �0 a.s. [P(�0)], there

exists �′ >0 such that for all sufficiently large n, N�′ ,n(�0) ⊆ N�,n(�̂n) a.s. [P(�0)].By Proposition 2, there exists �′′ >0, such that

N�′′ ,n(�0) ⊆ N�′ ,n(�0), where, as before, N�′′ ,n(�0) = {� : (� − �0)′(� − �0)� (�′′)2}. Therefore,

P(� /∈N�,n(�̂n)|Dn)�P(� /∈N�′ ,n(�0)|Dn)�P(� /∈N�′′ ,n(�0)|Dn).

The uniform prior is proper because of the compactness of the parameter space. Thus, by Theorem 1, P(� /∈N�′′ ,n(�0)|Dn) → 0.Therefore we have∫

‖u‖��√n fn(u) du∫

R2 fn(u) du→ 1. (A.5)

We can choose � such that for large enough n, the region ‖u‖��√n is contained in a neighborhood of �0 where the derivatives

of the log-likelihood with respect to elements of � are uniformly bounded, i.e., there exists M5 such that if � ∈ N�,n(�̂) then

n−1

∣∣∣∣∣ �3ln(�)��i��j��k

∣∣∣∣∣ <M5, i, j, k = 1, 2,

and �1 = � and �2 = �. Also, we can choose � such that, by Proposition 2, ‖(� − �̂)‖�n−1/2L−1/2min ‖u‖ for all � ∈ N�,n(�̂). Then, for

� ∈ N�,n(�̂) we have

|Rn(u)|�2M5L−3/2min n−1/2‖u‖3�2M5L

−3/2min �(u′u).

We can choose � such that 2M2L−3/2min � <1/2. Then

∫‖u‖��

√nexp

{−12u′u + Rn(u)

}du�

∫exp

{−(12

− 2M5L−3/2min �

)u′u

}du

= 2�(1 − 4M5L−3/2min �)−1 <∞.

Thus the sequence of functions I(‖u‖��√n) exp{− 1

2u′u + Rn(u)} are uniformly integrable and hence∫

‖u‖��√nexp

{−12u′u + Rn(u)

}du →

∫exp

{−12u′u

}du = 2�. (A.6)

Combining (5) and (6) we have that∫fn(u) du → 2�. Therefore �n(u) → �(u) a.s. [P(�0)]. Thus, (17) follows from Scheffe's

theorem. �

Next we prove the asymptotic normality of the MLE �̂n which is required in the proof of asymptotic normality of the Bayesestimator (Corollary 2). The proof of the theorem for asymptotic normality of the MLE (Theorem 4) relies on the followingproposition, which is a straightforward generalization of Theorem 5.7 in van der Vaart (1998, p. 45). The proof of the propositionfollows immediately from the proof of Theorem 5.7 of van der Vaart (1998, p. 46) and is omitted.

Let be a metric space with a metric d and let �0 be an interior point of .

Proposition 6. Let Gn and Hn be random functions of � such that for every � >0 the following conditions hold:

1. sup{|Gn(�) − Hn(�)| : � ∈ } P−→0,2. there exists a � >0, such that sup{Hn(�) : d(�,�0) >�} <Hn(�0) for each n�1.Then any sequence of estimators �̂n with Gn(�̂n)�Gn(�0) − op(1) converges in probability to �0.


The generalization allows one to consider sequence of random functions Hn instead of a fixed deterministic function of �. Thisis possible only for sequences of function satisfying Condition 2 which is quite restrictive in the sense that it requires all memberof the sequence to have a fixed maximum, �0. However, in our application Condition 2 is satisfied.

Theorem 4. Suppose the data are obtained following the sequential scheme where conditional on the first r observations, the (r+1)thdesign point is chosen as (4) and the corresponding response arises as (1). Let �̂n be the MLE of �. Then �̂n is consistent at �0 and(nM∗

n(�0))1/2(�̂n − �0) → N(0, I) in law.

Proof. We will first show that the MLE is consistent. The asymptotic normality will follow almost immediately from the consis-tency result. To prove consistency of theMLE,wewill verify the conditions of Proposition 6. Note that �̂n=arg max{Gn(�) : � ∈ },where

Gn(�) = n−1n∑

i=1

[Yi log

F�(Xi)F�0

(Xi)+ (1 − Yi) log

(1 − F�(Xi))(1 − F�0

(Xi))

].

Let

Hn(�) = n−1n∑

i=1

[E�0

(Yi|Di−1) logF�(Xi)F�0

(Xi)+ (1 − E�0

(Yi|Di−1)) log(1 − F�(Xi))(1 − F�0

(Xi))

]

= n−1n∑

i=1

[F�0

(Xi) logF�(Xi)F�0

(Xi)+ (1 − F�0

(Xi)) log(1 − F�(Xi))(1 − F�0

(Xi))

]

= −K∑i=1

�n,iKL(�0,�;di),

where KL(�0,�;di) are the Kullback--Leibler divergence measures between the Bernoulli distributions with means F�0(di) and

F�(di), respectively, and �n,i is the observed proportion of allocation at dose level di. By the property of the Kullback--Leiblerdistance, Condition 2 of Proposition 6 is satisfied for each function −KL(�0,�;di) and hence for any finite convex combination−∑K

i=1�n,iKL(�0,�;di). Let Zn(�) be the stochastic process (indexed by �) defined as

Zn(�) :=√n(Gn(�) − Hn(�)) = n−1/2

n∑i=1

(Yi − F�0(Xi))R(�,�0;Xi), (A.7)

where

R(�,�0;Xi) = logF�(Xi)/(1 − F�(Xi))F�0

(Xi)/(1 − F�0(Xi))

is the log odd-ratio between F�(Xi) and F�0(Xi). Condition 1 requires that sup{n−1/2|Zn(�)| : � ∈ } P−→0. We will apply

Theorem 2.2.4 of van der Vaart and Wellner (1996) to verify this assertion. Since 0 < F�(x) <1 for all � and x and is com-pact, Taylor's expansion shows that there exists a constant L1 such that

R(�1,�2;d)�L1(F�1 (d) − F�2(d))

for all �1,�2 ∈ and for all d ∈ �. Another Taylor's expansion then shows

R(�1,�2;d)�L1(F�1 (d) − F�2(d))�

√2L1M1‖�1 − �2‖ := C1‖�1 − �2‖

for any d ∈ � whereM1 is defined in Theorem 1. Fix �1,�2 ∈ . Define

Vi := (Yi − F�0(Xi))R(�1,�2;Xi), i = 1, 2, . . . ,n.

Then,

E(V4i |Di−1)�C2C

41‖�1 − �2‖4 := C3‖�1 − �2‖4, (A.8)

where C2 = max{E(Wj − F�0(dj))

4 : 1� j�K} for Wj ∼ Bernoulli(F�0 (dj)), j = 1, 2, . . . ,K. Also, {Vi} is a martingale differencesequence with respect to the filtration {Di−1}. Hence, by Burkholder's inequality (cf. Hall and Heyde, 1980, p. 23) and by the factthat (

∑ni=1 ai)

2�n∑n

i=1 a2i for any a1, . . . , an, we have for some absolute constant A,

AE

⎛⎝ n∑i=1

Vi

⎞⎠4

�E

⎛⎝ n∑i=1

V2i

⎞⎠2

�nn∑

i=1

E(V4i ).


By (A.8), we have E(∑n

i=1 Vi)4�n2C4‖�1 − �2‖4 for some constant C4. Let �(x) = x4. Therefore, the Orlicz norm ‖ · ‖� coincides

with the L4-norm, and we have

‖Zn(�1) − Zn(�2)‖� =

⎛⎜⎝E

⎛⎝n−1/2

n∑i=1

Vi

⎞⎠4⎞⎟⎠1/4

�C1/44 ‖�1 − �2‖.

By Theorem 2.2.4 of van der Vaart and Wellner (1996) and the comment following the proof and the fact that Zn(�0) ≡ 0for all n, we have∥∥∥∥∥sup�∈

|Zn(�)|∥∥∥∥∥�

�K1

∫ diam

0�−1(D(�, ‖ · ‖,)) d�,

where K1 is a constant depending only on � and C3, �−1(x) = x1/4 and D(., ‖ · ‖,) is the packing number of . Since is a

compact subset of R2, there exist a constant L3 such that D(�, ‖ · ‖2,)�L3�−2. Thus,∥∥∥∥∥sup�∈|Zn(�)|

∥∥∥∥∥�

�K1L3

∫ diam

0�−1/2 d� <∞.

Then, by Markov's inequality we have

sup�∈

|Gn(�) − Hn(�)| = sup�∈

n−1/2|Zn(�)| = Op(n−1/2).

By definition, Gn(�̂n)�Gn(�0). Hence, by Proposition 6 we have consistency of the MLE. By Taylor's expansion of the derivativeof the log-likelihood function at the MLE, we have

0 =n∑

i=1

ui(�0) − nM∗n(�0)(�̂n − �0) + Op(‖�̂n − �0‖), (A.9)

where ui(�0)=�/��[Li(�)−Li−1(�)] and Li(�) is the log-likelihood of the first i observations. Then, by the fact that ‖�̂n−�0‖=op(1)

and that (nM∗n(�0))

−1/2∑ni=1ui(�0) → N(0, I) (cf. Hall and Heyde, 1980, Section 6.2) we have the result. �

References

Abdelbasit, K.M., Plackett, R.L., 1983. Experimental designs for binary data. J. Amer. Statist. Assoc. 78, 90--98.Babb, J., Rogatko, A., Zacks, S., 1998. Cancer phase I clinical trials: efficient dose escalation with overdose control. Statist. Med. 17, 1103--1120.Bickel, P., Yahav, J., 1969. Some contributions to the asymptotic theory of Bayes solutions. Z. Wahrsch. Verw. Gebiete 11, 257--275.Box, G.E.P., Hunter,W.G., 1965. Sequential design for experiments for nonlinearmodels. In: Proceedings of the IBM Scientific Computing Symposium on Statistics,

October 21--23, 1963, pp. 113--137.Chaloner, K., 1989. Bayesian design for estimating the turning point of a quadratic regression. Comm. Statist. Theory Methods 18, 1385--1400.Chaudhuri, P., Mykland, P.A., 1993. Nonlinear experiments: optimal design and inference based on likelihood. J. Amer. Statist. Assoc. 88, 538--546.Chaudhuri, P., Mykland, P.A., 1995. On efficient designing of nonlinear experiments. Statist. Sinica 5, 421--440.Fedorov, V.V., 1972. Theory of Optimal Experiments. Academic Press, New York.Ford, I., Silvey, S.D., 1980. A sequentially constructed design for estimating a non-linear parametric function. Biometrika 67, 381--388.Ford, I., Titterington, D.M., Wu, C.F.J., 1985. Inference and sequential design. Biometrika 72, 545--551.Ford, I., Titterington, D.M., Kistos, C.P., 1989. Recent advances in nonlinear experimental design. Technometrics 31, 49--60.Ghosh, J.K., Ramamoorthi, R.V., 2003. Bayesian Nonparametrics. Springer, New York.Haines, L.M., 1998.Optimal design forneural networks. In: Flournoy,N., Rosenberger,W.F.,Wong,W.K. (Eds.), NewDevelopments andApplications inExperimental

Design. Institute of Mathematical Statistics, Hayward, pp. 152--162.Haines, L.M., Perovozskaya, I., Rosenberger, W.F., 2003. Bayesian optimal designs for phase I clinical trials. Biometrics 59, 591--600.Hall, P., Heyde, C.C., 1980. Martingale Limit Theory and its Application. Academic Press, New York.Hu, F., Rosenberger, W.F., Zhang, L.-X., 2005. Asymptotically best response-adaptive randomization procedures. J. Statist. Plann. Inference 136, 1911--1922.Hu, I., 1997. Strong consistency in stochastic regression models via posterior covariance matrices. Biometrika 84, 744--749.Kiefer, J., Wolfowitz, J., 1960. The equivalence of two extremum problem. Canad. J. Math. 12, 363--366.Lai, T.Z., 1994. Asymptotic properties of nonlinear least squares estimates in stochastic regression models. Ann. Statist. 22, 1917--1930.Leonard, T., 1982. An inferential approach to the bioassay design problem. Technical Summary Report 2416, Mathematics Research Center, University of

Wisconsin-Madison.Minkin, S., 1987. On optimal design for binary data. J. Amer. Statist. Assoc. 82, 1098--1103.O'Quigley, J., Pepe, M., Fisher, L., 1990. Continual reassessment method: a practical design for phase 1 clinical trials in cancer. Biometrics 46, 33--48.Rosenberger, W.F., Haines, L.M., 2002. Competing designs for phase I clinical trials: a review. Statist. Med. 21, 2757--2770.Rosenberger, W.F., Canfield, G.C., Perevozskaya, I., Haines, L.M., Hausner, P., 2005. Development of interactive software for Bayesian optimal phase I clinical trial

design. Drug Inform. J. 39, 89--98.Ross, S.M., 1996. Stochastic Processes. Wiley, New York.Schwartz, L., 1965. On Bayesian procedures. Z. Wahrsch. Verw. Gebiete 4, 10--26.Serfling, R., 1981. Approximation Theorems of Mathematical Statistics. Wiley, New York.Shen, L.Z., O'Quigley, J., 1996. Consistency of continual reassessment method under model misspecification. Biometrika 83, 395--405.


Silvey, S.D., 1980. Optimal Design. Chapman & Hall, London.Tsutakawa, R.K., 1972. Design of experiment for bioassay. J. Amer. Statist. Assoc. 67, 584--590.Tsutakawa, R.K., 1980. Selection of dose levels for estimating a percentage point of a logistic quantal response curve. Appl. Statist. 29, 25--33.van der Vaart, A., 1998. Asymptotic Statistics. Cambridge University Press, Cambridge.van der Vaart, A., Wellner, J., 1996. Weak Convergence and Empirical Processes. Springer, New York.White, L.V., 1975. Optimal design of experiments for non-linear models. Unpublished Ph.D. Thesis, Imperial College, London.Whitehead, J., Brunier, H., 1995. Bayesian decision procedures for dose determining experiments. Statist. Med. 14, 885--893.Wu, C.F.J., 1985. Asymptotic inference from sequential design in a nonlinear situation. Biometrika 72, 553--558.Wynn, H., 1970. The sequential generation of D-optimal experimental designs. Ann. Math. Statist. 41, 1655--1664.Zacks, S., 1977. Problems and approaches in design of experiments for estimation and testing in non-linear problems. In: Krishnaiah, P.R. (Ed.), Multivariate

Analysis IV. North-Holland, Amsterdam, pp. 209--223.Zacks, S., Rogatko, A., Babb, J., 1998. Optimal Bayesian-feasible dose escalation for cancer phase I trials. Statist. Probab. Lett. 38, 215--220.

Documents

Convergence properties of sequential Bayesian D-optimal designs