22
This article was downloaded by: [Amgen Inc] On: 06 August 2015, At: 11:02 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place, London, SW1P 1WG Journal of Biopharmaceutical Statistics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/lbps20 Doubly Randomized Delayed-Start Design for Enrichment Studies with Responders or Nonresponders Qing Liu a , Pilar Lim a , Jaskaran Singh a , David Lewin a , Barry Schwab a & Justine Kent a a Janssen Research & Development, LLC , Raritan , New Jersey , USA Published online: 31 May 2012. To cite this article: Qing Liu , Pilar Lim , Jaskaran Singh , David Lewin , Barry Schwab & Justine Kent (2012) Doubly Randomized Delayed-Start Design for Enrichment Studies with Responders or Nonresponders, Journal of Biopharmaceutical Statistics, 22:4, 737-757, DOI: 10.1080/10543406.2012.678234 To link to this article: http://dx.doi.org/10.1080/10543406.2012.678234 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions

Delayed Randomized Design

Embed Size (px)

Citation preview

Page 1: Delayed Randomized Design

This article was downloaded by: [Amgen Inc]On: 06 August 2015, At: 11:02Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: 5 Howick Place,London, SW1P 1WG

Journal of Biopharmaceutical StatisticsPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/lbps20

Doubly Randomized Delayed-Start Design forEnrichment Studies with Responders or NonrespondersQing Liu a , Pilar Lim a , Jaskaran Singh a , David Lewin a , Barry Schwab a & Justine Kent aa Janssen Research & Development, LLC , Raritan , New Jersey , USAPublished online: 31 May 2012.

To cite this article: Qing Liu , Pilar Lim , Jaskaran Singh , David Lewin , Barry Schwab & Justine Kent (2012) DoublyRandomized Delayed-Start Design for Enrichment Studies with Responders or Nonresponders, Journal of BiopharmaceuticalStatistics, 22:4, 737-757, DOI: 10.1080/10543406.2012.678234

To link to this article: http://dx.doi.org/10.1080/10543406.2012.678234

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Delayed Randomized Design

Journal of Biopharmaceutical Statistics, 22: 737–757, 2012Copyright © Taylor & Francis Group, LLCISSN: 1054-3406 print/1520-5711 onlineDOI: 10.1080/10543406.2012.678234

DOUBLY RANDOMIZED DELAYED-START DESIGNFOR ENRICHMENT STUDIES WITH RESPONDERSOR NONRESPONDERS

Qing Liu, Pilar Lim, Jaskaran Singh, David Lewin,Barry Schwab, and Justine KentJanssen Research & Development, LLC, Raritan, New Jersey, USA

High placebo response has been a major source of bias and is difficult to deal with inmany central nervous system (CNS) clinical trials. This bias has led to a high failurerate in mood disorder trials even with known effective drugs. For cancer trials, thetraditional parallel group design biases the inference on the maintenance effect of thenew drug with the traditional time-to-treatment failure analysis. To minimize bias, wepropose a doubly randomized delayed-start design for clinical trials with enrichment.The design consists of two periods. In the first period, patients can be randomized toreceive several doses of a new drug or a control. In the second period, control patientsof the first period of an enriched population can be rerandomized to receive the sameor fewer doses of the new drug or to continue on the control. Depending on the clinicalneeds, different randomization ratios can be applied to the two periods. The essentialfeature is that the design is naturally adaptive because of the randomization for thesecond period. As a result, other aspects of the second period, such as the samplesize, can be modified adaptively when an interim analysis is set up for the first period.At the end of the trial, response data from both randomizations are combined in anintegrated analysis. Because of the enrichment in the second period, the design increasesthe probability of trial success and, in addition, reduces the required sample size. Thus,for clinical development, the design offers greater efficiency.

Key Words: Adaptive design; Enrichment design; Maintenance effect; Proof-of-concept trials;Randomized start design; Sequential parallel design.

1. INTRODUCTION

High placebo response in many central nervous system (CNS) clinical trialsleads to reduced sensitivity to distinguish an effective therapeutic agent from aplacebo control. For example, clinical trials employing effective antidepressants arewidely known for their high failure rates (Khin and Chen, 2011). There are manycauses for high placebo response, not all well understood. One of the key factorsis an expectation bias of the likelihood of receiving placebo (Papakostas and Fava,2009; Sinyor et al., 2010; Tedeschini et al., 2010). Therefore, there is a need todevelop new study designs that aim to reduce the placebo response. For cancer

Received October 31, 2011; Accepted February 15, 2012Address correspondence to Qing Liu, Janssen Research & Development, LLC, Route 202,

PO Box 300, Raritan, NJ 08869, USA; E-mail: [email protected]

737

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 3: Delayed Randomized Design

738 LIU ET AL.

trials, it is often important to study the maintenance effect of a new treatmentregimen for patients who have responded to an effective control induction therapy.However, with the traditional parallel group design, the analysis of the maintenanceeffect with the standard time-to-treatment failure may be biased by the differentialinduction response rate.

To address the issue of bias, we propose a doubly randomized delayed-start design that is accomplished in two periods. In the first period, patients arerandomized to receive several doses of a new drug or placebo. At the end of thefirst period, an assessment is made to determine whether patients who had beenrandomized to receive placebo meet the reentry criteria for randomization in thesecond period. Those who meet the reentry criteria are rerandomized into the secondperiod to receive the same or fewer doses of the new drug, depending on resultsof the first period, or to continue on placebo. The reentry criteria can consist ofkey enrichment elements, in terms of either efficacy or safety. For example, forantidepressant trials, enrichment is often made with placebo nonresponders fromthe first period. For oncology trials, enrichment can be defined to include treatmentresponders or patients whose disease has not progressed. To ensure interpretabilityof statistical inference, enrichment criteria using specific cutoffs on predeterminedoutcome measures or an adjudication process need to be specified in advance in thestudy protocol. Depending on the need of the trial design or clinical considerations,different randomization ratios can be applied to the two periods. The essentialfeature is that the design is naturally adaptive because of the randomization for thesecond period. As a result, other aspects of the second period, such as the samplesize, choice of the doses of the new drug, duration of the treatment, and follow-up,can be modified adaptively when an interim analysis is set up for the first period.At the end of the trial, efficacy data from both randomizations are integrated viaan optimal combination test.

There is a rich history of employing designs with double randomizations forclinical trials in different therapeutic areas (see the survey article by Mills et al.,2007). An early application was described by Heyn et al. (1974) for pediatric acutelymphocytic leukemia, where patients who were initially randomized to receivea control regimen and who continued without CNS or marrow relapse werererandomized to receive a new treatment regimen or one of the two controls. Inmodern cancer trials, double randomization schemes are often used by cooperativeoncology groups funded by the National Cancer Institute (NCI) for new drugdevelopment of biotech companies. For example, the Eastern Cooperative OncologyGroup (ECOG) trial ECOG 4494 on the maintenance effect of rituximab forpatients with diffuse large B-cell lymphoma employed a double randomizationscheme, in which patients who responded to induction treatment with eitheran investigational regimen or standard regimen were rerandomized to differentmaintenance treatments (Habermann et al., 2006). The results of this trial, alongwith supportive data from other trials, led to the maintenance indication forrituximab. According to the clinical trial registry www.ClinicalTrials.gov, theNational Cancer Institute (NCI) provided trial details in November 1999 (withidentifier NCT00003150), indicating that the trial was initiated in December 1997.

The idea of randomized delayed start was initiated by Dr. Leber of the U.S.Food and Drug Administration (FDA) Division of Neurophamacological DrugProducts during the period between 1994 and 1996 for degenerative neurologic

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 4: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 739

diseases. Leber (1996) describes a randomized start design in which patients arerandomized to receive a given treatment sequence (i.e., drug/drug, placebo/drug,or placebo/placebo) at baseline but are not actively rerandomized before enteringperiod 2. More details on the regulatory background, scientific rational andmotivation, and further description of the design are provided in the discussionarticle by Leber (1997). The sequential parallel design has been proposed by Favaet al. (2003), which, in comparison to the randomized start design of Leber (1996),uses enrichment with the placebo nonresponders of the first period for initiationof treatment with the new drug in the second period. However, enrichment designswith nonresponders critically require randomization to ensure valid statisticalinference and unbiased clinical conclusions (Temple, 1994). This design feature isnot included in the sequential parallel design.

The randomized start design also motivated the concept of what we now callthe (adaptive) doubly randomized delayed-start design in late 1997 for use in thestudy of a variety of neurologic and psychiatric drugs. During this time, weightedcombination tests were developed for the simpler problem of sample size adjustment(Chi and Liu, 1999; Cui et al., 1999). The details for the doubly randomized delayed-start design, however, including sample size calculation and statistical analysis, havenot been developed and reported as originally intended. A slight modification ofthe randomized start design was described by McDermott et al. (2002) for whichthe placebo patients in the first period are rerandomized to receive the drug orto continue on placebo. McDermott et al. (2002) provide a weighted method forstatistical analysis for a general two-period factorial design, which includes thespecial case of the randomized start design of Leber (1996). In 2009, we proposedto the FDA at an end of phase 2 meeting the idea of rerandomizing placebononresponders for a confirmatory phase 3 clinical trial to receive one of two dosesof an antidepressant or the placebo control, due to concerns of potential bias insequential parallel design with possible unbalanced known or unknown prognosticfactors at baseline. In an article by the FDA staff, Chen et al. (2011) note thatin sequential parallel designs, treatment assignment in period 2 is predetermined atthe randomization of period 1 and raise concerns of bias “due to randomness andpossible unbalanced dropouts among placebo non-responders” prior to period 2.They also show through theoretical development and simulation work that thereis no need for the complex, seemingly unrelated regression analysis proposed byTamura and Huang (2007) when rerandomization of placebo nonresponders for thesecond period is in place.

As mentioned earlier, the proposed doubly randomized delayed-start designis by nature an adaptive design, which is different from the nonadaptive designby Chen et al. (2011). For applications in nonadaptive trial settings, the proposeddesign is also distinct in terms of the statistical method, sample size calculation, andflexibility in randomization.

This article uses antidepressant drug development to illustrate the basicconstruction of clinical trials with the doubly randomized delayed-start design.Therefore, throughout the remaining article, we only consider enrichment withplacebo nonresponders. The general theory and methods apply to both clinicalsettings. We show in section 3.1 that enrichment prior to the second randomizationcan greatly enhance the usefulness of the clinical outcomes for the resulting placebononresponders. Specifically, compared to patients randomized in the first period,

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 5: Delayed Randomized Design

740 LIU ET AL.

clinical outcomes of the placebo nonresponders for the second period are notonly far less variable but also have a higher correlation. As a result, the placebononresponders are expected to be more sensitive to treatment with an effective newdrug, or to be nonresponsive if they are randomized to continue with placebo.Central to the proposed design is the second randomization. The enrichment processcan introduce unexpected selection bias in comparing the effect of the new drugwith placebo for the placebo nonresponders if the second randomization is not putin place. Therefore, this randomized delayed-start enrichment for the second periodwould effectively address the placebo response issue and, consequently, increase theeffect size while avoiding the potential bias in the statistical analysis and clinicalconclusions. In section 2, we develop an optimal combination test statistic, as well asa simple closed-form formula for sample size calculation. Both fully take advantageof the improved sensitivity and predictability of the enrichment. In section 3.2, wedescribe a conditional analysis of covariance (ANCOVA) model for each period. Weconstruct the test statistics for the two periods, which are then used in the optimalcombination test statistic for hypothesis testing. The proposed test method onlyrelies on a standard ANCOVA, and therefore, no special statistical procedure (e.g.,the seemingly unrelated regression analysis) is needed. For settings with binary ortime-to-event endpoint, we use the standard logistic regression model or the Coxproportional hazards model for analysis.

As alluded to earlier, the second randomization also brings in additionalbenefits that are not possible with the sequential parallel design. As shown insection 2.2, the proposed doubly randomized delayed-start design is naturallyadaptive and utilizes the combination test statistic proposed by Cui et al. (1999).Following the general measure-theoretic framework of Liu et al. (2002) for adaptivedesigns, we provide the justification for the proposed design in the appendix forany type of clinical endpoints, not just for continuous endpoints following normaldistributions. For clinical development in general, this approach can be used forearly clinical programs such as (Phase 2a) proof-of-concept trials or (Phase 2b) dose-finding trials, as well as confirmatory Phase 2b/3 combination or Phase 3 trials,which may include adaptive features such as sample size modification, or adaptivedose-finding. The design can increase the probability of trial success via a reducedsample size as compared to a standard parallel group design. In addition, the designcan be further expanded to include a randomized withdrawal for patients who arerandomized to receive the new drug in the first period (see the brief description insection 6). Because of all these features, the doubly randomized delayed-start designcan substantially increase the efficiency (i.e., in cost and resource) and effectiveness(i.e., in probability of technical and regulatory success) of antidepressant clinicaldevelopment.

In section 4.1 we provide an illustrative example of a proof of concept trialwhere a single dose of a new drug is compared to placebo. We also comparethe required sample size to that of the sequential parallel design, as well as thetraditional parallel group design. In section 4.2, we also present a Phase 2b trialwith two doses of a new drug and illustrate how to apply the proposed sample sizeprocedure as well as existing multiple trend test and closed testing procedures. Insection 5, we provide simulation studies to confirm the adaptive measure-theoretictheory for the combination test and, in addition, show that the weights in the

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 6: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 741

combination test can also depend on the actual sample size randomized to eachperiod without inflating the type 1 error rates.

The proposed design avoids the potential bias that may be introduced if theplacebo nonresponders in the placebo–drug and placebo–placebo sequences are notcomparable. At a minimum, the proposed design is more efficient and offers greaterinferential interpretability. Because of its adaptive nature, the design also offersgreater flexibility that benefits clinical development.

2. DESIGN

2.1. Description

To develop the theory and method, we consider the simple setting wherepatients are randomized in both periods to receive either one dose of the new drugor placebo. We then illustrate in section 4.2 how to apply the theory and method foran application with two doses of a new drug. The design consists of two periods. Atthe beginning of period 1, patients are assessed for their baseline variables and thenrandomized to receive either placebo or treatment(s) with the new drug. Patients arethen treated during the first period. During the first period, patients are allowed todrop out due to lack of efficacy or safety concerns. Patients who are treated withthe new drug in the first period may continue with their treatment(s) in the secondperiod. At the end of the first period, patients are evaluated for their response tothe assigned treatment and are classified as responders or nonresponders. For thesecond period, patients who received placebo and have not dropped out in the firstperiod and who are nonresponders at the end of the first period are randomized toreceive the treatment with the new drug or to continue on placebo in the secondperiod. The rerandomized placebo patients are then treated during the second periodand at the end of the second period, they are evaluated for their final clinicaloutcomes. Note that the durations of the two periods may be the same but it is notrequired. Also, the randomization ratios may not be balanced.

2.2. General Theory

For each period, patients are evaluated with respect to a clinical endpoint. Theendpoints for the two periods are not required to be the same for the proposeddesign. Let �1 be the parameter comparing the onset effect of the new drug toplacebo with the period 1 endpoint. For the second period, let �2 be the parameterwith the period 2 endpoint for comparing the delayed-start effect of the new drugto placebo. We are interested in testing against the global null hypothesis

�0 � �1 ≤ 0 and �2 ≤ 0

in favor of the alternative hypothesis

�A � �1 > 0 or �2 > 0

For �1 and �2, the test statistics against the individual null hypothesis �01 � �1 ≤ 0and �02 � �2 ≤ 0 are denoted by Z1 and Z2, respectively. To establish the efficacy of

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 7: Delayed Randomized Design

742 LIU ET AL.

the new drug, we combine Z1 and Z2 via

Z = �11/2Z1 + �2

1/2Z2 (1)

for certain prespecified weights �1 and �2 such that �1 + �2 = 1.Note that the combination test statistic is widely used for two-stage adaptive

designs in the literature (see Cui et al., 1999) with a single randomization.The difference here is that Z1 and Z2 correspond to two different randomizations.In the following, we define Z1 and Z2, and establish that the test Z ≥ z�, for whichz� is the critical value of the standard normal distribution at the significance level �,controls the type 1 error rate at �.

Let � be the �-field of the first period data from all randomized patients.Assume for testing against the null hypothesis �1 ≤ 0 in favor of the alternativehypothesis �1 > 0, there is a p-value p1, which is � measurable, such that PH01

�p1 ≤�� ≤ � for all � ∈ 0 1� where H01 � �1 = 0. Let �·� be the cumulative distributionfunction of the standard normal distribution. Define Z1 to be the normal inverse ofp1, that is, Z1 = �−11− p1�. Then under the null hypothesis H01, the normal inversetest statistic Z1 is not stochastically larger than a standard normal distribution.

The construction of Z2 is more involved as the trial involves a selection processfor patients who are randomized to receive placebo in the first period. Let � bethe �-field of the first period data of all patients who are randomized to receiveplacebo in the first period. Let g represent the process for selecting patients fromthose who are randomized to receive placebo in the first period to be randomizedfor the second period. Following the description of the design, g involves excludingpatients for various reasons of dropout or patients whose outcome meet the criteriafor treatment responders at the end of the first period. In general, g cannot befully specified. However, it can be assumed that g is a � measurable function withrange M whose elements represent various choices of subsets of patients to bererandomized. In Liu et al. (2002), g is known as an adaptation rule. For eachm ∈ M , let p2m be a p-value for testing against the null hypothesis �02 � �2 ≤ 0in favor of the alternative hypothesis �A2 � �2 > 0, such that PH02

�p2m ≤ C�� � ≤ Cfor any � measurable function C ∈ 0 1� where H02 � �2 = 0. Following Liu et al.(2002), the adaptive p-value is given by

p2g =∑m∈M

p2mI�g=m� (2)

where I�g=m� is the indicator for the event �g = m�. By the adaptation theory of Liuet al. (2002), we establish the following theorem.

Theorem.

(a) PH02�p2g ≤ C �� � ≤ C.

(b) Let Z2 = �−11− p2g�. Then under the null hypothesis H0 � �1 = �2 = 0, the teststatistic

Z = �11/2Z1 + �2

1/2Z2 (3)

is not stochastically larger than the standard normal distribution.

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 8: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 743

The proof of the theorem is given in the appendix. Following the theorem, thecombination test statistic Z can be used to test against the null hypothesis �0 infavor of a more specific alternative hypothesis at a specified significance level �. Byconstruction, the type 1 error rate of the combination test is controlled even thoughthe sample size for the second period is random and is irrespective of whetherdropout at the end of period 1 is informative or not. The results of the theoremremain valid even if the selection process g is expanded to depend on comparativefirst period data, that is, g is expanded to be a � measurable function. The theoremrelies on the p-values p1, and p2m for m ∈ M , that are not stochastically smaller thanuniform distributions. This allows various types of endpoints, including continuous,binary, and time-to-event endpoints. Therefore, for clinical investigations in general,this approach can be used for confirmatory Phase 3 trials, as well as for earlyclinical development such as (Phase 2a) proof-of-concept trials or (Phase 2b) dose-finding trials. Either an asymptotic or randomization-based exact justification forthis assumption rests on randomizations at the beginning of each period. Withoutthe second randomization, such as the design by Fava et al. (2003), there is noguarantee of uniformity of p2m for m ∈ M .

For continuous endpoints following normal distributions, Chen et al. (2011)develop a test procedure based on a weighted average of estimates for designsincorporating double randomizations. They show that the estimates from twoperiods have (asymptotic) zero correlation under a constancy assumption of thecorrelation between the continuous endpoint measures. The proof also requires thatthe sample size for each period is fixed in advance. With the proposed doublyrandomized delayed-start design, the lack of the constancy assumption of thecorrelation as well as lack of the normality assumption can be easily handled withexact randomization tests.

Also note that the theorem assumes the weights �k for k = 1 2 are prespecified.This assumption, however, can be relaxed to allow the weights to depend on blindedfirst period data. We can easily justify this following the blinded adaptation theoryby Liu and Chi (2010a). The prespecified weights �k for k = 1 2 are given insection 2.3.

Standard analysis for two-group comparisons provides estimates of the onsetand delayed-start effects, �1 and �2.

2.3. Sample Size

Let rk be the randomization ratio of patients receiving placebo to patientsreceiving the treatment with the new drug for period k where k = 1 2. The numbersof patients for the treatment group with the new drug and the placebo group arenk and rknk, respectively. Assume that for k = 1 or 2, Zk follows asymptotically anormal distribution with mean

E Zk� = nkRk�1/2 k

for Rk = rk/1+ rk�, k = �k/�k with some standard deviation �k, and varianceVar Zk� = 1.

With this canonical form (see Jennison and Turnbull, 2000, p. 49), we providea uniform approach to sample size calculation. For a normally distributed endpoint,

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 9: Delayed Randomized Design

744 LIU ET AL.

�k is the standard deviation of the normal distribution. For a binary endpoint, �kis the standard deviation based on the pooled success rate of the two treatmentgroups. Fava et al. (2003) provide the sample size calculation for the sequentialparallel design with binary endpoints; however, there is no sample size calculationmethod for continuous endpoints; determination of sample size has to rely on time-consuming simulation studies (Tamura and Huang, 2007).

Under an alternative hypothesis, Z1 and Z2 may not be independent in general.To simplify sample size calculation, we follow Chen et al. (2011) and assume thecorrelation coefficients between measurements at the first and second periods areidentical for both treatment groups. Then Z1 and Z2 are asymptotically independent.As a result, the test statistic Z given in equation (3) follows an asymptotic normaldistribution with mean

E Z� ≈ �1/21 n1R1�

1/2 1 + �1/22 n∗

2R2�1/2 2 (4)

and variance Var Z� = 1, where n∗2 is the expected number of placebo

nonresponders who are randomized to receive the treatment with the new drug.Now assume further that k ≥ 0 for k = 1 2. Then maximizing the power

of the test Z ≥ z� is equivalent to maximizing the expectation E Z� given inequation (4) with respect to �k for k = 1 2 subject to the constraint �1 + �2 = 1. Thisleads to optimal weights �∗k for k = 1 2. By the method of Lagrange multipliers, weobtain

�∗1 =n1R1

21

n1R1 21 + n∗

2R2 22

(5)

and �∗2 = 1− �∗1. Let � be the rate of attrition due to dropout or exclusion ofresponders of the placebo patients from period 1. Then, for the second period,the expected number of placebo nonresponders who are randomized to receive thetreatment with the new drug is

n∗2 = r1n11− ��/1+ r2�

and the expected number of placebo nonresponders who are randomized to continueon placebo is

r2n∗2 = r1r2n11− ��/1+ r2��

Using the optimal weights given in equation (5), E Z� in equation (4) becomes

E Z� = n1R�1/2 1 (6)

for

R = R1

[1+ �R21− R2�/1− R1��1− ���2

]

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 10: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 745

where � = 2/ 1. For given type 1 and 2 error rates � and �, the required samplesize n1 follows the equation

n1R = z� + z��2/ 21 (7)

where z� and z� are critical values of the standard normal distribution at � and �.Note that the n∗

2 and r2n∗2 are expected numbers of placebo nonresponders

to be randomized to the new drug or placebo. The actual number of placebononresponders to be randomized, that is, n21+ r2�, is a random variable. Asthis variability is not accounted for in the sample size calculation procedure, it isexpected that the actual power is less than the nominal level. This is illustrated insection 5.3 through simulation studies. An ad hoc fix for this problem is to identifythe amount of power loss through simulation studies under different designs andthen use an adjusted power, which is also illustrated in section 5.3.

3. STATISTICAL MODELS

3.1. Enrichment with Nonresponders

Enrichment with nonresponders is an appealing clinical concept as patientswho are responders tend to continue to remain as responders and, as a result,reduce the ability to detect a treatment difference of an effective drug fromplacebo (Temple, 1994). However, less is known about the characteristics of thenonresponders in terms of their clinical outcome trajectory, the variability, and thecorrelation between clinical outcome measures.

We present results of an analysis of the 17-item Hamilton Depression RatingScale (HAMD17) total scores of placebo patients from a randomized, double-blindparallel group trial to study the efficacy and safety of a new drug in patients withmajor depressive disorder (MDD). A balanced randomization was planned with75 patients allocated to each treatment group. Study duration included 1 week ofscreening, 1 week of washout, 7 weeks of treatment with patients monitored weekly,and 1 week of follow-up. While the primary endpoint for the study was week 7change from baseline in MADRS total score, the week 7 change from baseline inHAMD17 total score was a key secondary endpoint. This dataset is used to illustratethe design of the proof-of-concept trial in section 4.1.

In Table 1, we provide the means and standard deviations of changes inHAMD17 total scores and correlations between HAMD17 total scores for allrandomized placebo patients as well as placebo nonresponders at week 4. For allrandomized patients, a change in the HAMD17 total score is calculated as thedifference between the weekly HAMD17 total score and baseline. A treatmentnonresponder is defined as a patient with the HAMD17 total score greater than 18at week 4. For treatment nonresponders, a change in the HAMD17 total score is thedifference of the HAMD17 total scores at week 5, 6, or 7 from week 4. Pearson’scorrelation coefficient � is used to assess potential relationships between the baselineor week 4 and subsequent week HAMD17 total score. A full analysis of this datasetwas performed, as mentioned later in sections 4.1 and 5.2. Due to its scale, it wasdecided to not present it here in this article.

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 11: Delayed Randomized Design

746 LIU ET AL.

Table 1 Properties of enrichment with placebo nonresponders

All patients Nonresponders

Week N Mean SD � N Mean SD �

1 69 −4�1 4.2 .5042 71 −7�1 6.3 .4523 71 −9�3 7.3 .3564 71 −10�5 7.6 .2455 71 −11�7 8.1 .062 33 −1�76 4.1 .5826 71 −11�3 7.7 .191 33 −�91 3.5 .5547 71 −12�4 8.3 .184 33 −1�42 3.8 .628

It is seen that for all placebo patients the mean change score decreases overtime. This observation is consistent with widely known reports of analysis databasesconducted by the FDA. A reference of the reports is provided in Chen et al. (2011).In addition to the decrease of the mean change scores, we also see that the standarddeviation increases over time while the correlation decreases. More than 50% of thepatients responded to treatment with placebo at week 4. By enriching with placebononresponders, the mean change scores are stabilized, the standard deviations arereduced by more than 50%, and the correlations are increased by roughly 200% ormore.

While this dramatic result testifies to the essence of enrichment with placebononresponders, it also raises concerns of the selection bias in designs (e.g., thesequential parallel design) where rerandomization to treatment with the drug orplacebo is not in place for placebo nonresponders. This is the case when theinitial randomization fails to balance known or unknown prognostic factors, whichis exacerbated during period 1 by differential and informative dropouts (Chenet al., 2011). We note that an assessment of the extent of bias is not available,as simulation studies of the sequential parallel design are only limited to settingswithout an imbalance of the prognostic factors.

3.2. Conditional ANCOVA Model

The general theory developed in section 2.2 does not impose a specificdistribution requirement on the endpoint in question. To reflect the trial examplesgiven in section 4, we construct statistical models that allow continuous endpointswhose measures can be analyzed via a traditional ANCOVA. We show how to usethe models to derive the standard deviations �k for k = 1 2 in section 2.3 for samplesize calculation. The models also provide the basis for simulation studies to evaluatethe operating characteristics of the doubly-randomized design. The ANCOVAapproach can easily be extended to settings with more sophisticated random effectmodels for continuous endpoints, logistic regression models for binary endpoints, orCox’s proportional hazards model for time-to-event endpoints.

Let the random vector Y0 Y1 Y2� represent the patient measurements atbaseline, period 1, and period 2, respectively. The baseline measure Y0 follows anormal distribution with mean �0 and standard deviation �. To reflect patient entrycriteria, Y0 is truncated by a lower cutoff yL from below. Thus, for patients who are

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 12: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 747

enrolled into the trial, the resulting baseline measure Y0 follows a truncated normaldensity. For periods 1 and 2, we use conditional normal distributions for Y1 givenY0 = y0 and Y2 given Y1 = y1. This serves two purposes. First, patients must meetentry criteria (i.e., baseline Y0 above yL or being nonresponders) to be randomizedin each period. The conditional distributions are unaffected by the entry criteria.Second, as the analysis for the first or second period is based on change score fromthe respective baseline, the conditional normal distributions ensure that the changescores are also conditionally normally distributed. As shown below, the conditionalnormal models motivate the traditional ANCOVA.

Let Y1i be the response for period 1 where i = 0 or 1 indicates that thepatient receives placebo or treatment with the new drug. Assume that withoutthe truncation on Y0 by the entry criteria, Y0 and Y1i follow a bivariate normaldistribution with the mean vector �0 �1i�, correlation �1, and variance �2

0 and �21.

It is well known that the conditional distribution of Y1i given Y0 = y0 is

Y1i � y0 ∼ N{�1i + �1y0 − �0��1/�0 �

211− �2

1�}

(8)

Let �Y1i = Y1i − Y0; then

�Y1i � y0 ∼ N{�1i + �1y0 − �0��1/�0 − y0 �

211− �2

1�}

(9)

During period 1, patients can drop out due to lack of efficacy or safety concerns.At the end of period 1, the placebo nonresponders are rerandomized to receive thenew drug or placebo for period 2. To ensure that this process does not introduceselection bias in the inference, we consider conditional distributions of Y2j givenY10 = y10 where j = 1 or 0 indicates if patients are randomized to receive the newdrug or placebo. Similarly, assume that without the truncation on Y10 by the reentrycriteria, Y10 and Y2i follow a bivariate normal distribution with the mean vector�10 �2i�, correlation �2 and variance �2

1 and �22. Then

Y2j � y10 ∼ N{�2i + �2y10 − �10��2/�1 �

221− �2

2�}

(10)

For �Y2j = Y2j − Y10,

�Y2j � y10 ∼ N{�2i + �2y10 − �10��2/�1 − y10 �

221− �2

2�}

(11)

The conditional models given by equation (9) and equation (11) provide thebasis for conditional ANCOVA models for constructing test statistics Z1 and Z2

for the combination test in equation (3). For the first period, �1 = �11 − �10 isthe treatment effect between the new drug and placebo. Equation (9) suggeststhe conditional ANCOVA model for �Y1i including a treatment contrast for�1, the baseline measures y0, and other potential factors as model predictors.It is immediately obvious that the conditional ANCOVA model, following theconditional model in equation (9), reduces the standard deviation to �1 = �11−�21�

1/2, resulting in a more powerful analysis. In case the randomization for thefirst period does not balance out the baseline values y0, the conditional ANCOVAmodel reduces the bias in the inference. The test statistic Z1 for �1 can be based on

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 13: Delayed Randomized Design

748 LIU ET AL.

the Wald statistic for the treatment contrast, which is easily constructed from itsreported point estimate and standard error.

The parameter of interest for the second period is �2 = �21 − �20. Theconditional ANCOVA model on �Y2j includes the treatment contrast for �2,the baseline measures y10, and other potential factors. It is for this conditionalANCOVA model that the need for the second randomization stands out. Anyimbalance in either y0 or y10 is adjusted, providing the basis towards an unbiasedinference of �2. The sequential parallel design by Fava et al. (2003) lacks this criticalfeature. It is unclear from Tamura and Huang (2007) if a change score from theperiod 1 baseline is used for both periods or if the seemingly unrelated regressionmodel necessarily includes any baseline values. Chen et al. (2011) do not explicitlydefine the endpoint; however, it is clear that their ANCOVA model includes thebaseline values of period 2 when placebo nonresponders are rerandomized. Animportant benefit of the enrichment with placebo nonresponders is also reflectedby the conditional ANCOVA model for period 2, for which the standard deviationis dramatically reduced to �2 = �21− �2

2�1/2 due to the large increase of the

correlation from �1 to �2 (see section 4).It is noted that the standard deviations for the conditional ANCOVA models

of period 1 and 2 are �1 = �11− �21�

1/2 and �2 = �21− �22�

1/2, respectively. Thesestandard deviations, rather than those from unadjusted change from baseline scores,are used to calculate the sample size for the examples in section 4.

4. APPLICATION

4.1. Proof of Concept

Clinical trials in mood disorder are known to have large failure rates, mainlybecause a large percentage of patients in the trial respond to placebo treatment.With a doubly randomized delayed-start design, the effect of a new drug canbe further evaluated in placebo non-responders. This design consists of a 2-weekplacebo lead-in phase and a drug testing phase with two 4-week periods, where bothphases are double-blind. The purpose of the placebo lead-in phase is to screen outpotential placebo responders. This is important, as without the placebo lead-in, theplacebo response rate at the end of period 1 (or week 4), as seen from Table 1, canbe as high as 50%. In the drug testing phase, eligible patients are randomized withone-to-one ratio in period 1 to receive either a new test drug or placebo. Treatmentin the second period is based on the patient’s period 1 treatment and responsestatus. At the end of period 1, all patients are evaluated for efficacy and safety. Inparticular, patients are classified as responders if their HAMD17 total scores are18 or less. For period 2 the one-to-one ratio is also used to rerandomize placebononresponders to receive the new test drug or continue on placebo.

Efficacy will be based on the change from baseline of the HAMD17 totalscore with the new test drug compared with placebo after 4 weeks of treatment ineach period. This is carried out with the conditional ANCOVA models with therespective baseline HAMD17 total score and treatment contrast between the newtest drug and the placebo. The combination test in equation (3) is then used to detecta possible efficacy signal in HAMD17 total score at the one-sided type 1 error rate� = �1.

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 14: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 749

The design parameters for the sample size calculation are based on a fullanalysis of the MDD dataset used in section 3.1 followed with hierarchicallongitudinal disease modeling. The resulting models provide, among other things,the mean and standard deviation of the HAMD17 total scores for each week and thecorrelations of the HAMD17 total scores between weeks. Through this modeling, itis determined that 30% of the patients would respond by the end of 2 weeks afterenrollment, which suggests a 2-week placebo lead-in phase. This would reduce theresponder rate for the drug testing phase. The placebo responder rate for period1 is also 30%. We assume for the first period a treatment difference �1 = 3 in themean change from baseline in HAMD17 total score between the new test drugand placebo. The full analysis also suggests an increased effect size of at least 1point (i.e., �� = 1) in placebo nonresponders in period 2. Thus, we choose �2 =�1 + �� = 4. With �1 = 8 and �1 = �25, �1 = �11− �2

1�1/2 is approximately 7�75.

For the second period, �2 = 9 and �2 = �8, which leads to �2 = �21− �22�

1/2 ≈ 5�5.Based on literature review, we use 10% as the dropout rate during period 1. Withthe responder rate of 30% for period 1, the attrition rate is � = �4. To achieve aminimum 90% power, we use the 95% nominal power, for which the required totalnumber of patients to be randomized for period 1 is 112. Based on a simulationstudy with 20 000 runs, the simulated power is �92925 with the combination testthat uses adaptive weights (see section 5). The optimal weight for the first period inthe combination test is .4857.

To determine the sample size of the sequential parallel design of Fava et al.(2003), we performed a series of simulation studies as suggested by Tamura andHuang (2007). For the seemingly unrelated regression analysis, we used the changefrom the respective baseline HAMD17 total score as the dependent variable.Because of the rerandomization, the correlation between the dependent variables ofthe two periods is set to zero following Chen et al. (2011). The required samplesize is 128 using the midpoint �7 of the weight range �6 to �8. In comparison, therequired total sample size of a parallel-group design with a treatment difference of3 points and a standard deviation of 7�75 would be 225 patients after adjusting for10% dropouts.

4.2. Dose Finding

Consider a Phase 2b dose finding trial for patients with MDD where twodoses of a new test drug are compared to a placebo control. The design againconsists of a 2-week placebo lead-in phase and a drug testing phase with two 4-week periods, and both phases are double-blind. In the drug testing phase, eligiblepatients are randomized in a 1 � 1 � 2 ratio into period 1 to receive either of thetwo doses of the new test drug or a placebo. Treatment in the second periodis based on the patient’s period 1 treatment and response status. At the end ofperiod 1, all patients are evaluated for efficacy and safety. In particular, patients areclassified as responders if there is a 50% or more reduction in their Montgomery–Asberg Depression Rating Scale (MADRS) total scores. Placebo nonrespondersfrom period 1 will be rerandomized in a 1 � 1 � 1 ratio to receive either of the twodoses of the new test drug or to continue with placebo.

A key feature of this design is the utilization of an unequal randomizationratio during the first period of the double-blind phase. Based on past clinical trials

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 15: Delayed Randomized Design

750 LIU ET AL.

experience, designs using an equal randomization ratio for multiple doses tend toincrease the placebo response rate, largely because of the high probability for apatient to receive the potentially effective new test drug. For the current trial, theprimary endpoint for each period is the change from the respective baseline in theMADRS total scores. With the 1 � 1 � 2 randomization ratio for the first period, thetreatment difference with respect to the primary endpoint between a dose of the newtest drug is assumed to be 4�5 points. In contrast, the smaller treatment differenceof 4 points would be used if the randomization ratio were 1 � 1 � 1 (Sinyor et al.,2010). For the second period, we still use the treatment difference of 4�5 because it isexpected that the randomization ratio would play a much smaller role in the placeboresponse rate for the enriched placebo nonresponders. The standard deviations usedfor periods 1 and 2 are 9 and 6, respectively. Again, we assume that 30% of placebopatients are responders at the end of period 1, and 10% of patients are expected todrop out during period 1.

For the Phase 2b trial, we choose the one-sided type 1 error rate � = �05 inorder to control the probability of failure of the Phase 3 program. An objective ofthe Phase 2b trial is to identify the dose(s) to carry forward into a Phase 3 clinicaldevelopment program. Thus, an analysis comparing each individual dose to placebois necessary. Therefore, the sample size is based on a pairwise comparison of a doseof the new test drug with the placebo at the type 1 error rate � = �05 with 95%nominal power. The resulting sample size for the first randomization is 33 patientsfor a dose of the new test drug, and 66 patients for the placebo. The weight for thefirst period is �4969.

To control multiple type 1 error rates at the specified � = �05 level, we usethe following closed testing procedure. We first perform an overall test againstthe global null hypothesis that neither dose is effective (compared to placebo) atthe � = �05 level. If the global null hypothesis is rejected, we then test againstthe individual null hypothesis that a particular dose of the new test drug is noteffective at the � = �05 level. The overall test employs the combination test givenin equation (3) for which triple trend test statistics (Capizzi et al., 1992) are usedfor both periods. For each triple trend test statistic, the same trend scores arederived from three different sigmoid Emax models; each trend statistic is based onthe ANCOVA model with the baseline MADRS total score and the trend score. Asa result, we use the same weight �4969, which is derived from the pairwise samplesize calculation already shown, to combine the two triple trend test statistics. Forthe pairwise comparison, the ANCOVA model includes the baseline MADRS totalscore and the treatment contrast between the dose in question and placebo. Thepairwise test statistics from the two periods are then combined via equation (3) withthe first period weight �4969 to test against the individual null hypothesis.

This example illustrates the flexibility of the doubly randomized delayed-startdesign, which is not present in the previous example. There are two important designfeatures. First, there are only three treatment groups for each period. Second isthat different randomization ratios for the two periods are employed to take fulladvantage of the enrichment with the placebo non-responders. Analytically, we canapply the closed testing procedure to control the multiple type 1 error rates, andwe can easily apply the existing triple trend test to test against the global nullhypothesis.

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 16: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 751

In comparison, the sequential parallel design by Fava et al. (2003) does nothave these features. Five treatment sequences would be needed, which requires alarge randomization block. There is also a limitation in choosing the randomizationratio. There is no existing procedure for triple trend tests using the seeminglyunrelated regression analysis.

5. SIMULATION STUDIES

5.1. Principle

Monte Carlo simulations play an important role in clinical trial designs andanalyses. It can be used to verify results established by a statistical theory or toevaluate operating characteristics that are difficult to study theoretically. The latterincludes either investigations of the robustness of the trial design and analysis or anacknowledgment of the limitations of the simulation studies. Unfortunately, therehave been many instances of abuse or fraud in practice, which are documented ingreat detail by Liu and Chi (2010b).

Prior to carrying out simulation studies, the first task is to constructa simulation model, which consists of two critical components: (i) the set ofprocedures to mimic the trial design that strictly follow the prespecified statisticalanalysis of proposed methods, and (ii) a set of mathematical models to approximatethe state of nature. Without (i), it is difficult to regard the simulation results asrelevant to the trial design or analysis in question. As the trial design and analysisare specified, it is possible and necessary to construct the set of procedures in greatdetail. For (ii), it is easy to specify distribution models that follow the assumptionsrequired by the statistical theory; the challenge is to come up with models forthe state of nature which is not fully understood and difficult to predict. Withthis conviction in mind, we construct our simulation model for the simplest designdescribed in section 2.1 to compare the treatment with a single dose of the new drugto that of a placebo.

5.2. Simulation Model

For the doubly randomized delayed-start design, our simulation model isconstructed in the following steps, reflecting the actual programming code.

1. As seen in section 3.1 and alluded to in section 4.1, we first performed a fullanalysis of a historical dataset from a double-blind, placebo-controlled trialconsisting of a new test drug and an active control with an existing drug. We thenfit the data with a set of hierarchical longitudinal models, from which importantparameters for the doubly randomized delayed-start design are derived. Theseparameters include placebo means (�0, �10 and �20), the standard deviations (�0,�1 and �2), the correlation coefficients (�1 and �2), and the increased treatmenteffect �� for placebo nonresponders.

2. For the placebo lead-in phase, we generate a large number of baseline valuesof y0 from a normal distribution with mean �0 and standard deviation �0. Toqualify for randomization into period 1 of the drug testing phase, we subset thebaseline values according to the nonresponder criterion y0 ≥ yL. From the subset,

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 17: Delayed Randomized Design

752 LIU ET AL.

we randomly select n11 for the treatment group, and n10 = r1n11 for the placebogroup of period 1.

3. For the active treatment group in period 1, we generate n11 scores of y11 from theconditional normal distribution given in equation (8) with �11 = �10 + �1 for aspecified value of the effect size �1; for the placebo group, we generate n10 scoresof y10 from the conditional normal distribution given in equation (8) with �10.

4. For the period 1 analysis, we use the conditional ANCOVA model for the changescores of y11 − y0 or y10 − y0 with the baseline value y0 and treatment contrastfor the new test drug versus placebo as the predictors. From the ANCOVAanalysis, we extract the estimate and standard error, and compute the Wald teststatistic Z1.

5. For period 2, we select the period 1 subset of placebo scores of y10 by thenonresponder criterion y10 ≥ yL. This subset and its size are denoted by S∗

10 andn∗10 the number of elements of the subset. Let d be the dropout rate of the

placebo nonresponders. We generate a number of dropouts, say, nd, from abinomial distribution with size n∗

10 and probability d. We then select randomlyn∗10 − nd scores of y10 from S∗

10, which are used as baseline values for period 2.The resulting subset is denoted by S10. From S10 we randomly select n21 scoresy10 for the treatment group with the new drug for period 2, and the remainingn21 scores of y10 are for the placebo group; n21 and n20 are chosen according tothe randomization ratio r2 = n20/n21 for period 2.

6. For the active treatment group in period 2, we generate n21 scores of y21 fromthe conditional normal distribution given in equation (10) with �21 = �20 + �2 forwhich �2 = �1 + ��; for the placebo group, we generate n20 scores of y20 from theconditional normal distribution given in equation (10) with �20.

7. For the period 2 analysis, we use the conditional ANCOVA model for the changescores of y21 − y10 or y20 − y10 with the baseline value y10 and treatment contrastfor the new test drug versus the placebo as the predictors. From the ANCOVAanalysis, we extract the estimate and standard error, and compute the Wald teststatistic Z2.

8. We compute the combination test statistic in equation (3) with three choices ofweights. The first is the optimal weight given by equation (5) with the expectedsample size for the second period. The second is the weight by equation (5) withthe actual second period sample size n21. The third is an arbitrarily specifiedweight, say, �6, �7, or �8, as suggested by Tamura and Huang (2007).

Steps 2 to 8 are repeated to a specified simulation size under the null hypothesisthat �1 = �2 = 0 and various alternative hypotheses with different values of standarddeviations and correlation coefficients.

5.3. Results

We consider a phase 3 trial with the same doubly randomized delayed-startdesign in the proof-of-concept example in section 4.1. The parameter settings areidentical except for the type 1 error rate � = �025, power 90%, effect size �1 = 2�5and �2 = 3�5, and the dropout rate d = �2. The required total sample size is 206patients. The number of simulation runs is 20000. We simulate the type 1 errorrates and values of power for different dropout rates d = �1 �2 and �3. The results

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 18: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 753

Table 2 Type 1 error rate and power with � = �1

�opt �adp �1�fix �

2�fix �

3�fix

�1 = 0, �2 = 0, d = �1

.02640 .02695 .02650 .02675 .02600�1 = 0, �2 = 0, d = �2

.02655 .02680 .02650 .02610 .02625�1 = 0, �2 = 0, d = �3

.02570 .02680 .02625 .02615 0.02545�1 = 2�5, �2 = 3�5, d = �1

.88445 .88680 .88525 .87835 .86195�1 = 2�5, �2 = 3�5, d = �2

.87290 .87680 .87350 .86985 .85245�1 = 2�5, �2 = 3�5, d = �3

.84710 .85460 .85255 .84830 .83545

are summarized in Tables 2 and 3 where �opt = �5069 is the optimal fixed weight,�adp is the adaptive weight, and �

1�fix = �6, �2�fix = �7, and �

3�fix = �8 are arbitrary fixed

weights.From Table 2, it is seen that the simulated type 1 error rates are very

close to the theoretical value � = �025. None of the simulated type 1 error ratesare statistically different from � = �025 at the two-sided significance level �05. Inparticular, the combination test using adaptive weight does not yield type 1 errorrates that are significantly different from �025. This is consistent with unblindedtrial modification theory by Liu and Chi (2010a). It is noticed, as expected, thatthe proposed combination test, using either the optimal weight or adaptive weight,is more powerful than the test using the prefixed weight �8. By the simulationstudies conducted by Chen et al. (2011), the weighted estimate procedure withrerandomization of the placebo nonresponders has nearly the same power as thatof the sequential parallel design when the weight for the estimate, based on Tamuraand Huang (2007), is chosen between �6 and �8. When a small period 2 standarddeviation is used to reflect the enrichment, the resulting optimal weight of thecombination test in equation (3) is substantially smaller than �7 or �8. From this, wecan infer that the proposed method is more powerful than the sequential paralleldesign of Fava et al. (2003). This conclusion is also consistent with our calculationsof the required sample size of both designs given in section 4.1.

It is seen from Table 2 that the values of power are all below the required90%. As explained in section 2.3, this is due to the use of expected sample sizefor the second period that ignores the variability of the actual number of placebononresponders that can be rerandomized. To fix this problem for the design, wecalculate the final sample size with 93% power. The resulting total sample size is 230.Table 3 provides the simulation results with this new sample size. From both Table 2and Table 3, it is clear that a loss of power also occurs when the dropout rate ishigher than the expected dropout rate. This problem could be resolved by using alarger dropout rate in the sample size calculation. However, for the actual trial, thedropout rate can be lower than expected, and therefore, using a larger dropout rate

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 19: Delayed Randomized Design

754 LIU ET AL.

Table 3 Type 1 error rate and power with � = �07

�opt �adp �1�fix �

2�fix �

3�fix

�1 = 0, �2 = 0, d = �1

.02705 .02760 .02700 .02700 .02645�1 = 0, �2 = 0, d = �2

.02775 .02835 .02785 .02765 .02740�1 = 0, �2 = 0, d = �3

.02740 .02785 .02705 .02675 .02670�1 = 2�5, �2 = 3�5, d = �1

.92270 .92335 .92145 .91390 .89945�1 = 2�5, �2 = 3�5, d = �2

.90600 .90910 .90800 .90245 .88675�1 = 2�5, �2 = 3�5, d = �3

.88735 .89145 .88930 .88480 .87455

in the sample size calculation may unnecessarily increase the cost and duration ofthe trial. A possible solution to this problem may be to adjust the sample size to berandomized for the first stage according to the actual observed dropout rate.

6. DISCUSSION

6.1. Summary

We propose an adaptive doubly randomized delayed-start design that offersgreater flexibility and interpretability than the sequential parallel design of Favaet al. (2003). The key differences are the addition of the randomization for placebopatients who meet the enriched entry criteria for the second period, and the useof an adaptive combination test. The design allows the adaptive weight to bebased on the actual number of patients randomized to the second period. Throughsimulation studies, we show that the combination test with the adaptive weightis more powerful than using a fixed weight, which must be prespecified for thesequential parallel design. The design can handle the simple setting for comparingone dose of a new drug against placebo, as well as a complex setting of multipledoses, including possibly an active control with an existing approved drug. Thelatter is important to allow for comparative effectiveness research.

For drug development in mood disorders, the proposed design consists ofa placebo lead-in phase and a drug testing phase with two periods. The placebononresponders from either the placebo lead-in phase or the first period arerandomized to receive the new test drug or placebo.

Chen et al. (2011) raise concerns of rerandomization and state that patientsmay feel the drug effect right after receiving the new drug, which may introduce biasthrough undermining the integrity of blinding. But if patients must not feel the drugeffect, how has the FDA managed to approve many effective drugs where clinicalendpoints rely on patients own subjective evaluations? A major reason for having arandomized, double-blind trial, especially with an enrichment design (Temple, 1994),as opposed to open-label trials, is for patients to feel the drug effect when in fact

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 20: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 755

the drug is efficacious (under the alternative hypothesis). Difficulties only arise whenthere are noticeable differential side effects in favor of an ineffective new drug.Issues with partially unblinding the data are addressed by Liu and Pledger (2005)and Liu and Chi (2010b).

6.2. Future Research

There are several ways to improve the simulation model. First, there isevidence from existing data that clinical endpoint measures can follow a distributionwith different characteristics (e.g., standard deviation or correlation) for patientsreceiving an active drug. Thus, a simulation model that reflects these differencesis useful to evaluate the robustness of the underlying statistical analysis method(e.g., ANCOVA). Although the combination test is shown to control the type 1error rates, it must be realized that the underlying assumption is that the p-valuefor a given realization is not stochastically smaller than the uniform distribution.Thus, an inappropriate analysis method for deriving the period-wise test statisticsZ1 or Z2 can still lead to an inflation of the type 1 error rate. Chen et al. (2011)specifically require that the correlation coefficient is constant. When it is not, theirprocedure may inflate the type 1 error rate as well. However, we point out that theproblem is not particular to the design in question. The concern is for the statisticalmethod that is used to derive the test statistic or p-value, irrespective of the designemployed.

Another improvement is to incorporate longitudinal models in both thesimulation model and the statistical analysis. This has the advantage of addressingdropouts in both periods as well as to further increase the power of the test. Chenet al. (2011) evaluated the robustness of the analysis based on a mixed-effect modelwhen dropouts are missing-at-random (MAR). It is also necessary to evaluate themixed effects model analysis when there are differential and informative dropouts(i.e., missing-not-at-random or MNAR). We notice that Chen et al. (2011) statethat there is a no evidence of MNAR from the 25 New Drug Applications (NDA).However, it would not be possible to establish evidence of MNAR unless these trialsactually collected the missing data.

Chen et al. (2011) showed that the sequential parallel design of Fava et al.(2003) does not inflate the unconditional type 1 error rates over 2,000 hypotheticalreplications of their simulation studies. In reality, only a few trials are conductedin a single clinical development program. For trials with imbalance of known andunknown prognostic factors, only conditional type 1 error rates are relevant. We arenot aware of methods that allow evaluations of the conditional type 1 error rates.

There are many areas for adding adaptive features to the doubly randomizeddelayed-start design. A simple setting is sample size adjustment based on blindeddata review of the dropout rate. As the design is naturally adaptive, it also allowsother adaptive features such as sample size adjustment based on unblinded review ofthe period 1 variability, effect size, and so on, or adaptive dose-finding. The designcan also be expanded to include randomized withdrawal for patients who respondto the new test drug in period 1. This addition allows study of the new test drug’seffect in many ways, including a reduced dose or frequency for maintenance of theresponse.

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 21: Delayed Randomized Design

756 LIU ET AL.

APPENDIX: PROOF OF THE THEOREM

(a) Since � ⊂ � , g is also � measurable. Thus PH02�p2g ≤ C�� � ≤ C is true

following the Adaptation Lemma of Liu et al. (2002).

(b) Let X be a � measurable random variable that follows a standardnormal distribution under the null hypothesis H01 � �1 = 0. Let

Z̃1Z2� = z� − �1/22 Z2�/�

1/21

Then PH�Z ≥ z�� = PH�Z1 ≥ Z̃1Z2�� ≤ PH�X ≥ Z̃1Z2��� The inequality for theprobability PH�·� is true, as following the construction, Z1 is not stochastically largerthan X. Now let

Z̃2X� = z� − �1/21 X�/�

1/22

Then X ≥ Z̃1Z2� if and only if Z2 ≥ Z̃1X�, which is equivalent to p2g ≤ CX� whereCX� = 1−��Z̃2X��. Thus, PH�X ≥ Z̃1Z2�� = E �PH�p2g ≤ CX��� �� ≤ E �CX��for which the last inequality follows (a). Because X follows a standard normaldistribution, it is easy to show that E �CX�� = �. Chaining all the inequalitiestogether, we have

PH�Z ≥ z�� ≤ �

ACKNOWLEDGMENTS

The authors thank Trevor McMullan for the analysis of the MDD dataset.They are also grateful to Dr. Yevgen Tymofyeyev for reviewing the article, andsuggesting the use of the adaptive weight.

REFERENCES

Capizzi, T., Survill, T. T., Heyes, J. F. (1992). An empirical and simulated comparisonof some tests for detecting progressiveness of response with increasing doses of acompound. Biometrical Journal 3:275–289.

Chen, Y. F., Yang, Y., Hung, H. M. J., Wang, S. J. (2011). Evaluation of performanceof some enrichment designs dealing with high placebo response in psychiatric clinicaltrials. Contemporary Clinical Trials 32:592–604.

Chi, Y. G., Liu, Q. (1999). The attractiveness of the concept of a prospectively designedtwo-stage clinical trial. Journal of Biopharmaceutical Statistics 9(4):537–547.

Cui, L., Hung, H. M. J., Wang, S. J. (1999). Modification of sample size in group sequentialclinical trials. Biometrics 55:853-857.

Fava, M., Evins, A. E., Dorer, D. J., Schoenfeld, D. A. (2003). The problem of the placeboresponse in clinical trials for psychiatric disorders: culprits, possible remedies, and anovel study design approach. Psychotherapy and Psychosomatics 72:115–127.

Habermann, T. M., Weller, E. A., Morrison, V. A., et al. (2006). Rituximab-CHOP versusCHOP alone or with maintenance rituximab in older patients with diffuse large B-celllymphoma. Journal of Clinical Oncology 24:3121–3127.

Heyn, R. M., Joo, P., Karon, M., et al. (1974). BCG in the treatment of acute lymphocyticleukemia. Blood 46:431–442.

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015

Page 22: Delayed Randomized Design

DOUBLY RANDOMIZED DELAYED-START DESIGN 757

Jennison, C., Turnbull, B. W. (2000). Group Sequential Methods with Applications to ClinicalTrials Boca Raton, FL: Chapman & Hall.

Khin, N. A., Chen, Y. (2011). Exploratory analyses of efficacy data from major depressivedisorder trials submitted to the US Food and Drug Administration in support of newdrug applications. Journal of Clinical Psychiatry 72:464–472.

Leber, P. (1996). Observations and suggestions for antidementia drug development. AlzheimerDisease and Associated Disorders 10(suppl. 1):31–34.

Leber, P. (1997). Slowing the progression of Alzheimer disease: Methodological issues.Alzheimer Disease and Associated Disorders 11(suppl. 5):10–21.

Liu, Q., Chi, G. Y. H. (2010a). Fundamental theory of adaptive designs with unplanneddesign change in clinical trials with blinded data. In: Pong, A., Chow, S. C., eds.Handbook of Adaptive Designs in Pharmaceutical and Clinical Development. Boca Raton,FL: Chapman & Hall, pp.2-1–2-8.

Liu, Q., Chi, G. Y. H. (2010b). Understanding the FDA guidance on adaptive designs:historical, legal and statistical perspectives. Journal of Biopharmaceutical Statistics20(special issue):1178–1219.

Liu, Q., Pledger, G. W. (2005). Interim analysis and bias in clinical trials: The adaptivedesign perspective. In: Buncher, R., Tsay, J.-Y., eds. Statistics in the PharmaceuticalIndustry. 3rd ed. Revised and Expanded, New York: Taylor & Francis, pp. 231–244.

Liu, Q., Proschan, M. A., Pledger, G. W. (2002). A unified theory of two-stage adaptivedesigns. Journal of the American Statistical Association 97:1034–1041.

McDermott, M. P., Hall, W. J., Oakes, D., Eberly, S. (2002). Design and analysis of two-period studies of potentially disease-modifying treatments. Controlled Clinical Trials23:635–649.

Mills, E. J., Kelly, S., Wu P., Guyatt, G. H. (2007). Epidemiology and reporting ofrandomized trials employing rerandomization of patient groups: a systematic survey.Contemporary Clinical Trials 28:268–275.

Papakostas, G. I., Fava, M. (2009). Does the probability of receiving placebo influenceclinical trial outcome? A meta-regression of double-blind, randomized clinical trials inMDD. European Neuropsychopharmacology 19:34–40.

Sinyor, M., Levitt, A. J., et al. (2010). Does inclusion of a placebo arm influence response toactive antidepressant treatment in randomized controlled trials? Results from pooledand meta-analyses. Journal of Clinical Psychiatry 71:270–279.

Tamura, R., Huang, X. (2007). An examination of the efficiency of the sequential paralleldesign in psychiatric clinical trials. Clinical Trials 4:309–317.

Tedeschini, E., Fava, M., Goodness, T. M., Papakostas, G. I. (2010). Relationshipbetween probability of receiving placebo and probability of prematurely discontinuingtreatment in double-blind, randomized clinical trials for MDD: a meta-analysis.European Neuropsychopharmacology 20:562–567.

Temple, R. J. (1994). Special study designs: early escape, enrichment, studies in non-responders. Communications in Statistics – Theory and Methods 23:499–531.

Dow

nloa

ded

by [

Am

gen

Inc]

at 1

1:02

06

Aug

ust 2

015