26
165 CHAPTER Clinical Trials DOI: © 2012 Elsevier Inc. All rights reserved. 2012 10.1016/B978-0-12-391911-3.00009-8 I. INTRODUCTION Statisticians contribute an essential intellectual component to most clinical trials (1,2). The following introduces statistical concepts and formulas that enable an understand- ing of information occurring at later points in this textbook. These concepts and for- mulas include the Kaplan-Meier plot, the Z statistic, the t statistic, P values, the hazard ratio, the concept of sample group versus population group, and the concept of superiority analysis versus non-inferiority analysis. An earlier chapter in this book will explain the concept of intent-to-treat analysis versus per protocol analysis. a. Kaplan-Meier plot Kaplan-Meier plots are used to represent deaths occurring during the course of clini- cal trials in oncology, and hence are often called survival plots (3). But Kaplan-Meier plots are also used to represent other types of events in clinical trials, such as time to metastasis, time to a non-fatal heart attack, time to disappearance of pain in studies of arthritis drugs, and time to recovery from an infection after antibiotic treatment (4,5). A Kaplan-Meier plot is a curve or, more accurately, a step-function. In this curve, the X axis is time and the Y axis is the cumulative proportion of study subjects experi- encing the event of interest at any given time. In clinical trials in oncology, the event of interest is often death. Where a study subject dies, this death is shown by a down- ward (vertical) step in the curve. Intervals of time, during the clinical trial where there are no deaths, are shown by horizontal lines (no downward steps). Typically, horizontal components of the plot are shorter near the beginning of the clinical trial, because many subjects are participating in the trial (they have not yet died) and thus many sub- jects are at risk for triggering the event of interest, while horizontal components of the plot are longer near the end of the clinical trial, because relatively few subjects are still participating in the trial (6). Biostatistics 9 1 The author thanks Dr. Jenna Elder, Ph.D., of PharPoint Research, Inc., Wilmington, NC, for reviewing the draft chapter and for providing insightful suggestions. 2 The author thanks Dr. Harvey Motulsky, M.D., of GraphPad Software, Inc., La Jolla, CA, for reviewing the draft chapter and for responding with perceptive suggestions. 3 Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457–481. 4 Duerden M. What are Hazard Ratios? What is? Series. Hayward Medical Communications. Hayward Group, Ltd.; 2009;8. 5 Machin D, Gardner MJ. Calculating confidence intervals for survival time analyses. Brit Med J. 1988;296:1369–1371. 6 Kirkwood BR, Sterne JA. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell Science Ltd.; 2003;278.

Clinical Trials || Biostatistics

  • Upload
    tom

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Clinical Trials || Biostatistics

165

CHAPTER

Clinical TrialsDOI:

© 2012 Elsevier Inc. All rights reserved.2012

10.1016/B978-0-12-391911-3.00009-8

I. INTRODUCTION

Statisticians contribute an essential intellectual component to most clinical trials (1,2). The following introduces statistical concepts and formulas that enable an understand-ing of information occurring at later points in this textbook. These concepts and for-mulas include the Kaplan-Meier plot, the Z statistic, the t statistic, P values, the hazard ratio, the concept of sample group versus population group, and the concept of superiority analysis versus non-inferiority analysis. An earlier chapter in this book will explain the concept of intent-to-treat analysis versus per protocol analysis.

a. Kaplan-Meier plotKaplan-Meier plots are used to represent deaths occurring during the course of clini-cal trials in oncology, and hence are often called survival plots (3). But Kaplan-Meier plots are also used to represent other types of events in clinical trials, such as time to metastasis, time to a non-fatal heart attack, time to disappearance of pain in studies of arthritis drugs, and time to recovery from an infection after antibiotic treatment (4,5).

A Kaplan-Meier plot is a curve or, more accurately, a step-function. In this curve, the X axis is time and the Y axis is the cumulative proportion of study subjects experi-encing the event of interest at any given time. In clinical trials in oncology, the event of interest is often death. Where a study subject dies, this death is shown by a down-ward (vertical) step in the curve. Intervals of time, during the clinical trial where there are no deaths, are shown by horizontal lines (no downward steps). Typically, horizontal components of the plot are shorter near the beginning of the clinical trial, because many subjects are participating in the trial (they have not yet died) and thus many sub-jects are at risk for triggering the event of interest, while horizontal components of the plot are longer near the end of the clinical trial, because relatively few subjects are still participating in the trial (6).

Biostatistics9

1 The author thanks Dr. Jenna Elder, Ph.D., of PharPoint Research, Inc., Wilmington, NC, for reviewing the draft chapter and for providing insightful suggestions.

2 The author thanks Dr. Harvey Motulsky, M.D., of GraphPad Software, Inc., La Jolla, CA, for reviewing the draft chapter and for responding with perceptive suggestions.

3 Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. J Am Stat Assoc. 1958;53:457–481.4 Duerden M. What are Hazard Ratios? What is…? Series. Hayward Medical Communications. Hayward Group, Ltd.;

2009;8.5 Machin D, Gardner MJ. Calculating confidence intervals for survival time analyses. Brit Med J. 1988;296:1369–1371.6 Kirkwood BR, Sterne JA. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell Science Ltd.; 2003;278.

Page 2: Clinical Trials || Biostatistics

Clinical Trials166

In Kaplan-Meier plots, the event must be a one-time event. In other words, if relief from arthritis pain is found at two months into the trial, and found again in the same subject at three months into the trial, the event is only counted once (only the relief from pain occurring at two months is counted). Alternatively, where the event of inter-est is “relief from pain without relapse,” and where relief is first detected at a scheduled assessment at two months, and where relief is again detected at a scheduled assessment at three months, then relief at both of these scheduled endpoints is mandated to trigger the endpoint (7). Where the goal of the investigator is to make a graph of events of a recurring nature, Kaplan-Meier plots are not used (8).

Where the Kaplan-Meier plot contains two curves, data used for plotting these curves are also used for calculating the P value and the hazard ratio (HR).

b. Examples of Kaplan-Meier plots – the Holm studyHolm et al. (9) conducted a clinical trial on breast cancer patients. The trial enrolled 564 subjects. All of the subjects received surgery followed by radiation in an attempt to eliminate the cancer. Of these, 276 subjects were then enrolled in arm A of the clinical trial and received tamoxifen, while 288 subjects were enrolled in arm B and received only placebo. The two arms are summarized below:l Arm A. Tamoxifen.l Arm B. Placebo.Tamoxifen and placebo were administered for a period of two years. Subjects were observed for 15 years in all, starting from the day of assignment to arm A or arm B. During these years, subjects were periodically tested for recurrence of the cancer. Data that were collected took the form of an endpoint called recurrence-free survival.

For each of the subjects, the event of recurrence of the breast cancer, or the event of death from recurrence of the breast cancer, triggered this endpoint.

For each subject, when the event was triggered, the investigators placed a point on the survival curve, also known as the Kaplan-Meier plot (Fig. 9.1). This plot includes two survival curves, one for arm A and the other for arm B. The dots on the curves represent subjects who were censored. Each dot represents one subject. The Kaplan-Meier plot shown in Fig. 9.1 was simplified somewhat from the original diagram, for clarity in presentation. Subjects who are censored are usually represented by dots or tick marks, though some investigators choose not to indicate censored subjects. Censoring is defined below.

Visual inspection of the separation between the two curves can indicate the effi-cacy of the experimental treatment relative to the control treatment. While it is hoped

7 The author thanks Dr. Jenna Elder for this suggestion.8 Motulsky H. Intuitive Biostatistics. New York, NY: Oxford Univ. Press; 1995;54.9 Holm C, Rayala S, Jirström K, Stål O, Kumar R, Landberg G. Association between Pak1 expression and subcellular

localization and tamoxifen resistance in breast cancer patients. J Natl Cancer Inst. 2006;98:671–680.

Page 3: Clinical Trials || Biostatistics

Biostatistics 167

that the experimental treatment results in better survival than the control treatment, it is often the case that there is no discernible difference, and sometimes the case that the experimental treatment reduces survival relative to the control. A measure of the difference in efficacy of the experimental and control treatments is conventionally expressed, in numbers, by way of the hazard ratio. Statisticians define the hazard ratio as the hazard rate of the event of interest in arm A, compared to the hazard rate of the same event in arm B (10). In clinical trials in oncology, an event that is typically plotted on the Kaplan-Meier plot is an increase in tumor burden beyond a predeter-mined standard. This increase is called “progression.” Another event typically plotted on Kaplan-Meier plots is the event of death. In clinical trials on other disorders, the event can be the occurrence of an eye disorder (macular degeneration) (11) occur-rence of relapse in patients with multiple sclerosis (12) appearance of a bacterial infec-tion in patients with cirrhosis (13) or a heart attack in patients with atherosclerosis

Arm A. Tamoxifen.

Arm B. Placebo.

0 5 10 15

Years after randomization

Pro

port

ion

of s

ubje

cts

still

exp

erie

ncin

gre

curr

ence

-free

sur

viva

l

1.0

0

0.2

0.8

0.6

0.4

Figure 9.1 Kaplan-Meier plot from Holm study. The Kaplan-Meier plot contains two curves enabling the comparison of events experienced by subjects in two study arms over the course of time. The degree of separation of the two curves is measured by the hazard ratio, and the significance of this separation is measured by the P value

10 Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York, NY: Lange Medical Books; 2004;229–235.11 Cukras C, Agrón E, Klein ML, et al. Natural history of drusenoid pigment epithelial detachment in age-related

macular degeneration: Age-Related Eye Disease Study Report No. 28. Ophthalmology. 2010;117:489–499.12 Kappos L, Radue EW, O’Connor P, et al. A placebo-controlled trial of oral fingolimod in relapsing multiple sclerosis.

New Engl J Med. 2010;362:387–401.13 Papp M, Norman GL, Vitalis Z, et al. Presence of anti-microbial antibodies in liver cirrhosis – a tell-tale sign of

compromised immunity? PLoS One. 2010;5:e12957.

Page 4: Clinical Trials || Biostatistics

Clinical Trials168

(14). Where a Kaplan-Meier plot is used, it provides information on two things, the cumulative percentage of subjects’ experience of the event, and timing of the event.

For the Kaplan-Meier plot from the cancer study of Holm et al. (15) the hazard ratio was HR 0.502. In calculating the hazard ratio, it is always the case that HR 1.0 means that there is no difference in the underlying hazard rates of the two groups.

The P-value corresponding to the significance of separation of the curves from arm A and arm B from this particular Kaplan-Meier plot was P .001. The P value is applied for interpreting the experiment as follows. It means that the probability of observing a result as extreme as, or more extreme than, the one actually observed from chance alone is one in one thousand. If the P value had been P .01, it would have meant that the probability of observing a result as extreme as, or more extreme than, the one actually observed from chance alone is one in one hundred. Because the value of 0.001 is less than 0.05, this means that we can reject the null hypothesis. Holm et al. (16) expressly stated that, “P less than .05 was considered statistically significant.” By convention, in the context of handling P values, this number (0.05) is called the alpha value.

c. Censoring dataData on any given subject are “censored” when a subject drops out of the clinical trial. Data on any subjects still alive when a clinical trial on cancer has come to its end, and when the clinical trial has been formally concluded, are also censored. For subjects still alive at the end of an oncology clinical trial, the event that is usually of interest (death) has not occurred, and for this reason the subject is censored.

The term censored means that the exact date of the subject’s death is not marked by a downward step on the Kaplan-Meier plot, and is not used for calculating the fraction of surviving study subjects. Where a study subject is censored, this may be indicated on the Kaplan-Meier plot by way of a tick mark or dot. Bland and Altman (17) described the Kaplan-Meier plot as, “[t]he ‘curve’ is a step function, with sudden changes in the estimated probability corresponding to times at which an event was observed. The times of the censored data are indicated by short vertical lines.” Where a subject is censored, a tick mark or dot is shown on the graph.

Generally, subjects are censored when they are lost to the study, but it is not advis-able to censor subjects for problems that are less severe, for example failure to adhere to the drug schedule. If a subject is too ill to travel to the clinic for an infusion of an anti-cancer drug, and if that subject is censored, the act of censoring that particular

17 Bland JM, Altman DG. Survival probabilities (the Kaplan-Meier method). Brit Med J. 1998;317:1572.

14 Stone GW, Maehara A, Lansky AJ, et al. A prospective natural-history study of coronary atherosclerosis. New Engl J Med. 2011;364:226–235.

15 Holm C, Rayala S, Jirström K, Stål O, Kumar R, Landberg G. Association between Pak1 expression and subcellular localization and tamoxifen resistance in breast cancer patients. J Natl Cancer Inst. 2006;98:671–680.

16 Holm C, Rayala S, Jirström K, Stål O, Kumar R, Landberg G. Association between Pak1 expression and subcellular localization and tamoxifen resistance in breast cancer patients. J Natl Cancer Inst. 2006;98:671–680.

Page 5: Clinical Trials || Biostatistics

Biostatistics 169

subject may introduce bias into the calculations and analysis. Please consider the following hypothetical. In this hypothetical, an experimental drug is ineffective against breast cancer, except for a minority of people in the general population with a rare mutation in the epidermal growth factor gene. Now, please imagine that the health of most of the study subjects deteriorates to the point where they can no longer come to the clinic, and where the investigator decides to censor data from the subjects. In this hypothetical, only a fraction of the subjects – perhaps 5% of the total subjects enrolled in the study – having the rare mutation will feel good enough to come to the clinic for more treatment. The result of the censoring will be that the drug is discovered to be dramatically effective against breast cancer. But this will be a misleading and artifac-tual result because, in fact, the drug is only effective in 5% of the subjects.

In reviewing data from a clinical trial, the statistician can analyze the data from the total population of study subjects, as well as from specific subgroups. These subgroups typically include subjects between 18 and 65 years of age versus subjects over 65 years, subjects previously treated with chemotherapy versus those who are treatment-naive, and subjects with wild-type genes, for example epidermal growth factor gene versus those with a mutated gene. If there is reason to suspect that expression of a given gene is relevant to response to a study drug, or that a mutation in the gene is relevant to response, then subgroup analysis can be performed when the study is completed, and when all of the data are collected. Dr. Harvey Motulsky (18) has emphasized that good methodology in study design requires the definition of subgroups before initiating the clinical trial, and not after the clinical trial when the data are available, and that defin-ing subgroups after the clinical trial can raise the issue of “data mining.” Data mining has been described as, “data dredging or fishing and…the process of trawling through data in the hope of identifying patterns” (19).

d. Hazard ratioThe hazard ratio is the ratio of: [chance of an event occurring in the treatment arm]/[chance of an event occurring in the control arm] (20). The hazard ratio has also been defined as the ratio of [risk of outcome in one group]/[risk of outcome in another group], occurring at a given interval of time (21). In the situation where the hazard for an outcome is exactly twice in Group A than in Group B, the value of the haz-ard ratio can be either 2.0 or 0.5. The result of the calculation (whether HR 2.0 or 0.5) depends on whether the investigator chooses to calculate the ratio of hazards for

18 Motulsky H. E-mail of May 9, 2011.19 Hand DJ. Data mining: statistics and more? Am. Stat. 1998;52:112–118.20 Duerden M. What are Hazard Ratios? What is…? Series. Hayward Medical Communications, Hayward Group, Ltd.;

2009;8.21 Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York, NY: Lange Medical Books; 2004;407.

Page 6: Clinical Trials || Biostatistics

Clinical Trials170

[Group A]/[Group B] or, alternatively, to calculate the ratio of hazards for [Group B]/[Group A] (22,23).

The term “hazard” refers to the probability that an individual, under observation in a clinical trial at time t, has an event at that time (24). It represents the instantaneous event rate for an individual who has already survived to the time “t.”

The two arms of a clinical trial can be compared by way of the hazard ratio and the P value. The following serves as a starting point for defining hazard ratio and P value, as it applies to two curves in a Kaplan-Meier plot. The hazard ratio is a measure of the magnitude of the difference between the two curves in the Kaplan-Meier plot, while the P value measures the statistical significance of this difference. These two definitions serve only as starting points for our present goal in arriving at correct definitions. The follow-ing are the correct definitions. The numerical value of the hazard ratio expresses the rela-tive hazard reduction achieved by the study drug compared to the hazard reduction by the control treatment. The numerical value can be a fraction of 1.0 or it can be greater than 1.0. For example, a hazard ratio of 0.70 means that the study drug provides 30% risk reduction compared to the control treatment (25). A hazard ratio of exactly 1.0 means that the study drug provides zero risk reduction, compared to the control treatment. The P value gives the probability of observing an event by chance alone, if the null hypothesis is true. The P value expresses the probability of observing a difference as extreme as that observed, if in fact the null hypothesis is true (26). If the P value from the study results is smaller than the alpha value, it is concluded that the observed difference is unlikely to be from chance, and that it arose from the treatment used in the clinical trial.

A Kaplan-Meier plot can be used to plot results from only one group. The Kaplan-Meier plot can also be used to plot results from two groups, for example study drug group and control group. The Kaplan-Meier plot can also be used for data from more than two groups. But a hazard ratio is used to represent the relative difference between only two groups. Please also note that when the hazard ratio is used as a measure for the difference between two survival curves (on one Kaplan-Meier plot), the hazard ratio can be calculated from data collected from the entire study period or, alter-natively, from an early time interval or from a late time interval (27). According to Dr. Harvey Motulsky (28) the hazard ratio is only meaningful if you assume that the hazard ratio is the same at all time points.

27 Kestenbaum B. Epidemiology and Biostatistics: An Introduction to Clinical Research. New York, NY: Springer; 2009;227–228.

28 Motulsky H. E-mail of May 9, 2011.

22 Machin D, Cheung YB. Survival Analysis: A Practical Approach. 2nd ed. Hoboken, NJ: John Wiley & Sons, Inc.; 2006;62.23 Crowley J. Handbook of Statistics in Clinical Oncology. New York, NY: Marcel Dekker; 2001;541.24 Duerden M. What are Hazard Ratios? What is…? Series. Hayward Medical Communications, Hayward Group, Ltd.;

2009;8.25 Kane RC. The clinical significance of statistical significance. The Oncologist. 2008;13:1129–1133.26 Kane RC. The clinical significance of statistical significance. The Oncologist. 2008;13:1129–1133.

Page 7: Clinical Trials || Biostatistics

Biostatistics 171

II. DEFINITIONS AND FORMULAS

The following definitions and formulas are used to calculate the hazard ratio (29).O1 is the observed number of deaths at time t for group 1.O2 is the observed number of deaths at time t for group 2.E1 is the expected number of deaths at time t for group 1, where this expectation

is based on the number of deaths occurring in this group for the immediately previous time point (the time just before time t).

E2 is the expected number of deaths at time t for group 2, where this expectation is based on the number of deaths occurring in this group for the immediately previous time point (the time just before time t).

E1 is calculated from the following formula (Eq. (1)):

Er d

r1i i

i

1 =

∑ [( )( )]

E2 is calculated from the following formula (Eq. (2)):

Er d

ri i

i

2 2=

∑ [( )( )]

The term r1i is the number of subjects alive and not censored in group 1, just before time ti.

The term r2i is the number of subjects alive and not censored in group 2, just before time ti.

The term ri, which appears in the denominator, means: r1i r2i. In other words, ri is the total number of subjects alive in both groups and not censored, just before time t.

The term di is the total number of subjects who died at time ti, in both groups combined. In other words, di d1i d2i.

The symbol ∑ (summation sign) indicates the addition over each time of death up to and including time t. The summation sign indicates that the following calculation must be made. Assume that the clinical study has six time periods. This type of clinical study can be represented by a Kaplan-Meier plot where each curve has six points.

The hazard ratio is calculated from the following formula (Eq. (3)) (30):

hO E

O E=

[ / ]

[ / ]1 1

2 2

(1)

(2)

(3)

29 Machin D, Gardner MJ. Calculating confidence intervals for survival time analyses. Brit Med J. 1988;296:1369–1371.30 Machin D, Gardner MJ. Calculating confidence intervals for survival time analyses. Brit Med J. 1988;296:1369–1371.

Page 8: Clinical Trials || Biostatistics

Clinical Trials172

III. DATA FROM THE STUDY OF MACHIN AND GARDNER

Machin and Gardner (31) provide an example of data from a clinical study of 49 sub-jects with colorectal cancer. Twenty-five of the subjects were treated with study drug, while 24 were controls. The following discloses the times when a subject died, and times when a subject was censored. The time a subject died is the “survival time.”

The raw data provided were as follows:Events of death. Subjects died on the following months. Repeated numbers mean

that more than one subject died on that month. In Group 1, subjects died on months: 6, 6, 10, 10, 12, 12, 12, 12, 24, and 32. In Group 2, subjects died on months: 6, 6, 6, 6, 8, 8, 12, 12, 20, 24, 30, and 42.

Censored subjects. Subjects were censored on the following months. Repeated numbers mean that more than one subject was censored on that month. In Group 1, subjects were censored on month: 1, 5, 9, 10, 12, 13, 15, 16, 20, 24, 27, 34, 36, 36, 44. In Group 2, subjects were censored on month: 3, 12, 15, 16, 18, 18, 22, 28, 28, 28, 30, 33.In all cases, a subject experiencing the event of interest (death) was a different

human being than a subject who was censored. Events and censoring are entered into the calculations, but at different points in the mathematical formulas.

Table 9.1 provides numbers (r1i, r2i, and di) that occur as intermediates in the calcu-lation of the hazard ratio.

In approaching the conclusion of the calculation of the hazard ratio, it is found that O1 10, E1 11.37, O2 12, and E2 10.63. In arriving at the conclusion of this calculation, it is seen that the hazard ratio is as follows (Eq. (4)):

Hazard ratio = = =[ / ]

[ / ]

[ / . ]

[ / . ].

O E

O E1 1

2 2

10 11 37

12 10 630 78

This means that treatment with study drug is associated with a reduction in deaths to 78% compared to that found with the control treatment. Dawson and Trapp (32) provide another example of calculating the hazard ratio, along with intermediate num-bers that were used during the course of the calculation.

IV. DATA USED FOR CONSTRUCTING THE KAPLAN-MEIER PLOT ARE FROM SUBJECTS ENROLLING AT DIFFERENT TIMES

In a typical clinical trial, the sequence of events for each subject involves responding to an advertisement, contacting the sponsor, and undergoing screening and enrollment.

(4)

32 Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York, NY: Lange Medical Books; 2004;229–235.

31 Machin D, Gardner MJ. Calculating confidence intervals for survival time analyses. Brit Med J. 1988;296:1369–1371.

Page 9: Clinical Trials || Biostatistics

Biostatistics 173

It is usually not the case that the sponsor enrolls the desired number of subjects, and then administers study drug or control treatment to all of the subjects on exactly the same day. What is indicated at “day 0” on the Kaplan-Meier plot may correspond, in actuality, to hundreds of different days spread out over the course of a year. The fact that study drug and control treatments are started for each subject, shortly after the subject becomes available and immediately after the subject is properly enrolled in the trial, has prompted some investigators to compare differences in the subjects’ baseline characteristics, for subjects enrolled early in the trial with subjects enrolled late in the trial. Jacobs et al. (33) provide a good example of this comparison.

Table 9.1 Data from a clinical trial for calculating the hazard ratioa

Months (i) Group 1 (study drug) Group 2 (control) di

r1i (r1i is number of subjects alive and not censored, just before time ti)

r2i (r2i is number of subjects alive and not censored, just before time ti)

Total deaths at time i

1 25 24 03 24 24 05 24 23 06 23 23 68 21 19 29 21 17 010 20 17 212 17 17 613 13 14 015 11 14 016 10 13 018 9 12 020 9 10 122 8 9 024 8 8 227 6 7 028 5 7 030 5 4 132 5 2 133 4 2 034 4 1 036 3 1 042 1 1 144 1 0 0a Machin D, Gardner MJ. Calculating confidence intervals for survival time analyses. Brit Med J. 1988; 296:1369–1371.

33 Jacobs LD, Cookfair DL, Rudick RA, et al. Intramuscular interferon beta-1a for disease progression in relapsing multiple sclerosis. The Multiple Sclerosis Collaborative Research Group (MSCRG). Ann Neurol. 1996;39:285–294.

Page 10: Clinical Trials || Biostatistics

Clinical Trials174

In nutritional studies, all of the enrolled subjects are typically started on experi-mental and control treatments on exactly the same day. The ample supply of healthy subjects willing to participate in a nutritional study enables this kind of study design. Moreover, the requirement for keeping nutritional studies subjects confined in a “met-abolic unit” during the course of the study, to prevent subjects from consuming non-study foods, necessitates that all subjects begin the trial on the same date (34).

It is also the case that some clinical trials are concluded before the event of interest has occurred in every single one of the subjects. This situation can lead to bias, where the physiological properties of patients enrolled early in the trial differ from those enrolled late in the trial. As articulated by Bland and Altman (35) “we assume that the survival probabilities are the same for subjects recruited early and late in the study. In a long term observational study of patients with cancer, for example, the case mix may change over the period of recruitment.”

V. SAMPLE VERSUS POPULATION

The terms sample and population are standard terms in statistics. The term sample refers to data acquired by actual measurements. The investigator has the option of testing one sample, taken from a population, or of testing more than one sample, taken from the population. In discussions of statistics, it is the case that the term sample refers to a group of objects, for example 50 drug tablets, while the term population refers to the entire batch of 10,000 drug tablets that was manufactured. In statistics, it is the case that the term sample refers to 100 subjects enrolled in a clinical trial, while the term population refers to the entire world’s population of people with the disease of interest. The sample needs to be representative of the population.

The term population can refer to a hypothesized, underlying value or to an imagi-nary, idealized value. In some situations, it is possible for the researcher to measure a parameter of interest from all members of the population. But often, it is impracti-cal or impossible to measure the parameter in all members of the population. Data acquired by analyzing a sample are subject to variations in the properties of the sample and to variations in the techniques used by the investigator. For example, in a study of 50 human subjects, 15 of the subjects may have a mutation in a growth factor receptor gene, while the other 35 subjects have the wild-type gene. Or, in a study of 50 human subjects, 5 of the subjects may have forgotten to take two of their drug doses, while 45 of the subjects had remembered to take all of the drug doses. But data acquired by analyzing a population take into account these and all other variations, and data acquired by

34 Margen S, Chu JY, Kaufmann NA, Calloway DH. Studies in calcium metabolism. I. The calciuretic effect of dietary protein. Am J Clin Nutr. 1974;27:584–589.

35 Bland JM, Altman DG. Survival probabilities (the Kaplan-Meier method). Brit Med J. 1998;317:1572.

Page 11: Clinical Trials || Biostatistics

Biostatistics 175

analyzing only a 15-person sample taken from this population of 50 human subjects may be drastically biased, for example where all of the subjects in the 15-person sam-ple have the genetic mutation.

Researchers may be interested in measuring a parameter from a sample, where the goal is to predict the same parameter in the entire population, that is, where it is not practical or not possible to determine that parameter in the population. An example can be found in the manufacture of tablets or pills. Where the goal is to determine the weight of the tablets, and to determine whether the range of weights is within manu-facturing specifications, the analyst can measure the weights of 100 tablets, taken from a population of 1 million tablets that was manufactured in a specific batch. In this situ-ation, the 100 tablets constitutes a “sample,” while the 1 million tablets manufactured in a specific batch constitutes the “population.” Batchwise manufacture is distinguished in that each component has a specific lot number, and by the fact that the machinery was cleaned and calibrated specifically for the manufacture of that batch.

In this scenario, the various statistical parameters of the sample (mean, standard deviation) are known, and the statistical parameters of the population (mean, standard deviation) are also known. Instead of measuring a parameter of 1 million tablets, the researcher can refer to standards set forth by the pharmaceutical industry. These stan-dards may relate to mean weight and standard deviation.

Researchers may also want to compare a parameter from a first sample with the same parameter of a second sample. This situation occurs in clinical trials where there are two study arms, that is, an experimental drug group and a control group. In the context of a clinical trial, the relevant parameters (mean value of death rate, standard deviation) are collected from the two samples. But the relevant population parameters (mean value of death rate, standard deviation) would usually be impossible to collect, because this population would consist of all of the people in the world having the disease of interest, and satisfying the particular inclusion criteria and exclu-sion criteria mandated by the trial design.

VI. WHAT CAN BE COMPARED

Tests in drug manufacturing, or comparisons made in clinical trials, often take one of the following three forms (36). First, the mean value from a sample can be compared with a hypothetical value. The hypothetical value can be a standard (manufacturing specification) set forth by the manufacturing industry. The hypothetical can be a value from a census, or from an epidemiological study, involving every person in a country. For this type of study, there is one sample group and one population group.

36 Whitley E, Bell J. Statistics review 5: Comparison of means. Crit Care. 2002;6:424–428.

Page 12: Clinical Trials || Biostatistics

Clinical Trials176

A second type of comparison can involve paired data. For each subject, a parameter is measured before treatment and after treatment. Thus, each subject serves as his own control. For this type of study, there are two samples (but no population group). The statistical analysis compares the mean value of the “before” measurements, with the mean value of the “after” measurements. Disis et al. (37) provide an excellent example of the statistical analysis of paired data, where immune response in cancer patients was measured before and after vaccination.

Third, the mean value from a first sample can be compared with the mean value of a second sample. With this type of comparison, in the context of clinical trials, the human subjects in the first sample are not the same people as the human subjects in the second sample. This third type of comparison is the most common trial design that is used in randomized clinical trials.

VII. ONE-TAILED TEST VERSUS TWO-TAILED TEST

The terms one-tailed test and two-tailed test are encountered, for example, when con-ducting analytical studies on manufactured tablets and when conducting clinical trials. When doing calculations, these terms are encountered when plugging a Z value into a table of areas under the standard normal curve, and acquiring a P value. One-tailed test is also called one-sided test, and two-tailed test is also called two-sided test.

This standard table has been called “Standard Normal Distribution Areas” (38) “Areas in Tail of the Standard Normal Distribution” (39) and “Areas Under the Standard Normal Curve” (40).

The heading of the table of areas under the standard normal curve typically directs the reader to one column of numbers, which is to be used for one-tailed tests, and to another column of numbers, which is to be used for two-tailed tests (41).

A one-tailed test is used to determine if the mean of group 1 is greater than the mean of group 2, while a two-tailed test is used to determine if the mean of group 1 is different than the mean of group 2 (42). By “different,” what is meant here is whether there is a statistically significant difference. More accurately, by “different,” what is meant is if the difference is plausible within an acceptable degree of error (43).

38 Durham TA, Turner JR. Introduction to Statistics in Pharmaceutical Clinical Trials. Chicago, IL: PhP Pharmaceutical Press; 2008;195–203.

39 Kirkwood BR, Sterne JA. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell Science Ltd.; 2003;470–471.40 Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York, NY: Lange Medical Books; 2004;364–365.41 Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York, NY: Lange Medical Books/McGraw-Hill;

2004;364–365.42 Norman GR, Streiner DL. Biostatistics. 3rd ed. Hamilton, Ontario: B.C. Decker, Inc.; 2008;56.43 Elder J. E-mail of May 12, 2011.

37 Disis ML, Wallace DR, Gooley TA, et al. Concurrent trastuzumab and HER2/neu-specific vaccination in patients with metastatic breast cancer. J Clin Oncol. 2009;27:4685–4692.

Page 13: Clinical Trials || Biostatistics

Biostatistics 177

As explained by Dawson and Trapp (44) the one-tailed test is a directional test, while the two-tailed test is a non-directional test.

A one-tailed test should be used where the goal is to determine if the value of a mean of a sample is significantly greater than the value of the mean for the corre-sponding population. The one-tailed test is also used where the goal is to determine if the value of a mean of a sample is significantly greater than the value of the mean of another sample.

Thus, a one-tailed test is used where the goal is to determine if a new, improved pill dissolves faster in water than an older formulation of the pill. Also, a one-tailed test is used where the goal is to determine if a drug having expected curative properties results in a better cure than an inactive placebo.

To provide another example, a one-tailed test is used where the goal is to deter-mine if vials containing a vaccine are contaminated with ten or more bacteria (45). In this case, the analyst is only interested in whether the vials contain ten or more bacteria, in view of industry-wide specifications requiring that vials must contain less than ten bacteria. Generally, the one-tailed test is used to determine if sample A is sig-nificantly greater than sample B, in the situation where it would not be reasonable to expect sample A to be significantly less than sample B.

But a two-tailed test should be used where the goal is to determine the percentage of tablet weights that are greater or lesser (the sum of the percentage of tablets that are greater plus the sum of the percentage of tablets that are lesser) than the required specifi-cation, when comparing tablets made by manufacturer 1 with tablets made by manufac-turer 2. Two-tailed tests are more widely used in clinical trials than the one-tailed test, in view of the fact that the two-tailed test is more stringent and more conservative (46,47).

VIII. P VALUE

The P value is used in a procedure called hypothesis testing. P, which stands for prob-ability, can be any number between 0.0 and 1.0. According to Whitley and Ball (48) “[v]alues close to 0 indicate that the observed difference is unlikely to be due to chance, whereas a P value close to 1 suggests there is no difference between groups other than that due to random variation.” According to Motulsky (49) “P value is simply a

44 Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York, NY: Lange Medical Books/McGraw-Hill; 2004;104.

45 Example derived from page 108 of Jones D. Pharmaceutical Statistics. Chicago, IL: Pharmaceutical Press; 2002.46 Norman GR, Streiner DL. Biostatistics. 3rd ed. Hamilton, Ontario: B.C. Decker, Inc.; 2008;56.47 Motulsky H. Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 2nd ed. New York, NY: Oxford

Univ. Press; 2010;99.48 Whitley E, Ball J. Statistics review 3: Hypothesis testing and P values. Crit Care. 2002;6:222–225.49 Motulsky H. Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking. 2nd ed. New York, NY: Oxford

Univ. Press; 2010;104.

Page 14: Clinical Trials || Biostatistics

Clinical Trials178

probability that answers the following question: If the null hypothesis were true…what is the probability that random sampling…would result in a difference as big as or big-ger than the one observed?”

Whitley and Ball (50) explain why hypothesis testing is needed, using the example of a drug (nitrate) for preventing deaths from heart disease. Even if there is no real effect of nitrate on mortality, sampling variation makes it unlikely that exactly the same proportion of patients in each group will die. Thus, any observed difference between the two groups may be due to the treatment or it may simply be due to chance. The aim of hypothesis testing is to establish which of these two explanations, treatment ver-sus chance, is more likely.

Hypothesis testing can involve asking if the mean of a sample has a statistically sig-nificant difference from the mean of a population (51). The question of whether there is any difference takes the form of the “null hypothesis.” The null hypothesis is that there is no statistically significant difference between the sample and the population.

Where a clinical study involves comparing a study drug group and a placebo group (or study drug sample group and an entire population), the null hypothesis is that there is no statistically significant difference between the study drug group and the placebo group (or no difference between the study drug sample group and the entire population).

The sample mean from the study drug group and the population mean (or the sam-ple mean from the study drug group and the sample mean from the placebo group) can be used to calculate a P value. This P value is then applied to the null hypothesis.

In hypothesis testing that involves the null hypothesis, the researcher does not ask, “Does the study drug work better than the placebo?”

The null hypothesis only asks, “Does the study drug work the same as the placebo?”The question asked by the null hypothesis is the more conservative of these two

questions. According to a number of authors (52,53) the null hypothesis is a “straw man” hypothesis.

Failure to reject the null hypothesis can arise from several sources, including: (1) random scatter or noise in the data; (2) use of too few subjects in the clinical trial; (3) lack of a true, underlying difference between efficacy of the study drug and efficacy of the control treatment. Random scatter can arise from several sources. These include failure of study subjects to take pills according to the required schedule, genetic vari-ability of the infecting virus in clinical trials on anti-viral drugs, genetic variability of the tumor in clinical trials using anti-cancer drugs, and genetic variability of nor-mal tissues in study subjects. Genetic variability of study subjects include differences

50 Whitley E, Ball J. Statistics review 3: Hypothesis testing and P values. Crit Care. 2002;6:222–225.51 Jones D. Pharmaceutical Statistics. Chicago, IL: Pharmaceutical Press; 2002;154–156.52 Durham TA, Turner JR. Introduction to Statistics in Pharmaceutical Clinical Trials. Chicago, IL: PhP Pharmaceutical Press;

2008;76.53 Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB. Designing Clinical Research. 3rd ed. New York, NY:

Lippincott, Williams, and Wilkins; 2006;58.

Page 15: Clinical Trials || Biostatistics

Biostatistics 179

in cytochrome P-450 (54) and differences in a component of the immune system called major histocompatibility complex (MHC) (55,56). Cytochrome P-450 is a class of enzymes capable of degrading and inactivating a wide variety of drugs. MHC is a membrane-bound protein of white blood cells that is required for antigen presentation.

After the P value is calculated, the researcher compares it with the number 0.05. This number (0.05) is called the alpha value. In making this comparison, if it is evident that P is equal to or less than 0.05, then the difference between the sample group and the population group (or the study drug group and the control group) is considered significant (and the null hypothesis is rejected).

According to the editorial board of the American Journal of Physiology, if the achieved significance level P is less than the critical significance level alpha value, defined before any data are collected, then the experimental effect is likely to be real. Most researchers define alpha to be 0.05. Where a researcher chooses an alpha of 0.05, this means that 5% of the time the researcher is willing to declare than an effect exists when (in fact) the effect does not exist (57).

Thus, according to this editorial, the number 0.05 is used, by researchers, for com-paring the P value derived from their calculations. Researchers can choose other alpha values, for example an alpha value of 0.01 or an alpha value of 0.001. Using an alpha value of 0.01 provides a more stringent test of statistical significance. An alpha value of 0.0001 provides an even more stringent test.

Regarding the alpha value of 0.05, Healy (58) finds that the “issue is where to draw the line between significant and non-significant. No such line exists and any distinction of this kind is completely arbitrary. A convention has grown up which places the divid-ing line at a significance level of 0.05.” In the instances where a smaller alpha value is used, these tend to involve experiments that are more under the control of the investiga-tor (when compared to clinical trials on human subjects), such as experiments in bacteri-ology (59) or genetics (60,61,62). In these cited articles, alpha values of 0.001 were used.

54 De Gregori M, Allegri M, De Gregori S, et al. How and why to screen for CYP2D6 interindividual variability in patients under pharmacological treatments. Curr Drug Metab. 2010;11:276–282.

55 Sidney J, Steen A, Moore C, et al. Divergent motifs but overlapping binding repertoires of six HLA-DQ molecules frequently expressed in the worldwide human population. J Immunol. 2010;185:4189–4198.

56 de Araujo Souza PS, Sichero L, Maciag PC. HPV variants and HLA polymorphisms: the role of variability on the risk of cervical cancer. Future Oncol. 2009;5:359–370.

57 Curran-Everett D, Benos DJ. Guidelines for reporting statistics in journals published by the American Physiological Society. 2004;287:E189–E191.

58 Healy MJR. Significance tests. Arch. Dis. Childhood. 1991; 66:1457–1458.

60 Mutch DM, Simmering R, Donnicola D, et al. Impact of commensal microbiota on murine gastrointestinal tract gene ontologies. Physiol Genomics. 2004;19:22–31.

61 Travers SA, Tully DC, McCormack GP, Fares MA. A study of the coevolutionary patterns operating within the env gene of the HIV-1 group M subtypes. Mol Biol Evol. 2007;24:2787–2801.

62 Znaidi S, Weber S, Al-Abdin OZ, et al. Genomewide location analysis of Candida albicans Upc2p, a regulator of sterol metabolism and azole drug resistance. Eukaryot Cell. 2008;7:836–847.

59 Kinder SA, Holt SC. Characterization of coaggregation between Bacteroides gingivalis T22 and Fusobacterium nucleatum T18. Infect Immun. 1989;57:3425–3433.

Page 16: Clinical Trials || Biostatistics

Clinical Trials180

Statistical significance needs to be distinguished from clinical significance. Kaul and Diamond (63) Kane (64) Bhardwaj et al. (65) and Houle and Stump (66) warn of the situation where data are statistically significant but are not clinically significant and have no real-world value. A number of publications have reported that a parameter was sta-tistically significant, but not clinically significant, for example Jeffrey et al. (67) and van Maldegem et al. (68). Fethney (69) pointed out that the P value on its own provides no information about the overall importance or meaning of the results to clinical practice.

IX. CALCULATING THE P VALUE – A WORKING EXAMPLE

Table 9.2 lists the parameters needed for calculating the P value.Only one example will be shown for calculating the P value. This example involves

comparing the mean of a first sample (study drug group) with the mean of a second sample (control group). The data are from Machin and Gardner (70).

In Group 1 (study drug group), subjects died on months: 6, 6, 10, 10, 12, 12, 12, 12, 24, and 32.

In Group 0 (control group), subjects died on months: 6, 6, 6, 6, 8, 8, 12, 12, 20, 24, 30, and 42.

When faced with the need to calculate a P value, the researcher must choose between various different statistical tests. One of these tests involves an intermedi-ate step where the Z statistic is calculated, while another commonly used test has an intermediate step where the t statistic is calculated. According to Pocock (71) the sim-plest test is the one using the Z statistic. But it should be noted that calculations using the Z statistic may be misleading when analyzing small samples, and that the t statistic is more appropriate with small samples (72). With large samples the t statistic and Z statistic are equivalent to each other (73).

63 Kaul S, Diamond GA. Trial and error. How to avoid commonly encountered limitations of published clinical trials. J Am Coll Cardiol. 2010;55:415–427.

64 Kane RC. The clinical significance of statistical significance. Oncologist. 2008;13: 1129–1133.65 Bhardwaj SS, Camacho F, Derrow A, Fleischer Jr AB, Feldman SR. Statistical significance and clinical relevance: the

importance of power in clinical trials in dermatology. Arch Dermatol. 2004;140:1520–1523.66 Houle TT, Stump DA. Statistical significance versus clinical significance. Semin. Cardiothorac. Vasc Anesth.

2008;12:5–6.67 Jeffery NN, Douek N, Guo DY, Patel MI. Discrepancy between radiological and pathological size of renal masses.

BMC Urol. 2011;11(2):9.68 van Maldegem BT, Duran M, Wanders RJ, et al. Clinical, biochemical, and genetic heterogeneity in short-chain

acyl-coenzyme A dehydrogenase deficiency. J Am Med Assoc. 2006;296:943–952.69 Fethney J. Statistical and clinical significance, and how to use confidence intervals to help interpret both. Aust Crit

Care. 2010;23:93–97.70 Machin D, Gardner MJ. Calculating confidence intervals for survival time analyses. Brit Med J. 1988;296:1369–1371.71 Pocock SJ. The simplest statistical test: how to check for a difference between treatments. Brit Med J. 2006;

332:1256–1258.72 Motulsky H. E-mail of May 9, 2011.73 Motulsky H. E-mail of May 9, 2011.

Page 17: Clinical Trials || Biostatistics

Biostatistics181

Table 9.2 Definitions and formulasSymbol Equation Formula or definition

Sample number n – This is the number of individuals in a given sample. This can be the number of tablets taken for sampling from a larger, defined manufacturing batch. This can also be the number of people (meeting enrollment criteria) with a specific disease who are actually enrolled in a clinical trial, and allocated to a specific arm of the trial.

Population number N – This can be the total number of tablets in a large, defined manufacturing batch. This can also be the number of people (meeting enrollment criteria) with a specific disease, in the entire world.

Sample mean (pronounced “x-bar”)a

_x Eq. (5) x n xi= ∑( / )1

Population mean μ Eq. (6) µ = ∑( / )1 N xi

Sample standard deviationa,b s Eq. (7) s x x nj= − −∑[( ( ) /( )]2 1

Population standard deviationb σ Eq. (8) σ = −∑[( ( ) /( )]x x Nj2

Square root – The symbol requires taking the square root of everything occurring to the right of the symbol.

Z score (calculated using population standard deviation) c

Z Eq. (9) Z x= −( )/µ σ

Z score used when comparing two samples (study drug group and control group).c x1 is the mean of the parameter of interest from the study drug group, while x0 is the mean of the corresponding parameter of interest from the control group.c

Z Eq. (10)

Eq. (11)

Where σ is known: Z x x n n= − +( )/[ (( / ) ( / ))]1 0 12

1 12

0σ σ

Where σ is not known:d

Z x x s n s n= − +( )/[ (( / ) ( / ))]1 0 12

1 02

0

a Whitley E, Ball J. Statistics review 1: Presenting and summarising data. Crit Care. 2002; 6: pp. 66–71.b Jones D. Pharmaceutical Statistics. Chicago, IL: Pharmaceutical Press; 2002; p. 21.c Kirkwood BR, Sterne JA. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell Science Ltd.; 2003; pp. 39, 45, 51, 53–54, 58, 61–62.d Where the value of the population standard deviation (σ) is not known, the value of the sample standard deviation(s) may be used instead (Kirkwood BR, Sterne JA. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell Science Ltd.; 2003; p. 39; Norman GR, Streiner DL. Biostatistics. 3rd ed. Hamilton, Ontario: B.C. Decker, Inc.; 2008; p. 50).

Page 18: Clinical Trials || Biostatistics

Clinical Trials182

74 Durham TA, Turner JR. Introduction to Statistics in Pharmaceutical Clinical Trials. Chicago, IL: PhP Pharmaceutical Press; 2008;195–203.

75 Kirkwood BR, Sterne JA. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell Science Ltd.; 2003;470–471.76 Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York, NY: Lange Medical Books; 2004;364–365.77 http://www.stat.lsu.edu/exstweb/statlab/Tables/TABLES98-Z.html 78 This table, which was from Louisiana State University, has numbers that are rounded off differently from, but

contains numbers that are essentially identical to, corresponding tables in Kirkwood BR, Sterne JA. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell Science Ltd.; 2003;470–471, Dawson B, Trapp RG. Basic and Clinical Biostatistics. 4th ed. New York, NY: Lange Medical Books; 2004;364–365, and Durham TA, Turner JR. Introduction to Statistics in Pharmaceutical Clinical Trials. Chicago, IL: PhP Pharmaceutical Press; 2008;195–203.

The relevant formulas are from Table 9.2. These formulas are as follows:

l Sample standard deviation (Eq. (12)): s x x nj= − −

∑ ( ) /( )2 1

l Z value when comparing two samples: study drug (group1) and control (group 0) (Eq. (13)):

Zx x

s n s n=

+

( )

[ (( / ) ( / ))]

1 0

12

1 02

0

Intermediate steps in making the calculation appear in Table 9.3, which shows the calculation of sample standard deviation for the two samples (study drug group, control group), for the goal of determining the Z value, and for the eventual goal of determin-ing the P value.

The following continues with the calculations (Eq. (14)):

Zx x

s n s n=

+=

−+

( )

[ (( / ) ( / ))]

( . . )

[ ( . / ) (1 0

12

1 02

0

13 6 15 0

66 49 10 136.. / )]

.

( . . ).

35 12

1 4

6 649 11 360 330=

−+

= −

Plugging Z 0.330 into a standard table (Table 9.4) that converts Z values to probabilities (to areas under a normal curve) results in P 0.3707. The relevant num-bers are shown in BOLD in Table 9.4. This probability is a one-tailed P value. As men-tioned above, this standard table has names, such as “Standard Normal Distribution Areas” (74) “Areas in Tail of the Standard Normal Distribution” (75) and “Areas Under the Standard Normal Curve” (76). Comments on the nature of this table appear in the cited footnotes (77,78). The Y axis (left border) shows the number immediately before the decimal, and immediately after the decimal, for the Z value. The X axis (top border) shows the second decimal place number of the Z value. The body of the table (Table 9.4) shows the P values (probabilities, areas).

Page 19: Clinical Trials || Biostatistics

Biostatistics183

Table 9.3 Steps for calculating the Z valueColumn 1 Column 2 Column 3 Column 4 Column 5 Column 6 Column 7 Column 8 Column 9 Column 10

Group 1 Group 0

Study drug group (survival time, months)

Survival time minus mean (mean 13.6 months)

Squared value from column 2

Sum of all squared values

Sample standard deviation (n 10; n minus 1  9)

Control group (survival time, months)

Survival time minus mean (mean 15.0 months)

Squared value from column 7

Sum of all squared values

Sample standard deviation (n 12; n minus 1 11)

6 7.6 57.76 598.4 8.154 months

6 9 81 1500 11.677 months

6 7.6 57.76 6 9 8110 3.6 12.96 6 9 8110 3.6 12.96 6 9 8112 1.6 2.56 8 7 4912 1.6 2.56 8 7 4912 1.6 2.56 12 3 912 1.6 2.56 12 3 924 10.4 108.16 20 5 2532 18.4 338.56 24 9 81– – – – – 30 15 225– – – – – 42 27 729

Page 20: Clinical Trials || Biostatistics

Clinical Trials184

Table 9.4 Areas in tail of the standard normal distribution (right-tail areas from Z to infinity)Z value 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09

0.00 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.46410.10 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.42470.20 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.38590.30 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.34830.40 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.31210.50 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.27760.60 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.24510.70 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.21480.80 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.18670.90 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.16111.00 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.13791.10 0.1357 0.1335 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.11701.20 0.1151 0.1131 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.09851.30 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.08231.40 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.06811.50 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.05591.60 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.04551.70 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.03671.80 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.02941.90 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.02332.00 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.01832.10 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.01432.20 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.01102.30 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.00842.40 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.00642.50 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.00482.60 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.00362.70 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.00262.80 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.00192.90 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.00143.00 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.00103.10 0.0010 0.0009 0.0009 0.0009 0.0008 0.0008 0.0008 0.0008 0.0007 0.00073.20 0.0007 0.0007 0.0006 0.0006 0.0006 0.0006 0.0006 0.0005 0.0005 0.00053.30 0.0005 0.0005 0.0005 0.0004 0.0004 0.0004 0.0004 0.0004 0.0004 0.00033.40 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.0003 0.00023.50 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.0002 0.00023.60 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.00013.70 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.00013.80 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.00013.90 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.00004.00 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Page 21: Clinical Trials || Biostatistics

Biostatistics 185

Because 0.3707 is greater than 0.05, it is concluded that the difference between Group 1 and Group 0 is not statistically significant. Because the difference between Group 1 and Group 0 is not significant, the null hypothesis is accepted.

The reader needs to use extreme caution before using the standard table because the format is not particularly standard. Depending on the statistics book, different for-mats are used. Norman and Streiner (79) expressly warn that one should be careful reading tables of the normal curve in statistics books. The problem is that in some books, the area equivalent to Z 0.0 is assigned as being 0.5, while in other books, the area equivalent to Z 0.0 is assigned as 0.0. In using these tables, the reader needs to determine if the area (the probability) corresponds to the area above the Z value, or to the area below the Z value.

Kirkwood and Sterne (80) provide the same sort of calculation in an example involving two samples. The first sample is an experimental group (Group 1; 36 smokers), while the second sample is a control group (Group 0; 64 non-smokers). In this example, the mean value (lung volume) for the smokers was x1 4 7= . liters, while the mean value (lung volume) for the non-smokers was x0 5 0= . liters. The standard deviations for the two groups were s1 (smokers) 0.6 and s0 (non-smokers) 0.6. (It was only a coincidence that the standard deviations were both the same.) In working through the calculations, one finds that (Eq. (15)):

Zx x

s n s n=

+=−

= −( )

[ (( / ) ( / ))]

.

..1 0

12

1 02

0

0 3

0 1252 4

Plugging the Z value of 2.4 into the standard table provides P 0.0082. This probability is a one-tailed P value. Since 0.0082 is less than 0.05, the difference is found to be significant, and the null hypothesis is rejected. What is rejected is the notion that smoking does not significantly influence lung volume.

In configuring the smoking study, the statistician can ask, “Does smoking signifi-cantly reduce lung volume?” This involves using a one-sided P value. Alternatively, the statistician can ask, “Does smoking significantly change lung volume?” This involves using a two-sided P value. The value of the two-sided P value is twice 0.0082, that is, P .0164.

Please note the negative sign in the above Z value, that is, Z 2.4. For the pur-poses of plugging in the Z value and obtaining a probability, the negative sign can be

79 Norman GR, Streiner DL. Biostatistics. 3rd ed. Hamilton, Ontario: B.C. Decker, Inc.; 2008;35.80 Kirkwood BR, Sterne JA. Essential Medical Statistics. 2nd ed. Malden, MA: Blackwell Science Ltd.; 2003;61–63.

Page 22: Clinical Trials || Biostatistics

Clinical Trials186

ignored. In the words of Norman and Streiner (81) “[w]hat we do is ignore the sign, but keep it in our minds.”

Pocock (82) provides additional examples from actual clinical trials. What are provided

are numbers for plugging into the formula, Z x x s= − +( )/[ (( / ) ( / ))]1 0 12

1 02

0n s n , the calculated number for the Z value, instructions on plugging the Z value into the standard table, the value for P that was determined from this table, and instructions for interpreting the P value.

X. SUMMARY

To calculate the P value, the investigator starts with data regarding an event of interest from a first sample, and data regarding the same event but from a second sample. From these data, the investigator then calculates the means, standard deviations, and the Z value. The Z value is then plugged into a standard table with numbers corresponding to areas under a normal distribution curve. In plugging in the Z value, the investigator then arrives at the P value. This particular routine is used where the distribution of values in the first sample (study drug) follows a normal distribution, and where the distribution in the second sample (control treatment) also follows a normal distribution. Where the distribution of values is not normal, that is, where the distribution is skewed or contains two peaks, the investigator should use a different statistical tool, that is, a statistical tool that is a non-parametric test.

Daniel (83) provides a flow chart (decision tree) for determining which statisti-cal formula to use. The decision tree asks if the population is normally distributed, if the sample is large or if the sample is small, and if the population variance (or standard deviation) is known or unknown. Depending on the answers, the researcher may need to use, or may prefer to use, the Z statistic, the t statistic, or a non-parametric test such as the Wilcoxon rank sum test.

XI. THEORY BEHIND THE Z VALUE AND THE TABLE OF AREAS IN THE TAIL OF THE STANDARD NORMAL DISTRIBUTION

In brief, calculating the Z value converts the raw data into a normalized value. The normalized value, when plugged into the standard table, provides an area under a curve that depicts the normal distribution. This area is, in effect, identical with the probability

81 Norman GR, Streiner DL. Biostatistics. 3rd ed. Hamilton, Ontario: B.C. Decker, Inc.; 2008;35.82 Pocock SJ. The simplest statistical test: how to check for a difference between treatments. Brit Med J.

2006;332:1256–1258.83 Daniel WW. Biostatistics. 9th ed. Hoboken, NJ: John Wiley & Sons, Inc.; 2009;176.

Page 23: Clinical Trials || Biostatistics

Biostatistics 187

(P value). The goal of this chapter is to serve as a starting point in biostatistics, and to provide a reference point for use in navigating through textbooks on biostatistics.

XII. STATISTICAL ANALYSIS BY SUPERIORITY ANALYSIS VERSUS BY NON-INFERIORITY ANALYSIS

The two arms found in a typical Kaplan-Meier plot can be compared or analyzed by various statistical methods, including superiority analysis and non-inferiority analysis. Where the study drug is compared to a placebo, superiority analysis is used. But where the study drug is compared with an active control drug, or with the standard or tradi-tional treatment, both superiority analysis and non-inferiority analysis are used (84).

While sponsors and investigators prefer that their drug be superior to the control treatment, the difference in efficacy may be insignificant. Where the difference is insignifi-cant, the clinical trial can be rescued, at least in some situations, by non-inferiority analysis.

Following the clinical trial, the statistician analyzes the results to determine if the study drug is superior to the active control drug. The statistician also analyzes the results to determine if the study drug is not significantly inferior to the active control.

With non-inferiority analysis, the goal of the investigator is to prove that the efficacy of the study drug is better than, equivalent to, or only trivially worse than, the active con-trol in terms of efficacy (85). D’Agostino et al. (86) emphasize that, in designing a non-inferiority clinical trial, the comparator drug should be the best available comparator drug.

In addition to the superiority trial design, and the non-inferiority trial design, another type of trial design is the equivalence trial. The goal of this type of trial is to demonstrate that the study drug is both insignificantly better than and insignificantly worse than an active control drug. Paggio et al. (87) document the fact that published reports of clinical trials frequently confuse the concepts of non-inferiority and equivalence.

In conducting a clinical trial, the sponsor prefers to show that its study drug is superior to an active control drug in terms of efficacy. However, if superiority in terms of efficacy cannot be shown, and where the investigator is not willing to scrap the results from the clinical trial, the results can be salvaged by using non-inferiority analy-sis. In practice, statisticians conduct the non-inferiority analysis first, and once this is complete, they conduct the superiority analysis (88). The following situation concerns

84 U.S. Dept. of Health and Human Services. Food and Drug Administration. Guidance for Industry. Non-inferiority clinical trials. March 2010; (66 pages).

85 Piaggio G, Elbourne DR, Altman DG, Pocock SJ, Evans SJ, CONSORT Group. Reporting of noninferiority and equivalence randomized trials: an extension of the CONSORT statement. J Am Med Assoc. 2006;295:1152–1160.

86 D’Agostino Sr RB, Massaro JM, Sullivan LM. Non-inferiority trials: design concepts and issues – the encounters of academic consultants in statistics. Stat Med. 2003;22:169–186.

87 Piaggio G, Elbourne DR, Altman DG, Pocock SJ, Evans SJ, CONSORT Group. Reporting of noninferiority and equivalence randomized trials: an extension of the CONSORT statement. J Am Med Assoc. 2006;295:1152–1160.

88 The author thanks Dr. Jenna Elder for this fact.

Page 24: Clinical Trials || Biostatistics

Clinical Trials188

a finding of non-inferiority where the efficacy of the study drug is found to be not statistically better than that of the active control drug (comparator drug). In this case, regulatory approval can be granted based on the fact that the study drug is safer, cheaper to produce, easier to administer (injected vs. oral), or where compliance by patients outside of the clinical trial is expected to be better (89,90). Better compliance, for example, will occur with a pill taken once a day compared with a pill that must be taken three times per day at strictly timed intervals.

Non-inferiority analysis is relevant only where the study design compares a study drug with an active control drug (comparator drug). This type of analysis is neither rel-evant nor appropriate where the study design compares study drug with placebo (91).

Where the investigator foresees performing a non-inferiority analysis of the efficacy results, the best trial design is a 3-arm study, involving study drug, active control drug (comparator drug), and placebo. According to the ICH Guidelines, “non-inferiority trials may also incorporate a placebo, thus pursuing multiple goals in one trial; for example, they may establish superiority to placebo and hence validate the trial design and simultaneously evaluate the degree of similarity of efficacy and safety to the active comparator” (92).

D’Agostino et al. (93) provide a diagram with tick marks showing efficacy of vari-ous treatments, where the distance between tick marks represents statistically signifi-cant differences and statistically insignificant differences. The take-home lesson of this diagram is as follows. To arrive at a conclusion that the study drug is non-inferior to an active control drug, the efficacy of the study drug (which takes the form of a range called the confidence interval) must be greater than the efficacy of the placebo (defined as zero efficacy). Also, the efficacy of the study drug must reside in an open-ended range, where the lower end of the range is defined as insignificantly less than the efficacy of the active control, and the upper range is defined as being greater than the efficacy of the active control. This diagram is shown in Fig. 9.2.

For comparison, what is then shown is a diagram of a superiority analysis (Fig. 9.3).Non-inferiority trials are different from superiority trials in terms of how the data

are analyzed, but also in terms of study design. Non-inferiority trials require more study subjects. In the words of one commentator, “[b]ecause the acceptable difference between the two arms is typically smaller by half than what was gained moving from a

91 Durham TA, Turner JR. Introduction to Statistics in Pharmaceutical Clinical Trials. Chicago, IL: PhP Pharmaceutical Press; 2008;187.

92 European Medicines Agency (EMEA) ICH Topic E9. Statistical principles for clinical trials. September 1998; (37 pages).

93 D’Agostino Sr RB, Massaro JM, Sullivan LM. Non-inferiority trials: design concepts and issues – the encounters of academic consultants in statistics. Stat Med. 2003;22:169–186.

89 Lesaffre E. Superiority, equivalence, and non-inferiority trials. Bulletin NYU Hospital for Joint Diseases. 2008;66:150–154.

90 Durham TA, Turner JR. Introduction to Statistics in Pharmaceutical Clinical Trials. Chicago, IL: PhP Pharmaceutical Press; 2008;28:130–131.

Page 25: Clinical Trials || Biostatistics

Biostatistics 189

placebo to a standard treatment, the number of patients required to maintain the same statistical power increases substantially. On average, fourfold more patients are required for a non-inferiority design than for a superiority one” (94).

Le Henanff et al. (95) provide guidance on designing and reporting of non-infe-riority trials. For example, it was recommended that, “[i]t is surely preferable to use standard vocabulary such as treatment A ‘is not inferior to’…treatment B ‘with regard to the margin pre-specified at Δ’ rather than stating a treatment is ‘not substantially lower than’…a control treatment” (96). For reporting the results of non-inferiority tri-als, these authors also recommend disclosing the results from per protocol (PP) analysis,

Efficacy ofplacebo (definedas zero efficacy)

Efficacy of activecontrol

Efficacy of studydrug

Figure 9.2 Non-inferiority trial. In a non-inferiority trial, the efficacy of the study drug must not be significantly less than the efficacy of the active control, and it also must be significantly greater than the apparent efficacy of the placebo group. The open-ended bracket indicates the range of what is significantly greater

Efficacy of studydrug

Efficacy of active control,or of placebo

(placebo defined as zeroefficacy)

Figure 9.3 Superiority trial. In a superiority trial, the efficacy of the study drug must be significantly greater than the efficacy of the active control (or placebo). The open-ended bracket indicates the range of what is significantly greater

94 Tuma RS. Trend toward noninferiority trials mean more difficult interpretation of trial results. J Natl Cancer Inst. 2007;99:1746–1748.

95 Le Henanff A, Giraudeau B, Baron G, Ravaud P. Quality of reporting of noninferiority and equivalence randomized trials. J Am Med Assoc. 2006;295:1147–1151.

96 Le Henanff A, Giraudeau B, Baron G, Ravaud P. Quality of reporting of noninferiority and equivalence randomized trials. J Am Med Assoc. 2006;295:1147–1151.

Page 26: Clinical Trials || Biostatistics

Clinical Trials190

as well as from intent-to-treat (ITT) analysis, and reporting the confidence interval. For non-inferiority trials, per protocol analysis is preferred, while ITT analysis is sec-ondary (97,98). In the context of a non-inferiority trial, per protocol analysis may pro-vide conclusions that are more conservative or careful, than conclusions provided by ITT analysis (99,100).

The confidence interval defines the difference between the efficacy of the study drug and the active control drug. Moreover, these authors recommend reporting the number of subjects dropping out of the trial.

In view of the fact that non-inferiority trials are conducted with the expectation that efficacy of the study drug and active control drug are the same, Dignam (101) has warned against the early termination of clinical trials where the available data implicate the study drug as superior to the active control. Thus, during the early phases of any clinical trial, the available data may show that the study drug clearly works better than the control treatment. But often, data available during the first weeks or months of a clinical trial are of a sporadic nature, that is, early indications of remarkable efficacy or unusual toxicity typically disappear as more and more data are collected. Thus, in the context of a non-inferiority trial, where there is an expectation of no difference in efficacy, investigators should refrain from deciding that the study drug is more effective than the active control, where the clinical trial is only partly completed.

99 Matsuyama Y. A comparison of the results of intent-to-treat, per-protocol, and g-estimation in the presence of non-random treatment changes in a time-to-event non-inferiority trial. Stat Med. 2010;29:2107–2116.

100 Matilde Sanchez M, Chen X. Choosing the analysis population in non-inferiority studies: per protocol or intent-to-treat. Stat Med. 2006;25:1169–1181.

101 Dignam JJ. Early viewing of noninferiority trials in progress. J Clin Oncol. 2005;23:5461–5463.

97 Sanjay Mitter, personal communication of May 13, 2011. 98 The author thanks Dr. Jenna Elder for this advice.