65
RACE 615 Introduction to Medical Statistics Sample size for Estimation Assoc.Prof.Dr.Ammarin Thakkinstian [email protected] www.ceb-rama.org Doctor of Philosophy Program in Clinical Epidemiology, Section for Clinical Epidemiology & Biostatistics Faculty of Medicine Ramathibodi Hospital, Mahidol University Semester 1, 2015

RACE 615 Introduction to Medical Statistics€¦ · Introduction to Medical Statistics Sample size for Estimation Assoc.Prof.Dr.Ammarin Thakkinstian [email protected] Doctor

  • Upload
    others

  • View
    8

  • Download
    1

Embed Size (px)

Citation preview

RACE 615 Introduction to Medical Statistics

Sample size for Estimation

Assoc.Prof.Dr.Ammarin Thakkinstian [email protected] www.ceb-rama.org

Doctor of Philosophy Program in Clinical Epidemiology, Section for Clinical Epidemiology & Biostatistics Faculty of Medicine Ramathibodi Hospital, Mahidol University

Semester 1, 2015

2

CONTENTS

INTRODUCTION ...............................................................................................................................6

SAMPLE SIZE FOR ESTIMATION ................................................................................................7

One proportion ...............................................................................................................................7

Diagnostic study ...........................................................................................................................11

SAMPLE SIZE FOR TEST FOR DIFFERENCE ..........................................................................12

One proportion .............................................................................................................................12

Two proportions with independent samples .................................................................................16

Two proportions with dependent samples ....................................................................................29

More than two groups of proportions ...........................................................................................33

Two independent means ...............................................................................................................36

Two dependent means ..................................................................................................................39

More than two groups of means ...................................................................................................42

TEST FOR EQUIVALENCE ...........................................................................................................45

Continuous data ............................................................................................................................45

Two independent means ...............................................................................................................48

Two dependent samples ...............................................................................................................52

Cross – over trial ..........................................................................................................................52

NON – INFERIORITY ....................................................................................................................55

Continuous data ............................................................................................................................55

Two independent means ...............................................................................................................57

Cross-over design .........................................................................................................................59

Dichotomous data ........................................................................................................................61

ASSIGNMENT VI ..............................................................................................................................65

3

OBJECTIVES

Students should be able to:

1. Realize and concern how important a prior sample size estimation is

2. Know what information and background knowledge are required prior to use for

estimating sample size

3. Appropriately estimate sample size corresponding to the primary objective and study

design which aims to:

a. Estimate prevalence (or incidence)

b. Test for differences

c. Test for equivalence or non-inferiority

REFERENCES

1. Kish L. Survey Sampling. New York: John Wiley & Sons, Inc.; 1965.

2. Ingsathit A, Thakkinstian A, Chaiprasert A, et al. Prevalence and risk factors of

chronic kidney disease in the Thai adult population: Thai SEEK study. Nephrology,

dialysis, transplantation : official publication of the European Dialysis and

Transplant Association - European Renal Association. 2010;25(5):1567-1575.

3. Anothaisintawee T, Rattanasiri S, Ingsathit A, et al. Prevalence of chronic kidney

disease: a systematic review and meta-analysis. Clinical nephrology. 2009;71(3):244-

254.

4. Julious SA, Campbell MJ. Tutorial in biostatistics: sample sizes for parallel group

clinical trials with binary data. Statistics in medicine. 2012;31(24):2904-2936.

5. Dupont WD, Plummer WD. Power and Sample Size Calculations: A Review and

Computer Program Controlled Clinical Trials. 1990;11 116-128

4

6. Dupont WD. Power calculations for matched case-control studies. Biometrics.

1988;44(4):1157-1168.

7. Sample size. In: Schlesselman JJ, ed. Case-control studies: Design, conduct, analysis.

Oxford: Oxford University press, 1982:144-165.

8. Barthel FMS, Royston P, Babiker A. Menu-driven facility for complex sample size

calculation in randomized controlled trials with survival or a binary outcome: Update.

STATA Journal. 2005;5(1):123-129.

9. Kamanamool N, McEvoy M, Attia J, et al. Efficacy and adverse events of

mycophenolate mofetil versus cyclophosphamide for induction therapy of lupus

nephritis: systematic review and meta-analysis. Medicine. 2010;89(4):227-235.

10. Bruin J. newtest: command to compute new test. UCLA:

Statistical Consulting Group. 2006. (http://www.ats.ucla.edu/stat/stata/ado/analysis/).

(Accessed 04/09 2013).

11. Julious SA. Sample sizes for clinical trials with normal data. Statistics in medicine.

2004;23(12):1921-1986.

12. Julious SA. SampSize. In: White R, Wroblewski D, Julious SA, et al., eds. Sheffield,

UK: EpiGenesys, 2012.

13. D'Agostino RB, Sr., Massaro JM, Sullivan LM. Non-inferiority trials: design concepts

and issues - the encounters of academic consultants in statistics. Statistics in medicine.

2003;22(2):169-186.

14. Dann RS, Koch GG. Methods for one-sided testing of the difference between

proportions and sample size considerations related to non-inferiority clinical trials.

Pharmaceutical statistics. 2008;7(2):130-141.

5

READING SECTION

Appendix I: Schulz KF, Grimes DA. Sample size calculations in randomised trials:

mandatory and mystical. Lancet. 2005 Apr 9-15;365(9467):1348-53.

Appendix II: Dupont WD. Power calculations for matched case-control studies.

Biometrics.1988 Dec;44(4):1157-68.

Appendix III: Julious SA. Sample sizes for clinical trials with normal data. Stat Med.

2004 Jun 30;23(12):1921-86.

Appendix IV: Julious SA, Campbell MJ. Tutorial in biostatistics: sample sizes for

parallel group clinical trials with binary data. Statistics in medicine

2012;31(24):2904-36.

FURTHER READING

Appendix V: Stat Med. 2002 Oct 15;21(19):2807-14.

Appendix VI: JAMA 2006; 295: 1152

Appendix VII: Statistics in Medicine 2003; 22: 169

Appendix VIII: Pharmaceut Statist 2008; 7: 130

ASSIGNMENT VI (25%)

P. 65, Due: October 15, 2015

6

INTRODUCTION

Sample size estimation is a requirement that investigators need to plan before conducting

research. Methods of estimation should be clearly described in the research proposal. Why do

we need to estimate sample size is a common question that investigators usually ask. The

reasons behind this are as follows: It will lead investigators to have ideas how big or small

effect size which the study will be able to detect at the end given the estimated sample size.

Once the sample size is estimated, it will aid investigators to assess feasibility considering time

required, estimated budget, magnitude of interested event, and manpower that are required for

conducting that research.

Before estimating sample size, statistician and investigator need to clarify themselves for:

- What are the primary/secondary objectives,

- What is the study design,

- Will the sample size be estimated based on the primary objective only, or it will be

covered both primary and secondary objectives?

- What information do we require for estimation and where/how to obtain?

For instance we may need: prevalence/incidence of interested disease, expected

numbers of patients/month/year in each setting, effect size that investigators want to

determine, etc.

- How to set up these values

- Type I (or false positive) and II errors (or false negative)

- Size of difference (or equivalence) that the investigator wants

to detect. This should be discussed within the research team

how big/small the difference needs to be for clinical

significance.

This module describes how to estimate sample size in health science research, which primarily

aims for estimation and hypothesis testing. For hypothesis testing, tests for difference and

equivalence/non-inferiority are covered for both continuous and dichotomous outcomes.

7

SAMPLE SIZE FOR ESTIMATION

One proportion

The primary aim of a study is to estimate the prevalence of disease or an interested event. For

instance, estimation of the prevalence of steroid abuse, chronic kidney disease, obesity,

metabolic syndrome, etcetera. The study design can be either a hospital- or a community-based

cross-sectional study. The site of study can be single centre or multi-centre depending on how

common or how rare that disease is, or how representative the results that investigators would

like to see? For instance, the study wants to estimate the prevalence of complications (e.g.,

nephropathy, retinopathy) in diabetic patients in Ramathibodi Hospital, which is a single-

hospital-based study. Although these complications are not too rare and thus study in only one

centre is possible, investigators would like to estimate the prevalence that can be represent the

population in Bangkok. This study should be therefore expanded to other tertiary hospitals in

Bangkok.

The equation for estimation for a single site is as follows:

width(CI) interval confidenceΔ

p-1q

estimate want to that weprevalencep

..(1)..................................................ˆˆ

n

2

2

α/2

Δ

qpZ

Required information is the magnitude of prevalence from previous studies, which we can

retrieve from review previous literatures or medical statistics at that setting. Type I error and a

confidence interval width also need to be set. The sample size is mostly dependent on the

magnitude of prevalence and the confidence interval width. The prevalence that is close to 0.5

will yield the largest sample size if the confidence interval width and type I error are fixed. If

8

the prevalence is fixed, the lower the confidence interval width, the larger the sample size is

required.

Example 1

Investigators aim to estimate the prevalence of viral hepatitis B positive in blood donors at one

hospital. The previous reports in the literatures showed that the prevalence of viral hepatitis B

in blood donors varied from 5%-15%. From reviewing previous medical statistics, there have

been about 25,000 donors per year during the last five years. The prevalence used should be

0.15 because it will yield the largest sample size compares with the prevalence of 0.05. The

investigators expect that estimation of the true prevalence should not vary too much (or in other

words, should be a precise estimation), and thus we will set the CI at ± 5%. However, this CI

width is not possible if the true prevalence is as low as 5%, which will result in the estimated

true prevalence ranging from 0% to 10%! They therefore have to re-set the CI width to 3%,

type one error at 5%, and the estimated sample size is:

545

544.23

2

2

0.03

0.15)x0.15(11.96

Δ

ˆ1ˆ2

2

α/2 )p(pxZn

That is specimens of 545 donors are needed to perform a laboratory test. This estimation is

suitable for a single centre study applying simple random sampling to select subjects.

A multi - centre study that applies a stratified or cluster sampling will need further adjustment

as in example 2.

9

Example 2

The investigator had conducted a hospital-based cross-sectional survey in order to estimate the

prevalence of steroid abuse in Thailand. The stratified-cluster random sampling were applied,

with strata referred to 5 regions (i.e., northern, southern, north-eastern, eastern, and central) and

clusters were those regional/provincial hospitals within each strata. Previous study showed that

the proportion of using steroid abuse was 575/100,000 patients/year. Investigators would like to

reach a precise estimate and thus had set a 95% confidence width of 300

patients/100,000/year. As a result, the estimation could range from 275/100,000 to 875/100,000

patients/year. The expected sample size was:

subjetcs 2441

2440.2

2

2

0.003

0.00575)x0.00575(11.96

Δ

ˆ1ˆ2

2

α/2 )p(pxZn

With this sample size, the expected number of subjects with steroid abuse is about 14. A 95%

CI of this estimation is as below.

cii 2441 14

-- Binomial Exact --

Variable | Obs Mean Std. Err. [95% Conf. Interval]

-------------+---------------------------------------------------------------

| 2441 .0057354 .0015284 .003139 .0096042

Since the design was a multi-centre-hospital-based study, where hospitals were randomly

selected from the lists of regional/provincial hospitals across the country, sample size

estimation should take into account this “design effect”. That is, subjects within the same

hospital were highly correlated or similar to each other, but they were different between

hospitals. This is also known as 1/intra-class correlation and it is usually set at ~1/(0.2-0.5) (1).

10

The sample size for multi-stage sampling is therefore equal to sample size from simple random

sampling multiplied by the design effect. For instance, this study would need 2441x2 = ~4882

subjects to enroll if the intra-class correlation and design effect were 0.5 and 2, respectively.

Sometimes investigators might have an idea how many subjects they would be able to enroll

based on manpower and budget in hands. If this is so, range estimates (e.g., the 95% CI) based

on the expected sample size can be more practical and will give an idea how wide/vary that

estimate will be, and whether this is acceptable. If it is not, it can be re-justified by

increasing/decreasing the design effect and/or level of error. This idea of sample size estimation

can be worked out using EPI-INFO version 6.2b (EPITABLE) or STATA using ‘cii’ command.

For instance, if this study could possibly enroll 5,000-6,000 subjects with the point estimated

proportion of 575/100,000, the expected point estimated proportions and their 95% CI can be

calculated as follows:

*n=500

disp .00575*5000

cii 5000 29

-- Binomial Exact --

Variable | Obs Mean Std. Err. [95% Conf. Interval]

-------------+---------------------------------------------------------------

| 5000 .0058 .0010739 .0038877 .0083192

*n = 6000

disp 0.00575*6000 /*34.5*/

cii 6000 35

-- Binomial Exact --

Variable | Obs Mean Std. Err. [95% Conf. Interval]

-------------+---------------------------------------------------------------

| 6000 .0058333 .0009831 .0040664 .0081035

11

Diagnostic study

Example 3

For diagnostic study, investigators aim to estimate performances of interested tests compared

with the standard test. The performances can be measured using sensitivity, specificity,

predictive positive/negative values, or likelihood ratio. The study design is mostly a cross-

sectional study, in which subjects will have to perform both interested and standard tests within

the same period. Performing the standard test should not be due to the result of the interested

test or vice versa.

For this example, the investigators aimed at estimating the sensitivity of visual inspection with

acetic acid (VIA) for screening cervical cancer, in which a Papanicolaou smear (Pap smear) has

generally been used for screening. However, the Pap smear needs to be performed by an

Obstetrician/Gynecologist and the result has to reported by pathologists. The interested test is

claimed to be more convenient because general practitioners can perform it and the result can be

read straight away. The investigator expected that the sensitivity of the VIA test should not be

less than the Pap smear, with reported sensitivity of 70%. The range of estimation can vary

within 5%, and the prevalence of cervical cancer reported by previous studies was about 14%.

The sample size calculation is as follow:

323

322.69

0.05

0.70)0.70x(11.96

Δ

ˆˆ

2

2

2

2

α/2qpZ

n

12

That is 323 subjects with cervical cancer are needed in order to estimate the sensitivity which

the investigators expected. As for the prevalence, 323/0.14 = 2307.14 = 2308 subjects are

required to enroll.

SAMPLE SIZE FOR TEST FOR DIFFERENCE

One proportion

We usually compare a studied prevalence with the prevalence in the reference population or the

previous prevalence that has been reported in the literatures. For instance,

- compare prevalence of diabetes in Thailand with the prevalence reported in China,

- compare prevalence of chronic kidney disease in the Thai population with Caucasians

- compare prevalence of Gln and Glu alleles of beta-2 adrenoreceptor polymorphisms

in the Thai population with those studied in Caucasians.

All of these examples have only one group of studied population and most study designs are

cross-sectional studies. The hypothesis and equation for sample size calculation are as follows:

Ho: P = Po

Ha: P Po (= Pa)

As for the figure above, a critical value C can be solved under the null hypothesis and

alternative hypothesis as follows:

13

2

0

2

β1002/α

00β12/α0

β12/α0

β1

a

0

)(

)1()1(

)1()1(1

)(

)1()1(

)1(

HUnder

1

HUnder

PP

PPZPPZn

PPZPPZn

PP

n

PPZP

n

PPZP

n

PPZPc

n

)P(PZPc

a

aa

aa

aaa

aa

aaa

00α/20

Example 4

Investigators would like to conduct a cross-sectional survey study to primarily estimate the

prevalence of CKD across Thailand (2). Investigators had also asked whether CKD in Thailand

was as common as in other Asian countries. They therefore had a secondary objective as

comparing the CKD prevalence in Thailand to the prevalence in Asian population. From a

systematic review of previous studies, the pooled prevalence of CKD stage III or higher in the

Asian population was 8.3% (95% CI: 4.3%, 12.4%) (3) They wondered how many subjects

were needed to enroll in order to answer the secondary objective. Type I and II errors were

respectively set at 5% and 20%, and size of difference that they wanted to detect was ±5%. The

sample size could be estimated as follows:

14

273

9.272

)083.0133.0(

133.01(133.084.0)083.01(083.096.12

2

2

oa

2

aaβooα/2

)P(P

)P(1PZ)P(1PZn

Thus, it was required at least 273 subjects to compare the current vs previous prevalence, if, and

only if, the difference was 5% or higher.

This example can be estimated using STATA as follows:

power oneproportion 0.083 0.133, power(.8)

Estimated sample size for a one-sample proportion test

Score z test

Ho: p = p0 versus Ha: p != p0

Study parameters:

alpha = 0.0500

power = 0.8000

delta = 0.0500

p0 = 0.0830

pa = 0.1330

Estimated sample size:

N = 274

*Vary delta 0.05-0.10

power oneproportion 0.083 (0.133(0.01)0.183 ), power(0.8) table(_all)

graph

Ho: p = p0 versus Ha: p != p0

+-----------------------------------------------------------------+

| alpha power beta N delta p0 pa diff |

|-----------------------------------------------------------------|

| .05 .8 .2 274 .05 .083 .133 .05 |

| .05 .8 .2 194 .06 .083 .143 .06 |

| .05 .8 .2 146 .07 .083 .153 .07 |

| .05 .8 .2 114 .08 .083 .163 .08 |

| .05 .8 .2 92 .09 .083 .173 .09 |

| .05 .8 .2 75 .1 .083 .183 .1 |

| .05 .8 .2 63 .11 .083 .193 .11 |

+-----------------------------------------------------------------+

Graphical user interface (GUI)

15

16

Two proportions with independent samples

Clinical trial or observational study sometime aims to compare proportions between two independent

groups. For instance:

- Compare proportions of Glu alleles between asthma and non-asthma groups

- Compare proportions of chronic kidney disease between patients whose

hemoglobin-A1-C < 6.5% and ≥ 6.5%

- Compare incidence of cardiovascular events between patients who do/do not receive

Rosiglitazone.

- Compare incidence of micro- and macro-albuminuria between patients who receive

Angiotensin Converting Enzyme inhibitor (ACEI) and other hypertensive drugs.

- Compare proportion of remission between patients who receive Mycophenolate

Mofetil (MMF) and Cyclophosphamide.

The test for difference can be two-sided test if a direction of difference is not specified, or one-

sided test (called called superiority) if a direction is specified. If the later, evidences are required

to support the expected direction. The null and alternative hypotheses for a two-sided test are as

follows(4):

Ho: P1 - P2 = 0

Ha: P1 – P2 0

Base on H0: P1 - P2 = 0 = P

2112

11

nn if ;n

P)P(

n

)P(P

n

)P(P

)Var(P)Var(P)PVar(P

2

22

1

11

2121

If the ratio of treatment per control group is assigned as 1:1, the equation for sample size

calculation is as follows:

Under the Ho, a critical value C is defined as

17

n

P)2P(Zc α/2

10

Under the Ha, it is defined as

2

PPP

)P(P

)P(P)P(PZP)P(Zn

)P(P)P(PZP)P(Zn

1)P(P

)P(P)P(PZn

1)P(PP)2P(Z

n

1

n

)P(P

n

)P(PZ)P(P

n

P)2P(Z

n

)P(P

n

)P(PZ)P(Pc

21

2

21

2

2211β1α/2

2211β1α/221

2211β121α/2

2

22

1

11

β121α/2

2

22

1

11

β121

1112

1112

111

111

11

The ratio of treatment versus control (n1:n2) can be varied from 1:1. In the case that the new

treatment is quite expensive compared with the standard treatment, or it is more likely to be

harm from side effect/s of the new treatment than the standard one, the investigator may assign

as 1: 2, 1:3, or 1: 4 for the new treatment versus the standard groups. This is also applicable in

an observational case-controlled study in case that the disease is very rare and it is difficult to

achieve equal numbers of cases and controls. An investigator thus designs to have more

controls (say 1:2, 1:3, or even 1:4) than cases. Also the same as in a cohort study where

exposure is rare compared to non-exposure.

Information needs for calculations (e.g., event proportion in control group, size of difference to

be detected, false positive (type I) & negative (type II)) should be set and clearly described in

the proposal. Sources of information should be cited if possible. The false positive and false

negative rates are usually fixed whereas the size of difference that can be detected (P1-P2, also

called size of detectable, or effect size) can be varied and this component mainly determines the

sample size. The smaller the size of detectable, the larger the sample size is. How to set up this

18

effect size is to justify between having clinical significance and feasibility of conducting

research. The size should be as minimal as possible to reach to clinical significance, but

practically sometimes this is not feasible with limitations in time, cost, and manpower.

Discussion with the team will help to get ideas about this size.

The type I error (α) or false positive rate is the error from rejecting the null hypothesis when it

is true (i.e., there is no treatment effect in the population). This error usually is set at 5% or

lower in clinical trials or medical/health research. This means the investigators will face the

false positive of 5% if they reject the (true) null hypothesis.

The type II error (β) or the false negative occurs when the study concludes that there is no

treatment effect, but in fact the treatment effect exists in the population, i.e., the null hypothesis

is false. This is usually set at 0.20 or lower, and thus the power of test (i.e., 1- β) is 80% or

higher; which is the probability of detecting the treatment effect if in fact the treatment effect is

present.

READ more detail in Appendix I

Example 5

Investigators wanted to assess whether receiving calcium supplement 500 mg/day would be

able to reduce osteoporotic fracture compared with receiving placebo. The incidence of fracture

in the general population was 0.1, reducing the incidence to be 0.05 would be clinically

significant. Type I & II errors were set at 5% and 20%, respectively. The sample size

calculation was as follows:

19

434

433.9

0.05)(0.10

0.05)0.05(10.10)0.10(10.840.075)2x0.075(11.96

0.0752

0.050.1

111

2

2

n

2

PPP

)P(P

)P((P)P(PZ)P(P2Zn

21

2

21

2

2211βα/2

They needed at least 434 subjects per group to enroll to the study in order to detect the

difference of fracture rate between groups of 5%. It is common in a follow-up study that

subjects may be lost to follow-up and the sample size should be planned for this regard. If

previous studies of their colleagues in the same settings showed that the lost follow-up rate was

about 20%, therefore the total sample size should be 434+434x0.2 = 521 subjects/group.

This example can be worked out using statistical software such as STATA or PS as follows:

STATA 13: GUI

20

power twoproportions .1 .05 , power(.8) test(chi2)

Performing iteration ...

Estimated sample sizes for a two-sample proportions test

Pearson's chi-squared test

Ho: p2 = p1 versus Ha: p2 != p1

Study parameters:

alpha = 0.0500

power = 0.8000

delta = -0.0500 (difference)

p1 = 0.1000

p2 = 0.0500

Estimated sample sizes:

N = 870

N per group = 435

21

Vary size of detectable with ratio 1:4

power twoproportions .1 (.07(.01).05), test(chi2) nratio(.25)

table(alpha:"Type I error" N:"Total N" N1:"n0" N2:"n1" p1:"p0" p2:"p1"

diff:"size of detectable")

Performing iteration ...

Estimated sample sizes for a two-sample proportions test

Pearson's chi-squared test

Ho: p2 = p1 versus Ha: p2 != p1

+-------------------------------------------------------------------------+

| Type I error Total N n0 n1 p0 p1 size of detectable |

|-------------------------------------------------------------------------|

| .05 4388 3510 878 .1 .07 -.03 |

| .05 2365 1892 473 .1 .06 -.04 |

| .05 1445 1156 289 .1 .05 -.05 |

+-------------------------------------------------------------------------+

Power and Sample size calculation (PS) program is another public (free) software that

can be use in medical research (5). The program covers for dichotomous, continuous, and

time to event outcomes, which appear as icons of corresponding statistics test.

We need to choose type of (primary) outcome of our study before calculation. In addition, other

3 parts need filling in, i.e., output, design, and input. The output refers to what we want to know

between sample size, power of test, and detectable alternative (a range of size of detectable).

The design part asks about our study design (matched vs unmatched design), case control

(retrospective vs prospective), expression of hypothesis (two proportion vs relative risk), and

statistics test (Fisher exact vs Chi-square test). Finally, the input requires filling type I, type II,

p1 and p2 (if dichotomous outcome), and a ratio of A/B treatments. Please note that the PS

program uses ‘retrospective’ for case-control and ‘prospective’ for cohort study, which is the

old concept of study design. For the current concept, retrospective or prospective study only

refers to a direction of the study, not the study design. A randomized controlled trial or

retrospective cohort should use the ‘prospective’ option. Estimation of the sample size by PS

for the above example can be done as below:

22

Both STATA and PS software provide options with/without continuity correction for rare

events. For continuity correction, Fisher exact test is applied instead of Chi-square test, and thus

the sample size is larger and more conservative than without continuity correction (see detail of

how differences between Fisher exact and Chi-square tests are in the RACE 615 course:

statistics inference.)

Sample size can be estimated based on testing a magnitude of effect size or treatment effect

(e.g. relative risk, odds ratio, or hazard ratio) instead of the difference of proportions. For

instance, the null hypothesis is odds ratio equals 1 in a case-controlled study and risk ratio (or

hazard ratio) equals 1, in a cohort study/clinical trial. The sample size can be estimated as

follows:

For case control study, the null hypothesis is

Ho: OR =1

23

)P(1ORxP

ORxPP

11

12

For cohort or randomized control trial, the null hypothesis is

Ho: RR =1

0

1

I

IRR

Then, estimation of P2 or I1 can be done using the above equations and substituting it in the

equation for 2 proportions, or using STATA for calculation, or using PS with option ‘relative

risk’ for hypothesis expression.

Example 6

In the case-controlled study of risk factors of steroid abuse, investigators wanted to assess

whether using traditional medicine was associated with adrenal insufficiency or adrenal crisis.

Previous reports showed that the prevalence of using traditional medicine in the general

population was about 15%. The odds ratio that can be detected is set at 1.5. Since the case (i.e.,

adrenal insufficiency or adrenal crisis) is quite rare, the ratio of case versus controls is set at 1:4.

False positive and false negative rates are set at 5% and 20%, respectively. The estimated P2 can

be estimated as:

0.21

0.15)(11.5x0.15

1.5x0.15P

)P(1ORxP

ORxPP

2

11

12

24

We can use PS to calculate sample size straight away as follows:

This study needs to enroll 397 cases and 397x4 controls to be able to detect the odds ratio of

1.5. There might be incomplete or missing data about 5%, taking this into account will require

2,085 subjects in total.

This can be estimated by STATA 13 as follows:

power twoproportions .15, test(chi2) oratio(1.5) nratio(.25)

table(alpha:"Type I error" power:"Power of test" N:"Total N" N1:"n0" N2:"n1"

delta:"effect size" oratio:"OR")

Performing iteration ...

Estimated sample sizes for a two-sample proportions test

Pearson's chi-squared test

Ho: p2 = p1 versus Ha: p2 != p1

25

+------------------------------------------------------------------------+

| Type I error Power of test Total N n0 n1 effect size OR |

|------------------------------------------------------------------------|

| .05 .8 1984 1587 397 1.5 1.5 |

+------------------------------------------------------------

26

27

Vary OR from 1.2-1.5

power twoproportions .15, test(chi2) oratio(1.2(.1)1.5) nratio(.25)

table(alpha:"Type I error" power:"Power of test" N:"Total N" N1

> :"n0" N2:"n1" delta:"effect size" oratio:"OR")

Performing iteration ...

Estimated sample sizes for a two-sample proportions test

Pearson's chi-squared test

Ho: p2 = p1 versus Ha: p2 != p1

+------------------------------------------------------------------------+

| Type I error Power of test Total N n0 n1 effect size OR |

|------------------------------------------------------------------------|

| .05 .8 10717 8573 2144 1.2 1.2 |

| .05 .8 5010 4008 1002 1.3 1.3 |

| .05 .8 2958 2366 592 1.4 1.4 |

| .05 .8 1984 1587 397 1.5 1.5 |

+------------------------------------------------------------------------+

Example 7

A clinical trial of glycemic control compared cardiovascular risks between Rosiglitazone versus

other medicines. Rosiglitazone is claimed as increasing the risk of cardiovascular events, but

might be better in glycemic control compared with other oral regimens. Investigators wanted to

make sure that their sample size was able to detect not just only the difference of level of

HbA1C, but also cardiovascular events. The later outcomes needed larger sample sizes than the

former outcomes, since they were categorical data used for calculation. Previous studies

reported the incidence of cardiovascular events was about 5% in those patients who received

Rosiglitazone. False positive, false negative, and the ratio of Rosiglitazone versus control were

set at 0.05, 0.80, and 1:1, respectively. The risk ratio that authors wanted to detect was set at

2.0. The sample size calculation using PS is displayed below:

28

Therefore, 1,471 subjects needed enrolling into each group. If a loss to follow- up rate

of 20% was taken into account, each group needed 1,766 subjects

STATA

power twoproportion 0.05, rrisk(1.5) test(chi2)

Performing iteration ...

Estimated sample sizes for a two-sample proportions test

Pearson's chi-squared test

Ho: p2 = p1 versus Ha: p2 != p1

Study parameters:

alpha = 0.0500

power = 0.8000

delta = 1.5000 (relative risk)

p1 = 0.0500

p2 = 0.0750

rrisk = 1.5000

Estimated sample sizes:

N = 2942

N per group = 1471

29

Two proportions with dependent samples

Sometimes investigators may want to do matching cases and controls with few characteristics

(e.g., age, sex, ethnicity) that are known to associate with the outcome. Thus, the selection of

controls are designed to depend on these characteristics of cases. This type of sample is called

dependent/related sample since controls will depend on a few characteristics of cases. Data

layout for a match-pair case-controlled study is:

Case Control

Exp+ Exp

-

Exp+ n11 n10 R1

Exp- n01 n00 R2

Total C1 C2 M

The null hypothesis that we want to test for this design is

H0: n01= n10=0.5

The equation for sample size calculation for m discordant pairs is(6, 7):

R1

R

ψ1

ψ

0.5)(P

ZZ

2

2

β2

α/2

P

P)P(1

m

ψ (pronounce as psi)= Odds ratio, R = risk ratio ,

P = proportion (probability) of exposed-discordant pair = (n10+n01)/m

We need a total of M pairs on average to yields m discordant pairs as:

30

)P(ORxP

ORxPP

qpqpp

p

mM

00

01

0110e

e

1

Example 8

In a case-controlled study of factors associated with hip fracture, authors wanted to perform a

pair-match by matching cases and controls with age. One interested factor was the history of

taking traditional medicine in which previous literature reported about 30% in the general

population. The ratio of cases per control, type I, type II errors, and OR were set at 1:1, 0.05,

0.3, and 2.5, respectively. Calculation of sample size could be performed as follows(7):

80.0

0.5

40.2

0.50.3)0.5(10.5)0.3(1

0.50.3)(12.5x0.3

2.5x0.3

1

40.2

.5)(2.5/3.5

2.5/3.5)2.5/3.5(10.841.96/2

0.5

1

2

2

2

2

M

p

)p(ORp

ORpp

qpqpp

)(P

P)P(Z2

Z

m

e

00

01

o11oe

βα/2

31

STATA 13:

32

power pairedproportions .3, corr(0) oratio(2.5)

Performing iteration ...

Estimated sample size for a two-sample paired-proportions test

Large-sample McNemar's test

Ho: p+1 = p1+ versus Ha: p+1 != p1+

Study parameters:

alpha = 0.0500

power = 0.8000

delta = 2.5000 (odds ratio)

p1+ = 0.3000

p+1 = 0.5172

corr = 0.0000

odds ratio = 2.5000

Estimated sample size:

N = 82

However, this equation ignored the correlation between samples, i.e., controls were selected

based on a few characteristics of cases and thus variances of exposures in cases and controls

were not independent and should be taken into account in estimating sample size. The

correlation depends on the proportion of exposure in controls and the OR that we want to

detect. The estimation of correlation (φ pronounce as phi) uses the equation below (5):

00112

011001102

kkyx

kk

qpqp)1OR(

)qpqp(ORqpqpOR

0/1y;0/1x;σσ

)y,xcov(φ

For example 8,

0.07

3x0.70.5x0.5x0.1)(2.5

0.50.7)52.5(0.3x0.0.5x0.70.3x0.52.52

2

The sample size can be calculated using the PS program as follows:

33

READ more detail in Appendix II

Work on here 09/09

More than two groups of proportions

Clinical research has sometimes more than two groups of interest. For instance, investigators

aimed to compare the incidence of gastro-intestinal ulcer between arthritis patients who

received Celecoxib, Valdecoxib, and Naproxen; recovery rates between Acyclovir plus

Prednisolone, Acyclovir alone, and Prednisolone alone; and remission rates between MMF plus

Prednisolone, Cyclophosphamide plus Prednisolone, and MMF alone. If investigators dealt with

data as binary data, sample sizes could be estimated as in the example below:

34

Example 9

Investigators wished to estimate sample size for a randomized control trial of treatment

regimens in Bell’s Palsy. There were 3 treatments of interest, which were Acyclovir plus

Prednisolone, Acyclovir, and Prednisolone alone. The ratio between treatment groups was

1:1:1, and alpha and beta were set at 0.05 and 0.8, respectively. Review of previous studies

showed that the recovery rate within 3 months after receiving Prednisolone was about 40%-

60%. If Acyclovir works well, it should provide at least 10% higher recovery rates compared

with Prednisolone. Using STATA aided in calculation with the following command(8):

artbin, pr(.5 .6 .6) ngroups(3) aratios(1 1 1 ) distant(0) alpha(0.05)

power(0.8)

ART - ANALYSIS OF RESOURCES FOR TRIALS (version 1.0.0, 3 March 2004)

--------------------------------------------------------------------------

A sample size program by Abdel Babiker, Patrick Royston & Friederike Barthel,

MRC Clinical Trials Unit, London NW1 2DA, UK.

--------------------------------------------------------------------------

Type of trial Superiority - binary outcome

Statistical test assumed Unconditional comparison of 3

binomial proportions

Number of groups 3

Allocation ratio Equal group sizes

Anticipated event probabilities 0.500, 0.600, 0.600

Alpha 0.050 (two-sided)

Power (designed) 0.800

Total sample size (calculated) 1065

Expected total number of events 604

--------------------------------------------------------------------------

Investigators required to enroll 1065 subjects and then randomly allocate 355 subjects for each

group. Recovery within 3 months was expected to occur in 604 patients, which was estimated

by (0.5+0.6+0.6)/3x1065. Since Acyclovir was much more expensive than Prednisolone,

investigators may not be able to support, and thus re-set the ratio of Acyclovir plus

Prednisolone: Acyclovir: Prednisolone as 1:1:2. Also increasing the recovery rate only 10% was

not clinically significant and so this had been reset as 15%. Re-estimation of the sample size is

displayed as below. This came up with 136 subjects for each of the Acyclovir plus Prednisolone

and Acyclovir groups, and 272 subjects for the Prednisolone group. Since patients would be

35

followed up for at least 12 months, previous experience taught them that loss to follow-up was

~ 15%-20%. This sample size should therefore be adjusted for this aspect, so totally 652

subjects needed enrolling.

artbin, pr(.5 .65 .65) ngroups(3) aratios(1 1 2 ) distant(0) alpha(0.05)

power(0.8)

ART - ANALYSIS OF RESOURCES FOR TRIALS (version 1.0.0, 3 March 2004)

--------------------------------------------------------------------------

A sample size program by Abdel Babiker, Patrick Royston & Friederike Barthel,

MRC Clinical Trials Unit, London NW1 2DA, UK.

--------------------------------------------------------------------------

Type of trial Superiority - binary outcome

Statistical test assumed Unconditional comparison of 3

binomial proportions

Number of groups 3

Allocation ratio 1.00:1.00.00:2.00

Anticipated event probabilities 0.500, 0.650, 0.650

Alpha 0.050 (two-sided)

Power (designed) 0.800

Total sample size (calculated) 543

Expected total number of events 332

--------------------------------------------------------------------------

Suppose that investigators could later enroll subjects only 500 because time and budget

were running out, how was the power of test based on a sample size of 500? This can

be estimated as

artbin, pr(0.5 0.65 0.65) ngroup(3) aratios(1 1 2) alpha(.05) n(500)

ART - ANALYSIS OF RESOURCES FOR TRIALS (version 1.0.0, 3 March 2004)

------------------------------------------------------------------------------

A sample size program by Abdel Babiker, Patrick Royston & Friederike Barthel,

MRC Clinical Trials Unit, London NW1 2DA, UK.

------------------------------------------------------------------------------

Type of trial Superiority - binary outcome

Statistical test assumed Unconditional comparison of 3

binomial proportions

Number of groups 3

Allocation ratio 1.00:1.00.00:2.00

Anticipated event probabilities 0.500, 0.650, 0.650

Alpha 0.050 (two-sided)

Power (calculated) 0.765

Total sample size (designed) 500

Expected total number of events 307

------------------------------------------------------------------------------

36

Two independent means

The outcome of interest can be continuous data, such as

- Bone mineral density between calcium supplement versus placebo

- Estimated GFR (or serum creatinine) between controlled and un-controlled

blood sugar groups in diabetic patients

- Systolic/diastolic blood pressure between angiotensin-receptor blocker (ARB) and

ACEI in diabetic patients

- Level of HbA1C between patients who received Rosiglitazone versus other

glycemic drugs

- Pain scores of arthritis patients who receive Celecoxib and Ibuprofen

These outcomes are mostly intermediate or surrogate of the final outcomes. The drawback of

these can be studied more in the RCT course, but the benefit is that it is usually needs a smaller

sample size than comparison of dichotomous (proportion) or time to event outcomes. In case

that the investigators do not have much time to follow up, the interested clinical endpoint also

takes long time to occur, and/or investigators do not have enough budget to run a longer-period

project, they usually come up with comparison of continuous outcomes. The concept of sample

size estimation is the same as for proportions. That is the false positive and false negative are

needed to assign before conducting the study. Information we need to gather from previous

studies are the mean and standard deviation of interested values in the control or standard

treatment group. Finally, the size of difference to be able to detect needs calibrating or

justifying considering clinical significance and feasibility for conducting the study. The null

hypothesis and equation used for sample size calculation are as follows:

Ho: 1 -2= 0

Ha: 1-2 0 2

21

βα/2

μμr

σZZrn

)(

)1)x((

37

Example 10

Investigators would like to conduct a randomized control trial to compare the efficacy of

Mycophenolate mofetil (MMF) and Cyclophosphamide (CYC) in induction therapy of Lupus

Nephritis. One of their interested outcome is serum creatinine within 6 months. From a

previous systematic review (9), serum creatinine after receiving CYC was 0.902 (SD=0.179).

Since MMF was much more expensive than CYC, the investigator had therefore set the ratio of

MMF versus CYC groups at 1:3. If the MMF is efficacious, it should be able to decrease serum

creatinine at least 0.1 compared with CYC. Type I and II errors were set at 5% and 20%,

respectively. The sample size was estimated as below, suggesting 136 subjects were required

for enrolling to detect serum creatinine between groups of at least 0.1. However, we might have

loss to follow-up or withdrawal from the study about 20%. Taking this into account would

require 164 subjects (41 versus 123) to enroll to this study.

power twomeans .902 .802, sd(.179) nratio(.333)

Performing iteration ...

Estimated sample sizes for a two-sample means test

t test assuming sd1 = sd2 = sd

Ho: m2 = m1 versus Ha: m2 != m1

Study parameters:

alpha = 0.0500

power = 0.8000

delta = -2.8554

m1 = 0.9020

m2 = 0.8020

sd = 0.1790

N2/N1 = 0.3330

Estimated sample sizes:

N = 138

N1 = 103

N2 = 35

N2/N1 = 0.3398

Vary power from 0.8 to 0.9

power twomeans .902 .802, sd(.179) power(0.8(.05).9) nratio(.333) table graph

+---------------------------------------------------------------------------------+

| alpha power N N1 N2 nratio delta m1 m2 sd |

|---------------------------------------------------------------------------------|

| .05 .8 138 103 35 .333 -.1 .902 .802 .179 |

| .05 .85 156 117 39 .333 -.1 .902 .802 .179 |

| .05 .9 183 137 46 .333 -.1 .902 .802 .179 |

+---------------------------------------------------------------------------------+

38

Suppose that investigators can support MMF only 30 to 40, how many subjects are required to

enroll for the CYC group given other parameter are fixed. This can be estimated using GUI

command as follows:

power twomeans .902 .802, sd(.179) power(.8) n2(30 35 40) compute(n1)

Performing iteration ...

Estimated sample sizes for a two-sample means test

t test assuming sd1 = sd2 = sd

Ho: m2 = m1 versus Ha: m2 != m1

-------------------------------------------------------------------------+

| alpha power N N1 N2 delta m1 m2 sd |

|-------------------------------------------------------------------------|

| .05 .8 196 166 30 -.1 .902 .802 .179 |

| .05 .8 130 95 35 -.1 .902 .802 .179 |

| .05 .8 112 72 40 -.1 .902 .802 .179 |

+-------------------------------------------------------------------------+

This can be done using PS program as follows:

140

150

160

170

180

To

tal

sam

ple

siz

e (N

)

.8 .85 .9Power (1- )

Parameters: = .05, N2/N1 = .33, 1 = .9, 2 = .8, = .18

t test assuming 1 = 2 = H0: 2 = 1 versus Ha: 2 1

Estimated total sample size for a two-sample means test

39

Two dependent means

Some studies may wish to compare means before and after receiving treatment (or

intervention). For instance,

- mean systolic blood pressure before and after receiving treatment

- mean BMD before/after receiving calcium supplement

The design actually has only one group of samples, but subjects are repeatedly measured for

two times, i.e. before and after receiving treatment. Sample size estimation can be performed as

follows:

Ho: before = after

Ha: before after

2

βα/2

Δ

)σZ(Zn

40

Example 11

Similarly to example 10, suppose that investigators were interested to compare serum creatinine

within only the CYC group, before and after receiving treatment for 6 months. Then the sample

size calculation is as follows:

power pairedmeans .902 .802, sddiff(.179)

Performing iteration ...

Estimated sample size for a two-sample paired-means test

Paired t test

Ho: d = d0 versus Ha: d != d0

Study parameters:

alpha = 0.0500 ma1 = 0.9020

power = 0.8000 ma2 = 0.8020

delta = -0.5587

d0 = 0.0000

da = -0.1000

sd_d = 0.1790

Estimated sample size:

N = 28

*estimate power for vary n from 10 to 40

power pairedmeans .902 .802, sddiff(.179) n(10 20 30 40) table graph +-------------------------------------------------------------------------+

| alpha power N delta d0 da ma1 ma2 sd_d |

|-------------------------------------------------------------------------|

| .05 .3521 10 -.5587 0 -.1 .902 .802 .179 |

| .05 .6594 20 -.5587 0 -.1 .902 .802 .179 |

| .05 .8407 30 -.5587 0 -.1 .902 .802 .179 |

| .05 .9312 40 -.5587 0 -.1 .902 .802 .179 |

+-------------------------------------------------------------------------+

Graphical user interface

41

The GUI can estimate power with varying sample size and plot graph as follows :

power pairedmeans .902 .802, sddiff(.179) n(10 20 30 40) table graph

.4

.6

.8

1

Po

wer

(1

-

)

10 20 30 40Sample size (N)

Parameters: = .05, = -.56, d0 = 0, da = -.1, a1 = .9, a2 = .8, d = .18

Paired t testH0: d = d0 versus Ha: d d0

Estimated power for a two-sample paired-means test

42

PS

Twenty-six subjects needed enrolling in order to detect a difference of serum creatinine of 0.1

between before and after receiving CYC.

More than two groups of means

We sometimes would like to compare means between 3 groups or more. For instance:

- Fetuin-A levels in TT, AT, and AA genotypes of Alpha2 Heremand-Schmid

glycoprotein (AHSG) polymorphism

- BMI between AA, TA, and TT genotype of fat mass and obesity-associated (FTO)

polymorphism

- Pain, function, and quality of life scores between glucosamine sulphate, diacerein,

glucosamine sulphate plus diacerein, and placebo in knee osteoarthritis patients

- Estimated GFR in type 2 diabetes, hypertension, patients with none of both, and

both conditions

43

There are STATA user-written commands by the UCLA group (10) (i.e., fpower and simpower)

that can estimate a sample size for this purpose. This is demonstrated as shown in the example

below:

Example 12

Investigators wished to conduct a randomized controlled trial in order to compare the mean

VAS score (ranges 0-100) between Celecoxib, Valdecoxib, and Naproxen after receiving

treatments for 7 days. Exploring previous studies showed that the mean and SD of VAS score

in the Naproxen group were 85 and 25, respectively. Investigators though that the effects of

Celecoxib and Valdecoxib should be similar and they should be able to reduce pain scores by at

least 15 units compared with pain scores in the Naproxen group. Thus possible means were 70,

70, and 85 for Celecoxib, Valdecoxib, and Naproxen groups. This can be estimated using

‘power oneway’ command but this is required overall variance (i.e., mean square error) for

perfroming overall F-test. This can be approximately estimated from SD2.

power oneway 85 70 70, varerror(625)

Performing iteration ...

Estimated sample size for one-way ANOVA

F test for group effect

Ho: delta = 0 versus Ha: delta != 0

Study parameters:

alpha = 0.0500

power = 0.8000

delta = 0.2828

N_g = 3

m1 = 85.0000

m2 = 70.0000

m3 = 70.0000

Var_m = 50.0000

Var_e = 625.0000

Estimated sample sizes:

N = 126

N per group = 42

A ratio Naproxen: Celecoxib: Valdecoxib = 2:1:1

power oneway 85 70 70, varerror(625) grweight(2 1 1)

44

Performing iteration ...

Estimated sample size for one-way ANOVA

F test for group effect

Ho: delta = 0 versus Ha: delta != 0

Study parameters:

alpha = 0.0500

power = 0.8000

delta = 0.3000

N_g = 3

m1 = 85.0000

m2 = 70.0000

m3 = 70.0000

Var_m = 56.2500

Var_e = 625.0000

Estimated sample sizes:

N = 112

Average N = 37.3333

N1 = 56

N2 = 28

N3 = 28

READ more detail in Appendix III & IV Suppose that the investigator could enroll only 30 subjects for each group

45

TEST FOR EQUIVALENCE

Continuous data

READ Appendix III (Statist Med 2004; 23: 1921)

Some clinical researchers aim to determine whether a new treatment has the same clinical

effect as the standard treatment one. In this case, the concepts of hypothesis testing, type I and

II errors, and sample size estimation are different compared to those studies which aim to test

for difference or superiority. The null and alternative hypotheses for equivalent studies are

opposite to difference/superiority studies. For instance,

Ho: Mean values are different between groups (Ho: µA≠µB)

Ha: Mean values are not different between groups (Ha: µA = µ)

However, when we said the two treatments are equivalent they are actually not exactly

equivalent, which leads us to define a region or margin of equivalence (11). For instance, the

two treatments are claimed to be similarly effective if the difference (µA - µB) falls between -d

and +d and they are different if (µA - µB) is ≤ -d or (µA - µB) ≥ +d, as described in the figure

below. Thus, the null and alternative hypotheses are:

Ho: µA - µB ≤ -d or µA - µB ≥ +d

Ha: -d < µA - µB < +d

The null hypothesis consists of a pair of one-sided tests, i.e., treatment A is superior (µA - µB ≥

+d ), and treatment A is inferior to the treatment B (µA - µB ≤ -d). The alternative hypothesis

states that treatment A is equivalent to treatment B if the difference falls within the margins. In

order to accept that the two treatment effects are equivalent, we need to reject both of the one-

sided tests in the null hypothesis. Once the null hypothesis is rejected, there are an errors, i.e.,

type I and II errors.

46

The confidence intervals of testing for difference (e.g., superiority test), equivalence, and non-

inferiority can be illustrated as showed in figure 2. Superiority is testing whether the treatment

of interest has better effect than the standard one, i.e., the difference of effects between

treatment groups should be greater than 0. If the difference lies between ± d, we would assume

that the two treatments are equivalent. However, the new treatment may be or may be not better

than the standard treatment, but at least it is as good as the standard one, i.e., when the

difference is ≥ -d, this is known as a non-inferiority test and we will discuss more in detail in

the next section.

47

A general approach for hypothesis testing is:

dVar(D)ZDdVar(D)ZD

Var(D)ZD

α

α/21α/21

α/21

Δ and Δ

i.e., d, within be should eequivalenc of limitslower andupper The

Δ

is differencemean zero-nonfor interval confidencepercent )100(1

There are two chances that will fail to reject the null hypothesis and thus face to type II error,

which are:

Δ)μf (μD;)μμ(

let

N(0,1)~)μVar(μ

)μf (μ)μμ(value lStatistica

BABA

BA

BABA

ˆˆ

ˆˆ

48

)Z(Z

dVar(D)

ZVar(D)

dZ

ZVar(D)

dZ

β/2ββ ,βββ

ZVar(D)

dZ

ZVar(D)

dZ

Var(D)ZVar(D)Zd

Var(D)ZVar(D)Zd

2

α1β/21

2

α/21β/21

α/21β/21

2121

α/21β1

α/11β1

α/21β1

α/21β1

2

1

2

1

0Δ If

Δ

where

Δ

andΔ

Thus

Δ

andΔ

Two independent means

As for

2

α1β/21

2

)Z(Z

d)(Var(D)

Δ

49

α11

2

22

α/21β/21

1

2

α/21β1

2

1

2

1

2

12

1

2

2

2

1

2

d)(Δ

1)(r

rnΦ12xβ1

d)(Δ

σ)Z(Z

r

1rn

)Z(Z

d)(Δ

n

σ

r

1r

n

σ

r

1rVar(D)

rnn

rn

n

let

n

σ

n

σVar(D)

istest ofPower

Special case is if ∆ = 0

2

22

α/21β/21

1d

σ)Z(Z

r

1rn

Example 13

Researchers aimed to compare pain scores, measured by VAS (range 0-10), between Rofecoxib

and Ibuprofen in arthritic patients. The research question was whether Rofecoxib was as

efficacious as Ibuprofen in reducing pain. They thought that the score of ±1 (∆) is clinically

equivalent. A previous study reported that the mean pain score for an Ibuprofen group was 6

and variance was 10. Assigning alpha = 0.05 (two sided test = 0.025), beta=0.1 (β/2=0.05),

ratio=1:1, how many subjects should be enrolled into the study?:

50

602

259.9

2

106)2(1.65x1.92

2

2

22

α/21β/21

1d

σ)Z(Z

r

1rn

This estimate can be done using STATA or using SampSize(12) as well. For STATA, we can

apply ‘power’ command, which is originally designed for test for difference, but it is needed to

input power of test as (1-β/2) to get overall type II error of β. For instance, input power of 0.95

if the overall power is 0.90.

power twomeans 6 5, sd(3.16) power(0.95)

Estimated sample sizes for a two-sample means test

t test assuming sd1 = sd2 = sd

Ho: m2 = m1 versus Ha: m2 != m1

Study parameters:

alpha = 0.0500

power = 0.9500

delta = -1.0000

m1 = 6.0000

m2 = 5.0000

sd = 3.1600

Estimated sample sizes:

N = 522

N per group = 261

The SampSize is more specifically developed for clinical trial working on the iPad only. It

provided for all sorts of tests (i.e., superiority, non-inferiority, equivalence, bioequivalence, and

precision) for both binary and continuous outcome but only for 2 treatment arms. The program

is developed based a tutorial be Julious and Campbell(4). The program requires to input data

similar to STATA, but α/2 is needed to yield overall type I error of α. In addition, not just only

equivalence limit but also the difference between treatment is required. As for the example

above, it can estimate as follows:

51

Power: 90%

Significance Level: 0.025

Difference Between Treatments: 0

Equivalence Limit: 1

Population Standard Deviation: 3.162

Allocation Ratio: 1.000

Sample Size Group 1: 261

Sample Size Group 2: 261

Total Sample Size: 522

Suppose that investigators could enroll only about 400 subjects for each group, then the power

of test of this sample size can be estimated as follows:

.76860

2x0.1151β1

2x0.115β

.1150β/2

1.202Z

Z1.9610

1x

2

200

ZZσ

d

2

n

β/21

β/21

β/21α11

The power of test can be estimated from STATA as detailed below. The output reported the

power of test of 0.8843, thus the error for this one-sided test is 1-0.8843, and the two-sided

error is 2x(1-0.8843). As a result, the power of test is 1- [2x(1-0.8843)]. This can be estimated

using power or SamSize in the iPad as well, result is a little different, see below.

power twomeans 6 5, sd(3.16) n(400)

Estimated power for a two-sample means test

t test assuming sd1 = sd2 = sd

Ho: m2 = m1 versus Ha: m2 != m1

Study parameters:

alpha = 0.0500

52

N = 400

N per group = 200

delta = -1.0000

m1 = 6.0000

m2 = 5.0000

sd = 3.1600

Estimated power:

power = 0.8843

*power = 0.8843

*STATA work on probability density on left-tail area. For instance

at a critical region 1.96, p = 0.975, thus p/2 alpha = 1-0.975

disp normprob(1.96) /*left-region area*/

disp 1- normprob(1.96)

****

disp (1-0.8843) /*one-side error*/

disp 2*(1-0.8843) /*two-sided error*/

disp 1-2*(1-0.8843) /*power of test*/

.7686

Two dependent samples

sdif ference pair of deviation standardis σ

d)(Δ

σ)Z(Zn

2

d

2

2

d

2

α/21β/21

Cross – over trial

ANOVA)(from deviation standardsubject within

0Δ case specialFor

2

w

2

2

α/21β/21

2

w

2

2

α/21β/21

2

w

σ

d

)Z(Zσn

d)(

)Z(Zσn

2

Δ

2

53

Example 14

Investigators would like to estimate sample size for an equivalent test for a pharmacokinetic

study. The primary aim was to assess whether maximum concentrations of original Simvastatin

and local Simvastatin are similar or equivalent after administration. They decided to conduct a

cross-over trial. Information from a pilot study provided within subjects’ variance of 2.80 and

mean maximum concentrations after 24 hours were 9.83 and 9.60 for original and generic

drugs, respectively. They thought that the true difference of concentration between the two

drugs should be 0, and the level of difference of 2 should be clinically equivalent. Setting type I

& II errors of 0.05 and 0.1 respectively, how many subjects needed enrolling into this study?:

totalfor 19

18.2

2

1.96)52x2.80(1.6

2

2

2

2

2

α/21β/21

2

w

d

)Z(Zσn

This can be estimated in STATA using ‘power repeated’ command. Because correlation was

unknown, it was assumed equal 0.

power repeated 9.83 11.83, varerror(2.8) power(.95) corr(0)

Performing iteration ...

Estimated sample size for repeated-measures ANOVA

F test for within subject with Greenhouse-Geisser correction

Ho: delta = 0 versus Ha: delta != 0

Study parameters:

alpha = 0.0500

power = 0.9500

delta = 0.8452

N_g = 1

N_rep = 2

means = <matrix>

Var_w = 1.0000

Var_we = 1.4000

Var_e = 2.8000

rho = 0.0000

Estimated sample sizes:

N = 21

N per group = 21

54

Investigators had planned to conduct the study by recruiting healthy subjects. Later on the

budget ran out and they could enroll only 14 subjects. They wondered what the power of test

was?:

0.7700.23011

0.230

1.20)

1.20

1.962x2.80

14x2

2

2

β

β

(Z

Z2σ

ndZ

d

)Z(Zσn

α12

w

2

β/21

2

2

α1β/21

2

w

2x.115thus

.115

for As

Φ

55

NON – INFERIORITY

Continuous data

The new treatment A is said to be non-inferior to treatment B if its effect is clinically similar, or

not worse than the treatment B, given that the treatment B is a standard-active control (11, 13).

Suppose that the level of interested outcome is continuous and higher value is better treatment

effect.

This null hypothesis and alternative hypothesis are:

H0: An interested treatment is inferior to the standard treatment

dμμ:H BA0

Ha: An interested treatment is as good as or better (non-inferior) to the standard treatment

dμμ:H BAa

Here, -d is a non-inferior margin which indicates how much the treatment A can be inferior to

B, but it is still considered non-inferior. The most difficult for non-inferior design is to set how

close the effect of treatment A should be to treatment B to claim that treatment A is not inferior

to treatment B. The margin d should be set based on statistical and clinical judgments, given

that it should be greater than the effect size of active control B versus placebo. For instance, if

the effect size for B vs placebo is 1, the d margin can be any value between 0-1 (usually 10-

20%), but should not exceed 1. The best way to get information for the effect size of B vs

placebo is to perform a systematic review and apply a meta-analysis to pool effect size across

studies. The range estimate of pooled effect size (i.e., 95% confidence interval) will help

investigators to justify the margin d properly, usually the lower limit is used (13).

56

The lower limit of (1-α)100 confidence interval or limit of rejection area is below and described

in figure 3.

Thus the limit of non-inferiority should be:

dVar(D)ZD α/21 Δ

Two chances will fail to reject the null hypothesis which results in:

2

α/21β1

α/11β1

α/21β1

ZZ

dVar(D)

ZVar(D)

dZ

Var(D)ZVar(D)Zd

Δ

Δ

Δ

57

Two independent means

α/212

2

2

2

α/21β1

2

Z1)σ(r

d)r(β

d)r(

)Z(Z)σ(rn

ΔΦ1

Δ

1

The same as estimating sample size for superiority or equivalence, choosing type I, type II, and

marginal size d are needed. Setting the margin d is difficult but some concept of choosing this is

described previously. Although the test is one-sided test, type I error is still needed to be a half

of total type I error, e.g. 0.025 if 0.05 is overall type I error. This is because we still need to

estimate lower limit or upper limit of 95% CI in order to reject the null hypothesis. In order to

claim that treatment A is non-inferior when compares to treatment B (as for the above null

hypothesis), lower limit should be greater than –d.

READ more detail in Appendix IV(Statistics in Medicine 2012; 31:2904) and VI

(Statistics in Medicine 2003; 22: 169.)

Example 15

A trial aimed to determine whether switching treatment from Atorvastatin to Simvastatin will

be able to control cholesterol non-inferior to remaining on Atorvastatin. A previous cohort

study reported that mean serum cholesterol for an Atorvastatin group was 6.7 mmol/l with a

variance of 5. Thus, the margin d should not exceed 6.7. Although efficacy of Simvastatin was

lower than Atorvastatin, it was claimed to be clinically non-inferior if its effect was about 15%

lower than Atorvastatin. Thus, the margin d was set at 6.7x0.15. The true difference of means

was supposed to be 0. Type I, II error, and ratio were set at 0.05, 0.1, and 1:1 ,respectively. How

many samples needed enrolling into this study?:

58

106

105.1

1(1)

1.96)1)5(1.282(1

Δ

1

2

2

n

d)r(

)Z(Z)σ(rn

2

2

α/21β1

2

STATA

power twomeans 6.7 7.7, sd(2.24) power(.9)

Performing iteration ...

Estimated sample sizes for a two-sample means test

t test assuming sd1 = sd2 = sd

Ho: m2 = m1 versus Ha: m2 != m1

Study parameters:

alpha = 0.0500

power = 0.9000

delta = 1.0000

m1 = 6.7000

m2 = 7.7000

sd = 2.2400

Estimated sample sizes:

N = 214

N per group = 107

SampSize

Power: 90%

Signifiance Level: 0.025

Anticipated Difference Between Treatments: 0

Non-Inferiority Limit: 1

Population Standard Deviation: 2.240

Allocation Ratio: 1.000

Sample Size Group 1: 107

Sample Size Group 2: 107

Total Sample Size: 214

That is 107 subjects needed enrolling for each group. Investigators could enroll subjects about

120 to 160 (60 to 80 per group). How large is the type II error with this sample size?:

59

0.807

1.962x5

80x1Φ

1(1)

1.96)1)x5x((180

1

2

2

1

β

βZ

Vary n from 120 to 160

power twomeans 6.7 7.7, n(120(20)160) sd(2.24) table

Estimated power for a two-sample means test

t test assuming sd1 = sd2 = sd

Ho: m2 = m1 versus Ha: m2 != m1

+-------------------------------------------------------------------------+

| alpha power N N1 N2 delta m1 m2 sd |

|-------------------------------------------------------------------------|

| .05 .6791 120 60 60 1 6.7 7.7 2.24 |

| .05 .7462 140 70 70 1 6.7 7.7 2.24 |

| .05 .8013 160 80 80 1 6.7 7.7 2.24 |

+-------------------------------------------------------------------------+

SampSize

Power: 80%

Signifiance Level: 0.025

Anticipated Difference Between Treatments: 0

Non-Inferiority Limit: 1

Population Standard Deviation: 2.240

Allocation Ratio: 1.000

Sample Size Group 1: 80

Sample Size Group 2: 80

Total Sample Size: 160

60

Cross-over design

α/212

w

2

2

2

α/21β1

2

w

nd)(β

d)(

)Z(Zσn

2

ΔΦ1

Δ

2

Exercise 16

Similar to exercise 14, investigators questioned that general Simvastatin might be lower quality,

but this may not be clinically inferior compared with the original Simvastatin. The level of

maximum concentration of 2 would be the largest difference that they would declare as non-

inferiority. With the same setting of type I & II error, how many subjects needed enrolling?:

15

14.7

2

1.96)822x2.80(1.2

Δ

2

2

2

2

2

α/21β1

2

w

d)(

)Z(Zσn

Performing iteration ...

Estimated sample size for repeated-measures ANOVA

F test for within subject with Greenhouse-Geisser correction

Ho: delta = 0 versus Ha: delta != 0

Study parameters:

alpha = 0.0500

power = 0.9000

delta = 0.8452

N_g = 1

N_rep = 2

means = <matrix>

Var_w = 1.0000

Var_we = 1.4000

Var_e = 2.8000

rho = 0.0000

Estimated sample sizes:

N = 17

N per group = 17

61

SampSize

*estimate n

Power: 90%

Signifiance Level: 0.025

Anticipated Difference Between Treatments: 0

Non-Inferiority Limit: 2

Population Standard Deviation: 1.670

Total Sample Size: 17

*estimate power for n=14

Power: 82%

Significance Level: 0.025

Anticipated Difference Between Treatments: 0

Non-Inferiority Limit: 2

Population Standard Deviation: 1.670

Total Sample Size: 14

Dichotomous data

Read Appendix IV (Statist Med 2012;31:2904-36)

Most of the time the outcome of interest is dichotomous outcome. Suppose that C and T are a

standard and new treatments, respectively, and PC and PT are corresponding proportions of

interested events after receiving treatments. Higher proportion would reflect higher treatment

efficacy. If the effect of T is not much worse than C treatment, T is said to be non-inferior to C

treatment. How large is the margin d required to set prior to conducting the study? The same as

mentioned previously, a systematic review and meta-analysis would give an idea how to set the

margin d properly. The null and alternative hypotheses are as follows (4, 14):

dPP:H

dPP:H

cTa

cT0

62

If the difference between PT and Pc > -d, the null hypothesis is rejected. Type I error for non-

inferior (and also equivalence) study is usually set at a half of type I error for a difference

(superiority) study (4). The reason for the equivalent study is because we need to reject both 2-

sided hypotheses in order to claim that the 2 treatments are equivalent. For a non-inferior

study, although we need to reject only one-sided test, a 95% CI is also needed to estimate.

Regarding the above hypothesis, we need to show that the lower limit (2.5%)–d is exceeded.

The equation used for sample size estimation is as follows (4, 14):

α/21Z)cP(crP)TP(TP

d)cPT(PTnΦβ

α1Z)cP(crP)TP(TP

d)cPT(PTnβ1Z

2d)cPT(P

)cP(crP)TP(TP2)α/21Zβ1(Z

Tn

1111

11

11

Exercise 17

Investigators would like to conduct a non-inferiority RCT in order to compare the incidence of

complete remission between MMF versus Cyclophosphamide. It was claimed that the efficacy

of MMF in reaching disease remission might be little worse than Cyclophosphamide, but

adverse events from use of this drug (e.g., infection, leucopenia, or ovarian failure) occurred

less. The investigators think that if the MMF’s efficacy is not inferior to Cyclophosphamide, it

should be worth prescribing. A previous systematic review and meta-analysis reported that the

incidence of complete remission in Cyclophosphamide was 0.194 (9). If the incidence of

complete remission in MMF is about 20% lower (i.e., 3.88% (d)), this should be clinically non-

inferior. They allowed a true difference equal to 1%, type I & II =0.05 and 0.1 respectively, and

ratio=1:1. How large was the sample size?

63

1011

1010.5049

0388.184.

194.184.1.96.842

2)00(0.194

)00.194()00.184(2)(0

2d)cPT(P

)cP(crP)TP(TP2)α/21Zβ1(Z

Tn

11

11

This can be estimated using SampSize as follow

Power: 80%

Significance Level: 0.025

Non-Inferiority Limit: 0.0388

Response Anticipated On Treatment A: 0.184

Response Anticipated On Treatment B: 0.194

Sample Size Group 1: 1011

Sample Size Group 2: 1011

Total Sample Size: 2022

STATA can be used to estimate sample size for non-inferiority using Analysis of Resource for

Trials (ART) written by Patrick Royston as follows (8). However, there is no option particularly

for (PT-PC) and margin ‘d’, we need to combine these values and account them into PT. As for

this example, (PT-PC) is set at 0.01, ‘d’ = 0.0388, thus PT =0.194-(0.01+0.0388)=0.145.

artmenu on

Once the menu is activated, we can use a command from a drop down menu via user\art\binary

outcomes. Information for number of groups, allocation ratio, alpha, power, and proportions need to

be filled in, see below. In addition, options for non-inferior design and one-sided test should be

selected. A sample size of 2009 subjects was required for this study.

64

artbin, pr(.194 .145) ngroups(2) aratios(1 1) distant(0) alpha(0.05)

power(0.8) onesid

> ed(0) ni(1)

ART - ANALYSIS OF RESOURCES FOR TRIALS (version 1.0.0, 3 March 2004)

------------------------------------------------------------------------------

A sample size program by Abdel Babiker, Patrick Royston & Friederike Barthel,

MRC Clinical Trials Unit, London NW1 2DA, UK.

------------------------------------------------------------------------------

Type of trial Non-inferiority - binary outcome

Statistical test assumed Unconditional comparison of 2

binomial proportions

Number of groups 2

Allocation ratio Equal group sizes

Anticipated event probabilities 0.194, 0.145

Alpha 0.050 (two-sided)

Power (designed) 0.800

Total sample size (calculated) 2009

Expected total number of events 341

------------------------------------------------------------------------------

65

ASSIGNMENT VI

1. A case-controlled study will be conducted to assess the association between diabetic

mellitus (DM) and cardiovascular diseases. The hypothesis is that patients with diabetes have

higher risk of cardiovascular events than non-diabetic patients. Previous literature reported that

the prevalence of diabetes in the general population was ~16%. The authors thought that an

odds ratio of at least 3 will have an impact on clinical practice.

a) Estimate sample size for this case-controlled study.

b) If the difference of diabetic prevalence between cases and controls is considered

instead of OR, by setting the size of difference = 10%, how many cases and controls would you

suggest?

c) If your colleague can conduct the study with only 100 subjects for each group, what

power will they come up with and how large is the difference which they will be able to detect?

Your colleagues have also asked you to write this calculation to add in a proposal.

2. A randomized controlled trial aims to test whether efficacy of ACE inhibitor is equivalent

to ARB in reno-protective effect. The outcome of interest is serum creatinine, in which mean

and its SD in type 2 diabetic patients who received ARB are 1.25 and 0.2, respectively. If a

difference of serum creatinine between the 2 drugs is ±0.05, it would be claimed as the 2 drugs

are clinically equivalent.

a) How many subjects will need to be enrolled to test this hypothesis,

b) Since ARB is more expensive than ACE inhibitor, what is your suggestion to

minimize the budget of their projects?

c) From their previous experiences of conducting clinical trials in their settings, a loss to

follow-up rate could reach to 25%. Justify the sample size according to loss to follow-up.

d) Writing a report for this part.