Sample size and analytical issues for cluster trials David Torgerson Director, York Trials Unit [email protected]

Sample size and analytical issues for cluster trials

David Torgerson

Director, York Trials Unit

[email protected]

www.rcts.org

Background

• For any trial we want to make it sufficiently large that if there were a ‘true’ difference between the groups that this difference would be statistically significant.

• A Type II error occurs when we wrongly conclude there is no difference when there actually is.

Sample size calculations

• “Most hand calculations diabolically strain human limits, even for the easiest formula,..” (Schulz & Grimes, Lancet 2005)

Sample size formulae

• Usually need a computer to calculate. However, a simple approximation for a two armed randomised trial with 1:1 ratio for a continuous variable (e.g., blood pressure) is as follows d = effect size (difference/standard deviation):

2

32N eApproximat

d

Example

• We want to investigate a treatment for back pain. The measure is the Roland and Morris back pain scale with a standard deviation of 4. If we want to detect a 2 point difference how many do we need?

• 2/4 = 0.5 = Effect size (d). 0.5 x 0.5 = 0.25.• 32/0.25 = 128 in total for 80% power, 5%

significance (use 42 for 90% power). • NB using computer software answer = 126

Binary variables

• For a dichotomous variable (cured not cured) the following is useful (a = average proportion difference).

2

2

/32aa

d N eApproximat

Example

• Breast feeding rates are only 50% and we have an educational intervention where we think this will increase to 60%; how many do we need?

• d2 = 0.6-0.5 = 0.12 = 0.01• a = 0.6+0.5/2 = 0.55• a2 = 0.552 = 0.3025• 0.01/(0.55-0.3025) = 0.040• 32/0.040 = 792• Need 792 to have 80% power to show a 10% difference

in breast feeding rates if it were present (use 42 for 90% power).

• NB using computer software the answer is: 774

Approximations

• The formulae slightly overestimate the true sample size needed. But they can be done on a hand calculator and you can impress the statisticians.

• What about cluster trials?

Cluster Sample Size

• Usual sample size estimates assume independence of observations. When people are members of the same cluster (e.g., classroom, GP surgery) they are more related than we would expect to be at random.

• This is the intra-cluster correlation co-efficient.

ICC

• The ICC needs to incorporated into the sample size calculations. The formula is as follows: Design effect = 1 + (m – 1) X ICC. Design effect is the size the sample needs to be inflated by. M is the number of people in the cluster.

Sample size example.

• Let’s assume for an individually randomised trial we need 128 people to detect 0.5 of an effect size with 80% power (2p = 0.05). Now assume we have 24 groups with 7 members. The ICC is 0.05, which is quite high.

• 1+ (7 – 1) x 0.05 = 1.3, we need to increase the sample size by 30%. Therefore, we will need 166 participants.

What happens if cluster gets bigger?

• If our cluster size is twice as big (14), things begin to get really interesting.

• 1+(14-1)x0.05 = 1.65.

• What about 30? (1+(30-1)x 0.05 = 2.45 (I.e, 314 participants).

• Say we randomise a larger cluster, such as a school (n = 500) (1+(500-1) x 0.05 = 25.95 (ie. 3322).

ICC size

• ICCs can be large for some things. ICCs for educational outcomes for examples are often around 0.4 to 0.5.

• A class-based RCT with n = 30 and an ICC of 0.4 would need 1,612 participants or 54 classes with n = 30 in each class.

What makes the ICC large?

• If the treatment is applied to health care provider (e.g., guidelines will increase ICCs for patients).

• If cluster relates to outcome variable (e.g., smoking cessation and schools)

• If members of cluster are expected to influence each other (e.g., households).

Authors Source Years

Clustering allowed for in sample size

Clustering allowed for in analysis

Donner et al. (1990)

16 non-therapeutic intervention trials

1979 – 1989

<20% <50%

Simpson et al. (1995)

21 trials from American Journal of Public Health and Preventive Medicine

1990 – 1993

19% 57%

Isaakidis and Ioannidis (2003)

51 trials in Sub-Saharan Africa

1973 – 2001 (half post 1995)

20% 37%

Puffer et al. (2003)

36 trials in British Medical Journal, Lancet, and New England Journal of Medicine

1997 – 2002

56% 92%

Eldridge et al. (Clinical Trials 2004)

152 trials in primary health care

1997 - 2000

20% 59%

Reviews of Cluster Trials

Sample Size Problems

Cluster and Sample Size

Cluster Size 1 20 50 100

Sample Size 116 160 230 346

Cluster Trials Demand Larger Sample Sizes

Conditional ICC

• The key ICC is the conditional ICC, usually we only have access to estimates of the unconditional ICC.

• If we know, and can measure, characteristics that cause the ICC, we can adjust for this and lower the ICC.

• Cook claims that using covariates allows a school based RCT to reduce the number for schools from about 50 to around 22.

Summary of sample size

• The KEY thing is the size of the cluster. It is nearly always best to get lots of small clusters than a few large ones (e.g, a trial with small hospital wards, GP practices, classrooms will, ceteris paribus, be better than large clusters).

• BUT if the ICC is tiny may not affect the sample too much.

Cluster Trials: Should I do one?

• If possible avoid like the plague. BUT although they are difficult to do, properly, they WILL give more robust answers than other methods, (e.g., observational data), when done properly.

• Is it possible to avoid doing them and do an individually randomised trial?

Contamination

• An important justification for their use is SUPPOSED ‘contamination’ between participants allocated to the intervention with people allocated to the control.

Spurious Contamination?

• Trial proposal to cluster randomise practices for a breast feeding study – new mothers might talk to each other!

• Trial for reducing cardiac risk factors patients again might talk to each other.

• Trial for removing allergens from homes of asthmatic children.

Contamination

• Contamination occurs when some of the control patients receive the novel intervention.

• It is a problem because it reduces the effect size, which increases the risk of a Type II error (concluding there is no effect when there actually is).

Patient level contamination

• In a trial of counselling adults to reduce their risk of cardiovascular disease general practices were randomised to avoid contamination of control participants by intervention patients.

Steptoe. BMJ 1999;319:943.

Accepting Contamination

• We should accept some contamination and deal with it through individual randomisation and by boosting the sample size rather than going for cluster randomisation

Torgerson BMJ 2001;322:355.

Counselling Trial

• Steptoe et al, wanted to detect a 9% reduction in smoking prevalence with a health promotion intervention. They needed 2000 participants (rather than 1282) because of clustering.

• If they had randomised 2000 individuals this would have been able to detect a 7% reduction allowing for a 20% CONTAMINATION.

Steptoe. BMJ 1999;319:943.

Comparison of Sample Sizes

Contamination and Sample Size

0 10% 20% 30%

116 144 182 236

Cluster and Sample Size

1 20 50 100

116 160 230 346

NB: Assuming an ICC of 0.02.

Misplaced contamination

• The ONLY health study, I’m aware of to date, to directly compare an individually randomised study with a cluster design, showed no evidence of contamination.

• In an RCT of nurse led cardiovascular risk factor screening some ‘intervention’ clusters had participants allocated to no treatment. NO contamination was observed.

What about dilution bias?• If, in the presence of contamination, we

use individual allocation we might observe a difference that is statistically significant but is not clinically or economically significant.

• Dilution has biased the estimate towards the mean.

Dealing with contamination

• Sometimes there may be substantial contamination and this will dilute the treatment effects, it may, however, still be best to individually randomise if you can measure contamination.

Per-protocol analysis?• We cannot adjust for contamination using

either per-protocol or on treatment analysis: these popular analytical methods are plainly wrong as they violate the random allocation.

CACE analysis: a solution?• If we can measure contamination we can

use a statistical approach known as Complier Average Causal Effect (CACE) analysis.

Assumptions of CACE• Assumption 1 – if the control group had

been offered treatment the same proportion would comply with treatment – this must be true as random allocation ensures that it is.

• Assumption 2 – merely being offered treatment has no effect on outcomes.

Example CRC screening• In a RCT of bowel cancer screening only

53% of people invited for screening attended.

• ITT = relative risk = 0.85. BUT what happened to those who were screened? The per protocol RR was 0.62 THIS IS WRONG.

• What is the true estimate?

Randomisation

Observed adherers n = 40,214 (53%) Outcome = 138 = 0.34%

Observed non-adherers n = 35,039 (47%) Outcome = 222 = 0.63%

Intervention group (n = 75,253)

Potential adherers n = 40,078 (53%) Unobserved outcome = 199 = 0.50%

Potential non-adherers n = 34,920 (47%) Unobserved outcome = 221 = 0.63%

Control group (n = 74,998)

True differences• For ITT the policy of offering screening to

the whole community the RR = 0.85, that is a 15% reduction in CRC deaths.

• For those who accepted screening their RR was 0.68 – a 32% reduction in deaths, NOT a 38% reduction.

Individuals are best• Using CACE we can get the best of both

worlds retain individual randomisation and get unbiased estimates.

Sample size simulation

• CACE analysis generally produces wider confidence intervals as there are two sources of variance.

• Therefore, it is possible that cluster allocation may actually have a lower standard error in some circumstances.

• To assess whether this is true we undertook a simulation exercise.

Cluster Size

ICC = 0.04, Cluster trial

Contamination (%)

Individual RCT with CACE

Contamination effect

10 1080 0 630 1

30 1740 10 756 1.20

50 2400 20 890 1.41

100 4000 30 1090 1.73

NB 80% power to detect an effect size of 0.2

Source: Hewitt PhD thesis.

Sample size Trade-off between cluster and individual allocation

Sample size

• CACE performs better than cluster allocation in a range of sample size scenarios

• Because of the difficulties of doing a cluster trial then an individual trial design with CACE analysis might be best.

Limitations

• The assumption that being offered treatment has no effect is a weakness as some may appear not to comply but actually access some of the treatment.

Still need to do a cluster trial?

• If a cluster trial is be undertaken it is important, once the trial has been completed that it is analysed correctly and that the effect of the clustering is accounted for. This has been known since 1940, when Linquist advocated that educational trials should use the class as the natural unit of allocation.

What did Lindquist proposed

• Each class should be treated both as the unit of allocation and the unit of analysis.

• Put simply a trial with 20 classes of 30 children is NOT a trial of 600 children it is a trial of 20 classes.

• The simplest approach is to calculate the mean score of each cluster and do a t-test comparing the two means.

Example

• A randomised trial of 28 adult literacy classes sought to ascertain whether or not paying participants an incentive to attend would improve adherrence.

• 14 classes were randomised for students to get an incentive 14 were controls.

• Students were paid £5 per class attended• There were 150 students in total the ICC

was 0.39.See Martin Bland’s website http://www-users.york.ac.uk/~mb55/ for a worked example

http://www-users.york.ac.uk/~mb55/

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

Group X | 70 6.685714 .4177941 3.495516 5.852238 7.519191

Group Y | 82 5.280488 .2991881 2.709263 4.685197 5.875778

---------+--------------------------------------------------------------------

combined | 152 5.927632 .2566817 3.164585 5.42048 6.434783

---------+--------------------------------------------------------------------

diff | 1.405226 .5037841 .4097968 2.400656

------------------------------------------------------------------------------

diff = mean(Group X) - mean(Group Y) t = 2.7893

Ho: diff = 0 degrees of freedom = 150

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.9970 Pr(|T| > |t|) = 0.0060 Pr(T > t) = 0.0030

Wrong

• This analysis is wrong it treats all of the students as individuals and ignores the clustering of outcomes between the two approaches.

• Let us try Lindquist’s approach to the anlaysis.

Two-sample t test with equal variances

------------------------------------------------------------------------------

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

---------+--------------------------------------------------------------------

1 | 14 6.69932 .7457716 2.790422 5.088178 8.310461

2 | 14 5.189229 .3974616 1.487165 4.330565 6.047893

---------+--------------------------------------------------------------------

combined | 28 5.944274 .439363 2.32489 5.042776 6.845773

---------+--------------------------------------------------------------------

diff | 1.510091 .8450746 -.226985 3.247166

------------------------------------------------------------------------------

diff = mean(1) - mean(2) t = 1.7869

Ho: diff = 0 degrees of freedom = 26

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 0.9572 Pr(|T| > |t|) = 0.0856 Pr(T > t) = 0.0428

T-test method

• This is correct in the sense that it takes clustering into account, however, it does not take chance differences in cluster size into account or powerful predictors of outcome.

• We have information of cluster size and pre-test literacy score we can use to improve the precision of our estimate (i.e., reduce width of the confidence intervals). We can use summary statistics in a regression approach

Source | SS df MS Number of obs = 28

-------------+------------------------------ F( 2, 25) = 22.97

Model | 88.6762362 2 44.3381181 Prob > F = 0.0000

Residual | 48.252853 25 1.93011412 R-squared = 0.6476

-------------+------------------------------ Adj R-squared = 0.6194

Total | 136.929089 27 5.07144775 Root MSE = 1.3893

------------------------------------------------------------------------------

sessions | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

group | -1.778653 .5301429 -3.36 0.003 -2.870503 -.6868038

midscl | -.0945941 .015181 -6.23 0.000 -.1258598 -.0633283

_cons | 13.13811 1.175841 11.17 0.000 10.71642 15.5598

------------------

Other methods

• There are other statistical methods, that are more complex, and may yield slightly different results. However, simple methods are approximately correct and easier to do.

Summary

• Cluster trials need larger sample sizes than individually randomised studies.

• Clustering needs to be taken into account both in the sample size and the analysis.

• There are simple methods that can do this.

Documents

Sample size and analytical issues for cluster trials David Torgerson Director, York Trials Unit [email protected]