29
Are Power Calculations a Waste of Time? Presented by A/Prof Gunter Hartel 6 September 2017

Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Are Power Calculations

a Waste of Time?

Presented by A/Prof Gunter Hartel

6 September 2017

Page 2: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Who Am I ?

A/Prof Gunter Hartel Head of QIMR Berghofer Statistics Unit

• 10 Statisticians, Pk/PD modeller, Data manager

• Statistical Consultation, Training, Collaboration,

• Statistical methodology development / research

• QIMRB researchers and clinical researchers from 6 local hospitals

Previously

• Principal Statistician at CNS (CRO) (ph I clinical Trials)

• Global Director of Statistics at CSL/Behring (all Clinical Trials)

• UQ Population Health, QUT Statistics (Population Health)

• 30 years statistical consulting

© QIMR Berghofer Medical Research Institute | 2

Page 3: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Sample Size Power Calculation

• #1 request for statistical consult

• Typical scenario:

– Protocol or grant proposal near final draft

– Feasibility done to determine how many samples

or patients are available / affordable

– Request to statistician – please fill in:

With N=__ subjects this study has 80% power for an α=0.05 test of the primary hypothesis.

*note: we can only get 60 subjects at the most, is that ok?

© QIMR Berghofer Medical Research Institute | 3

Page 4: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Sample Size Power Calculation

• Typical questions from statistician:

– What is your 1 hypothesis / endpoint?

– What size of effect are you trying to detect? Ie is scientifically,

clinically, or practically important?

– What are the expect background rates or variances in your

study population?

• Typical answers:

– Keeping hypotheses flexible because we’re not sure how it’ll

turn out

– If I knew all this info I wouldn’t have to do the study!

– Or use info from other different study with different population

and endpoints and probably underpowered too

© QIMR Berghofer Medical Research Institute | 4

Page 5: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Sample Size Power Calculation

• Typical outcome:

– “With N=60 patients our study has 80% power to detect

an effect size of 0.7 with an α=0.05 test”.

– Power calculation has no effect on the design of the

study

– Study ends up finding significant 3-way interaction in a

sub-group when adjusted for significant covariates, and

gets published.

– Future research uses this study to help with power

calculations

© QIMR Berghofer Medical Research Institute | 5

Page 6: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Sample Size Power Calculation

• More optimistic scenario:

– Researcher involves statistician in design of study and analysis plan

– Bases effect size target on Minimum clinically significant difference (MCSD) or previously observed effect size

– Bases assumptions on literature or previous research

– Adjusts sample size or study design to achieve appropriate power if N close to feasible range

• Perfect?

– Publication bias => parameters may be over-estimated

– Pilot studies may be underpowered also

– Using point estimates leads to overestimation of power (2007, Uebersax)

– MCSD can lead to overpowered studies

© QIMR Berghofer Medical Research Institute | 6

Page 7: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Why do Power Calculations?

• Google “Why do Power Calculation?”

– yields “How to....”

© QIMR Berghofer Medical Research Institute | 7

Page 8: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Why do Power Calculations?

• Google “Why do Power Calculation?”

– yields “How to....”

– And stuff like this:

© QIMR Berghofer Medical Research Institute | 8

Page 9: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Why do Power Calculations?

• Top Reasons:

1. Needed for Grant application

2. Under-power and risk failing study

3. Over-power is waste of resource

4. Unethical if human or animal subjects involved

© QIMR Berghofer Medical Research Institute | 9

Page 10: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Why do Power Calculations?

• Top Reasons:

1. Needed for Grant application

2. Under-power and risk failing study

3. Over-power is waste of resource

4. Unethical if human or animal subjects involved

• EMERGING REASON:

– Reproducibility Crisis!

– Many (Most) significant results are false positives!• Eg Ioannidis 2005

– Publication Bias

– Underpowered studies yield inflated results

© QIMR Berghofer Medical Research Institute | 10

Page 11: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Reproducibility of Research

© QIMR Berghofer Medical Research Institute | 11

• How do you know what

your type I error rate is?

– Should be α=5%!?

Page 12: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Reproducibility of Research

© QIMR Berghofer Medical Research Institute | 12

• How do you know what

your type I error rate is?

• By Replicating Research:

• Open Science Collaboration(2015) replicated 100 studies published in top 3

Psychology Journals:

• 97% of original results were significant

• Only 36% of replicated studies were significant

• Average effect size went from 0.4 to 0.2

Similar results have been found in many other fields

Page 13: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Reproducibility of Research

© QIMR Berghofer Medical Research Institute | 13

• Reproducibility Research in Neuroscience:

– Button et al (2013) compared results from 49 meta

analyses comprising 730 primary studies to

estimate power

– median power was 20% ranging from 8% to 31%

Page 14: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Reproducibility of Research

© QIMR Berghofer Medical Research Institute | 14

• Reproducibility Research in Neuroscience:

– Button et al (2013) compared results from 49 meta analyses comprising 730 primary studies to estimate power

– median power was 20% ranging from 8% to 31%

• Ioannidis (2005) Why most research findings are false– Probability a research claim is true is function of study

power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships. (unknown)

– Related to Screening problem

Page 15: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Reproducibility of Research

© QIMR Berghofer Medical Research Institute | 15

• Misunderstanding αlpha

– Alpha is NOT the probability that your result is a false

positive

– α = 5% does NOT mean that you are 95% confident

your results is true

– α is the probability you reject your H0 IF it is TRUE

• We don’t know if it is true

– 1-β is the probability you reject your H0 IF it is FALSE

• We also don’t know if it is false

Page 16: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Reproducibility of Research

© QIMR Berghofer Medical Research Institute | 16

• Probability a published result is truly significant?

• Prob(finding is true)×(1-β)×Prob(it gets published)

• Probability a published result is falsely significant?

• Prob(finding is false)×(α)×Prob(it gets published)

Page 17: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Reproducibility of Research

© QIMR Berghofer Medical Research Institute | 17

• Probability a published result is truly significant?

• Prob(finding is true)×(1-β)×Prob(it gets published)

• Probability a published result is falsely significant?

• Prob(finding is false)×(α)×Prob(it gets published)

• Depends on unknowns and power and significance level

• Novel findings are less likely to be true – but maybe more

likely to be published – than replications of previous studies

• if Prob(finding true) = 5% then prob significant result is false = 95%×5%

5%×80%+95%×5%= 54% more likely false than true!

Page 18: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Alpha Inflation

© QIMR Berghofer Medical Research Institute | 18

• So, how do under-powered studies get published?

• Publication bias? Partly

• Inflation of type I error rate increases Power

– Multiple Testing - Control FWER (family wise error rate):

Bonferroni, FDR (false discovery rate)

Page 19: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Alpha Inflation

© QIMR Berghofer Medical Research Institute | 19

• So, how do under-powered studies get published?

• Publication bias? Partly

• Inflation of type I error rate increases Power

– Multiple Testing - Control FWER (family wise error rate): Bonferroni, FDR (false discovery rate)

– Other researcher degrees of freedom:

1. Definition of endpoint(s)

2. Inclusion/exclusion of subjects, outliers

3. Definition of subgroups

4. Analysis methods

5. Transformations, dichotomization, scoring/factors

6. Model selection procedures, selection of covariates, interaction terms, etc

7. Choice of which data/experiment to report on

(Simmons et al 2011)

Page 20: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Alpha Inflation

© QIMR Berghofer Medical Research Institute | 20

Example: CART to select biomarkers to predict malignant vs benign

Outcome Area

M 0.9550

N=100 subjects, 350 biomarkers,

6 biomarkers selected

Using 5-fold cross-validation

AUROC = 95.5%

Actual Predicted

Outcome M B

M 35 7

B 3 55

Page 21: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

N=100 subjects, 350 biomarkers,

6 biomarkers selected

Using 5-fold cross-validation

AUROC = 95.5%

Admission:

All predictors are random normal

Outcome is random binomial

All independent – no relationships

Alpha Inflation

© QIMR Berghofer Medical Research Institute | 21

Example: CART to select biomarkers to predict malignant vs benign

Outcome Area

M 0.9550

Actual Predicted

Outcome M B

M 35 7

B 3 55

Receiver Operating Characteristic

Page 22: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Alpha Inflation

© QIMR Berghofer Medical Research Institute | 22

Example: CART to select biomarkers to predict malignant vs benign

Outcome AUROC

M 0.9828

Receiver Operating Characteristic

REAL DATA

N=125 subjects, 217 biomarkers,

13 biomarkers selected

Using 5-fold cross-validation

AUROC = 98.3%

Fit same set of biomarkers to 2 other datasets

AUROC > 90%

Page 23: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Alpha Inflation

© QIMR Berghofer Medical Research Institute | 23

Example: CART to select biomarkers to predict malignant vs benign

Outcome AUROC

M 0.9828

PermutationPvalue 0.6640

Receiver Operating Characteristic

REAL DATA

N=125 subjects, 217 biomarkers,

13 biomarkers selected

Using 5-fold cross-validation

AUROC = 98.3%

Fit same set of biomarkers to 2 other datasets

AUROC > 90%

Randomly shuffled Outcome & refit CART

10,000 times

66.4% of permuted datasets got larger AUROCs!

Page 24: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Recommendations

© QIMR Berghofer Medical Research Institute | 24

• How Can We Improve the Situation?1. Perform Larger Studies

2. Registration of Studies

Reduce publication bias by surfacing negative results

3. Pre-specification of hypotheses, study design, analyses

4. Reporting Standards, eg CONSORT, CAMARADES

5. Publish Raw Data to allow re-analysis and review

6. Incentivize replication of research

Page 25: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Recommendations

© QIMR Berghofer Medical Research Institute | 25

• How Can We Improve the Situation?1. Perform Larger Studies

2. Registration of Studies

Reduce publication bias by surfacing negative results

3. Pre-specification of hypotheses, study design, analyses

4. Reporting Standards, eg CONSORT, CAMARADES

5. Publish Raw Data to allow re-analysis and review

6. Incentivize replication of research

7. Reduce reliance on hypothesis testing and P-value – focus

more on estimation and precision (ie CIs)

8. Focus on program of research / staging

9. Bayesian thinking – evaluate prior knowledge and consider

each study as improving precision of prior knowledge

Page 26: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Conclusion

© QIMR Berghofer Medical Research Institute | 26

– NO MORE Power Calculations? no

• Rather – evaluate statistical properties of entire study in

context of existing research and overall research plan

– Power and real α – ie probability of finding spurious results

• Involve Statistician in planning of research and study

• Statisticians:

– understand the Science behind the research

– understand the maths behind the statistics – too many recipe

followers

Page 27: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Conclusion

© QIMR Berghofer Medical Research Institute | 27

– NO MORE Power Calculations?

• Rather – evaluate statistical properties of entire study in context of existing research and overall research plan

– Power and real α – ie probability of finding spurious results

• Involve Statistician in planning of research and study

• Statisticians:

– understand the Science behind the research

– Understand the maths behind the statistics – too many recipe followers

Consider alternate study designs,

eg adaptive designs, Platform Umbrella studies

Triangulation – using multiple approaches in a research program, including Qualitative, Observational, MetaAnalyses, Big data/registry studies, Natural experiments, eg Mendelian randomisation

Page 28: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

References

• Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124.

• Ioannidis & Bossuyt (2017) Waste, Leaks, and Failures in the Biomarker Pipeline. Clinical

Chemistry 63:5, 963–972.

• Button, KS et al (2013) Power failure: why small sample size undermines the reliability of

neuroscience. NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 365

• Open Science Collaboration (2015), Science 349, aac4716. DOI: 10.1126/science.aac4716

• Colquhoun D. 2014 An investigation of the false discovery rate and the misinterpretation of p-

values. R. Soc. Open sci. 1: 140216. http://dx.doi.org/10.1098/rsos.140216

• Simmons et al (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection

and Analysis Allows Presenting Anything as Significant. Psychological Science 22(11) 1359–

1366

• Hartel (2015) A tale of two errors - why most significant results can't be replicated

https://www.linkedin.com/pulse/tale-two-errors-why-most-significant-results-cant-hartel-phd

• Maxwell, Kelley & Rausch (2008) Sample Size Planning for Statistical Power and Accuracy in

Parameter Estimation. Annu. Rev. Psychol. 2008. 59:537–63. http://psych.annualreviews.org

doi: 10.1146/annurev.psych.59.103006.093735

• Uebersax, John S. (2007) Bayesian Unconditional Power Analysis. http://www.john-

uebersax.com/stat/bpower.htm

© QIMR Berghofer Medical Research Institute | 28

Page 29: Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Thank you

Gunter Hartel

www.qimrberghofer.edu.au