Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of

Are Power Calculations

a Waste of Time?

Presented by A/Prof Gunter Hartel

6 September 2017

Who Am I ?

A/Prof Gunter Hartel Head of QIMR Berghofer Statistics Unit

• 10 Statisticians, Pk/PD modeller, Data manager

• Statistical Consultation, Training, Collaboration,

• Statistical methodology development / research

• QIMRB researchers and clinical researchers from 6 local hospitals

Previously

• Principal Statistician at CNS (CRO) (ph I clinical Trials)

• Global Director of Statistics at CSL/Behring (all Clinical Trials)

• UQ Population Health, QUT Statistics (Population Health)

• 30 years statistical consulting

© QIMR Berghofer Medical Research Institute | 2

Sample Size Power Calculation

• #1 request for statistical consult

• Typical scenario:

– Protocol or grant proposal near final draft

– Feasibility done to determine how many samples

or patients are available / affordable

– Request to statistician – please fill in:

With N=__ subjects this study has 80% power for an α=0.05 test of the primary hypothesis.

*note: we can only get 60 subjects at the most, is that ok?



• Typical questions from statistician:

– What is your 1 hypothesis / endpoint?

– What size of effect are you trying to detect? Ie is scientifically,

clinically, or practically important?

– What are the expect background rates or variances in your

study population?

• Typical answers:

– Keeping hypotheses flexible because we’re not sure how it’ll

turn out

– If I knew all this info I wouldn’t have to do the study!

– Or use info from other different study with different population

and endpoints and probably underpowered too



• Typical outcome:

– “With N=60 patients our study has 80% power to detect

an effect size of 0.7 with an α=0.05 test”.

– Power calculation has no effect on the design of the

study

– Study ends up finding significant 3-way interaction in a

sub-group when adjusted for significant covariates, and

gets published.

– Future research uses this study to help with power

calculations



• More optimistic scenario:

– Researcher involves statistician in design of study and analysis plan

– Bases effect size target on Minimum clinically significant difference (MCSD) or previously observed effect size

– Bases assumptions on literature or previous research

– Adjusts sample size or study design to achieve appropriate power if N close to feasible range

• Perfect?

– Publication bias => parameters may be over-estimated

– Pilot studies may be underpowered also

– Using point estimates leads to overestimation of power (2007, Uebersax)

– MCSD can lead to overpowered studies


Why do Power Calculations?

• Google “Why do Power Calculation?”

– yields “How to....”



• Google “Why do Power Calculation?”

– yields “How to....”

– And stuff like this:



• Top Reasons:

1. Needed for Grant application

2. Under-power and risk failing study

3. Over-power is waste of resource

4. Unethical if human or animal subjects involved



• Top Reasons:

1. Needed for Grant application

2. Under-power and risk failing study

3. Over-power is waste of resource

4. Unethical if human or animal subjects involved

• EMERGING REASON:

– Reproducibility Crisis!

– Many (Most) significant results are false positives!• Eg Ioannidis 2005

– Publication Bias

– Underpowered studies yield inflated results


Reproducibility of Research


• How do you know what

your type I error rate is?

– Should be α=5%!?



• How do you know what

your type I error rate is?

• By Replicating Research:

• Open Science Collaboration(2015) replicated 100 studies published in top 3

Psychology Journals:

• 97% of original results were significant

• Only 36% of replicated studies were significant

• Average effect size went from 0.4 to 0.2

Similar results have been found in many other fields



• Reproducibility Research in Neuroscience:

– Button et al (2013) compared results from 49 meta

analyses comprising 730 primary studies to

estimate power

– median power was 20% ranging from 8% to 31%



• Reproducibility Research in Neuroscience:

– Button et al (2013) compared results from 49 meta analyses comprising 730 primary studies to estimate power

– median power was 20% ranging from 8% to 31%

• Ioannidis (2005) Why most research findings are false– Probability a research claim is true is function of study

power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships. (unknown)

– Related to Screening problem



• Misunderstanding αlpha

– Alpha is NOT the probability that your result is a false

positive

– α = 5% does NOT mean that you are 95% confident

your results is true

– α is the probability you reject your H0 IF it is TRUE

• We don’t know if it is true

– 1-β is the probability you reject your H0 IF it is FALSE

• We also don’t know if it is false



• Probability a published result is truly significant?

• Prob(finding is true)×(1-β)×Prob(it gets published)

• Probability a published result is falsely significant?

• Prob(finding is false)×(α)×Prob(it gets published)



• Probability a published result is truly significant?

• Prob(finding is true)×(1-β)×Prob(it gets published)

• Probability a published result is falsely significant?

• Prob(finding is false)×(α)×Prob(it gets published)

• Depends on unknowns and power and significance level

• Novel findings are less likely to be true – but maybe more

likely to be published – than replications of previous studies

• if Prob(finding true) = 5% then prob significant result is false = 95%×5%

5%×80%+95%×5%= 54% more likely false than true!

Alpha Inflation


• So, how do under-powered studies get published?

• Publication bias? Partly

• Inflation of type I error rate increases Power

– Multiple Testing - Control FWER (family wise error rate):

Bonferroni, FDR (false discovery rate)

Alpha Inflation


• So, how do under-powered studies get published?

• Publication bias? Partly

• Inflation of type I error rate increases Power

– Multiple Testing - Control FWER (family wise error rate): Bonferroni, FDR (false discovery rate)

– Other researcher degrees of freedom:

1. Definition of endpoint(s)

2. Inclusion/exclusion of subjects, outliers

3. Definition of subgroups

4. Analysis methods

5. Transformations, dichotomization, scoring/factors

6. Model selection procedures, selection of covariates, interaction terms, etc

7. Choice of which data/experiment to report on

(Simmons et al 2011)

Alpha Inflation


Example: CART to select biomarkers to predict malignant vs benign

Outcome Area

M 0.9550

N=100 subjects, 350 biomarkers,

6 biomarkers selected

Using 5-fold cross-validation

AUROC = 95.5%

Actual Predicted

Outcome M B

M 35 7

B 3 55




AUROC = 95.5%

Admission:

All predictors are random normal

Outcome is random binomial

All independent – no relationships

Alpha Inflation



Outcome Area

M 0.9550

Actual Predicted

Outcome M B

M 35 7

B 3 55

Receiver Operating Characteristic

Alpha Inflation



Outcome AUROC

M 0.9828


REAL DATA




AUROC = 98.3%

Fit same set of biomarkers to 2 other datasets

AUROC > 90%

Alpha Inflation



Outcome AUROC

M 0.9828

PermutationPvalue 0.6640


REAL DATA




AUROC = 98.3%

Fit same set of biomarkers to 2 other datasets

AUROC > 90%

Randomly shuffled Outcome & refit CART

10,000 times

66.4% of permuted datasets got larger AUROCs!

Recommendations


• How Can We Improve the Situation?1. Perform Larger Studies

2. Registration of Studies

Reduce publication bias by surfacing negative results

3. Pre-specification of hypotheses, study design, analyses

4. Reporting Standards, eg CONSORT, CAMARADES

5. Publish Raw Data to allow re-analysis and review

6. Incentivize replication of research

Recommendations


• How Can We Improve the Situation?1. Perform Larger Studies

2. Registration of Studies

Reduce publication bias by surfacing negative results

3. Pre-specification of hypotheses, study design, analyses

4. Reporting Standards, eg CONSORT, CAMARADES

5. Publish Raw Data to allow re-analysis and review

6. Incentivize replication of research

7. Reduce reliance on hypothesis testing and P-value – focus

more on estimation and precision (ie CIs)

8. Focus on program of research / staging

9. Bayesian thinking – evaluate prior knowledge and consider

each study as improving precision of prior knowledge

Conclusion


– NO MORE Power Calculations? no

• Rather – evaluate statistical properties of entire study in

context of existing research and overall research plan

– Power and real α – ie probability of finding spurious results

• Involve Statistician in planning of research and study

• Statisticians:

– understand the Science behind the research

– understand the maths behind the statistics – too many recipe

followers

Conclusion


– NO MORE Power Calculations?

• Rather – evaluate statistical properties of entire study in context of existing research and overall research plan

– Power and real α – ie probability of finding spurious results

• Involve Statistician in planning of research and study

• Statisticians:

– understand the Science behind the research

– Understand the maths behind the statistics – too many recipe followers

Consider alternate study designs,

eg adaptive designs, Platform Umbrella studies

Triangulation – using multiple approaches in a research program, including Qualitative, Observational, MetaAnalyses, Big data/registry studies, Natural experiments, eg Mendelian randomisation

References

• Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124.

• Ioannidis & Bossuyt (2017) Waste, Leaks, and Failures in the Biomarker Pipeline. Clinical

Chemistry 63:5, 963–972.

• Button, KS et al (2013) Power failure: why small sample size undermines the reliability of

neuroscience. NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 365

• Open Science Collaboration (2015), Science 349, aac4716. DOI: 10.1126/science.aac4716

• Colquhoun D. 2014 An investigation of the false discovery rate and the misinterpretation of p-

values. R. Soc. Open sci. 1: 140216. http://dx.doi.org/10.1098/rsos.140216

• Simmons et al (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection

and Analysis Allows Presenting Anything as Significant. Psychological Science 22(11) 1359–

1366

• Hartel (2015) A tale of two errors - why most significant results can't be replicated

https://www.linkedin.com/pulse/tale-two-errors-why-most-significant-results-cant-hartel-phd

• Maxwell, Kelley & Rausch (2008) Sample Size Planning for Statistical Power and Accuracy in

Parameter Estimation. Annu. Rev. Psychol. 2008. 59:537–63. http://psych.annualreviews.org

doi: 10.1146/annurev.psych.59.103006.093735

• Uebersax, John S. (2007) Bayesian Unconditional Power Analysis. http://www.john-

uebersax.com/stat/bpower.htm


http://dx.doi.org/10.1098/rsos.140216

https://www.linkedin.com/pulse/tale-two-errors-why-most-significant-results-cant-hartel-phd

http://psych.annualreviews.org/

http://www.john-uebersax.com/stat/bpower.htm

Thank you

Gunter Hartel

www.qimrberghofer.edu.au

Documents

Are Power Calculations a Waste of Time? Why... · Sample Size Power Calculation • Typical outcome: – “With N=60 patients our study has 80% power to detect an effect size of