Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Are Power Calculations
a Waste of Time?
Presented by A/Prof Gunter Hartel
6 September 2017
Who Am I ?
A/Prof Gunter Hartel Head of QIMR Berghofer Statistics Unit
• 10 Statisticians, Pk/PD modeller, Data manager
• Statistical Consultation, Training, Collaboration,
• Statistical methodology development / research
• QIMRB researchers and clinical researchers from 6 local hospitals
Previously
• Principal Statistician at CNS (CRO) (ph I clinical Trials)
• Global Director of Statistics at CSL/Behring (all Clinical Trials)
• UQ Population Health, QUT Statistics (Population Health)
• 30 years statistical consulting
© QIMR Berghofer Medical Research Institute | 2
Sample Size Power Calculation
• #1 request for statistical consult
• Typical scenario:
– Protocol or grant proposal near final draft
– Feasibility done to determine how many samples
or patients are available / affordable
– Request to statistician – please fill in:
With N=__ subjects this study has 80% power for an α=0.05 test of the primary hypothesis.
*note: we can only get 60 subjects at the most, is that ok?
© QIMR Berghofer Medical Research Institute | 3
Sample Size Power Calculation
• Typical questions from statistician:
– What is your 1 hypothesis / endpoint?
– What size of effect are you trying to detect? Ie is scientifically,
clinically, or practically important?
– What are the expect background rates or variances in your
study population?
• Typical answers:
– Keeping hypotheses flexible because we’re not sure how it’ll
turn out
– If I knew all this info I wouldn’t have to do the study!
– Or use info from other different study with different population
and endpoints and probably underpowered too
© QIMR Berghofer Medical Research Institute | 4
Sample Size Power Calculation
• Typical outcome:
– “With N=60 patients our study has 80% power to detect
an effect size of 0.7 with an α=0.05 test”.
– Power calculation has no effect on the design of the
study
– Study ends up finding significant 3-way interaction in a
sub-group when adjusted for significant covariates, and
gets published.
– Future research uses this study to help with power
calculations
© QIMR Berghofer Medical Research Institute | 5
Sample Size Power Calculation
• More optimistic scenario:
– Researcher involves statistician in design of study and analysis plan
– Bases effect size target on Minimum clinically significant difference (MCSD) or previously observed effect size
– Bases assumptions on literature or previous research
– Adjusts sample size or study design to achieve appropriate power if N close to feasible range
• Perfect?
– Publication bias => parameters may be over-estimated
– Pilot studies may be underpowered also
– Using point estimates leads to overestimation of power (2007, Uebersax)
– MCSD can lead to overpowered studies
© QIMR Berghofer Medical Research Institute | 6
Why do Power Calculations?
• Google “Why do Power Calculation?”
– yields “How to....”
© QIMR Berghofer Medical Research Institute | 7
Why do Power Calculations?
• Google “Why do Power Calculation?”
– yields “How to....”
– And stuff like this:
© QIMR Berghofer Medical Research Institute | 8
Why do Power Calculations?
• Top Reasons:
1. Needed for Grant application
2. Under-power and risk failing study
3. Over-power is waste of resource
4. Unethical if human or animal subjects involved
© QIMR Berghofer Medical Research Institute | 9
Why do Power Calculations?
• Top Reasons:
1. Needed for Grant application
2. Under-power and risk failing study
3. Over-power is waste of resource
4. Unethical if human or animal subjects involved
• EMERGING REASON:
– Reproducibility Crisis!
– Many (Most) significant results are false positives!• Eg Ioannidis 2005
– Publication Bias
– Underpowered studies yield inflated results
© QIMR Berghofer Medical Research Institute | 10
Reproducibility of Research
© QIMR Berghofer Medical Research Institute | 11
• How do you know what
your type I error rate is?
– Should be α=5%!?
Reproducibility of Research
© QIMR Berghofer Medical Research Institute | 12
• How do you know what
your type I error rate is?
• By Replicating Research:
• Open Science Collaboration(2015) replicated 100 studies published in top 3
Psychology Journals:
• 97% of original results were significant
• Only 36% of replicated studies were significant
• Average effect size went from 0.4 to 0.2
Similar results have been found in many other fields
Reproducibility of Research
© QIMR Berghofer Medical Research Institute | 13
• Reproducibility Research in Neuroscience:
– Button et al (2013) compared results from 49 meta
analyses comprising 730 primary studies to
estimate power
– median power was 20% ranging from 8% to 31%
Reproducibility of Research
© QIMR Berghofer Medical Research Institute | 14
• Reproducibility Research in Neuroscience:
– Button et al (2013) compared results from 49 meta analyses comprising 730 primary studies to estimate power
– median power was 20% ranging from 8% to 31%
• Ioannidis (2005) Why most research findings are false– Probability a research claim is true is function of study
power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships. (unknown)
– Related to Screening problem
Reproducibility of Research
© QIMR Berghofer Medical Research Institute | 15
• Misunderstanding αlpha
– Alpha is NOT the probability that your result is a false
positive
– α = 5% does NOT mean that you are 95% confident
your results is true
– α is the probability you reject your H0 IF it is TRUE
• We don’t know if it is true
– 1-β is the probability you reject your H0 IF it is FALSE
• We also don’t know if it is false
Reproducibility of Research
© QIMR Berghofer Medical Research Institute | 16
• Probability a published result is truly significant?
• Prob(finding is true)×(1-β)×Prob(it gets published)
• Probability a published result is falsely significant?
• Prob(finding is false)×(α)×Prob(it gets published)
Reproducibility of Research
© QIMR Berghofer Medical Research Institute | 17
• Probability a published result is truly significant?
• Prob(finding is true)×(1-β)×Prob(it gets published)
• Probability a published result is falsely significant?
• Prob(finding is false)×(α)×Prob(it gets published)
• Depends on unknowns and power and significance level
• Novel findings are less likely to be true – but maybe more
likely to be published – than replications of previous studies
• if Prob(finding true) = 5% then prob significant result is false = 95%×5%
5%×80%+95%×5%= 54% more likely false than true!
Alpha Inflation
© QIMR Berghofer Medical Research Institute | 18
• So, how do under-powered studies get published?
• Publication bias? Partly
• Inflation of type I error rate increases Power
– Multiple Testing - Control FWER (family wise error rate):
Bonferroni, FDR (false discovery rate)
Alpha Inflation
© QIMR Berghofer Medical Research Institute | 19
• So, how do under-powered studies get published?
• Publication bias? Partly
• Inflation of type I error rate increases Power
– Multiple Testing - Control FWER (family wise error rate): Bonferroni, FDR (false discovery rate)
– Other researcher degrees of freedom:
1. Definition of endpoint(s)
2. Inclusion/exclusion of subjects, outliers
3. Definition of subgroups
4. Analysis methods
5. Transformations, dichotomization, scoring/factors
6. Model selection procedures, selection of covariates, interaction terms, etc
7. Choice of which data/experiment to report on
(Simmons et al 2011)
Alpha Inflation
© QIMR Berghofer Medical Research Institute | 20
Example: CART to select biomarkers to predict malignant vs benign
Outcome Area
M 0.9550
N=100 subjects, 350 biomarkers,
6 biomarkers selected
Using 5-fold cross-validation
AUROC = 95.5%
Actual Predicted
Outcome M B
M 35 7
B 3 55
N=100 subjects, 350 biomarkers,
6 biomarkers selected
Using 5-fold cross-validation
AUROC = 95.5%
Admission:
All predictors are random normal
Outcome is random binomial
All independent – no relationships
Alpha Inflation
© QIMR Berghofer Medical Research Institute | 21
Example: CART to select biomarkers to predict malignant vs benign
Outcome Area
M 0.9550
Actual Predicted
Outcome M B
M 35 7
B 3 55
Receiver Operating Characteristic
Alpha Inflation
© QIMR Berghofer Medical Research Institute | 22
Example: CART to select biomarkers to predict malignant vs benign
Outcome AUROC
M 0.9828
Receiver Operating Characteristic
REAL DATA
N=125 subjects, 217 biomarkers,
13 biomarkers selected
Using 5-fold cross-validation
AUROC = 98.3%
Fit same set of biomarkers to 2 other datasets
AUROC > 90%
Alpha Inflation
© QIMR Berghofer Medical Research Institute | 23
Example: CART to select biomarkers to predict malignant vs benign
Outcome AUROC
M 0.9828
PermutationPvalue 0.6640
Receiver Operating Characteristic
REAL DATA
N=125 subjects, 217 biomarkers,
13 biomarkers selected
Using 5-fold cross-validation
AUROC = 98.3%
Fit same set of biomarkers to 2 other datasets
AUROC > 90%
Randomly shuffled Outcome & refit CART
10,000 times
66.4% of permuted datasets got larger AUROCs!
Recommendations
© QIMR Berghofer Medical Research Institute | 24
• How Can We Improve the Situation?1. Perform Larger Studies
2. Registration of Studies
Reduce publication bias by surfacing negative results
3. Pre-specification of hypotheses, study design, analyses
4. Reporting Standards, eg CONSORT, CAMARADES
5. Publish Raw Data to allow re-analysis and review
6. Incentivize replication of research
Recommendations
© QIMR Berghofer Medical Research Institute | 25
• How Can We Improve the Situation?1. Perform Larger Studies
2. Registration of Studies
Reduce publication bias by surfacing negative results
3. Pre-specification of hypotheses, study design, analyses
4. Reporting Standards, eg CONSORT, CAMARADES
5. Publish Raw Data to allow re-analysis and review
6. Incentivize replication of research
7. Reduce reliance on hypothesis testing and P-value – focus
more on estimation and precision (ie CIs)
8. Focus on program of research / staging
9. Bayesian thinking – evaluate prior knowledge and consider
each study as improving precision of prior knowledge
Conclusion
© QIMR Berghofer Medical Research Institute | 26
– NO MORE Power Calculations? no
• Rather – evaluate statistical properties of entire study in
context of existing research and overall research plan
– Power and real α – ie probability of finding spurious results
• Involve Statistician in planning of research and study
• Statisticians:
– understand the Science behind the research
– understand the maths behind the statistics – too many recipe
followers
Conclusion
© QIMR Berghofer Medical Research Institute | 27
– NO MORE Power Calculations?
• Rather – evaluate statistical properties of entire study in context of existing research and overall research plan
– Power and real α – ie probability of finding spurious results
• Involve Statistician in planning of research and study
• Statisticians:
– understand the Science behind the research
– Understand the maths behind the statistics – too many recipe followers
Consider alternate study designs,
eg adaptive designs, Platform Umbrella studies
Triangulation – using multiple approaches in a research program, including Qualitative, Observational, MetaAnalyses, Big data/registry studies, Natural experiments, eg Mendelian randomisation
References
• Ioannidis JPA (2005) Why most published research findings are false. PLoS Med 2(8): e124.
• Ioannidis & Bossuyt (2017) Waste, Leaks, and Failures in the Biomarker Pipeline. Clinical
Chemistry 63:5, 963–972.
• Button, KS et al (2013) Power failure: why small sample size undermines the reliability of
neuroscience. NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 365
• Open Science Collaboration (2015), Science 349, aac4716. DOI: 10.1126/science.aac4716
• Colquhoun D. 2014 An investigation of the false discovery rate and the misinterpretation of p-
values. R. Soc. Open sci. 1: 140216. http://dx.doi.org/10.1098/rsos.140216
• Simmons et al (2011) False-Positive Psychology: Undisclosed Flexibility in Data Collection
and Analysis Allows Presenting Anything as Significant. Psychological Science 22(11) 1359–
1366
• Hartel (2015) A tale of two errors - why most significant results can't be replicated
https://www.linkedin.com/pulse/tale-two-errors-why-most-significant-results-cant-hartel-phd
• Maxwell, Kelley & Rausch (2008) Sample Size Planning for Statistical Power and Accuracy in
Parameter Estimation. Annu. Rev. Psychol. 2008. 59:537–63. http://psych.annualreviews.org
doi: 10.1146/annurev.psych.59.103006.093735
• Uebersax, John S. (2007) Bayesian Unconditional Power Analysis. http://www.john-
uebersax.com/stat/bpower.htm
© QIMR Berghofer Medical Research Institute | 28
Thank you
Gunter Hartel
www.qimrberghofer.edu.au