The folly of believing positive findings from underpowered intervention studies

Too Good to Be True: Health Psychology’s Dependence onUnderpowered Positive Studies

James C. Coyne, Ph.D.

University of Groningen, University Medical Center

Groningen, The NetherlandsTwitter @CoyneoftheRealm

Long a pervasive problem…Long a pervasive problem…

Lack of sufficient resources to conduct well-designed, amply powered studies

Confusion about pilot studies: Cannot be the basis for evaluating efficacy or estimating effect sizes!

“We are grateful to the Society of Behavioral Medicine (SBM) for selecting the authorship group. This article is one of three meta-analyses that have been undertaken under the aegis of the SBM Evidence-Based Behavioral Medicine Committee; the other two meta-analyses examine the effects of psychosocial interventions on depression and fatigue among patients with cancer.”

SBM InitiativeSBM Initiative

Meta-analyses generated by professional organizations should receive special critical scrutiny because of tenancy to gloss over limits of literature in order to promote the services of their membership.

Small StudiesSmall Studies

Suffer strong publication bias.

Negative findings go unpublished because the studies are too small.

Positive findings celebrated because they were obtained despite the smallness.


Require a larger effect size for statistical significance.

Published results tend to be exaggerated and not to be replicated in larger and better quality later studies.

Small Trials Likely to Have Outliers, Small Trials Likely to Have Outliers, and With Publication Bias, Yield and With Publication Bias, Yield

Results That Won’t ReplicateResults That Won’t Replicate

Hospital A has 10 births per month on average. Hospital B has 100 births per month on average.

In January, one of the hospitals reported 70% of the births were girls. Is it more likely in A, B, or equally likely to be in either?


Are particularly vulnerable to selective loss of patients to follow-up and to investigators, outcome raters knowing to which condition patients are assigned.

Investigators can naïvely or deliberately monitor incoming data and stop the trial when a positive finding has been obtained, even when it is a chance finding that would be undone with continued accumulation of patients.

Sample SizeSample Size

Sample size is the best proxy for other sources of bias in trials.

Sample size negatively predicts overall effect size.

In presence of small study effects, restriction of analyses to large trials or predictions of treatment benefits observed in large trials might provide more valid estimates than overall analyses of trials, irrespective of sample size.

Gorin, et al "Meta-analysis of psychosocial interventions to reduce pain in patients with cancer." Journal of Clinical Oncology 30: (5): (2012): 539-547.

Forest plot of effect sizes (g) for studies measuring pain severity (k = 38).

Sheinfeld Gorin S et al. JCO 2012;30:539-547

©2012 by American Society of Clinical Oncology

What the SBM Authors Claimed about Psychosocial Interventions for Cancer Pain

“Robust findings" of "substantial rigor" and “strong evidence for psychosocial pain management approaches."

Claimed findings supported the “systematic implementation" of these techniques.

Estimated would take 812 unpublished studies lurking in file drawers to change their assessment.

19 of 38 studies had less than 35 patients in the intervention or control group. Two of the other largest trials should have been excluded for other reasons.

Of 13 studies individually having significant effects on pain severity, 8 would have been excluded because they were too small, 1 because it should not have been included in the first place.

For 4 studies having the largest effect sizes, 1 had only 20 patients receiving relaxation; the next largest had 10 patients who were hypnotized; the next, 20 patients listening to the relaxation tape versus 20 patients getting live instructions, but these numbers were obtained by replacing patients who dropped out.

Study with the fourth largest effect size had 15 patients receiving training in self-hypnosis.

Some of the studies quite small

7 patients receiving pain education

10 patients receiving hypnosis

16 patients getting pain education

16 patients getting self hypnosis

8 patients getting relaxation plus 8 patients getting CBT plus relaxation

What is Left

Montgomery 0.9 0.61 2.67 Hypnosis

Lang 0.42 0.08 0.78 Hypnosis

Allard 0.32 -0.05 0.68 Nursing/Patient Ed

Rimer 0.31 -0.02 0.63 Nursing/Patient Ed

Yates 0.3 -0.03 0.63 Nursing/Patient Ed

DeWit 0.14 -0.08 0.36 Nursing/Patient Ed

DeWit Van Dam -0.19 -0.61 0.24 Nursing/Patient Ed

Gaston-Johansson -0.28 -0.65 0.46 Comprehensive Coping

Synthesis: pooling the resultsSynthesis: pooling the results

Hart, et al. "Meta-analysis of efficacy of interventions for elevated depressive symptoms in adults diagnosed with cancer." Journal of the National Cancer Institute 104:13 (2012): 990-1004.

.

Hart S L et al. JNCI J Natl Cancer Inst 2012;104:990-1004

© The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected].

3 studies classified as “psychotherapeutic” were complex collaborative care interventions for depression emphasizing medication management.

These studies provided the bulk [527] of the patients in the authors' calculation of the effect size for psychotherapeutic intervention.

Of the 2 remaining studies, 1 randomly assigned 45 patients to either problem-solving or waitlist control and retained only 37 patients for analyses.

Final study contributed 2 effect sizes based on comparisons of 29 patients receiving CBT and 23 receiving supportive therapy to the same 26-patient no-treatment control group, thus violating the assumption of independence of effect sizes.

With Removal of Small and With Removal of Small and Inappropriately Classified StudiesInappropriately Classified Studies

No Eligible Studies Were Left

Fail-safe N of 106 confirms the relative stability of the observed effect size.

“Our findings advance this literature by demonstrating that psychological and pharmacologic approaches, evaluated in RCTs, can be targeted productively toward cancer patients in need of intervention by virtue of clinical depression or elevated depressive symptoms.”

Fail Safe N is Pseudo-Precise Fail Safe N is Pseudo-Precise NonsenseNonsense

Don’t Be Intimidated by Exaggerated Don’t Be Intimidated by Exaggerated Estimates of Number of Unpublished Estimates of Number of Unpublished Studies Needed to Unseat Conclusions Studies Needed to Unseat Conclusions Based on Meta Analysis of Underpowered Based on Meta Analysis of Underpowered Studies.Studies.

Deficiencies of Failsafe NDeficiencies of Failsafe N

Combining Z scores does not directly account for sample sizes of the studies.

Choice of zero for the average effect of the unpublished studies is arbitrary, almost certainly biased.

Allowing for unpublished negative studies substantially reduces failsafe N.

Deficiencies of Failsafe NDeficiencies of Failsafe N

Estimates of failsafe N not influenced by evidence of bias in the data.

Guesswork to estimate the magnitude of unpublished studies in the area.

Heterogeneity among the studies is ignored.

Method is not influenced by the shape of the funnel graph.

Are Small, Unpowered Studies Are Small, Unpowered Studies Good for Anything?Good for Anything?

Leon, Andrew C., Lori L. Davis, and Helena C. Kraemer. The role and interpretation of pilot studies in clinical research. Journal of Psychiatric Research 45:5 (2011): 626-629.

A pilot study is not a hypothesis testing study.

Efficacy and effectiveness are not evaluated in a pilot.

A pilot study does not provide a meaningful effect size estimate for planning subsequent studies due to the imprecision inherent in data from small samples.

Feasibility results do not necessarily generalize beyond the inclusion and exclusion criteria of the pilot design..

Health & Medicine

The folly of believing positive findings from underpowered intervention studies