View
221
Download
0
Category
Tags:
Preview:
Citation preview
Title - Critical Evaluation of Clinical Trial Data
Erick Turner, M.D.Oregon Health & Science University
Dept of Psychiatry; Dept of Pharmacology
Portland VA Medical Center
Mood Disorders Center
Disclosure
No trade names, advertising, or product-group messages
Recovering promotional speaker– Last “slip” was in fall of 2005
Objectives
Things to watch for in evaluating medical
information
Heighten your level of skepticism and
paranoia
May or may not apply to today’s talks
More about clinical trials in general, esp.
industry-sponsored
Studies Presented Today
CATIESTAR*DSTEP-BDBOLDER
The A*C*R*O*N*Y*M Study
Effect of Acronym Name
Doubled the citation rate Independent of study size, quality,
outcome
Source
– Poster: What's in a NAME?
– Peer Review Congress 2005 (AMA)
Standard Clinical Trials vs. Large Simple Clinical Trials
Signal-to-noiseSmall & clean N (standard clinical trials)Big & dirty N“Dirt” “comes out in the wash”
Efficacy vs. Effectiveness
Patients: “squeaky clean” vs. “real world”Comorbidities
– EtOH, other drugs– Depression + anxiety
“The clinical evidence”
Whose evidence?– Intellectual COI
• “I was right! I’ve been vindicated!”
• Attracting grant money - “the Midas touch”
Which evidence?– Available evidence-based medicine
Selective Publication
Nonsignificant studies tend not to get published
Some studies never see the light of dayAmong studies that are published
– Selective presentation of endpoints within those studies
– “Outcome reporting bias”
Why the Need for Selective Publication?
Unimpressive effect sizes in psychiatryMany NS antidepressant trials
– 47/92 (51%) active tx arms NS• Khan 2003 Neuropsychopharm
• Later-approved drugs and dosages
“The Emperor’s New Drugs”
80% of drug effect duplicated by placebo2-point difference between drug and
placebo– HAMD-17-item max = 50 points– 21-item max = 62 points
Kirsch I. Prevention & Treatment, Volume 5, Article 23, posted July 15, 2002
There Must Be 50 Ways . . .
…to put lipstick on a pig
Splice the Y-AxisDepakote and Lithium
0
5
10
15
20
25
30
0 5 10 15 21
Time on Protocol, d
Mania Rating Scale Scores
Placebo
Divalproex
Lithium
(Bowden et al, JAMA, 271:12, March 1994)*p < 0.05
Show Change from Baseline (not Absolute) Scores
(Keck et al, Am J. Psychiatry, 160:4, April 2003)
0
5
10
15
20
25
30
35
40
0 2 4 7 14 21
Study Day
Mania Rating Scale Score
Non-Psychiatric Example
Graph in PDRGraph in PDRChange scoresChange scores
Same numbersSame numbersAbsolute scoresAbsolute scores
Don’t Show Variability in Data
Noise in data– random variability– Interindividual differences
• Perhaps your patient isn’t “Mr. Mean”
Showing just means can be misleading– Liquid N2
Prefer error bars (or even raw data points)
But how much/little overlap do you want the error bars to show?
Have it Your Way
Small Standard ErrorMedium Confidence interval (95%)Large Standard Deviation
Overpower Your Study
Unnecessarily large N
Clinically insignificant result statistically significant
Candidate A vs. Candidate BEffect of the Number of Voters
Disclaimer: Assumes that popular vote matters
The split:
Total No. Voters P value News Headline1,000 0.95 tie
10,000 0.84 tie100,000 0.53 tie
1,000,000 0.046 (<.05) A wins10,000,000 <.0001 A wins by a landslide!!
Limitation of P Values P values confounded by sample size
– Clinically insignificant difference can be statistically very significant
P values tell about precision, – how likely the difference observed could have occurred
by chance Clinicians and pts also interested in magnitude of
effect– Effect size– Confidence intervals– Reading: Jacob Cohen: The Earth is Round, P<.05
Underpowered Studies
Could have clinically significant difference
N too small to reach statistical significance
Michael Jordan free-throw shootout MJ vs. ET -- 7 free throws each MJ makes 7, I make 3
P = .07 (NS, Fisher Exact test)
Conclusions– There was “no difference” between us. – I’m as good as Michael Jordan!
Vickers A, Medscape 2006. Michael Jordan Won't Accept the Null Hypothesis: Notes on Interpreting High P Values
Lack of a significant difference does not mean equality!
If it’s not black, it’s not necessarily white, either… could be gray
Study could be underpoweredBeware claims of equivalence
But what if Ns are adequate?
Claims of Equivalence
Example: Two drugs performed “the same”. Were both medications really equally effective?Or were they equally ineffective?
St. John’s Wort vs. Sertraline
0
5
10
15
20
25
30
0 1 2 3 4 5 6 7 8
Study Week
HAM-D Scores
Hypericum
Sertraline
Mean decrease = 47% for Zoloft (vs. 38%) p = .06
JAMA Apr 10, 2002 -- Vol 287, No. 14, 1807-1814
. . . and with Placebo in the Picture
0
5
10
15
20
25
30
0 1 2 3 4 5 6 7 8
Study Week
HAM-D Scores
Hypericum
Placebo
Sertraline
Comparison pHyp vs. Pbo .59Ser vs. Pbo .18Ser vs. Hyp .06
St. John’s Wort vs. Sertraline Analysis of other primary efficacy endpoint
24 %25 %
0
10
20
30
Hypericum Sertraline
% Full Responders
p = .99
Chi-squared test, Yates corrected
. . . with Placebo in the Picture
24 % 25 %
32 %
0
10
20
30
40
Hypericum Sertraline Placebo
% Full Responders
Comparative Claims
FDA leery– …of equivalence claims– …of superiority claims
FDA does not allow them in labeling (package insert, advertising)
Efficacy advantage– Underdose competing drug
Safety advantage– Dose competing drug too high and/or too fast
Transitivity
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Am J Psychiatry 163:185-194, February 2006
Consider the Source
RESULTS: Of the 42 reports identified by the authors, 33 were sponsored by a pharmaceutical company. In 90.0% of the studies, the reported overall outcome was in favor of the sponsor’s drug. This pattern resulted in contradictory conclusions across studies when the findings of studies of the same drugs but with different sponsors were compared.
Beware the Comparison to Nothing!
Open-label study - pts know what they are getting– Voice alteration in VNS trials
Often single-arm w/ no placebo control
Anyone ever seen an open-label study in which pts did not get better compared to baseline?
(How do they get published?)
Single-Blind Studies
A step above open-label in rigorInvestigators know what tx the study pt is
gettingExamples:
– Acupuncture studies– Many device studies (e.g. rTMS)
The Problem with Single-Blind Studies:Clever Hans
Use Lots of ScalesDon’t Put All Your Eggs in One Basket
Observer-based– MADRS– CGI
• CGI-I (improvement)• CGI-S (severity)
– HAMD in all its flavors
• 17-item• 21-item• 28-item• 33-item
Self-report– BDI (Beck)
– QIDS-SR (STAR*D)
– Quality of life scales
Pros and Cons of Many Scales
The upside of multiple endpoints:– Internal replication– Robustness (vs. fragile finding)
The downside– Increased probability of chance finding– Multiplicity, aka multiple comparisons
Put Enough Monkeys at Enough Typewriters . . .
…and sooner or later you’ll have the complete works of William Shakespeare
Multiple Subscales
HAMD-33 item, you also get . . .– 28-item– 21-item– 17-– 6- (“core items”)
Anxiety subscale of the HAMD Depression subscale of the PANSS
But was it in the original protocol?
What Can You Do With All These Scales?
Continuous measure– Use each score as-is (absolute score)– Change from baseline
Transform into categorical measure– Cutoffs patients either above or below
– Remitters– Responders
Responders
Just “responders”– >= 50% decrease from baseline
• Ex. Baseline score 40 -> endpoint score = 20
– < 50% ==> “nonresponder”• Baseline = 40, endpoint score = 21
Gradations of responders– Partial responders (25-50% decrease from baseline)
– Full responders (>50% decrease)
Remitters
“Remission” usually = absolute score (HAMD < 8)
STAR*D defines remission as 75% decrease from baseline
Advantage - set threshold deemed clinically significant
But % remitters may still differ between groups to extent that is just statistically significant (remember the “election” slide)
Handling Dropouts
LOCF – last observation carried forward
OC– Observed cases– aka. completers
MMRM– Mixed model repeated measures
HARKing
HypothesizingAfter theResults are Known
A priori vs. post hoc
How the FDA Guards Against This
FDA gets protocol before study beginsSponsors can’t “censor” studies that don’t
go wellDrugs approved based on all studies
It’s the Protocol, Stupid!
“If the Devil is in the Details, Salvation is in the Protocol”– Talk by Paul Andreason, FDA
Primary endpoints– a priori hypothesis– Where you’re placing your bet
Secondary endpoints– Exploratory– If you make it, fine, but don’t make a big deal about it.– Repeat study, designate it as primary, see if it replicates
Off-Label Use
Drug used for something FDA has not approved it for
(FDA does not regulate prescribing) Often appropriate to prescribe off-label
– No approved drugs for condition (but why not?)– You’ve exhausted approved drugs
Ask why isn’t drug approved for this condition?– Could they have submitted and gotten it rejected?– If they haven’t submitted an application, why not?
How do you Know Whether a Drug is FDA-Approved for the Condition You’re Treating?
Beware of sources that talk about “uses”– AHFS Drug Information (“The Red Book”)– Fluoxetine uses: obesity, bipolar d/o, myoclonus,
cataplexy, EtOH dependence
Gabapentin has never been approved for any psych indication
Just look in the package insert or PDR– Indications & Usage section
– More details in Clinical Trials section
The End
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Recommended