Upload
fagan
View
50
Download
0
Embed Size (px)
DESCRIPTION
Multiple testing adjustments. European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham, [email protected]. Motivation. Already come across several cases where need to correct p -values. Pairwise gene expression data. - PowerPoint PPT Presentation
Citation preview
Multiple testing adjustments
European Molecular Biology LaboratoryPredoc Bioinformatics Course
17th Nov 2009
Tim Massingham, [email protected]
MotivationAlready come across several cases where need to correct p-values
Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6Exp 1 0.027 0.033 0.409 0.330 0.784Exp 2 0.117 0.841 0.985 0.004Exp 3 0.869 0.927 0.001Exp 4 0.245 0.021Exp 5 0.004Exp 6
Pairwise gene expression data
What happens if we perform several vaccine trials?
Motivation10 new vaccines are trialledDeclare vaccine a success if test has p-value of less than 0.05
If none of the vaccines work, what is our chance of success?
Motivation10 new vaccines are trialledDeclare vaccine a success if test has p-value of less than 0.05
Each trial has probability of 0.05 of “success” (false positive)Each trial has probability of 0.95 of “failure” (true negative)
Probability of at least one = 1 - Probability of none = 1 -
(Probability a trial unsuccessful)10
= 1 - 0.9510
= 0.4
If none of the vaccines work, what is our chance of a “success”?
Rule of ThumbMultiple size of test by number of tests
MotivationMore extreme example: test entire population for disease
True negative False positiveFalse negative True positive
Mixture: some of population have disease, some don’tFind individuals with disease
Family Wise Error RateControl probability that any false positive occurs
False Discovery RateControl proportion of false positives discovered
True status HealthyDiseased
Test reportHealthy Diseased
FDR = # false positives = # false positives # positives # true positives + # false positives
Cumulative distribution
Simple examination by eye:The cumulative distribution should be approximately linear
Rank
• Rank data• Plot rank against p-value
P-value
0 11
n
N.B. Often scale ranks to (0,1] by dividing by largest rank
Start (0,1)End (1,n)Never decreases
Cumulative distribution
Five sets of uniformly distributed p-values
Non-uniformly distributed data. Excess of extreme p-values (small)
Examples: For 910 p-valuesCould use a one-sided Kolmogorov test if desired
A little set theory
Test 1 false positive
Test 2 false positive
Test
3 fals
e pos
itive
No test gives false positive
All tests givefalse positive
Represent all possible outcomes of three tests in a Venn diagramAreas are probabilities of events happening
A little set theory
+
+≤
P(any test gives a false positive)
A little set theory
+
+≤
€
P(any test gives false positive) ≤ P(test i gives false positive)i
∑
Bonferroni adjustment
€
P(any test gives false positive) ≤ P(test i gives false positive)i
∑
Want to control this Know how to control each of these(the size of each test)
Keep things simple: do all tests at same size
If we have n tests, each at size a/n then
€
P(any test gives false positive) ≤ n × αn
=α
Bonferroni adjustment
€
P(any test gives false positive) ≤ n × αn
=α
If we have n tests, each at size a/n then
Family-Wise Error Rate
Bonferroni adjustment (correction)For a FWER of less than a, perform all tests at size a/nEquivalently: multiple p-values of all tests by n (maximum of 1) to give adjusted p-value
Example 1Look at deviations from Chargaff’s 2nd parity ruleA and T content of genomes for 910 bugs
Many show significant deviationsFirst 9 pvalues3.581291e-66 3.072432e-12 1.675474e-01 6.687600e-01 1.272040e-05 1.493775e-23 2.009552e-26 1.024890e-14 1.519195e-24
Unadjusted pvaluespvalue < 0.05 764pvalue < 0.01 717pvalue < 1e-5 559
Bonferroni adjusted pvaluespvalue < 0.05 582pvalue < 0.01 560pvalue < 1e-5 461
First 9 adjusted pvalues3.258975e-63 2.795913e-09 1.000000e+00 1.000000e+00 1.157556e-02 1.359335e-20 1.828692e-23 9.326496e-12 1.382467e-21
Aside: pvalues measure evidenceShown that many bugs deviate substantial from Chargaff’s 2nd rule p-values tell us that there is significant evidence for a deviation
median
Upper quantile
Lower quantile
Lots of bases and so ability to detect small deviations from 50%
Powerful test
1st Qu. Median 3rd Qu.0.4989 0.4999 0.5012
Bonferroni is conservativeConservative: actual size of test is less than bound
€
P(any test gives false positive) ≤ n × αn
=α
Not too bad for independent tests
Worst when positively correlated• Applying same test to subsets of data• Applying similar tests to same data
More subtle problem
Mixture of blue and red circlesNull hypothesis: Is blueRed circles are never false positives
Bonferroni is conservative
+
+≤
If experiment really is different from null, then
€
P(test gives false positive) = 0
Over adjustedp-value
Number of potential false positives may be less than number of tests
Holm’s methodHolm(1979) suggests repeatedly applying Bonferroni
Initial Bonferroni: Insignificant Significant
Insignificant Significant
No false positive? Been overly strict, apply Bonferroni only to
insignificant set. False positive? More won’t hurt, so may as well test again Step 2
Insignificant SignificantStep 3
Stop when “insignificant” set does not shrink further
Example 2
Bonferroni adjusted pvaluespvalue < 0.05 582pvalue < 0.01 560pvalue < 1e-5 461
First 9 adjusted pvalues3.258975e-63 2.795913e-09 1.000000e+00 1.000000e+00 1.157556e-02 1.359335e-20 1.828692e-23 9.326496e-12 1.382467e-21
Return to Chargaff data910 bugs but more than half are significantly different after adjustment
There is strong evidence that we’ve over-corrected
First 9 Holm adjusted pvalues2.915171e-63 1.591520e-09 1.000000e+00 1.000000e+00 4.452139e-03 9.903730e-21 1.390610e-23 5.623765e-12 1.019380e-21
Holm adjusted pvaluespvalue < 0.05 606 (+24)pvalue < 0.01 574 (+14)pvalue < 1e-5 472 (+12)
Gained a couple of percent more but notice that gains tail off
Hochberg’s methodConsider a pathological case
Apply same test to same data multiple times
# Ten identical pvaluespvalues <- rep(0.01,10)# None are significant with Bonferronip.adjust(pvalues,method=“bonferroni”) 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1# None are significant with Holmp.adjust(pvalues,method=“holm”) 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1# Hochberg recovers correctly adjusted pvaluesp.adjust(pvalues,method=“hochberg”) 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01
First 9 Hochberg adjusted pvalues2.915171e-63 1.591520e-09 9.972469e-01 9.972469e-01 4.452139e-03 9.903730e-21 1.390610e-23 5.623765e-12 1.019380e-21
Hochberg adjusted pvaluespvalue < 0.05 606pvalue < 0.01 574pvalue < 1e-5 472
Hochberg adjustment is identical to Holm for Chargaff data…. but requires additional assumptions
False Discovery RatesNew methods, dating back to 1995Gaining popularity in literature but mainly used for large data setsUseful for enriching data sets for further analysis
RecapFWER: control probability of any false positive occurringFDR: control proportion of false positives that occur
“q-value” is proportion of significant tests expected to be false positivesq-value times number significant = expected number of false positives
MethodsBenjamini & Hochberg (1995)Benjamini & Yekutieli (2001)Storey (2002,2003) aka “positive false discovery rate”
Example 3Returning once more to the Chargaff data
First 9 FDR q-values3.359768e-65 7.114283e-12 1.891664e-01 6.931340e-01 2.063380e-05 5.481191e-23 8.350193e-26 2.569283e-14 5.760281e-24
FDR q-valuesqvalue < 0.05 759qvalue < 0.01 713qvalue < 1e-5 547
Q-values have a different interpretation from p-valuesUse qvalues to get the expected number of false positivesqvalue = 0.05 expect 38 false positives (759 x 0.05)qvalue = 0.01 expect 7 false positives (713 x 0.01)qvalue = 1e-5 expect 1/200 false positives
Summary
Holm is always better than Bonferroni
Hochberg can be better but has additional assumptions
FDR is a more powerful approach - finds more things significant• controls a different criteria• more useful for exploratory analyses than publications
A little questionSuppose results are published if the p-value is less than 0.01, what proportion of the scientific literature is wrong?