22
Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert Tibshirani PNAS August 5, 2003 9440-9445

Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Q-Vals (and False Discovery Rates) Made Easy

Dennis ShashaBased on the paper

"Statistical significance for genomewide studies"by John Storey and Robert Tibshirani

PNAS August 5, 2003 9440-9445

Page 2: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Challenge

• You test plants/patients/… in two settings (or from different populations).

• You want to know which genes are differentially expressed (alternate)

• You don’t want to make too many mistakes (declaring a gene to be alternate when in fact it’s null – not differentially expressed).

Page 3: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

First Idea

• You take p-vals of the differences in expression.

• P-val(g) is the probability that if g is null, it would have a difference at least this large.

• You choose a cutoff, say 0.05.

• You say all genes that differ with p-val <= 0.05 are truly different.

• What’s the problem?

Page 4: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Thought Experiment

• Suppose that no genes are truly differentially expressed.

• You will conclude that about 5% of those you called significant really are.

• Your false discovery rate (number null among those predicted to be alternate/number predicted to be alternate) = 100%.

• Bad.

Page 5: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

A Fundamental Insight

• All truly null genes (i.e. not truly differentially expressed) are equally likely to have any p-val.

• That is by construction of p-val: under the null hypothesis, 1% of the genes will be in the top 1 percentile, 1% will be in percentile between 89 and 90th and so on. P-val is just a way of saying percentile in null condition.

Page 6: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

What Do We Do With That?

• Mixture model: imagine null genes as light blue marbles and truly different genes as red ones.

• If the assay is decent, red marbles should be concentrated at the low p-values.

Page 7: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

0 …. Pval …………………………………………………1

Page 8: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Method We Can Use

• We don’t of course know the colors of the marbles/we don’t know which genes are true alternates.

• However, we know that null marbles are equally likely to have any p-value.

• So, at the p-value where the height of the marbles levels off, we have primarily light blue marbles/null genes.

• Why?

Page 9: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

0 …. Pval …………………………………………………1

Flat region starts here

Level of flat region

Page 10: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Answer

• Because if all genes/marbles were null, the heights would be about uniform.

• Provided the reds are concentrated near the low p-vals, the flat regions will be primarily light blues.

Page 11: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Example: all null

• Consider the all null case.

• All marbles are light blue.

• False discovery rate in region to left of flat region is estimated number of white marbles (based on flat region)/number of marbles to left of flat region.

• This will be close to 100%

Page 12: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

0 …. Pval …………………………………………………1

Flat region starts here

Level of flat region

Page 13: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Example: all non-null

• Consider the all non-null case.• All marbles are red and they are highly

skewed. • Flat region is essentially zero.• False discovery rate in region to left of flat

region is estimated number of white marbles (based on flat region)/number of marbles to left of flat region.

• This will be close to 0.

Page 14: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

0 …. Pval …………………………………………………1

Flat region starts here

Page 15: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Example: mixed case

• Get a distribution of p-values.

• Find flat region.

• Estimate number of nulls in the left-of-flat region by extending the flat line.

• This gives the false discovery rate.

Page 16: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

0 …. Pval ……………………………………………1

Flat line; base level of nulls

Num

ber of genes having pval

Possible p-value threshold

Page 17: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Example: mixed case

• What would you estimate the false discovery rate to be in the case that we declare the entire area to the left of the possible p-value threshold to be significant?

• 10%, 25%, 50%?

Page 18: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

0 …. Pval ……………………………………………1

Flat line; base level of nulls

Num

ber of genes having pval

Possible p-value threshold

Page 19: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Obtaining q-values from False Discovery Rate

• Suppose we order genes from least p-value to greatest.

• That corresponds to one of these cartesian graphs.

• The q-value of a gene having p-value p is exactly the False Discovery Rate if the declared significance region had a threshold of p.

Page 20: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

0 …. Pval ……………………………………………1

Flat line; base level of nulls

Num

ber of genes having pval

Q-value of a gene having this p-val is the FDR if this is the significance threshold.

Page 21: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

Lessons for Research

• Mushy p-values (large error bars/few replicates) may force us to the far left in order to get a low False Discovery Rate.

• This may eliminate genes of interest.

• If testing out a gene is not too expensive, then we can accept a higher False Discovery Rate – nothing magical about 0.01.

Page 22: Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert

0 …. Pval ……………………………………………1

Flat line; base level of nulls

Num

ber of genes having pval

Better p-values avoid loss of genes, for small FalseDiscovery Rate.