Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Bioinformatics, statistics and multiple testing
Alan MosesML4bioML4bio
With slidesfrom QuaidMorris
Outline for TodayOutline for Today
• BioinformaticsBioinformatics– GO and other annotations– The annoying thing about bioinformaticsThe annoying thing about bioinformatics
• Review of hypothesis testing– Parametric vs. non-parametric testsParametric vs. non parametric tests– Exact tests – Multivariate hypothesis testingyp g
• Multiple hypothesis testing– Bonferoni, FDRBonferoni, FDR– Application to gene set enrichment analysis
ConDens kinase substrate predictionConDens kinase substrate prediction
• Andy Lai was a MSc student in my lab who BOI1
List of predicted Cbk1 targets in yeast
developed a cool new way to predict kinasesubstrates based on amino acid sequence alignments.
BOI1SEC3MPT5SSD1DSF2FIR1FIR1YNL058CKIN1YGR117CKIN2
• He predicted new lists of substrates for some
IRC8YJL016WACE2RGA2
• He predicted new lists of substrates for some kinases, and wanted to show that the predictions were good, without doing any experiments. CG8617
Oatp30B
List of predicted Cbk1 targets in drosophila
• Gene Set Enrichment Analysis is the answerCG9467ecpan
Where Do Gene Lists Come From?Where Do Gene Lists Come From?
• Molecular profiling e.g. mRNA, protein– Identification Gene list– Quantification Gene list + values
Ranking Clustering (biostatistics)– Ranking, Clustering (biostatistics)• Interactions: Protein interactions, microRNA
targets, transcription factor binding sites g p g(ChIP)
• Genetic screen e.g. of knock out libraryA i ti t di (G id )• Association studies (Genome-wide)– Single nucleotide polymorphisms (SNPs)– Copy number variants (CNVs)Copy number variants (CNVs)
Quaid Morris
What is the Gene Ontology (GO)?What is the Gene Ontology (GO)?
• Set of biological phrases (terms) which are
www.geneontology.org
Set of biological phrases (terms) which are applied to genes:– protein kinase– protein kinase
– apoptosis
membrane– membrane
• Dictionary: term definitions
• Ontology: A formal system for describing knowledge
Jane Lomax @ EBI
GO StructureGO Structure
• Terms are related within a hierarchy– is‐a– part‐of
• Describes multiple levels of detail oflevels of detail of gene function
• Terms can have more• Terms can have more than one parent or child
What GO Covers?What GO Covers?• GO terms divided into three aspects:
– cellular component
– molecular function
– biological process (important pathway source)
l 6 h h t iglucose-6-phosphate isomerase activity
Cell divisionCell division
TermsTerms• Where do GO terms come from?
GO dd d b di EBI d– GO terms are added by editors at EBI and gene annotation database groups
T dd d b t– Terms added by request
– Experts help with major development
32029 99% i h d fi i i– 32029 terms, >99% with definitions.• 19639 biological_process
• 2859 cellular component• 2859 cellular_component
• 9531 molecular_function
• As of July 15, 2010As of July 15, 2010
Annotations
• Genes are linked or associated with GO
Annotations
Genes are linked, or associated, with GO terms by trained curators at genome databasesdatabases– Known as ‘gene associations’ or GO annotations
Multiple annotations per gene– Multiple annotations per gene
• Some GO annotations created automatically ( ith t h i )(without human review)
Annotation Sources• Manual annotation
– Curated by scientistsCurated by scientists• High quality• Small number (time‐consuming to create)
– Reviewed computational analysis
• Electronic annotation– Annotation derived without human validation
• Computational predictions (accuracy varies)‘ l ’ h l d• Lower ‘quality’ than manual codes
• Key point: be aware of annotation origin
Evidence TypesEvidence Types• Experimental Evidence Codes
• EXP: Inferred from Experiment• IDA: Inferred from Direct Assay
• Author Statement Evidence Codes
• TAS: Traceable Author y• IPI: Inferred from Physical Interaction• IMP: Inferred from Mutant Phenotype• IGI: Inferred from Genetic Interaction• IEP: Inferred from Expression Pattern
Statement• NAS: Non-traceable
Author Statement• Curator Statement Evidence
CodesIC: Inferred by • IC: Inferred by Curator
• ND: No biological Data available
• Computational Analysis Evidence Codes• ISS: Inferred from Sequence or Structural
Similarity• ISO: Inferred from Sequence Orthology
ISA: Inferred from Sequence Alignment• ISA: Inferred from Sequence Alignment• ISM: Inferred from Sequence Model• IGC: Inferred from Genomic Context• RCA: inferred from Reviewed Computational
Analysis
• IEA: Inferred from electronic annotation
See http://www.geneontology.org
Wide & Variable Species Coverage
Lomax J. Get ready to GO! A biologist's guide to the Gene Ontology. Brief Bioinform. 2005 Sep;6(3):298-304.
Accessing GO: QuickGO
http://www.ebi.ac.uk/ego/See also AmiGO: http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
Biomart 0.7
Quaid Morris
Ensembl BioMartEnsembl BioMart• Convenient access to gene list annotation
Select genome
Select filters
Select attributesto download
Quaid Morris
Sources of Gene AttributesSources of Gene Attributes
• Ensembl BioMart (eukaryotes)Ensembl BioMart (eukaryotes)– http://www.ensembl.org
• Entrez Gene (general)(g )– http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene
• Model organism databasesg– E.g. SGD: http://www.yeastgenome.org/
• Also available through R
Quaid Morris
Why is it all such a mess?Why is it all such a mess?• Naming of molecules was done by whoever found it first.
o Proteins and Genes do not always have consistent nameso Proteins and Genes do not always have consistent names.
o More important genes that were studied by many groups have many names. Competing research groups may purposefully omit the name(s) used by other groups
• Database identifiers (IDs) are unique stable names or numbersDatabase identifiers (IDs) are unique, stable names or numbers that help track database records, but…
o Each database will typically use its own internal IDs and naming conventions
o The more important a gene/protein is the more databases will have information for it so ito The more important a gene/protein is, the more databases will have information for it, so it will have many IDs
o Databases are frequently updated, so we always have to keep track of the database version that was used
• Records for: Gene, DNA, RNA, Proteino Important to recognize the correct record type
o Different data sources pertain to different data types (e.g., Pfam only has proteins)
o The relationship between Genes, DNA, RNA and Proteins is not 1 to 1
Common IdentifiersSpecies-specificHUGO HGNC BRCA2MGI MGI:109337RGD 2219
GeneEnsembl ENSG00000139618Entrez Gene 675U i H 34012 RGD 2219
ZFIN ZDB-GENE-060510-3 FlyBase CG9097 WormBase WBGene00002299 or ZK1067.1SG S
Unigene Hs.34012
RNA transcriptGenBank BC026160 1 SGD S000002187 or YDL029W
AnnotationsInterPro IPR015252OMIM 600185
GenBank BC026160.1RefSeq NM_000059Ensembl ENST00000380152
OMIM 600185Pfam PF09104Gene Ontology GO:0000724SNPs rs28897757E i t l Pl tf
ProteinEnsembl ENSP00000369497RefSeq NP_000050.2U iP t BRCA2 HUMAN Experimental Platform
Affymetrix 208368_3p_s_atAgilent A_23_P99452CodeLink GE60169
UniProt BRCA2_HUMAN or A1YBP1_HUMANIPI IPI00412408.1EMBL AF309413
Red = Recommended
Illumina GI_4502450-SEMBL AF309413 PDB 1MIU
Quaid Morris
ID Mapping ServicesID Mapping Services
• SynergizerSynergizer– http://llama.med.harvard.edu/synergiz
er/translate/
• Ensembl BioMart– http://www.ensembl.org
• PICR (proteins only)– http://www.ebi.ac.uk/Tools/picr/
R l• R language annotation databases
– http://www.bioconductor.org
Quaid Morris
ID Mapping ChallengesID Mapping Challenges• Avoid errors: map IDs correctly
• Gene name ambiguity – not a good IDGene name ambiguity not a good ID– e.g. FLJ92943, LFS1, TRP53, p53
– Better to use the standard gene symbol: TP53g y
• Excel error‐introduction– OCT4 is changed to October‐4g
• Problems reaching 100% coverage– E.g. due to version issues
– Use multiple sources to increase coverageZeeberg BR et al. Mistaken identifiers: gene name errors can be introduced inadvertently ywhen using Excel in bioinformatics BMC Bioinformatics. 2004 Jun 23;5:80
Quaid Morris
Summary so farSummary so far• GO (and other functional annotations) are a great ( ) g
way to tell us about the functions of a list of gene• In order to use these, we need to compare our
li t t h t’ i th GO d t bgene list to what’s in the GO database…– Genes and their products and attributes have many
identifiers (IDs)( )– Bioinformatics often means converting or mapping
IDs from one type to anotherID mapping services are available– ID mapping services are available
– Use standard, commonly used IDs to reduce ID mapping challenges
Outline for TodayOutline for Today
• BioinformaticsBioinformatics– GO and other annotations– The annoying thing about bioinformaticsThe annoying thing about bioinformatics
• Review of hypothesis testing– Parametric vs. non-parametric testsParametric vs. non parametric tests– Exact tests – Multivariate hypothesis testingyp g
• Multiple hypothesis testing– Bonferoni, FDRBonferoni, FDR– Application to gene set enrichment analysis
What is a P-value?What is a P value?
• A) The probability that the null hypothesisA) The probability that the null hypothesis is true
• B) Probability of a test statistic under the• B) Probability of a test statistic under the null distributionC) P b bilit f i t j ti f• C) Probability of an incorrect rejection of the null hypothesis
• D) Some subset of the above
Modified from Quaid Morris
What is a P-value?What is a P value?
• A) The probability that the null hypothesisA) The probability that the null hypothesis is true
• B) Probability of a test statistic under the• B) Probability of a test statistic under the null distributionC) P b bilit f i t j ti f• C) Probability of an incorrect rejection of the null hypothesis
• D) Some subset of the above
N f th !Modified from Quaid Morris
None of these!
What is a P-value?What is a P value?
• Probability of observing something asProbability of observing something as extreme or more under the null hypothesis
What is this thing?
• Usually it’s a “test statistic” but it can be any summary of the data…
• Always a sum or integral over the “tail” or “tails” of a distribution.
Hypothesis testingHypothesis testing• Random variables:
– H: H0 (null hypothesis) or H1 (alternative hypothesis)– Data: X1, X2, … XN (independent and identically distributed –
IID)– t is a test statistic, t = f(X)– t* observed value of test statistic
• Parameters:α: significance level– α: significance level
– Reject H0 if P-value < α
• P-value is:– Pr[ t is “as or more extreme” than t* | H0 is true ]
26Modified from Quaid Morris
P-value versus false rejectionsP value versus false rejections
• P-value is:P value is:– Pr[ t is “as or more extreme” than t* | H0 is true ]
• False rejection probability:– Pr[ H0 is true | H0 is rejected ]– aka “False discovery rate”
27Modified from Quaid Morris
P-value factsP value facts
• Note that: Pr[P-value < p | H0 is true] = pNote that: Pr[P value p | H0 is true] p
• So under the null distribution P-value is aSo under the null distribution, P-value is a random variable that is uniformly distributed between 0 and 1.
• Given different tests with P-values p1, p2, …, pN you can combine them into a single P-value. “Fisher’s method”
• Fisher figured out that test statistic X2 = -2 Σi ln[ pi ] is chi-square with 2N degrees of freedom if p’s are uniform {0,1}
• Sometimes called “meta analysis” because you can combine the
28Modified from Quaid Morris
y yresults of many analyses this way
E g 2-sample testssamples
E.g., 2 sample tests
W k l b ti d t• We make several observations under two situations, and we want to find out whether there is a statistical difference.
gene
s
• Which genes have differential expression in the different tumor types?the different tumor types?
Sorlie T, Tibshirani R, Parker J, Hastie T, Marron JS, Nobel A, Deng S, Johnsen H, Pesich R, Geisler S, Demeter J, Perou CM, Lønning PE, Brown PO, Børresen-Dale AL, Botstein D. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc Natl Acad Sci U S A. 2003 Jul 8;100(14):8418-23.
2 sample test2 sample testGene Expression levels for a single gene
95
4 6
65
0.8
1
ensi
ty
Distribution of gene expression levels9
6 3
4
7
65
Normal Breast-like
0 2
0.4
0.6
Pro
babi
lity
d
01
2-1 1
10
0 0
0
0.2
-5 0 5 10Gene expression
1 -2 10
Basal subtype
p
Question: How likely is it that the difference between the two samples is due to chance?
30Modified from Quaid Morris
2 sample t-test2 sample t-testSummarize the data with theso called “t statistic”
0.8
1
ensi
tyNormal Breast-like: N1=10Mean: m1 = 5.6
so-called t-statistic Distribution of gene expression levels
0 2
0.4
0.6
Pro
babi
lity
d
Basal subtype: N2=13
Mean: m1 5.6 Std: s1 = 1.6
Mean: m = 0 30
0.2
-5 0 5 10Gene expression T statistic =
Mean: m1 = 0.3 Std: s1 = 1.0
21 mm − p
H0: Black and red scores are drawn from a distribution with the same meanH Th t t l
T-statistic =
2
22
1
21
Ns
Ns
+
31
H1: The two means are not equalModified from Quaid Morris
2 sample t-testP-value = shaded area * 2
2 sample t-test
0.8
1
ensi
tyT-distribution
dens
ity
P value shaded area 2Distribution of gene expression levels
0 2
0.4
0.6
Pro
babi
lity
d
Pro
babi
lity
d
0
0
0.2
-5 0 5 10Gene expression T statistic =
21 mm −
T-statistic
0
pT-statistic =
2
22
1
21
Ns
Ns
+ H0: Black and red scores are drawn from a distribution with the same meanH Th t t lDistribution of this statistic is
32
H1: The two means are not equalModified from Quaid Morris
Distribution of this statistic is known under the null hypothesis
Examples of inappropriate distributions for T-testsT-test assumes data are (approximately) normally distributedT-test detects differences between means, not necessarily between distributions
Values are positive and have increasing density near zero, e.g. sequence counts
Distributions with outliers, or “heavy-tailed” distributions
Bimodal “two-bumped” distributions.
ility
den
sity
bilit
y de
nsity
lity
dens
ity
Pro
bab
Gene expression 0
Pro
bab
Gene expression
Pro
babi
l
Gene expression
Solutions: “non-parametric two-sample tests”1) Robust test for difference of medians (WMW)2) Di f diff f di ib i (K S)
33
2) Direct test of difference of distributions (K-S)
Quaid Morris
Enrichment analysis with two-sample, not paired Wilco on Rank S mWilcoxon Rank Sum
aka Mann-Whitney U test or simply “WMW”
1) Rank gene scores, calculate RB,1) Rank gene scores, calculate RB, sum of ranks of black values
6.55 6
2.15.6
ranks12
dens
ity
5.64.53.22.1
-1.1-2.5-0.5
2345
RB = 21
Pro
babi
lity
1.70.1
-1.12 5
3.21.76.5
N2 redvalues
6789
P
Gene Expression
H : Probability that a red ranks are-2.5-0.54.5
0.1
910
H0: Probability that a red ranks are greater than black ranks is 0.5H1: red ranks are greater than black ranks
34
N1 blackvalues
Z Quaid Morris
Wilcoxon-Mann-Whitney (WMW) testaka Mann-Whitney U-test, Wilcoxon rank-sum test
2) Calculate Z-score:
RB = 21
dens
ity
BNNNR2
1211
++−
mean rank
Pro
babi
lity
U
Zσ
2=
3) Calculate P value:
= -1.4
P
Gene ExpressionNormal distributiony
P-value = shaded area * 23) Calculate P-value:
H : Probability that a random sample from
obab
ility
dens
ity H0: Probability that a random sample from distribution of red score is > than one from black is 0.5H1: Otherwise
35 ZZ
Pro
0-1.4
H1: Otherwise
Quaid Morris
WMW test detailsWMW test details
• Described method is only applicable forDescribed method is only applicable for large N1 and N2 and when there are no tied scorestied scores
• WMW test is robust to (a few) outliers12/)1( NNNN 12/)1( 2121 ++= NNNNuσ
36 Quaid Morris
Kolmogorov-Smirnov (K-S) test for diff f di ib idifference of distributions
Empirical (cumulative)
dens
ity
prob
abili
ty
1.0
Empirical (cumulative) distribution
roba
bilit
y d
mul
ativ
e p
0.5
Pr
Gene Expression0C
um
Gene Expression0
1) Calculate cumulative distributions of red and black
37 Quaid Morris
Kolmogorov-Smirnov (K-S) testKolmogorov Smirnov (K S) testEmpirical (cumulative)
dens
ity
prob
abili
ty
1.0
Empirical (cumulative) distribution
roba
bilit
y d
mul
ativ
e p
0.5
Pr
Gene Expression0C
um
Gene Expression0
1) Calculate cumulative distributions of red and black
38 Quaid Morris
Kolmogorov-Smirnov (K-S) testKolmogorov Smirnov (K S) testEmpirical (cumulative)
dens
ity
prob
abili
ty
1.0
Empirical (cumulative) distribution
roba
bilit
y d
mul
ativ
e p
0.5
Pr
Gene Expression 0C
um
Gene Expression 0
1) Calculate cumulative distributions of red and black
39 Quaid Morris
Kolmogorov-Smirnov (K-S) testKolmogorov Smirnov (K S) testEmpirical (cumulative)
dens
ity
prob
abili
ty
1.0
Empirical (cumulative) distribution
roba
bilit
y d
mul
ativ
e p
0.5Distance = 0.4
Pr
Gene Expression 0
Test statistic: Maximum vertical difference
Cum
Gene Expression 0
Test statistic: Maximum vertical difference between the two cumulative distributions
Distribution of test statistic is known dl f th d l i di t ib ti
40
regardless of the underlying distributionsModified from Quaid Morris
WMW and K-S test caveatsWMW and K S test caveats• Neither tests is as sensitive as the T-test, i.e. they , y
require more data points to detect the same amount of difference, so use the T-test whenever it is valid.K S t t d WMW i diff t K S• K-S test and WMW can give you different answers: K-S detects difference of distributions, WMW detects whether samples from one tend to be higher than those from the other (or vice versa)
• Technical issue: Tied scores and/or small # of observations can be a problem for someobservations can be a problem for some implementations of the WMW or KS-test
41 Quaid Morris
Central limit theorem• If you have a moderately large sample, you
can do statistical tests that don’t depend oncan do statistical tests that don t depend on assumptions about the distribution of the datadata
nsity
E.g., black data mean is almost certainly greater than red mean, but there are a lot of tied ‘0’ values that might mess up K S and WMW tests
abili
ty d
en might mess up K-S and WMW tests.
Central Limit Theorem: Distribution of your the estimate of means is Gaussian. (Assuming
Pro
b
Gene Expression 0
your sample is big enough, i.i.d., and that the variance is finite)
Under the null hypothesis, average red = average black and is N(μ,σ2), where μ is the mean and σ2 is the variance.
What is the distribution of my data?What is the distribution of my data?
• Because of the central limit theorem and permutation tests, you don’t usually have to worry about it
• A good way to check is using a “qq plot”• A good way to check is using a qq-plot . – This compares the “theoretical quantiles” of a
particular distribution to the quantiles in your data.– If they don’t disagree too badly, you can usually
be safe assuming your data are consistent with that distribution
• With large genomics data sets, you will have enough power to reject the hypothesis that your data “truly” come from any distributionyour data truly come from any distribution
Permutation testsPermutation tests• Often, the null distribution of the testOften, the null distribution of the test
statistic is unclear or not analytical.• In these cases, you can generate an
i i l di t ib ti b li f thempirical distribution by sampling from the null distribution and then evaluating your test statistic against this distribution.g
• In many genomic applications it is often possible to get a sample from the null distribution by randomizing (i e permuting)distribution by randomizing (i.e. permuting) the association between genes and corresponding data.p g
44 Quaid Morris
When permuting, you have to think deep thoughts about what your null hypothesis really is.
Janusz Dutkowski, Michael Kramer, Michal A Surma, Rama Balakrishnan, J Michael Cherry, Nevan J Krogan & Trey Ideker Nature Biotechnology 31, 38–45 (2013) A gene ontology inferred from molecular networks
Exact tests
• Sometimes the probability of anSometimes, the probability of an observation as extreme or more can be calculated directly under the H0calculated directly under the H0
• In this case there is no “test statistic”E “bi i l t t” “Fi h ’ E t T t”• E.g., “binomial test”, “Fisher’s Exact Test” and “hypergeometric test” Use for
Gene Set Enrichment Analysis
• These tests are feasible now because computers calculate these probabilities
E g Binomial testE.g., Binomial test
• You did a poll were you get “yes” or “no”You did a poll were you get yes or no answers each time, and you have some prior belief about the frequency of “yes” orprior belief about the frequency of yes or “no” under the null hypothesis. E.g., if people don’t care then p should be 50%people don t care, then p should be 50%
P-value = Pr(73 or more “yes” | 102 total, p=50%)
!n⎞⎛
X=102
!)!(!
kknn
kn
−=⎟⎟
⎠
⎞⎜⎜⎝
⎛
⎞⎛102
, where
ΣX=73
(0.5)X (1 – 0.5)102-X⎟⎟⎠
⎞⎜⎜⎝
⎛X
102P-value =
E g Fisher’s Exact testE.g., Fisher s Exact test
• You developed a prediction method where youYou developed a prediction method where you got a 2 x 2 table as the result
predictedp ed c ed
erve
d
positive negative
nega
tive
14 178I won’t bother you with the formula, but the probability of the “configuration” of the 2 x 2
7Obs
epo
sitiv
e
31table can be calculated exactly
doesn’t make any assumption about the distribution of positives and negatives
P-value = Pr(a “configuration” as extreme or more | no association)
To calculate this you need to sum up a lot of possible tables
p g
To calculate this, you need to sum up a lot of possible tables
According to R, in this case P-value = 0.05666
The hypergeometric testThe hypergeometric testGene list
RRP6MRD1RRP7
H0: List is a random sample from populationH1: More black genes than expected
RRP7RRP43RRP42
Background population:500 bl k500 black genes, 4500 red genes
49 Quaid Morris
The hypergeometric functionThe hypergeometric function
Probability a random sample of k genes contains q black genes when the background population contains m black genes out of n total genes:
m⎛ ⎞ n −m⎛ ⎞ # ways to # ways to choose m
q
⎛
⎝ ⎜
⎞
⎠ ⎟ n m
q − k
⎛
⎝ ⎜
⎞
⎠ ⎟
n⎛ ⎞
choose q out of m genes
q-k out of n-mgenes
=n
k
⎛
⎝ ⎜
⎞
⎠ ⎟ # ways to
choose k out of n genes
50
n
k
⎛
⎝ ⎜
⎞
⎠ ⎟ =
n!(n − k)!k!
is called “n choose k” for details seehttp://www.khanacademy.org/video/combinations
Quaid Morris
The hypergeometric testThe hypergeometric testGene list Null distribution
RRP6MRD1RRP7
500
4
⎛
⎝ ⎜
⎞
⎠ ⎟ 4500
1
⎛
⎝ ⎜
⎞
⎠ ⎟
5000⎛ ⎜
⎞ ⎟
500
5
⎛
⎝ ⎜
⎞
⎠ ⎟ 4500
0
⎛
⎝ ⎜
⎞
⎠ ⎟
5000⎛ ⎜
⎞ ⎟
+ = 4.6 x 10-4
P-value
RRP7RRP43RRP42
5⎝ ⎜
⎠ ⎟ 5⎝
⎜ ⎠ ⎟
Background population:500 bl k500 black genes, 4500 red genes
51 Quaid Morris
Important detailsImportant details• One way to test for under-enrichment of “black”, test for y ,
over-enrichment of “red”• Same as a “One-tailed Fisher’s Exact Test”• Need to choose “background population” appropriately,
e.g., if only portion of the total gene complement is queried (or available for annotation), only use thatqueried (or available for annotation), only use that population as background.
• To test for enrichment of more than one independent t f t ti ( d bl k d i l )types of annotation (red vs black and circle vs square), we need to apply the hypergeometric test separately for each type. ***multivariate hypothesis testing***
52
yp yp g
Quaid Morris
Multivariate hypothesis testsMultivariate hypothesis tests
• Basic problem is the “or more”P-value is the “probability of observing something as extreme or more under the null hypothesis”
Basic problem is the or more
We would have to do theWe would have to do the sum in all dimensions.
Instead there are two major strategies for multivariate hypothesis testing:1. Likelihood ratio test – summarizes the multivariate hypothesis with a single yp g
test statistic, and then do the sum in a single dimension2. Test each dimension independently – very conservative because it ignores the
potential correlation between dimensions.
When we want to know which dimensions are causing the rejection of the null hypothesis, we typically use #2
Gene set enrichment analysisGene set enrichment analysis
• Which (if any) annotations are enriched inWhich (if any) annotations are enriched in our gene list?
• Test each annotation independently using• Test each annotation independently using the hypergeometric test
• Need to correct P-values because there are so many annotations tested…
Outline for TodayOutline for Today
• BioinformaticsBioinformatics– GO and other annotations– The annoying thing about bioinformaticsThe annoying thing about bioinformatics
• Review of hypothesis testing– Parametric vs. non-parametric testsParametric vs. non parametric tests– Exact tests – Multivariate hypothesis testingyp g
• Multiple hypothesis testing– Bonferoni, FDRBonferoni, FDR– Application to gene set enrichment analysis
Multiple test correction:Multiple test correction: Bonferroni and False Discovery
R tRate
56 Quaid Morris
Mark Gerstein P-value paradoxMark Gerstein P value paradox
– His lab publishes about 30 researchHis lab publishes about 30 research papers/year. E.g., published 33 papers in 2011 (>300 in the last 10 years)
– At P-value=0.05, how many significant , y gresults/year are expected from his lab under the null hypothesis?
How to win the P-value lottery, part 1How to win the P value lottery, part 1
Random draws
… 7,834 draws later …Expect a random draw with observed
i h t,
enrichment once every 1 / P-value draws
Background population:500 bl k500 black genes, 5000 red genes
58 Quaid Morris
How to win the P-value lottery, part 2Keep the gene list the same, evaluate different annotations
Observed draw Different annotationsObserved drawRRP6MRD1RRP7
Different annotationsRRP6MRD1RRP7RRP7
RRP43RRP42
RRP7RRP43RRP42
59 Quaid Morris
ORA tests need correctionORA tests need correction
From the Gene Ontology website:Current ontology statistics: 25206 terms
• 14825 biological process• 2101 cellular component• 8280 molecular function
Buying 1 or 2 or even 10 lottery tickets, you still have a small chance of winning. However, if you by 25,000 tickets, your chances of winning start to improve.
60 Quaid Morris
Simple P-value correction: BonferroniSimple P value correction: Bonferroni
If M = # of annotations tested:
Corrected P-value = M x original P-valueg
Corrected P-value is greater than or equal to the probability thatg q p yone or more of the observed enrichments could be due to
random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”
Quaid Morris
Bonferroni correction caveatsBonferroni correction caveats
• Bonferroni correction is very stringent andBonferroni correction is very stringent and can “wash away” real enrichments.
• Often users are willing to accept a less• Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR) which leads to a gentlerrate (FDR), which leads to a gentler correction when there are real enrichmentsenrichments.
62 Quaid Morris
False discovery rate (FDR)• FDR is the expected proportion of the
observed enrichments due to random chance.
• Compare to Bonferroni correction which is a bound on the probability that any one of the observedon the probability that any one of the observed enrichments could be due to random chance.
• Typically FDR corrections are calculated using the yp y gBenjamini-Hochberg procedure.
• FDR threshold is often called the “q-value”
Quaid Morris
Controlling FDR using the B j i i H hb d IBenjamini-Hochberg procedure I
• Say you want to bound the FDR at α youSay you want to bound the FDR at α, you need to calculate the corresponding P-value threshold tvalue threshold t
• First, calculate the P-values for all the tests and then sort them so that p is thetests, and then sort them so that p1 is the smallest (i.e. most significant) P-value, and p is the leastand pm is the least.
64Benjamini, Y. & Hochberg, Y. (1995) J. R. Stat. Soc. B 85, 289–300 Quaid Morris
Controlling FDR using the B j i i H hb d IIBenjamini-Hochberg procedure II
• t = pr where r is the max value for which:t pr where r is the max value for which:
FDR threshold
pr ≤ rα / m
FDR threshold
pr ≤ rα / m
rank # of tests
Cavaet: Assumes independent or positively correlated tests.correlated tests.
65 Quaid Morris
Reducing multiple test correction stringencyReducing multiple test correction stringency
• Can control the stringency by reducing theCan control the stringency by reducing the number of tests: e.g. use GO slim or restrict testing to the appropriate GOrestrict testing to the appropriate GO annotations.
66 Quaid Morris
Reducing multiple test correction stringencyReducing multiple test correction stringency
• The correction to the P-value threshold ⟨The correction to the P value threshold ⟨depends on the # of tests that you do, so, no matter what the more tests you do theno matter what, the more tests you do, the more sensitive the test needs to be
• Can control the stringency by reducing the• Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations;testing to the appropriate GO annotations; or select only larger GO categories.
Quaid Morris
SummarySummary• Multiple test correction
– Bonferroni: stringent, controls probability of at least one false positive
– FDR: more forgiving, controls expected proportion of false positives -- typically use B j i i H hbBenjamini-Hochberg