22
Alex Lewin (Imperial College Centre for Biostatistics) Ian Grieve (IC Microarray Centre) Elena Kulinskaya (IC Statistical Advisory Service) Improving Interpretation in Gene Set Enrichment Analysis

Alex Lewin (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

  • Upload
    reuben

  • View
    26

  • Download
    1

Embed Size (px)

DESCRIPTION

Improving Interpretation in Gene Set Enrichment Analysis . Alex Lewin (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre) Elena Kulinskaya (IC Statistical Advisory Service). Introduction. Microarray experiment  list of differentially expressed (DE) genes - PowerPoint PPT Presentation

Citation preview

Page 1: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Alex Lewin (Imperial College Centre for Biostatistics)

Ian Grieve (IC Microarray Centre)Elena Kulinskaya (IC Statistical Advisory Service)

Improving Interpretation in Gene Set Enrichment Analysis

Page 2: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Introduction

• Microarray experiment list of differentially expressed (DE) genes

• Genes belong to categories of Gene Ontology (GO)

• Are some GO categories (groups of genes) over-represented amongst the DE genes?

Page 3: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Contents

• Grouping Gene Ontology categories can improve interpretation of gene set enrichment analysis

• Fuzzy decision rules for multiple testing with discrete data

Page 4: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Gene Ontology (GO)

Database of biological terms

Arranged in graph connecting related terms: links from more general to more specific terms

For each node, can define ancestor and descendant terms

Directed Acyclic Graph

~16,000 terms

from QuickGO website (EBI)

Page 5: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Gene Annotations

• Genes/proteins annotated to relevant GO terms– Gene may be annotated to several GO terms – GO term may have 1000s of genes annotated to it (or

none)

• Gene annotated to term A annotated to all ancestors of A

Page 6: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Find GO terms over-represented amongst differentially expressed genes

For each GO term, compare: proportion of differentially expressed genes annotated to that term

v. proportion of non-differentially expressed genes annotated to that term

Fisher’s test p-value for each GO term.

Multiple testing considerations threshold below which p-values are declared significant.

Many websites do this type of analysis, eg FatiGO website http://fatigo.bioinfo.cnio.es/

22

173 7847

467GO

not

notDE

Page 7: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Difficulties in Testing GO terms

Interpretation: many terms close in the graph may be found significant – or not significant but many low p-values close together in the graph

Statistical Power: many terms have few genes annotated

Discrete statistics: p-values not Uniform under null

Page 8: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Grouping GO terms

Use the Poset Ontology Categorizer (POSOC) Joslyn et al. 2004

Software which groups terms based on - pseudo-distance between terms- ‘coverage’ of genes

Example: for data used here, reduces ~16,000 terms to 76 groups

Page 9: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Example: genes associated with the insulin-resistance gene Cd36

Knock-out and wildtype mice

Bayesian hierarchical model gives posterior probabilities (pg) of being differentially expressed

Most differentially expressed:pg > 0.5 (280 genes)

Least differentially expressed: pg < 0.2 (11171 genes)

Page 10: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Example Results

Individual term tests Used Fatigo websiteMultiple testing corrections (Benjamini and Hochberg FDR)

done separately for each ‘level’Found no GO terms significant when FDR controlled at 5%

Group testsPOSOC on all genes on U74A chip, gives 76 groups3 groups found significant when controlling FDR at 5%

Page 11: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Comparison of Individual and Group Tests

Rank in Fatigo (smallest p-values) Membership of POSOC group significant1: response to external stimulus2: resp. to pest, pathogen or parasite3: response to wounding4: organismal movement5: response to biotic stimulus6: neurophysiological process7: response to stress8: inflammatory response9: transmission of nerve impulse10: neuromuscular physiological proc.11: defense response12: immune response13: chemotaxis

14: nucleobase, nucleoside, nuc …15: cell-cell signalling

IAresponse to p.p.p.response to woundingIAIA-IAimmune resp, resp. to ppp, resp to wound--IAimmune resp, resp. to ppp, resp to wound immune resp, resp. to ppp, resp to wound chemotaxis, cell-migration--

IAyesyesIAIA-IAyes--IAyesyesno (at 5%)no--

IA = immediate ancestor of significant POSOC group

Page 12: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Physiological process`

Organismal movement

Inflammatory response

Response to stimulus

Response to external stimulus

Response to biotic stimulus

Response to stress

Response to wounding

Defense response

Response to pest, pathogen

or parasiteImmune response

Biological process

Response to other

organism

Ranks high individually (smallest p-values)

Significant in group tests (and ranks high individually)

Comparison of Individual and Group Tests

Page 13: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Discrete test statistics

Null hypothesis determined by margins of 2x2 table

Often very small no. possible values for cells small no. possible p-values

X

173 7847

467GO

not

notDE

Null Hypothesis:X ~ HyperGeom(173, 7847-173, 467)X = 0,…,173

Page 14: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Discrete test statistics

X

173 7847

467GO

not

notDE

p-value p(x) = P( X ≤ x | null )

P( p ≤ α | null) ≠ α for most α

Page 15: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Randomised Test

Observe X=x0

pobs = observed p-value = P( X ≤ x0 | null )

pprev = next smallest possible p-value = P( X ≤ x0-1 | null )

Randomised p-value

P(x0) = P( X < x0 | null ) + u*P( X = x0 | null ) where u ~ Unif(0,1) = pprev + u*(pobs - pprev)

conditionally, P | x0 ~ Unif(pprev , pobs) unconditionally P ~ Unif(0,1)

pobs0 1pprev

Page 16: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Fuzzy Decision Rule

Idea is to use all possible realisations of randomised test.

Summarise evidence by critical function of randomised test:

τα(pprev , pobs) =

1 pobs < α

(α – pprev)/(pobs - pprev) pprev < α < pobs

0 pprev > α pobs0 1pprev

Use τα as a fuzzy measure of evidence against the null hypothesis.

(Fuzzy decision rule considered by Cox & Hinckley, 1974 and developed by Geyer and Meeden 2005)

Page 17: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Fuzzy Decision Rules for Multiple Testing

We have developed fuzzy decision rules for multiple tests (i = 1,…,m)

Use Benjamini and Hochberg false discovery rate (BH FDR)

τBHα(pi

prev , pi

obs ) = P( randomised p-value i is rejected | null )

using BH FDR procedure

For small no. tests we can calculate these exactly.

Page 18: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Fuzzy Decision Rules for Multiple Testing

τBHα(pi

prev , pi

obs ) = P( randomised p-value i is rejected | null )

For large no. tests use simulations:

for j = 1,…,n {

generate randomised p-values (i=1,…,m) Pij ~ Unif (piprev

, piobs

)

perform BH FDR procedure Iij =

}

τBHα(pi

prev , pi

obs ) = 1/n Σj Iij

1 if Pij rejected0 else

^

Page 19: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Results for Cd36 Example

[1] "alpha = 0.05" pprev pval i.bonf i.bh tau POSOC group1 1e-04 3e-04 1 1 1 response to pest, pathogen or parasite 2 1e-04 4e-04 1 1 1 response to wounding 3 2e-04 6e-04 1 1 1 immune response 4 7e-04 0.0079 0 0 0.297 digestion 5 0.003 0.0122 0 0 0.021 chemotaxis 6 0.0039 0.0209 0 0 0.002 organic acid biosynthesis 7 0.0092 0.0306 0 0 0 synaptic transmission 8 5e-04 0.0436 0 0 0.059 response to fungi

[1] "alpha = 0.15" pprev pval i.bonf i.bh tau POSOC group1 1e-04 3e-04 1 1 1 response to pest, pathogen or parasite 2 1e-04 4e-04 1 1 1 response to wounding 3 2e-04 6e-04 1 1 1 immune response 4 7e-04 0.0079 0 1 1 digestion 5 0.003 0.0122 0 0 0.943 chemotaxis 6 0.0039 0.0209 0 0 0.661 organic acid biosynthesis 7 0.0092 0.0306 0 0 0.375 synaptic transmission 8 5e-04 0.0436 0 0 0.391 response to fungi

Page 20: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Results for Cd36 Example

Order of fuzzy decisions is not the same as order of observed p-values

Depends on amount of discreteness of null

pobspprev

Page 21: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Conclusions

• Grouping Gene Ontology categories can help find significant regions of the GO graph

• Fuzzy decision rules for multiple testing with discrete data can provide more candidates for rejection

Page 22: Alex Lewin  (Imperial College Centre for Biostatistics) Ian Grieve ( IC Microarray Centre)

Acknowledgements

AcknowledgementsCliff Joslyn (Los Alamos National Laboratory)Tim Aitman (IC Microarray Centre)Sylvia Richardson (IC Centre for Biostatistics)

BBSRC ‘Exploiting Genomics’ grant (AL)Wellcome Trust grant (IG)

ReferencesJoslyn CA, Mniszewski SM, Fulmer A and Heaton G (2004), The

Gene Ontology Categorizer, Bioinformatics 20, 169-177.Geyer and Meeden (2005), Fuzzy Confidence Intervals and P-

values, Statistical Science, to appear.