30
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner [email protected]

Gene Set Analysis using R and Bioconductor

  • Upload
    maegan

  • View
    63

  • Download
    0

Embed Size (px)

DESCRIPTION

Gene Set Analysis using R and Bioconductor. Daniel Gusenleitner [email protected]. Why Gene Sets?. Phenotypic characteristics or clinical diseases can only rarely be defined by one single gene Most diseases, are complex and involve multiple genes - PowerPoint PPT Presentation

Citation preview

Page 1: Gene Set  Analysis  using R and  Bioconductor

Gene Set Analysis using R and Bioconductor

Daniel [email protected]

Page 2: Gene Set  Analysis  using R and  Bioconductor

Why Gene Sets?

• Phenotypic characteristics or clinical diseases can only rarely be defined by one single gene

• Most diseases, are complex and involve multiple genes

• Genes usually do not work independently; they work as parts of a functional unit

Page 3: Gene Set  Analysis  using R and  Bioconductor

Genes and Proteins talk in Pathways

Page 4: Gene Set  Analysis  using R and  Bioconductor

Definition of Gene Sets

• Gene sets are loosely defined as groups of genes that share biological mechanisms or characteristic

• They represent the distilled base of biological knowledge and act as an aid for theoretical and experimental research

Page 5: Gene Set  Analysis  using R and  Bioconductor

There are different kinds of gene sets

• Data-driven gene sets usually use high-throughput experiments in order

to derive and identify sets of related genes.

• Knowledge-driven gene sets require expert knowledge to construct gene sets.

These are usually specific to domains of interest.

Page 6: Gene Set  Analysis  using R and  Bioconductor

Resources for Gene Sets

Page 7: Gene Set  Analysis  using R and  Bioconductor

2 . Extract gene signatures from tables figures or supplement

1. Search Pubmed with pre-defined search criteria

3 . Annotate each gene signature 4. Map all mappable identifiers to genome to create standardized gene

listsName DescriptionPMID Pubmed identifierTissue Name of search term set used to search PubMED.Organism Species common name (human, mouse, etc)Platform Name of microarray or other experimental technique used to

derive gene signature Platform Description

Description of platform

Genes Articles

Number of genes in gene signature

Sig ID Signature identifier, in the format PMID-XXX, where XXX is the gene signature table, figure or supplementary file e.g. 18490921- Table3

Sig Name Name of gene signature, in the format Tissue_AuthorYear_ NumberofGenes _Description. Description is optional. e.g.

Breast_Bertucci08_75genesSig Description

Description of gene signature, typically extracted from table or figure legend (free text)

File Associated

Name of tab delimited file gene signature file. Format is SigID.txt

URL URL from where gene signature was downloadedColumn Mappings

Content of each column in gene signature file (selection from constrained list in Table 1b)

Page 8: Gene Set  Analysis  using R and  Bioconductor

• initiative to unify the representation of gene and gene product attributes across all species

• Maintains a controlled vocabulary of gene and gene product attributes

• Provides an annotation for genes and gene products

• Provides tools for easy access to all aspects of the data provided by the project

http://www.geneontology.org/

Page 9: Gene Set  Analysis  using R and  Bioconductor

• cellular components, the parts of a cell or its extracellular environment

• molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis

• biological process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units

Page 10: Gene Set  Analysis  using R and  Bioconductor

• connects known information on molecular interaction networks

• It contains:genes and proteinsbiochemical compounds and reactionspathways and complexes

http://www.genome.jp/kegg/

Page 11: Gene Set  Analysis  using R and  Bioconductor

Version 3.0 (September 2010)

Warehouse of 6769 annotated gene sets

http://www.broadinstitute.org/gsea/index.jsp

Page 12: Gene Set  Analysis  using R and  Bioconductor

• Divided in 5 major collections:

– C1: positional gene sets for each human chromosome and cytogenetic band (326 gene sets)

– C2: curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts (3272 gene sets)

– C3: motif gene sets based on conserved cis-regulatory motifs (836 gene sets)

– C4: computational gene sets defined by mining large collections of cancer-oriented microarray data (881 gene sets)

– C5: GO gene sets consists of genes annotated by the same GO terms (1454 gene sets)

Page 13: Gene Set  Analysis  using R and  Bioconductor

Gene Set Analysis

Page 14: Gene Set  Analysis  using R and  Bioconductor

Statistical Methods

• Fisher’s Exact Test

• EASE: the Expression Analysis Systematic Explorer

Page 15: Gene Set  Analysis  using R and  Bioconductor

Gene Set Analysis (GSA) using Gene Expression Data

• shifts the analyses more towards biology-driven approaches

• utilizes functional related groups of genes in order to analyze gene expression datasets

• more robust than single gene analyses

Page 16: Gene Set  Analysis  using R and  Bioconductor

Competitive vs. Self-Contained Hypothesis

• GSA differ in the definition of the null hypothesis:

Self-contained tests just compare the gene expression within the gene set across the given samples

Comparative tests compare differential expression of the gene set to either all or the complement of the genes represented on a microarray

Page 17: Gene Set  Analysis  using R and  Bioconductor

Gene Set Enrichment Analysis (GSEA)

Mootha et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes, Nature Genetics, 2003, 34-3

Subramanian et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, PNAS, 2005, 102-43

Oron et al. Gene set enrichment analysis using linear models and diagnostics, Bioinformatics, 2008, 24-22

Bioconductor Package:GSEAlm - Linear Model Toolset for Gene Set Enrichment Analysis

Page 18: Gene Set  Analysis  using R and  Bioconductor

Aims of a Gene Set Enrichment Analysis

• Looking for up or down regulated gene sets between two tested classes

• Testing if a gene set of interest is differentially regulated between two tested phenotypes

Page 19: Gene Set  Analysis  using R and  Bioconductor

Testing different Phenotypes

Pair-wise Tests:Normal versus Low gradeNormal versus High gradeLow grade versus High grade

Sample Disease Type

S1 Normal breast tissueS2 Normal breast tissueS3 Normal breast tissueS4 Low grade cancer (Luminal A)S5 Low grade cancer (Luminal A)S6 Low grade cancer (Luminal A)S7 High grade cancer (Basal)S8 High grade cancer (Basal)S9 High grade cancer (Basal)

Gene Expression Data

Clinical Data

Combined Tests:Normal versus Low/High gradeNormal/low grade versus High grade

Page 20: Gene Set  Analysis  using R and  Bioconductor

I.) Ranking the genes according to differential expression using t-test or linear models

Gene Set Enrichment Analysis (GSEA)

Page 21: Gene Set  Analysis  using R and  Bioconductor

Gene Set Enrichment Analysis (GSEA)

II.) Include gene set membership information

Page 22: Gene Set  Analysis  using R and  Bioconductor

Enrichment Score (ES)

• reflects the degree to which a set S is overrepresented at the extremes of the entire ranked list L.

• The score is calculated by walking down the list L

• The enrichment score is the maximum deviation from zero encountered in the random walk;

• It corresponds to a weighted Kolmogorov–Smirnov-like statistic

Page 23: Gene Set  Analysis  using R and  Bioconductor

Gene Set Enrichment Analysis (GSEA)

Page 24: Gene Set  Analysis  using R and  Bioconductor

Subramanian A et al. PNAS 2005;102:15545-15550

Page 25: Gene Set  Analysis  using R and  Bioconductor

Permutation test to estimate the significance

• The significance of the ES has to be estimated

• Class label permutation versus gene label permutation

• Calculation the ES of the gene set for the permuted data, which generates a null distribution for the ES

• The empirical, nominal P value of the observed ES is then calculated relative to this null distribution

Page 26: Gene Set  Analysis  using R and  Bioconductor

Gene Set Enrichment Analysis (GSEA)

Page 27: Gene Set  Analysis  using R and  Bioconductor

Interpretation of the results

Page 28: Gene Set  Analysis  using R and  Bioconductor

Correction for Multiple Testing• When an entire database of gene sets is evaluated, we have

to adjust the estimated significance level to account for multiple hypothesis testing

• Control for false discovery rate (FDR)

• The FDR is the estimated probability that a set with a given ES represents a false positive finding

Page 29: Gene Set  Analysis  using R and  Bioconductor

Interpretation of the Results

Page 30: Gene Set  Analysis  using R and  Bioconductor

Tutorial