51
Gene Set Enrichment Analysis Dr. Vered Caspi Head, Bioinformatics Core Facility Ben-Gurion University of the Negev Advanced Bioinformatics Course, Weizmann Institute of Science April 14 th , 2010

Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Gene Set Enrichment Analysis

Dr. Vered Caspi

Head, Bioinformatics Core Facility

Ben-Gurion University of the Negev

Advanced Bioinformatics Course, Weizmann Institute of Science

April 14th , 2010

Page 2: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Gene Expression Matrix

Page 3: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Astrocyte

Normal

DownSynd.

Type

Tissue

01 02

Cerebellum Cerebrum

17 18 19

03 04 0506 07 08 09

10 11 12

Slide by Vered Caspi

Usually, pairwise comparisons between groups of samples are performed. e.g. sick vs. healthy, or sick cerebellum vs. healthy

cerebellum, etc.

Heכ"בדart

13 14

24 2515 16 20 21 22 23

Page 4: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

This is done by a statistical test, e.g. ANOVA, and yields fold of change and significance p-value for each pairwise comparison (also called “contrast”).

Page 5: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 6: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

List No. of genes

Down - Normal 196

Cerebrum: Down - Normal 193

Cerebellum: Down - Normal 564

Astrocyte: Down - Normal 296

Heart: Down - Normal 269

Generation of gene lists

Cutoff: p-value < 0.01 AND |fold change| > 1.3

No. of genes in any of the lists

1228

Page 7: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Venn diagram

Page 8: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

• If the experiment contains more than two

treatment groups, an alternative way to get

lists of “interesting genes” is clustering

analysis.

Page 9: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Clustering

Partitioning(K-means

Self Organizing MapsClick)

Hierarchical

www.stat.berkeley.edu/~bolstadBen Bolstad, Biostatistics, University of California, Berkeley,

Page 10: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 11: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 12: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

100

1000

No

rma

lize

d e

xp

ressio

n (

log

sca

le)

WT.1 WT.2 WT.3 KO.1 KO.2 KO.3 OE.1 OE.2 OE.3

Genes which show a similar expression pattern across the treatments are calld

“coexpressed genes”

Page 13: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

The gene expression pattern may be viewed as line graph (profile), as in the previous

slide, or as a heat map as shown below.

Left – two treatments.

Right – more than two treatments

Page 14: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

• Given a group of coexpressed genes, we may now ask: what is the common

biological theme shared among the genes in the group?

• Examples for biological themes:

– Metabolic pathway

– Signal transduction pathway

– Subcellular localization

– Protein complex

– Modulated by the same transcription factor(s)

– Are targets for the same miRNA

– Chromosomal location

– And more…

Page 16: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

• Do you expect genes in the same biological

pathway, or same complex, to all change in

the same direction (up or down regulation)

upon a certain treatment?

Page 17: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Response to light intensityAnatomical structure

formation involved in

morphogenesis

Behavior of Arabidopsis genes of two GO terms upon dedifferentiation

Page 18: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

100

1000

No

rmal

ized

ex

pre

ssio

n (

log

sca

le)

WT.1 WT.2 WT.3 KO.1 KO.2 KO.3 OE.1 OE.2 OE.3

Reminder: a group of coexpressed genes

Page 19: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

The problem: Given a group of genes and a collection of gene sets: find enriched gene sets

Page 20: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Two questions

• How shall the enrichment be calculated?

• Which data set collections shall be used?

Page 21: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

http://www.broadinstitute.org/gsea/

Page 22: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Data set collection: MSigDB

Page 23: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 24: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Example for a curated gene set:

Page 25: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 26: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 27: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

You may also create your own gene sets…

Page 28: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Enrichment analysis using hypergeometric test: “compute overlaps”

Page 29: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

MSigDB Analysis: “compute overlaps”

251 genes

Example

Page 30: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 31: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Another example:

Page 32: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Another example

Page 33: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Overlap matrix by gene and gene set

Page 34: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Enrichment analysis

• Hypergeometric test is adequate when we have a

group of genes (out of all genes) for which we

wish to test enrichment of a certain gene set.

• These may be co-expressed genes from microarray

cluster analysis

Page 35: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Hypergeometric test:

genes

Page 36: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

The problem: Given a group of genes and a collection of gene sets: find enriched gene sets

Page 37: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

David Bioinformatics))

Page 38: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Enrichment analysis considering

continuous data

• But now consider the following case:

Page 39: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 40: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Enrichment analysis

• In this case we only compared two treatments, and

defining a “differentially expressed genes” group

requires setting an arbitrary cutoff.

• GSEA allow us to overcome this limitation.

Page 41: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 42: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Rank the gene list according to:

Genes’ differential expression with respect to two

phenotypes

Or

Genes’ correlation with a predefined expression

profile

Page 43: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

For each gene set (S),

Mark the location of the genes from set S within

the sorted gene list

Page 44: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Plot the running sum for S in the dataset, including the location of

the maximum enrichment score (ES) and the leading-edge subset

Page 45: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Calculation of an Enrichment Score.

We calculate an enrichment score (ES) that reflects the

degree to which a set S is overrepresented at the extremes

(top or bottom) of the entire ranked list L.

The score is calculated by walking down the list L,

increasing a running-sum statistic when we encounter a

gene in S and decreasing it when we encounter genes not in

S.

The magnitude of the increment depends on the correlation

of the gene with the phenotype.

The enrichment score is the maximum deviation from zero

encountered in the random walk; it corresponds to a

weighted Kolmogorov–Smirnov-like statistic

Page 46: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Estimation of Significance Level of ES.

We permute the phenotype labels and recompute the ES of the gene set for the

permuted data, which generates a null distribution for the ES. The empirical,

nominal P value of the observed ES is then calculated relative to this null

distribution. Importantly, the permutation of class labels preserves gene-gene

correlations and, thus, provides a more biologically reasonable assessment of

significance than would be obtained by permuting genes among the gene sets.

Adjustment for Multiple Hypothesis Testing.

We first normalize the ES for each gene set to account for the size of the set,

yielding a normalized enrichment score (NES).

We then control the proportion of false positives by calculating the false discovery

rate (FDR) corresponding to each NES. The FDR is the estimated probability that a

set with a given NES represents a false positive finding; it is computed by

comparing the tails of the observed and null distributions for the NES.

Page 47: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 48: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 49: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 50: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter
Page 51: Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter