Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter

Gene Set Enrichment Analysis

Dr. Vered Caspi

Head, Bioinformatics Core Facility

Ben-Gurion University of the Negev

Advanced Bioinformatics Course, Weizmann Institute of Science

April 14th , 2010

Gene Expression Matrix

Astrocyte

Normal

DownSynd.

Type

Tissue

01 02

Cerebellum Cerebrum

17 18 19

03 04 0506 07 08 09

10 11 12

Slide by Vered Caspi

Usually, pairwise comparisons between groups of samples are performed. e.g. sick vs. healthy, or sick cerebellum vs. healthy

cerebellum, etc.

Heכ"בדart

13 14

24 2515 16 20 21 22 23

This is done by a statistical test, e.g. ANOVA, and yields fold of change and significance p-value for each pairwise comparison (also called “contrast”).

List No. of genes

Down - Normal 196

Cerebrum: Down - Normal 193

Cerebellum: Down - Normal 564

Astrocyte: Down - Normal 296

Heart: Down - Normal 269

Generation of gene lists

Cutoff: p-value < 0.01 AND |fold change| > 1.3

No. of genes in any of the lists

1228

Venn diagram

• If the experiment contains more than two

treatment groups, an alternative way to get

lists of “interesting genes” is clustering

analysis.

Clustering

Partitioning(K-means

Self Organizing MapsClick)

Hierarchical

www.stat.berkeley.edu/~bolstadBen Bolstad, Biostatistics, University of California, Berkeley,

http://www.stat.berkeley.edu/~bolstad

100

1000

No

rma

lize

d e

xp

ressio

n (

log

sca

le)

WT.1 WT.2 WT.3 KO.1 KO.2 KO.3 OE.1 OE.2 OE.3

Genes which show a similar expression pattern across the treatments are calld

“coexpressed genes”

The gene expression pattern may be viewed as line graph (profile), as in the previous

slide, or as a heat map as shown below.

Left – two treatments.

Right – more than two treatments

http://www.pnas.org/content/vol102/issue43/images/large/zpq0370595180001.jpeg

• Given a group of coexpressed genes, we may now ask: what is the common

biological theme shared among the genes in the group?

• Examples for biological themes:

– Metabolic pathway

– Signal transduction pathway

– Subcellular localization

– Protein complex

– Modulated by the same transcription factor(s)

– Are targets for the same miRNA

– Chromosomal location

– And more…

http://images.google.com/imgres?imgurl=http://www.hofesh.org.il/articles/science/ecosphere/2/12_chromosome.jpg&imgrefurl=http://www.hofesh.org.il/articles/science/ecosphere/2/ecosphere_2.html&h=322&w=363&sz=32&hl=iw&start=27&um=1&tbnid=cQJgaBLVlluW5M:&tbnh=107&tbnw=121&prev=/images%3Fq%3Dchromosome%26start%3D20%26ndsp%3D20%26svnum%3D10%26um%3D1%26hl%3Diw%26rls%3Dcom.microsoft:en-US%26sa%3DN

• Do you expect genes in the same biological

pathway, or same complex, to all change in

the same direction (up or down regulation)

upon a certain treatment?

Response to light intensityAnatomical structure

formation involved in

morphogenesis

Behavior of Arabidopsis genes of two GO terms upon dedifferentiation

100

1000

No

rmal

ized

ex

pre

ssio

n (

log

sca

le)

WT.1 WT.2 WT.3 KO.1 KO.2 KO.3 OE.1 OE.2 OE.3

Reminder: a group of coexpressed genes

The problem: Given a group of genes and a collection of gene sets: find enriched gene sets

Two questions

• How shall the enrichment be calculated?

• Which data set collections shall be used?

http://www.broadinstitute.org/gsea/

http://www.broadinstitute.org/gsea/

Data set collection: MSigDB

Example for a curated gene set:

You may also create your own gene sets…

Enrichment analysis using hypergeometric test: “compute overlaps”

MSigDB Analysis: “compute overlaps”

251 genes

Example

Another example:

Another example

Overlap matrix by gene and gene set

Enrichment analysis

• Hypergeometric test is adequate when we have a

group of genes (out of all genes) for which we

wish to test enrichment of a certain gene set.

• These may be co-expressed genes from microarray

cluster analysis

Hypergeometric test:

genes

The problem: Given a group of genes and a collection of gene sets: find enriched gene sets

David Bioinformatics))

Enrichment analysis considering

continuous data

• But now consider the following case:

Enrichment analysis

• In this case we only compared two treatments, and

defining a “differentially expressed genes” group

requires setting an arbitrary cutoff.

• GSEA allow us to overcome this limitation.

Rank the gene list according to:

Genes’ differential expression with respect to two

phenotypes

Or

Genes’ correlation with a predefined expression

profile


For each gene set (S),

Mark the location of the genes from set S within

the sorted gene list


Plot the running sum for S in the dataset, including the location of

the maximum enrichment score (ES) and the leading-edge subset


Calculation of an Enrichment Score.

We calculate an enrichment score (ES) that reflects the

degree to which a set S is overrepresented at the extremes

(top or bottom) of the entire ranked list L.

The score is calculated by walking down the list L,

increasing a running-sum statistic when we encounter a

gene in S and decreasing it when we encounter genes not in

S.

The magnitude of the increment depends on the correlation

of the gene with the phenotype.

The enrichment score is the maximum deviation from zero

encountered in the random walk; it corresponds to a

weighted Kolmogorov–Smirnov-like statistic


Estimation of Significance Level of ES.

We permute the phenotype labels and recompute the ES of the gene set for the

permuted data, which generates a null distribution for the ES. The empirical,

nominal P value of the observed ES is then calculated relative to this null

distribution. Importantly, the permutation of class labels preserves gene-gene

correlations and, thus, provides a more biologically reasonable assessment of

significance than would be obtained by permuting genes among the gene sets.

Adjustment for Multiple Hypothesis Testing.

We first normalize the ES for each gene set to account for the size of the set,

yielding a normalized enrichment score (NES).

We then control the proportion of false positives by calculating the false discovery

rate (FDR) corresponding to each NES. The FDR is the estimated probability that a

set with a given NES represents a false positive finding; it is computed by

comparing the tails of the observed and null distributions for the NES.

Documents

Gene Set Enrichment Analysis · The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter