Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Gene Set Enrichment Analysis
Dr. Vered Caspi
Head, Bioinformatics Core Facility
Ben-Gurion University of the Negev
Advanced Bioinformatics Course, Weizmann Institute of Science
April 14th , 2010
Gene Expression Matrix
Astrocyte
Normal
DownSynd.
Type
Tissue
01 02
Cerebellum Cerebrum
17 18 19
03 04 0506 07 08 09
10 11 12
Slide by Vered Caspi
Usually, pairwise comparisons between groups of samples are performed. e.g. sick vs. healthy, or sick cerebellum vs. healthy
cerebellum, etc.
Heכ"בדart
13 14
24 2515 16 20 21 22 23
This is done by a statistical test, e.g. ANOVA, and yields fold of change and significance p-value for each pairwise comparison (also called “contrast”).
List No. of genes
Down - Normal 196
Cerebrum: Down - Normal 193
Cerebellum: Down - Normal 564
Astrocyte: Down - Normal 296
Heart: Down - Normal 269
Generation of gene lists
Cutoff: p-value < 0.01 AND |fold change| > 1.3
No. of genes in any of the lists
1228
Venn diagram
• If the experiment contains more than two
treatment groups, an alternative way to get
lists of “interesting genes” is clustering
analysis.
Clustering
Partitioning(K-means
Self Organizing MapsClick)
Hierarchical
www.stat.berkeley.edu/~bolstadBen Bolstad, Biostatistics, University of California, Berkeley,
100
1000
No
rma
lize
d e
xp
ressio
n (
log
sca
le)
WT.1 WT.2 WT.3 KO.1 KO.2 KO.3 OE.1 OE.2 OE.3
Genes which show a similar expression pattern across the treatments are calld
“coexpressed genes”
The gene expression pattern may be viewed as line graph (profile), as in the previous
slide, or as a heat map as shown below.
Left – two treatments.
Right – more than two treatments
• Given a group of coexpressed genes, we may now ask: what is the common
biological theme shared among the genes in the group?
• Examples for biological themes:
– Metabolic pathway
– Signal transduction pathway
– Subcellular localization
– Protein complex
– Modulated by the same transcription factor(s)
– Are targets for the same miRNA
– Chromosomal location
– And more…
• Do you expect genes in the same biological
pathway, or same complex, to all change in
the same direction (up or down regulation)
upon a certain treatment?
Response to light intensityAnatomical structure
formation involved in
morphogenesis
Behavior of Arabidopsis genes of two GO terms upon dedifferentiation
100
1000
No
rmal
ized
ex
pre
ssio
n (
log
sca
le)
WT.1 WT.2 WT.3 KO.1 KO.2 KO.3 OE.1 OE.2 OE.3
Reminder: a group of coexpressed genes
The problem: Given a group of genes and a collection of gene sets: find enriched gene sets
Two questions
• How shall the enrichment be calculated?
• Which data set collections shall be used?
Data set collection: MSigDB
Example for a curated gene set:
You may also create your own gene sets…
Enrichment analysis using hypergeometric test: “compute overlaps”
MSigDB Analysis: “compute overlaps”
251 genes
Example
Another example:
Another example
Overlap matrix by gene and gene set
Enrichment analysis
• Hypergeometric test is adequate when we have a
group of genes (out of all genes) for which we
wish to test enrichment of a certain gene set.
• These may be co-expressed genes from microarray
cluster analysis
Hypergeometric test:
genes
The problem: Given a group of genes and a collection of gene sets: find enriched gene sets
David Bioinformatics))
Enrichment analysis considering
continuous data
• But now consider the following case:
Enrichment analysis
• In this case we only compared two treatments, and
defining a “differentially expressed genes” group
requires setting an arbitrary cutoff.
• GSEA allow us to overcome this limitation.
Rank the gene list according to:
Genes’ differential expression with respect to two
phenotypes
Or
Genes’ correlation with a predefined expression
profile
For each gene set (S),
Mark the location of the genes from set S within
the sorted gene list
Plot the running sum for S in the dataset, including the location of
the maximum enrichment score (ES) and the leading-edge subset
Calculation of an Enrichment Score.
We calculate an enrichment score (ES) that reflects the
degree to which a set S is overrepresented at the extremes
(top or bottom) of the entire ranked list L.
The score is calculated by walking down the list L,
increasing a running-sum statistic when we encounter a
gene in S and decreasing it when we encounter genes not in
S.
The magnitude of the increment depends on the correlation
of the gene with the phenotype.
The enrichment score is the maximum deviation from zero
encountered in the random walk; it corresponds to a
weighted Kolmogorov–Smirnov-like statistic
Estimation of Significance Level of ES.
We permute the phenotype labels and recompute the ES of the gene set for the
permuted data, which generates a null distribution for the ES. The empirical,
nominal P value of the observed ES is then calculated relative to this null
distribution. Importantly, the permutation of class labels preserves gene-gene
correlations and, thus, provides a more biologically reasonable assessment of
significance than would be obtained by permuting genes among the gene sets.
Adjustment for Multiple Hypothesis Testing.
We first normalize the ES for each gene set to account for the size of the set,
yielding a normalized enrichment score (NES).
We then control the proportion of false positives by calculating the false discovery
rate (FDR) corresponding to each NES. The FDR is the estimated probability that a
set with a given NES represents a false positive finding; it is computed by
comparing the tails of the observed and null distributions for the NES.