Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Pathway models in cancer genomicsSIAM 2008
Sayan Mukherjee
Department of Statistical ScienceInstitute for Genome Sciences & Policy
Department of Computer ScienceDuke University
August 6, 2008
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Genetic and molecular basis of complex traits
Model complex traits based on high-throughput genomic datausing using statistical and computational methods.
Complex traits are controlled by variation across many genes andare often express variation in phenotype. This heterogeneity is thecrux of the difficulty in modeling these traits.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Genetic and molecular basis of complex traits
Model complex traits based on high-throughput genomic datausing using statistical and computational methods.
Complex traits are controlled by variation across many genes andare often express variation in phenotype. This heterogeneity is thecrux of the difficulty in modeling these traits.
Two sources of heterogeneity:
variation across phenotypes or disease stages
variation across genes in individual samples.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Cluster or stratify
An obvious solution is to build models
1 Separately across each stage
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Cluster or stratify
An obvious solution is to build models
1 Separately across each stage
2 On genes that are mutated or varying across all individuals.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Cluster or stratify
An obvious solution is to build models
1 Separately across each stage
2 On genes that are mutated or varying across all individuals.
Huge loss of power especially in genes
1 Genes mutated across most individuals: KRAS, TP53, andAPC – gene mountains.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Cluster or stratify
An obvious solution is to build models
1 Separately across each stage
2 On genes that are mutated or varying across all individuals.
Huge loss of power especially in genes
1 Genes mutated across most individuals: KRAS, TP53, andAPC – gene mountains.
2 Most variable genes are mutated in 5% of individuals – genehills.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Integration of information
Borrow strength
Model in gene set space – integrate genes into a priori knowngene sets with putative functional or structural annotation.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Integration of information
Borrow strength
Model in gene set space – integrate genes into a priori knowngene sets with putative functional or structural annotation.
Conjoint models across phenotypes – borrow strength acrossphenotypes.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Modeling principles
Biological systems are often assayed by thousands of variables.
These data lie on or near a low-dimensional manifold and there arestrong dependencies between variables.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Modeling principles
Biological systems are often assayed by thousands of variables.
These data lie on or near a low-dimensional manifold and there arestrong dependencies between variables.
We build models that are predictive with respect to phenotype butalso infer the strong dependencies which we hope can help infermechanism.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Questions
1 Which groups of genes (pathways) are involved in all or somestages of progression ?
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Questions
1 Which groups of genes (pathways) are involved in all or somestages of progression ?
2 What are the pathway dependencies (inferring pathwaynetworks) ?
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Questions
1 Which groups of genes (pathways) are involved in all or somestages of progression ?
2 What are the pathway dependencies (inferring pathwaynetworks) ?
3 For each relevant pathway infer gene network for pathway.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Progression in colon cancer
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Pathway dependencies in prostate cancer: benign to
primary
BA
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Gene network for ERK pathway
NGF DPM2NGFBSOS1E L K 1PT PRNGF DPM2NGFBSOS1E L K 1PT PR
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
GNB 1
R PS6K AS
MAPK 1
PDG
GNASMAP2K 1
PPP2CAGR B 2
MK NK 2
MAP2K 2
MK NK 1
GNB 1
R PS6K AS
MAPK 1
PDG
GNASMAP2K 1
PPP2CAGR B 2
MK NK 2
MAP2K 2
MK NK 1
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Mountains and hills
KRAS, TP53, and APC are hard to target with drugs and using asgenetic basis of treatment has proven unsuccessful.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Mountains and hills
KRAS, TP53, and APC are hard to target with drugs and using asgenetic basis of treatment has proven unsuccessful.
Organize gene hills into biological pathways or sets of genesrelevant to cancer initiation, make gene hills into a pathwaymountain.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Gene Set Enrichment Analysis (GSEA)
1 Given an expression data set and phenotypes (labels) order thegenes by correlation with phenotype
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Gene Set Enrichment Analysis (GSEA)
1 Given an expression data set and phenotypes (labels) order thegenes by correlation with phenotype
2 Estimate the gene set’s Enrichment score with respect to theordered genes
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Gene Set Enrichment Analysis (GSEA)
1 Given an expression data set and phenotypes (labels) order thegenes by correlation with phenotype
2 Estimate the gene set’s Enrichment score with respect to theordered genes
3 Assess Statistical significance and adjust formultiple hypothesis testing
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Gene set database
The gene sets in the database are defined by
1 Positional gene sets: cytogenetic bands, 3 megabase windows;
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Gene set database
The gene sets in the database are defined by
1 Positional gene sets: cytogenetic bands, 3 megabase windows;
2 Motif gene sets: TRANSFAC motifs, Representative motifs;
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Gene set database
The gene sets in the database are defined by
1 Positional gene sets: cytogenetic bands, 3 megabase windows;
2 Motif gene sets: TRANSFAC motifs, Representative motifs;
3 Curated gene sets: Pathways, Literature reviews, Animalmodels, Clinical phenotypes, Expert curations, Chemical orgenetic perturbations.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Diabetes
Ordered genes
OXPHOS pathway
p-value=.007pFDR=4%
Databaseof gene sets
Normals Diabetics
+
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Diabetes
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Analysis of Sample Set Enrichment Scores (ASSESs)
1 GSEA – Provides a summary for the enrichment of each geneset with respect to differential expression across the data set.
2 ASSESS – Provides a summary statistic Sij for the enrichmentof each gene set with respect to differential expression foreach observation :
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Analysis of Sample Set Enrichment Scores (ASSESs)
1 GSEA – Provides a summary for the enrichment of each geneset with respect to differential expression across the data set.
2 ASSESS – Provides a summary statistic Sij for the enrichmentof each gene set with respect to differential expression foreach observation :Sij is constitutive differential enrichment in the i -th sample ofgenes in gene set j .
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Gender
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Primary to metastatic
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Primary to metastatic
p m
Ge
ne
se
ts
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Two questions
1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Two questions
1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.
2 Comparison of gene sets defined by experimental studies vs.expert knowledge – importance of context on interpretation ofresults.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Two questions
1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.
2 Comparison of gene sets defined by experimental studies vs.expert knowledge – importance of context on interpretation ofresults.
Experiments or designs with known pathways driving phenotypeused to start answering these questions.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Regression or single-task learning
The regression problem :(1) explanatory variables Xn = (x1, ..., xn) over n samples eachxi ∈ IR
d the enrichment of the sample over d pathways;
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Regression or single-task learning
The regression problem :(1) explanatory variables Xn = (x1, ..., xn) over n samples eachxi ∈ IR
d the enrichment of the sample over d pathways;(2) categorical label variables Yn = (y1, ..., yn) where yi ∈ 0, 1with 0 = less serious stage 1 = more serious stage.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Regression or single-task learning
The regression problem :(1) explanatory variables Xn = (x1, ..., xn) over n samples eachxi ∈ IR
d the enrichment of the sample over d pathways;(2) categorical label variables Yn = (y1, ..., yn) where yi ∈ 0, 1with 0 = less serious stage 1 = more serious stage.
Find a function f that accurately predicts y given x :
f (xnew) ≈ ynew.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Hierarchical modeling or multi-task learning
Definition
Single Task Notation nt samples (xi , yi )xi ∈ IR
d
yi ∈ 0, 1 for classificationAssume to be working in d ≫ nt paradigm.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Hierarchical modeling or multi-task learning
Definition
Single Task Notation nt samples (xi , yi )xi ∈ IR
d
yi ∈ 0, 1 for classificationAssume to be working in d ≫ nt paradigm.
Definition
Multi-task Learning (MTL) Formulation Given T tasks witht ∈ 1, . . . ,T
Ft(x) = f0(x) + ft(x) + ε, εiid∽ No(0, σ2).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task learning for tumor progression
Context of tumor progression: b 7→ p 7→ m:
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task learning for tumor progression
Context of tumor progression: b 7→ p 7→ m:
1 Pathways predictive across all stages b 7→ p 7→ m: f0(x);
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task learning for tumor progression
Context of tumor progression: b 7→ p 7→ m:
1 Pathways predictive across all stages b 7→ p 7→ m: f0(x);
2 Pathways predictive in the first stage b 7→ p – Task 1:f1(x);
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task learning for tumor progression
Context of tumor progression: b 7→ p 7→ m:
1 Pathways predictive across all stages b 7→ p 7→ m: f0(x);
2 Pathways predictive in the first stage b 7→ p – Task 1:f1(x);
3 Pathways predictive in the second stage stages p 7→ m –Task 2: f2(x).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Models of variation in genetics
weight at 33 days oo Gain 0-33 days oo External conditions
Heredity
ggNNNNNNNNNNN
wwppppppppppp
weight at birth oo
OO
Rate of growth oo condition of dam
ggOOOOOOOOOOOOOOOOOOOOOOOOOOO
wwooooooooooooooooooooooooooo
Size of litter
ggOOOOOOOOOOO
wwooooooooooo
YY44444444444444444444444
Gestation period
``BBBBBBBBBBBBBBBBBB
Heredity of dam
OO
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Inverse regression
Regression is modeling y = f (x) or p(Y |X ).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Inverse regression
Regression is modeling y = f (x) or p(Y |X ).
Inverse regression is modeling p(X |Y ) or often cov (X |Y ).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Inverse regression
Regression is modeling y = f (x) or p(Y |X ).
Inverse regression is modeling p(X |Y ) or often cov (X |Y ).
Given D = (X1,Y1), .., (Xn,Yn) we estimate a variation of thecovariance of the inverse regression
Γ = h[cov (X |Y )], d × d matrix.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Inverse regression
Regression is modeling y = f (x) or p(Y |X ).
Inverse regression is modeling p(X |Y ) or often cov (X |Y ).
Given D = (X1,Y1), .., (Xn,Yn) we estimate a variation of thecovariance of the inverse regression
Γ = h[cov (X |Y )], d × d matrix.
1 Γii – relevance of pathway i with respect to label
2 Γij – covariation of pathways i and j with respect to label.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
More formally
Given X ∈ X ⊂ IRp and Y ∈ 0, 1
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
More formally
Given X ∈ X ⊂ IRp and Y ∈ 0, 1
D = (x1, y1), ..., (xn, yn)iid∼ ρ(X ,Y ) with typically p ≫ n.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Gradients and outer products
Given a smooth function f the gradient is
∇f (X ) =(
∂f (X )∂X1
, ..., ∂f (X )∂Xd
)′
.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Gradients and outer products
Given a smooth function f the gradient is
∇f (X ) =(
∂f (X )∂X1
, ..., ∂f (X )∂Xd
)′
.
Define the gradient outer product matrix Γ
Γij =
∫
X
∂f
∂xi
(X )∂f
∂xj
(X )dρX(X ),
Γ = E[(∇f ) ⊗ (∇f )].
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Statistical interpretation
Linear casey = β′x + ε, ε
iid∼ No(0, σ2).
Ω = cov (E[X |Y ]), ΣX
= cov (X ), σ2Y
= var (Y ).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Statistical interpretation
Linear casey = β′x + ε, ε
iid∼ No(0, σ2).
Ω = cov (E[X |Y ]), ΣX
= cov (X ), σ2Y
= var (Y ).
Γ = σ2Y
(
1 − σ2
σ2Y
)2
Σ−1X
ΩΣ−1X
≈ σ2Y
Σ−1X
ΩΣ−1X
.
Γ and Ω contain equivalent information.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Statistical interpretation
For smooth f (x)
y = f (x) + ε, εiid∼ No(0, σ2).
Ω = cov (X |Y ) not so clear.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Nonlinear case
Partition into sections and compute local quantities
X =
I⋃
i=1
χi
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Nonlinear case
Partition into sections and compute local quantities
X =
I⋃
i=1
χi
Ωi = cov (Xχi|Yχ
i)
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Nonlinear case
Partition into sections and compute local quantities
X =
I⋃
i=1
χi
Ωi = cov (Xχi|Yχ
i)
Σi = cov (Xχi)
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Nonlinear case
Partition into sections and compute local quantities
X =
I⋃
i=1
χi
Ωi = cov (Xχi|Yχ
i)
Σi = cov (Xχi)
σ2i = var (Yχ
i)
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Nonlinear case
Partition into sections and compute local quantities
X =
I⋃
i=1
χi
Ωi = cov (Xχi|Yχ
i)
Σi = cov (Xχi)
σ2i = var (Yχ
i)
mi = ρX(χ
i).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Nonlinear case
Partition into sections and compute local quantities
X =
I⋃
i=1
χi
Ωi = cov (Xχi|Yχ
i)
Σi = cov (Xχi)
σ2i = var (Yχ
i)
mi = ρX(χ
i).
Γ ≈
I∑
i=1
mi σ2i Σ−1
i Ωi Σ−1i .
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Gauss-Markov graphical models
Give a multivariate normal distribution, x ∈ Rp
p(x) ∝ exp(
−(x − µ)C−1(x − µ)′)
.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Gauss-Markov graphical models
Give a multivariate normal distribution, x ∈ Rp
p(x) ∝ exp(
−(x − µ)C−1(x − µ)′)
.
The precision matrix P = C−1 is also the conditional independencematrix
Pij = dependence of i ↔ j | all other variables.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Gauss-Markov graphical models
By construction Γ is a covariance matrix.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Gauss-Markov graphical models
By construction Γ is a covariance matrix.
So the (pseudo) inverse
J = inv(Γ)
is a conditional independence matrix with
Jij = dependence of i ↔ j | all other X variables and Y .
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task learning or hierarchical models
Definition
Multi-task Learning (MTL) Formulation Given T tasks witht ∈ 1, . . . ,T
Ft(x) = f0(x) + ft(x) + ε, εiid∽ No(0, σ2).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task gradient learning
Estimate not just the functions
f0, f1, ..., fT ,
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task gradient learning
Estimate not just the functions
f0, f1, ..., fT ,
but the gradients as well
(f0,∇f0), (ft ,∇ft)Tt=1.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task gradient learning
Estimate not just the functions
f0, f1, ..., fT ,
but the gradients as well
(f0,∇f0), (ft ,∇ft)Tt=1.
This provides us with T + 1 matrices
1 Γ0 is the GOP estimate across all the tasks
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Multi-task gradient learning
Estimate not just the functions
f0, f1, ..., fT ,
but the gradients as well
(f0,∇f0), (ft ,∇ft)Tt=1.
This provides us with T + 1 matrices
1 Γ0 is the GOP estimate across all the tasks
2 Γ1, . . . , ΓT are the task specific GOP estimates.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Estimating the gradient
A penalized likelihood model is used to estimate the gradient givendata in the single task or multitask setting, ~fD or ~fD,t.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Estimating the gradient
A penalized likelihood model is used to estimate the gradient givendata in the single task or multitask setting, ~fD or ~fD,t.
The gradient outer product estimate Γ is computed from ~fD .
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Estimating the gradient
A penalized likelihood model is used to estimate the gradient givendata in the single task or multitask setting, ~fD or ~fD,t.
The gradient outer product estimate Γ is computed from ~fD .
The gradient inference algorithm requires fewer than n2 parametersand is O(n6) time and O(pn) memory.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Convergence of gradient
Assume the data is concentrated on a manifold M ⊂ Rd with
M ∈ Rm and there exists an isometric embedding ϕ : M → R
d .
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Convergence of gradient
Assume the data is concentrated on a manifold M ⊂ Rd with
M ∈ Rm and there exists an isometric embedding ϕ : M → R
d .
Theorem
Under mild regularity conditions on the distribution andcorresponding density, with probability 1 − δ
‖(dϕ)∗~fD −∇Mf ‖ρX
≤ C log
(
1
δ
)
n−1/m
where (dϕ)∗ is the dual of the map dϕ.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Convergence of graphical model
Theorem
Under mild conditions, with probability 1 − δ
‖J − J‖ ≤ C log
(
1
δ
)
n−1/m.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Observations on convergence
Typical theoretical results on convergence of graphical models is interms of sparsity s = ‖J‖0.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Observations on convergence
Typical theoretical results on convergence of graphical models is interms of sparsity s = ‖J‖0.
Our results are in terms of rank of J reflective of intrinsicdimension.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Observations on convergence
Typical theoretical results on convergence of graphical models is interms of sparsity s = ‖J‖0.
Our results are in terms of rank of J reflective of intrinsicdimension.
Our belief is the covariance structure is not necessarily sparse butis low rank.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Heirarchical modelsGraphical models
Observations on convergence
Typical theoretical results on convergence of graphical models is interms of sparsity s = ‖J‖0.
Our results are in terms of rank of J reflective of intrinsicdimension.
Our belief is the covariance structure is not necessarily sparse butis low rank.
The genetic basis of complex traits may be sparse in pathwayseven when not in genes.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Tumor progression as multi-task learning
Progression: b 7→ p 7→ m.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Tumor progression as multi-task learning
Progression: b 7→ p 7→ m.Task 1: b 7→ pTask 2: p 7→ m
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Tumor progression as multi-task learning
Progression: b 7→ p 7→ m.Task 1: b 7→ pTask 2: p 7→ mTasks correspond to progression from less serious 0 to more serious1.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Data set – Tomlins et al.
Gene expression from 22, 283 genes. 71 people:22 benign (b) prostate epithelium;32 primary (p) prostate cancer;17 metastatic (m) prostate cancer.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Data set – Tomlins et al.
Gene expression from 22, 283 genes. 71 people:22 benign (b) prostate epithelium;32 primary (p) prostate cancer;17 metastatic (m) prostate cancer.
Progression: b 7→ p 7→ m.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Data set – Tomlins et al.
Gene expression from 22, 283 genes. 71 people:22 benign (b) prostate epithelium;32 primary (p) prostate cancer;17 metastatic (m) prostate cancer.
Progression: b 7→ p 7→ m.
523 pathway defined gene sets.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Pathways relevant in progression
b 7→ p 7→ m – Set ID Gene Set Name SourceTRANS CR Transport PNAS 2014CCC Cell Cycle Checkpoint GOb 7→ p – Set ID Gene Set Name SourceGHD MAP00361 Gamma Hexacholrocyclohexane Degradation GenMAPPKREB kreb Pathway BioCartap 7→ m – Set ID Gene Set Name SourceHORM CR Hormonal Functions PNAS 2008GLY GLYCOL Manually Curated
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Pathways relevant in progression
10 20 30 40 50 60 70 80 90 100
TRANS
CCC
GHD
KREB
HORM
GLY −0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8A
C
B
p pb m
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Pathway dependencies: benign to primary
BA
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Gene set refinement
1 All genes in a gene set are not relevant in the specific contextstudied.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Gene set refinement
1 All genes in a gene set are not relevant in the specific contextstudied.
2 Genes not included in the gene set may be relevant to thespecific context studied.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Gene set refinement
1 All genes in a gene set are not relevant in the specific contextstudied.
2 Genes not included in the gene set may be relevant to thespecific context studied. Still working on this.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Gene set refinement
1 All genes in a gene set are not relevant in the specific contextstudied.
2 Genes not included in the gene set may be relevant to thespecific context studied. Still working on this.
3 Refinement procedure adapts gene set to context of data.Infers gene dependencies and substructure in the gene set.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Gene set refinement
1 All genes in a gene set are not relevant in the specific contextstudied.
2 Genes not included in the gene set may be relevant to thespecific context studied. Still working on this.
3 Refinement procedure adapts gene set to context of data.Infers gene dependencies and substructure in the gene set.
4 Gene network model of dependencies is inferred.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Gene network for ERK pathway
NGF DPM2NGFBSOS1E L K 1PT PRNGF DPM2NGFBSOS1E L K 1PT PR
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
R AF1
SHC1E GFR
ST AT
R PS T GFB
MY C
GNB 1
R PS6K AS
MAPK 1
PDG
GNASMAP2K 1
PPP2CAGR B 2
MK NK 2
MAP2K 2
MK NK 1
GNB 1
R PS6K AS
MAPK 1
PDG
GNASMAP2K 1
PPP2CAGR B 2
MK NK 2
MAP2K 2
MK NK 1
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Genetic models of colorectal cancer
Colorectal cancer is a complex trait that progresses from normalcolon epithelium to adenoma to stages of carcinomas
n → a → c1 → c2 → c3 → c4.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Genetic models of colorectal cancer
Colorectal cancer is a complex trait that progresses from normalcolon epithelium to adenoma to stages of carcinomas
n → a → c1 → c2 → c3 → c4.
Tremendous heterogeneity exists across
1 Time or stage
2 Individuals.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Genetic model of colorectal cancer (Fearon and Vogelstein,
1990)
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Mountains and hills
KRAS, TP53, and APC are hard to target with drugs and using asgenetic basis of treatment has proven unsuccessful.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Mountains and hills
KRAS, TP53, and APC are hard to target with drugs and using asgenetic basis of treatment has proven unsuccessful.
Organize gene hills into biological pathways or sets of genesrelevant to cancer initiation, make gene hills into a pathwaymountain.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Prostate cancerColon cancer
Inferred system
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Open problems
Nonlinear progression through stages – more generalphenotypic substructure.
Decomposition into sub-pathways.
Integration across genomic data.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Acknowledgements
Coworkers
Learning gradients: Q Wu, D-X Zhou, J Guinney, MMaggioni, K Mao
Multi-task learning: J Guinney, Q Wu
Tumor progression: E Edelman, J Guinney, A Potti
Funding:
IGSP
Center for Systems Biology at Duke
NSF DMS-0732260
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Relevant papers
Learning Coordinate Covariances via Gradients. Sayan Mukherjee, Ding-Xuan Zhou; Journal of MachineLearning Research, 7(Mar):519–549, 2006.
Estimation of Gradients and Coordinate Covariation in Classification. Sayan Mukherjee, Qiang Wu; Journalof Machine Learning Research, 7(Nov):2481–2514, 2006.
Learning Gradients and Feature Selection on Manifolds. Sayan Mukherjee, Qiang Wu, Ding-Xuan Zhou;Bernoulli, submitted.
Learning Gradients: predictive models that infer geometry and dependence. Qiang Wu, Justin Guinney,Mauro Maggioni, Sayan Mukherjee; Journal of Machine Learning Research, submitted.
Estimating variable structure and dependence in Multi-task learning via gradients. Justin Guinney, Q. Wu,Sayan Mukherjee; Journal of Machine Learning Research, submitted.
Modeling Cancer Progression via Pathway Dependencies. E. Edelman, J. Guinney, J-T. Chi, P.G. Febbo, S.Mukherjee; PLoS Computational Biology, 4(2): e28.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Main questions
Two questions
1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Main questions
Two questions
1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.
2 Comparison of gene sets defined by experimental studies vs.expert knowledge – importance of context on interpretation ofresults.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Main questions
Two questions
1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.
2 Comparison of gene sets defined by experimental studies vs.expert knowledge – importance of context on interpretation ofresults.
Experiments or designs with known pathways driving phenotypeused to start answering these questions.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
P53
Transcription factor inhibiting cell growth and stimulating celldeath. Point mutation (p17) inactivates capacity to bindspecifically to recognition sequence.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
P53 experiment
Database contains 5 gene sets defining P53 pathways,experimentally and from expert studies.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
P53 experiment
Database contains 5 gene sets defining P53 pathways,experimentally and from expert studies.
1 Direct context: enrichment of known p53 mutants in theNCI-60 collection ?
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
P53 experiment
Database contains 5 gene sets defining P53 pathways,experimentally and from expert studies.
1 Direct context: enrichment of known p53 mutants in theNCI-60 collection ?
2 Indirect context: enriched in recurrent breast cancer ?
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
P53 experiment
Gene Set SourceStress p53 Specific Up Amundson et alp53 Genes All Inga et alp53 Up Kennan et alp53 Pathway BioCartap53 Hypoxia BioCarta
Table: The five p53 gene sets defined by experimental and expertsources.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Enrichment in P53 mutants
Rank Gene Set NES Pval FDREnriched in Wild Type1 p53 Pathway -2.38 0 02 Stress p53 Specific Up -2.33 0 03 p53 Hypoxia -2.19 0 04 p53 Genes All -2.1 0 0.0110 p53 Up -1.73 0.02 0.20
Table: GSEA results for the p53 gene sets in the wild type/p53 mutantdataset. The ranks are out of 220 gene sets which are enriched in thewild type phenotype.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Enrichment in breast cancer recurrence
Rank Gene Set NES Pval FDREnriched in Nonrecurrent10 p53 Pathway -1.27 0.15 115 p53 Genes All -1.23 0.2 119 Stress p53 Specific Up -1.18 0.23 169 p53 Hypoxia -0.83 0.70 1Enriched in Recurrent248 p53 Up 0.72 0.90 0.96
Table: GSEA results for the p53 gene sets in the recurrent/nonrecurrentbreast cancer dataset. The ranks are out of 123 gene sets which areenriched in the nonrecurrent phenotype and 278 gene sets which areenriched in the recurrent phenotype.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Hypoxia experiment
Database contains 7 gene sets defining hypoxia pathways,experimentally and from expert studies.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Hypoxia experiment
Database contains 7 gene sets defining hypoxia pathways,experimentally and from expert studies.Direct context: enrichment under hypoxic conditions – 3 humanastrocytes and 3 HeLA cells under hypoxic conditions and 3 humanastrocytes and 3 HeLA cells under normoxic conditions.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Hypoxia experiment
Gene Set SourceHypoxia Down Manalo et alHypoxia Up Manalo et alHypoxia Fibro Up Kim et alHypoxia Reg Up Leonard et alHypoxia Review HarrisHIF Pathway BioCartaVEGF Pathway BioCarta
Table: The seven hypoxia gene sets defined by experimental and expertsources.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Enrichment in hypoxia
Rank Gene Set NES Pval FDREnriched in Hypoxic Cells3 Hypoxia Up -1.96 0.008 0.0264 Hypoxia Review -1.95 0 0.0276 Hypoxia Fibro Up -1.84 0.004 0.0889 Hypoxia Reg Up -1.73 0.02 0.19110 HIF Pathway -1.73 0.02 0.17653 VEGF Pathway -1.39 0.055 0.553Enriched in Normal Cells17 Hypoxia Down 1.48 0.167 0.596
Table: GSEA results for the hypoxia gene sets in the hypoxia/normaldataset. The ranks are out of 323 gene sets which are enriched in thehypoxia phenotype and 205 gene sets which are enriched in the normalphenotype.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
RAS experiment – context
5 RAS gene sets considered:(1) pathway signature for activated H-RAS from PREC (Bild et al);(2) pathway signature for kRAS mouse model mutant(Sweet-Cordero et al);(3) biochemical interactions for RAS from biocarta.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
RAS experiment – context
5 RAS gene sets considered:(1) pathway signature for activated H-RAS from PREC (Bild et al);(2) pathway signature for kRAS mouse model mutant(Sweet-Cordero et al);(3) biochemical interactions for RAS from biocarta.
Oathways will be compared in two different settings:(1) aanimal model with mutated kRAS – very specific;(2) heterogeneous, non-small cell lung cancers (Potti et al).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
RAS enrichment in mouse model
Gene Set NES Pval FDREnriched in TumorRAS Up BioCarta 1.51 0 0.05SRC Down 1.41 0.09 0.32MYC Up 1.25 0.15 0.35SRC Up 1.25 0.15 0.21H-RAS Up 1.12 0.26 0.49E2F3 Up 1.12 0.25 0.33BCAT Up 0.81 0.74 0.80Enriched in NormalRAS Down BioCarta -1.51 0.12 0.06E2F3 Down -1.29 0.10 0.49H-RAS Down -1.18 0.19 0.52BCAT Down -1.14 0.29 0.32MYC Down -0.99 0.55 0.52
Table: GSEA results for the oncogenic gene sets defined from Bild et alSayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
RAS enrichment in lung adenocarcinoma
RAS activation linked with lung adenocarcinomas ( Rodenhuis etal, Salgia et al).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
RAS enrichment in lung adenocarcinoma
RAS activation linked with lung adenocarcinomas ( Rodenhuis etal, Salgia et al).45 adenocarcinoma lung cancer samples and 48 squamous lungcancer samples from Potti et al.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
RAS enrichment in lung adenocarcinoma
RAS activation linked with lung adenocarcinomas ( Rodenhuis etal, Salgia et al).45 adenocarcinoma lung cancer samples and 48 squamous lungcancer samples from Potti et al.ASSESS to compute enrichment scores for three RAS pathwaysacross all 93 samples.pause Leave-one-out error: 69.9% for H-RAS, 75.3% for kRAS,and 79.6% for BioCarta.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Organization of results
Used “Hallmarks of Cancer” (Hanahan and Weinberg) to organizeanalysis:
Self sufficiency of growth signals
Insensitivity to anti-growth signals
Evasion of apoptosis
Defense against limitless replicative potential
Angiogenesis
Invasion and metastasis
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Some results
Self sufficiency in growth signals (all stages):Cell cycle gene sets, ErB4, EGF, Sprouty, ERK.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Some results
Self sufficiency in growth signals (all stages):Cell cycle gene sets, ErB4, EGF, Sprouty, ERK.
Evidence for insensitivity to anti-growth signals (all stages):PTEN down-regulation, PTDINS up-regulation.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Some results
Self sufficiency in growth signals (all stages):Cell cycle gene sets, ErB4, EGF, Sprouty, ERK.
Evidence for insensitivity to anti-growth signals (all stages):PTEN down-regulation, PTDINS up-regulation.
Evasion of apoptosis (all stages):IGF1R up-regulation, ROS down-regulation.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Some results
Self sufficiency in growth signals (all stages):Cell cycle gene sets, ErB4, EGF, Sprouty, ERK.
Evidence for insensitivity to anti-growth signals (all stages):PTEN down-regulation, PTDINS up-regulation.
Evasion of apoptosis (all stages):IGF1R up-regulation, ROS down-regulation.
Energy production:Glycolysis gene set up-regulation (late stage), ATP synthesis geneset up-regulation and Oxidative phosphorylation up-regulation(early stage).
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Multiscale graphical models
Define J = Γ−1. and the partial correlation coefficient
rij = −Jij
√
JiiJjj
.
Define Rij = rij if i 6= j and 0 otherwise.Define Dii = Jii .Define R = D−1/2RD−1/2.Γ = D−1/2(1 − R)−1D−1/2.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Multiscale graphical models
The following expansions hold
Γ = D−1/2
(
∞∑
k=0
Rk
)
D−1/2
Γ = D−1/2
(
∞∏
k=0
(I + R2k
)
)
D−1/2.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Multiscale graphical models
The following expansions hold
Γ = D−1/2
(
∞∑
k=0
Rk
)
D−1/2
Γ = D−1/2
(
∞∏
k=0
(I + R2k
)
)
D−1/2.
Here k is path-length in the first equality. The second equalityfactorizes the covariance matrix into low rank matrices with fewerentries.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Multiscale graphical models
The following expansions hold
Γ = D−1/2
(
∞∑
k=0
Rk
)
D−1/2
Γ = D−1/2
(
∞∏
k=0
(I + R2k
)
)
D−1/2.
Here k is path-length in the first equality. The second equalityfactorizes the covariance matrix into low rank matrices with fewerentries.This is the idea of diffusion wavelets.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Multiscale graphical models
(a) Covariance (b) PCA
50 100 150 200 250 300
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
Basis at level 3
(c) Coarse scale
50 100 150 200 250 300
−0.1
−0.05
0
0.05
0.1
(d) Fine scale
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Consistency
Assume the data is concentrated on a manifold M ⊂ IRp with
M ∈ Rd and there exists an isometric embedding ϕ : M → R
p.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Consistency
Assume the data is concentrated on a manifold M ⊂ IRp with
M ∈ Rd and there exists an isometric embedding ϕ : M → R
p.
Theorem
Under mild regularity conditions on the distribution andcorresponding density, with probability 1 − δ
‖(dϕ)∗~fD −∇Mf ‖ρX
≤ C log
(
1
δ
)
n−1/d
where (dϕ)∗ is the dual of the map dϕ.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Estimating the gradient
Taylor expansion
yi ≈ f (xi ) ≈ f (xj ) + 〈∇f (xj), xj − xi〉
≈ yj + 〈∇f (xj), xj − xi〉 if xi ≈ xj .
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Estimating the gradient
Taylor expansion
yi ≈ f (xi ) ≈ f (xj ) + 〈∇f (xj), xj − xi〉
≈ yj + 〈∇f (xj), xj − xi〉 if xi ≈ xj .
Let ~f ≈ ∇f the following should be small
∑
i ,j
wij(yi − yj − 〈~f (xj), xj − xi 〉)2,
wij = 1sp+2 exp(−‖xi − xj‖
2/2s2) enforces xi ≈ xj .
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Estimating the gradient
The algorithm implements
~fD = arg minf ∈Hp
1
n2
n∑
i ,j=1
wij(yi − yj − 〈~f (xj), xj − xi〉)2 + J(‖~f ‖)
.
where J(‖~f ‖) penalizes the smoothness of the estimate.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Computational efficiency and consistency
The computation requires fewer than n2 parameters and is O(n6)time and O(pn) memory.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Computational efficiency and consistency
The computation requires fewer than n2 parameters and is O(n6)time and O(pn) memory.This due to the following representer property
~fD(x) =
n∑
i=1
ci ,DK (xi , x)
where K (·, ·) is a symmetric positive definite kernel function andcD = (c1,D , . . . , cn,D)′ ∈ R
np.
Sayan Mukherjee Pathway models in cancer genomics
MotivationGene sets or pathways
Statistical methodsModeling tumor progression
ConclusionThanks
Validation of gene sets
Computational efficiency and consistency
The computation requires fewer than n2 parameters and is O(n6)time and O(pn) memory.This due to the following representer property
~fD(x) =
n∑
i=1
ci ,DK (xi , x)
where K (·, ·) is a symmetric positive definite kernel function andcD = (c1,D , . . . , cn,D)′ ∈ R
np.
In addition the gradient outer product estimate is
Γ = cDKc ′D
where K is a matrix with Kij = K (xi , xj ).
Sayan Mukherjee Pathway models in cancer genomics