143
Motivation Gene sets or pathways Statistical methods Modeling tumor progression Conclusion Thanks Pathway models in cancer genomics SIAM 2008 Sayan Mukherjee Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University August 6, 2008 Sayan Mukherjee Pathway models in cancer genomics

Pathway models in cancer genomicssayan/siam.pdf · Biological systems are often assayed by thousands of variables. These data lie on or near a low-dimensional manifold and there are

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Pathway models in cancer genomicsSIAM 2008

Sayan Mukherjee

Department of Statistical ScienceInstitute for Genome Sciences & Policy

Department of Computer ScienceDuke University

August 6, 2008

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Genetic and molecular basis of complex traits

Model complex traits based on high-throughput genomic datausing using statistical and computational methods.

Complex traits are controlled by variation across many genes andare often express variation in phenotype. This heterogeneity is thecrux of the difficulty in modeling these traits.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Genetic and molecular basis of complex traits

Model complex traits based on high-throughput genomic datausing using statistical and computational methods.

Complex traits are controlled by variation across many genes andare often express variation in phenotype. This heterogeneity is thecrux of the difficulty in modeling these traits.

Two sources of heterogeneity:

variation across phenotypes or disease stages

variation across genes in individual samples.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Cluster or stratify

An obvious solution is to build models

1 Separately across each stage

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Cluster or stratify

An obvious solution is to build models

1 Separately across each stage

2 On genes that are mutated or varying across all individuals.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Cluster or stratify

An obvious solution is to build models

1 Separately across each stage

2 On genes that are mutated or varying across all individuals.

Huge loss of power especially in genes

1 Genes mutated across most individuals: KRAS, TP53, andAPC – gene mountains.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Cluster or stratify

An obvious solution is to build models

1 Separately across each stage

2 On genes that are mutated or varying across all individuals.

Huge loss of power especially in genes

1 Genes mutated across most individuals: KRAS, TP53, andAPC – gene mountains.

2 Most variable genes are mutated in 5% of individuals – genehills.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Integration of information

Borrow strength

Model in gene set space – integrate genes into a priori knowngene sets with putative functional or structural annotation.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Integration of information

Borrow strength

Model in gene set space – integrate genes into a priori knowngene sets with putative functional or structural annotation.

Conjoint models across phenotypes – borrow strength acrossphenotypes.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Modeling principles

Biological systems are often assayed by thousands of variables.

These data lie on or near a low-dimensional manifold and there arestrong dependencies between variables.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Modeling principles

Biological systems are often assayed by thousands of variables.

These data lie on or near a low-dimensional manifold and there arestrong dependencies between variables.

We build models that are predictive with respect to phenotype butalso infer the strong dependencies which we hope can help infermechanism.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Questions

1 Which groups of genes (pathways) are involved in all or somestages of progression ?

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Questions

1 Which groups of genes (pathways) are involved in all or somestages of progression ?

2 What are the pathway dependencies (inferring pathwaynetworks) ?

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Questions

1 Which groups of genes (pathways) are involved in all or somestages of progression ?

2 What are the pathway dependencies (inferring pathwaynetworks) ?

3 For each relevant pathway infer gene network for pathway.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Progression in colon cancer

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Pathway dependencies in prostate cancer: benign to

primary

BA

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Gene network for ERK pathway

NGF DPM2NGFBSOS1E L K 1PT PRNGF DPM2NGFBSOS1E L K 1PT PR

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

GNB 1

R PS6K AS

MAPK 1

PDG

GNASMAP2K 1

PPP2CAGR B 2

MK NK 2

MAP2K 2

MK NK 1

GNB 1

R PS6K AS

MAPK 1

PDG

GNASMAP2K 1

PPP2CAGR B 2

MK NK 2

MAP2K 2

MK NK 1

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Mountains and hills

KRAS, TP53, and APC are hard to target with drugs and using asgenetic basis of treatment has proven unsuccessful.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Mountains and hills

KRAS, TP53, and APC are hard to target with drugs and using asgenetic basis of treatment has proven unsuccessful.

Organize gene hills into biological pathways or sets of genesrelevant to cancer initiation, make gene hills into a pathwaymountain.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Gene Set Enrichment Analysis (GSEA)

1 Given an expression data set and phenotypes (labels) order thegenes by correlation with phenotype

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Gene Set Enrichment Analysis (GSEA)

1 Given an expression data set and phenotypes (labels) order thegenes by correlation with phenotype

2 Estimate the gene set’s Enrichment score with respect to theordered genes

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Gene Set Enrichment Analysis (GSEA)

1 Given an expression data set and phenotypes (labels) order thegenes by correlation with phenotype

2 Estimate the gene set’s Enrichment score with respect to theordered genes

3 Assess Statistical significance and adjust formultiple hypothesis testing

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Gene set database

The gene sets in the database are defined by

1 Positional gene sets: cytogenetic bands, 3 megabase windows;

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Gene set database

The gene sets in the database are defined by

1 Positional gene sets: cytogenetic bands, 3 megabase windows;

2 Motif gene sets: TRANSFAC motifs, Representative motifs;

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Gene set database

The gene sets in the database are defined by

1 Positional gene sets: cytogenetic bands, 3 megabase windows;

2 Motif gene sets: TRANSFAC motifs, Representative motifs;

3 Curated gene sets: Pathways, Literature reviews, Animalmodels, Clinical phenotypes, Expert curations, Chemical orgenetic perturbations.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Diabetes

Ordered genes

OXPHOS pathway

p-value=.007pFDR=4%

Databaseof gene sets

Normals Diabetics

+

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Diabetes

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Analysis of Sample Set Enrichment Scores (ASSESs)

1 GSEA – Provides a summary for the enrichment of each geneset with respect to differential expression across the data set.

2 ASSESS – Provides a summary statistic Sij for the enrichmentof each gene set with respect to differential expression foreach observation :

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Analysis of Sample Set Enrichment Scores (ASSESs)

1 GSEA – Provides a summary for the enrichment of each geneset with respect to differential expression across the data set.

2 ASSESS – Provides a summary statistic Sij for the enrichmentof each gene set with respect to differential expression foreach observation :Sij is constitutive differential enrichment in the i -th sample ofgenes in gene set j .

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Gender

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Primary to metastatic

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Primary to metastatic

p m

Ge

ne

se

ts

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Two questions

1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Two questions

1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.

2 Comparison of gene sets defined by experimental studies vs.expert knowledge – importance of context on interpretation ofresults.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Two questions

1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.

2 Comparison of gene sets defined by experimental studies vs.expert knowledge – importance of context on interpretation ofresults.

Experiments or designs with known pathways driving phenotypeused to start answering these questions.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Regression or single-task learning

The regression problem :(1) explanatory variables Xn = (x1, ..., xn) over n samples eachxi ∈ IR

d the enrichment of the sample over d pathways;

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Regression or single-task learning

The regression problem :(1) explanatory variables Xn = (x1, ..., xn) over n samples eachxi ∈ IR

d the enrichment of the sample over d pathways;(2) categorical label variables Yn = (y1, ..., yn) where yi ∈ 0, 1with 0 = less serious stage 1 = more serious stage.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Regression or single-task learning

The regression problem :(1) explanatory variables Xn = (x1, ..., xn) over n samples eachxi ∈ IR

d the enrichment of the sample over d pathways;(2) categorical label variables Yn = (y1, ..., yn) where yi ∈ 0, 1with 0 = less serious stage 1 = more serious stage.

Find a function f that accurately predicts y given x :

f (xnew) ≈ ynew.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Hierarchical modeling or multi-task learning

Definition

Single Task Notation nt samples (xi , yi )xi ∈ IR

d

yi ∈ 0, 1 for classificationAssume to be working in d ≫ nt paradigm.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Hierarchical modeling or multi-task learning

Definition

Single Task Notation nt samples (xi , yi )xi ∈ IR

d

yi ∈ 0, 1 for classificationAssume to be working in d ≫ nt paradigm.

Definition

Multi-task Learning (MTL) Formulation Given T tasks witht ∈ 1, . . . ,T

Ft(x) = f0(x) + ft(x) + ε, εiid∽ No(0, σ2).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task learning for tumor progression

Context of tumor progression: b 7→ p 7→ m:

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task learning for tumor progression

Context of tumor progression: b 7→ p 7→ m:

1 Pathways predictive across all stages b 7→ p 7→ m: f0(x);

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task learning for tumor progression

Context of tumor progression: b 7→ p 7→ m:

1 Pathways predictive across all stages b 7→ p 7→ m: f0(x);

2 Pathways predictive in the first stage b 7→ p – Task 1:f1(x);

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task learning for tumor progression

Context of tumor progression: b 7→ p 7→ m:

1 Pathways predictive across all stages b 7→ p 7→ m: f0(x);

2 Pathways predictive in the first stage b 7→ p – Task 1:f1(x);

3 Pathways predictive in the second stage stages p 7→ m –Task 2: f2(x).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Models of variation in genetics

weight at 33 days oo Gain 0-33 days oo External conditions

Heredity

ggNNNNNNNNNNN

wwppppppppppp

weight at birth oo

OO

Rate of growth oo condition of dam

ggOOOOOOOOOOOOOOOOOOOOOOOOOOO

wwooooooooooooooooooooooooooo

Size of litter

ggOOOOOOOOOOO

wwooooooooooo

YY44444444444444444444444

Gestation period

``BBBBBBBBBBBBBBBBBB

Heredity of dam

OO

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Inverse regression

Regression is modeling y = f (x) or p(Y |X ).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Inverse regression

Regression is modeling y = f (x) or p(Y |X ).

Inverse regression is modeling p(X |Y ) or often cov (X |Y ).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Inverse regression

Regression is modeling y = f (x) or p(Y |X ).

Inverse regression is modeling p(X |Y ) or often cov (X |Y ).

Given D = (X1,Y1), .., (Xn,Yn) we estimate a variation of thecovariance of the inverse regression

Γ = h[cov (X |Y )], d × d matrix.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Inverse regression

Regression is modeling y = f (x) or p(Y |X ).

Inverse regression is modeling p(X |Y ) or often cov (X |Y ).

Given D = (X1,Y1), .., (Xn,Yn) we estimate a variation of thecovariance of the inverse regression

Γ = h[cov (X |Y )], d × d matrix.

1 Γii – relevance of pathway i with respect to label

2 Γij – covariation of pathways i and j with respect to label.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

More formally

Given X ∈ X ⊂ IRp and Y ∈ 0, 1

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

More formally

Given X ∈ X ⊂ IRp and Y ∈ 0, 1

D = (x1, y1), ..., (xn, yn)iid∼ ρ(X ,Y ) with typically p ≫ n.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Gradients and outer products

Given a smooth function f the gradient is

∇f (X ) =(

∂f (X )∂X1

, ..., ∂f (X )∂Xd

)′

.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Gradients and outer products

Given a smooth function f the gradient is

∇f (X ) =(

∂f (X )∂X1

, ..., ∂f (X )∂Xd

)′

.

Define the gradient outer product matrix Γ

Γij =

X

∂f

∂xi

(X )∂f

∂xj

(X )dρX(X ),

Γ = E[(∇f ) ⊗ (∇f )].

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Statistical interpretation

Linear casey = β′x + ε, ε

iid∼ No(0, σ2).

Ω = cov (E[X |Y ]), ΣX

= cov (X ), σ2Y

= var (Y ).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Statistical interpretation

Linear casey = β′x + ε, ε

iid∼ No(0, σ2).

Ω = cov (E[X |Y ]), ΣX

= cov (X ), σ2Y

= var (Y ).

Γ = σ2Y

(

1 − σ2

σ2Y

)2

Σ−1X

ΩΣ−1X

≈ σ2Y

Σ−1X

ΩΣ−1X

.

Γ and Ω contain equivalent information.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Statistical interpretation

For smooth f (x)

y = f (x) + ε, εiid∼ No(0, σ2).

Ω = cov (X |Y ) not so clear.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Nonlinear case

Partition into sections and compute local quantities

X =

I⋃

i=1

χi

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Nonlinear case

Partition into sections and compute local quantities

X =

I⋃

i=1

χi

Ωi = cov (Xχi|Yχ

i)

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Nonlinear case

Partition into sections and compute local quantities

X =

I⋃

i=1

χi

Ωi = cov (Xχi|Yχ

i)

Σi = cov (Xχi)

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Nonlinear case

Partition into sections and compute local quantities

X =

I⋃

i=1

χi

Ωi = cov (Xχi|Yχ

i)

Σi = cov (Xχi)

σ2i = var (Yχ

i)

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Nonlinear case

Partition into sections and compute local quantities

X =

I⋃

i=1

χi

Ωi = cov (Xχi|Yχ

i)

Σi = cov (Xχi)

σ2i = var (Yχ

i)

mi = ρX(χ

i).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Nonlinear case

Partition into sections and compute local quantities

X =

I⋃

i=1

χi

Ωi = cov (Xχi|Yχ

i)

Σi = cov (Xχi)

σ2i = var (Yχ

i)

mi = ρX(χ

i).

Γ ≈

I∑

i=1

mi σ2i Σ−1

i Ωi Σ−1i .

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Gauss-Markov graphical models

Give a multivariate normal distribution, x ∈ Rp

p(x) ∝ exp(

−(x − µ)C−1(x − µ)′)

.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Gauss-Markov graphical models

Give a multivariate normal distribution, x ∈ Rp

p(x) ∝ exp(

−(x − µ)C−1(x − µ)′)

.

The precision matrix P = C−1 is also the conditional independencematrix

Pij = dependence of i ↔ j | all other variables.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Gauss-Markov graphical models

By construction Γ is a covariance matrix.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Gauss-Markov graphical models

By construction Γ is a covariance matrix.

So the (pseudo) inverse

J = inv(Γ)

is a conditional independence matrix with

Jij = dependence of i ↔ j | all other X variables and Y .

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task learning or hierarchical models

Definition

Multi-task Learning (MTL) Formulation Given T tasks witht ∈ 1, . . . ,T

Ft(x) = f0(x) + ft(x) + ε, εiid∽ No(0, σ2).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task gradient learning

Estimate not just the functions

f0, f1, ..., fT ,

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task gradient learning

Estimate not just the functions

f0, f1, ..., fT ,

but the gradients as well

(f0,∇f0), (ft ,∇ft)Tt=1.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task gradient learning

Estimate not just the functions

f0, f1, ..., fT ,

but the gradients as well

(f0,∇f0), (ft ,∇ft)Tt=1.

This provides us with T + 1 matrices

1 Γ0 is the GOP estimate across all the tasks

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Multi-task gradient learning

Estimate not just the functions

f0, f1, ..., fT ,

but the gradients as well

(f0,∇f0), (ft ,∇ft)Tt=1.

This provides us with T + 1 matrices

1 Γ0 is the GOP estimate across all the tasks

2 Γ1, . . . , ΓT are the task specific GOP estimates.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Estimating the gradient

A penalized likelihood model is used to estimate the gradient givendata in the single task or multitask setting, ~fD or ~fD,t.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Estimating the gradient

A penalized likelihood model is used to estimate the gradient givendata in the single task or multitask setting, ~fD or ~fD,t.

The gradient outer product estimate Γ is computed from ~fD .

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Estimating the gradient

A penalized likelihood model is used to estimate the gradient givendata in the single task or multitask setting, ~fD or ~fD,t.

The gradient outer product estimate Γ is computed from ~fD .

The gradient inference algorithm requires fewer than n2 parametersand is O(n6) time and O(pn) memory.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Convergence of gradient

Assume the data is concentrated on a manifold M ⊂ Rd with

M ∈ Rm and there exists an isometric embedding ϕ : M → R

d .

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Convergence of gradient

Assume the data is concentrated on a manifold M ⊂ Rd with

M ∈ Rm and there exists an isometric embedding ϕ : M → R

d .

Theorem

Under mild regularity conditions on the distribution andcorresponding density, with probability 1 − δ

‖(dϕ)∗~fD −∇Mf ‖ρX

≤ C log

(

1

δ

)

n−1/m

where (dϕ)∗ is the dual of the map dϕ.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Convergence of graphical model

Theorem

Under mild conditions, with probability 1 − δ

‖J − J‖ ≤ C log

(

1

δ

)

n−1/m.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Observations on convergence

Typical theoretical results on convergence of graphical models is interms of sparsity s = ‖J‖0.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Observations on convergence

Typical theoretical results on convergence of graphical models is interms of sparsity s = ‖J‖0.

Our results are in terms of rank of J reflective of intrinsicdimension.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Observations on convergence

Typical theoretical results on convergence of graphical models is interms of sparsity s = ‖J‖0.

Our results are in terms of rank of J reflective of intrinsicdimension.

Our belief is the covariance structure is not necessarily sparse butis low rank.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Heirarchical modelsGraphical models

Observations on convergence

Typical theoretical results on convergence of graphical models is interms of sparsity s = ‖J‖0.

Our results are in terms of rank of J reflective of intrinsicdimension.

Our belief is the covariance structure is not necessarily sparse butis low rank.

The genetic basis of complex traits may be sparse in pathwayseven when not in genes.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Tumor progression as multi-task learning

Progression: b 7→ p 7→ m.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Tumor progression as multi-task learning

Progression: b 7→ p 7→ m.Task 1: b 7→ pTask 2: p 7→ m

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Tumor progression as multi-task learning

Progression: b 7→ p 7→ m.Task 1: b 7→ pTask 2: p 7→ mTasks correspond to progression from less serious 0 to more serious1.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Data set – Tomlins et al.

Gene expression from 22, 283 genes. 71 people:22 benign (b) prostate epithelium;32 primary (p) prostate cancer;17 metastatic (m) prostate cancer.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Data set – Tomlins et al.

Gene expression from 22, 283 genes. 71 people:22 benign (b) prostate epithelium;32 primary (p) prostate cancer;17 metastatic (m) prostate cancer.

Progression: b 7→ p 7→ m.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Data set – Tomlins et al.

Gene expression from 22, 283 genes. 71 people:22 benign (b) prostate epithelium;32 primary (p) prostate cancer;17 metastatic (m) prostate cancer.

Progression: b 7→ p 7→ m.

523 pathway defined gene sets.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Pathways relevant in progression

b 7→ p 7→ m – Set ID Gene Set Name SourceTRANS CR Transport PNAS 2014CCC Cell Cycle Checkpoint GOb 7→ p – Set ID Gene Set Name SourceGHD MAP00361 Gamma Hexacholrocyclohexane Degradation GenMAPPKREB kreb Pathway BioCartap 7→ m – Set ID Gene Set Name SourceHORM CR Hormonal Functions PNAS 2008GLY GLYCOL Manually Curated

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Pathways relevant in progression

10 20 30 40 50 60 70 80 90 100

TRANS

CCC

GHD

KREB

HORM

GLY −0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8A

C

B

p pb m

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Pathway dependencies: benign to primary

BA

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Gene set refinement

1 All genes in a gene set are not relevant in the specific contextstudied.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Gene set refinement

1 All genes in a gene set are not relevant in the specific contextstudied.

2 Genes not included in the gene set may be relevant to thespecific context studied.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Gene set refinement

1 All genes in a gene set are not relevant in the specific contextstudied.

2 Genes not included in the gene set may be relevant to thespecific context studied. Still working on this.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Gene set refinement

1 All genes in a gene set are not relevant in the specific contextstudied.

2 Genes not included in the gene set may be relevant to thespecific context studied. Still working on this.

3 Refinement procedure adapts gene set to context of data.Infers gene dependencies and substructure in the gene set.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Gene set refinement

1 All genes in a gene set are not relevant in the specific contextstudied.

2 Genes not included in the gene set may be relevant to thespecific context studied. Still working on this.

3 Refinement procedure adapts gene set to context of data.Infers gene dependencies and substructure in the gene set.

4 Gene network model of dependencies is inferred.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Gene network for ERK pathway

NGF DPM2NGFBSOS1E L K 1PT PRNGF DPM2NGFBSOS1E L K 1PT PR

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

R AF1

SHC1E GFR

ST AT

R PS T GFB

MY C

GNB 1

R PS6K AS

MAPK 1

PDG

GNASMAP2K 1

PPP2CAGR B 2

MK NK 2

MAP2K 2

MK NK 1

GNB 1

R PS6K AS

MAPK 1

PDG

GNASMAP2K 1

PPP2CAGR B 2

MK NK 2

MAP2K 2

MK NK 1

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Genetic models of colorectal cancer

Colorectal cancer is a complex trait that progresses from normalcolon epithelium to adenoma to stages of carcinomas

n → a → c1 → c2 → c3 → c4.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Genetic models of colorectal cancer

Colorectal cancer is a complex trait that progresses from normalcolon epithelium to adenoma to stages of carcinomas

n → a → c1 → c2 → c3 → c4.

Tremendous heterogeneity exists across

1 Time or stage

2 Individuals.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Genetic model of colorectal cancer (Fearon and Vogelstein,

1990)

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Mountains and hills

KRAS, TP53, and APC are hard to target with drugs and using asgenetic basis of treatment has proven unsuccessful.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Mountains and hills

KRAS, TP53, and APC are hard to target with drugs and using asgenetic basis of treatment has proven unsuccessful.

Organize gene hills into biological pathways or sets of genesrelevant to cancer initiation, make gene hills into a pathwaymountain.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Prostate cancerColon cancer

Inferred system

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Open problems

Nonlinear progression through stages – more generalphenotypic substructure.

Decomposition into sub-pathways.

Integration across genomic data.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Acknowledgements

Coworkers

Learning gradients: Q Wu, D-X Zhou, J Guinney, MMaggioni, K Mao

Multi-task learning: J Guinney, Q Wu

Tumor progression: E Edelman, J Guinney, A Potti

Funding:

IGSP

Center for Systems Biology at Duke

NSF DMS-0732260

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Relevant papers

Learning Coordinate Covariances via Gradients. Sayan Mukherjee, Ding-Xuan Zhou; Journal of MachineLearning Research, 7(Mar):519–549, 2006.

Estimation of Gradients and Coordinate Covariation in Classification. Sayan Mukherjee, Qiang Wu; Journalof Machine Learning Research, 7(Nov):2481–2514, 2006.

Learning Gradients and Feature Selection on Manifolds. Sayan Mukherjee, Qiang Wu, Ding-Xuan Zhou;Bernoulli, submitted.

Learning Gradients: predictive models that infer geometry and dependence. Qiang Wu, Justin Guinney,Mauro Maggioni, Sayan Mukherjee; Journal of Machine Learning Research, submitted.

Estimating variable structure and dependence in Multi-task learning via gradients. Justin Guinney, Q. Wu,Sayan Mukherjee; Journal of Machine Learning Research, submitted.

Modeling Cancer Progression via Pathway Dependencies. E. Edelman, J. Guinney, J-T. Chi, P.G. Febbo, S.Mukherjee; PLoS Computational Biology, 4(2): e28.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Main questions

Two questions

1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Main questions

Two questions

1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.

2 Comparison of gene sets defined by experimental studies vs.expert knowledge – importance of context on interpretation ofresults.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Main questions

Two questions

1 Accuracy of gene sets annotated according to knownperturbations – confidence in annotation and interpretationsmade based on annotations.

2 Comparison of gene sets defined by experimental studies vs.expert knowledge – importance of context on interpretation ofresults.

Experiments or designs with known pathways driving phenotypeused to start answering these questions.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

P53

Transcription factor inhibiting cell growth and stimulating celldeath. Point mutation (p17) inactivates capacity to bindspecifically to recognition sequence.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

P53 experiment

Database contains 5 gene sets defining P53 pathways,experimentally and from expert studies.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

P53 experiment

Database contains 5 gene sets defining P53 pathways,experimentally and from expert studies.

1 Direct context: enrichment of known p53 mutants in theNCI-60 collection ?

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

P53 experiment

Database contains 5 gene sets defining P53 pathways,experimentally and from expert studies.

1 Direct context: enrichment of known p53 mutants in theNCI-60 collection ?

2 Indirect context: enriched in recurrent breast cancer ?

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

P53 experiment

Gene Set SourceStress p53 Specific Up Amundson et alp53 Genes All Inga et alp53 Up Kennan et alp53 Pathway BioCartap53 Hypoxia BioCarta

Table: The five p53 gene sets defined by experimental and expertsources.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Enrichment in P53 mutants

Rank Gene Set NES Pval FDREnriched in Wild Type1 p53 Pathway -2.38 0 02 Stress p53 Specific Up -2.33 0 03 p53 Hypoxia -2.19 0 04 p53 Genes All -2.1 0 0.0110 p53 Up -1.73 0.02 0.20

Table: GSEA results for the p53 gene sets in the wild type/p53 mutantdataset. The ranks are out of 220 gene sets which are enriched in thewild type phenotype.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Enrichment in breast cancer recurrence

Rank Gene Set NES Pval FDREnriched in Nonrecurrent10 p53 Pathway -1.27 0.15 115 p53 Genes All -1.23 0.2 119 Stress p53 Specific Up -1.18 0.23 169 p53 Hypoxia -0.83 0.70 1Enriched in Recurrent248 p53 Up 0.72 0.90 0.96

Table: GSEA results for the p53 gene sets in the recurrent/nonrecurrentbreast cancer dataset. The ranks are out of 123 gene sets which areenriched in the nonrecurrent phenotype and 278 gene sets which areenriched in the recurrent phenotype.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Hypoxia experiment

Database contains 7 gene sets defining hypoxia pathways,experimentally and from expert studies.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Hypoxia experiment

Database contains 7 gene sets defining hypoxia pathways,experimentally and from expert studies.Direct context: enrichment under hypoxic conditions – 3 humanastrocytes and 3 HeLA cells under hypoxic conditions and 3 humanastrocytes and 3 HeLA cells under normoxic conditions.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Hypoxia experiment

Gene Set SourceHypoxia Down Manalo et alHypoxia Up Manalo et alHypoxia Fibro Up Kim et alHypoxia Reg Up Leonard et alHypoxia Review HarrisHIF Pathway BioCartaVEGF Pathway BioCarta

Table: The seven hypoxia gene sets defined by experimental and expertsources.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Enrichment in hypoxia

Rank Gene Set NES Pval FDREnriched in Hypoxic Cells3 Hypoxia Up -1.96 0.008 0.0264 Hypoxia Review -1.95 0 0.0276 Hypoxia Fibro Up -1.84 0.004 0.0889 Hypoxia Reg Up -1.73 0.02 0.19110 HIF Pathway -1.73 0.02 0.17653 VEGF Pathway -1.39 0.055 0.553Enriched in Normal Cells17 Hypoxia Down 1.48 0.167 0.596

Table: GSEA results for the hypoxia gene sets in the hypoxia/normaldataset. The ranks are out of 323 gene sets which are enriched in thehypoxia phenotype and 205 gene sets which are enriched in the normalphenotype.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

RAS experiment – context

5 RAS gene sets considered:(1) pathway signature for activated H-RAS from PREC (Bild et al);(2) pathway signature for kRAS mouse model mutant(Sweet-Cordero et al);(3) biochemical interactions for RAS from biocarta.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

RAS experiment – context

5 RAS gene sets considered:(1) pathway signature for activated H-RAS from PREC (Bild et al);(2) pathway signature for kRAS mouse model mutant(Sweet-Cordero et al);(3) biochemical interactions for RAS from biocarta.

Oathways will be compared in two different settings:(1) aanimal model with mutated kRAS – very specific;(2) heterogeneous, non-small cell lung cancers (Potti et al).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

RAS enrichment in mouse model

Gene Set NES Pval FDREnriched in TumorRAS Up BioCarta 1.51 0 0.05SRC Down 1.41 0.09 0.32MYC Up 1.25 0.15 0.35SRC Up 1.25 0.15 0.21H-RAS Up 1.12 0.26 0.49E2F3 Up 1.12 0.25 0.33BCAT Up 0.81 0.74 0.80Enriched in NormalRAS Down BioCarta -1.51 0.12 0.06E2F3 Down -1.29 0.10 0.49H-RAS Down -1.18 0.19 0.52BCAT Down -1.14 0.29 0.32MYC Down -0.99 0.55 0.52

Table: GSEA results for the oncogenic gene sets defined from Bild et alSayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

RAS enrichment in lung adenocarcinoma

RAS activation linked with lung adenocarcinomas ( Rodenhuis etal, Salgia et al).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

RAS enrichment in lung adenocarcinoma

RAS activation linked with lung adenocarcinomas ( Rodenhuis etal, Salgia et al).45 adenocarcinoma lung cancer samples and 48 squamous lungcancer samples from Potti et al.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

RAS enrichment in lung adenocarcinoma

RAS activation linked with lung adenocarcinomas ( Rodenhuis etal, Salgia et al).45 adenocarcinoma lung cancer samples and 48 squamous lungcancer samples from Potti et al.ASSESS to compute enrichment scores for three RAS pathwaysacross all 93 samples.pause Leave-one-out error: 69.9% for H-RAS, 75.3% for kRAS,and 79.6% for BioCarta.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Organization of results

Used “Hallmarks of Cancer” (Hanahan and Weinberg) to organizeanalysis:

Self sufficiency of growth signals

Insensitivity to anti-growth signals

Evasion of apoptosis

Defense against limitless replicative potential

Angiogenesis

Invasion and metastasis

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Some results

Self sufficiency in growth signals (all stages):Cell cycle gene sets, ErB4, EGF, Sprouty, ERK.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Some results

Self sufficiency in growth signals (all stages):Cell cycle gene sets, ErB4, EGF, Sprouty, ERK.

Evidence for insensitivity to anti-growth signals (all stages):PTEN down-regulation, PTDINS up-regulation.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Some results

Self sufficiency in growth signals (all stages):Cell cycle gene sets, ErB4, EGF, Sprouty, ERK.

Evidence for insensitivity to anti-growth signals (all stages):PTEN down-regulation, PTDINS up-regulation.

Evasion of apoptosis (all stages):IGF1R up-regulation, ROS down-regulation.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Some results

Self sufficiency in growth signals (all stages):Cell cycle gene sets, ErB4, EGF, Sprouty, ERK.

Evidence for insensitivity to anti-growth signals (all stages):PTEN down-regulation, PTDINS up-regulation.

Evasion of apoptosis (all stages):IGF1R up-regulation, ROS down-regulation.

Energy production:Glycolysis gene set up-regulation (late stage), ATP synthesis geneset up-regulation and Oxidative phosphorylation up-regulation(early stage).

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Multiscale graphical models

Define J = Γ−1. and the partial correlation coefficient

rij = −Jij

JiiJjj

.

Define Rij = rij if i 6= j and 0 otherwise.Define Dii = Jii .Define R = D−1/2RD−1/2.Γ = D−1/2(1 − R)−1D−1/2.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Multiscale graphical models

The following expansions hold

Γ = D−1/2

(

∞∑

k=0

Rk

)

D−1/2

Γ = D−1/2

(

∞∏

k=0

(I + R2k

)

)

D−1/2.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Multiscale graphical models

The following expansions hold

Γ = D−1/2

(

∞∑

k=0

Rk

)

D−1/2

Γ = D−1/2

(

∞∏

k=0

(I + R2k

)

)

D−1/2.

Here k is path-length in the first equality. The second equalityfactorizes the covariance matrix into low rank matrices with fewerentries.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Multiscale graphical models

The following expansions hold

Γ = D−1/2

(

∞∑

k=0

Rk

)

D−1/2

Γ = D−1/2

(

∞∏

k=0

(I + R2k

)

)

D−1/2.

Here k is path-length in the first equality. The second equalityfactorizes the covariance matrix into low rank matrices with fewerentries.This is the idea of diffusion wavelets.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Multiscale graphical models

(a) Covariance (b) PCA

50 100 150 200 250 300

−0.06

−0.04

−0.02

0

0.02

0.04

0.06

Basis at level 3

(c) Coarse scale

50 100 150 200 250 300

−0.1

−0.05

0

0.05

0.1

(d) Fine scale

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Consistency

Assume the data is concentrated on a manifold M ⊂ IRp with

M ∈ Rd and there exists an isometric embedding ϕ : M → R

p.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Consistency

Assume the data is concentrated on a manifold M ⊂ IRp with

M ∈ Rd and there exists an isometric embedding ϕ : M → R

p.

Theorem

Under mild regularity conditions on the distribution andcorresponding density, with probability 1 − δ

‖(dϕ)∗~fD −∇Mf ‖ρX

≤ C log

(

1

δ

)

n−1/d

where (dϕ)∗ is the dual of the map dϕ.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Estimating the gradient

Taylor expansion

yi ≈ f (xi ) ≈ f (xj ) + 〈∇f (xj), xj − xi〉

≈ yj + 〈∇f (xj), xj − xi〉 if xi ≈ xj .

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Estimating the gradient

Taylor expansion

yi ≈ f (xi ) ≈ f (xj ) + 〈∇f (xj), xj − xi〉

≈ yj + 〈∇f (xj), xj − xi〉 if xi ≈ xj .

Let ~f ≈ ∇f the following should be small

i ,j

wij(yi − yj − 〈~f (xj), xj − xi 〉)2,

wij = 1sp+2 exp(−‖xi − xj‖

2/2s2) enforces xi ≈ xj .

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Estimating the gradient

The algorithm implements

~fD = arg minf ∈Hp

1

n2

n∑

i ,j=1

wij(yi − yj − 〈~f (xj), xj − xi〉)2 + J(‖~f ‖)

.

where J(‖~f ‖) penalizes the smoothness of the estimate.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Computational efficiency and consistency

The computation requires fewer than n2 parameters and is O(n6)time and O(pn) memory.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Computational efficiency and consistency

The computation requires fewer than n2 parameters and is O(n6)time and O(pn) memory.This due to the following representer property

~fD(x) =

n∑

i=1

ci ,DK (xi , x)

where K (·, ·) is a symmetric positive definite kernel function andcD = (c1,D , . . . , cn,D)′ ∈ R

np.

Sayan Mukherjee Pathway models in cancer genomics

MotivationGene sets or pathways

Statistical methodsModeling tumor progression

ConclusionThanks

Validation of gene sets

Computational efficiency and consistency

The computation requires fewer than n2 parameters and is O(n6)time and O(pn) memory.This due to the following representer property

~fD(x) =

n∑

i=1

ci ,DK (xi , x)

where K (·, ·) is a symmetric positive definite kernel function andcD = (c1,D , . . . , cn,D)′ ∈ R

np.

In addition the gradient outer product estimate is

Γ = cDKc ′D

where K is a matrix with Kij = K (xi , xj ).

Sayan Mukherjee Pathway models in cancer genomics