Large scale genomic data mining Curtis Huttenhower 10-23-09 Harvard School of Public Health...

Preview:

Citation preview

Large scalegenomic data mining

Curtis Huttenhower

10-23-09Harvard School of Public HealthDepartment of Biostatistics

Mining Biological Data

~100 GB

More than 100GB

Mining Biological Data

~100 GB

More than 100GB

Mining Biological Data

~100 GB

More than 100GB

How can we ask and answer specific biomedical questions

using thousands ofgenome-scale datasets?

5

Outline

2. Applications:Human molecular data

and clinical cancer cohorts

1. Methodology:Algorithms for mining

genome-scale datasets

3. Next steps:Methods for microbial communities

and functional metagenomics

6

A Definition of Functional Genomics

Genomic data Prior knowledge

Data↓

Function

Function↓

Function

Gene↓

Gene

Gene↓

Function

7

MEFIT: A Framework forFunctional Genomics

BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…

Related Gene Pairs

HighCorrelation

LowCorrelation

Fre

quen

cy

MEFIT

8

MEFIT: A Framework forFunctional Genomics

BRCA1 BRCA2 0.9BRCA1 RAD51 0.8RAD51 TP53 0.85…

BRCA2 SOX2 0.1RAD51 FOXP2 0.2ACTR1 H6PD 0.15…

Related Gene Pairs

Unrelated Gene PairsHigh

CorrelationLow

Correlation

Fre

quen

cy

MEFIT

9

MEFIT: A Framework forFunctional Genomics

Golub 1999

Butte 2000

Whitfield 2002

Hansen 1998

Functional Relationship

10

MEFIT: A Framework forFunctional Genomics

Golub 1999

Butte 2000

Whitfield 2002

Hansen 1998

Functional Relationship

Biological Context

Functional areaTissueDisease…

11

Functional Interaction Networks

MEFIT

Global interaction network

Autophagy networkVacuolar transport

network Translation network

Currently have data from30,000 human experimental results,

15,000 expression conditions +15,000 diverse others, analyzed for

200 biological functions and150 diseases

12

Predicting Gene Function

Cell cycle genes

Predicted relationships between genes

HighConfidence

LowConfidence

13

Predicting Gene FunctionPredicted relationships

between genes

HighConfidence

LowConfidence

Cell cycle genes

14

Cell cycle genes

Predicting Gene FunctionPredicted relationships

between genes

HighConfidence

LowConfidence

These edges provide a measure of how likely a gene is to

specifically participate in the process of

interest.

15

Comprehensive Validation of Computational Predictions

Genomic data

Computational Predictions of Gene Function

MEFITSPELLHibbs et al 2007

bioPIXIEMyers et al 2005

Genes predicted to function in mitochondrion organization

and biogenesis

Laboratory ExperimentsPetite

frequencyGrowthcurves

Confocal microscopy

New known functions for correctly predicted genes

Retraining

With David Hess, Amy Caudy

Prior knowledge

16

Evaluating the Performance of Computational Predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

135Under-annotations

82Novel Confirmations,

First Iteration

17Novel Confirmations,

Second Iteration

340 total: >3x previously known genes in ~5 person-months

17

Evaluating the Performance of Computational Predictions

106Original GO Annotations

Genes involved in mitochondrion organization and biogenesis

95Under-annotations

40Confirmed

Under-annotations

80Novel Confirmations

First Iteration

17Novel Confirmations

Second Iteration

340 total: >3x previously known genes in ~5 person-months

Computational predictions from large collections of genomic data can be

accurate despite incomplete or misleading gold standards, and they

continue to improve as additional data are incorporated.

18

Functional Associations Between Contexts

Predicted relationships between genes

HighConfidence

LowConfidence

The average strength of these relationships

indicates how cohesive a process is.

Cell cycle genes

19

Functional Associations Between Contexts

Predicted relationships between genes

HighConfidence

LowConfidence

Cell cycle genes

20

Functional Associations Between Contexts

DNA replication genes

The average strength of these relationships indicates how

associated two processes are.

Predicted relationships between genes

HighConfidence

LowConfidence

Cell cycle genes

21

Functional mapping:Scoring functional associations

How can we formalizethese relationships?

Any sets of genes G1 and G2 in a network can be compared

using four measures:

• Edges between their genes

• Edges within each set• The background edges

incident to each set• The baseline of all edges

in the network

),(),(

),(

2121

21, 21 GGwithin

baseline

GGbackground

GGbetweenFA GG

Stronger connections between the sets increase association.

Stronger within self-connections or nonspecific background connections decrease association.

22

Functional mapping:Bootstrap p-values

• Scoring functional associations is great……how do you interpret an association score?– For gene sets of arbitrary sizes?– In arbitrary graphs?– Each with its own bizarre distribution of edges?

Empirically!# Genes 1 5 10 50

1

5

10

50

Histograms of FAs for random sets

For any graph, compute FA scores for many randomly chosen gene sets of different sizes. Null distribution is

approximately normal with mean 1.

Standard deviation is asymptotic in the sizes

of both gene sets.

Maps FA scores to p-values for any gene sets and

underlying graph.

100

102

104

100

101

102

103

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

|G1|

|G2|

Null distribution σs for one graph

|)(|||

|||)(|),(ˆ

1),(ˆ

ji

jijiFA

jiFA

GCG

BGGAGG

GG

)(1)( ),(ˆ),,(ˆ, 212121xxFAP GGGGGG

23

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

24

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

25

Functional Associations Between Processes

EdgesAssociations between processes

VeryStrong

ModeratelyStrong

NodesCohesiveness of processes

BelowBaseline

Baseline(genomic

background)

VeryCohesive

BordersData coverage of processes

WellCovered

SparselyCovered

Hydrogen Transport

Electron Transport

Cellular Respiration

Protein ProcessingPeptide

Metabolism

Cell Redox Homeostasis

Aldehyde Metabolism

Energy Reserve

Metabolism

Vacuolar Protein

Catabolism

Negative Regulation of Protein Metabolism

Organelle Fusion

Protein Depolymerization

Organelle Inheritance

AHP1DOT5GRX1GRX2…

APE3LAP4PAI3PEP4 …

26

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

Data integration summarizes an impossibly huge amount of experimental data into an

impossibly huge number of predictions; what next?

27

Functional Maps:Focused Data Summarization

ACGGTGAACGTACAGTACAGATTACTAGGACATTAGGCCGTATCCGATACCCGATA

How can a biologist take advantage of all this data to study

his/her favorite gene/pathway/disease without

losing information?

Functional mapping• Very large collections of genomic data• Specific predicted molecular interactions• Pathway, process, or disease

associations• Underlying experimental results and

functional activities in data

28

Outline

2. Applications:Human molecular data

and clinical cancer cohorts

1. Methodology:Algorithms for mining

genome-scale datasets

3. Next steps:Methods for microbial communities

and functional metagenomics

29

HEFalMp: Predicting human gene function

HEFalMp

30

HEFalMp: Predicting humangenetic interactions

HEFalMp

31

HEFalMp: Analyzing human genomic data

HEFalMp

32

HEFalMp: Understanding human disease

HEFalMp

33

Validating Human Predictions

Autophagy

Luciferase(Negative control)

ATG5(Positive control) LAMP2 RAB11A

NotStarved

Starved(Autophagic)

Predicted novel autophagy proteins

5½ of 7 predictions currently confirmed

With Erin Haley, Hilary Coller

34

Current Work: MolecularMechanisms in a Colon Cancer CohortWith Shuji Ogino, Charlie Fuchs

~3,100gastrointestinal

subjects

~3,800tissue samples

~1,450colon cancer

samples~1,150

CpG island methylation

~1,200LINE-1

methylation

~700TMA immuno-histochemistry

~2,100cancer

mutation tests

Health Professionals Follow-Up

StudyNurse’s HealthStudy

LINE-1 Methylation• Repetitive element making up ~20% of

mammalian genomes• Very easy to assay methylation level (%)• Good proxy for whole-genome methylation

level

DASL Gene Expression• Gene expression analysis from

paraffin blocks• Thanks to Todd Golub, Yujin

Hoshida

~775gene

expression

35

Colon Cancer:LINE-1 methylation levels

30 35 40 45 50 55 60 65 70 75 8030

40

50

60

70

80

LINE-1 Methylation in Mul-tiple Tumors from the Same

Subject

Methylation %, Tumor #1M

eth

ylat

ion

%,

Tu

mo

r #2

ρ = 0.718, p < 0.01

Ogino et al, 2008

Lower LINE-1 methylation associates with poor colon cancer prognosis.

LINE-1 methylation varies remarkably between individuals…

…but it is highly correlated within individuals.

What does it all mean??What is the biological

mechanism linking LINE-1 methylation to colon cancer?

With Shuji Ogino, Charlie Fuchs

36

Colon Cancer:LINE-1 methylation levels

30 35 40 45 50 55 60 65 70 75 8030

40

50

60

70

80

LINE-1 Methylation in Mul-tiple Tumors from the Same

Subject

Methylation %, Tumor #1M

eth

ylat

ion

%,

Tu

mo

r #2

ρ = 0.718, p < 0.01

Ogino et al, 2008

Lower LINE-1 methylation associates with poor colon cancer prognosis.

LINE-1 methylation varies remarkably between individuals…

…but it is highly correlated within individuals.

This suggests a genetic effect.

This suggests a copy number variation.

This suggests linkage to a cancer-related pathway.

Is anything different about these outliers?

What is the biological mechanism linking LINE-1

methylation to colon cancer?

With Shuji Ogino, Charlie Fuchs

37

Colon Cancer:LINE-1 methylation levels

What is the biological mechanism linking LINE-1

methylation to colon cancer?

Preliminary Data• Six genes differentially expressed even using naïve methods• One uncharacterized, one oncogene, three malignancy, one histone• 1/3 are from a family with known variable GI expression, prognostic

value• 2/3 fall in same cytogenic band, which is also a known CNV hotspot• HEFalMp links to a set of transmembrane receptors/channels• Better analysis pulls out mostly one-carbon metabolism and a few

more signaling pathways (neurotransmitters??)

Check back in acouple of months!

38

Outline

2. Applications:Human molecular data

and clinical cancer cohorts

1. Methodology:Algorithms for mining

genome-scale datasets

3. Next steps:Methods for microbial communities

and functional metagenomics

39

Next Steps:Microbial Communities

• Data integration is off to a great start in humans– Complex communities of distinct cell types– Very sparse prior knowledge

• Concentrated in a few specific areas

– Variation across populations– Critical to understand mechanisms of disease

40

Next Steps:Microbial Communities

• What about microbial communities?– Complex communities of distinct species/strains– Very sparse prior knowledge

• Concentrated in a few specific species/strains

– Variation across populations– Critical to understand mechanisms of disease

41

Next Steps:Functional Metagenomics

• Metagenomics: data analysis from environmental samples– Microflora: environment includes us!

• Another data integration problem– Must include datasets from multiple organisms

• Another context-specificity problem– Now “context” can also mean “species”

• What questions can we answer?– How do human microflora interact with diabetes,

obesity, oral health, antibiotics, aging, …– What’s shared within community X?

What’s different? What’s unique?– What’s perturbed in disease state Y?

One organism, or many? Host interactions?– Current methods annotate ~50% of synthetic data,

<5% of environmental data

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

42

Next Steps:Microbial Communities

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

~120 available expression datasets

~70 species

PKH1

PKH3

PKH2LPD1

CAR1

W04B5.5

pdk-1

R04B3.2

LLC1.3

T21F4.1

PDPK1

ARG1DLD

ARG2

AGA

Weskamp et al 2004

Flannick et al 2006

Kanehisa et al 2008

Tatusov et al 1997

• Data integration works just as well in microbes as it does in humans• We know an awful lot about some microorganisms and almost nothing about others• Purely sequence-based and purely network-based tools for function transfer both fall short• We need data integration to take advantage of both and mine out useful biology!

43

Functional Maps forFunctional Metagenomics

YG17

YG16YG15

YG10

YG6

YG9

YG8

YG5

YG11

YG7

YG12

YG13

YG14

YG2

YG1

YG4

YG3

KO8

KO4

KO5

KO7

KO9

KO6

KO2

KO3

KO1

KO1: YG1, YG2, YG3KO2: YG4KO3: YG6…

ECG1, ECG2PAG1ECG3, PAG2…

44

Functional Maps forFunctional Metagenomics

45

Validating Orthology-BasedFunctional Mapping

Does unweighted data integration predict functional relationships?

What is the effect of “projecting” through an orthologous space?

Recall

log(

Pre

cisi

on/R

ando

m)

KEGG

GO

Recall

log(

Pre

cisi

on/R

ando

m)

Recall

log(

Pre

cisi

on/R

ando

m)

GO

Unsupervised integration

Individual datasets

Recall

log(

Pre

cisi

on/R

ando

m) Individual

datasets

KEGG

Unsupervised integration

46

Validating Orthology-BasedFunctional Mapping

YG17

YG16YG15

YG10

YG6

YG9

YG8

YG5

YG11

YG7

YG12

YG13

YG14

YG2

YG1

YG4

YG3Holdout set,

uncharacterized “genome”

Random subsets,characterized “genomes”

47

Validating Orthology-BasedFunctional Mapping

48KEGG KEGG

GO GO

Validating Orthology-BasedFunctional Mapping

Can subsets of the yeast genome predict a heldout subset’s

functional maps?

Can subsets of the yeast genome predict a heldout subset’s

interactome?

0.68 0.48

0.39 0.25

0.30 0.37

0.27 0.39

0.43

0.40

What have we learned?• Yeast is incredibly well-curated

• KEGG tends to be more specific than GO

• Predicting interactomes by projecting through

functional maps

works decently in the absolute best case

49

Functional Maps forFunctional Metagenomics

Now, what happens if you do this forcharacterized microbes?

• ~20 (somewhat) well-characterized species

• 1-35 datasets each

• Integrate within species

• Evaluate using KEGG

• Then cross-validate by holding out species

Recall

log(

Pre

cisi

on/R

ando

m)

KEGG

Unsupervised integrations

50

Next Steps:Missing Methodology, Mining

• Most machine learning algorithms are optimized for one of two cases:

– Small, dense data

– Large, sparse data

• HEFalMp integrates ~300M records using ~1K features, relatively few of which are missing, in ~200 contexts

Feature selection

Regularization

Dimension reductionSimple models, efficient algorithmsSlightly less

51

Next Steps:Missing Methodology, Models

Dataset #1

Dataset #2

Dataset #2 …

Functional Relationship

52

Next Steps:Missing Methodology, Models

Dataset #1

Dataset #2

Dataset #3 …

Functional Relationship

Biological Context

53

Next Steps:Missing Methodology, Models

Dataset #1

Dataset #2

Dataset #3 …

Functional Relationship

Cellular Processes

Tissue/Cell Lineage

Disease State

Developmental Stage

Cross-Species Orthology

This is clearly not a sustainable system;novel large-scale hierarchical modeling is needed to capture the complex biology of metazoan and

metagenomic interaction networks.

Types of Interactions

Regulation

54

Efficient Computation For Biological Discovery

Massive datasets and genomes require efficient algorithms and implementations.

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!It’s also speedy: improves on Bayes Net Toolbox by

~22x in memory usage and up to >100x in runtime.

55

Efficient Computation For Biological Discovery

Massive datasets and genomes require efficient algorithms and implementations.

• Sleipnir C++ library for computational functional genomics

• Data types for biological entities• Microarray data, interaction data, genes and gene sets,

functional catalogs, etc. etc.• Network communication, parallelization

• Efficient machine learning algorithms• Generative (Bayesian) and discriminative (SVM)

• And it’s fully documented!

8 hours

1 minute

30 years

2 months

18 hours

Original processing time

Current processing time

2-3 hours

56

Outline

2. Applications:Human molecular data

and clinical cancer cohorts

1. Methodology:Algorithms for mining

genome-scale datasets

3. Next steps:Methods for microbial communities

and functional metagenomics

• Bayesian system for genomic

data integration• Sleipnir software for efficient

large scale data mining• Functional mapping to statistically

summarize large data collections

• HEFalMp system for human data

analysis and integration

• Six confirmed predictions in

autophagy• Ongoing analysis of

LINE-1methylation in colon

cancer• Data integration

applied tomicrobial

communities andfunctional

metagenomics• Efficient machine

learningfor large, dense

feature spaces

57

Thanks!

NIGMShttp://function.princeton.edu/sleipnir

http://function.princeton.edu/hefalmp

Interested? We’re lookingfor students and postdocs!Biostatistics Department

http://huttenhower.sph.harvard.edu

Hilary CollerErin HaleyTsheko Mutungu

Olga TroyanskayaMatt HibbsChad MyersDavid HessEdo AiroldiFlorian Markowetz

Shuji OginoCharlie Fuchs

59

Colon Cancer:Immunohistochemistry

Tumor #1 Tumor #2 … Tumor #700

AKT1 0 11 55AURKA 0 5 0CCND1 25 0 30

… …

Gen

es

Conditions

Quantities

The world’s smallest, cheapest microarray!

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

60

Colon Cancer:Immunohistochemistry

~700 Tumor Samples

LINE-1 hypomethylated outliers

LINE-1 methylation “normal”

STAT3VDR

HIF1A

CDKN1B

AURKA

MAPK

CDX2

DNMT1

PPARGCDK8

CTSBPTEN

CCND10

10

20

30

40

50

60

70

80

LINE-1 Methylation Low

Normal

IHC

Pse

ud

oex

pre

ssio

n

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

The world’s smallest, cheapest microarray!

61

Colon Cancer:Mining Microarrays

STAT3VDR

HIF1A

CDKN1B

AURKA

MAPK

CDX2

DNMT1

PPARGCDK8

CTSBPTEN

CCND1-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

log

2( L

ow

/ N

orm

al )

~650 datasets~15,000 expression conditions

~24,000 genes

Most like our 26-gene LINE-1 differential methylation

signature

Least like the signature

26 genes in signature

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

62

Colon Cancer: Mining Microarrays

“The goal of GSEA is to determine whether members of a gene

set S tend to occur toward the top (or bottom) of the list L.”

data

Subramanian et al, 2005

Most like our 26-gene LINE-1 differential methylation

signature

Least like the signature

Bleomycin effect on mutagen-sensitive lymphoblastoid cells

Folic acid deficiency effect on colon cancer cells

Bladder tumor stage classification

Normal tissue of diverse types

Muscle function and aging

Non-diseased lung tissue

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Dataset 1

Condition XCondition YCondition Z

Dataset 2

Condition ACondition BCondition CCondition DCondition E

63

Colon Cancer: Mining Microarrays

“The goal of GSEA is to determine whether members of a gene

set S tend to occur toward the top (or bottom) of the list L.”

Subramanian et al, 2005

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

CNV 1

Gene XGene YGene Z

CNV 2

Gene AGene BGene CGene DGene E

Most upregulated insignificantly enriched datasets

Most downregulated

PSGs (11 genes on 19q13.3)

PCDHs (~50 genes on 5q31.3) Misc. ~12 genes on 16p13.3

Iafrate et al, 2005

?

64

Colon Cancer: Mining Microarrays

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Iafrate et al, 2005

Pregnancy specific β glycoproteins

Salahshor et al, 2005

“PSG9 is not found in the non-pregnant adult except in association with cancer, and it appears to be an early molecular event associated with colorectal cancer.”

Differential gene expression profile reveals deregulation of pregnancy specific β1 glycoprotein 9 early during colorectal carcinogenesis

65

Colon Cancer:Generating a Hypothesis

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Pregnancy specific β glycoproteins

66

Colon Cancer:Generating a Hypothesis

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Pregnancy specific β glycoproteins

67

Colon Cancer:Using All the Data

What is the biological mechanism linking LINE-1

methylation to colon cancer?

What does the IHC data tell us about LINE-1 hypomethylation?

Can existing microarrays amplify the LINE-1

hypomethylation signal?

Identify microarray datasets with conditions enriched for

LINE-1 hypomethylation.

What CNV-linked genes are differentially expressed in

these datasets?

Pregnancy specific β glycoproteins

GI cancers and chemotherapy

Yes(caveat investigator)

Get back to me in a couple of months…

What’s the state of the data?• Extremely hypomethylated colon cancer carries a significantly poor prognosis

• In our cohort, these ~20 tumors are weakly enriched for a protein activity signature based on IHC

• The expression datasets most enriched for the same signature represent mainly GI cancer and chemotherapy conditions

• The PSG gene family is upregulated in these datasets and is linked to a known CNV

• HEFalMp associates the PSGs with cancer based on correlation with known colorectal cancer genes in a variety of expression datasets

Nothing definite – yet.

68

• Of only five regulators found, four have

generic cell cycle/proliferation targets

• Just five basic regulators for ~7,000 genes?

• These motifs only appear upstream of ~half

of the genes

Human Regulatory Networks

G0

I

III

IV

V

VIVII

IX

VIII

II

X

6,829genes

Serum re-stimulated (hrs)Serum starved (hrs)1

5< <50

2 4 8 24 96 1 2 4 8 24 48

De

velo

pm

en

t

De

velo

pm

en

t

Ch

ole

ste

rol

Pro

tein

loca

liza

tion

Ce

ll cy

cle

RN

A p

roce

ssin

g

Me

tab

olis

m

FIRE: Elemento et al. 2007

Elk-1

Sp1

NF-Y

YY1

Quiescence: reversible exit from the cell cycle

69

Regulatory Modules:Expression Biclusters + Sequence Motifs

CRG1

CRG2

CRG3

CRG4

RND1

RND2

RND3

RND4

RND5

RND6

RND7

RND8

3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions

70

Regulatory Modules:Expression Biclusters + Sequence Motifs

CRG1

CRG2

CRG4

CRG3

RND1

RND2

RND3

RND4

RND5

RND6

RND7

RND8

3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions

71

Regulatory Modules:Expression Biclusters + Sequence Motifs

CRG1

CRG2

CRG4

CRG3

RND1

RND2

RND3

RND4

RND5

RND6

RND7

RND8

3 4 71 2 5 6 8Bicluster:Coregulated subset of genes and conditions

…do all that, and simultaneously find

(under)enriched sequence motifs!

…any dataset can contain many

overlapping biclusters…

…any gene or condition can participate in

multiple biclusters…

72

COALESCE: Combinatorial Algorithm forExpression and Sequence-based Cluster Extraction

Gene Expression DNA Sequence

5’ UTR 3’ UTR

Upstream flank Downstream flank

Evolutionary Conservation

Nucleosome Positions

Identify conditions where genes

coexpress

Identify motifs enriched in

genes’ sequences

Create a new module

Select genes based on conditions

and motifs

Subtract mean from all data

Regulatory modules• Coregulated genes• Conditions where they’re

coregulated• Putative regulating motifs

Feature selection:Tests for differential expression/frequency

Bayesian integration

73

COALESCE: SelectingCoexpressed Conditions

• For each gene expression condition…– Compare distributions of values for

• Genes in the module versus• Genes not in the module

– If significantly different, include the condition

Preserving data structure:• If multiple conditions derive from the

samedataset, can be included/excluded as a

unit• For example, time course vs. deletion

collection• Test using multivariate z-test• Precalculate covariance matrix; still very

efficient

74

COALESCE: SelectingSignificant Motifs

• Coalesce looks for three kinds of motifs:– K-mers– Reverse complement pairs– Probabilistic Suffix Trees (PSTs)

• For every possible motif…– Compare distributions of values for

• Genes in the module versus• Genes not in the module

– If significantly different, include the motif

ACGACGT

ACGACAT | ATGTCGT

A

TC

G

T

TG

CA

• This can distinguish flanks from UTRs• Fast!• Efficient enough to search coding sequence

(e.g. exons/introns)

75

COALESCE: SelectingProbable Genes

• For each gene in the genome…For each significant condition… For each significant motif…

What’s the probability the gene came from the module’s distribution?

What’s the probability that it came from outside the module?

)()|()()|(

)()|()|(

MgPMgDPMgPMgDP

MgPMgDPDMgP

Distributions of each feature in and out of the developing module are observed from the data.

Prior is used to stabilize module convergence; genes already in the module are more likely to stay there next iteration.

The probability of a gene being in the module given some data…

76

COALESCE: IntegratingAdditional Data Types

Nucleosome placement Evolutionary conservation

• Can be included as additional datasets and feature

selected just like expression conditions/motifs.

• Or can be used as a prior or weight on the values of

individual motifs.

N C

G1 2.5 0.0

G2 0.6 0.5

G3 1.2 0.9

… … …

TCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATG

77

COALESCE Results:S. cerevisiae Modules

~2,200 conditions

~6,000 genes

The haystack

A needle

100 genes80 conditions

78

COALESCE Results:Yeast TF/Target Accuracy

Bas1p Hap4p Met32p

Cup2p Met31p

Zap1p Upc2p Mbp1p

Hsf1p Gln3p Hap3p Gcn4p Uga3p Gis1p Hap5p

-0.3

-0.1

0.0999999999999997

0.3

0.5

0.7

0.9

1.1

1.3

COALESCE

cMonkey

FIRE

Weeder

Z-S

core

79

COALESCE Results:Yeast Clustering Accuracy

• ~2,200 yeast conditions– Recapitulation of known biology from Gene Ontology

80

COALESCE Results:Yeast Clustering Accuracy

• ~2,200 yeast conditions– Recapitulation of known biology from Gene Ontology

ASCL1 in 5’ flank, unch. sequences underenriched in 3’ UTR

M. musculus: Up in callosal and motor neurons

C. elegans: Up in larvae, down in adults

GATA in 5’ flank, miR-788 seed in 3’ UTR

AAGGGGC (zf?) and enriched in 5’ flank

H. sapiens: Up in normal muscle, down in diabetic

81

COALESCE: Coregulated Quiescence Modules

Down during quiescence entry, up during quiescence exit,down with adenoviral infection

Specific predicted uncharacterized reverse complement motif

Up during quiescence entry, down during quiescence exit

Many known related (proliferation) motifs:Pax4, Staf, NFKB1, Gfi, ESR1, Runx1, Su(H)

Down during quiescence entry,enriched for transport/trafficking

miR-297 motif predicted in 3’ UTR (CACATAC)

Down with let-7 exposure

let-7 motifs predicted in 3’ UTR (UACCUC)

82

Summary

• COALESCE algorithm for regulatory module prediction

– Biclustering + putative de novo motifs

– Optimized for complex organisms (fast!)• Large genomes, large data collections

– High accuracy, low false positives

– Leverage prior knowledge, multiple data types

Recommended