1 Joint analysis of regulatory networks and expression profiles Ron Shamir School of Computer Science Tel Aviv University April 2013 1 Sources: Igor Ulitsky

1

Joint analysis of regulatory networks and expression

profilesRon Shamir

School of Computer ScienceTel Aviv University

April 2013

1

Sources: Igor Ulitsky and Ron Shamir. Identification of Functional Modules using Network Topology and High-Throughput Data. BMC Systems Biology 1:8 (2007). Igor Ulitsky and Ron Shamir. Identifying functional modules using expression profiles and confidence-scored protein interactions. Bioinformatics Vol. 25 no. 9 1158-1164 (2009) .

http://www.tau.ac.il/

Outline• Background• Joint network and expression profiles

– Matisse– Cezanne

2


Background

3


DNA RNA protein

transcription translation

The hard disk

One program

Its output

4


DNA Microarrays / RNA-seq

• Simultaneous measurement of expression levels of all genes / transcripts.

• Perform 105-109 measurements in one experiment

• Allow global view of cellular processes. • The most important biotechnological

breakthroughs of the last /current decade

http://www.biomedcentral.com/1471-2105/12/323/figure/F25

http://www.biomedcentral.com/1471-2105/12/323/figure/F2


The Raw Data

genes

experiments

Entries of the Raw Data matrix: expression levels.Ratios/absolute values/…

• expression pattern for each gene• Profile for each experiment/condition/sample/chip

Needs normalization!

6


7

EXPression ANalyzer and DisplayER

Clustering Identify clusters of co-expressed

genes

CLICK, KMeans, SOM, hierarchical

http://acgt.cs.tau.ac.il/expander

A. Maron, R. Sharan Bioinformatics 03

Function.

enrichment

GO, TANGO

Visualization

Promoter analysis

Analyze TF binding sites of

co-regulated genes

PRIMA

Biclustering Identify

homogeneous submatrices

SAMBA

A. Maron-Katz, A. Tanay, C. Linhart, I. Steinfeld, R. Sharan, Y. Shiloh, R. Elkon BMC Bioinformatics 05

microRNA

function

inference: FAME

Ulitsky et al. Nature Protocols 10


Networks of Protein-protein interactions (PPIs)

• Large, readily available resource• Representation: Network with

nodes=proteins/genes edges=interactions

8

Analysis methods:Global propertiesMotif content analysisComplex extractionCross-species comparison


The hairball syndrome

9


• Potential inroad into pathways and function

• Can the network help to improve the analysis?

10


Analysis of gene expression profiles + a

network

11


12

Goal

• Challenge: Detect active functional modules: connected subnetwork of proteins whose genes are co-expressed

• “Where is the action in the network in a particular experiment?”


Ron Shamir, RNA Antalia, April 081313


14


15

Ulitsky & Shamir

BMC Systems Biology 07


• Input: Expression data and a PPI network

• Output: a collection of modules– Connected PPI subnetworks– Correlated expression profiles

Interaction

High expression similarity

http://acgt.cs.tau.ac.il/matisse16

Modular Analysis for Topology of Interactions and

Similarity SEts


Probabilistic model• Event Mij: i,j are mates = highly co-expressed

• P(Sij|Mij) ~ N(m , 2m)

• P(Sij|Mij) ~ N(n , 2n)

• H0: U is a set of unrelated genes• H1: U is a module = connected subnetwork with high internal similarity

• Ri: gene i transcriptionally regulated• m: fraction of mates out of module gene pairs that are transcriptionally regulated

• m= P(Mij| Ri Rj, H1)• pm: fraction of mates out of all gene pairs that are transcriptionally regulated

))P(R(R)|HP(S jiMUxU 17


Probabilistic model (2)• Is connected gene set U a module?

Assuming pair indep:• Define m

ij= m P(Ri)P(Rj)

• Define nij= pm P(Ri)P(Rj).

• Likelihood ratio Pr(Data|H1)/Pr Data|H0)

• Taking log: sum of terms ij:

18

))P(R(R)|HP(S jiMUxU


Probabilistic model - summary

• Similarities: mixture of two Gaussians• For a candidate group U, the likelihood ratio of originating from a module or from the background is

• Module score = Gene group likelihood ratio = sum over all the gene pairs

• Find connected subgraphs U with high WU

( , ) ( , )

( | )( | )log log

( | ) ( | )ij MU U M

U iji j U U i j U UU U N ij N

P S HP S HW w

P S H P S H

))P(R(R)|HP(S jiMUxU 19


Complexity

• Finding heaviest connected subgraph: NP hard even without connectivity constraints (+/- edge weights)

• Devised a heuristic for the problem

20


MATISSE workflow

• Seed generation• Greedy optimization• Significance filtering

Finding seeds

• Three seeding alternatives tested• All alternatives build a seed and

delete it from the network• Building small seeds around single

nodes:• Best neighbors• All neighbors

• Approximating the heaviest subgraph• Delete low-degree nodes and record the

heaviest subnetwork found

Greedy optimization

• Simultaneous optimization of all the seeds

• The following steps are considered:• Node addition• Node removal• Assignment change• Module merge

Front vs. Back nodes

• Only a fraction of the genes (front nodes) have meaningful similarity values

• MATISSE can link them using other genes (back nodes).

• Back nodes correspond to:– Unmeasured transcripts– Post-translational regulation– Partially regulated pathways 24


Test case: Yeast osmotic shock

• Network: 65,990 PPIs & protein-DNA interactions among 6,246 genes

• Expression: 133 experimental conditions – response of perturbed strains to osmotic shock (O’Rourke & Herskowitz 04)

• Front nodes: 2,000 genes with the highest variance

26


Pheromone response subnetwork

Back

Front

27


Performance comparison

0

10

20

30

40

50

60

70

80

90

100

Matisse Co-Clustering CLICK Random

GO-Process

GO-Compartment

MIPS Phenotypes

KEGG Pathways

% of modules with category enrichment at p< 10-3

0

5

10

15

20

25

30

35

40

45

Matisse Co-Clustering CLICK Random

GO-Process

GO-Compartment

MIPS Phenotypes

KEGG Pathways

% annotations enriched at p<10-3 in modules

28


GO and promoter analysisSubnetwork Size Front Enriched GO terms P-value TFs P-Value

1 120 119 processing of 20S pre-rRNA < 0.001 Fhl1 4.82E-16rRNA processing < 0.001 Rap1 2.89E-1135S primary transcript processing < 0.001 Sfp1 2.98E-08ribosomal large subunit assembly and maintenance 0.019rRNA modification < 0.001ribosome biogenesis 0.029

2 120 118 translational elongation < 0.001 Fhl1 1.03E-053 120 118 processing of 20S pre-rRNA < 0.001

rRNA processing 0.0335S primary transcript processing 0.011ribosomal large subunit assembly and maintenance 0.019ribosomal large subunit biogenesis < 0.001

5 120 112 signal transduction during filamentous growth 0.01 Ste12 5.41E-13conjugation with cellular fusion < 0.001 Dig1 5.41E-13

6 120 99 transcription from RNA polymerase III promoter < 0.001transcription from RNA polymerase I promoter 0.006

7 120 107 ergosterol biosynthesis < 0.001hexose transport 0.019

8 114 85 chromatin remodeling 0.0511 120 114 pseudohyphal growth 0.01 Msn2 3.17E-04

response to stress < 0.001 Msn4 1.82E-1214 120 102 ubiquitin-dependent protein catabolism 0.04715 120 96 nuclear mRNA splicing, via spliceosome < 0.00116 89 61 ubiquitin-dependent protein catabolism < 0.001 Rpn4 6.44E-0617 120 109 response to stress < 0.001 Msn4 1.74E-03

mitochondrial electron transport < 0.00118 87 59 nuclear mRNA splicing, via spliceosome 0.01220 46 35 pyridoxine metabolism 0.045 29


Application to stem cells• ~150 human stem cell lines of diverse

types profiled using microarrays• Clustered profiles into groups• Adjusted Matisse to seek subnetworks

that characteristic to each group • Focused analysis on pluripotent stem

cells

F. Müller, L. Laurent, D. Kostka, I. Ulitsky, R. Williams, C. Lu, I. Park, M. Rao, P. Schwartz, N. Schmidt, J. Loring Nature 08

30


Pluripotent stem cells network

Highlights the key protein machinery underlying pluripotency

31


Ulitsky & Shamir Bioinformatics 2009

32


Accounting for PPI confidence• PPI-based analysis is made difficult by

abundant false positive / negative interactions• Various methods can assign confidence

(probability) to individual edges• Idea: seek modules that are connected with

high probability

Ulitsky & Shamir Bioinformatics, 2009

33

CEZANNE: (Co-Expression Zone ANalysis using NEtworks)

•Edge probability p(e) Edge weight –log(1-p(e))

•For any WU, ≥1 edge connects W with U\W with probability q (e.g. 0.95) The weight of the minimum cut of U is at least -log(1-q)

•Algorithm: among the subnets whose minimum cut exceeds -log(1-q) find the one with the maximum co-expression score

P({A},{B,C,D})=1-0.3*0.3=0.91

P({A,C,D},{B})=0.94P({A,B},{C,D})=0.94

P({A,B,D},{C})=0.994

minimum cut 0.7

0.9

0.70.8

A

B

C

D

36

DNA damage response in S. cerevisiae• 47 DNA Damage Response

expression profiles (Gasch et al., 01)

• Front nodes: 2,074 genes with at least two-fold expression change

• Network and confidence values: purification enrichment (PE) scores (Collins et al. 07)

38

Module size GO biological process p-value GO-slim protein complexes p-value

346

ribosome biogenesis and assembly 1.2·10-117 ribosome 5.9·10-91

translation 1.0·10-85 eukaryotic 43S preinitiation complex 3.8·10-49

rRNA processing 7.5·10-79 small nucleolar ribonucleoprotein complex 1.5·10-41

35S primary transcript processing 4.6·10-44 DNA-directed RNA polymerase III complex 3.1·10-17

ribosome assembly 4.3·10-39 exosome (RNase complex) 4.4·10-15

ribosomal large subunit biogenesis 9.2·10-14 DNA-directed RNA polymerase I complex 5.7·10-14

rRNA modification 4.4·10-12 Noc complex 3.2·10-6

38protein catabolism 1.8·10-46 proteasome complex (sensu Eukaryota) 5.7·10-71

proteolysis 9.0·10-44 proteasome core complex (sensu Eukaryota) 9.4·10-32

ubiquitin cycle 1.1·10-42

12histone acetylation 3.6·10-13 histone acetyltransferase complex 2.1·10-12

chromatin modification 5.9·10-11

transcription from RNA polymerase II promoter 1.4·10-6

12 translation 1.1·10-14 ribosome 1.4·10-15

12nuclear mRNA splicing, via spliceosome 3.5·10-21 spliceosome complex 3.5·10-17

small nuclear ribonucleoprotein complex 2.5·10-15

10barbed-end actin filament capping 4.8·10-6 F-actin capping protein complex 4.8·10-6

endocytosis 1.1·10-5

cytoskeleton organization and biogenesis 2.8·10-5

8 establishment and/or maintenance of chromatin architecture 1.1·10-5 chromatin remodeling complex 4.6·10-6

7 glycogen metabolism 3.0·10-8 protein phosphatase type 1 complex 3.3·10-5

sporulation (sensu Fungi) 2.0·10-6

6 translation 1.1·10-7 ribosome 4.0·10-8

6 tRNA processing 2.5·10-14 ribonuclease P complex 9.2·10-8

rRNA processing 2.2·10-9

4 trehalose biosynthesis 6.8·10-14 alpha,alpha-trehalose-phosphate synthase complex (UDP-forming) 6.8·10-14

4 ubiquitin-dependent protein catabolism 5.2·10-7

3 pseudohyphal growth 9.8·10-7 cAMP-dependent protein kinase complex 9.6·10-7

3 proteasome assembly 3.2·10-6

protein folding 3.9·10-6

DNA damage response modules

Cytoplasmic ribosome biogenesis

Proteasome

Mitochondrial ribosome – small subunit

Mitochondrial ribosome – large subunit

Spliceosome

Novel actin-localized pathway?

Hsp90

PKA

Trehalose biosynthesis

Ribonuclease P

Suggests SWS2 a novel member

Novel pathway enriched with actin-localized proteins; Supported in other datasets; Similar

deletion phenotypes

39

Comparison with prior work

Combined measure of sensitivity

(% of annotations enriched)

and specificity (% of modules enriched) with

p<0.001

Clustering of only expression data

Clustering expression &

network (Hanisch et al., 2002)

Expression similarity +

network connectivity

Expression similarity + confident network

connectivity

40

Summary•Algorithms using co-expression + networks to

detect functionally coherent modules •Accommodate both sparse and dense

subnetworks•Subnetworks linked to osmotic shock and

DNA damage•A general framework for confident

connectivity in PPI networks•The next steps:

▫Co-expression is not the only interesting way to utilize GE data

▫Scaling to complex human datasets

42

Documents

1 Joint analysis of regulatory networks and expression profiles Ron Shamir School of Computer Science Tel Aviv University April 2013 1 Sources: Igor Ulitsky