Network motifs: discovery and applications Guy Zinman Seminar in Bioinformatics Technion, Spring...

Preview:

Citation preview

Network motifs: discovery and applications

Guy Zinman

Seminar in Bioinformatics

Technion, Spring 2005

Outline

• Theory of network motifs• Definition, Algorithm

• Application to E. Coli transcription network• The dynamic behavior of the motifs

• Finding active subnetworks• Simulated annealing• experiments

Network

Network

• Dictionary definition: • A group or system of (electric) components and

connecting circuitry designed to function in a specific manner.

• Network is the backbone of a complex system

• Studies of networks are similar to paleontology: learning about an animal

from its backbone

Network motifs

• The notion of motif, widely used for sequence analysis, is generalized to the level of networks.

• Network Motifs are defined as patterns of interconnections that recur in many different parts of a network at frequencies much higher than those found in randomized networks.

Network motifs (cont.)

Such motifs are found in networks from:

• Biochemistry• Transcriptional regulation networks

• Neurobiology• Neuron connectivity

• Ecology • Food webs

• Engineering• Electoronic circuits• World Wide Web

Network motifs (cont.)

Schematic view of motif detection

• Occurrence of the FFL motif:

Random vs designed/evolved features

• Large networks may contain information about design principles and/or evolution of the complex system

• Which features are there for a reason:• design principles (e.g. feed-forward loops)• constraints (e.g. the all nodes on the Internet must be

connected to each other)• evolution, growth dynamics (e.g. network growth is

mainly due to gene duplication)

Network motifs

• Alon U. et al: “Network Motifs: Simple building Blocks of Complex Networks”; Science, 2002.

• Different motifs were found in different classes of network.

• The motif reflect the underlying processes that generate each type of network.

Motifs detected

• Two significant motifs:

Both appeared numerous times in non-homologous gene systems that perform diverse biological functions

Motifs detected

Motifs detected

Main tasks for detecting network motifs

There are two main tasks in detecting network motifs:

(1) generating an ensemble of proper random networks

(2) counting the subgraphs in the real network and in random networks.

The algorithm

• Starting point: graph with directed edges

• Scan for n-node subgraphs (n=3,4) and count number of occurrences

• Compare to Erdos-Renyi randomized graph• (randomization preserves in-, out- and in+out- degree

of each node)

All 3-node connected subgraphs

• 13 different isomorphic types of 3-node connected subgraph

• There are:199 4-node subgraphs, 9364 5-node subgraphs ……

Generation of randomized network

• Algorithm A• Employ a Markov-chain algorithm based on starting

with the real network and repeatedly swapping randomly chosen pairs of connections (X1 => Y1, X2 => Y2 is replaced by X1 => Y2, X2 => Y1) until the network is well randomized.

• Switching is prohibited if the either of the connections X1 => Y2 or X2 => Y1 already exist.

Generation of randomized network

• Algorithm B• Each network was presented as a connectivity matrix

M, such that Mij = 1 if there is a connection directed from node i to node j, and 0 otherwise.

• The goal is to create a randomized connectivity matrix Mrand, which has the same number of nonzero elements in each row and column as the corresponding row and column of the real connectivity matrix.

Generation of randomized network

• Ri = ∑jMrand,ij = ∑jMij, Ci = ∑iMrand,ij = ∑iMij. • To generate the randomized networks, we start with an empty

matrix Mrand. • We then repeatedly randomly choose a row n according to the

weights pi = Ri/∑Ri and a column m according to the weights qj = Rj/∑Rj.

• If Mrand,nm = 0, we set Mrand,mn = 1. • We then set Rm = Rm – 1 and Cn = Cn – 1. If the entry (m, n)

was previously entered to the randomized matrix, that is, ifMrand,mn = 1, or if m = n, we choose a new (m, n).

• This process is repeated until all Ri = 0 and Cj = 0.

Network motif detection

• For each nonzero element (i,j):

Looping through all connected elements Mik = 1, Mki = 1, Mjk = 1, and Mkj = 1. This is recursively repeated with elements (i, k), (k, i), (j,k), and (k, j) until an n-node subgraph is obtained.

• A table is formed that counts the number of appearances of each type of subgraph in the network, correcting for the fact that multiple submatrices of M can correspond to one isomorphic architecture owing to symmetries.

Network motif detection

• This process is repeated for each of the randomized networks. The number of appearances of each type of subgraph in the random ensemble is recorded, to assess its statistical significance.

• The present concepts and algorithms are easily generalized to nondirected or directed graphs with several “colors” of edges and nodes, multipartite graphs, and so forth.

Criteria for Network Motif Selection

• The probability that it appears in a randomized network an equal or greater number of times than in the real network is smaller than P = 0.01.

Reminder:p-value: the probability to get the given result when the tested subject is not affected by the experiment.

if p-value < 0.01 than the subject is considered to be affected (the hypothesis is correct).

Run time complexity

• The performance of this algorithm scales with the total number of n-node subgraphs in the network.

• The number of subgraphs and the algorithm runtime also increase dramatically for subgraphs with n ≥ 5.

Sampling method for subgraph counting

• Kashtan et al.: “Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs”; Bioinformatics, 2004.

• This algorithm samples subgraphs in order to estimate their relative frequency.

• The runtime of the algorithm asymptotically does not depend on the network size.

• Surprisingly, few samples are needed to detect network motifs reliably.

Subgraph sampling

Procedure description:• pick a random edge from the network and then expand

the subgraph iteratively by picking random neighboring edges until the subgraph reaches n nodes.

• For each random choice of an edge, in order to pick an edge that will expand the subgraph size by one, prepare a list of all such candidate edges and then randomly choose an edge from the list.

Subgraph sampling

• Finally, the sampled subgraph is defined by the set of n nodes and all the edges that connect between these nodes in the original network.

• Finding n-node subgraphs for n ≥5 is much easier now….

Comparing sampling method results with exhaustive enumeration

Transcriptional Regulation Network ofEscherichia coli

• Operon – a group of contiguous genes that are transcribed into a single mRNA molecule.

• The transcriptional network is represented as a directed graph: each operon represents a node and edges represent

direct transcriptional

interactions.

Application to E. Coli

Alon U.: “Network motifs in the transcriptional regulation network of Eschersichia coli”; Nature Genetics, 2002.

• Database - RegulonDBcontains interactions between Transcription Factors and the operons they regulate

• Contains 577 interactions, 424 operons and 116 TFs• 35 more TFs were added from literature• Previously described algorithm was run on this data (1000

random networks)

Significant motifs

Feedforward loop

found in 22 different systems,

10 TFs and 40 operons

P-Val=0.001

Concentration of FFL

Same in the yeast regulatory network

• Young et. al: Transcriptional Regulatory Networks in Saccharomyces cerevisiae; Science, 2002

• Can you think of a possible role for this motif?

Dynamics for the FFL

• Mangan et al., “Structure and function of the feed-forward loop”; PNAS, 2003.

Consider Sx and Sy as

Input signal – small molecules

That activate or inhibit the

Activity of X and Y.

Coherency of FFLs

• The FFL is ‘coherent’ if the direct effect of the general TF on the effector has the same sign.

• 85% of the FFL found were coherent.

Significant motif

Single Input Motif (SIM)

• Single Transcription Factor controls set of operons.

• All operons in a SIM are regulated

with the same sign.

• Appeared in 24 different systems

Dynamics for the SIM

Significant motif

Dense Overlapping Regulon (DOR) -

a layer of overlapping interactions between operons and a group of TFs, much denser than this structure would appear in an Erdos-Renyi random graph

E. Coli network

Dor detection

Briefly…

• Define a (nonmetric) distance measure between operon k and j.

• The operons were clustered.

• DORs corresponded to clusters with more than C=10 connections, with ratio of connections to TF greater than R=2.

mFinder

• A software tool for estimating subgraph concentrations and detecting network motifs.

• www.weizmann.ac.il/mcb/UriAlon/

Discussion

• The concept of homology between genes based on sequence motifs has been crucial for understanding the function of uncharacterized genes.

• Likewise, the notion of similarity between connectivity patterns in networks, based on network motifs, may be helpful in gaining insight into the dynamic behavior of newly identified gene circuits.

Discussion

• Until now we considered only transcription interactions specifically manifested by transcription factors that bind regulatory sites.

• This transcriptional network can be thought of as ‘slow’ part of the cellular regulation network (time scale of minutes).

Discussion

• An additional layer of faster interactions, which include interaction between proteins (often subsecond timescale), contributes to the full regulatory behavior.

Finding active subnetworks

• Ideker, T.: “Discovering regulatory and signaling circuits in molecular interaction networks”; Bioinformatics, 2002.

• Integrates protein-protein and protein-DNA interactions with mRNA expression data, in a goal of better understanding the molecular mechanism of the observed gene expression.

• Uses a method of searching the network to find ‘active subnetwork’, i.e., connected sets of genes with unexpectedly high levels of differential expression, under one or more perturbation.

Methodology

• Using a molecular interaction network to analyze changes in expression over 20 perturbations to the yeast galactose utilization (GAL) pathway.

• Determining which conditions significantly affected the gene expression in each active subnetwork.

The means

• Combining a rigorous statistical measure for scoring subnetworks with a search algorithm for identifying subnetworks with high score.

• To rate the biological activity of a particular subnetwork, begin with assessing the significance of differential expression for each gene.

• The error model provided by VERA (Variability and ERror Assessment) program.• VERA estimates the parameters of a statistical model using

the method of maximum likelihood.

• Output: p-values (pi), representing the significance of expression change.

Basic z-score calculation

Basic z-score calculation

• Each pi is converted to z-score:

zi = Φ-1(1-pi) • Φ-1 = The inverse normal CDF (cumulative distribution

function)• Smaller p-values correspond to larger z-score

z-score - quantifies how different from normal the given value is:

x

xxxZ

• Aggregate z-score for an entire subnetwork A of k genes:

Notice:

• zA will also be distributed according the standard normal (because the variables are independent).

• Subnetworks of all sizes are comparable under this scoring system, independent of k.

• A high zA indicates a biologically active subnetwork.

Ai

iA Zk

Z1

Scoring of Subnetworks

Calibrating z against background distribution

• Randomly sample gene sets of size k using a Monte Carlo approach, compute their scores zA, and calculate standard deviation parameters for each k.

• The corrected subnet score SA is:

k

kAA

ZS

Scoring an example subnetwork

Za Zb Zc Zd ZA SA

Scoring over multiple conditions

• Starting with a matrix of p-values (genes vs. conditions) and corresponding z-scores.

• Producing m different aggregate scores, one for each condition, and sorting them.

• Finding the probability that at least j of the m conditions had scores above zA(j)

• Monte Carlo technique is used for estimating the mean and the standard deviation from random gene set of size k.

Scoring over multiple conditions

Finding the maximal scoring

• Problem:

Finding the maximal scoring connected subgraph is NP-hard.

The Difficulty in Searching Global Optima

Global maxima

Local maximaLocal maxima

subnetwork

sig

nifi

can

ce

sco

re

Rugged landscapes and local maxima problem

Monte Carlo random search

• Known also as the ‘Metropolis algorithm’• A simulation technique for conformational sampling and

optimization based on a random search for energetically favourable conformations

• Finding global (or at least “good” local) maximum by biased random walk may take some luck …

Global maxima

Local maxima

Local maxima

subnetwork

sig

nifi

can

ce

sco

re

Climbing mountains easier: simulated annealing

Global maxima

Local maxima

Local maxima

subnetwork

sig

nifi

can

ce

sco

re

In order to get out from a local maxima one needs to allow for locally unfavorable moves

Introduction to simulated annealing

Simulated annealing (Kirkpatrick et al.,1983).Mathematical method developed together with Monte Carlo techniques to avoid false maxima Method simulates slow cooling of a solidifying solution to form a single crystal

Origin: The annealing process of heated solids

Intuition:By allowing occasional descent in the search process, we might be able to escape the trap of local maxima.

In our context:

Allow nodes to be removed from the subsets, even if the resulting subnetwork’s score is a (little) lower.

• What can be an adverse effect of this method?

Consequences of the Occasional Ascents

Help escaping the local optima.

desired effectMight pass global optima

after reaching it

adverse effect

So the result is not guaranteed to be optimal. But here we don’t care- any high-scoring subnetwork is

suspected to be biologically significant.

Climbing mountains easier: simulated annealing

• Defining a “temperature” function.• Increasing the effective “temperature” means

higher probability of accepting moves that increase the energy Thus, the likelihood of escaping from a local maximum may be tuned.

Control of Annealing Process

Acceptance of a search step (Metropolis Criterion):

Assume the performance change in the search direction is .

Accept a descending step only if it pass a random test, i.e. with probability

p =

Always accept a ascending step, i.e. 0

Te

Control of Annealing Process

Cooling Schedule:

T, the annealing temperature, is the parameter that control the frequency of acceptance of decending steps.

We gradually reduce temperature T(k) between 1 and 0. The probability to accept declining steps is proportional!

Te

In our context

• Input:

Graph G = (V,E) of molecular interactions,

N – number of iteration

Ti – temperature function which decreases from Tstart to Tend

• Output:

Gw – Subgraph of G

• Initialize Gw by setting each node to an ‘active/inactive’ state randomly (with p = ½).

Simulated Annealing Algorithm

• For i = 1 to N DO• Randomly pick a node v from V and toggle it’s state.

• Compute the score si for the working subgraph Gw

• IF (si > si-1), keep v toggled;

• ELSE keep v toggled with probability iii TSSep /)( 1

Heuristics for improved annealing

• Look for M active subnetworks simultaneously.

• M is a user defined variable• Maintaining multiple components can improve

the efficiency of annealing.• Can be done by:

• multiple annealing runs

Or by• extending the annealing approach to maintain a

graph state vector of the top M component scores.

Galactose metabolic flow

Results:

Experiment #1

small network of 362 interaction. 2 conditions of the expression data: gal80 deletion vs. WT.

5 significant subnetworks were found, including 41 out of 77 significant genes.

Score and temperature vs. number of iteration

Temperature cooling is geometric from 1 to 0.

• N =

• By the end of the run, each of the 5 subnetworks reach a (local) maximum.

5101

Evaluation of the subnetworks

Z-score distribution with real data

Z-score distribution with random data ( scrambled nodes z-scores )

Z-score distribution of the top 5 active networks.

Experiment #2

• Network consists of all known interactions:7145 protein-protein interactions from BIND317 regulation interactions from TRANSFAC

• Expression data includes 20 perturbations to genes in the Galactose pathway.

• 7 active subnetworks found. The biggest consists of 340 genes.

• Repeating annealing with the network above, generated 5 significant sub-sub-networks.

• All results were evaluated with methods similar to what we have seen.

Results:

Discussion

Cytoscape

• www.cytoscape.org

Summary

• Theory of network motifs• Definition, Alogorithm

• Application to E. Coli transcription network• The dynamic behavior of the motifs

• Finding active subnetworks• Simulated annealing• 2 experiments

References

• S Shen-Orr, R Milo, S Mangan & U Alon,

Network motifs in the transcriptional regulation network of Escherichia coli.

Nature Genetics, 31:64-68 (2002).

• R Milo, S Shen-Orr, S Itzkovitz, N Kashtan, D Chklovskii & U Alon,Network Motifs: Simple Building Blocks of Complex Networks

Science, 298:824-827 (2002).

• Ideker, T., Ozier, O., Schwikowski, B., and Siegel, A.

Discovering regulatory and signaling circuits in molecular interaction networks.

Bioinformatics 18 : S233 (2002).

• S. Mangan and U. AlonStructure and function of feed forward loop network motif.

PNAS 100:11980-11985 (2003).

• N. Kashtan, S. Itzkovitz, R. Milo and U. AlonEfficient sampling algorithm for estimating subgraph concentration and detecting network motifs; Bioinformatics 20:1746-175 (2004).

• S. kirkpatrick, C. D. Gelatt and M. P. VecchiOptimization by simulated annealing

Science 220:671-680 (1983).

Thank you