1 Seminar in Bioinformatics An efficient algorithm for detecting frequent subgraphs in biological...

Preview:

Citation preview

1

Seminar in BioinformaticsSeminar in Bioinformatics

An efficient algorithm for detecting

frequent subgraphs in biological

networks

Paper by: M. Koyuturk, A. Grama and W. Szpankowski

Appeared in: Bioinformatics, Vol. 20, Sup. 1, 2004, pages i200-i207.

Presented by: Royi Ronen

2

AbstractAbstract

• Motivation– Network interaction data is abundant

– Analyzing this data is important

– Problems are close to the subgraph isomorphism problem – Hard!

• Results– An efficient algorithm for detecting frequently occurring patterns in

bio-network

– The algorithm simplifies the subgraph isomorphism problem to a different, tractable, problem with biological applications

– Mining the KEGG database yields positive empiric results

3

OutlineOutline

• Introduction

• Model

• Approach: Graph Mining– Related Work

– Formalism for metabolic pathways

– The Algorithm

• Discussion and Empiric Results

• Conclusion

• Future Work

4

IntroductionIntroduction

• Experimental data relating to biological sequences (that are highly available and accessible) play an important role in tasks such as discovering common sequences and motifs

• Biomolecular interaction data are abstracted as graphs– Example: A hypergraph can represent a metabolic

pathway where nodes represent compounds

– Can be reduced to a directed graph where nodes are enzymes and edges relate them

5

IntroductionIntroduction

• Key problems in this context:

– Aligning multiple graphs

– Finding frequently occurring sub-graphs in a collection of

graphs

• A solution can lead to the understanding of

– Motifs of cellular interactions

– Evolutionary relationships

– Differences between networks in different organisms

– Patterns of gene regulation

6

IntroductionIntroduction

• In the paper– Finding frequently occurring subgraphs in a collection of graphs, each representing a

metabolic pathway

– Close to the NP-Hard subgraph isomorphism problem

– End of story?

• No!– The problem can be simplified and made tractable and still capture the biological

information

– Nodes will be “uniquely labeled”, according to the represented enzyme

– Experimental results: discovering “interesting” patterns from KEGG takes seconds

7

OutlineOutline

• Introduction ☺

• Model

• Approach: Graph Mining– Related Work

– Formalism for metabolic pathways

– The Algorithm

• Discussion and Empiric Results

• Conclusion

• Future Work

8

Metabolic PathwaysMetabolic Pathways

• Oldest kind of biological network

• Group the reactions that belong to a process

• Publicly available (e.g., KEGG)

• Chemical compounds are linked to each other by a

product-substrate relationship

• In a hypergraph – Nodes are compounds

– A hyperedge is a reaction (or an enzyme)

– Hyperedge direction is important to distinguish between substrates

and products

a

b

c

9

Metabolic PathwaysMetabolic Pathways

• Simplification: – Regular graph, nodes represent enzymes, an edge connects enzyme a to enzyme b

iff a’s product is b’s substrate (more accurately, if such a relation exists)

– Edges may be labeled by the compound that relates a to b.

– A specific enzyme may appear more than once in the same pathway, but we consider merged nodes at the price of losing temporal information

• Various problems related to understanding the molecular interaction in the cell can be solved using graph related frameworks, mostly to provide a means to investigate units with well defined functionality

• Paper focus: Mining pathways for frequent connected subgraphs, which is important because functional modules are expected to repeat among several pathways or organisms (or both)

a bcom.

10

OutlineOutline

• Introduction ☺

• Model ☺

• Approach: Graph Mining– Related Work

– Formalism for metabolic pathways

– The Algorithm

• Discussion and Empiric Results

• Conclusion

• Future Work

11

Related WorkRelated Work

• Subgraph isomorphism– Unlabeled version. Hardness usually “tackled” by

ordering nodes and edges for efficient processing

– Labeled Version. Easier, suitable for biological networks

• Frequent itemset mining– Multiple sets of items (transactions) from domain D are

given

– Itemset X implies itemset Y with c confidence if c% of sets containing X also contain Y

– X→Y has support s if s% of the sets contain X and Y

12

Graph Formalism for Metabolic Graph Formalism for Metabolic PathwaysPathways

• A Metabolic Pathway is a triplet, P(M,Z,R)

– M, a set of metabolites

– Z, a set of enzymes

– R, a set of reactions, where each reaction r is associated with

• A set of enzymes Z(r) from Z

• A set of substrates S(r) from M

• A set of products T(r) from M

metabolite

enzyme

13

Graph Formalism for Metabolic Graph Formalism for Metabolic PathwaysPathways

• A Graph G(V,E) for P(M,Z,R) is defined

– For every enzyme zi in Z - a node vi exists

– (vi,vj) in E iff zj consumes the product of zi

• Example: enzymemetabolite

enzyme

14

Mining Metabolic PathwaysMining Metabolic Pathways

• The Problem: Given a collection of n graphs and a

support threshold ε, find all maximal connected

subgraphs that are contained in at least εn of the

graphs

• The support of a subgraph which appears in n’

graphs is n’/n.

• A frequent subgraph is maximal if it is not

contained by another frequent subgraph

15

Subgraph Isomorphism SimplifiedSubgraph Isomorphism Simplified

• Nodes are labeled by enzyme identifiers

• Only edges are needed to define a graph. Their labels conceptually identify the nodes

• Edges are items, uniquely specified by labels which refer to enzymes

• The problem can therefore be reduced to mining frequent itemset

• The graph G1 here is {ab,ac,de}

• Connectivity has to be considered

16

Subgraph Homeomorphism Subgraph Homeomorphism SimplifiedSimplified

• A connected edgeset corresponds to a connected subgraph– A unique edge is a set of two node labels

– A set of unique edges ES={e1, e2 …, ek} is called connected iff every

subset ES’ of ES shares at least one node with the remaining edges

ES\ES’.

• Connection to frequent itemset mining– Input Graphs correspond to transactions

– Connected edgesets correspond to itemsets

– Approach: build frequent sets bottom up (small to large)

– Edge addition preserves connectivity

17

Subgraph Homeomorphism Subgraph Homeomorphism SimplifiedSimplified

• Through the search, only connected

edgesets are considered

– Captures the connected nature of pathways

• Avoiding redundancy coming from

considering the same sets in different order

is important.

18

The AlgorithmThe Algorithm

19

The AlgorithmThe Algorithm

• The procedure is invoked for each frequent edge ei

– Mine({}, {ei}, N(ei), {e1,e2,…,ek})

• The support is embodied in the “if frequent”

statement

• Example: consider 5 enzymes, a, b, c, d and e,

which participate (vacuously or not) in 4 pathways

G1,G2,G3,G4.

• We mine with support = ¾.

20

ExampleExample

ab, ac and de are the only frequent edges

Mine({}, {ab}, N(ab), {ab,ac,bd,de,ce}

Mine({}, {ac}, N(ac), {ab,ac,bd,de,ce}

Mine({}, {de}, N(de), {ab,ac,bd,de,ce}

{ab,ac},{de} are the frequent subgraphs

21

ExampleExample

{ab,ac},{de} are the frequent maximal subgraphs

Mining development:

22

Polynomial BoundPolynomial Bound

• The paper does not prove complexity, but only justifies

“efficiency” in an empiric way

• We show a polynomial bound for time complexity– Determining which are the frequent edges can be done using

sorting

– Determining the neighbors of an edge is linear (requires one pass)

– In every level of the recursion, the algorithm extends a frequent

subgraph with a new frequent edge. This is a linear number of

procedures

– Each such procedure can be done in polynomial time complexity,

where n is the number of edges in the input

23

OutlineOutline

• Introduction ☺

• Model: ☺

• Approach: Graph Mining ☺– Related Work ☺

– Formalism for metabolic pathways ☺

– The Algorithm ☺

• Discussion and Empiric Results

• Conclusion

• Future Work

24

Empiric ResultsEmpiric Results

• The bold subgraph

was mined and

appears in 29% of the

organisms in KEGG

• The solid subgraph

appears in 19.3%

• The entire graph

appears in 14.2%

Glutamate

25

Empiric ResultsEmpiric Results

32.1%, 19.2%, 11.5% 25.6%, 21.8%, 15.4%

Alanine-aspartatePyrimidine

26

Empiric ResultsEmpiric Results

• Run time results for

Pentium 4, 2 GHz, 0.5

GB of RAM

• Sub pathway of 16

edges discovered in 3

sec.

• The entire graph

appears in 14.2%

27

OutlineOutline

• Introduction ☺

• Model: ☺

• Approach: Graph Mining ☺– Related Work ☺

– Formalism for metabolic pathways ☺

– The Algorithm ☺

• Discussion and Empiric Results ☺

• Conclusion

• Future Work

28

ConclusionConclusion

• Framework for mining biological networks

• Graph simplification without losing biological

meaning

• Efficient graph mining

• Good response times

29

OutlineOutline

• Introduction ☺

• Model: ☺

• Graph Mining ☺– Related Work ☺

– Formalism for metabolic pathways ☺

– The Algorithm ☺

• Discussion and Empiric Results ☺

• Conclusion ☺

• Future Work

30

Future WorkFuture Work

• Adding flexibility for capturing biologically

meaningful info and concepts, such as probabilistic

methods

• Probabilistic models for investigating the

significance of discovered patterns (but unlike the

previous case, probability does not model biology)

• Approximate matching rather than exact

– What is an approximation in this case?

Suitable definition needed

31

NEXT PAPER (IN BRIEF)…NEXT PAPER (IN BRIEF)…

32

Seminar in BioinformaticsSeminar in Bioinformatics

Pairwise Local Alignment of Protein

Interaction Networks Guided by Models

of Evolution

Paper by: M. Koyuturk, A. Grama and W. Szpankowski

Appeared in: Journal of Comp. Biology, 13(2), 182-199, 2006.

Presented by: Royi Ronen

33

The ProblemThe Problem

• Protein-Protein-Interaction networks are

modeled as graphs

• A PPI network is an undirected graph (V,E)

– Elements in V represent proteins

– Elements in E represent pairs which interact

• The paper solves the problem of aligning two

graphs (rather than many)

34

Homology Function S(•,•)Homology Function S(•,•)

• Consider two Graphs: G(U,E), H(V,F)

• For each pair from the union of V and U, S assigns a score:– If the pair belongs to the same (a different) species, the confidence

that they are paralogous (orthologous). 0 is the lowest value

– Values of S are determined by an algorithm out of the scope of the

paper (INPARANOID)

• Some definitions:– Match: A conserved interaction between orthologous pairs

– Mismatch: A lack of interaction between a pair whose orthologs

interact

– Duplication: Paralogous proteins (tend to diverge in the long run)

35

Proposed SolutionProposed Solution

• Every pair of node subsets induces an alignment {M,N,D} which is associated with a score

• M - Pairs of edges, with positive S values to nodes, which exist in both graphs. Each associated with a positive score

• N - Pairs of edges, with positive S values to nodes, which exist in one graph but not in the other. Each associated with a negative score

• D - Pairs of nodes from the same graph with positive S. Each associated with a negative score

• The total score is the sum of all the scores, and we wish to find alignment with locally maximal scores

36

Proposed SolutionProposed Solution

• An algorithm is proposed in order to avoid

considering all possible subsets

• The heuristics tries to expand a set so that

its scores is made higher

• Rings a bell?

37

Experimental ResultsExperimental Results

• Using this alignment method and a scoring

algorithm for S(•,•) called INPARANOID, PPI

networks of Human and Mouse were aligned

• Data taken from the DIP Database

• Details:

– Homo Sapiens - 1369 interaction between 1065 proteins

– Mus Musculus – 286 interactions between 329 proteins

38

Experimental ResultsExperimental Results

• INPARANOID discovered 237 ortholog clusters

• 305 matched interactions were discovered; 205

mismatches, 536 duplications in Human; 149

mismatches, 384 duplications in Mouse.

• Examples:

– Conserved subnet with one-way mismatches

– Conserved subnet with two-way mismatches

– Duplications

39

Example 1Example 1

• Graphs aligned

• Biological meaning

– Similarity and differences between the species

– Insight on evolutionary events

40

Example 2Example 2

• Another graph alignment result with local

maximum score

41

Example 3Example 3

• Instance of duplication between mouse and

human

• The regulator regulates homologs

Recommended