Upload
frederic-chataigner
View
220
Download
1
Embed Size (px)
Citation preview
l
lemr
he
Information Processing Letters 93 (2005) 239–244
www.elsevier.com/locate/ip
Approximating the Maximum Agreement Forestonk trees
Frédéric Chataigner
LIAFA, université Denis Diderot, 2 place Jussieu, Paris cedex 05 75251, France
Received 9 September 2004; received in revised form 12 November 2004
Available online 13 December 2004
Communicated by L. Boasson
Abstract
The Maximum Agreement Forest problem (MAF) asks for the largest common subforest of a set of binary trees. This probis known to be MAXSNP-complete for instances consisting of 2 trees. We show that it remains MAXSNP-complete fok � 2trees. 2004 Elsevier B.V. All rights reserved.
Keywords:Computational complexity; Approximation algorithms
1. Introduction with a paper from Hein [3], who also introduced t
F)ees.is-PR)-
on-edandthevo-ofted
idea of MAF in a subsequent paper [4] in which thes
en.byceta-PRun-the
b-ovel,
hen
erved
The Maximum Agreement Forest problem (MAasks for the largest common subforest of a set of trIt was introduced as a tool for studying an editing dtance on trees: the Subtree Prune and Regraft (Swhich itself is motivated by the problem of recombinations in phylogenetic trees. The number of cnected components in the MAF of two trees is inderoughly equal to the SPR distance between them,in turn the SPR distance between the two trees isnumber of recombinations events required in an elution model explaining the two trees. The studyrecombinations in phylogenetic reconstruction star
E-mail address:[email protected] (F. Chataigner).
0020-0190/$ – see front matter 2004 Elsevier B.V. All rights resdoi:10.1016/j.ipl.2004.11.004
,
MAXSNP-completeness of MAF for 2 binary treewas proved, and an approximation algorithm givSome proofs in this article were later correctedSagot et al. [5], while the study of the SPR distanprogressed in papers like [1]. The lemma from [3] sting the equivalence of the MAF problem and the Sdistance was also corrected in [1], who showed a coterexample and ties to a slightly different distance,Tree Bissection and Regraft distance (TBR).
This leaves open the complexity of the MAF prolem for more than 2 trees. In this paper, we use a napproach to compute maximum agreement forestsand show that there is no increase in complexity wdealing with an finite number of input trees.
.
240 F. Chataigner / Information Processing Letters 93 (2005) 239–244
Fig. 1. Three trees (in black) and an agreement forest for them (in gray).
2. Definitions
-ue.-
st
-
tberof
s
re-the
r-:
eson-
e
r
A phylogenetic tree, or tree in this paper, is a leaflabeled rooted binary tree. Leaf labels are uniqA phylogenetic forestis a leaf-labeled forest of phylogenetic trees. We noteL(T ) or L(F) the set of leavesof trees and forests, andr(T ) or r(F ) their root(s). Wedenote byTv the subtree ofT rooted at its nodev. Fora treeT and a subsetL of L(T ), thesubtree inducedby L, notedT |L is the tree defined as the smalleconnected subgraph ofT containingL. A forestF isa subforest of a treeT if the following conditions aremet:
• L(F) = L(T ).• F is a directed subgraph ofT (edge orientations
must be respected).
Subdividing an edgee = (u, v) is the operationconsisting in adding a nodew and replacinge by thepair of edges(u,w) and (w,v) (again, edge orientation must be respected). A graphH is asubdivisionofa graphG if H can be obtained fromG by a seriesof edge subdivisions. Anagreement forestF betweentreesT1, . . . , Tk is a forest such that eachTi contains asubdivision ofF as subforest. Amaximum agreemenforestis an agreement forest with a maximum numof edges, or equivalently, with a minimum numberconnected components. We notec(F ) the number ofconnected components in a graphF . Thetopologyofa treeT is the unique binary tree of whichT is a sub-division. For example, ifT is a tree with three leave{a, b, c}, there are 3 possible topologies forT , whichwe abbreviate in((a, b), c), ((b, c), a) and((a, c), b),respectively.
Fig. 2. Topologies of a tree with 3 leaves.
2.1. Supporting propositions
We will need to compare tree topologies, more pcisely to guarantee that different (sub)trees havesame topology. This will be done by means ofcon-flicts: given two treesA and B, a triple of leavesU = {a, b, c} is a conflict if the induced subtreesA|UandB|U have different topologies. Looking for a patition without conflicts is motivated by the following
Proposition 1. Two trees with the same set of leavhave the same topology iff they do not have any cflict.
Proof. If treesA andB do not have any conflicts, wconsider the rootsrA andrB of A andB. Let rA have2 children nodesa1 anda2, and letx be a leaf inAa1
andy be a leaf inAa2. Let b1 andb2 be the 2 chil-dren of rB . If x and y are both inBb1 (or both inBb2), then letz be a leaf inBb2. z being a leaf inA, itlies either inAa1 or Aa2. Thus the triple{x, y, z} hastopology((x, z), y) or (x, (y, z)) in A, and topology((x, y), z) in B. This is a conflict, contradicting ouhypothesis. Thereforex is in Bb1 andy in Bb2. Thenfor eachz /∈ {x, y}, we have eitherz ∈ Aa1 and topol-ogy ((x, z), y) for the triple {x, y, z}, or z ∈ Aa2 andtopology(x, (y, z)). Since there are no conflicts,z ∈Aa1 ⇒ A|{x,y,z} = ((x, z), y) = B|{x,y,z} ⇒ z ∈ Bb1,and for the same reason,z ∈ Aa2 ⇒ z ∈ Bb2. Thismeans thatL(Aa1) = L(Bb1); sinceAa1 is a subtree
F. Chataigner / Information Processing Letters 93 (2005) 239–244 241
of A andBb1 a subtree ofB, the set of conflicts be-tweenAa1 andBb1 is a subset of the conflicts between
f
a
rm-
l
ic
-ly
n
-ber
fg
a maximal set forTx\T |Ui and a maximal set forTy\T |Ui , sof (v,Ui) = f (x,Ui) + f (y,Ui).
ndm
ses
f
l-
h
gono-
i-
.ly
earin
A andB, i.e., it is empty. An induction on the size othe trees completes the proof.�
One step of the algorithm will involve findingmaximal disjoint set of subtrees in a treeT , more pre-cisely subtrees induced by conflicts, i.e., of the foT |{a,b,c}, with a, b, c ∈ L(T ). This involves some dynamic programming:
Proposition 2. LetT be a rooted binary tree, and letUbe a set of triples inL(T ). A collection with maximasize of edge-disjoint subtreesT |U , whereU ∈ U , canbe found in polynomial time.
Proof. This is done by means of bottom-up dynamprogramming in the treeT . For each nodev ∈ T , weconstruct a tablef (v,Ui ) with one entry per tripleUi ∈ U , plus one entryf (v,∅) defined below. Theentry for the tripleUi is the maximum number of edgedisjoint T |Uj , j �= i whose edge set can fit entirein Tv\T |Ui . Entry f (v,Ui) is not filled if v /∈ T |Ui .The entryf (v,∅) is the number of edge-disjointT |Uj
whose edge set can fit entirely inTv . Letx andy be thechildren nodes ofv, and consider an entry inf (v, ·).
• f (v,∅): consider a maximal setC of T |Ui fittingentirely in Tv . If v is a node of one of theT |Ui
in this maximal set, then thisT |Ui must fit en-tirely in Tv , meaningv is the root ofT |Ui , andx, y ∈ T |Ui . By maximality ofC, the setC −T |Ui
is a maximal set ofT |Uj whose edge sets fit iTv\T |Ui . SinceTv\T |Ui can be split in 2 partsTx\T |Ui and Ty\T |Ui , we can writef (v,∅) =f (x,Ui) + f (y,Ui) + 1. The other possibility isthat noT |Ui containsv. In this case, neither(v, x)
nor (v, y) are on aT |Uj , and the maximal number of subtrees is the sum of the maximal numof subtrees fitting entirely inTx and Ty , that isf (x,∅) + f (y,∅):
f (v,∅) = max{f (x,∅) + f (y,∅);
maxj |v root of T |Uj
f (x,Uj) + f (y,Uj ) + 1}.
• f (v,T |Ui ) and both edges(v, x) and (v, y) arein the edge set ofT |Ui : then a maximal set otriples inTv\T |Ui can be constructed by mergin
• f (v,T |Ui ) and (v, x) ∈ T |Ui and (v, y) /∈ T |Ui :a maximal set of triples inTv\T |Ui can be con-structed by merging a maximal set forTx\T |Ui
and a maximal set forTy , sof (v,Ui) = f (x,Ui)
+ f (y,∅).
Entriesf (v,U) are computed in constant time, aentriesf (v,∅) require the computation of a maximuof at most|U | values. Since there aren − 1 internalnodesv, and at mostn3 different triples inU (wheren
is the number of leaves in the tree), this algorithm uO(n4) time and space.
The entryf (r(T ),∅) gives a maximal number odisjoint triples fitting entirely inT . A maximal set oftriples can then be constructed by backtracking.�
The algorithm for the MAF problem uses the folowing subproblem, which we callminimum conflictcut: let T be a binary rooted tree, andU a set of triplesin L(T ); find a minimal setM of edges inT cut-ting T |U for eachU ∈ U , i.e., a set such that for eacU ∈ U , there exists at least one edgee ∈ M ∩E(T |U).This problem is very similar to the problem of findina minimum multi-cut in a tree, and the approximatialgorithm we present for it is inspired from the algrithm in [2].
Proposition 3. Minimum conflict cut can be approxmated within a ratio4 in polynomial time.
Proof. The algorithm constructs a conflict cutM froma subsetF of the triples, which we will call a flowWhile the min cut/max flow theorem does not apphere, we have the following inequality: maxflow �mincut.
This can be shown by means of integer and linprogramming. We start by expressing the problemterms of integer programming:
min∑
e∈E(T )
de
∀U ∈ U,∑
e:e∈T |Ude � 1
de ∈ {0;1}
242 F. Chataigner / Information Processing Letters 93 (2005) 239–244
In this integer program,de = 1 ⇔ e ∈ M. Relaxingthis to a linear program, and then taking its dual, one
ned
e-
msual
m-
ax-
ges
te
are
ne
ge
to V , thus contradicting the maximality ofV . ForU ∈ U1 andV ∈ V such thatU ∩ V �= ∅, let thenx
f
is
m
3e
h
ith-
-
gets:
max∑U∈U
fU
∀e ∈ E(T ),∑
U :e∈T |Ufi � 1
fi � 0
This latter linear program can then be strengtheto an integer program, and since allfi appear in in-equalities bounded by 1, in that integer program, whavefi ∈ {0,1}. By analogy with the multicut problem in trees, we call the dual variablesfi “flows”. ThesetM of edges for whichde = 1 will also be called amulticut. The relations between the integer prograand their relaxations, and the equality of the two dlinear programs gives the following:
max integral flow� maxfractional flow
= min fractional multi-cut
� minmulti-cut
Proposition 2 indicates that it is possible to copute the maximal integral flow, i.e., a setV ⊂ U oftriples. We now produce a multicut based on this mimal integral flow.
For each tripleU = {a, b, c}, with topology((a, b),
c) for T |U , we noterU the root ofT |U , defined aslca(a, c) in T , and bysU the nodelca(a, b). Let M1be the set of edges entering the nodesrU andsU , for alltriplesU ∈ V . If r(T ) = rU for someU , it has no en-tering edge, so we pick one of its two outgoing edinstead. LetU1 be the set of triplesU such thatT |Udoes not contain an edge ofM1. We need to find a seM2 of edges such that eachT |U contains at least onedge fromM2. Since triples fromU1 must avoidM1,they have some properties. These first 2 propertiesdirect consequences of the choice ofM1:
• P1: a tripleU ∈ U1 sharing nodes with a tripleV ∈V satisfiesr(T |U) ∈ T |V .
• P2: a tripleU ∈ U1 shares nodes with at most otriple fromV .
First observe thatU ∈ U1 shares at least one edwith someV ∈ V . Indeed, if a tripleU /∈ V is edge-disjoint from all the triples inV , it could be added
be the node inT |U ∩ T |V closest to the rootr(T ).If x = r(T ), then one of its outgoing edges is inM1andM1 ∩ T |U �= ∅, a contradiction on the choice oU . If the parenty of x is in T |U , theny /∈ T |v , andthe edge(y, x) ∈ T |U . Since(y, x) /∈ T |V , x is theroot of T |V , and(y, x) ∈ M1, a contradiction on thechoice ofU again. The only remaining possibilityy ∈ T |v and y /∈ T |U . In this case,x = r(T |U) andx ∈ T |V , sor(T |U) ∈ T |V as required. P2 stems froP1 and the fact that theT |V for V ∈ V are edge-disjoint: P1 states thatr(T |U) lies on eachT |V withnodes in common withT |U ; because of the degreefor the internal nodes ofT , edge-disjoint conflicts aralso node-disjoint; sincer(T |U) is unique, it can lie ononly oneT |V .
For each tripleV = {a, b, c} ∈ V , let UV be the setof triples inU ∈ U1 such thatT |U shares nodes witT |V = ((a, b), c). According to P1 and P2,U1 is a dis-joint union of theUV , and it is sufficient to find, foreachV ∈ V , edges ofT |V to intersect all the triples inUV . Let us focus on a particularV ∈ V . The fact thatT is a tree induces this important property:
• P3: if U1,U2 ∈ UV are such thatT |U1 ∩T |U2 �= ∅,then there existsw ∈ T |V such thatw ∈ T |U1 ∩T |U2.
If U1,U2 ∈ UV , let w be a node inT |U1 ∩ T |U2.SinceT is a tree, the path betweenw and r(T ) isunique.(r(T |U1) → w) and(r(T |U2) → w) both be-ing subintervals of(r(T ) → w), one is contained inthe other. Suppose(r(T |U1) → w) contains(r(T |U2)
→ w). Then P1 implies thatr(T |U2) ∈ T |V , sor(T |U2) ∈ T |V ∩ T |U2 ∩ T |U1, i.e., P3 is true.
Since we added the edge enteringsV to M1, weneed to splitUV in two parts. Letd be the parent ofsV ,let W1 be the set of triples with edges in common wT |{a,b}, andW2 the set of triples with edges in common with T |{d,c}. By construction,UV = W1 ∪ W2.These two sets of triplesW1 andW2 have a nice property: eitherW1 = ∅ or W2 = ∅.
We start by proving thatW1 ∩ W2 = ∅. If U ∈W1∩W2, thenT |U contains one node fromT |{a,b} andone node fromT |{c,d}. By the connectivity ofT andT |U , this means(d, sV ) ∈ T |U , contradictingU ∈ U1.ThereforeU does not exist andW1 ∩ W2 = ∅. Sup-
F. Chataigner / Information Processing Letters 93 (2005) 239–244 243
re
d
nin
tser-
eick
in-the
3. Algorithm
ie-b-
ps,mentiesst
e
e
fy notfeeks
ar-
ition-
theest.
-
f
ut
Fig. 3. The subtree induced byV ∈ U and two triplesU1,U2 not inU . The algorithm needs to put an edge fromK in the multicut.
pose there existsU1 ∈ W1 and U2 ∈ W2. If U1,U2
are not disjoint, P3 implies there exists a nodew ∈T |U1 ∩ T |U2 ∩ T |V . If w ∈ T |{a,b}, then by the con-nectivity T |U2 would contain the edge(d, sV ), whichis a contradiction; and ifw ∈ T |{c,d}, thenT |U1 wouldcontain(d, sV ), again a contradiction. SoU1,U2, ifthey exist, must be edge-disjoint. But thenU1,U2 areedge-disjoint from each other, and because of P2 aalso edge-disjoint from the triples inV − V , so onecould replaceV by {U1,U2} in V , thus contradictingthe maximality ofV .
From now on, we supposeW1 is the empty set, anW2 �= ∅. Let us write the nodes of the pathT |{c,d}as c = x0;x1; · · · ;xn = d . Then for eachU ∈ W2,the non-empty subpathT |V ∩ T |U is of the formxcU ;xcU+1; · · · ;xdU . Since 2 edge-disjoint triples iW2 would also be edge-disjoint from the triplesV − V , they could be used to replaceV and contradictthe maximality ofV . So any 2 triples inW2 have someedges in common (not necessarily edges ofT |V ). ThenP3 implies that each pair of triples inW1 has at leasone node ofT |{c,d} in common, so no two interval[cU . . . dU ] are disjoint. As a consequence, the intsectionK of all [cU, dU ] is non-empty. IfK is a singlenode, we add in the setM2 the 2 edges ofT |{c,d} inci-dent to that node. IfK contains at least one edge, wadd that edge. By construction, the edge(s) we pare contained in all triples ofW2. M1 andM2 togetherplace at least one edge on each triple ofU and bothcontain at most 2 edges per triple ofV . Therefore weconstruct a multicutM whose size if bounded by 4|V|.Combined with the inequalities from the linear andteger programs, this proves we can approximateminimum conflict cut within a ratio 4. �
We present a polynomial time algorithm that achves a constant ratio approximation of the MAF prolem for k � 2 binary trees. The algorithm has 2 steeach one ensuring a needed property of the agreeforest; in turn, the combination of the two propertwill ensure the produced forest is an agreement forefor all the trees of the instance.
Theorem 4. The maximum agreement forest onk treescan be approximated within a ratio8 in polynomialtime.
Proof. Let us noteL the common set of leaves of thinput trees. A forestF obtained from a treeT , withtreesF1 . . .Fm, can be defined by the partition of thset of leavesL(T ) induced by the subtreesFi , i.e., byL(F1)/L(F1)/ · · ·/L(Fm). Conversely, a partition othe leaves induces a set of subtrees, but those maform a forest obtained fromT by removing edges ithe subtrees share edges or nodes. The algorithm sa partitionL1 . . .Lm of L such that:
• Ptopo: for eachj ∈ [1..m], the subtreesTi |Lj in-duced byLj have the same topology, and the ptition induces a subforest onT1.
• Psep: for each i ∈ [1..k], the subtreesTi |Lj arevertex-disjoint (and thus edge-disjoint too).
The second property ensures that such a partdefines a subforest for eachTi , and the first property ensures that all the induced subforests havesame topology, i.e., they form an agreement forThe MAF obviously satisfies both Ptopo and Psep. Wedenote bym∗ the size of the MAF, that ism∗ =c(MAF) − 1.
The first step is meant to produce a partition of theleaves satisfying Ptopo. The partition will then be refined in the second step to ensure Psep. We proceed asfollow:
• we single out one of thek trees, sayT1, and con-sider the setU of conflicts between the pair otrees(T1, Ti), for eachi �= 1.
• we compute an approximate minimum conflict cM for U andT1.
• we take the partition defined by the forestT1 −M.
244 F. Chataigner / Information Processing Letters 93 (2005) 239–244
ynt
n
-
-
or-ed, soe
sfy
est
ef a
hter
d
he
e
gbe
removed from any subforest ofT1 to ensure Psep.This means that the minimum number of connected
i-tsin
n-ere-
tin
t
ly-
eedusthere ofs to
-01)
rm-
se-69–
of96)
ox-m,
lin,
Fig. 4. The forestF is the topology of the union of the bold grasubtrees onT1 andT2. F has no conflicts, but is not an agreemeforest yet: the dashed edges are overlapping edges of its subdivisioin T2.
By construction, a setLj in the resulting partitionwill not contain any conflict, and proposition 1 implies that the treeX of T1 − M such thatLj = L(X)
has the same topology as all the treesTi |Lj , for eachi �= 1. This ensures Ptopo is satisfied. The MAF satisfies Ptopo, so any set of edges whose removal fromT1produces the MAF is a conflict cut. For any subfestF of T1, only c(F ) − 1 edges need to be removfrom T1 to produce the same partition of the leavesa setM of minimal size producing the MAF has siz|M| � c(MAF) − 1 = m∗. This in turn implies that theforest produced by this step has at most 4m∗ connectedcomponents.
The second step adds up to 4m∗ edges ofT1 to M,to refine the partition from the first step and satiPsep. Let F be the topology of the forestT1 − M pro-duced in the first step. Thus internal nodes ofF haveoutdegree 2, and can be identified by means of lowcommon ancestors.
Each nodev of F corresponds to a unique nodof eachTi , namely the lowest common ancestor onode in the left subtree ofv in F and a node in theright subtree. Then for each edgea of F , let Pi(a)
be the path corresponding to this edge inTi , i.e., thepath between the nodes ofTi in correspondence witthe ends ofa in F . These paths can have length greathan 1; indeed, on the previous figure, for the edgex =(lca(c, e); e), the pathP2(x) has length 2. We proceeas follows:
• build the graphG whose nodes represent tedges ofF and an edge(a, b) ∈ E(G) represent acollision, i.e., somei such thatPi(a) ∩Pi(b) �= ∅.
• compute a vertex cover ofG.• for each nodea of the vertex cover, remove on
edge fromP1(a).
Since the MAF satisfies Psep, the edges defininthe same partition of the leaves as the MAF can
components to create fromT1 − M to ensure Psep isbounded bym∗. The number of edges in a rooted bnary forest withn leaves andp connected componenis 2(n−1−p). Creatingm∗ connected componentsT1 − M produces a subforest with 2(n − 1− p − m∗)edges, so at most 2m∗ edges need to be removed to esure Psep. Removing the edge represented by a noda
in G removes all the collisions it was causing, corsponding to all the edges incident toa in G, and con-versely. Thus a minimum vertex cover ofG has size amost 2m∗. Since it can be easily approximated witha ratio 2, we can find a set of edges of size at most 4m∗to remove fromT1 − M so that Psepis satisfied.
We remove up to 8m∗ edges fromT1, thus produc-ing a subforest ofT1 of size bounded by 8m∗. SincePtopo and Psepare satisfied, this isan agreement foresapproximating the MAF within a ratio 8.�4. Conclusion
We have shown that the problem MAF onk binarytrees can be approximated with a ratio 8 by a ponomial time algorithm independently ofk. For smallvalues ofk, this ratio is significantly worse than thk +1 achieved by the algorithm of [4]. Work is needto improve the approximation ratios of the variosteps in our algorithm. One can also wonder whethis complexity result can be extended to the casbounded degree trees; preliminary research seemindicate this is the case.
References
[1] B.L. Allen, M. Steel, Subtree transfer operations and their induced metrics on evolutionary trees, Ann. Combin. 5 (201–13.
[2] N. Garg, V.V. Vazirani, M. Yannakakis, Approximate max-floomin-(multi)cut theorems and their applications, SIAM J. Coput. 25 (1996) 235–251.
[3] J. Hein, A heuristic method to reconstruct the history ofquences subject to recombination, J. Mol. Evol. 32 (1993) 3405.
[4] J. Hein, T. Jiang, L. Wang, K. Zhang, On the complexitycomparing evolutionary trees, Discrete Appl. Math. 71 (19153–169.
[5] E.M. Rodrigues, M.F. Sagot, Y. Wakabayashi, in: Some Apprimation Results for the Maximum Agreement Forest Problein: Lecture Notes in Comput. Sci., vol. 2129, Springer, Ber2001, pp. 159–163.