- Home
- Documents
- Parameterized and approximation algorithms for maximum agreement forest in multifurcating trees

Published on

03-Apr-2017View

213Download

1

Embed Size (px)

Transcript

Theoretical Computer Science 562 (2015) 496512Contents lists available at ScienceDirect

Theoretical Computer Science

www.elsevier.com/locate/tcs

Parameterized and approximation algorithms for maximum agreement forest in multifurcating trees

Jianer Chen a,, Jia-Hao Fan a, Sing-Hoi Sze a,ba Department of Computer Science & Engineering, Texas A&M University, College Station, TX 77843, USAb Department of Biochemistry & Biophysics, Texas A&M University, College Station, TX 77843, USA

a r t i c l e i n f o a b s t r a c t

Article history:Received 11 August 2013Received in revised form 1 October 2014Accepted 21 October 2014Available online 27 October 2014Communicated by G. Ausiello

Keywords:Approximation algorithmFixed-parameter tractabilityPhylogenetic treeMaximum agreement forest

We study parameterized algorithms and approximation algorithms for the maximum agreement forest problem, which, for two given leaf-labeled trees, is to find a maximum forest that is a subgraph of both trees. The problem was motivated by research in phylogenetics. For parameterized algorithms, while the problem is known to be fixed-parameter tractable for binary trees, it was an open problem whether the problem is still fixed-parameter tractable for general trees. We resolve this open problem by developing an O (3kn)-time parameterized algorithm for general trees. Our techniques on tree structures also lead to a polynomial-time approximation algorithm of ratio 3 for the problem, giving the first constant-ratio approximation algorithm for general trees.

2014 Elsevier B.V. All rights reserved.

1. Introduction

The evolutionary relationships between a set of species are usually represented by a phylogenetic tree in which each leaf is labeled by a distinct species. While a popular way to construct a phylogenetic tree is to start from multiple alignments of genes over a given set of species, different methods often lead to different trees. The use of different sets of aligned genes in the given species can also lead to gene trees that are different from the species tree.

In order to facilitate the comparison of different phylogenetic trees, several distance metrics have been proposed for measuring their similarity [1,8,10,12,13]. A graph theoretical model, the maximum agreement forest (abbr. MAF) has also been proposed that provides a combinatorial structure for the study of the comparison of phylogenetic trees. In particular, the tree-bisection-and-reconnection (TBR) and the subtree-prune-and-regraft (SPR) distances [2,11,20] have direct correspondences to the size of a maximum agreement forest on unrooted trees [1] and on rooted trees [6], respectively.

While most previous work on MAF is restricted to bifurcating (i.e., binary) trees, the problem and related problems on multifurcating (i.e., general) trees have drawn attention recently. Multifurcating phylogenetic trees have appeared quite often in the research of evolutionary biology [3,14,15,19]. Moreover, the relationship between MAF and tree distance metrics on binary trees can be naturally extended to that on multifurcating trees (e.g., see Theorem 2.1 in the next section). The focus of the current paper is on algorithms for the MAF problem on unrooted general trees, which corresponds to the TBR distance on general phylogenetic trees [1].

* Corresponding author.E-mail addresses: chen@cs.tamu.edu (J. Chen), grantfan@cs.tamu.edu (J.-H. Fan), sze@cs.tamu.edu (S.-H. Sze).http://dx.doi.org/10.1016/j.tcs.2014.10.0310304-3975/ 2014 Elsevier B.V. All rights reserved.

http://dx.doi.org/10.1016/j.tcs.2014.10.031http://www.ScienceDirect.com/http://www.elsevier.com/locate/tcsmailto:chen@cs.tamu.edumailto:grantfan@cs.tamu.edumailto:sze@cs.tamu.eduhttp://dx.doi.org/10.1016/j.tcs.2014.10.031http://crossmark.crossref.org/dialog/?doi=10.1016/j.tcs.2014.10.031&domain=pdf

J. Chen et al. / Theoretical Computer Science 562 (2015) 496512 497Review on related research. The problem of constructing an MAF for two unrooted trees is NP-hard and MAX SNP-hard, even when it is restricted to binary trees [1,5].

Approximation algorithms have been studied for the problem, mainly on binary trees. An approximation algorithm of ratio 3 for the problem on rooted binary trees was claimed by Hein et al. [12], who also claimed that the MAF problem on rooted binary trees corresponds to the SPR distance. Allen and Steel [1] showed that the claim in [12] on the relation-ship between MAF and SPR was not true, and, on the other hand, proved that the MAF problem on unrooted binary trees corresponds to the TBR distance. Rodrigues et al. [17] found a subtle error in [12] and showed that the algorithm in [12]has ratio at least 4. Rodrigues et al. [17] then presented a new approximation algorithm and claimed that their algorithm has ratio 3. Bonet et al. [4] provided a counterexample and showed that for the TBR distance, the algorithm in [12] has approximation ratio at least 5 while the algorithm in [17] has approximation ratio at least 4. Using very different methods, Chataigner [7] developed an approximation algorithm of ratio 8 for the TBR distance for two or more binary trees. Recently, Whidden et al. [20,21] presented a linear-time approximation algorithm of ratio 3 for the TBR distance on unrooted binary trees. This is the best known approximation algorithm for the TBR distance on binary trees. We note that there is also a line of research on another metric, the rSPR distance, on binary trees [4,21], for which the best approximation algorithm has ratio 3 and runs in linear time [20,21]. For general trees, to our knowledge, there are currently no known approximation algorithms for the TBR distance. For the SPR distance on rooted general trees, Rodrigues et al. [18] developed an approxi-mation algorithm of ratio d + 1, where d is the maximum number of children a node in the input trees may have. There is also a line of research on the maximum acyclic agreement forest problem on general trees [16].

Parameterized algorithms for the MAF problem, parameterized by the number k of trees in the MAF, have also been studied. We say that a problem is fixed-parameter tractable [9] if it is solvable in time f (k)nO (1) , where k is the parameter and f (k) is a function independent of the input size n. Allen and Steel [1] showed that the MAF problem on unrooted binary trees, which corresponds to the TBR distance, is fixed-parameter tractable. By branching based on inconsistent structures in quartets, Hallett and McCartin [11] developed an algorithm of time O (4kk5 + nO (1)) for the problem. Whidden and Zeh [21,20] further improved the time complexity to O (4kn), which is currently the best known parameterized algorithm for the MAF problem on unrooted binary trees. For the MAF problem on rooted binary trees, Bordewich et al. [5] proposed a parameterized algorithm of time O (4kk4 + n3), and Whidden et al. [20,21] improved the time complexity to O (2.42kn). While there has been significant work that shows the fixed-parameter tractability for the MAF problem and related problems on binary trees, it was unknown whether the MAF problem on general trees is fixed-parameter tractable or not. This has been posed specifically as an open problem by a number of researchers [11,20].

Our contributions. Our focus is on parameterized algorithms and approximation algorithms for the MAF problem on unrooted general trees. Our method is based on a careful study of the graph structures that takes advantage of spe-cial relationships among sibling leaves in the given trees. We develop an O (3kn)-time parameterized algorithm for the MAF problem on unrooted general trees, thus showing the fixed-parameter tractability of the problem and resolving the open problem posed in [11,20]. In fact, our algorithm is even faster than the previous best parameterized algorithm for the problem on binary trees, which runs in time O (4kn) [21]. We also present a polynomial-time approximation algo-rithm of ratio 3 for the MAF problem on unrooted general trees. The ratio matches the best known approximation ratio for the problem on unrooted binary trees [20,21], but our algorithm keeps the same constant ratio and works for gen-eral trees. The only previously known approximation algorithm for the MAF problem on general trees [18] is on rooted trees and has a ratio of d + 1, where d is the maximum number of children a node in the trees may have. Our algo-rithm is the first constant-ratio approximation algorithm for the MAF problem on general trees, which is on unrooted trees.

2. Preliminaries and problem reformulations

In this paper, all graphs are undirected. For a vertex v , an edge e, and an edge subset E in a graph G , denote by G v , G e, and G E the graphs obtained from G with v , e, and the edges in E removed, respectively. All trees in our discussion are unrooted. A leaf of a tree is a vertex of degree less than 2. A forest is a collection of disjoint trees. A nonempty forest F is leaf-labeled over a label-set L if there is a one-to-one mapping from the leaves of F to the elements of L (with all non-leaf vertices unlabeled). The label for a leaf v is denoted by (v). More generally, for a subforest F of F , denote by (F ) the set of labels for the leaves in F .

Two leaf-labeled forests F1 and F2 over the same label-set L are isomorphic if there is a bijection function f between the vertex sets of F1 and F2 such that any two vertices u and v of F1 are adjacent if and only if f (u) and f (v) are adjacent in F2 and the corresponding leaves have the same label. The forests F1 and F2 are homeomorphic if they become isomorphic after contracting all degree-2 vertices (contracting a degree-2 vertex v is to replace the vertex v and its incident edges with a new edge connecting the two neighbors of v). Note that if a leaf-labeled forest F1 is homeomorphic to a subforest of a leaf-labeled forest F2, then there is a unique subforest of F2 that is homeomorphic to F1. Therefore, in this case, without any confusion, we can simply say that the forest F1 is a subforest of F2. An agreement forest for two leaf-labeled forests F1 and F2 over the same label-set L is a leaf-labeled forest F over the label-set L such that F is a subforest of both F1 and F2. A maximum agreement forest F (abbr. MAF) for F1 and F2 is an agreement forest for F1 and F2 such that the size of (i.e., the number of trees in) F is minimized over all agreement forests for F1 and F2 [11].

498 J. Chen et al. / Theoretical Computer Science 562 (2015) 496512Fig. 1. TBR operation and MAF.

The two versions of the MAF problem studied in this paper are

para-maf. Given two leaf-labeled trees T1 and T2 over the same label-set L, and a parameter k, is there an agreement forest of size at most k for T1 and T2?max-maf. Given two leaf-labeled trees T1 and T2 over the same label-set L, construct an MAF for T1 and T2.

An unlabeled vertex in a leaf-labeled forest F may have degree 2. Moreover, our algorithms for the para-maf andmax-maf problems will delete edges that may make an unlabeled vertex to have degree even less than 2. Contraction is an operation on an unlabeled vertex v of degree less than 3, defined as follows: (1) if v has degree 2, then the contraction replaces v and its two incident edges with a new edge connecting the two neighbors of v; and (2) if v has degree less than 2, then the contraction simply removes the vertex (and the incident edge if there is one). In particular, contraction enables us to keep the leaves of our forests always labeled.

2.1. MAF and TBR distance

Allen and Steel [1] proved that the TBR distance between two leaf-labeled unrooted binary trees is equal to the size of their MAF minus 1. The proof can be modified to give the same result for leaf-labeled unrooted general trees. For completeness, we present details for this proof on general trees.

Recall that a TBR operation on a leaf-labeled binary tree T is removing an edge in T then running a new edge between the midpoints of two edges in the two resulting subtrees [1]. We first present a natural extension of this definition on leaf-labeled general trees.

Definition. A tree-bisection-and-reconnection (TBR) operation on a leaf-labeled tree T is to remove an edge e1 in T , resulting in two disjoint trees T1 and T2, then reconnect T1 and T2 by a new edge e2, where each end of the edge e2 can be on either a non-leaf vertex or the midpoint of an edge in T1 and T2.

Fig. 1 gives an intuitive illustration for how a TBR operation transforms a leaf-labeled tree T into another leaf-labeled tree T , assuming T has no degree-2 vertices. Starting with the leaf-labeled tree T as shown in the left side of Fig. 1, the TBR operation first removes an edge e1 in T , giving a forest F of two trees, as shown in the middle of Fig. 1. Note that if removing edge e1 results in degree-2 vertices, then the degree-2 vertices are contracted. The TBR operation then inserts a new edge e2 between the two trees in F , resulting in the tree T . Note that the two ends of the edge e2 can be either a non-leaf vertex in F or the middle point of an edge in F . If latter is the case, then the middle point of the edge becomes a new vertex in the tree T , as shown in the right side of Fig. 1. It is obvious that the forest F is a subforest of both T and T . Thus, F is an agreement forest for T and T . Moreover, since T = T , F is actually an MAF for T and T .

In the original definition of the TBR operation on binary trees [1], the condition that the ends of the new edge e2 must be the midpoints of edges is to ensure that the resulting tree is a binary tree. For general trees, we relax this condition and allow the ends of e2 to be non-leaf vertices.

As for binary trees, the TBR distance dtbr(T1, T2) between two leaf-labeled general trees T1 and T2 is defined to be the minimum number of TBR operations that transform T1 into T2. Clearly, dtbr(T1, T2) = dtbr(T2, T1).

Theorem 2.1. For any two leaf-labeled unrooted general trees T1 and T2 over the same label-set, dtbr(T1, T2) = maf(T1, T2) 1, where maf(T1, T2) is the size of an MAF for T1 and T2 .

Proof. We prove the theorem by induction on dtbr(T1, T2) and maf(T1, T2).For dtbr(T1, T2) = 0, we have T1 = T2, and T1 (or T2) itself clearly makes an MAF for T1 and T2. Thus, maf(T1, T2) =

|T1| = 1, and the theorem holds true. For the case dtbr(T1, T2) = 1, T1 = T2, we can remove an edge e1 in T1, resulting in two disjoint trees T and T , then reconnect T and T by a new edge e2 to obtain T2. This implies that {T , T } is an agreement forest for T1 and T2. Since T1 = T2, {T , T } must be an MAF for T1 and T2, thus, maf(T1, T2) = 2. The theorem thus again holds true.

J. Chen et al. / Theoretical Computer Science 562 (2015) 496512 499Now assume dtbr(T1, T2) = d > 1. Then there is a leaf-labeled tree T3 such that dtbr(T1, T3) = d 1, and dtbr(T3, T2) = 1. By the inductive hypothesis, maf(T1, T3) = d, so T1 and T3 have an MAF F = {T 1, , T d}, where T i...