30
Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet- Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology Haifa, Israel

Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Embed Size (px)

Citation preview

Page 1: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/071

On the Hardness of Inferring Phylogenies

from Triplet-Dissimilarities

Ilan Gronau Shlomo Moran

Technion – Israel Institute of TechnologyHaifa, Israel

Page 2: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/072

Pairwise-Distance Based Reconstruction

L

G

E

H

M

B

DT

Butt’fly…AAGT…

Eagle…CAGA…

Gorrila…CCGT…

Human…AACG…

Lion…AATA…

Mouse…CGCG…

0 13 17 15 10 12

0 14 13 17 11

0 2 10 9

0 15 8

0 6

0

B E G H L M B E

G H

L M

D

calculate

B E G HM L

21343 7

42

5

T

1

reconstruct

0 14 15 16 14 13

0 15 16 14 13

0 3 11 10

0 12 11

0 7

0

B E G H L M B E

G H

L M

Page 3: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/073

Optimization Criteria

We wish the tree-metric DT to approximate simultaneously the

pairwise distances in D.2

n

Maximal Difference (l∞ ) 1 2 1 2,

( , ) max ( , ) ( , )i j

MaxDiff D D D i j D i j

•Maximal Distortion 1 21 2

, ,2 1

( , ) ( , ), max max

( , ) ( , )i j i j

D i j D i jMaxDist D D

D i j D i j

Two “closeness” measures studied here:

0 13 17 15 10 12

0 14 13 17 11

0 2 10 9

0 15 8

0 6

0

B E G H L M B E

G H

L M

0 14 15 16 14 13

0 15 16 14 13

0 3 11 10

0 12 11

0 7

0

B E G H L M

B E

G H

L M

should be “close” to = D DT =

Page 4: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/074

Maximal Difference (l∞ )

vs. Maximal Distortion

0 13 17 15 10 12

0 14 13 17 11

0 2 10 9

0 15 8

0 6

0

B E G H L M B E

G H

L M

0 14 15 16 14 13

0 15 16 14 13

0 3 11 10

0 12 11

0 7

0

B E G H L M B E

G H

L M

3 17, 1.821...

2 14TMaxDist D D

( , ) |10 14 | = 4TMaxDiff D D

Goal: Find optimal T ,which minimizes the maximal difference/distortion between D and DT

D= DT=

Page 5: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/075

Previous works on Approximating Dissimilarities by Tree Distances

Negative results: (NP-hardness)

• Closest tree-metric (even ultrametric ) to dissimilarity matrix under l1 l2 [Day

‘87]

• Closest tree-metric to dissimilarity matrix under l∞ [ABFPT99] Hard to approximate better than 1.125 Implicit: Hard to approximate closest MaxDist tree within any constant factor

Positive results:

• Closest ultrametric to dissimilarity matrix under l∞ [Krivanek

‘88]

• 3-approximation of closest additive metric to a given metric [ABFPT99] (implicit 6-approximation for general dissimilarity matrices)

Page 6: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/076

This Work: Triplet-Distances – Distances to Triplets Midpoints

i

j

k

τT (i ; jk)

• τT (i ; jk) = τT (i ; kj)

• τT (i ; ij) = 0

• τT (i ; jj) = DT (i, j)

C(i,j,k)

Page 7: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/077

Triplet-Distances Defined by 2-Distances

• Each distance Matrix D defines 3-trees3

n

i

k

j9

7

8

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

Any metric on 3 taxa…

C(i,j,k)

i j

k

3

4

5

…is realizable by a 3-tree

Page 8: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/078

reconstruct

Triplet-Distance Based Reconstruction

B E G HM L

21343 7

42

5

T

1

…AAGT…

…CAGA…

…CCGT…

…AACG…

…AATA…

…CGCG…

B E

G H

L

M

BB BE BG….. LL LM MM

12

17

15

12

13

0 0 0 . . . . . .

0 6 . . .

6 0 . . .

8 3 . . .

6 3 . . .

5 4 . . . 0

T

14

15

16

13

14

0 0 0 . . . . . .

0 7 . . .

8 0 . . .

9 2 . . .

6 4 . . .

7 5 . . . 0

B E

G H

L

M

BB BE BG….. LL LM MM

0 13 17 15 10 12

0 14 13 17 11

0 2 10 9

0 15 8

0 6

0

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

Page 9: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/079

Why use Triplet-Distances?

1. They enable more accurate

estimations of 2-distances.

2. They are used (de facto) by known

reconstruction algorithms

Page 10: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0710

Improved Estimations of Pairwise Distances:

0 13 17 15 10 12

0 14 13 17 11

0 2 10 9

0 15 8

0 6

0

B E G H L M B E

G H

L M

D=

Butt’fly…AAGT…

Eagle…CAGA…

Gorrila…CCGT…

Human…AACG…

Lion…AATA…

Mouse…CGCG…

“Information Loss”

(In calculating D(H,E),

all other taxa are ignored

Human…AACG…

Eagle…CAGA…

(Maximum Likelihood)

H

E

13

Calculate D(H,E)

Page 11: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0711

Improved Estimations (cont):

Estimate D(H,E) by calculating all the 3-trees on {H,E,X:XH,E}

(Or: calculate just one 3-tree, for a “trusted” 3rd taxon X :• V. Ranwez, O. Gascuel, Improvement of distance-based phylogenetic methods by a local maximum likelihood

approach using triplets, Mol.Biol. Evol. 19(11) 1952–1963. (2002)

B=(..AAGT..)

H= (..AACG..) E=(..CAGA..)

3 2

(..****..)

M=(..CGCG..)

33

(..****..)

H= (..AACG..) E=(..CAGA..)

G=(..CCGT..)

H= (..AACG..)

E=(..CAGA..)

1 5

(..****..)

L=(..AATA..)

H= (..AACG..) E=(..CAGA..)

2 4

(..****..)

Page 12: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0712

(Implicit) use of Triplet-Distances in 2-Distance Reconstruction Algorithms

BB BE BG….. LL LM MM

12

17

15

12

13

0 0 0 . . . . . .

0 6 . . .

6 0 . . .

8 3 . . .

6 3 . . .

5 4 . . . 0

0 13 17 15 10 12

0 14 13 17 11

0 2 10 9

0 15 8

0 6

0

B E G H L M

B E

G H

L M

D

B E G HM L

21343 7

42

5

T

1

τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)].

Page 13: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0713

1st use : “Triplet Distances from a Single Source”:

Fix a taxon r, and construct a tree T which minimizes:

Optimal solution is doable in O(n2) time, and is used eg in :

(FKW95): Optimal approximation of distances by ultrametric trees.

(ABFPT99): The best known approximation of distances by general

trees

(BB99): Fast construction of Buneman trees.

| ( ; ) ( ; ) |: ,TMax r ij r ij i j r

i

j

r

Page 14: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0714

2nd use:Saitou&Nei Neighbour Joining

The neighbors-selection criterion of NJ selects a taxon-pair i,j which

maximizes the sum :

i

j

r

( ; )D i j ,

( ; )r i j

r ij

,

( , ) ( , ) ( ; )r i j

Q i j D i j r ij

r

r

r

r

rr

Page 15: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0715

Previous Works on Triplet-Dissimilarities/Distances

• I. Gronau, S. Moran Neighbor Joining Algorithms for Inferring Phylogenies via LCA-Distances, Journal of

Computational Biology 14(1) pp. 1-15 (2007).

Works which use the total weights of 3 trees:

• S. Joly, GL Calve, Three Way Distances, Journal of Classification 12 pp. 191-205 (1995)

• L. Pachter, D. Speyer Reconstructing Trees from Subtrees Weights , Applied Mathematics Letters 17 pp. 615-621

(2004)

• D. Levy, R. Yoshida, L. Pachter, Beyond pairwise distances: Neighbor-joining with phylogenetic diversity

estimates, Mol. Biol. Evol. 23(3) 491–498 (2006) .

Page 16: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0716

Summary of Results

Results for Maximal Difference (l∞):

1. Decision problem is NP-Hard

IS there a tree T s.t. ||τ,τT ||∞ ≤ Δ ?

2. Hardness-of-approximation of optimization problem

Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞

3. A 15-approximation algorithm Using the 6-approximation algorithm for 2-dissimilarities from [ABFPT99]

Result for Maximal Distortion:• Hardness-of-approximation within any constant factor

Page 17: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0717

NP Hardness of the Decision Problem

We use a reduction from 3SAT

(the problem of determining whether a 3CNF formula is satisfiable)

1 2 3 1 2 4 1 2 4 1 3 4x x x x x x x x x x x x clause

literals

1 2 3 4; ; ;x x x x T F F FSatisfying assignment:

If one can determine for (τ,Δ) whether there exists a tree T s.t. ||τ,τT ||∞ ≤

Δ, then one can determine for every 3CNF formula φ whether it is

satisfiable.

We show:

Page 18: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0718

The Reduction

The set of taxa:

• Taxa T , F.

• A taxon for every literal ( ).

• 3 taxa for every clause Cj ( y j1 , y j

2 , y j3

).

i ix , x

Given a 3CNF formula φ we define triplet

distances and an error bound Δ which enforce the output tree to imply a satisfying assignment to φ.

Page 19: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0719

One the following can be enforced on each taxa triplet (u,v,w):1. taxon u is close to Path(v,w), or2. taxon u is far to Path(v,w)

u

Properties Enforced by the Input (,Δ)

v

w

Page 20: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0720

A truth assignment to φ is implied by the following:1. T is far from F2. For each i, is far from , and both of and are close to Path(T ,F)

T F

Enforcing Truth Assignmaent

ix ix

Thus we set xi =T iff xi is close to T.

ixix ixix

Page 21: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0721

A clause C=( l 1 l

2 l 3 ) is satisfied iff

At least one literal l i is true, i.e. is close to T.

Enforcing Clauses-Satisfaction

F

l 3

l 1

l 2

(l 1 l

2 l 3 ) is satisfied iff

it is not like this

We need to guarantee that all clauses avoid the above by the close/far relations.

Page 22: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0722

-(l 1 l

2 l 3 ) is satisfied iff out of the three paths:

Path(l 1 , l

2), Path(l 1 , l

3), Path(l 2 , l

3),

at least two paths are close to T .

Clauses-Satisfaction (cont)

T Fl

1

l 3

l 2

But we don’t know which two paths

Page 23: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0723

Clauses-Satisfaction (cont) We attach a taxon to each such path:

y1 is close to Path ( l 2,l

3)

y2 is close to Path ( l 1,l

3)

y3 is close to Path ( l 1,l

2)

(l 1 l

2 l 3 ) is satisfied iff at least two yi’s can be located

close to T.…

T Fl

1

l 3

l 2

y1y2 y3

Page 24: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0724

… and, at least two of the yi’s can be located close to T

Path( y 2,y

3), Path( y

1,y 3), Path( y

1,y 2), are close to T

Clauses-Satisfaction (end)

So, (l 1 l

2 l 3 ) is satisfied iff all the above paths are close to T

T Fl

1

l 3

l 2

y1y2 y3

Page 25: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0725

vFvTT F2β αα

Construction Example

1 2 3 1 2 4 1 2 4 1 3 4x x x x x x x x x x x x

1 2 3 4; ; ;x x x x T F F F

α

1x 1x 2x 3x 4x2x 3x 4x

α

y12

y11

y13

αy2

3

y21

α

y22

φ is satisfiable there is a tree T which satisfies all bounds

A1 τT (T , F ) ≥ 2α+2β

A2 i=1..n : τT (T ; ) ≤ α ; τT (F ; ) ≤ α

B1 j=1..m : τT (y j1 ; l j

2 l j3 ) ≤ α ; τT (y j

2 ; l j1 l j

3 ) ≤ α ; τT (y j3 ; l j

1 l j2 ) ≤ α

B2 j=1..m : τT (y j1 ; T F ) ≥ α ; τT (y j

2 ; T F ) ≥ α ; τT (y j

3 ; T F ) ≥ α

B3 j=1..m : τT (T ; y j2 y j

3 ) ≤ α ; τT (T ; y j1 y j

3 ) ≤ α ; τT (T ; y j1 y j

2 ) ≤ α

i ix xi ix x

Page 26: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0726

Hardness of Approximation Results

Approximating Maximal Difference• Finding a tree T s.t. ||τ,τT ||∞ ≤ 1.4||τ,τOPT||∞

Approximating Maximal Distortion:• Finding a tree T s.t.

MaxDist(τ,τT ) ≤ C MaxDist(τ,τOPT) for any constant CDetails in:I. Gronau and S. moran, On The Hardness of Inferring Phylogenies from Triplet-Dissimilarities, Theoretical Computer Science 389(1-2), December 2007, pp. 44-55.

By “stretching” the close/far restrictions, the following problems are also shown NP hard:

Page 27: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0727

Open Problems/Further Research

•Extending hardness results for 3-diss tables induced by 2-diss matrices

(τ(i ; jk)= ½[D(i,j)+D(i,k)-D(j,k)] )

•Extending hardness results for “naturally looking” trees(binary trees with constant-bounded edge weights)

•Check Performance of NJ when neighbor selection formula computed from “real” 3-distances.

•Devise algorithms which use 3-distances as input.

•Does optimization of 3-diss lead to good topological accuracy (under accepted models of sequence evolution)

(it is known that optimization of 2-diss doesn’t lead to good topological

accuracy)

Page 28: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0728

Thank You

Page 29: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Distance-Based Phylogenetic Reconstruction

• Compute distances between all taxon-pairs

• Find a tree (edge-weighted) best-describing the distances

0

30

980

1514180

171620220

1615192190

D

4 5

7 21

210 61

Page 30: Plgw03, 17/12/07 1 On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities Ilan Gronau Shlomo Moran Technion – Israel Institute of Technology

Plgw03, 17/12/0738

The Reduction – τ(φ)A1 τT (T , F ) ≥ 2α+2β

A2 i=1..n : τT (T ; ) ≤ α ; τT (F ; ) ≤ α

B1 j=1..m : τT (y j1 ; l j

2 l j3 ) ≤ α ; τT (y j

2 ; l j1 l j

3 ) ≤ α ; τT (y j3 ; l j

1 l j2 ) ≤ α

B2 j=1..m : τT (y j1 ; T F ) ≥ α ; τT (y j

2 ; T F ) ≥ α ; τT (y j

3 ; T F ) ≥ α

B3 j=1..m : τT (T ; y j2 y j

3 ) ≤ α ; τT (T ; y j1 y j

3 ) ≤ α ; τT (T ; y j1 y j

2 ) ≤ α

i ix xi ix x

vFvTT F2β αα

α

1x 1x 2x 3x 4x2x 3x 4x

α

y12

y11

y13

α y23

y21

α

y22

A1 τ(T , F ) = 2α+3βA2 i=1..n : τ(T ; ) = α-β ; τ(F ; ) = α-β

B1 j=1..m : τ(y j1 ; l j

2 l j3 ) = α-β ; τ(y j

2 ; l j1 l j

3 ) = α-β ; τ(y j3 ; l j

1 l j2 ) = α-β

B2 j=1..m : τ(y j1 ; T F ) = α+β ; τ(y j

2 ; T F ) = α+β ; τ(y j3 ; T F ) = α+β

B3 j=1..m : τ(T ; y j2 y j

3 ) = α-β ; τ(T ; y j1 y j

3 ) = α-β ; τ(T ; y j1 y j

2 ) = α-β

Other 2-distances: τ(s , t ) = 2α+2β

Other 3-distances: τ(s ; t u ) = α+2β

i ix x i ix x

In our constructed tree:• All 2-distances are in [2α , 2α+2β].• All 3-distances are in [α , α+2β].

Δ=β.