47
Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time) Leo van Iersel 1 , Judith Keijsper 1 , Steven Kelk 2 , Leen Stougie 12 (1) Technische Universiteit Eindhoven (TU/e) (2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam Email: [email protected] Web: http://homepages.cwi.nl/~kelk

Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Embed Size (px)

Citation preview

Page 1: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Leo van Iersel1, Judith Keijsper1, Steven Kelk2, Leen Stougie12

(1) Technische Universiteit Eindhoven (TU/e)(2) Centrum voor Wiskunde en Informatica (CWI), Amsterdam

Email: [email protected] Web: http://homepages.cwi.nl/~kelk

Page 2: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Part 1:

Context

Page 3: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Phylogenetic tree reconstruction

Orangutan

Gorilla

Chimpanzee Human

(This tree borrowed from a presentation by Tandy Warnow)

Phylogenetic tree reconstruction is

essentially the science of efficiently inferring and constructing

plausible evolutionary trees when we only

have limited input data about the ‘species’

concerned…

At the intersection of biology, bioinformatics, computer science and

mathematics.

Page 4: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Dominant methods in phylogenetic reconstruction

Character-based methods

Maximum Parsimony (= Minimum Steiner Tree)

Maximum Likelihood

Bayesian methods (Markov Chain Monte Carlo)

Distance-based methods

Neighbour Joining

UPGMA

Quartet/triplet-based methods

Page 5: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Triplet-based methods (1)

• Quartet-based methods used for constructing unrooted evolutionary trees: no root (= most distant ancestor) and edges have no direction (e.g. edge between species X and Y does not say whether X evolved into Y, or vice-versa.)

• Triplet-based methods are used for constructing rooted evolutionary trees: there is a root and edges are directed.

• The central idea: build a single, ‘big’ evolutionary tree for a set L of species by combining smaller evolutionary trees on subsets of L such that the big tree respects the structure of the smaller trees.

• In triplet-based methods, the small input trees are always defined on size-3 subsets of the species set L (and are called rooted triplets.)

Page 6: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Triplet-based methods (2)

• For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}.

• I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)

z w x

x y z

y x w

w z y

algorithm

w z x y

solution

Page 7: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Triplet-based methods (2)

• For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}.

• I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)

z w x

x y z

y x w

w z y

algorithm

w z x y

solution

Page 8: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Triplet-based methods (2)

• For example. Suppose I want to reconstruct a plausible evolution for the species set {W,X,Y,Z}.

• I am given a set of rooted triplets zw|x, yx|w, xy|z, wz|y. (Note zw|x = wz|x.)

z w x

x y z

y x w

w z y

algorithm

w z x y

solution

Page 9: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

From trees to networks…

• The algorithm of Aho et al. (1981) can be used to construct trees from rooted triplets.

• But…what if the algorithm fails? Why might the algorithm fail?

• Possible reason 1: The underlying evolution is tree-like, but the input triplets contain errors.

• Possible reason 2: The triplets are correct, but the underlying evolution is not tree-like. Biological phenomena such as hybridization, horizontal gene transfer, recombination and gene duplication can lead to evolutionary scenarios that are not tree-like!

• Response: try and construct not phylogenetic trees, but phylogenetic networks

Page 10: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

From trees to networks (2)

x y z

x z y

• For example, suppose the input is {xy|z, xz|y}.

z

x

y

(Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)

Page 11: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

From trees to networks (2)

x y z

x z y

• For example, suppose the input is {xy|z, xz|y}.

z

x

y

(Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)

Page 12: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

From trees to networks (2)

x y z

x z y

• For example, suppose the input is {xy|z, xz|y}.

z

x

y

(Note that there are cases when, even if there is at most one triplet per 3 species, a tree is not possible)

Page 13: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Level-k phylogenetic networks

z

x

y

root(only one!)

leaf-vertex

split-vertex

recombination-vertex

A level-k phylogenetic network is a rooted,

directed acyclic graph where every biconnected

component (in the underlying undirected

graph) contains at most k recombination vertices.

This network here is a very simple example of a

level-1 network.

In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the

alternative name “galled tree”.

Page 14: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Level-k phylogenetic networks

z

x

y

root(only one!)

leaf-vertex

split-vertex

recombination-vertex

A level-k phylogenetic network is a rooted,

directed acyclic graph where every biconnected

component (in the underlying undirected

graph) contains at most k recombination vertices.

This network here is a very simple example of a

level-1 network.

In a level-1 network, the ‘cycles’ are vertex-disjoint, hence the

alternative name “galled tree”.

Page 15: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

• A set of input triplets is dense iff, for every subset of 3 species, there is at least one triplet corresponding to those 3 species.

• A dense set of input triplets for n species contains thus O(n3) triplets.

• Jansson & Sung (2006) showed the following:

What Jansson & Sung (& Nguyen) did…

Given a dense set of triplets T for a set L of species, it is possible to determine in polynomial-time whether a level-1 phylogenetic

network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.)

• They later showed, together with Nguyen, how to do this in time linear in |T|. They also showed that, in the non-dense case, the problem is NP-hard.

• But what about level-2 networks, and higher?

Page 16: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Here is an example of a level-2 network.

Main result: Given a dense set of triplets T for a set L of species, it is possible to determine in time O(|T|3) whether a level-2

phylogenetic network N exists such that all the triplets in T are consistent with N. (And if so, to construct such a network.)

Page 17: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Part 2:

The algorithm

Page 18: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Algorithm, high-level idea

• The algorithm is conceptually (fairly) simple, but the proof of correctness and the technical details are rather complex.

• The high-level idea is as follows:

1. PARTITION the set of leaves (i.e. species) into a ‘correct’ partition P;

2. INDUCE a new set of triplets T’ where every block of the partition P becomes a single leaf (a kind of ‘meta-leaf’ if you like)

3. SOLVE a simpler version of the problem for T’ to get a network N’

4. RECURSE inside each leaf of N’ • Step 3 is the critical part of the algorithm. It brings together two

issues:

(a) why is it sufficient to only solve a simpler version of the problem?

(b) how do we solve this simpler version of the problem?

Page 19: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

• Suppose I have a partition P = {P1, P2, …, Pt} of the leaf set L.

• Suppose I have a dense set of triplets T on the leaf set L.

• Let T’ be a new triplet set on leaf set {q1, q2,…, qt} defined as follows:

• qiqj|qk is in T’ if and only if i≠j≠k and there exists a triplet xy|z in T such that x is in Pi, y is in Pj and z is in Pk

• Then we say that T’ is the triplet set induced by the partition P of L.

• Critically: if T is dense, then T’ is also dense.

• In some sense this can be perceived as a ‘coarsening’ of the input set.

Definition: inducing new triplet sets from partitions of the leaf set

Page 20: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Definition: simple level-2 networks

A simple level-2 network is any network obtained by“hanging leaves” off one of the above structures.

Simple level-2 networks capture in some sense the essence of the complexityof general level-2 networks.

Page 21: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Here the leaves{a,b,c,d,e,f,g,h} have

been ‘hung’ from structure 8a, to yield a simple level-2

network.

An example of a simple level-2 network

Page 22: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Definition: SN-set

• Jansson & Sung introduced the idea of the SN-set.• SN-sets are special subsets of the leaves L, and are defined with

respect to triplet sets.• All sets containing just a single leaf, are SN-sets.• More generally, an SN-set is any subset of leaves obtained by

taking the closure of the following operation on some subset S of the leaves L:

some subset S ofthe leaves

x

y

z

Page 23: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Definition: SN-set

• Jansson & Sung introduced the idea of the SN-set.• SN-sets are special subsets of the leaves L, and are defined with

respect to triplet sets.• All sets containing just a single leaf, are SN-sets.• More generally, an SN-set is any subset of leaves obtained by

taking the closure of the following operation on some subset S of the leaves L:

x

y

z

In other words, if there is some pair of leaves x,z in the set S such that xy|z is a triplet

and y is notin the set S, add y to S, and repeat until no more leaves can be added. An SN-set is any set that can be

constructed this way.

Page 24: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

• The SN-set that is equal to the total leaf set L, is called the trivial SN-set.

• An SN-set that is non-trivial, and is not a strict subset of any other non-trivial SN-set, is called a maximal SN-set.

• Jansson and Sung proved that the set of maximal SN-sets partition the leaf set L. So no two maximal SN-sets overlap, and they completely cover the set of input leaves.

• It is polynomial-time solveable to find all the SN-sets, and all the maximal SN-sets.

• Jansson & Sung solved the level-1 problem by observing that they could treat the maximal SN-sets like ‘meta-leaves’, thus reducing the problem to recursively solving the problem on the triplets induced by the maximal SN-sets.

• Our idea is similar, but SN-sets in level-2 networks are (unfortunately) rather more complex creatures than in level-1 networks.

Definition: maximal SN-set

Page 25: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

• In a phylogenetic network N, a cut-edge (x,y) is an edge whose removal disconnects the (underlying) graph.

• A cut-edge (x,y) is said to be a trivial cut edge iff y is a leaf.

• A cut-edge (x,y) is said to be highest iff there is no cut-edge (p,q) such that there is a directed path from q to x in N.

Definition: (highest) cut-edges

Page 26: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

• Fact. Let (x,y) be a highest cut-edge and let L’ be the set of leaves reachable from y. Let L* be a strict subset of L’. Then L* is not a maximal SN-set.

• Proof: the set of leaves reachable from a highest cut-edge (x,y), is itself an SN-set. Why? Because it is not possible for there to be leaves p,q in L’ and r outside L’ such that pr|q is in the set of triplets: the edge (x,y) forms a bottleneck and would have to be used twice.

y

x

p q r

p r qL’

So: each maximal SN-set

can be expressed as

the union of the leaves

reachable by one or more highest cut-

edges.

Page 27: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

• Now, suppose we have a dense set of triplets T and there exists a level-2 network N such that all the triplets in T are consistent with N. (Of course we don’t know what N is yet…)

• Suppose we construct a partition P of L as follows. The blocks of P are the sets of leaves reachable from highest cut-edges in N. (Each maximal SN-set of N thus corresponds to one or more blocks in P.)

• Let T’ be the new set of triplets induced by the partition P. In other words, if we collapse the set of leaves below highest cut-edges into ‘meta-leaves’, T’ is the new set of triplets we get. (Nice property: the maximal SN-sets of T’ are in 1:1 correlation with the maximal SN-sets of T.)

• Critical fact 1: the only level-2 networks where all cut-edges are trivial, are simple level-2 networks.

• Critical fact 2: there exists some simple level-2 network N’ such that the triplets in T’ are consistent with N’. Furthermore, if we find such an N’, and then recursively construct networks within each meta-leaf, we obtain a network consistent with T!

A first attempt at reducing the problem to simple level-2 networks

Page 28: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

But….that’s a non-deterministic argument

• So, it looks like we can indeed reduce the problem – in some sense – to finding simple level-2 networks.

• But that analysis was based on knowing where the highest cut-edges are in a hypothetical solution N. And we don’t know N…this is precisely what we’re looking for!

• We can, however, compute the maximal SN-sets of the input triplet set T.

• We need to be able to say something more about how maximal SN-sets of T relate to highest cut-edges in hypothetical solutions. Then we can base the recursion on maximal SN-sets, instead of highest cut-edges.

Page 29: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Central Theorem (simplified). Suppose there is a dense triplet set T consistent with some simple level-2 network N. Then there

exists a level-2 network N’ (not necessarily simple) such that, with the exception of perhaps one maximal SN-set with respect to T,

every maximal SN-set appears below a single cut-edge in N’. The remaining, ‘odd-one-out’ maximal SN-set (if it exists) will be equal

to the union of leaves below two cut-edges.

Page 30: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Observe how SN-set {C,G,F} has been ‘pushed’

below a single cut-edge.

transformation

Page 31: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

An existence argument

• If some solution N exists for T, then a simple level-2 solution N’ exists for T’ (induced by the highest cut-edges of N) where the maximal SN-sets of T’ are tightly correlated with the maximal SN-sets of T. Finding N’ gives the starting point for a solution to T.

• But by the Central Theorem, all (except maybe one) of the maximal SN-sets of N’ can be ‘pushed’ below highest cut-edges to give a solution N’’ for T’.

• If we re-expand all the meta-leaves of N’’, we obtain a new solution N* for T. Crucially, all (except maybe one) of the maximal SN-sets of T will be beneath single cut-edges in N*. The odd-one-out will be beneath two cut-edges.

• So if we substitute N* as N in the first step, we come to the following conclusion:

• We can find a solution for T by finding a simple level-2 solution for the set of triplets induced by the maximal SN-sets of T, and recursing. We need to correctly guess the ‘odd-one-out’ maximal SN-set, however, and split that into two meta-leaves. Fortunately we can just try splitting each maximal SN-set in turn.

Page 32: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

subnetwork belowhighest cut-edge

Page 33: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

subnetwork belowhighest cut-edge

Page 34: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

transformationF

GC

S = {C,G,F}

F

GC

Page 35: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

transformationF

GC

S = {C,G,F}

F

GC

Page 36: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

transformationF

GC

S = {C,G,F}

F

GC

Page 37: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

transformationF

GC

S = {C,G,F}

F

GC

whole maximal SN-

set is now below a cut-

edge!

Page 38: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Finding simple level-2 networks

• So we know that, if we analyse the maximal SN-sets carefully, and construct an appropriate new set of triplets, we can recursively reduce the entire problem to finding simple level-2 networks.

• But how do we algorithmically construct a simple level-2 network that is consistent with a given dense set of triplets?

Page 39: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Suppose we can correctly ‘guess’ that leaf g hangs

directly below a recombination node.

If we remove g, and all triplets that contain g, then we know that a

level-1 network must be possible on this new set of triplets (because now

fewer recombination nodes are needed.)

Page 40: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Suppose we can correctly ‘guess’ that leaf g hangs

directly below a recombination node.

If we remove g, and all triplets that contain g, then we know that a

level-1 network must be possible on this new set of triplets (because now

fewer recombination nodes are needed.)

Page 41: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Suppose we subsequently guess that leaf h now hangs below a

recombination node in the new network.

If we remove h, and all triplets that contain h, then we know that a

level-0 network must be possible on this new set of triplets (because now

even fewer recombination nodes are

needed.)

Page 42: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Suppose we subsequently guess that leaf h now hangs below a

recombination node in the new network.

If we remove h, and all triplets that contain g, then we know that a

level-0 network must be possible on this new set of triplets (because now

even fewer recombination nodes are

needed.)In such a case the

resulting tree is UNIQUE (J&S).

Page 43: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

So now we have a tree. We are going to guess

how to add leaf h back in, and then guess how to

add leaf g back in.

This guessing is not a problem because we can simply try all possibilities.

Page 44: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Adding leaf h back in.

Page 45: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Adding leaf h back in.And finally adding leaf g

back in.

g

Page 46: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Conclusions & open problems

• So we know how to efficiently construct level-2 networks from dense triplet sets. What’s next?

• Applicability: how useful is it?

• Initial implementation: programming and fine-tuning

• Improving running time: in the spirit of the “SN-tree” of J&S&N

• Complexity: what about level-3 and higher?

• Bounds: worst-case, best-case scenarios

• Building all networks

• Properties of output networks as function of input

• Different triplet restrictions

• Confidence: how good are the solutions?

• Exponential-time exact algorithms for NP-hard problems

Page 47: Phylogenetic networks: recent questions and results (or: constructing a level-2 phylogenetic network from a dense set of input triplets in polynomial time)

Conclusions & open problems

• So we know how to efficiently construct level-2 networks from dense triplet sets. What’s next?

• Applicability: how useful is it?

• Initial implementation: programming and fine-tuning

• Improving running time: in the spirit of the “SN-tree” of J&S&N

• Complexity: what about level-3 and higher?

• Bounds: worst-case, best-case scenarios

• Building all networks

• Properties of output networks as function of input

• Different triplet restrictions

• Confidence: how good are the solutions?

• Exponential-time exact algorithms for NP-hard problems

Thank you for your

attention!