38
Species Trees & Constraint Programming

Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Embed Size (px)

Citation preview

Page 1: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Species Trees & Constraint Programming

Page 2: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

The Tree of Life

A central goal of systematics

• construct the tree of life

• a tree that represents the relationship between all living things• including constraint programmers

• The leaf nodes of the tree are species

• The interior nodes are hypothesized species• extinct, where species diverged

Page 3: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Science 300

Page 4: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

To date, biologists have cataloged about 1.7 million species yet estimatesof the total number of species ranges from 4 to 100 million.

“Of the 1.7 million species identified only about 80,000 species have been placed in the tree of life”

E. Pennisi “Modernizing the Tree of Life” Science 300:1692-1697 2003

Page 5: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Properties of a Species Tree

• We have a set of leaf nodes, each labelled with a species• the interior nodes have no labels• each interior node has 2 children and one parent

• except the root (it has no parent)• if we have n leaf nodes we then have n 1 interior nodes• it is a bifurcating tree

Page 6: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Super Trees

• We are given two trees, T1 and T2

• T1 has leaf set S1 and S2 has leaf set • remember, leaves are species!

• But S1 and S2 have a non-empty intersection• why? How can that happen?

• We want to combine T1 and T2• so, why is that a problem?

Page 7: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Tree: (f,((d,e),((c,(a,b)),g)))Triples: {((a,b),c),((d,e),c),((c,b),e), ((e,b),f),((a,g),f)}

Page 8: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Most Recent Common Ancestors (mrca)

a b

cWe have 3 species, a, b, and c

Species a and b are more closely relatedto each other than they are to c

The most recent common ancestor of a and bis further from the root than the most recent common ancestor of a and c (and b and c)

• mrca(a,b) mrca(a,c)• mrca(a,b) mrca(b,c)• mrca(a,c) mrca(b,c)

cab |

Page 9: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Triples (and Fans)

a b

c

b c

d

Species trees are frequently presented as a set of triples (and fans)

}|,|{ dbccab

Page 10: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Triples (and Fans)

a b

c

b c

d

a b

c

d

Page 11: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

BreakUp & OneTree (circa 1996)

Algorithm breakUp takes a species tree and produces a set of rooted triples R that define that tree.

Algorithm OneTree takes a set of species and a set of rootedtriples, and builds a tree that respects those triples, or reportsthat no tree exists (in polytime)

OneTree is a specialisation of Build, an algorithm proposedby Aho, Sagiv, Szymanski, and Ulman in 1981

Page 12: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

The Flavour of OneTree

Given a set of species S and rooted triples R

• produce a node N• construct a graph G

• with vertices in S• and edge (x,y) if triple xy|z is in R

• if G is a single component fail• else recursively build

• on the left with one component • with S’ and R’ (the set of species and triples in that component)

• on the right, with the other components

Page 13: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

The Flavour of OneTree

},,,{

}|,|{

dcbaS

dbccabR

d

a

c

b

},,{

}|{

cbaS

cabR

a

c

b

}{

{}

dS

R

d

Page 14: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Min-cut Super Trees

• What happens if OneTree fails?

• Gives us the best you can• by breaking some triples (resulting in fans)• by excluding some species

• There are polytime algorithms for this• but they are greedy and biased• minCut supertrees

Page 15: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Constraint Programming solutions to building a species tree from a set of rooted triples

Page 16: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

A naïve constraint encoding (footnotes 756, 789, 794, 796)

• n-1 variables as interior nodes• v[i] = j parent(v[i]) = v[j]• no loops/cycles

• Barbara used set variables (ILOG)• Patrick used specialised constraint (Chco)

• Francois then encoded set variables!• n variables as leaf nodes• each takes a value respecting triples

• I am sparing you (and me) the details

Page 17: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Why was this a naïve constraint encoding?

• It produced the right number of trees when no triples• the Catalan number• symmetry breaking

• It would produce a tree if one existed

A 2 stage process

• (1) build a tree from the interior nodes• there are Catalan many of these

• (2) given an “interior tree” place the leaf nodes• there are n! ways to do this

• if step (2) fails generate the next interior tree in (1)

Yikes! That’s expensive.Imagine {ab|c,bc|d,cd|a}

Page 18: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Ultrametric Trees & Species Trees (footnotes 803,804,805,810,819)

What is an ultrametric tree?

• We are given a 2d symmetric matrix D• D[i,j] is the time of divergence of species i and j.

• D[i,j] is the the mrca(i,j) labeled with time of divergence• D[i,j] is the value of mrca(i,j)

• Build a bifurcating tree• n leaves and n - 1 interior nodes• interior nodes labeled with entries from D• any path from the root is a strictly decreasing sequence

Page 19: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

8

35

B3 CD

EA

0

50

880

8830

35880

E

D

C

B

A

EDCBA

Ultrametric Trees: here’s one I (well, Dan Gusfield actually ) prepared earlier

Note: if the sequence increases, we have min-ultrametric tree

Page 20: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Ultrametric Matrix: necessary & sufficient conditions

• cannot have more than n - 1 distinct values• because there are n - 1 interior nodes

• For every 3 indices i,j,k• there is a tie for the maximum between D[i,j], D[i,k], D[j,k]

Given an ultrametric matrix, an ultrametric tree can beconstructed in O(n2)

… see Dan Gusfield’s book “Algorithms on Strings, Trees, and Sequences”

Page 21: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Why are our species trees ultrametric?

Page 22: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

• Take any rooted tree• Mark the interior nodes with their depth/height in the tree• Any path to a leaf is an increasing/decreasing sequence• The tree is ultrametric

Page 23: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

So?

If you take any 3 leaf nodes x,y,z

- the deepest ancestor of x and y is deeper than the deepest ancestor of x (y) and z OR- the deepest ancestor of x and z is deeper than the deepest ancestor of x (z) and y OR- the deepest ancestor of y and z is deeper than the deepest ancestor of y (z) and x

Page 24: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

A CP encoding of D

• We have a 2 dimensional matrix of constrained integer variables D• We must ensure that for any i,j,k the following holds

• for any 3 indices, there is a tie for the maximum

],[],[

],[],[

],[],[

kjDkiD

kjDjiD

kiDjiD

],[ jiD

],[ kjD

],[ kiD

i

j k

Think isosceles triangles,allowing equilateral

An ultrametric space,composed of isosceles triangles

Page 25: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

A CP encoding of D

],[],[

],[],[

],[],[

kjDkiD

kjDjiD

kiDjiD

],[ jiD

],[ kjD

],[ kiD

i

j k

Any instantiation of the variables in D isnow guaranteed to be min-ultrametric

We get Catalan number of min-ultrametric solutions

Page 26: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

A geometric view: Choose one of these isosceles triangle corresponding to a rooted triples

a b c

a c b

b c a

c

bab

a ca

b c

Page 27: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

How can we exploit this?

• We are given triples and fans, but not distances!• But we can consider a triple ij|k as a constraint

k

ji

],[],[],[],[],[],[ kjDjiDkiDjiDkjDkiD

Note: our tree is min-ultrametric!

This over-rides the disjunctions postedacross the matrix

Page 28: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

The CP encoding (contd)

• we have the “blanket” disjunctive constraints to ensure min-ultrametric• ijk(ijk triple(i,j,k) triple(i,k,j) triple(j,k,i))• triple(i,j,k) (D[i,k] D[j,k] D[i,j] D[i,k] D[i,j] D[j,k])• O(n3) ternary constraints!

• triples are constraints that break the disjunctions

• a solution (if one exists) is min-ultrametric respecting triples

• we can then produce tree from the matrix, as a post process

• NOTE: we need a pre-process to break up trees into triples

Page 29: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Some real data sets followThorley’s thesis & Rod Page’s birds

Page 30: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Tree: (f,((d,e),((c,(a,b)),g)))Triples: {((a,b),c),((d,e),c),((c,b),e), ((e,b),f),((a,g),f)}

Page 31: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

0122354g

1011111f

2103222e

2130222d

3122033c

5122304b

4122340a

gfedcba

The tree is ultrametric & has an ultrametric matrixDepth is our measure

Depth mrca(X,Y)

Page 32: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship
Page 33: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship
Page 34: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

What’s the performance like?- time?- space?

Page 35: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Why bother with a constraint encoding?

• the challenge of it• add side constraints,

• such as data on interior nodes• distances between species

• optimisation• use variable & value ordering heuristics• state of the art CP search algorithms

• explanations• why are some species closer to each other?• Why is species X excluded from the tree?• Etc?

Page 36: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

What are the (initial) challenges?

• make triple(i,j,k) more efficient• I think this is easy (and so does Peter Nightingale)

• something better than the O(n3) ternary constraints• an encoding that provably answers the decision in polytime

Page 37: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

So where are we?

Good question:

• we have tried real data• we have a number of different micro-encodings• Are we in P for decision?

• Not sure yet• How about optimisation?

• We can see a way, by introducing penalties

Page 38: Species Trees & Constraint Programming. The Tree of Life A central goal of systematics construct the tree of life a tree that represents the relationship

Questions?