Perl for Phyloinformatics

Perl for PhyloInformatics

What did we learn?What did we do?

Tree ConceptsWhat are phylogenetic trees?

Phylogenetic TreesDescribe the historical relationships among lineages of organisms or their parts, such as their genes.

Terminal nodes or tips

Internal nodes

A

B

C

D

F

E

Operational taxonomic units (OTU) / Taxa

SistersRoot

Branches

Tree terminology

Interpreting phylogeniesThese trees are the same shape

Rooted vs. unrooted trees

Root

A

B

C

D

E

F

AB

CF

E

D

Rooted tree: Has a root that denotes common ancestry

Unrooted tree: Only specifies the degree of kinship among taxa but not the evolutionary path

Tree terminology

Rooted and unrooted trees

The number of rooted and unrooted trees for n species is NR = (2n - 3)!/2n-2(n - 2)! NU = (2n - 5)!/2n-3(n - 3)!

Species Rooted Trees Unrooted Trees2 1 1

3 3 1

4 15 3

5 105 15

10 34,459,425 2,027,025

15 213,458,046,767,875 7,905,853,580,625

20 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875

A simple example

Why more rooted than unrooted?On an unrooted tree, the root can be placed on any of the branches.

Trees and classification

MonophyleticA monophyletic group is a group of organisms which forms a clade, meaning that it consists of an ancestor and all its descendants.

(Most clades on our Supertree are monophyletic.)

ParaphyleticA clade that excludes species that share a common ancestor with its members.

PolyphyleticA polyphyletic group is one whose members' most recent common ancestor is not a member of the group.

Example: birds and reptilesReptiles, without the birds, form a paraphyletic group.

Change and time

A

B

C

D

E

F

Phylograms: Branch lengths are proportional to amount of change that occurred on that branch (these are the gene trees before r8s).

Cladograms: Branch lengths are not proportional to the amount of changes (this is the Supertree from Monday).

Cladograms and phylograms

Ultrametric treesIf the distance from the root represents time (not change) we can use trees to study how fast new species form.

(This is our final tree after we put it all together.)

Types of dataWhat evidence are phylogenetic trees based on?

Distance dataExample: DNA-DNA hybridization. The more closely related two species are, the more similar their DNA. The more similar the DNA, the stronger the bond between the two strands, and the shorter the distance.

Morphological charactersExample: the shape of spider webs.

Molecular sequence dataI am sure you have all heard about DNA sequencing. Amino acid sequences are often used for more distantly related species.

Types of Data

Two categories Numerical data

Evolutionary distance between two species

Usually derived from sequence data Character data

Each character has a finite number of states

E.g. number or legs = 1, 2, 4 DNA = {A, C, T, G}

Tree reconstruction

Distance methods

Types of data

Distance matrices: DNA-DNA hybridization Computed from

sequences

Examples

UPGMA is the oldest distance matrix method

Neighbor-joining is more commonly used

Distance data

When using sequences, distance-based methods must transform the sequence data into a pairwise similarity matrix for use during tree inference

Species

A B C D

B 2 - - -

C 4 5 - -

D 7 9 5 -

E 3 5 7 8

Neighbor-Joining Methods1. Maintain a pairwise distance

matrix

2. Find the closest two taxa

3. Collapse them into one row (internal node) and recompute distance from the merged row to every other row

4. Loop to 2

5. Build tree as you go

Character methods

Types of data

Any homologized data:

Morphological data Molecular

sequences

Examples

Optimality-criterion methods: Maximum parsimony Maximum likelihood

Bayesian methods: MCMC

What is homology?

Example: forelimbs

Definition

Homology means any similarity between characters that is due to their shared ancestry.

Anatomical structures that evolved from the same structure in some ancestor species are homologous.

In genetics, homology can be observed in aligned DNA sequences.

What is an “optimality criterion”?

An optimality criterion is simply a way to quantify, using a number, how well a tree fits the data relative to other trees.

Examples are parsimony tree length (this is how the Supertree was optimized on the CIPRES cluster) and likelihood score.

The posterior probability can also be seen as an optimality criterion.

Parsimony tree lengthTree length is the minimum number of reconstructed changes.

The most parsimonious tree is the tree with the fewest number of changes.

Finding the optimal tree

Under an optimality criterion, trees need to be compared with one another to find the one that maximizes the optimality criterion.

When we talk about MP and ML trees, this is usually done with hill-climbing algorithms.

…but this is not the whole story!

Maximum Parsimony assumes a very simple model for evolutionary change – namely that change is rare.

Especially molecular evolution can be modeled in more realistic ways, using substitution models.

There are more complex ways to explore tree space than just hill-climbing (such as the Parsimony Ratchet).

We can also sample different areas of tree space to see how optimality is distributed, using MCMC.

Substitution models

Base frequencies and substitution rates

Additional parameters

Gamma distribution

Invariant sites

Perhaps some sites never change.

Maybe specify their proportion?

Likelihood and the number of parameters

More parameters always leads to a better fit of the data

Likelihood and the number of parameters

More parameters always leads to a higher value of the likelihood whether or not the additional parameters are providing a ‘significantly’ better fit to the data

Are the extra parameters justified?

Has chi-squared distribution

dof = number of additional parameters

(We did this with ModelTest)

Maximum Likelihood | H1

Maximum Likelihood | H0Likelihood ratio statistic: 2 log ( )

How did we use the substitution models?

Each substitution has an associated likelihood given a branch of a certain length and the estimated model parameters.

A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameters.

Optimise the branch lengths to get the maximum likelihood estimate.

Estimating node ages

Rate smoothing

r8s methods attempt to simultaneously estimate unknown divergence times and smooth the rapidity of rate change along lineages.

This is done by invoking some function that penalizes rates that change too quickly from branch to neighboring branch.

supertreeGiven a cladogram, how do we infer the divergencedates of the true tree?

NO

T t

ime

The relative lengths of some branches can be obtained from genes that fit an MLK model.

A B C D E

A C E

A B D E

“true tree”

Simmons Hackman

tim

e

Estimates from multiple molecular sequences can subsequently be combined by calibrating the gene trees on a common node, and applying the resulting node depths to the supertree.

A C EA B D E A B E D C

Where did we get the other dates?

If there is no extinction and constant speciation (!), the expected waiting time from one speciation event to the next is 1/n, where n=number of lineages.

This is a little more complicated if we take multiple labeled histories into account…

…but we can come up with expected ages this way.

PhyloInformatics

What is PhyloInformatics?

A made up word!

We’ve seen we have to deal with data of different types (trees, sequences, alignments, metadata).

This are part of complex work flows or pipelines.

We “do” phyloinformatics when we come up with repeatable ways to automate these pipelines.

The power of UNIX

UNIX is very useful for phyloinformatics:

Everything is text-based

Everything can be scripted and called from other programs

Many programs for phylogenetics are available on UNIX platforms

Everything can be piped together to create larger workflows

The power of Perl

Perl allows us to chain other UNIX tools together

Many perl libraries exist for dealing with biological data

Easy to learn, quick to develop

Join us!

We do a lot more phyloinformatics: Hackathons Google Summer of

Code Ongoing projects

Stay in touch, we can help each other!

谢谢！Thank you!

Technology

Perl for Phyloinformatics