50
Perl for PhyloInformatics What did we learn? What did we do?

Perl for Phyloinformatics

Embed Size (px)

DESCRIPTION

Course slides for computational phyloinformatics, an annual course organized by NESCent in collaboration with hosting organizations across the world. I am the teacher of the Perl section of the course, these are the slides I presented in 2010 at BGI, Shenzhen, PRC.

Citation preview

Page 1: Perl for Phyloinformatics

Perl for PhyloInformatics

What did we learn?What did we do?

Page 2: Perl for Phyloinformatics

Tree ConceptsWhat are phylogenetic trees?

Page 3: Perl for Phyloinformatics

Phylogenetic TreesDescribe the historical relationships among lineages of organisms or their parts, such as their genes.

Page 4: Perl for Phyloinformatics

Terminal nodes or tips

Internal nodes

A

B

C

D

F

E

Operational taxonomic units (OTU) / Taxa

SistersRoot

Branches

Tree terminology

Page 5: Perl for Phyloinformatics

Interpreting phylogeniesThese trees are the same shape

Page 6: Perl for Phyloinformatics

Rooted vs. unrooted trees

Root

A

B

C

D

E

F

AB

CF

E

D

Rooted tree: Has a root that denotes common ancestry

Unrooted tree: Only specifies the degree of kinship among taxa but not the evolutionary path

Tree terminology

Page 7: Perl for Phyloinformatics

Rooted and unrooted trees

The number of rooted and unrooted trees for n species is NR = (2n - 3)!/2n-2(n - 2)! NU = (2n - 5)!/2n-3(n - 3)!

Species Rooted Trees Unrooted Trees2 1 1

3 3 1

4 15 3

5 105 15

10 34,459,425 2,027,025

15 213,458,046,767,875 7,905,853,580,625

20 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875

Page 8: Perl for Phyloinformatics

A simple example

Page 9: Perl for Phyloinformatics

Why more rooted than unrooted?On an unrooted tree, the root can be placed on any of the branches.

Page 10: Perl for Phyloinformatics

Trees and classification

Page 11: Perl for Phyloinformatics

MonophyleticA monophyletic group is a group of organisms which forms a clade, meaning that it consists of an ancestor and all its descendants.

(Most clades on our Supertree are monophyletic.)

Page 12: Perl for Phyloinformatics

ParaphyleticA clade that excludes species that share a common ancestor with its members.

Page 13: Perl for Phyloinformatics

PolyphyleticA polyphyletic group is one whose members' most recent common ancestor is not a member of the group.

Page 14: Perl for Phyloinformatics

Example: birds and reptilesReptiles, without the birds, form a paraphyletic group.

Page 15: Perl for Phyloinformatics

Change and time

Page 16: Perl for Phyloinformatics

A

B

C

D

E

F

Phylograms: Branch lengths are proportional to amount of change that occurred on that branch (these are the gene trees before r8s).

Cladograms: Branch lengths are not proportional to the amount of changes (this is the Supertree from Monday).

Cladograms and phylograms

Page 17: Perl for Phyloinformatics

Ultrametric treesIf the distance from the root represents time (not change) we can use trees to study how fast new species form.

(This is our final tree after we put it all together.)

Page 18: Perl for Phyloinformatics

Types of dataWhat evidence are phylogenetic trees based on?

Page 19: Perl for Phyloinformatics

Distance dataExample: DNA-DNA hybridization. The more closely related two species are, the more similar their DNA. The more similar the DNA, the stronger the bond between the two strands, and the shorter the distance.

Page 20: Perl for Phyloinformatics

Morphological charactersExample: the shape of spider webs.

Page 21: Perl for Phyloinformatics

Molecular sequence dataI am sure you have all heard about DNA sequencing. Amino acid sequences are often used for more distantly related species.

Page 22: Perl for Phyloinformatics

Types of Data

Two categories Numerical data

Evolutionary distance between two species

Usually derived from sequence data Character data

Each character has a finite number of states

E.g. number or legs = 1, 2, 4 DNA = {A, C, T, G}

Page 23: Perl for Phyloinformatics

Tree reconstruction

Page 24: Perl for Phyloinformatics

Distance methods

Types of data

Distance matrices: DNA-DNA hybridization Computed from

sequences

Examples

UPGMA is the oldest distance matrix method

Neighbor-joining is more commonly used

Page 25: Perl for Phyloinformatics

Distance data

When using sequences, distance-based methods must transform the sequence data into a pairwise similarity matrix for use during tree inference

Species

A B C D

B 2 - - -

C 4 5 - -

D 7 9 5 -

E 3 5 7 8

Page 26: Perl for Phyloinformatics

Neighbor-Joining Methods1. Maintain a pairwise distance

matrix

2. Find the closest two taxa

3. Collapse them into one row (internal node) and recompute distance from the merged row to every other row

4. Loop to 2

5. Build tree as you go

Page 27: Perl for Phyloinformatics

Character methods

Types of data

Any homologized data:

Morphological data Molecular

sequences

Examples

Optimality-criterion methods: Maximum parsimony Maximum likelihood

Bayesian methods: MCMC

Page 28: Perl for Phyloinformatics

What is homology?

Example: forelimbs

Definition

Homology means any similarity between characters that is due to their shared ancestry.

Anatomical structures that evolved from the same structure in some ancestor species are homologous.

In genetics, homology can be observed in aligned DNA sequences.

Page 29: Perl for Phyloinformatics

What is an “optimality criterion”?

An optimality criterion is simply a way to quantify, using a number, how well a tree fits the data relative to other trees.

Examples are parsimony tree length (this is how the Supertree was optimized on the CIPRES cluster) and likelihood score.

The posterior probability can also be seen as an optimality criterion.

Page 30: Perl for Phyloinformatics

Parsimony tree lengthTree length is the minimum number of reconstructed changes.

The most parsimonious tree is the tree with the fewest number of changes.

Page 31: Perl for Phyloinformatics

Finding the optimal tree

Under an optimality criterion, trees need to be compared with one another to find the one that maximizes the optimality criterion.

When we talk about MP and ML trees, this is usually done with hill-climbing algorithms.

Page 32: Perl for Phyloinformatics

…but this is not the whole story!

Maximum Parsimony assumes a very simple model for evolutionary change – namely that change is rare.

Especially molecular evolution can be modeled in more realistic ways, using substitution models.

There are more complex ways to explore tree space than just hill-climbing (such as the Parsimony Ratchet).

We can also sample different areas of tree space to see how optimality is distributed, using MCMC.

Page 33: Perl for Phyloinformatics

Substitution models

Page 34: Perl for Phyloinformatics

Base frequencies and substitution rates

Page 35: Perl for Phyloinformatics

Additional parameters

Gamma distribution

Invariant sites

Perhaps some sites never change.

Maybe specify their proportion?

Page 36: Perl for Phyloinformatics

Likelihood and the number of parameters

More parameters always leads to a better fit of the data

Page 37: Perl for Phyloinformatics

Likelihood and the number of parameters

More parameters always leads to a higher value of the likelihood whether or not the additional parameters are providing a ‘significantly’ better fit to the data

Page 38: Perl for Phyloinformatics

Are the extra parameters justified?

Has chi-squared distribution

dof = number of additional parameters

(We did this with ModelTest)

Maximum Likelihood | H1

Maximum Likelihood | H0Likelihood ratio statistic: 2 log ( )

Page 39: Perl for Phyloinformatics

How did we use the substitution models?

Each substitution has an associated likelihood given a branch of a certain length and the estimated model parameters.

A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameters.

Optimise the branch lengths to get the maximum likelihood estimate.

Page 40: Perl for Phyloinformatics

Estimating node ages

Page 41: Perl for Phyloinformatics

Rate smoothing

r8s methods attempt to simultaneously estimate unknown divergence times and smooth the rapidity of rate change along lineages.

This is done by invoking some function that penalizes rates that change too quickly from branch to neighboring branch.

Page 42: Perl for Phyloinformatics

supertreeGiven a cladogram, how do we infer the divergencedates of the true tree?

NO

T t

ime

The relative lengths of some branches can be obtained from genes that fit an MLK model.

A B C D E

A C E

A B D E

Page 43: Perl for Phyloinformatics

“true tree”

Simmons Hackman

tim

e

Estimates from multiple molecular sequences can subsequently be combined by calibrating the gene trees on a common node, and applying the resulting node depths to the supertree.

A C EA B D E A B E D C

Page 44: Perl for Phyloinformatics

Where did we get the other dates?

If there is no extinction and constant speciation (!), the expected waiting time from one speciation event to the next is 1/n, where n=number of lineages.

This is a little more complicated if we take multiple labeled histories into account…

…but we can come up with expected ages this way.

Page 45: Perl for Phyloinformatics

PhyloInformatics

Page 46: Perl for Phyloinformatics

What is PhyloInformatics?

A made up word!

We’ve seen we have to deal with data of different types (trees, sequences, alignments, metadata).

This are part of complex work flows or pipelines.

We “do” phyloinformatics when we come up with repeatable ways to automate these pipelines.

Page 47: Perl for Phyloinformatics

The power of UNIX

UNIX is very useful for phyloinformatics:

Everything is text-based

Everything can be scripted and called from other programs

Many programs for phylogenetics are available on UNIX platforms

Everything can be piped together to create larger workflows

Page 48: Perl for Phyloinformatics

The power of Perl

Perl allows us to chain other UNIX tools together

Many perl libraries exist for dealing with biological data

Easy to learn, quick to develop

Page 49: Perl for Phyloinformatics

Join us!

We do a lot more phyloinformatics: Hackathons Google Summer of

Code Ongoing projects

Stay in touch, we can help each other!

Page 50: Perl for Phyloinformatics

谢谢!Thank you!