Upload
rutger-vos
View
1.332
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Course slides for computational phyloinformatics, an annual course organized by NESCent in collaboration with hosting organizations across the world. I am the teacher of the Perl section of the course, these are the slides I presented in 2010 at BGI, Shenzhen, PRC.
Citation preview
Perl for PhyloInformatics
What did we learn?What did we do?
Tree ConceptsWhat are phylogenetic trees?
Phylogenetic TreesDescribe the historical relationships among lineages of organisms or their parts, such as their genes.
Terminal nodes or tips
Internal nodes
A
B
C
D
F
E
Operational taxonomic units (OTU) / Taxa
SistersRoot
Branches
Tree terminology
Interpreting phylogeniesThese trees are the same shape
Rooted vs. unrooted trees
Root
A
B
C
D
E
F
AB
CF
E
D
Rooted tree: Has a root that denotes common ancestry
Unrooted tree: Only specifies the degree of kinship among taxa but not the evolutionary path
Tree terminology
Rooted and unrooted trees
The number of rooted and unrooted trees for n species is NR = (2n - 3)!/2n-2(n - 2)! NU = (2n - 5)!/2n-3(n - 3)!
Species Rooted Trees Unrooted Trees2 1 1
3 3 1
4 15 3
5 105 15
10 34,459,425 2,027,025
15 213,458,046,767,875 7,905,853,580,625
20 8,200,794,532,637,891,559,375 221,643,095,476,699,771,875
A simple example
Why more rooted than unrooted?On an unrooted tree, the root can be placed on any of the branches.
Trees and classification
MonophyleticA monophyletic group is a group of organisms which forms a clade, meaning that it consists of an ancestor and all its descendants.
(Most clades on our Supertree are monophyletic.)
ParaphyleticA clade that excludes species that share a common ancestor with its members.
PolyphyleticA polyphyletic group is one whose members' most recent common ancestor is not a member of the group.
Example: birds and reptilesReptiles, without the birds, form a paraphyletic group.
Change and time
A
B
C
D
E
F
Phylograms: Branch lengths are proportional to amount of change that occurred on that branch (these are the gene trees before r8s).
Cladograms: Branch lengths are not proportional to the amount of changes (this is the Supertree from Monday).
Cladograms and phylograms
Ultrametric treesIf the distance from the root represents time (not change) we can use trees to study how fast new species form.
(This is our final tree after we put it all together.)
Types of dataWhat evidence are phylogenetic trees based on?
Distance dataExample: DNA-DNA hybridization. The more closely related two species are, the more similar their DNA. The more similar the DNA, the stronger the bond between the two strands, and the shorter the distance.
Morphological charactersExample: the shape of spider webs.
Molecular sequence dataI am sure you have all heard about DNA sequencing. Amino acid sequences are often used for more distantly related species.
Types of Data
Two categories Numerical data
Evolutionary distance between two species
Usually derived from sequence data Character data
Each character has a finite number of states
E.g. number or legs = 1, 2, 4 DNA = {A, C, T, G}
Tree reconstruction
Distance methods
Types of data
Distance matrices: DNA-DNA hybridization Computed from
sequences
Examples
UPGMA is the oldest distance matrix method
Neighbor-joining is more commonly used
Distance data
When using sequences, distance-based methods must transform the sequence data into a pairwise similarity matrix for use during tree inference
Species
A B C D
B 2 - - -
C 4 5 - -
D 7 9 5 -
E 3 5 7 8
Neighbor-Joining Methods1. Maintain a pairwise distance
matrix
2. Find the closest two taxa
3. Collapse them into one row (internal node) and recompute distance from the merged row to every other row
4. Loop to 2
5. Build tree as you go
Character methods
Types of data
Any homologized data:
Morphological data Molecular
sequences
Examples
Optimality-criterion methods: Maximum parsimony Maximum likelihood
Bayesian methods: MCMC
What is homology?
Example: forelimbs
Definition
Homology means any similarity between characters that is due to their shared ancestry.
Anatomical structures that evolved from the same structure in some ancestor species are homologous.
In genetics, homology can be observed in aligned DNA sequences.
What is an “optimality criterion”?
An optimality criterion is simply a way to quantify, using a number, how well a tree fits the data relative to other trees.
Examples are parsimony tree length (this is how the Supertree was optimized on the CIPRES cluster) and likelihood score.
The posterior probability can also be seen as an optimality criterion.
Parsimony tree lengthTree length is the minimum number of reconstructed changes.
The most parsimonious tree is the tree with the fewest number of changes.
Finding the optimal tree
Under an optimality criterion, trees need to be compared with one another to find the one that maximizes the optimality criterion.
When we talk about MP and ML trees, this is usually done with hill-climbing algorithms.
…but this is not the whole story!
Maximum Parsimony assumes a very simple model for evolutionary change – namely that change is rare.
Especially molecular evolution can be modeled in more realistic ways, using substitution models.
There are more complex ways to explore tree space than just hill-climbing (such as the Parsimony Ratchet).
We can also sample different areas of tree space to see how optimality is distributed, using MCMC.
Substitution models
Base frequencies and substitution rates
Additional parameters
Gamma distribution
Invariant sites
Perhaps some sites never change.
Maybe specify their proportion?
Likelihood and the number of parameters
More parameters always leads to a better fit of the data
Likelihood and the number of parameters
More parameters always leads to a higher value of the likelihood whether or not the additional parameters are providing a ‘significantly’ better fit to the data
Are the extra parameters justified?
Has chi-squared distribution
dof = number of additional parameters
(We did this with ModelTest)
Maximum Likelihood | H1
Maximum Likelihood | H0Likelihood ratio statistic: 2 log ( )
How did we use the substitution models?
Each substitution has an associated likelihood given a branch of a certain length and the estimated model parameters.
A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameters.
Optimise the branch lengths to get the maximum likelihood estimate.
Estimating node ages
Rate smoothing
r8s methods attempt to simultaneously estimate unknown divergence times and smooth the rapidity of rate change along lineages.
This is done by invoking some function that penalizes rates that change too quickly from branch to neighboring branch.
supertreeGiven a cladogram, how do we infer the divergencedates of the true tree?
NO
T t
ime
The relative lengths of some branches can be obtained from genes that fit an MLK model.
A B C D E
A C E
A B D E
“true tree”
Simmons Hackman
tim
e
Estimates from multiple molecular sequences can subsequently be combined by calibrating the gene trees on a common node, and applying the resulting node depths to the supertree.
A C EA B D E A B E D C
Where did we get the other dates?
If there is no extinction and constant speciation (!), the expected waiting time from one speciation event to the next is 1/n, where n=number of lineages.
This is a little more complicated if we take multiple labeled histories into account…
…but we can come up with expected ages this way.
PhyloInformatics
What is PhyloInformatics?
A made up word!
We’ve seen we have to deal with data of different types (trees, sequences, alignments, metadata).
This are part of complex work flows or pipelines.
We “do” phyloinformatics when we come up with repeatable ways to automate these pipelines.
The power of UNIX
UNIX is very useful for phyloinformatics:
Everything is text-based
Everything can be scripted and called from other programs
Many programs for phylogenetics are available on UNIX platforms
Everything can be piped together to create larger workflows
The power of Perl
Perl allows us to chain other UNIX tools together
Many perl libraries exist for dealing with biological data
Easy to learn, quick to develop
Join us!
We do a lot more phyloinformatics: Hackathons Google Summer of
Code Ongoing projects
Stay in touch, we can help each other!
谢谢!Thank you!