View
224
Download
2
Tags:
Embed Size (px)
Citation preview
• Phylogeny is the inference of evolutionary relationships
• All forms of life share a common origin.
– deduce the correct trees for all species of life
– to estimate the time of divergence between organisms since the time they last shared a common ancestor
Terminology
• Phylogenetic trees that are used to assess the relationships of homologous proteins (or nucleotide sequences) in a family
OUT or external node
Internal node
Branch
Bifurcating node
Clade
Phylogram
Terminology
Species tree versus gene tree
• In a species tree an internal node represents a speciation event
• In a gene tree an internal node represents the divergence of an ancestral gene into two new genes with distinct sequences
• Species tree <> Gene tree
– horizontal gene transfer
– gene duplications
Phylogenetic inference
1. Selection of sequences for analysis
2. Multiple sequence alignment
3. Tree building
4. Tree evaluation
1. selection of sequences for analysis
DNA:
– Higher phylogenetic signal:
• Synonymous vs nonsynonymous substitutions (detect negative and positive selection)
Protein:
– Phylogenetic signal less predominant than in DNA
– Better to construct a tree for evolutionary distant species or genes
RNA: rRNA often used for constructing species trees
Phylogenetic inference
Phylogenetic inference
2. multiple sequence alignment• This is a critical step in the analysis as in many cases the alignment
of amino acids or nucleotides in a column implies that they share a common ancestor
• If you misalign a group of sequences you will still be able to produce a tree. However, it is not likely to be biologically meaningful.
Crap in is crap out!
• Inspect the alignment to be sure that all sequences are homologous
• Some times with ClustalW distantly related sequences are not well aligned. Try different gap and extension parameters to improve the alignment
• Only use these columns of the multiple alignment for which you have data for all organisms or sequences. Delete the columns for which this is not the case.
• Delete columns with gaps
Phylogenetic inference
3. Tree building
Character-based methods
Non-character based methods
Methods based on an explicitmodel of evolution
Maximum Likelihood Methods/Bayesian Phylogeny
Pairwise distance methods
Methods not based on an explicitmodel of evolution
Maximum Parsimony Methods
Distance based methods
Distance based methods: – calculate the distances between molecular sequences
using some distance metric
– A clustering method (UPGMA, neighbour joining) is used to infer the tree from the pairwise distance matrix
– treat the sequence from a horizontal perspective, by calculating a single distance between entire sequences
Advantage:
• Fast
• Allow using evolutionary models
Disadvantage:
• sequences reduced to one number
Character based methods
Character based methods:– treat the sequences from a vertical perspective
– they search for each column of the alignment, the simplest explanation for how the characters evolved.
– For instance, MP involves a search for a tree with the fewest number of amino acid (or nucleotide character changes that account for the observed differences between the protein (gene) sequences.
Phylogenetic inference
4. Tree evaluation: bootstrapping• sampling technique for estimating the statistical error in situations
where the underlying sampling distribution is unknown
• evaluating the reliability of the inferred tree - or better the reliability of specific branches
How to proceed:
• From the original alignment, columns in the sequence alignment are chosen at random ‘sampling with replacement’
• a new alignment is constructed with the same size as the original one
• a tree is constructed
This process is repeated 100 of times
Phylogenetic inference
Show bootstrap values on phylogenetic trees
• majority-rule consensus tree
• map bootstrap values on the original tree
Maximum parsimony
Principle
• Select that tree that minimizes the total tree length = being the number of nucleic acid substitutions or amino acid replacements required to explain a given set of data.
Method
• a particular topology is considered
• for this topology, the ancestral sequences at each branching point are reconstructed
• the minimum number of events to explain the sequence differences over the whole tree is computed: the minimum number of substitutions is computed for each nucleotide (or amino acid) site, and the numbers for all sites are added.
• another tree topology is chosen
Maximum parsimony
)2(2
)32(2
n
nN
nR )3(2
)52(3
n
nN
nU
OTU's rooted tree topologies unrooted tree topologies
3 3 1
4 15 3
5 105 15
6 954 105
7 10395 954
8 135135 10395
9 2027025 135135
equation
• Exhaustive search impossible
• Heuristics needed
Maximum parsimony
• Find different tree topologies that are 'equally parsimonious‘
• Represent results as a consensus tree.
– 'strict' consensus tree
– 'majority-rule' consensus tree
Maximum parsimony
Only informative sites of the alignment are used in the construction of the tree: when there are at least two different kinds of characters, each represented at least two times
Maximum parsimony
Parsimony trees are usually only represented as a tree topology (cladogram): sometimes, the parsimony program cannot decide in which branches the substitutions have been taken place. It can not calculate branch lengths.
Maximum parsimony
Assumptions
• Equal rate of evolution in all branches
Advantages
• sequence information is not reduced to one number (such as for example in pairwise distance methods)
Disadvantages of maximum parsimony methods
• can be slow for very large datasets
• no correction for multiple mutations, i.e. no substitution model can be applied (see further)
• sensitive to unequal rates of evolution in different lineages (see further) =>long branch attraction (voorbeeld hiervan?)
Pairwise distance methods
Approach:
• align pairs of sequences and count the number of differences (Hamming distance).
• For an alignment of length N with n sites at which there are differences: D= (n/N*100).
Problem:
• observed differences <> actual genetic distances between the sequences.
=> dissimilarity is an underestimation of the true evolutionary distance, because of the fact that some of the sequence positions are the result of multiple events
Solution:
• Use an evolutionary model that corrects for multiple mutations
Distance calculation
Pairwise distance methods
• Ultrametric trees are rooted trees, in which all the endnodes are equidistant from the root of the tree,
• Assuming a molecular clock: i.e, that all sequences evolve at a similar rate
Tree inference: UPGMA
Pairwise distance methods
• when two OTUs are grouped, we treat them as a new single OTU • when OTUs A, B (which have been grouped before) and C are grouped into a new node ‘u’, then the distance from node ‘u’ to any other node ‘k’ (e.g. grouping D and E) is simply computed as follows:
Tree inference: UPGMA
Pairwise distance methods
Advantages:
• Fast
• Allows incorporation of evolutionary models
Disadvantages:
• Assumption of a molecular clock
Tree inference: UPGMA
Pairwise distance methods
• Additive distances can be fitted to an unrooted tree such that the evolutionary distance between a pair of OTUs equals the sum of the lengths of the branches connecting them, rather than being an average as in the case of cluster analysis
• Tree construction methods: minimum evolution, the tree that minimizes the sum of the lengths of the branches is regarded the best estimate of the phylogeny
• Drawback for the ME method: is that in principle all different tree topologies have to be investigated in order to find the ‘minimum’ tree.
• The neighbour joining (NJ) method, developed by Saitou and Nei (1987) offers a heuristic approach to solve this problem
Tree inference: neighbor joining