Upload
doane
View
89
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Phylogenetic Analysis. Phylogenetic Analysis Overview. Insight into evolutionary relationships Inferring or estimating these evolutionary relationships shown as branches of a tree Length and nesting reflects degree of similarity between any two items (in our case, sequences). - PowerPoint PPT Presentation
Citation preview
Phylogenetic Analysis
2
Phylogenetic Analysis Overview
• Insight into evolutionary relationships
• Inferring or estimating these evolutionary relationships shown as branches of a tree
• Length and nesting reflects degree of similarity between any two items (in our case, sequences)
3
Phylogenetics and Cladistics
• Clade = a set of descendants from a single ancestor (Greek work for branch)
• Three basic assumptions– Any group of organisms are related b descent
from a common ancestor– There is a bifurcating pattern of cladogenesis– Change in characteristics occurs in lineages
over time
4
More default assumptions
1. Correct sequences and origins2. Shared ancestral origin3. Homologous sequences4. No mixtures of nuclear and organellar
sequences5. Large enough taxa sampling size6. Contains representative sequence
variations7. Sufficient sequence variations
5
Basic Terminology• Clades: a group of organisms
or genes that includes the most recent common ancestor of all of its members and all of the descendants of that most recent common ancestor.
• Taxons: any named group of organisms; not necessary a clade.
• Branches: branches sometimes correspond to the degree of divergence
• Nodes: a bifurcating branch point
Branch lengths are not significant
Branch lengths are significant
6
Basic Definition• Homologous: sequences that share an arbitrary
threshold level of similarity determined by alignment of matching bases
• Similarity: a quantifiable term that refers to a degree of relatedness between sequences, but does not necessarily reflect ancestry.
• Orthologs: homologs produced by speciation; derived from a common ancestor; tend to have similar function
• Paralogs: homologs produced by gene duplication; derived within an organism, tend to have differing functions
• Xenologs: homologs resulting from horizontal gene transfer between two organisms; difficult to verify; variable function but tends to be similar.
7
Phylogenetic Analysis Overview
• Objective: – determine branch length and to figure out
how the tree should be drawn– Sequences most closely related drawn as
neighboring branches
8
Phylogenetic Analysis Overview
• Dependent upon good multiple sequence alignment programs
• Group sequences with similar patterns of substitutions in order to reconstruct a phylogenetic tree
9
Phylogenetic Analysis Overview
• Consider two sequences that are related– Ancestoral sequence can be (partially)
derived– With additional sequences, more
information can be gathered to add to a correct derivation
10
Phylogenetic Analysis Overview• Example: C-Terminal Motor Kinesin sequences
– http://www.proweb.org/kinesin/BE4_Cterm.html
11
Practical use of phylogenetic analysis
• To prioritize the analysis of genes in the target family – give insight into protein functions
12
•P. asruginosa, a bacteria that is one of the top 3 causes or opportunistic infections, is noted for its antimicrobial resistance and resistance to detergents. •3 homologous outer membrane proteins, OprJ, OprM and OprN were identified as playing a role in this antimicrobial resistance.
13
Figure 14.2 Example of a phylogenetic tree based on genes that does not match organismal phylogeny, suggesting horizontal gene transfer has occurred.
Possible horizontal gene transfer
14
Uses of Phylogenetic Analysis
• Given a set of genes, determine which genes are likely to have equivalent functions
• Follow changes occurring in a rapidly changing species such as a virus– Example: influenza – Study of rapidly changing genes – Next year’s strain can be predicted– Flu vaccination can be developed
16
Tree of Life• Phylogenies study how the evolution of
species has occurred
• Image: http://microbialgenome.org/primer/tree.html
17
Tree of Life
• Traditionally, morphological (visible features) characters have been used to classify organisms– Living organisms– Fossil records
• Sequence data beginning to take larger role
18
Tree of Life
• Many different resources including:
– NCBI taxonomy web sites
– University of Arizona’s tree of life project
19
NCBI Taxonomy Web Site
• http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/
分類法 ; 分類學
21
Evolutionary Trees
• Two dimensional graph showing evolutionary relationship among a set of items
• can be organisms, genes, or sequences
• Each unit is defined by a distinct branch on the tree
22
Evolutionary Trees
• leaves represent the units (taxa) being studied
• nodes and branches representing the relationships among the taxa
• Two taxa derived from the same common ancestor will share a node in the graph
23
Evolutionary Trees
• length of each branch may be drawn according to the number of sequence level changes that occurred
• distance may not be in direct relation to evolutionary time
• uniform rate of mutation analyses use the molecular clock hypothesis
24
Rooted Trees• One sequence (root) defined to be common
ancestor of all of the other sequences
• A unique path leads from the root node to any other node
• Direction of path indicates evolutionary time
• Root chosen as a sequence thought to have branched off earliest
25
Rooted Trees
• If molecular clock hypothesis holds, it is possible to predict a root
• As the number of sequences increase, the number of possible rooted trees increases very rapidly
• In most cases, a bifurcating binary tree is the best model to simulate evolutionary events
26
Example Rooted Tree
SYSTEMATICS AND MOLECULAR PHYLOGENETICSImage source: http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
27
Unrooted Tree (Star)
• Indicates evolutionary relationship without revealing the location of the oldest ancestry
• Fewer possible unrooted trees than a rooted tree
28
Example Unrooted Tree
Image source: http://www.shef.ac.uk/english/language/quantling/images/quantling1.jpg
29• Image: http://www.ncbi.nlm.nih.gov/About/primer/phylo.html
30
Methods for Determining Trees
• Three main methods:– maximum parsimony– Distance– maximum likelihood
31
Maximum Parsimony
• Predicts evolutionary tree minimizing number of steps required to generate observed variation
• Multiple sequence alignment must first be obtained
32
Maximum Parsimony
• For each position, phylogenetic trees requiring the smallest number of evolutionary changes to produce the observed sequence changes are identified
• Trees that produce the smallest number of changes for all sequence positions are identified
33
Maximum Parsimony
• Time consuming algorithm
• Only works well if the sequences have a strong sequence similarity
34
Maximum Parsimony Example
1 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G
• four sequences, three possible unrooted trees
35
Maximum Parsimony Example
Possible Trees:
1
2 4
31
3 4
21
4 2
3
36
Maximum Parsimony Example
• Some sites are informative, and other sites are not
• Informative site has the same sequence character in at least two different sequences
• Only the informative sites need to be considered
37
1 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G
Three informative columns
Maximum Parsimony Example
38
Maximum Parsimony Example
1 G G A2 G G G3 A C A4 A C G
1
2 4
3 1
3 4
2 1
4 2
3
1
2 4
3 1
3 4
2 1
4 2
3
Column 1
Column 2
Column 3
1
2 4
3 1
3 4
2 1
4 2
3
Is a substitution
39
Distance Method
• Looks at the number of changes between each pair in a group of sequences
• Goal is to identify a tree that positions neighbors correctly and that also has branch lengths which reproduce the original data as closely as possible
40
Distance Method
• CLUSTALW uses the neighbor-joining method as a guide to multiple sequence alignments
• PHYLIP suite of programs employ neighbor-joining methods
– http://evolution.genetics.washington.edu/phylip.html
41
Distance Programs in Phylip
• NEIGHBOR: estimates phylogenies using either:
– neighbor-joining (no molecular clock assumed)
– unweighted pair group method with arithmetic mean (UPGMA) (molecular clock assumed)
42
Distance Analysis
• distance score counted as – number of mismatched positions in the
alignment– number of sequence positions that must be
changed to generate the second sequence
• Success depends on degree the distances among a set of sequences can be made additive on a predicted evolutionary tree
43
Example of Distance Analysis
• Consider the alignment:
A ACGCGTTGGGCGATGGCAACB ACGCGTTGGGCGACGGTAATC ACGCATTGAATGATGATAATD ACACATTGAGTGATAATAAT
44
Example of Distance Analysis
• Distances can be shown as a table
A ACGCGTTGGGCGATGGCAACB ACGCGTTGGGCGACGGTAATC ACGCATTGAATGATGATAATD ACACATTGAGTGATAATAAT
45
Example of Distance Analysis
• Using this information, a tree can be drawn:
A ACGCGTTGGGCGATGGCAACB ACGCGTTGGGCGACGGTAATC ACGCATTGAATGATGATAATD ACACATTGAGTGATAATAAT
C
D
A
B
41
2
2
1
46
Fitch and Margoliash Algorithm (3 sequences)
• Distance table used
• Sequences combined in threes – define the branches of the predicted tree– calculate the branch lengths of the tree
47
Fitch and Margoliash Algorithm (3 sequences)
• 1) Draw unrooted tree with three branches originating from common node:
Cc
b
a
B
A
48
Fitch and Margoliash Algorithm (3 sequences)
1) Calculate lengths of tree branches algebraically:
• distance from A to B = a + b = 22 (1)• distance from A to C = a + c = 39 (2)• distance from B to C = b + c = 41 (3)• • subtracting (3) from (2) yields:• • b + c = 41• -a – c = -39• __________• b – a = 2 (4)• • adding (1) and (4) yields 2b = 24; b = 12• so a + 12 = 22; a = 10• 10 + c = 39; c = 29
49
Fitch and Margoliash Algorithm (3 sequences)
• 3) Resulting tree:
C29
12
10
B
A
50
Fitch and Margoliash Algorithm (5 sequences)
• Algorithm can be extended to more sequences. Consider the distances:
A
B
C
D
E
a
bd
c
e
f
g
51
Summary of Fitch-Margoliash
1) Find the mostly closely related pairs of sequences (A, B).
2) Treat the rest of the sequences as a composite. Calculate the average distance from A to all others; and from B to all others.
3) Use these values to calculate the length of the edges a and b.
52
Summary of Fitch-Margoliash
4) Treat A and B as a composite. Calculate the average distances between AB and each of the other sequences. Create a new distance table.
5) Identify next pair of related sequences and begin as with step 1.
6) Subtract extended branch lengths to calculate lengths of intermediate branches.
53
Summary of Fitch-Margoliash
7) Repeat the entire process with all possible pairs of sequences.
8) Calculate predicted distances between each pair of sequences for each tree to find the best tree.
54
Neighbor Joining
• Similar to Fitch-Margoliash
• Sequences chosen to give best least-squares estimate of branch length
55
Maximum Likelihood
• Calculates likelihood of a tree given an alignment
• Trees with least number of changes will be most likely
56
Maximum Likelihood (ML)
• Probability of each tree is product of mutation rates in each branch
• Likelihoods given by each column multiplied to give the likelihood of the tree
57
Maximum Likelihood (ML)
• Disadvantages:– Computationally intensive– Can only be done for a handful of
sequences
58
Which Method to Choose?
• Depends upon the sequences that are being compared– strong sequence similarity:
• maximum parsimony – clearly recognizable sequence similarity
• distance methods – All others:
• maximum likelihood
59
Distance, Parsiomony and ML• Distance matrix: simply count the number of differences
between two sequences.• Maximum Parsimony: search for a tree that requires
the smallest number of changes to explain the differences observed among the taxa.
• ML: evaluates the probability that the chosen evolutionary model has generated the observed data. A simple model is that changes between all nucleotides (or amino acids) are equally probable. The probability for all possible reconstructions are summed up to yield the likelihood for one particular site. The likelihood for the tree is the product of the likelihoods for all alignment positions in the dataset.
60
Which Method to Choose?
• Best to choose at least two approaches
• Compare the results – if they are similar, you can have more confidence
61
Difficulties With Phylogenetic Analysis
• Horizontal or lateral transfer of genetic material (for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events.
• Genes selective pressure can be rapidly evolving, masking earlier changes that had occurred phylogenetically.
62
Difficulties With Phylogenetic Analysis
• Two sites within comparative sequences may be evolving at different rates.
• Re-arrangements of genetic material can lead to false conclusions.
• Duplicated genes can evolve along separate pathways, leading to different functions
63
Here are some 264 of the phylogeny packages, and 30 free servers
64
Exercise
• Multiple Sequence Alignment– Sequence Alignment: CLUSTALW
– Sample sequences: found on E-learning system
65
Explanation on the parameters
66
Exercise