Upload
zayit
View
38
Download
0
Embed Size (px)
DESCRIPTION
Inferring phylogenetic trees. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]. One-minute responses. I did not understand anything in the Gibbs sampling and the second method. - PowerPoint PPT Presentation
Citation preview
Inferring phylogenetic trees
Prof. William Stafford NobleDepartment of Genome Sciences
Department of Computer Science and EngineeringUniversity of Washington
One-minute responses• I did not understand anything in the Gibbs sampling and the second method.• The class was quite OK now. Understood most important things.• I understood 50% of the Python part. But I am a bit confused about the goal of the
programs.• Please send us the slides immediately after lecture.
– I put the slides on the website during the Python half of the class. Hit “refresh” on the web browser to see them.
• I didn’t understand clearly converting scores to p-values, more especially putting 1 and 2. Otherwise everything was clear.
• I think we should go a little bit slower.• I didn’t understand the EM and Gibbs.• The concept of EM and Gibbs sampling are really very important. Please go in
depth on them.• Python sessions are still fine as usual.• These algorithms are complex. Could you please explain them with a bit of some
examples?• I didn’t understand the second Python problem.• Emile must not mark our assessment on the programming part.
Revision - Gibbs
Motif occurrences
PSSM
Randomly select
1. Randomly discard one sequence
2. Build PSSM from remaining sequences• Counts• Add pseudocounts• Normalize
1. Scan discarded sequence with PSSM
2. Choose new occurrence according to resulting probabilities
sequences
Revision - EM
Motif occurrences
PSSM
Randomly select
1. Counts2. Add pseudocounts3. Normalize4. Divide by background5. Take log2
1. Scan each sequence with PSSM
2. Take top-scoring occurrence
sequences
Phylogenetic inference
RabbitDoveLionDonkey
?
Outline
• Parsimony• Distance methods
– Computing distances– Finding the tree
• Maximum likelihood
Selecting a method
Chooseset of
relatedsequences
Obtainmultiple
sequencealignment
Is therestrong
sequencesimilarity?
Maximumparsimonymethods
Is there clearlyrecognizable
sequencesimilarity
Maximumlikelihoodmethods
Distancemethods
No
Yes
No
Yes
Maximum parsimony
for each possible treecompute the parsimony score
return the tree with the best score
Enumerating these trees can take a very
long time
Computing this score is straightforward
How many trees?
• With four sequences: 3 unrooted trees
• With five sequences: 15 unrooted trees.• With seven sequences: 954 unrooted trees.
1
2
3
4
1
3
2
4
1
4
3
2
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
Scer = A Smik = A
Spar = G Skud = A
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
Scer = A Smik = A
Spar = G Skud = A
A A
Score = 1
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
Scer = A Smik = A
Spar = G Skud = A
Scer = A Spar = G
Smik = A Skud = A
Scer = A Smik = A
Skud = A Spar = G
A A
Score = 1
A A
A A
Score = 1
Score = 1
This site is uninformative, because all the trees have the same score.
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
Scer = Smik =
Spar = Skud =
Scer = Spar =
Smik = Skud =
Scer = Smik =
Skud = Spar =
Score = ?
Score = ?
Score = ?
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
Scer = G Smik = A
Spar = G Skud = T
Scer = G Spar = G
Smik = A Skud = T
Scer = G Smik = A
Skud = T Spar = G
G A
Score = 2
G G
G G
Score = 2
Score = 2
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
Scer = Smik =
Spar = Skud =
Scer = Spar =
Smik = Skud =
Scer = Smik =
Skud = Spar =
Score = ?
Score = ?
Score = ?
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
Scer = A Smik = T
Spar = A Skud = T
Scer = A Spar = A
Smik = T Skud = T
Scer = A Smik = T
Skud = T Spar = A
Score = 1
Score = 2
Score = 2
A T
A A
A A
This tree is best.
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
1 2 1 1 1 1 0 1 2 2 0 0 1 2 2 2 3 1 2 1
Scer Smik
Spar Skud
Total = 26
Computing parsimony scoresScer A G A A A A A T A A C T T T C T C A T G
Spar G G A A A A A T A A C T T T C T G A C A
Smik A A A A T A A C T T C T C A A C A A T ASkud A T C T T G A T C C C T T G T G T T G A
1 2 1 1 2 1 0 1 2 2 0 0 1 2 2 2 3 1 3 1
Scer Spar
Smik Skud
Total = 28
Parsimony software
• In general, the most widely used programs for phylogenetic analysis are– Phylip (Joe Felsenstein)– PAUP (Jim Swofford)– MacClade (David and Wayne Maddison)
• All three do parsimony. Only Phylip is free.
Previous one-minute responses• How many sequences are usually analyzed by
parsimony methods?– Exhaustively, probably tens of sequences. With heuristic
search methods, you can analyze arbitrarily many, but you lose the guarantee that you’re finding the most parsimonious tree.
• What do good parsimony scores look like?– It depends upon how many sequences are involved, and
how divergent they are.• Why doesn’t the parsimony method take into
account transitions versus transversions?– It can; I presented the simplest version.
Jukes-Cantor model• Assume the same
probability of change at all positions and all times.
• dAB is the proportion of changed sites in the alignment.
• KAB is the distance between sequences A and B.
ABAB dK
341ln
43
Problem #1
• Write a program jukes-cantor.py that takes as input a pairwise sequence alignment and prints the Jukes-Cantor distance. Skip sites that contain gaps.
> cat twoseqs.txtACGTACCG> python jukes-cantor.py twoseqs.txt0.823959
ABAB dK
341ln
43
Problem #2• Generalize your previous program to work for a multiple
sequence alignment.> cat threeseqs.txtACGTACTGACGG> python jukes-cantor-matrix.py threeseqs.txt 0.000 0.824 0.304 0.824 0.000 0.304 0.304 0.304 0.000 > jukes-cantor-multiple.py moreseqs.txt 0.000 0.233 0.383 0.233 0.233 0.000 0.824 0.572 0.383 0.824 0.000 0.107 0.233 0.572 0.107 0.000