Upload
spence
View
37
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Input Sensitive Algorithms for Multiple Sequence Alignment. Pankaj Agarwal @Duke Yonatan Bilu @Hebrew University Rachel Kolodny @Stanford. Multiple Sequence Alignment. Quantifies similarities among [DNA, Protein] sequences Detects highly conserved motifs & remote homologues - PowerPoint PPT Presentation
Citation preview
Input Sensitive Algorithms for
Multiple Sequence Alignment
Pankaj Agarwal @DukeYonatan Bilu @Hebrew
UniversityRachel Kolodny @Stanford
Multiple Sequence Alignment
• Quantifies similarities among [DNA, Protein] sequences
• Detects highly conserved motifs & remote homologues– Evolutionary insights– Transfer of annotation– Representation of protein families
Multiple Sequence Alignment
• Input: k sequences
• Output: optimal alignment– Gap infused sequences (-), one per row.– Restrictions column pattern
(1) GARFIELD MET NERMAL(2) ODIE AND HIS ASSOCIATE NERMAL MET GARFIELD AND HIS ASSOCIATE(3) GARFIELD AND HIS ASSOCIATE NERMAL
----GARFIELD MET----------------- NERMAL ------------------------------ODIE------------AND HIS ASSOCIATE NERMAL MET GARFIELD AND HIS ASSOCIATE----GARFIELD ---AND HIS ASSOCIATE NERMAL ------------------------------
Multiple Sequence Alignment
• Input: k sequences
• Output: optimal alignment– Minimal width– Score function
• Columns summation• e.g. sum of pairs
(1) GARFIELD MET NERMAL(2) ODIE AND HIS ASSOCIATE NERMAL MET GARFIELD AND HIS ASSOCIATE(3) GARFIELD AND HIS ASSOCIATE NERMAL
----GARFIELD MET----------------- NERMAL ------------------------------ODIE------------AND HIS ASSOCIATE NERMAL MET GARFIELD AND HIS ASSOCIATE----GARFIELD ---AND HIS ASSOCIATE NERMAL ------------------------------
DP solves MSA– Build a score matrix
• k-dimensional hypercube
– An alignment is a path
– Time:
GARFIELDANDHISASSOCIATENERMAL
GARFIELDMETNERMAL
num of nodes num neighbors per node
GARFIELDMET---------------NERMAL GARFIELD---ANDHISASSOCIATENERMAL
Previous WorkMSA Heuristics MSA Complexity
AnalysisFaster pairwise
SA•[Carrillo Lipman 88]•MACAW [Schuler, Altschul, Lipman 91]•ClustalW [Thompson et al 94]•DIAlign [Werner,Morgenstern, Dress 96]•T-Coffee [Notredame et al. 00]•POA [Lee et al. 02]•…
•Optimizing over the space of all possible inputs is NP hard [Jiang,Wang 94]•NP hard for SP[Just 01]•NP hard for SP that is a metric [Bonizzoni, Della Vedova 01]
•Assuming many common subsequences [Wilbur,Lipman 83]•Convex/Concave score functions [Eppstein et al. 92]•Exploiting compressibility of sequences [Landau Crochemore Ziv Ukelson 02]•…
•Review : Biological Sequence Analysis [Durbin et al.]
Pairwise Restriction• The “true” information: the aligned
subsequences and their relative positioning
• Study pairwise alignment first and restrict the alignment– Time:
• Focus efforts on “true” tradeoffs
GARFIELDMETNERMAL
GARFIELDANDHISASSOCIATENERMAL
Segments Matching Graph (SMG)
• Sequences are partitioned into segments
GARFIELD ANDHISASSOCIATE NERMAL
GARFIELD NERMALMET
NERMALODIE ANDHISASSOCIATE ANDHISASSOCIATEGARFIELDMET
nodesEdges: • self edges• between 2-equal-lengths-segments of different sequences• have scores
Defines allowed paths and their score
GARFIELDANDHISASSOCIATENERMAL
ODIEANDHISASSOCIATENERMALMETGARFIELDANDHISASSOCIATE
GARFIELD ANDHISASSOCIATE NERMAL
NERMALODIE ANDHISASSOCIATE ANDHISASSOCIATEGARFIELDMET
GARFIELDANDHISASSOCIATENERMAL
ODIEANDHISASSOCIATENERMALMETGARFIELDANDHISASSOCIATE
GARFIELD ANDHISASSOCIATE NERMAL
NERMALODIE ANDHISASSOCIATE ANDHISASSOCIATEGARFIELDMET
GARFIELDANDHISASSOCIATENERMAL
ODIEANDHISASSOCIATENERMALMETGARFIELDANDHISASSOCIATE
GARFIELD ANDHISASSOCIATE NERMAL
NERMALODIE ANDHISASSOCIATE ANDHISASSOCIATEGARFIELDMET
Extreme paths:
GARFIELDANDHISASSOCIATENERMAL
ODIEANDHISASSOCIATENERMALMETGARFIELDANDHISASSOCIATE
GARFIELD ANDHISASSOCIATE NERMAL
NERMALODIE ANDHISASSOCIATE ANDHISASSOCIATEGARFIELDMET
Extreme paths:
All paths
Extreme paths
Optimalpaths
Lemma: there is an optimal path that is extreme
GARFIELDANDHISASSOCIATENERMAL
ODIEANDHISASSOCIATENERMALMETGARFIELDANDHISASSOCIATE
Improved algorithm: DP on the segments
Transitive PR-MSAMore restrictions:
• Transitivity• Scoring function is shortest path
Faster algorithms
DNA sequences
*no scores in SMG, only matches
Maximal Directions
• Transitivity implies that for any point in the hypercube, the directions are partitioned into cliques – Defines maximal directions
• The shortest path can be taken over maximal directions.
• Pushes down the work per node
Obvious Directions
GARFIELD ANDHISASSOCIATE NERMAL
GARFIELD NERMALMET
NERMALODIE ANDHISASSOCIATE ANDHISASSOCIATEGARFIELDMET
GARFIELD ANDHISASSOCIATE NERMAL
GARFIELD NERMALMET
NERMALODIE ANDHISASSOCIATE ANDHISASSOCIATEGARFIELDMET
Obvious:
Non-Obvious:
?
Obvious Directions
• Lemma:Optimal pathis found, evenwhen making obvious decisions
• Not all nodes are relevant• Work for every node increases to
Special Vertices
(0,0)
Straightjunction
Corner junction
Thank you