22
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Robert F. Murphy Copyright Copyright 2000, 2001. 2000, 2001. All rights reserved. All rights reserved.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright 2000, 2001. All rights reserved

Embed Size (px)

DESCRIPTION

Possible Approaches Model-based Model-based  Motif-based (MEME/MAST)  Hidden Markov model-based (HMMER) Non-model-based Non-model-based  Family Pairwise Search (FPS)

Citation preview

Page 1: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Computational Biology, Part CFamily Pairwise Search and

Cobbling

Robert F. MurphyRobert F. MurphyCopyright Copyright 2000, 2001. 2000, 2001.

All rights reserved.All rights reserved.

Page 2: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Overall Goals

Find previously unrecognized members of a Find previously unrecognized members of a familyfamily

Develop a model of a familyDevelop a model of a family

Page 3: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Possible Approaches

Model-basedModel-based Motif-based (MEME/MAST)Motif-based (MEME/MAST) Hidden Markov model-based (HMMER)Hidden Markov model-based (HMMER)

Non-model-basedNon-model-based Family Pairwise Search (FPS)Family Pairwise Search (FPS)

Page 4: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

PSSMs

Motifs can be summarized and searched for Motifs can be summarized and searched for using using PPosition-osition-SSpecific pecific SScoring coring MMatricesatrices

Calculated from a multiple alignment of a Calculated from a multiple alignment of a conserved region for members of a familyconserved region for members of a family

Page 5: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Learning PSSMs

Unsupervised learning methods can be used Unsupervised learning methods can be used to find motifs in unaligned sequencesto find motifs in unaligned sequences

Best characterized algorithm is MEMEBest characterized algorithm is MEME T.L. Bailey & C. Elkan (1995) Unsupervised Learning of T.L. Bailey & C. Elkan (1995) Unsupervised Learning of

Multiple Motifs in Biopolymers Using Expectation Multiple Motifs in Biopolymers Using Expectation Maximization. Maximization. Machine Learning J. 21Machine Learning J. 21:51-83:51-83

Page 6: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Problems with PSSMs

Some families are characterized by two or Some families are characterized by two or more “sub”-motifs with variable spacing more “sub”-motifs with variable spacing between thembetween them

Deciding upon motif boundaries difficultDeciding upon motif boundaries difficult Possible information in intervening Possible information in intervening

sequences lost if only motifs are usedsequences lost if only motifs are used

Page 7: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Cobbling

Pick “most representative” protein sequence Pick “most representative” protein sequence from a familyfrom a family

Convert it to a profile by replacing each Convert it to a profile by replacing each amino acid by the corresponding column amino acid by the corresponding column from a similarity matrix from a similarity matrix

Page 8: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Cobbling

For each recognized “motif” in the family, For each recognized “motif” in the family, replace the corresponding section of the replace the corresponding section of the profile with the profile of the motifprofile with the profile of the motif

Page 9: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Cobbling

Advantage: At least some sequence Advantage: At least some sequence information between motifs is retained.information between motifs is retained.

S. Henikoff & J.G. Henikoff (1997) S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of Embedding strategies for effective use of information from multiple sequence information from multiple sequence alignments. alignments. Protein Science 6Protein Science 6:698-705:698-705

Page 10: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Cobbler Illustration

scores from profiles of conserved motifs

similarity scores for sequence from “most representative” family member

sequence of “most representative” family member

Page 11: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Family Pairwise Search

For all known members of family, calculate For all known members of family, calculate (pairwise) homology to each sequence in (pairwise) homology to each sequence in database (using BLAST) and sum those database (using BLAST) and sum those scoresscores

Page 12: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Family Pairwise Search

Does not generate a model of the motifDoes not generate a model of the motif Analogous to k nearest neighbor Analogous to k nearest neighbor

classificationclassification

Page 13: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Which method is best?

Compare BLAST using a randomly chosen Compare BLAST using a randomly chosen family member, BLAST FPS, MEME, family member, BLAST FPS, MEME, HMMERHMMER

W.N. Gundy (1998) Homology Detection W.N. Gundy (1998) Homology Detection via Family Pairwise Search. via Family Pairwise Search. J. Comput. J. Comput. Biol. 5Biol. 5:479-492:479-492

Page 14: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Comparison Protocol

For each methodFor each method For each known protein familyFor each known protein family

Train with family membersTrain with family membersSearch database for matchesSearch database for matchesRank by score from searchRank by score from searchDetermine how many known family Determine how many known family

members are ranked highlymembers are ranked highly

Page 15: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Comparison Protocol

Evaluation metricEvaluation metric average ROCaverage ROC5050

ROCROC5050 is the fraction of true positives detected at a is the fraction of true positives detected at a threshold giving 50 false negativesthreshold giving 50 false negatives

average over all familiesaverage over all families Bigger is better!Bigger is better!

Page 16: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Comparison Protocol

Caution!Caution! True positive True positive defined as being listed as a defined as being listed as a

member of the family in the PROSITE member of the family in the PROSITE compilationcompilation

Some Some false positivesfalse positives could be actual family could be actual family members that were missed during PROSITE members that were missed during PROSITE compilation!compilation!

(Should be minor effect)(Should be minor effect)

Page 17: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Results

BLAST FPS

BLAST

HMMER

MAST

Page 18: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Conclusion

FPS better than single sequence BLASTFPS better than single sequence BLAST FPS better than model-based methodsFPS better than model-based methods

Page 19: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Which is best (part 2)?

Compare BLAST, BLAST FPS, cobbled Compare BLAST, BLAST FPS, cobbled BLAST, cobbled BLAST FPSBLAST, cobbled BLAST FPS

W.N. Grundy and T.L. Bailey (1999) W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded Family pairwise search with embedded motif models. motif models. Bioinformatics 15:Bioinformatics 15:463-470463-470

Page 20: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Comparison Protocol

Evaluation metricEvaluation metric rank sumrank sum

calculate difference in ROCcalculate difference in ROC5050 for two methods for a for two methods for a given familygiven family

sort by absolute value of differencesort by absolute value of difference sum ranks of families for which one method is better sum ranks of families for which one method is better

than the otherthan the other Bigger is better!Bigger is better!

Page 21: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Results

Page 22: Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, 2001. All rights reserved

Conclusion

For task of finding members of a family For task of finding members of a family given a reasonable number of known given a reasonable number of known members of that family, cobbled FPS is best members of that family, cobbled FPS is best currently available method!currently available method!