Upload
anne-sharpe
View
16
Download
0
Embed Size (px)
DESCRIPTION
Speaker:. (Peter)Xiaoyong Wu Bioinformatics 4/28/03. Topic. Including Biological Literature Improves Homology Search Jeffrey T. Chang, Soumya Rachaudhuri, and Russ B. Altman (Paper Source: http://www.jeffchang.com/). Problem. Target of bio-sequence study: - PowerPoint PPT Presentation
Citation preview
Speaker:
(Peter)Xiaoyong Wu Bioinformatics 4/28/03
Topic
Including Biological Literature Improves Homology Search
Jeffrey T. Chang, Soumya Rachaudhuri, and Russ B. Altman
(Paper Source: http://www.jeffchang.com/)
Problem
Target of bio-sequence study: Annotate the giant sequence information based on accurate homology recognition (ex. Disclose the possible function,
relationship of sequences for medical research)
Current approach: Sequence similarity, such as PSI-BLAST
Problem: seq. similarity <> seq. homology
Idea of this paper
How expert in biology solve this problem?
Supplementing sequence similarity with biomedical literature information
Modify PSI-BLAST in each iteration using literature similarity to bound the search of sequences in a sensible scope
Methodology
Methodology
Collect sequence information and literature into a concatenation and remove the so called “stop words”
Calculate document similarity(Wilbur and Yang)
A and B are word vectors of two documents. cos(A. B) == 1, similar documents, cos(A, B) == 0, different documents.
||||),cos(
BA
BABA
Methodology
Construct the word vectors A and B of two documents.
A = (a1, a2, a3, …am)B = (b1, b2, b3, …bm)
am and bm represent the same attribute(word)total attributes are the union of words of A and B documents
Methodology-validation & test
Superfamily of proteinsOver 1000 protein superfamilies, in SCOP(
http://scop.berkeley.edu/), proteins in one superfamilies are of same function. But one protein may cover more than 2 superfamilies.
Gold StandardAll proteins just cover one superfamily. All proteins with multiple functions are removed.
Results
Results
Recall: the number of homologous sequences > a fixed e-value cutoff(seq. in Gold Standard retrieved by modified PSI-BLAST)/total number of homologous sequence(Gold standard)
Precision: number of homologous sequences detected/total number of seq. detected(PSI-BLAST reported)
Results
Questions?
Thanks!