12
Speaker: (Peter)Xiaoyong Wu Bioinformatics 4/28/03

Speaker:

Embed Size (px)

DESCRIPTION

Speaker:. (Peter)Xiaoyong Wu Bioinformatics 4/28/03. Topic. Including Biological Literature Improves Homology Search Jeffrey T. Chang, Soumya Rachaudhuri, and Russ B. Altman (Paper Source: http://www.jeffchang.com/). Problem. Target of bio-sequence study: - PowerPoint PPT Presentation

Citation preview

Page 1: Speaker:

Speaker:

(Peter)Xiaoyong Wu Bioinformatics 4/28/03

Page 2: Speaker:

Topic

Including Biological Literature Improves Homology Search

Jeffrey T. Chang, Soumya Rachaudhuri, and Russ B. Altman

(Paper Source: http://www.jeffchang.com/)

Page 3: Speaker:

Problem

Target of bio-sequence study: Annotate the giant sequence information based on accurate homology recognition (ex. Disclose the possible function,

relationship of sequences for medical research)

Current approach: Sequence similarity, such as PSI-BLAST

Problem: seq. similarity <> seq. homology

Page 4: Speaker:

Idea of this paper

How expert in biology solve this problem?

Supplementing sequence similarity with biomedical literature information

Modify PSI-BLAST in each iteration using literature similarity to bound the search of sequences in a sensible scope

Page 5: Speaker:

Methodology

Page 6: Speaker:

Methodology

Collect sequence information and literature into a concatenation and remove the so called “stop words”

Calculate document similarity(Wilbur and Yang)

A and B are word vectors of two documents. cos(A. B) == 1, similar documents, cos(A, B) == 0, different documents.

||||),cos(

BA

BABA

Page 7: Speaker:

Methodology

Construct the word vectors A and B of two documents.

A = (a1, a2, a3, …am)B = (b1, b2, b3, …bm)

am and bm represent the same attribute(word)total attributes are the union of words of A and B documents

Page 8: Speaker:

Methodology-validation & test

Superfamily of proteinsOver 1000 protein superfamilies, in SCOP(

http://scop.berkeley.edu/), proteins in one superfamilies are of same function. But one protein may cover more than 2 superfamilies.

Gold StandardAll proteins just cover one superfamily. All proteins with multiple functions are removed.

Page 9: Speaker:

Results

Page 10: Speaker:

Results

Recall: the number of homologous sequences > a fixed e-value cutoff(seq. in Gold Standard retrieved by modified PSI-BLAST)/total number of homologous sequence(Gold standard)

Precision: number of homologous sequences detected/total number of seq. detected(PSI-BLAST reported)

Page 11: Speaker:

Results

Page 12: Speaker:

Questions?

Thanks!