Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich...

Preview:

Citation preview

Finding Functional Gene Relationships Using the Semantic

Gene Organizer (SGO)

Kevin Heinrich

Master’s Defense

July 16, 2004

Outline

• Problem / Goals

• Related Work

• Information Retrieval– Vector Space Model– Latent Semantic Indexing (LSI)

• Biological Databases

• SGO Use & Results

Problem

• Biological tools are creating vast amounts of data.

• Current techniques are time-consuming and expensive.

• Want to know phenotype (function) from genotype (structure/sequence).

Goals

• Develop a tool to aid researchers in finding and understanding functional gene relationships.

• Use information that covers whole genome, e.g. literature.

Related Work

• Jenssen et al. (2001) developed PubGene.– Literature network– Assigns functional association if there is a co-

occurrence of gene symbols

• Wilkinson and Huberman (2004) expanded this idea to find communities of related genes.

• Yandell and Majoros (2002) use natural language processing techniques to identify nature of relationships.

Related Work

• Most all literature-based techniques rely on term co-occurrence.

• What about gene aliases?

• Solution: Apply a more robust technique.

Information RetrievalVector Space Model

• Documents are parsed into tokens.

• Tokens are assigned a weight of, wij, of ith token in jth document.

• An m x n term-by-document matrix, A, is created where

– Documents are m-dimensional vectors.– Tokens are n-dimensional vectors.

ijwA

Information RetrievalTerm Weights

• Term weights are the product of a local and global component

• tf

• idf

• idf2

jiijij dglw

ijij fl

jij

jij

i f

f

g

1log2 j

iji f

ng

Information RetrievalTerm Weights (cont’d)

• log-entropy

• Goal is to give distinguishing terms more weight.

n

pp

g jijij

i2

2

log

log

1

ijij fl 1log

jij

ijij f

fp

Information RetrievalQuery & Similarity

• Queries are represented by a pseudo-document vector

• Similarity is the cosine of the angle between document vectors.

mgggq ,,, 210

m

kk

m

kkj

m

kkjk

j

jjj

gw

wg

dq

dqdqsim

1

2

1

2

1cos,

Information RetrievalLatent Semantic Indexing (LSI)

LSI performs a truncated SVD on

A = UΣVT

• U is the m x n matrix of eigenvectors of AAT

• VT is the r x n matrix of eigenvectors of ATA• Σ is the r x r diagonal matrix containing the r nonnegative

singular values of A• r is the rank of A

A rank-k approximation is given by Ak = UkΣkVkT

Information RetrievalLSI (cont’d)

• Document-to-document similarity is

• Queries are projected into low-rank approximation space

TkkkkTk VVAA

10

kkTUqq

Information RetrievalLSI (cont’d)

• Scaled document vectors can be computed once and stored for quick retrieval.

• The lower-dimensional space forces queries and documents to be compared in a more conceptual manner and saves storage.

• Choice of number of factors is an open question.

• End Effect: LSI can find similarities between documents that have no term co-occurrence.

Information RetrievalEvaluation Measures

• Precision – ratio of relevant returned documents to the total number of returned documents.

• Recall – ratio of relevant returned documents to the total number of relevant documents.

• Goal is to have high precision at all levels of recall.

• Systems are often evaluated by average precision (AP), which is the average of 11 interpolated precision values at the decile ranges.

Biological DatabasesMEDLINE

• MEDLINE (NLM)– Contains 14+ million references to journal

articles with a concentration in medicine– Span over 4,600 journals worldwide– 1966 to present– ~500,000 citations added annually– Each citation is manually indexed with MeSH

terms.

Biological DatabasesPubMed

• PubMed– Retrieves articles from MEDLINE and other

journals.– Can be queried via any combination of

attributes.

Biological DatabasesLocusLink

• NCBI human-curated database• Single query interface to a comprehensive

directory for genes and gene reference sequences for key genomes.

• Provides links to related records in PubMed and other citations when applicable.

• Provides RefSeq Summary of gene function and links to key MEDLINE citations relevant to each gene.

Biological DatabasesOverview

• MEDLINE has lots information– Not all articles relate to genes– Gene terminology problem

• LocusLink does not cover all relevant citations, but a representative few.

Biological DatabasesGene Document Construction

• Concatenate titles and abstracts of MEDLINE citations cross-referenced in Human, Rat, and Mouse LocusLink entries.

• Sequencing abstracts included – noise

• LocusLink references are not comprehensive, so recall of all relevant abstracts is not guaranteed.

SGO

• Primarily uses LSI to rank genes.

• Enables user to specify query method– Gene query– Keyword query– Number of factors– Show latent matches

• Saves previous query sessions.

SGOInterface

SGOInterface (cont’d)

SGOTrees

• Unfortunately, ranked lists mean little to biologists.

• Pairwise distances can be formed into a matrix

where is the similarity between documents i and j

ijdD

ijijd cos1

ijcos

SGOTrees (cont’d)

• Fitch-Margoliash (1967) method in PHYLIP is applied to D to generate hierarchical trees.

• Thresholds can be applied to self-similarity matrix to produce graphs.

SGOHierarchical Tree

SGOGraph or Nodal Tree

SGOCoding Issues

• Web interface – must be interactive– Queries are processed on click– Document collections are parsed offline– Trees are constructed offline

• Storage will eventually become an issue.

ResultsTest Data Set

• 50 gene test data set was constructed.– Alzheimer’s Disease– Cancer– Development

• Reelin signaling pathway used as basis for evaluation– 5 primary genes (directly

associated)– 7 secondary genes (indirectly

associated)

ResultsPrimary AP

• AP for 5 primary genes– 61% for 5 factors– 84% for 25 factors– 84% for 50 factors

ResultsSecondary AP

• AP for 12 secondary genes– 53% for 5 factors– 59% for 25 factors– 61% for 50 factors

ResultsComparison

• LSI comparable to tf-idf for 5 primary genes• Far superior to tf-idf for 12 second genes

– PubMed co-citation identifies 2 of the 7 indirectly related genes

– Abstract overlap of LocusLink citations fails to identify any indirectly related genes

• tf-idf fails on many keyword queries

• Tested on Gene Ontology classifications (not shown)– Similar tendencies are observed

ResultsAbstract Representation

• To simulate scaling up, decrease representation of reelin-related genes

• AP of 47% on 20,856 Human LocusLink abstracts

ResultsHierarchical Tree

ResultsHierarchical Tree

ResultsHierarchical Tree

Conclusions

• SGO allows genes to be compared to each other and to keyword (function).

• SGO identifies latent relationships with promising accuracy.

• SGO is not meant to replace existing technologies, but to assist researchers– Verify current results– Direct future exploration

Future Work

• Scale up to entire genome

• Document construction

• Incorporate structural or other information for multi-modal similarity

• Test other models e.g. NMF, QR, etc.

• Interactive tree building

• Keep collections current

Recommended