37
Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Embed Size (px)

Citation preview

Page 1: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Finding Functional Gene Relationships Using the Semantic

Gene Organizer (SGO)

Kevin Heinrich

Master’s Defense

July 16, 2004

Page 2: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Outline

• Problem / Goals

• Related Work

• Information Retrieval– Vector Space Model– Latent Semantic Indexing (LSI)

• Biological Databases

• SGO Use & Results

Page 3: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Problem

• Biological tools are creating vast amounts of data.

• Current techniques are time-consuming and expensive.

• Want to know phenotype (function) from genotype (structure/sequence).

Page 4: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Goals

• Develop a tool to aid researchers in finding and understanding functional gene relationships.

• Use information that covers whole genome, e.g. literature.

Page 5: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Related Work

• Jenssen et al. (2001) developed PubGene.– Literature network– Assigns functional association if there is a co-

occurrence of gene symbols

• Wilkinson and Huberman (2004) expanded this idea to find communities of related genes.

• Yandell and Majoros (2002) use natural language processing techniques to identify nature of relationships.

Page 6: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Related Work

• Most all literature-based techniques rely on term co-occurrence.

• What about gene aliases?

• Solution: Apply a more robust technique.

Page 7: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Information RetrievalVector Space Model

• Documents are parsed into tokens.

• Tokens are assigned a weight of, wij, of ith token in jth document.

• An m x n term-by-document matrix, A, is created where

– Documents are m-dimensional vectors.– Tokens are n-dimensional vectors.

ijwA

Page 8: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Information RetrievalTerm Weights

• Term weights are the product of a local and global component

• tf

• idf

• idf2

jiijij dglw

ijij fl

jij

jij

i f

f

g

1log2 j

iji f

ng

Page 9: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Information RetrievalTerm Weights (cont’d)

• log-entropy

• Goal is to give distinguishing terms more weight.

n

pp

g jijij

i2

2

log

log

1

ijij fl 1log

jij

ijij f

fp

Page 10: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Information RetrievalQuery & Similarity

• Queries are represented by a pseudo-document vector

• Similarity is the cosine of the angle between document vectors.

mgggq ,,, 210

m

kk

m

kkj

m

kkjk

j

jjj

gw

wg

dq

dqdqsim

1

2

1

2

1cos,

Page 11: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Information RetrievalLatent Semantic Indexing (LSI)

LSI performs a truncated SVD on

A = UΣVT

• U is the m x n matrix of eigenvectors of AAT

• VT is the r x n matrix of eigenvectors of ATA• Σ is the r x r diagonal matrix containing the r nonnegative

singular values of A• r is the rank of A

A rank-k approximation is given by Ak = UkΣkVkT

Page 12: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Information RetrievalLSI (cont’d)

• Document-to-document similarity is

• Queries are projected into low-rank approximation space

TkkkkTk VVAA

10

kkTUqq

Page 13: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Information RetrievalLSI (cont’d)

• Scaled document vectors can be computed once and stored for quick retrieval.

• The lower-dimensional space forces queries and documents to be compared in a more conceptual manner and saves storage.

• Choice of number of factors is an open question.

• End Effect: LSI can find similarities between documents that have no term co-occurrence.

Page 14: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Information RetrievalEvaluation Measures

• Precision – ratio of relevant returned documents to the total number of returned documents.

• Recall – ratio of relevant returned documents to the total number of relevant documents.

• Goal is to have high precision at all levels of recall.

• Systems are often evaluated by average precision (AP), which is the average of 11 interpolated precision values at the decile ranges.

Page 15: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Biological DatabasesMEDLINE

• MEDLINE (NLM)– Contains 14+ million references to journal

articles with a concentration in medicine– Span over 4,600 journals worldwide– 1966 to present– ~500,000 citations added annually– Each citation is manually indexed with MeSH

terms.

Page 16: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Biological DatabasesPubMed

• PubMed– Retrieves articles from MEDLINE and other

journals.– Can be queried via any combination of

attributes.

Page 17: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Biological DatabasesLocusLink

• NCBI human-curated database• Single query interface to a comprehensive

directory for genes and gene reference sequences for key genomes.

• Provides links to related records in PubMed and other citations when applicable.

• Provides RefSeq Summary of gene function and links to key MEDLINE citations relevant to each gene.

Page 18: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Biological DatabasesOverview

• MEDLINE has lots information– Not all articles relate to genes– Gene terminology problem

• LocusLink does not cover all relevant citations, but a representative few.

Page 19: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Biological DatabasesGene Document Construction

• Concatenate titles and abstracts of MEDLINE citations cross-referenced in Human, Rat, and Mouse LocusLink entries.

• Sequencing abstracts included – noise

• LocusLink references are not comprehensive, so recall of all relevant abstracts is not guaranteed.

Page 20: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

SGO

• Primarily uses LSI to rank genes.

• Enables user to specify query method– Gene query– Keyword query– Number of factors– Show latent matches

• Saves previous query sessions.

Page 21: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

SGOInterface

Page 22: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

SGOInterface (cont’d)

Page 23: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

SGOTrees

• Unfortunately, ranked lists mean little to biologists.

• Pairwise distances can be formed into a matrix

where is the similarity between documents i and j

ijdD

ijijd cos1

ijcos

Page 24: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

SGOTrees (cont’d)

• Fitch-Margoliash (1967) method in PHYLIP is applied to D to generate hierarchical trees.

• Thresholds can be applied to self-similarity matrix to produce graphs.

Page 25: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

SGOHierarchical Tree

Page 26: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

SGOGraph or Nodal Tree

Page 27: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

SGOCoding Issues

• Web interface – must be interactive– Queries are processed on click– Document collections are parsed offline– Trees are constructed offline

• Storage will eventually become an issue.

Page 28: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

ResultsTest Data Set

• 50 gene test data set was constructed.– Alzheimer’s Disease– Cancer– Development

• Reelin signaling pathway used as basis for evaluation– 5 primary genes (directly

associated)– 7 secondary genes (indirectly

associated)

Page 29: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

ResultsPrimary AP

• AP for 5 primary genes– 61% for 5 factors– 84% for 25 factors– 84% for 50 factors

Page 30: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

ResultsSecondary AP

• AP for 12 secondary genes– 53% for 5 factors– 59% for 25 factors– 61% for 50 factors

Page 31: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

ResultsComparison

• LSI comparable to tf-idf for 5 primary genes• Far superior to tf-idf for 12 second genes

– PubMed co-citation identifies 2 of the 7 indirectly related genes

– Abstract overlap of LocusLink citations fails to identify any indirectly related genes

• tf-idf fails on many keyword queries

• Tested on Gene Ontology classifications (not shown)– Similar tendencies are observed

Page 32: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

ResultsAbstract Representation

• To simulate scaling up, decrease representation of reelin-related genes

• AP of 47% on 20,856 Human LocusLink abstracts

Page 33: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

ResultsHierarchical Tree

Page 34: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

ResultsHierarchical Tree

Page 35: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

ResultsHierarchical Tree

Page 36: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Conclusions

• SGO allows genes to be compared to each other and to keyword (function).

• SGO identifies latent relationships with promising accuracy.

• SGO is not meant to replace existing technologies, but to assist researchers– Verify current results– Direct future exploration

Page 37: Finding Functional Gene Relationships Using the Semantic Gene Organizer (SGO) Kevin Heinrich Master’s Defense July 16, 2004

Future Work

• Scale up to entire genome

• Document construction

• Incorporate structural or other information for multi-modal similarity

• Test other models e.g. NMF, QR, etc.

• Interactive tree building

• Keep collections current