27
Three Approaches to GO- Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa Ghanem Tom Barnwell Yike Guo Imperial College London Symposium on Semantic Mining in Biomedicine 2006 12/4/6

Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

Embed Size (px)

Citation preview

Page 1: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

Three Approaches to GO-Tagging Biomedical Abstracts

Neil DavisHenk HarkemaRob GaizauskasYikun Guo

University of Sheffield

Jon RatcliffeInforSense

Moustafa GhanemTom BarnwellYike GuoImperial College London

Symposium on Semantic Mining in Biomedicine 2006

12/4/6

Page 2: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

2

SMBM 2006

Introduction

• On-going explosive growth of biomedical literature

• Text Mining techniques can help through:• Extractive processes: extracting terms or facts

from papers for searching and linking

• Structuring processes: grouping papers based on content for conceptual navigation of large document collections

• GO-tag project:• Annotating biomedical papers with terms from

the Gene Ontology

Page 3: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

3

SMBM 2006

Gene Ontology

• Provides common descriptive framework forgenes and gene products across species

• Consists of three structured, controlled vocabularies (ontologies) that describe genesand gene products in terms of:

• Biological processes

• Cellular components

• Molecular functions

Page 4: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

4

SMBM 2006

Gene Ontology

• Contains almost 20,000 terms

• GO Slim (87 terms): subset of all GO terms• Aims to give broad overview of ontology content• Can be species-specific

• Typical GO term

Term name: isotropic cell growthAccession: GO:0051210Ontology: biological_processSynonyms: related: uniform cell growthDefinition: “The process by which a cell irreversibly increases in

size uniformly in all directions. In general, a rounded cellmorphology reflects isotropic cell growth.”

Page 5: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

5

SMBM 2006

Common Use of GO• Associations of genes and gene products with GO terms in

model organism and protein databases

• FlyBase, SGD, MGD

• For example (from SGD):

Gene GO Annotation References Evidence CodeACT1 Structural constituent Botstein D, et al. (1997) Traceable Author

of cytoskeleton The yeast cytoskeleton StatementACT1 Exocytosis Pruyne D and Bretsher Traceable Author

(2000) Polarization of Statementin yeastBotstein D, et al (1997) Traceable AuthorThe yeast cytoskeleton Statement

ACT1 Histone acetyltransferase Galarneua L, et al. Inferred fromcomplex (2000) Multiple links Direct Assay

between the NuA4 …

Page 6: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

6

SMBM 2006

GO-Tagging

• Task: given a text (PubMed abstract) and GO/GO Slim, assign 0 or more GO terms to the text if the text is “about” the process/component/function identified by the GO term

• Only most specific terms are assigned

• No association of GO term with specific genes or gene products

• User scenarios:• Research scientists: clustering of PubMed search results

• Database curators: identifying texts that may support Gene-GOterm associations

Page 7: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

7

SMBM 2006

Outline of Rest of Talk

• Data sets / Gold standards• SGD Gold Standard• IC Gold Standard

• Three approaches to GO-tagging• Lexical look-up• Information retrieval approach• Machine learning

• Evaluation results

• Conclusions

Page 8: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

8

SMBM 2006

SGD Gold Standard

• Derive Gold Standard from SGD model organism database (yeast)

• Given the annotated genes in SGD, assign a GO term T to a paper P if the paper P is referenced in support of a Gene-GO term association involving T

• SGD Gold Standard• 4922 PMIDS

• 2455 GO terms

• 10485 PMID-GO term pairs

Page 9: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

9

SMBM 2006

SGD Gold Standard• Advantages

• SGD data already exists – no further annotation work required• More Gold Standard data from other model organism databases

• Disadvantage• List of Gene-GO term assignments in SGD is incomplete for our task

• Each paper is associated with GO terms whose assignment to specific genes it supports, but the paper may be missing otherGO terms which can also be legitimately attached to it

• List does not contain all papers supporting a given assignment

• Consequence• SGD Gold Standard is “GO-term incomplete”

• Weak measure of Recall• Precision figures difficult to interpret

Page 10: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

10

SMBM 2006

SGD Gold Standard

• Further issue:• SGD Gene-GO term assignments are based on full

papers, whereas system only has access to abstracts

• Consequence:• Limit on maximum Recall obtainable by system

Page 11: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

11

SMBM 2006

IC Gold Standard

• Manually extend SGD Gold Standard to obtain GO-term complete annotation

• Select SGD papers for which all GO termassignments are supported by abstract or title

• Semi-automatically add further GO terms byfuzzy term matching + post-editing

• IC Gold Standard• 785 PMIDS• 1006 GO terms• 5170 PMID-GO term pairs

Page 12: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

12

SMBM 2006

IC Gold Standard

• Advantage• Closer to GO-term complete Gold Standard

• Disadvantages• Still not GO-term complete

• Direct mentions of GO terms vs. semantically inferred GO terms

• Gold Standard creation method favors lexicallook-up approach to GO-tagging

• Data set is small

Page 13: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

13

SMBM 2006

Outline of Rest of Talk

• Data sets / Gold standards• SGD Gold Standard• IC Gold Standard

• Three approaches to GO-tagging• Lexical look-up• Information retrieval approach• Machine learning

• Evaluation results

• Conclusions

Page 14: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

14

SMBM 2006

Lexical Look-Up

• (Task: given a text (PubMed abstract) and GO/GO Slim, assign 0 or more GO terms to the text if the text is “about” the process/component/function identified by the GO term)

• GO term T is assigned to a paper if term Toccurs in the abstract of the paper

• Simple & fast baseline

• GO terms recognized in text can be used as features in Machine Learning approach

Page 15: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

15

SMBM 2006

Lexical Look-Up

• Web service calls to Termino term tagger

• Term classes in Termino• GO terms

• GO term synonyms

• SGD yeast gene names

• Lexical look-up method• Case-insensitive

• Simple morphological analysis

• Cells mapped onto cell

• Mitochondrial, mitochondria not mapped onto mitochondrion

Page 16: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

16

SMBM 2006

Lexical Look-Up Results

• Recall

• Full text (SGD) vs. abstracts only (IC)• Inherent drawbacks of lexical look-up: term variation, literal mentions• Effects of Gold Standard creation method (IC)

• Precision

• Effects of Gold Standard creation method (IC)

• GO vs. GO Slim

• Recognizing GO Slim terms is easier than recognizing GO terms

Page 17: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

17

SMBM 2006

Lexical Look-Up

• Extensions• GO term T is assigned to a paper if synonym of

term T occurs in the abstract of the paper

• GO term T is assigned to a paper if yeast gene nameassociated with term T occurs in the abstract of the paper

• Effects on performance• Adding synonyms: slight decrease in Precision, substantial

increase in Recall

• Adding yeast terms: substantial decrease in Precision, substantial increase in Recall

Page 18: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

18

SMBM 2006

IR-Based Approach

• Document collection• For each GO term, create a document consisting

of the GO term, its synonyms, and its definition

• Query• For each paper, create a query consisting

of the words in the abstract of the paper

• Given a query (i.e., abstract), retrieve relevant documents (i.e., GO terms) from the document collection

• Assign top-ranked 5, 10, … GO terms to abstract

Page 19: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

19

SMBM 2006

IR-Based Approach

• Index documents using Lucene search engine

• Standard IR preprocessing: tokenization, stop word removal, case normalization, stemming

• Similarity measure: vector space model

• Two kinds of document• Flat document = GO term + synonyms + definition• Hierarchical document = GO term + synonyms +

definition + terms, synonyms, and definitions of parent GO nodes

Page 20: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

20

SMBM 2006

IR-Based Results

• Better performance on IC abstracts than on SGD abstracts

• Hierarchical documents do slightly worse than flat documents

• Discriminatory effect of specific GO terms may be reducedby occurrence of general terms such as cell and protein

Page 21: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

21

SMBM 2006

Machine Learning

• Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, …

• Naïve Bayes predicts only one GO term per abstract

• SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract

• Features: words, frequent phrases• Preprocessing steps: tokenization, removal of

stop words, stemming

• Training on 66% of annotated data, evaluation on remainder of data

• GO term assignments vis-à-vis generic GO Slim tomitigate data sparsity problems

Page 22: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

22

SMBM 2006

Machine Learning Results

• One GO term vs. multiple GO terms per abstract makes a difference• Higher precision scores than lexical look-up (SGD): GO terms directly

mentioned in text not be assigned if GO terms not present in training set• Oracle Text Decision Tree (IC): classifier learns systematic, strong

correlation between words in text and words in GO terms

Page 23: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

23

SMBM 2006

• Best F scores for GO Slim• SGD Gold Standard

• IC Gold Standard R P F

LLU 79.5 98.5 88.0

IR 59.5 37.6 46.1

ML 76.5 83.0 79.6

Comparison of Approaches

R P F

LLU 51.0 29.9 37.7

IR 51.5 26.2 34.7

ML 36.8 51.6 43.0

Page 24: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

24

SMBM 2006

Conclusions

• GO-tagging is an interesting task• NLP challenges

• Benefits of functional GO-tagger forresearchers and curators

• Creating valid Gold Standard• Completeness of annotation

Page 25: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

25

SMBM 2006

Conclusions

• Methods for GO-tagging• Lexical look-up

• Fast, simple

• Term variation, relevant GO terms inferred from text

• Information retrieval approach

• Novel perspective

• Noise from general biomedical terms

• Machine Learning

• Able to capture generalizations

• Feature selection

Page 26: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

26

SMBM 2006

Future Work

• Enhancements to each of the three simple approaches

• Combining three approaches into a hybrid system

• Improving resources and methodology for evaluatingthe technology

• Building and evaluating end-user applications employing this technology

• Look at other tasks:• Extracting GO term-gene/gene product pairs• Assigning evidence codes

Page 27: Three Approaches to GO-Tagging Biomedical Abstracts Neil Davis Henk Harkema Rob Gaizauskas Yikun Guo University of Sheffield Jon Ratcliffe InforSense Moustafa

27

SMBM 2006

Navigating GO-Tagged Document Collections

GOHierarchy

AbstractTitles

AbstractBodies

GO Terms/Gene Names