Improving Interoperability of Text Mining Tools with BioC

Ritu Khare, Chih-‐Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu National Center for Biotechnology Information (NCBI) National Institutes of Health

1

¡ Motivation ¡  Our Text Mining Tools ¡  Building BioC Compatible Tools ¡  Results and Conclusions

2

¡  Building complex text mining applications requires combining different tools developed by different groups

¡  Each tool is developed independently §  Group conventions: data representation, programming, execution environments

¡  Heterogeneity in data/text representations limits and slows down §  tool interoperability, application development, and research and innovation.

3

EXISTING SOLUTIONS ¡  Unstructured information

management architecture (UIMA) – 2004

¡  General Architecture for Text Engineering (GATE) -‐ 2009

¡  Steep Learning Curve ¡  Substantial Development

and Re-‐development time

BIOC ¡  Minimal change

requirement to existing applications and datasets

¡  BioC family §  XML formats to present text

documents and annotations §  Functions (C++, JAVA) to read/

write documents in BioC format

4


5

6

DNormDNorm

tmVartmVar

SR4GNSR4GN

tmChemtmChem

GenNormGenNorm

PubMed Abstract

Disease Mentions with MEDIC IDs

Mutation Mentions

Species Mentions with Taxonomy IDs

Chemical Mentions

Gene Mentions with Entrez IDs

Annotations for Various BioConcepts

Concept Recognition and Annotation Toolkit

PubMed Abstracts or Full-‐Text Articles

DNorm Disease Mentions with MEDIC IDs (F-‐measure= 80.90%)

tmVar Mutation Mentions (F-‐measure= 91.39%)

SR4GN Species Mentions with Taxonomy IDs (F-‐measure= 85.42%)

tmChem Chemical Mentions (F-‐measure= 88.27%)

GenNorm Gene Mentions with Entrez IDs (F-‐measure= 92.89%)

Annotations with various BioConcepts

NER tools Programming Language Method

Formats

PubMed/ PMC XML Free Text

PubTator Format

GenNorm Format

tmChem (Chemical) Java, Perl, C++ CRF √ √

DNorm (Disease) Java CRF √ √

tmVar (Mutation) Perl, C++ CRF √ √ √

SR4GN (Species) Perl Rule-‐based √ √ √

GenNorm (Gene) Perl Statistical √ √ √

PubTator Perl, JavaScript Web server √ √

7

8

¡  Official corpus for BioCreative IV GO Task ¡  200 full-‐text articles along with their gene ontology (GO) annotations §  evidence sentences §  gene/protein entities, GO terms, GO evidence codes

¡  Developed by expert GO curators via a web-‐based annotation tool.

9

¡ Motivation ¡  The NCBI Text Mining Toolkit ¡  Building BioC Compatible Tools ¡  Results and Conclusions

10

¡  The BioC family §  XML DTD ▪  how to present text

document and annotations (higher-‐level semantics)

§  C++ and Java Libraries ▪  functions/classes to read/

write documents in BioC format

¡  BioC Recommendations §  Full-‐text articles and

Annotations ▪  Present in BioC XML Format ▪  Keep in separate files

§  Key file ▪  describes how data should

be interpreted in the annotation file (lower-‐level semantics)

▪  needs to be created for a specific type of data.

11

¡  Steps taken to comply our tools with BioC §  Created the key file § Modified the input/output formats of the tools ▪  Added the BioC format as a new option for input/output

¡  Challenges

§  Defining an appropriate key file §  Offset calculation §  Translating web-‐based annotation file to BioC annotation file (Unicode to ASCII conversion)

12


13

¡  Common key file for all tools since they are designed for similar types of data

14

id: PubMed id.

Passage: e.g., title, abstract

Offset of the passage

Id of the bioconcept

Offset of the bioconcept

Length of the bioconcept

Mention of the bioconcept

date: the time annotation create

NER tools bioconcept

PubMed/ PMC XML BioC

Free Text PubTator GenNorm

tmChem Chemical √ √ √

DNorm Disease √ √ √

tmVar Mutation √ √ √ √

SR4GN Species √ √ √ √

GenNorm Gene √ √ √ √

PubTator N/A √ √ √

15

Our Text Mining Toolkit available for public access: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/

16

BioC Article File

BioC Annotation File

DNorm tmVar tmChem SR4GN GenNorm

Identifying Disease

Identifying Mutation

Identifying chemical

Identifying Species

Identifying Gene

17

id: PubMed id.

passage: title

date: the time file download

passage: abstract

18

Id of the bioconcept

Offset of the bioconcept

Length of the bioconcept

Mention of the bioconcept

Type of the bioconcept

Time: Time annotation created.

ID: PMID of the article.

GO term: e.g., receptor-‐mediated endocytosis

GO evidence code: e.g., Inferred from Mutant Phenotype (IMP)

Curatable entity: i.e., gene or gene product

Text: GO evidence text

¡  Our experience with BioC §  Minimal changes required to prepare BioC versions §  Easy to learn and use §  Improved interoperability within the toolkit

¡  Implications §  Improved interoperability ▪  With other tools to build sophisticated applications

§  The key file could evolve as a standard for concept recognition and normalization tasks

§  Anticipate broader usage of our tools as BioC gains popularity

20

¡  BioC Developers § W. John Wilbur §  Rezarta Islamaj Doğan §  Donald Comeau

¡  Intramural Research Program of the NIH, National Library Medicine

21

¡  Chih-Hsuan Wei §  [email protected] §  +1 301-594-5290

22

Technology

Improving Interoperability of Text Mining Tools with BioC