22
Ritu Khare, ChihHsuan Wei,Yuqing Mao, Robert Leaman, Zhiyong Lu National Center for Biotechnology Information (NCBI) National Institutes of Health 1

Improving Interoperability of Text Mining Tools with BioC

Embed Size (px)

Citation preview

Page 1: Improving Interoperability of Text Mining Tools with BioC

Ritu  Khare,  Chih-­‐Hsuan  Wei,  Yuqing  Mao,  Robert  Leaman,  Zhiyong  Lu  National  Center  for  Biotechnology  Information  (NCBI)  National  Institutes  of  Health    

1  

Page 2: Improving Interoperability of Text Mining Tools with BioC

¡ Motivation    ¡  Our  Text  Mining  Tools    ¡  Building  BioC  Compatible  Tools    ¡  Results  and  Conclusions  

2  

Page 3: Improving Interoperability of Text Mining Tools with BioC

¡  Building  complex  text  mining  applications  requires  combining  different  tools  developed  by  different  groups  

¡  Each  tool  is  developed  independently  §  Group  conventions:  data  representation,  programming,  execution  environments  

¡  Heterogeneity  in  data/text  representations  limits  and  slows  down  §  tool  interoperability,  application  development,  and  research  and  innovation.  

3  

Page 4: Improving Interoperability of Text Mining Tools with BioC

EXISTING  SOLUTIONS      ¡  Unstructured  information  

management  architecture  (UIMA)  –  2004  

¡  General  Architecture  for  Text  Engineering  (GATE)  -­‐  2009  

¡  Steep  Learning  Curve    ¡  Substantial  Development  

and  Re-­‐development  time  

BIOC  ¡  Minimal  change  

requirement  to  existing  applications  and  datasets  

¡  BioC  family  §  XML  formats  to  present  text  

documents  and  annotations  §  Functions  (C++,  JAVA)  to  read/

write  documents  in  BioC  format      

4  

Page 5: Improving Interoperability of Text Mining Tools with BioC

¡ Motivation    ¡  Our  Text  Mining  Tools    ¡  Building  BioC  Compatible  Tools    ¡  Results  and  Conclusions  

5  

Page 6: Improving Interoperability of Text Mining Tools with BioC

6  

DNormDNorm

tmVartmVar

SR4GNSR4GN

tmChemtmChem

GenNormGenNorm

PubMed  Abstract

Disease  Mentions  with  MEDIC  IDs

Mutation  Mentions

Species  Mentions  with  Taxonomy  IDs

Chemical  Mentions

Gene  Mentions  with  Entrez  IDs

Annotations  for  Various  BioConcepts

Concept  Recognition  and  Annotation  Toolkit

PubMed  Abstracts  or  Full-­‐Text  Articles

DNorm  Disease  Mentions  with  MEDIC  IDs  (F-­‐measure=  80.90%)  

tmVar  Mutation  Mentions    (F-­‐measure=  91.39%)  

SR4GN  Species  Mentions  with  Taxonomy  IDs  (F-­‐measure=  85.42%)  

tmChem  Chemical  Mentions    (F-­‐measure=  88.27%)  

GenNorm  Gene  Mentions  with  Entrez  IDs  (F-­‐measure=  92.89%)  

Annotations  with  various  BioConcepts  

Page 7: Improving Interoperability of Text Mining Tools with BioC

NER  tools  Programming  Language   Method  

Formats  

PubMed/  PMC  XML   Free  Text  

PubTator  Format  

GenNorm  Format  

tmChem  (Chemical)   Java,  Perl,  C++   CRF   √   √  

DNorm  (Disease)   Java   CRF   √   √  

tmVar  (Mutation)   Perl,  C++   CRF   √   √   √  

SR4GN  (Species)   Perl   Rule-­‐based   √   √   √  

GenNorm  (Gene)   Perl   Statistical     √   √   √  

PubTator   Perl,  JavaScript   Web  server   √   √  

7  

Page 8: Improving Interoperability of Text Mining Tools with BioC

8  

Page 9: Improving Interoperability of Text Mining Tools with BioC

¡  Official  corpus  for  BioCreative  IV  GO  Task    ¡  200  full-­‐text  articles  along  with  their  gene  ontology  (GO)  annotations      §  evidence  sentences  §  gene/protein  entities,  GO  terms,  GO  evidence  codes  

¡  Developed  by  expert  GO  curators  via  a  web-­‐based  annotation  tool.    

9  

Page 10: Improving Interoperability of Text Mining Tools with BioC

¡ Motivation    ¡  The  NCBI  Text  Mining  Toolkit    ¡  Building  BioC  Compatible  Tools    ¡  Results  and  Conclusions  

10  

Page 11: Improving Interoperability of Text Mining Tools with BioC

¡  The  BioC  family    §   XML  DTD    ▪  how  to  present  text  

document  and  annotations  (higher-­‐level  semantics)  

§  C++  and  Java  Libraries    ▪  functions/classes  to  read/

write  documents  in  BioC  format    

¡  BioC  Recommendations  §  Full-­‐text  articles  and  

Annotations  ▪  Present  in  BioC  XML  Format    ▪  Keep  in  separate  files  

§  Key  file    ▪  describes  how  data  should  

be  interpreted  in  the  annotation  file  (lower-­‐level  semantics)  

▪  needs  to  be  created  for  a  specific  type  of  data.    

11  

Page 12: Improving Interoperability of Text Mining Tools with BioC

¡  Steps  taken  to  comply  our  tools  with  BioC  §  Created  the  key  file  § Modified  the  input/output  formats  of  the  tools  ▪  Added  the  BioC  format  as  a  new  option  for  input/output  

 ¡  Challenges  

§  Defining  an  appropriate  key  file    §  Offset  calculation    §  Translating   web-­‐based   annotation   file   to   BioC  annotation  file  (Unicode  to  ASCII  conversion)  

12  

Page 13: Improving Interoperability of Text Mining Tools with BioC

¡ Motivation    ¡  Our  Text  Mining  Tools    ¡  Building  BioC  Compatible  Tools    ¡  Results  and  Conclusions  

13  

Page 14: Improving Interoperability of Text Mining Tools with BioC

¡  Common  key  file  for  all  tools  since  they  are  designed  for  similar  types  of  data    

14  

id:    PubMed  id.  

Passage:    e.g.,  title,  abstract  

Offset  of  the  passage  

Id  of  the  bioconcept  

Offset  of  the  bioconcept  

Length  of  the  bioconcept  

Mention  of  the  bioconcept  

date:    the  time  annotation  create  

Page 15: Improving Interoperability of Text Mining Tools with BioC

NER  tools   bioconcept  

PubMed/  PMC  XML   BioC  

Free  Text   PubTator   GenNorm  

tmChem   Chemical   √   √   √  

DNorm   Disease   √   √   √  

tmVar   Mutation   √   √   √   √  

SR4GN   Species   √   √   √   √  

GenNorm   Gene   √   √   √   √  

PubTator   N/A   √   √   √  

15  

Our  Text  Mining  Toolkit  available  for  public  access:  http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/  

Page 16: Improving Interoperability of Text Mining Tools with BioC

16  

BioC  Article  File  

BioC  Annotation    File  

DNorm   tmVar   tmChem   SR4GN   GenNorm  

Identifying  Disease

Identifying  Mutation

Identifying  chemical

Identifying  Species

Identifying  Gene

Page 17: Improving Interoperability of Text Mining Tools with BioC

17  

id:    PubMed  id.  

passage:    title  

date:    the  time  file  download  

passage:    abstract  

Page 18: Improving Interoperability of Text Mining Tools with BioC

18  

Id  of  the  bioconcept  

Offset  of  the  bioconcept  

Length  of  the  bioconcept  

Mention  of  the  bioconcept  

Type  of  the  bioconcept  

Page 19: Improving Interoperability of Text Mining Tools with BioC

Time:    Time  annotation  created.  

ID:  PMID  of  the  article.  

GO  term:  e.g.,  receptor-­‐mediated  endocytosis  

GO  evidence  code:  e.g.,  Inferred  from  Mutant  Phenotype  (IMP)  

Curatable  entity:  i.e.,  gene  or  gene  product  

Text:  GO  evidence  text  

Page 20: Improving Interoperability of Text Mining Tools with BioC

¡  Our  experience  with  BioC    §  Minimal  changes  required  to  prepare  BioC  versions    §  Easy  to  learn  and  use  §  Improved  interoperability  within  the  toolkit  

¡  Implications    §  Improved  interoperability  ▪  With  other  tools  to  build  sophisticated  applications  

§  The  key  file  could  evolve  as  a  standard  for  concept  recognition  and  normalization  tasks  

§  Anticipate  broader  usage  of  our  tools  as  BioC  gains  popularity    

20  

Page 21: Improving Interoperability of Text Mining Tools with BioC

¡  BioC  Developers  § W.  John  Wilbur  §  Rezarta  Islamaj  Doğan    §  Donald  Comeau    

¡  Intramural  Research  Program  of  the  NIH,  National  Library  Medicine  

21  

Page 22: Improving Interoperability of Text Mining Tools with BioC

¡  Chih-Hsuan Wei §  [email protected] §  +1 301-594-5290

22