17
Ambiguity and Variability of Database and So6ware Names in Bioinforma:cs SMBM 2012 Geraint Duck 1 , Robert Stevens 1 , David Robertson 2 and Goran Nenadic 1 1 School of Computer Science, 2 Faculty of Life Sciences The University of Manchester Manchester, UK

SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Embed Size (px)

Citation preview

Page 1: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Ambiguity  and  Variability  of  Database  and  So6ware  Names  in  

Bioinforma:cs  

SMBM  2012    

Geraint  Duck1,  Robert  Stevens1,  David  Robertson2  and  Goran  Nenadic1  

1School  of  Computer  Science,  2Faculty  of  Life  Sciences  The  University  of  Manchester  

Manchester,  UK  

Page 2: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Named  En:ty  Recogni:on  (NER)  

•  Variety  of  NER  uses  – Species  – Gene/protein  names  – Chemical  names  

•  Variety  of  NER  accuracy  – 95%  F-­‐score  species  (LINNAEUS)  – 73%  F-­‐score  (strict)  gene  name  (ABNER)  – Over  70%  F-­‐score  chemical  names  (OSCAR3)  

•  Draw  parallels  for  database  and  so/ware  NER  2  

Page 3: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Example  

3  

PMC1660556;  M.  Watson    

Page 4: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Challenges  -­‐  Ambiguity  

•  leg  •  white  •  cab  

•  C.  elegans  –  41  NCBI  taxonomy  species  

•  HIV  –  Human  immunodeficiency  virus  

–  Human  immunovirus  

•  analysis  •  Network  •  graph    

•  DIP  –  distal  interphalangeal  –  Database  of  Interac:ng  Proteins  

4  

Page 5: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Challenges  -­‐  Variability  

•  NF-­‐kappaB  •  NF-­‐kappa  B  •  NF-­‐kappa-­‐B  •  NF-­‐κB  

•  Case  variants  •  Spelling  variants  

•  ClustalW  •  Clustal  W  •  Clustal-­‐W  •  CLUSTAL  W  

•  ClustalX  (GUI)?  •  Now:  Clustal  Omega  

5  

Page 6: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Preliminary  

•  Annota:on  guidelines    – Database,  so6ware,  package,  ontology  names  – Not  file  formats,  algorithms,  tasks,  methods,  database  iden:fiers,  programming  languages,  opera:ng  systems,  etc.  

•  Gold  standard  corpus  – 25  from  BMC  Bioinforma:cs  and  PLoS  Computa:onal  Biology;  5  from  Genome  Biology  

•  Dic:onary  of  resource  names  – 4,879  unique  entries  from  10  online  resources   6  

Page 7: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Preliminary  

•  Inter-­‐annotator  agreement  –  F-­‐score:  86%  

•  30  documents  –  1319  total  men:ons  –  224  unique  men:ons  

     Databases      So/ware      Combined    Precision      0.79  (0.66)      0.99  (0.96)      0.93  (0.87)    Recall      0.67  (0.56)      0.84  (0.82)      0.80  (0.74)    F-­‐measure      0.73  (0.61)      0.91  (0.88)      0.86  (0.80)    

Total  Number  of  Documents   30  Total  Database  and  So9ware  Men<ons   1319  Total  Unique  Resource  Men<ons   224  Percentage  of  Database  Men:ons   36%  Percentage  of  Unique  DB  Men:ons   26%  Average  Men:ons  per  Document   44  Average  Unique  Men:ons  per  Document   8.2  Max  Men:ons  in  a  Single  Document   227  Max  Unique  Men:ons  in  a  Document   33  Resources  with  only  a  Single  Men:on   117  

7  

Page 8: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Ambiguity  and  Variability  •  Compared  names  to  

–  Acronym  Dic:onary:  1,933  –  English  Dic:onary:  86,308  

•  Ambiguity  in  corpus:  –  ≈  2%  (case-­‐sensi:ve)  –  ≈  12%  (case-­‐insensi:ve)    

•  Ambiguity  in  names  dic:onary:  –  ≈  0.1%  (case-­‐sensi:ve)  –  ≈  0.5%  (case-­‐insensi:ve)  

•  224  unique  names  –  45  were  variants  

•  15  acronyms  •  Orthographics  •  Spellings  

–  179  different  resources  

•  79%  one  variant  •  17%  two  variants  •  4%  three  variants  

8  

Page 9: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Name  Composi:on  •  Majority  are  single  nouns    

–  includes  acronyms  •  6%  lowercase  common  

nouns  –  affy,  bioconductor  

•  A  few  contained  numbers  –  S4,  t2prhd  

•  A  few  misclassified  as  verbs  –  …each  query  protein  is  first  

BLASTed  with…  –  …held  near  their  equilibrium  

values  using  SHAKE.  –  …graphical  representaPons  

were  achieved  using  dot  v1.10…  

NNP   68.0%  NNP  NNP   8.8%  NN   5.7%  NNP  NNP  NNP   5.3%  NNP  CD   3.1%  NNP  CD  .  CD   1.8%  NNP  NNP  NNP  NNP  NNP   1.3%  NNP  LS   0.9%  NNP  NNP  NNP  NNP   0.9%  Other  Pajerns   4.4%  

9  

Page 10: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Name  Composi:on  

•  Longest  Names  (most  tokens)  – Corpus:  5  –  Gene  Expression  Profile  Analysis  Suite  – Dic:onary:  12  –  PredicPon  of  Protein  SorPng  Signals  and  LocalisaPon  Sites  in  Amino  Acid  Sequences  

•  Evaluated  (stemmed)  token  frequencies  within  the  dic:onary  – Long-­‐tail  curve  – 87%  used  only  once  – High  frequency  words  suggest  common  heads  and  bioinforma:cs  related  terms  

10  

Page 11: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

!"#$%"&

'($)"*#&!"#"&

+",-"#."&

/-%0#&

1&

21&

31&

41&

51&

611&

621&

1& 27& 71& 87& 611& 627& 671&

!"#$%&'($)

*$%+,&

!"-&./0&!"#$%1&23"(415&

!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($&@<A$1&

11  

Page 12: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Dic:onary  Matching  

•  F-­‐score  under  55%  –  Low  precision  

•  GO  (GO:0007089)  •  cycle  •  genomes      

–  Low  recall,  Incomprehensive  •  i  Linker  •  xPedPhase  

•  95%  of  menPons  could  be  matched…    

12  

    TP   FP   FN   P   R   F  Lenient   729   633   590   54%   55%   54%  Strict   695   667   624   51%   53%   52%  

Dic:onary  matches     55.3%  Heads  and  Hearst  pajerns   9.7%  Title  appearances   0.6%  References  and  URLs   1.9%  Version  informa:on   1.2%  Noun/Verb  associa:ons   20.3%  Comparisons   5.8%  Remaining   5.2%  

Page 13: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Poten:al  Clues  

•  Heads  –  the  stochas:c  simulator  Dizzy  allows  ...    

–  The  MethMarker  so9ware  was  ...    

–  ...  system,  PSPE,  specifically  to  ...    

–  tools:  CLUSTALW,  ...,  and  MUSCLE.    

–  ...  programs  such  as  Simlink,  ...,  and  SimPed.  

•  Titles  –  CoXpress:  differen:al  co-­‐expression  in  gene  expression  data    

–  TABASCO:  A  single  molecule,  base-­‐pair  resolved  gene  expression  simulator  

–  SimHap  GUI:  An  intui:ve  graphical  user  interface  for  gene:c  associa:on  analysis  

13  

Page 14: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Poten:al  Clues  

•  References  –  Galaxy  [18]  and  EpiGRAPH  [19]  

–  The  learning  metrics  principle  [14,15]  

•  Versions  –  using  dot  v1.10  and  Graphviz  1.13(v16).    

–  CLUSTAL  W  version  1.83    –  Dynalign  4.5,  and  LocARNA  0.99  

•  Comparisons  –  xPedPhase  did  beRer  than  i  Linker    

–  Cofogla2  with  this  cutoff  PSVM  gives  a  bejer  false  posi:ve  rate  compared  to  RNAz    

–  Foldalign  was  much  slower  than  Cofolga2  except  for  

–  Like  Moleculizer,  Tabasco  dynamically  generates  

14  

FP  

Page 15: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Poten:al  Clues  •  the  SimHap  GUI  

installa<on.    •  implemented  within  

PedPhase    •  Our  mo:va:ons  for  

crea<ng  Tabasco    •  MethMarker  therefore  

provides    •  A  typical  screenshot  of  

MethMarker    •  MethMarker’s  user  

interface  reflects    

•  Tested  effect  on  precision  •  Ran  regular  expression  •  Percentage  of  sentences  

with  resource  name  and  that  matched  regex:  –  ran|run(ning|s)?

•  48%  –  RAM

•  50%  –  Website

•  77%  •  …  so  are  plausible  clues.    

15  

Page 16: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Scope  

•  Database  •  So6ware  •  Method  •  Approach  •  Algorithm  •  Task  •  Programming  Language  •  Records/Iden:fiers  •  File  Formats  

•  Author’s  mix  vocab  •  Fuzzy  dis:nc:on  •  R  language,  R  so6ware  –  Dis:nc:on?  

•  Microso6  Excel  –  Lots  of  sta:s:cs  

•  Students  t-­‐test  –  Lots  of  sta:s:cs  tools  

16  

Page 17: SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Summary  

•  Annota:on  guidelines  •  Annotated  gold  corpus  •  Evaluated  resource  name  men:ons  –  Composi:on  –  Ambiguity  –  Variability  

•  Dic:onary  match:  <  55%  •  Provide  poten:al  clues  for  capture  

•  Acknowledgments  –  BBSRC  –  Dan  Jamieson  –  IAA    

 •  hjp://sourceforge.net/projects/bionerds/    

•  Thank-­‐you!  •  Ques:ons?    

17