SMBM 2012: Ambiguity and Variability of Database and Software Names in Bioinformatics

Ambiguity and Variability of Database and So6ware Names in

Bioinforma:cs

SMBM 2012

Geraint Duck1, Robert Stevens1, David Robertson2 and Goran Nenadic1

1School of Computer Science, 2Faculty of Life Sciences The University of Manchester

Manchester, UK

Named En:ty Recogni:on (NER)

•  Variety of NER uses – Species – Gene/protein names – Chemical names

•  Variety of NER accuracy – 95% F-‐score species (LINNAEUS) – 73% F-‐score (strict) gene name (ABNER) – Over 70% F-‐score chemical names (OSCAR3)

•  Draw parallels for database and so/ware NER 2

Example

3

PMC1660556; M. Watson

Challenges -‐ Ambiguity

•  leg •  white •  cab

•  C. elegans –  41 NCBI taxonomy species

•  HIV –  Human immunodeficiency virus

–  Human immunovirus

•  analysis •  Network •  graph

•  DIP –  distal interphalangeal –  Database of Interac:ng Proteins

4

Challenges -‐ Variability

•  NF-‐kappaB •  NF-‐kappa B •  NF-‐kappa-‐B •  NF-‐κB

•  Case variants •  Spelling variants

•  ClustalW •  Clustal W •  Clustal-‐W •  CLUSTAL W

•  ClustalX (GUI)? •  Now: Clustal Omega

5

Preliminary

•  Annota:on guidelines – Database, so6ware, package, ontology names – Not file formats, algorithms, tasks, methods, database iden:fiers, programming languages, opera:ng systems, etc.

•  Gold standard corpus – 25 from BMC Bioinforma:cs and PLoS Computa:onal Biology; 5 from Genome Biology

•  Dic:onary of resource names – 4,879 unique entries from 10 online resources 6

Preliminary

•  Inter-‐annotator agreement –  F-‐score: 86%

•  30 documents –  1319 total men:ons –  224 unique men:ons

Databases So/ware Combined Precision 0.79 (0.66) 0.99 (0.96) 0.93 (0.87) Recall 0.67 (0.56) 0.84 (0.82) 0.80 (0.74) F-‐measure 0.73 (0.61) 0.91 (0.88) 0.86 (0.80)

Total Number of Documents 30 Total Database and So9ware Men<ons 1319 Total Unique Resource Men<ons 224 Percentage of Database Men:ons 36% Percentage of Unique DB Men:ons 26% Average Men:ons per Document 44 Average Unique Men:ons per Document 8.2 Max Men:ons in a Single Document 227 Max Unique Men:ons in a Document 33 Resources with only a Single Men:on 117

7

Ambiguity and Variability •  Compared names to

–  Acronym Dic:onary: 1,933 –  English Dic:onary: 86,308

•  Ambiguity in corpus: –  ≈ 2% (case-‐sensi:ve) –  ≈ 12% (case-‐insensi:ve)

•  Ambiguity in names dic:onary: –  ≈ 0.1% (case-‐sensi:ve) –  ≈ 0.5% (case-‐insensi:ve)

•  224 unique names –  45 were variants

•  15 acronyms •  Orthographics •  Spellings

–  179 different resources

•  79% one variant •  17% two variants •  4% three variants

8

Name Composi:on •  Majority are single nouns

–  includes acronyms •  6% lowercase common

nouns –  affy, bioconductor

•  A few contained numbers –  S4, t2prhd

•  A few misclassified as verbs –  …each query protein is first

BLASTed with… –  …held near their equilibrium

values using SHAKE. –  …graphical representaPons

were achieved using dot v1.10…

NNP 68.0% NNP NNP 8.8% NN 5.7% NNP NNP NNP 5.3% NNP CD 3.1% NNP CD . CD 1.8% NNP NNP NNP NNP NNP 1.3% NNP LS 0.9% NNP NNP NNP NNP 0.9% Other Pajerns 4.4%

9

Name Composi:on

•  Longest Names (most tokens) – Corpus: 5 – Gene Expression Profile Analysis Suite – Dic:onary: 12 – PredicPon of Protein SorPng Signals and LocalisaPon Sites in Amino Acid Sequences

•  Evaluated (stemmed) token frequencies within the dic:onary – Long-‐tail curve – 87% used only once – High frequency words suggest common heads and bioinforma:cs related terms

10

!"#$%"&

'($)"*#&!"#"&

+",-"#."&

/-%0#&

1&

21&

31&

41&

51&

611&

621&

1& 27& 71& 87& 611& 627& 671&

!"#$%&'($)

*$%+,&

!"-&./0&!"#$%1&23"(415&

!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($&@<A$1&

11

Dic:onary Matching

•  F-‐score under 55% –  Low precision

•  GO (GO:0007089) •  cycle •  genomes

–  Low recall, Incomprehensive •  i Linker •  xPedPhase

•  95% of menPons could be matched…

12

TP FP FN P R F Lenient 729 633 590 54% 55% 54% Strict 695 667 624 51% 53% 52%

Dic:onary matches 55.3% Heads and Hearst pajerns 9.7% Title appearances 0.6% References and URLs 1.9% Version informa:on 1.2% Noun/Verb associa:ons 20.3% Comparisons 5.8% Remaining 5.2%

Poten:al Clues

•  Heads –  the stochas:c simulator Dizzy allows ...

–  The MethMarker so9ware was ...

–  ... system, PSPE, specifically to ...

–  tools: CLUSTALW, ..., and MUSCLE.

–  ... programs such as Simlink, ..., and SimPed.

•  Titles –  CoXpress: differen:al co-‐expression in gene expression data

–  TABASCO: A single molecule, base-‐pair resolved gene expression simulator

–  SimHap GUI: An intui:ve graphical user interface for gene:c associa:on analysis

13

Poten:al Clues

•  References –  Galaxy [18] and EpiGRAPH [19]

–  The learning metrics principle [14,15]

•  Versions –  using dot v1.10 and Graphviz 1.13(v16).

–  CLUSTAL W version 1.83 –  Dynalign 4.5, and LocARNA 0.99

•  Comparisons –  xPedPhase did beRer than i Linker

–  Cofogla2 with this cutoff PSVM gives a bejer false posi:ve rate compared to RNAz

–  Foldalign was much slower than Cofolga2 except for

–  Like Moleculizer, Tabasco dynamically generates

14

FP

Poten:al Clues •  the SimHap GUI

installa<on. •  implemented within

PedPhase •  Our mo:va:ons for

crea<ng Tabasco •  MethMarker therefore

provides •  A typical screenshot of

MethMarker •  MethMarker’s user

interface reflects

•  Tested effect on precision •  Ran regular expression •  Percentage of sentences

with resource name and that matched regex: –  ran|run(ning|s)?

•  48% –  RAM

•  50% –  Website

•  77% •  … so are plausible clues.

15

Scope

•  Database •  So6ware •  Method •  Approach •  Algorithm •  Task •  Programming Language •  Records/Iden:fiers •  File Formats

•  Author’s mix vocab •  Fuzzy dis:nc:on •  R language, R so6ware –  Dis:nc:on?

•  Microso6 Excel –  Lots of sta:s:cs

•  Students t-‐test –  Lots of sta:s:cs tools

16

Summary

•  Annota:on guidelines •  Annotated gold corpus •  Evaluated resource name men:ons –  Composi:on –  Ambiguity –  Variability

•  Dic:onary match: < 55% •  Provide poten:al clues for capture

•  Acknowledgments –  BBSRC –  Dan Jamieson – IAA

•  hjp://sourceforge.net/projects/bionerds/

•  Thank-‐you! •  Ques:ons?

17