Upload
geraintduck
View
70
Download
2
Tags:
Embed Size (px)
Citation preview
Ambiguity and Variability of Database and So6ware Names in
Bioinforma:cs
SMBM 2012
Geraint Duck1, Robert Stevens1, David Robertson2 and Goran Nenadic1
1School of Computer Science, 2Faculty of Life Sciences The University of Manchester
Manchester, UK
Named En:ty Recogni:on (NER)
• Variety of NER uses – Species – Gene/protein names – Chemical names
• Variety of NER accuracy – 95% F-‐score species (LINNAEUS) – 73% F-‐score (strict) gene name (ABNER) – Over 70% F-‐score chemical names (OSCAR3)
• Draw parallels for database and so/ware NER 2
Example
3
PMC1660556; M. Watson
Challenges -‐ Ambiguity
• leg • white • cab
• C. elegans – 41 NCBI taxonomy species
• HIV – Human immunodeficiency virus
– Human immunovirus
• analysis • Network • graph
• DIP – distal interphalangeal – Database of Interac:ng Proteins
4
Challenges -‐ Variability
• NF-‐kappaB • NF-‐kappa B • NF-‐kappa-‐B • NF-‐κB
• Case variants • Spelling variants
• ClustalW • Clustal W • Clustal-‐W • CLUSTAL W
• ClustalX (GUI)? • Now: Clustal Omega
5
Preliminary
• Annota:on guidelines – Database, so6ware, package, ontology names – Not file formats, algorithms, tasks, methods, database iden:fiers, programming languages, opera:ng systems, etc.
• Gold standard corpus – 25 from BMC Bioinforma:cs and PLoS Computa:onal Biology; 5 from Genome Biology
• Dic:onary of resource names – 4,879 unique entries from 10 online resources 6
Preliminary
• Inter-‐annotator agreement – F-‐score: 86%
• 30 documents – 1319 total men:ons – 224 unique men:ons
Databases So/ware Combined Precision 0.79 (0.66) 0.99 (0.96) 0.93 (0.87) Recall 0.67 (0.56) 0.84 (0.82) 0.80 (0.74) F-‐measure 0.73 (0.61) 0.91 (0.88) 0.86 (0.80)
Total Number of Documents 30 Total Database and So9ware Men<ons 1319 Total Unique Resource Men<ons 224 Percentage of Database Men:ons 36% Percentage of Unique DB Men:ons 26% Average Men:ons per Document 44 Average Unique Men:ons per Document 8.2 Max Men:ons in a Single Document 227 Max Unique Men:ons in a Document 33 Resources with only a Single Men:on 117
7
Ambiguity and Variability • Compared names to
– Acronym Dic:onary: 1,933 – English Dic:onary: 86,308
• Ambiguity in corpus: – ≈ 2% (case-‐sensi:ve) – ≈ 12% (case-‐insensi:ve)
• Ambiguity in names dic:onary: – ≈ 0.1% (case-‐sensi:ve) – ≈ 0.5% (case-‐insensi:ve)
• 224 unique names – 45 were variants
• 15 acronyms • Orthographics • Spellings
– 179 different resources
• 79% one variant • 17% two variants • 4% three variants
8
Name Composi:on • Majority are single nouns
– includes acronyms • 6% lowercase common
nouns – affy, bioconductor
• A few contained numbers – S4, t2prhd
• A few misclassified as verbs – …each query protein is first
BLASTed with… – …held near their equilibrium
values using SHAKE. – …graphical representaPons
were achieved using dot v1.10…
NNP 68.0% NNP NNP 8.8% NN 5.7% NNP NNP NNP 5.3% NNP CD 3.1% NNP CD . CD 1.8% NNP NNP NNP NNP NNP 1.3% NNP LS 0.9% NNP NNP NNP NNP 0.9% Other Pajerns 4.4%
9
Name Composi:on
• Longest Names (most tokens) – Corpus: 5 – Gene Expression Profile Analysis Suite – Dic:onary: 12 – PredicPon of Protein SorPng Signals and LocalisaPon Sites in Amino Acid Sequences
• Evaluated (stemmed) token frequencies within the dic:onary – Long-‐tail curve – 87% used only once – High frequency words suggest common heads and bioinforma:cs related terms
10
!"#$%"&
'($)"*#&!"#"&
+",-"#."&
/-%0#&
1&
21&
31&
41&
51&
611&
621&
1& 27& 71& 87& 611& 627& 671&
!"#$%&'($)
*$%+,&
!"-&./0&!"#$%1&23"(415&
!"#$%&'($)*$%+,&67897%&:7+;"%<(,&:<8<=<1$&<%4&>"?6<($&@<A$1&
11
Dic:onary Matching
• F-‐score under 55% – Low precision
• GO (GO:0007089) • cycle • genomes
– Low recall, Incomprehensive • i Linker • xPedPhase
• 95% of menPons could be matched…
12
TP FP FN P R F Lenient 729 633 590 54% 55% 54% Strict 695 667 624 51% 53% 52%
Dic:onary matches 55.3% Heads and Hearst pajerns 9.7% Title appearances 0.6% References and URLs 1.9% Version informa:on 1.2% Noun/Verb associa:ons 20.3% Comparisons 5.8% Remaining 5.2%
Poten:al Clues
• Heads – the stochas:c simulator Dizzy allows ...
– The MethMarker so9ware was ...
– ... system, PSPE, specifically to ...
– tools: CLUSTALW, ..., and MUSCLE.
– ... programs such as Simlink, ..., and SimPed.
• Titles – CoXpress: differen:al co-‐expression in gene expression data
– TABASCO: A single molecule, base-‐pair resolved gene expression simulator
– SimHap GUI: An intui:ve graphical user interface for gene:c associa:on analysis
13
Poten:al Clues
• References – Galaxy [18] and EpiGRAPH [19]
– The learning metrics principle [14,15]
• Versions – using dot v1.10 and Graphviz 1.13(v16).
– CLUSTAL W version 1.83 – Dynalign 4.5, and LocARNA 0.99
• Comparisons – xPedPhase did beRer than i Linker
– Cofogla2 with this cutoff PSVM gives a bejer false posi:ve rate compared to RNAz
– Foldalign was much slower than Cofolga2 except for
– Like Moleculizer, Tabasco dynamically generates
14
FP
Poten:al Clues • the SimHap GUI
installa<on. • implemented within
PedPhase • Our mo:va:ons for
crea<ng Tabasco • MethMarker therefore
provides • A typical screenshot of
MethMarker • MethMarker’s user
interface reflects
• Tested effect on precision • Ran regular expression • Percentage of sentences
with resource name and that matched regex: – ran|run(ning|s)?
• 48% – RAM
• 50% – Website
• 77% • … so are plausible clues.
15
Scope
• Database • So6ware • Method • Approach • Algorithm • Task • Programming Language • Records/Iden:fiers • File Formats
• Author’s mix vocab • Fuzzy dis:nc:on • R language, R so6ware – Dis:nc:on?
• Microso6 Excel – Lots of sta:s:cs
• Students t-‐test – Lots of sta:s:cs tools
16
Summary
• Annota:on guidelines • Annotated gold corpus • Evaluated resource name men:ons – Composi:on – Ambiguity – Variability
• Dic:onary match: < 55% • Provide poten:al clues for capture
• Acknowledgments – BBSRC – Dan Jamieson – IAA
• hjp://sourceforge.net/projects/bionerds/
• Thank-‐you! • Ques:ons?
17