29
Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

Embed Size (px)

Citation preview

Page 1: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

Grounding Gene Mentions with Respect to Gene Database Identifiers

Ben HacheyBioNLP Reading Group18.07.2005

Overview of BioCreAtIvE Task 1B

Page 2: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 2

Outline

• BioCreAtIvE Task 1B

• Approaches to Task 1B

• Related Work

• Conclusions

Page 3: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 3

BioCreAtIvE

• Critical Assessment of Information Extraction Systems in Biology– Task 1A Named Entity Recognition

• Given a single sentence from an abstract, to identify all mentions of genes

• “(or proteins where there is ambiguity)”

– Task 1B Entity Normalisation

• Given NER’d abstract, associate list of unique identifiers

– Task 2 Automatic GO code annotation

Page 4: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 4

Example Abstract from Fly

Since Dpp and Gbb levels are not detectably higher in the early phases of cross vein development, other factors apparently account for this localized activity. Our evidence suggests that the product of the crossveinless 2 gene is a novel member of the BMP-like signaling pathway required to potentiate Gbb of Dpp signaling in the cross veins. crossveinless 2 is expressed at higher levels in the developing cross veins and is necessary for local BMP-like activity.

FBgn0000395 crossveinless 2

FBgn0000490 Dpp

FBgn0024234 Gpp

Input

Outp

ut

Page 5: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 5

Analogous Tasks

• Can be seen as…– Grounding:

Tying a textual mention of an entity to its identifier in a gene database/ontology

Provides a list, without repetition, of the entities referred to in the sentence (Information Extraction)

– Coreference:

Identifying which textual mentions refer to the same entity

– Lexical Entailment:

Whether term is substitutable in given context

Page 6: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 6

Resources

• Synonym database provided for each organism:– Fly Drosophila melanogaster

– Yeast Saccharomyces cerevisiae

– Mouse Mus musculus

• These list a number of different textual realisations for each unique gene identifier

Page 7: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 7

Fly Synonym DB Examples

ID Synonyms

FBgn0000395 CG15671, CT35855, crossveinless 2, cv 2, cv-2

FBgn0000490 CG9885, DPP, DPP C, DPP-C, Dpp, Haplo insufficient, Haplo-insufficient, Hin d: Haplo insufficient, Hind: Haplo-insufficient, M(2)23AB, M(2)LS1, Tegula, Tg: Tegula, blink, blk: blink, decapentaplegic, dpp, heldout, ho, ho: heldout, l(2)10638, l(2)22Fa, l(2)k17036, shortvein, Shv

FBgn0001105 CG10545, G beta, G betab, G protein &bgr; subunit, G protein &bgr;-subunit 13F, G protein beta 13F, G protein beta subunit, G protein beta subunit 13F, G protein beta-subunit 13F, G&bgr;, G&bgr;13F, G-&bgr;b, Gbeta, G-betab, G-protein &bgr; 13F, G-protein beta 13F, G¡down¿&bgr;¡/down¿ brain, Gb13F, Gbb, Gbeta, Gbeta brain, Gbeta13F, anon EST:Liang 1.22, anon-EST:Liang-1.22, clone 1.22, dg&bgr;, dgbeta

FBgn0017531 Spal\crossveinless 2, Spal\crossveinless-2, crossveinless 2, crossveinless-2

FBgn0018552 Dpse\cv2, crossveinless, crossveinless 2, crossveinless-2, cv

FBgn0024200 CG9936 Pap/Trap, Scad78, Suppressor of constitutively activated Dpp signaling 78, TRAP240, bli, blind spot, bls, dTRAP240, flytrap, l(3)L7062, l(3)rK760, pap, pap/dTRAP240, poils aux pattes

FBgn0024234 60A, CG5562, Gbb, Gbb 60A, Gbb-60A, SixtyA, TGF&bgr;-60A, TGFbeta 60A, TGFbeta-60A, Tgf&bgr;-60A, Tgfb 60, Tgfb-60, Tgfbeta 60A, Tgfbeta-60A, Transforming growth factor &bgr; at 60A, Transforming growth factor beta at 60A, gbb, gbb 60A, gbb-60A, gcn, gcn: gonial cell neoplasm, gcn: gonial-cell-neoplasm, glass bottom boat, glass bottom boat 60A, glass bottom boat-60A, l(2)60A J, l(2)60A-J, tgfb 60A, tgfb-60A, vgr/60A

FBgn0044017 Scad67, Suppressor of constitutively activated Dpp signaling 67

Page 8: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 8

• (Start with documents whose full text has been manually curated)

• Noisy Training Data1. Automatically eliminate gene Ids not found in abstract– Fly: 0.83, Mouse: 0.71, Yeast: 0.92 (quality)

• Testing Gold Standard2. Hand check for over-zealous elimination3. Add genes mentioned “in passing”

(so task is same across organisms)– Fly: 0.93, Mouse: 0.87, Yeast: 0.96 (agreement)– 250 abstracts/organism

Data Preparation

Page 9: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 9

Evaluation

• Precision, recall, and balanced f-score automatically calculated with respect to gold standard gene ID lists

• 8 teams total– Various numbers of submissions (0-3) on each

organism

• Number of submissions– Fly: 11

– Mouse: 16

– Yeast: 15

Page 10: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 10

Top Systems Performance (P/R/F)

Fly Mouse Yeast Focus83.1/80.0/81.5

(f rank: 1)

76.5/81.9/79.1

(f rank: 1)

96.6/84.0/89.9

(f rank: 3)

16 – Hanisch et al.

(Fraunhofer, LMU Munich)

69.2/76.5/72.6

(f rank: 2)

82.8/67.6/74.4

(f rank: 4)

95.0/89.4/92.1

(f rank: 1)

8 – Crim et al.

(UPenn)

– – – 76.4/78.7/77.6

(f rank: 2)

91.7/87.8/89.7

(f rank: 4)

24 – Fundel et al.

(LMU Munich)

46.3/38.0/41.7

(f rank: 5)

72.8/64.8/68.6

(f rank: 6)

94.0/87.1/90.4

(f rank: 2)

18 – No paper

(???)

59.2/74.8/66.1

(f rank: 3)

81.1/67.6/73.7

(f rank: 5)

91.5/79.0/84.8

(f rank: 6)

5 – Hachey et al.

(Edinburgh, Stanford)

– – – 78.5/70.9/74.5

(f rank: 3)

90.7/81.4/85.8

(f rank: 5)

6 – Tamames

(BioAlma)

Page 11: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 11

Outline

• BioCreAtIvE Task 1B

• Approaches to Task 1B

• Related Work

• Conclusions

Page 12: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 12

Approaches

1. Use synonyms for simple matching against text

– Difficult to ID false positives• Especially mouse and fly where synonyms include e.g.

common words (with, at, yellow, …)

2. ID gene text, then ground– Leverage NER system from Task 1A– Limited by performance of NER

• 78.8% precision, 73.5% recall, 76.1 balance F• 37% of FPs and 39% of FNs due to boundary problems

Synonym lists not exhaustive

Page 13: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 13

Information Sources

• Edit synonym list?– Add other specific and frequently used synonyms– Remove problematic synonyms

• String similarity– Matching against synonym list– Fuzzy matching (spelling variations, abbreviations, …)

• Coreference– Synonym in same text

• Other contextual evidence– Gene co-occurrence in same text– Word context around entity…

• Probabilistic/Statistical models– Pr(geneID), Pr(geneID|synonym)

Page 14: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 14

Top Systems Overview

Approach Inf. Sources

Team 1 2 EdSyn StrSim Coref OthrC Prob

user16 (h) user8 (a) user24 (h) user5 user6 (?)

Approach: 1 – match syns to text; 2 – NER, match ent to synsEdited Syn List String Sim/Fuzzy Match

Inf. Sources: Coreference Other ContextualProb/Stat Models

Page 15: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 15

Systems

• Team: user24 (mouse, yeast)– Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis– Ludwig-Maximilians-Universität München

• Approach: No NER, match synonyms to text– Rule-based generation and curation of synonym lists

• Remove unspecific and inappropriate synonyms• Expanded to include additional, frequently used synonyms

– Automatic rule-based edit system– Human curation to assure quality

• Tuned using training data– Select all matches– Post-filter: Remove matches with non-gene context (e.g.

‘cells’, ‘domains’, ‘cell type’, ‘DNA binding site’)

Semi-automatic syn list curation, could be used for gazetteers!

Page 16: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 16

Systems

• Team: user16 (fly, mouse, yeast)– Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and

Juliane Fluck– Fraunhofer Institute & Ludwig-Maximilians-Universität München

• Approach: No NER, match synonyms to text– Synonym list expanded (offline)

• Automatic rule-based edit system w/ human curation to assure quality

– Rule-based classification of synonyms• Class I: Case-insensitive near-synonyms• Class II: Case-sensitive near-synonyms• Class III: Questionable synonyms

(high frequency, inexact match)– Select n highest scoring matches (Hanisch et al., 2003)

• Focus on matching multi-word terms

Syn list curation, multi-word term matching!

Page 17: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 17

Systems

• Team: user8 (fly, mouse, yeast)– Jeremiah Crim, Ryan McDonald, and Fernando Pereira– University of Pennsylvania

• Approach: No NER, match synonyms to text– Pattern Matching

• Synonym list pruned by threshold on conditional probability of a gene ID (g) being a label for a document given that a synonym (s) matches

• List of candidate gene IDs compiled by selecting 1000 training documents with highest token-level cosine similarity

– Match Classification• Binary maximum entropy classifier trained to predict whether gene

IDs selected by pattern matching should be kept• Fly: +7.7, Mouse: +1.5, Yeast: -0.4

Prob models (pruning, disambiguation), no human curation!

Page 18: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 18

Systems

• Team: user5 (fly, mouse, yeast)– Ben Hachey, Huy Nguyen, Malvina Nissim, Bea Alex, and Claire Grover– University of Edinburgh & Stanford

• Approach: NER, match entities to synonyms1. Build organism-specific named entity recognition

• Noisy training data obtained from Task 1B materials

2. Match gene entities to synonym lists (fuzzy)• Incorporates various edit operations (e.g. case folding, optional

dashes and other punc, Brit/Am spellings)• Tuned per-organism to select and order edit operations

3. Disambiguate each entity to a single gene ID• Var. heuristic, statistical approaches (e.g. gene ID co-occurrence, IR

query term weighting, repetition in synonym list)• Again, optimised per-organism

Bootstrapping NE data, IR term weighting, no human curation!

Page 19: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 19

Systems

• Team: user6 (fly, mouse, yeast)– Javier Tamames– BioAlma SL

• Approach: NER, match entities to synonyms1. NER for various bio ents (e.g. genes, proteins, compounds)

• Also bio-medical semantic tagging of words E.g. Core terms (receptor, kinase, …) and types (alpha, a1, …)

2. Match gene entities to synonym lists (fuzzy)• Use BioCreAtIvE lists and other relevant databases• Match and weighting based on semantic labels

3. Disambiguate each entity to a single gene ID• Use of key words extracted from databases (e.g. HUGO, MGI, SGD)

Semantic tagging module, Key word context from org DBs!

Page 20: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 20

Outline

• BioCreAtIvE Task 1B

• Approaches to Task 1B

• Related Work

• Conclusions

Page 21: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 21

Related Work

• Ben Wellner (2005). Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization Data. In: Proceedings of BioLink-2005.

• …

Page 22: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 22

Outline

• BioCreAtIvE Task 1B

• Approaches to Task 1B

• Related Work

• Conclusions

Page 23: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 23

Conclusions

• Model that can be automatically tuned to e.g. domain, organism

• Proper modelling of:– Abbreviations

– Spelling variants

– Coreference in abstracts

– Textual context, key words

– Entity co-occurrence

– Entity and term distributions

– Token semantic roles

Page 24: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 24

Thank you

Page 25: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 25

References

Lynette Hirschman, Marc Colosimo, Alexander Morgan, Jeffrey Colombe, and Alexander Yeh (2004). Task 1B: Gene list task. In: Proceedings BioCreAtIvE Workshop.

Daniel Hanisch, Katrin Fundel, Heinz-Theodor Mevissen, Ralf Zimmer, and Juliane Fluck (2004). ProMiner: Organis-specific protein name detection using approximate string matching. In: Proceedings BioCreAtIvE Workshop. [user16]

Katrin Fundel, Daniel Güttler, Ralf Zimmer, and Joannis Apostolakis (2004). Exact versus approximate string matching for protein name identification. In: Proceedings BioCreAtIvE Workshop. [user24]

Jerimiah Crim, Ryan McDonald, and Fernando Pereira (2004). Automatically annotating documents with normalized gene lists. In: Proceedings BioCreAtIvE Workshop. [user8]

Ben Hachey, Huy Nguyen, Malvina Nissim, Bea Alex, and Claire Grover (2004). Grounding gene mentions with respect to gene database identifiers. In: Proceedings BioCreAtIvE Workshop. [user5]

Javer Tamames (2004). Text detective: BioAlma’s gene annotation tool. In: Proceedings BioCreAtIvE Workshop. [user6]

Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen, and Ralf Zimmer (2003). Playing biology’s name game: Identifying protein names in scientific text.

Page 26: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 26

Page 27: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

Bea Alex, Shipra Dingare, Claire Grover, Ben Hachey, Ewan Klein, Yuval Krymolowski, Malvina Nissim

Jenny Finkel, Chris Manning, Huy NguyenStanford:

Edinburgh:

The SEER Project Team

Page 28: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 28

Top Systems Performance (P/R/F)

Fly Mouse Yeast Focus83.1/80.0/81.5

(16)

76.5/81.9/79.1

(16)

95.0/89.4/92.1

(8)

16 – Hanisch et al.

(Fraunhofer, LMU Munich)

69.2/76.5/72.6

(8)

76.4/78.7/77.6

(24)

94.0/87.1/90.4

(18)

8 – Crim et al.

(UPenn)

59.2/74.8/66.1

(5)

78.5/70.9/74.5

(6)

96.6/84.0/89.9

(16)

24 – Fundel et al.

(LMU Munich)

31.5/73.2/44.0

(23)

82.8/67.6/74.4

(8)

91.7/87.8/89.7

(24)

18 – No paper

(???)

46.3/38.0/41.7

(18)

81.1/67.6/73.7

(5)

90.7/81.4/85.8

(6)

5 – Hachey et al.

(Edinburgh, Stanford)

22.4/38.9/28.4

(19)

72.8/64.8/68.6

(18)

91.5/79.0/84.8

(5)

6 – Tamames

(BioAlma)

Page 29: Grounding Gene Mentions with Respect to Gene Database Identifiers Ben Hachey BioNLP Reading Group 18.07.2005 Overview of BioCreAtIvE Task 1B

18/07/2005 BioCreative Task 1B 29

Top Systems Rank

Fly Mouse Yeast Focus16 16 8 16

Hanisch et al. (Fraunhofer, LMU Munich)

8 24 18 8

Crim et al. (UPenn)

5 6 16 24

Fundel et al. (LMU Munich)

23 8 24 18

???

18 5 6 5

Hachey et al. (Edinburgh, Stanford)

19 18 5 6

Tamames (BioAlma)