LeadMine: A grammar and dictionary driven approach to chemical entity recognition

2. Workflow

1. Abstract 7. Abbreviation Detection

www.nextmovesoftware.co.uk www.nextmovesoftware.com

NextMove Software Limited Innovation Centre (Unit 23)

Cambridge Science Park Milton Road, Cambridge

England CB4 0EY

LeadMine: A grammar and dictionary driven approach to chemical entity recognition

Daniel Lowe and Roger Sayle

NextMove Software Ltd, Cambridge

LeadMine is a system for recognizing entities, especially chemical entities, using large grammars and dictionaries[1]. Entities are identified without an explicit tokenization step. To allow recognition of terms slightly outside the coverage of these resources spelling correction, entity extension and entity merging are used. Recall is enhanced by the use of abbreviation detection, and precision is enhanced by the removal of abbreviations of non-entities. With the use of training data to produce further dictionaries of terms to recognize/ignore LeadMine achieved 86.2% precision and 85.0% recall on an unused development set.

10. Bibliography

1. Sayle R, Xie PH, Muresan S. Improved Chemical Text Mining of Patents with Infinite Dictionaries and Automatic Spelling Correction. Journal of Chemical Information and Modeling. 2011;52(1):51–62.

2. Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A, Alcantara R, Darsow M, Guedj M, Ashburner M. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Research. 2008;36:D344–350.

3. Schwartz A, Hearst M. A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text. In: Proceedings of the Pacific Symposium on Biocomputing. Kauai; 2003. pp. 451–462.

The rules for chemical nomenclature are written as formal grammars e.g. alkanStem : ‘meth’ | ‘eth’ | ‘prop’…

alkane: alkanStem ‘ane’

(485 rules are used in the systematic chemical name grammar and many are inherited by the derived grammars) The 2.94 million term PubChem dictionary is the primary source of trivial names. It was produced by running a series of filters against the ~94 million synonyms provided by PubChem. These included removing terms that are English words or start with an English word. The records for structures that contained tetrasaccharides (or longer) or hexadecapeptides (or longer) were excluded.

4. LeadMine Annotation

5. Entity extension and entity merging

8. Evaluation

9. Conclusions

Entities are extended until they reach whitespace, a mismatched bracket or an English word. Entities are then trimmed of non-essential parts. Finally adjacent entities are merged unless they are distinct molecules or one is an instance of the other according to ChEBI[2] (e.g. genistein is an isoflavone).

LeadMine combines the capabilities of grammars to recognize regular entities with the coverage of dictionaries. The results are readily understandable and can be iteratively improved.

The Hearst and Schwartz algorithm[3] was adapted to recognize abbreviations of the following forms: • Tetrahydrofuran (THF) • THF (tetrahydrofuran) • Tetrahydrofuran (THF; • Tetrahydrofuran (THF, • (tetrahydrofuran, THF) • THF = tetrahydrofuran

A list of domain specific abbreviations is used, which do not contain the characters of the abbreviation e.g. mercury Hg or estrone E1

The training set was used to automatically identify holes in coverage and identify common false positives and from this derive a dictionary of terms to include (Whitelist) and a dictionary of terms to exclude (BlackList). The workflow was then evaluated on the development set for the task of identifying all chemical entity mentions.

3. Normalization

Configuration Precision Recall F-score Baseline 0.869 0.820 0.844 WhiteList 0.862 0.850 0.856 BlackList 0.882 0.803 0.841

WhiteList + Blacklist 0.873 0.832 0.852

8. Non-entity abbreviation removal

The Hearst and Schwartz algorithm is used to find abbreviations which are recognised entities but for which the unabbreviated form is not an entity. The abbreviation is then ignored e.g.

current good manufacturing practice (cGMP)

LeadMine works internally on a normalized string with mappings back to the original input. Normalization allows XML tags to be ignored and requires fewer lexical varieties to be recognised.

Input Normalized œstradiol oestradiol

5` or 5’ or 5′ (backtick/quotation mark/prime) 5' <p>H<sub>2</sub>O</p> H2O

Input Found entities After extension/merging α-Santalol Santalol α-Santalol

Allura Red AC dye Allura Red AC dye Allura Red AC

Glycine ester Glycine AND ester Glycine ester Hexane-benzene Hexane AND benzene Hexane AND benzene

Genistein isoflavone

Genistein AND isoflavone Genistein AND isoflavone

Optional

Blue: Grammars Green: Traditional dictionaries Orange: Blocking dictionaries

Technology

LeadMine: A grammar and dictionary driven approach to chemical entity recognition