20
Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine http://compbio.uchsc.edu [email protected] Using ontologies for text processing

Lawrence Hunter & K. Bretonnel Cohen Center for Computational Pharmacology UCHSC School of Medicine [email protected] Using

Embed Size (px)

Citation preview

Lawrence Hunter & K. Bretonnel Cohen Center for Computational PharmacologyUCHSC School of Medicine

http://[email protected]

Using ontologies for text processing

Overview

Thesis: Ontologies (or even more elaborated knowledge-bases) are required to solve the lexical ambiguity problemDescribe the lexical ambiguity problem and its central importance in natural language processingDemonstrate how GO, combined with Direct Memory Access Parsing, provides a simple solution to some instances of this problemArgue no alternative is likely to work as well

Lexical Ambiguity

A word (character string) means different things in different contexts – How can a program disambiguate (tell which is meant)?

Widespread problem even in “simple” bioNLP– DNA vs. mRNA vs. protein [Hatzivassiloglou et al. 2001]– Gene symbol vs. non-gene acronym [Pustejovsky et al.

2001], [Chang et al. 2002], [Liu and Friedman 2003], [Schwartz and Hearst 2003]

– Gene/product vs. any other noun [Tanabe and Wilbur, 2002]

A particular example

“Hunk” can be a– Cell type: human natural killer– Gene: hormonally upregulated Neu-associated kinase– Medical abbreviation: radiographic/orthopedic joint

classification system– Non-technical English: a large lump, piece, or portion

All occur in Medline documents….(e.g. “hunk of metal” in article on ambulance design)

How do ontologies help?

The idea that knowledge is relevant to understanding words in context is controversial only among linguists, but…

Direct Memory Access Parsing (DMAP) [Martin, 1991] [Fitzgerald, 2000] technique demonstrates the power of knowledge-based method for disambiguation

GO & similar efforts make DMAP (or other knowledge-based methods) practical today

What is DMAP?

Conceptual parser– Maps from text to conceptual representations organized in

packaging and abstraction hierarchies (like GO)– In contrast to: pure syntactic parsers, pattern matching and

machine learning systems

Conceptual representations include lexical patterns that specify how to recognize the concept in text– Patterns consist of text literals and/or references to other concepts– Organized around concepts, not words; no independent lexicon.

Recognition creates expectations for related concepts

A real example

ID: cell-type-HUNKIS-A: cell-typelex: human natural killer

HUNK

RESULTS

ID: gene-26559IS-A: genelex:

hormonally upregulated Neu-associated kinase

HUNK

hormonally upregulated neu tumor-associated kinase

ID: GO-0006350lex: transcription expression

ID: gene-expressionslots: expressed-item: gene mechanism: expressionlex: (gene) (expression)

“…Hunk expression is restricted to subsets of cells…” [Gardner et al. 2000]

(parse ‘(Hunk))e-gene-26559 begin: 1 end: 1e-cell-type-HUNK begin: 1 end: 1

(parse ‘(Hunk expression))c-gene-expression-1 begin: 1 end: 2 expressed-item: e-gene-26559 begin: 1 end: 1 mechanism: GO:0006350 begin: 2 end: 2

DMAP output with and without context

Hunk alone: ambiguous

Hunk expression:not ambiguous

DMAP can handle much more complex constructions

“Hunk is expressed in mouse epithelial cells during cell proliferation.”

c-localized-gene-expression

expressed-item: e-gene-26559

mechanism: GO:0006350

where: c-epithelial-cell

taxon: ncbi_10090

when: GO:0008283

But uses our enriched knowledge-base, not just GO

Even just DMAP/GO is a big win

Recall 7,042 ambiguous symbols for 9,723 genes

Straightforward to disambiguate symbols that map to 2 or more genes when:– Each ambiguous gene referent has GO annotations, and – There is no overlap between the annotations for the genes

3,333 of the symbols (for 4715 of the genes) have this feature – nearly half the problem is solved!

Compare the alternatives

Statistical or machine learning approaches– Must avoid being fooled by word “cells” in example– Scalability: need statistics for many covariates of every

ambiguous word; doesn’t exploit the abstraction hierarchy

Full syntactic parse doesn’t disambiguate at all!

Cascaded FST’s, pattern-matching, etc.– Where is source of knowledge for these?– Much DMAP lexical information can be taken directly from

GO (and LocusLink, etc.)

Acknowledgments

Philip V. Ogren

Daniel J. McGoldrick

Christoffer S. Crosby

Jens Eberlein

George K. Acquaah-Mensah

I/NET’s (http://inetmi.com) CM / CMP software

Support from Wyeth Genetics Institute, NIAAA

http://compbio.uchsc.edu

Biognosticopoea representation of the hunk gene

Attachment ambiguity

Attachment ambiguity– These findings suggest that FAK functions in the

regulation of cell migration and cell proliferation. (Gilmore and Romer 1996:1209)

– What does FAK do?• ALMOST RIGHT:• FAK functions in the regulation of cell migration• FAK functions in cell proliferation• RIGHT:• FAK functions in the regulation of cell migration• FAK functions in the regulation of cell proliferation

Attachment ambiguity

GO-0016477 isA go-process lex: cell migrationGO-0008283 isA go-process lex: cell proliferationGO-0042127 isA go-process lex: regulation of cell proliferation regulation of ((go-process) and)* cell proliferationGO-0030334 lex: regulation of cell migration regulation of ((go-process) and)* cell migration

Attachment ambiguity

(parse ‘(These findings suggest that FAK functions in the regulation of cell migration and cell proliferation))

GO:30334

begin: 9 end: 12

GO:0042127

begin: 9 end: 15

What do we have so far?

Gene Ontology

UMLS

MeSH

What more do we need?

FamilyLocation– Macroanatomical– Subcellular localization

StructureFunction– Disease associations– Protein/protein interactions– …..

Where can we get it?

GO definitions

UMLS definitions

MeSH notes

Biomedical literature

If you don’t like DMAP….

full syntactic parse first

cascaded FST’s

“a little syntax, a little semantics”

machine learning

pattern-matching

All can benefit from ontology/KB