Amit Satsangi [email protected]

Faculty of Computer Science

CMPUT 605 December 06, 2007March 31,

2008© 2006

Towards Applying Text Mining and Natural Language Processing for Biomedical Ontology Acquisition

Inniss T., Light M., Thomas G., Lee J., Grassi M., Williams A. TMBIO(2006)

Amit [email protected]

© 2006

Department of Computing Science

CMPUT 605

Focus

Ontology for describing age-related macular degeneration (AMD)

Comparison of the accuracy of three methods for Ontology – Natural Language Processing (NLP) – Text Mining (SAS Text Miner)

– Human Expert

Manual and adhoc knowledge acquisition

IDOCS (Intelligent Distributed Ontology Consensus System)

© 2006


CMPUT 605

Introduction

No existing common and standardized vocabulary for classification of disease types for certain eye-diseases

Clinicians, dispersed geographically, may use different terms to describe the same condition

Research aimed at extracting the feature and

attribute descriptions for the vocabulary of AMD,

and build an Ontology from that.

© 2006


CMPUT 605

Related Work

Lot of research done, since 1990’s, for applying NLP techniques in medicine, bio-medicine etc.

NLP & Text Data Mining have been recognized to play an important role in this endeavor

Research focused on online repositories such as Medline & PubMed

NLP systems developed: MedLee, UMLS, GENIES etc.

© 2006


CMPUT 605

IDOCS

© 2006


CMPUT 605

Methodology

Four clinical experts in retinal diseases enlisted to view 100 eye sample images of AMD

Experts in different geographic locations

Described the observations using digital voice recorders – no artificially imposed vocabulary constraints

Another retinal expert for manual parsing of the transcribed text – extracting key words, organization of key-words into categories etc.

© 2006


CMPUT 605

Results: Human Experts

© 2006


CMPUT 605

Methodology: NLP

NLP: Used for information extraction and automatic summarization.

Identify short sequences of words having meaning over and above a meaning composed directly from their parts – “extreme programming”

Ngram Statistics Package (NSP) used for collocation discovery in case of bi-grams

Word-pair associations measured by PMI

© 2006


CMPUT 605

Methodology: NLP

Large PMI for larger degree of association between

the words

© 2006


CMPUT 605

Results: NLP

© 2006


CMPUT 605

Methodology:Text Mining (SAS Text Miner)

Collection of documents (corpus) used as input to any text mining algorithm

Corpus broken into tokens or terms (tokens in a particular language)

Term weighting Measures: Entropy, Inverse Document Frequency (IDF), Global Frequency (GF) -IDF, None (Global weight of 1) & Normal term wt.

© 2006


CMPUT 605

Results: Text Miner

Frequency wt. None

Term wt. Normal

© 2006


CMPUT 605

Common Terms

sss

© 2006


CMPUT 605

Comparison

Thus text mining is a viable and effective method for determining vocabulary to describe a particular disease

Text Mining found a lot of terms that NLP found

Human Expert is the best Ground Truth

© 2006


CMPUT 605

Ontology Generation

© 2006


CMPUT 605

Conclusion and Future Work

Human experts are the best, but they did miss some key descriptors

Text Mining and NLP can enhance the generation of feature generations, by preventing the above case

As a consequence more robust vocabulary can be generated

Extension – evaluate the effectiveness of the automated tools, text mining & NLP

Different weighting schemes to be tried in the future

© 2006


CMPUT 605

Thank You For Your Attention!

Documents

Amit Satsangi [email protected]