Biomedical Named Entity Recognition

Preview:

DESCRIPTION

Biomedical Named Entity Recognition. Ramakanth Kavuluru. NLP Seminar – 8/21/2012. What are named entities?. The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes. - PowerPoint PPT Presentation

Citation preview

Citation

Biomedical InformaticsData ➜ Information ➜ Knowledge

BMI

Biomedical Named Entity Recognition

Ramakanth Kavuluru

NLP Seminar – 8/21/2012

BMI

What are named entities?

• The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes.

• Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells

BMI

What are named entities?

• The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes.

• Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells

Biologically Active Substance

Drug

Disorder

Organic Chemical

Enzyme

Cell

BMI

What are named entities?

• The benefits of taking cholesterol lowering statin drugs outweigh the risks even among people who are likely to develop diabetes.

• Acute exposure to resveratrol inhibits AMPK activity in human skeletal muscle cells

Cholesterol lowering drugs

Drug

Biological Function

BMI

Why do we need to extract them?

• To provide effective semantic search– Find all discharge summaries of patients that

have a history of diabetes and obesity and have taken statins as part of their treatment.

– Find all biomedical articles that discuss the dopamine neurotransmitter in the context of depressive disorders.

Clinical Trial Recruitment

Literature Review

BMI

Why do we need to extract them?

• To use as features in machine learning for effective text classification

• To build semantic clusters of textual documents to understand evolving themes

• Reduce noise by avoiding key words that are not indicative of the classes or clusters

• Recently, as a first step in relation extraction and hence in knowledge discovery

BMI

A major task in text mining• Extract information from textual data• Use this information to solve problems• What type of information?– relevant concepts - a medical condition or

finding, a drug, a gene or protein, an emotion (hope, love, …)

– Relevant (binary) relations – drug TREATS a condition, protein CAUSES a disease

• What are the typical questions?– Does a pathology report indicate a reportable

case?– Which patients satisfy the criteria for a clinical

trial?

BMI

Knowledge Discovery

• VIP Peptide – increases – Catecholamine Biosynthesis

• Catecholamines – induce – β-adrenergic receptor activity

• β-adrenergic receptors – are involved – fear conditioning

VIP Peptide – affects – fear conditioning ?????

In Cattle

In Rats

In Humans

BMI

Clinical NER

Concept Type Attributes• Disorder/

Symptom

• Medication

• Procedures

Present/historical/absent, Acute? Uncertain?

Present/historical/future

BMI

Why is NER Hard?

BMI

Linguistic Variation

• Derivational variation: cranial, cranium• Inflectional variation: coughed, coughing• Synonymy– nuerofibromin 2, merlin, NF2 protein, and

schwannomin.– Addison’s disease, adrenal insufficiency,

hypocortisolism, bronzed disease– Feeding problems in newborn – The mother

said she was having trouble feeding the baby.

BMI

Polysemy

• Merlin – both a bird and protein in UMLS• Discharge– Patient was prescribed codeine upon discharge– The discharge was yellow and purulent

• Abbreviations– APC: Activated protein C, Adenomatosis

polyposis coli, antigen presenting cell, aerobic plate count, advanced pancreatic cancer, age period cohort, antibody producing cells, atrial premature complex

BMI

Negation

• Nearly half of all clinical concepts in dictated narratives are negated– There is no maxillary sinus tenderness

• Implied absence without negation– Lungs are clear upon auscultationSo,– Rales: Absent– Rhonchi: Absent– Wheezing: Absent

BMI

Controlled Terminologies

Controlled vocabularies or taxonomies– Gene Ontology (gene products)

• most cited, 450 per year in PubMed• Total of 33000+ terms

– SNOMED CT (about 300K+ concepts)– NCI Thesaurus , ICD-9/10, ICD-0-3, LOINC,

MedlinePlus– UMLS Metathesaurus (integration of 140+

vocabularies)• 2.3 million concepts

BMI

Semantic Types and Relations

• NLM Semantic Network, the type system behind UMLS Metathesaurus– Semantic Types (135)

• Semantic Groups (15)– Semantic Relations (54)

• Specialist Lexicon– Malaria, malarial– Hyperplasia, hyperplastic

How do we extract named entities?

BMI

Metamap from NLM

Identify phrases: Use SPECIALIST parser

Map to CUIs: Use SPECIALIST Lexicon, Metathesaurus and Semantic Network

BMI

Output of syntactic analysis

• Syntactic Analysis – “ocular complications of myasthenia gravis” – Ocular (adj), complications (noun), of (prep),

myasthenia (noun), gravis (noun)– gives noun phrases (NP): “Ocular

complications” and “Myasthenia gravis”– Prepositions are ignored– In a given NP, you have a head and modifiers:

• Ocular (mod) and complications (head)• How about “male pattern baldness”?

BMI

Variant Generation

BMI

Variant Generation

BMI

Candidate identification• Look for all variants in Metathesaurus

strings and identify those candidate concepts (CUIs) that contain at least one variant as a substring

• Example: For ocular complication, obtain all Metathesaurus strings that contain any of the following as substrings– Optic complication– Eyes complication– Opthalmic complicated– ….

BMI

Mapping and Evaluation

• So now we have a bunch of candidate CUIs based on presence of variants of the given phrase in Metathesaurus strings. How do we select the best candidate.

• Use several measures to compute a rank– Centrality (involvement of head)– Variation (average of inverse distance scores)– Coverage– Cohesivness

BMI

Final Score

BMI

Metamap Options

• Types of variants: include or exclude derivational variants

• Word sense disambiguation– Discharge (bodily secretion VS release the

patient)• Concept gaps– Obstructive apnea mapping to “obstructive

sleep apnea” or “obstructive neonatal apnea”• Term processing– Process the input string as a single concept,

that is, don’t split it into noun phrases

BMI

Output options

• Human readable format• XML format• Restrictions based on certain vocabularies:

consider only ICD-9• Restrictions based on certain types:

consider only pharmacological substances (i.e., drugs)

DEMO TIME: Daniel Harris

BMI

References• An overview of Metamap

: Historical Perspectives and Recent Advances, Alan Aronson and Francois Lang

• Effective Mapping of Biomedical Text to the UMLS Metathesaurus: The MetaMap Program, Alan Aronson

• Comparison of LVG and Metamap Functionality, Alan Aronson

• Lexical, Terminological, and Ontological Resources for Biological Text Mining, Olivier Bodenreider

Recommended