36
A.I. in health informatics lectures 9&10 natural language processing and biomedical texts kevin small & byron wallace

A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

  • Upload
    others

  • View
    36

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

A.I. in health informatics lectures 9&10 natural language processing and biomedical texts

kevin small & byron wallace

Page 2: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

today

•  natural language processing (NLP) – applications – techniques

•  clinical and biomedical language

Page 3: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

natural language

•  humans prefer – scientific literature – technical reports – administrative reports – patient charts – spoken language transcriptions

Page 4: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

structured data

•  computers prefer – measurements in a spreadsheet – predefined lists (e.g., diseases, genes) – patient data – billing/administrative information

Page 5: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

natural language processing (NLP)

•  narrative data structured data – narrative data inefficient to (re)process – narrative data rife with ambiguity and

variance in expression – structured data not always sufficient

•  NLP abstracts narrative information into a structured form

•  NLP is hard

Page 6: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

NLP goals

•  time – PubMed has ~20M documents – patient documents

•  consistency / objectivity –  rules can be updated by experts –  classifiers generalize to new data

•  cost –  labor with specialized training

Page 7: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

information extraction (IE)

•  locates important structures – named entities – relations

Myf-­‐5  

Pax-­‐3  

Troponin  

Myod  

Polymerase  Wnt  Pax-­‐7  

Shh  

Dole  ’s  wife  ,  Elizabeth  ,  is  a  naBve  of    Salisbury  ,      N.C.  

person   person   locaBon   locaBon  

spouse_of  

born_in  

born_in  

located_in  

Page 8: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

information retrieval (IR)

•  access documents in large corpora

•  satisfy information need of query

•  term indexing

•  phrase/entity/relation indexing

Page 9: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

machine translation

•  converts text in one language to another language

•  study enrollment

•  scientific literature

Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference in Canada

Page 10: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

et cetera

•  text generation •  summarization •  automatic editing •  user interfaces •  speech transcription •  use your imagination…

Page 11: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

levels of knowledge

•  morphology – morphemes generate words

•  lexicography – global properties of words

•  syntax –  structure of phrases and sentences

•  semantics –  interpretation of linguistic structures

•  pragmatics – understanding discourse

Page 12: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

NLP requirements

•  specifying representation

•  method of acquiring knowledge to generate representation

•  algorithms to support applications based on specified representation

Page 13: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

NLP techniques

•  symbolic/logical methods –  finite state machines –  context-free grammars (CFGs) –  informative, brittle

•  statistical methods – Markov models – probabilistic context-free grammars

(PCFGs) – often less interpretable, robust – discriminative methods

Page 14: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

NLP paradigms

•  parsing –  linguistic analysis from latently structured

information to explicit structure

•  generation –  use linguistic/statistical models to generate

natural language

•  extraction –  extract relevant information –  doesn’t necessarily require full analysis

Page 15: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

morphology

•  morphemes (roots, prefixes, suffixes) are used to generate words –  free vs. bound –  inflectional (e.g. bigg-er) – derivational (e.g. judg-ment)

•  biomedical data morphologically richer than general English – hydr-oxy-nitro-di-hydro-thym-ine – hepatico-cholangio-jejuno-stom-y

Page 16: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

tokenization

•  parsing string into tokens – sentence splitting often first step –  includes words, numbers, symbols

q.i.d.  four  Bmes  a  day  

M03F4.2A  gene  name  

(w)adh-­‐2  biological  named  enBty  

Page 17: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

tokenization

•  regular expressions

•  Markov model

•  hybrid systems

[a-­‐z]+(‘s)?|[0-­‐9]+|[.]  

5  mg.  given.  

p(5*mg.*given*.)    versus    p(5*mg*.*given*.)  

5   mg   mg.   given   .  

5   0.1   0.8   0.9   0.4   0.6  

mg   0.3   0.1   0.1   0.9   0.4  

mg.   0.3   0.1   0.1   0.9   0.2  

given   0.7   0.6   0.6   0.2   0.7  

.   0.6   0.4   0.4   0.8   0.1  

Page 18: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

lexicography

•  atomic elements of language – multi-word expressions (MWE)

•  foreign phrases (ad hoc) •  prepositions (along with) •  idioms (follow up) •  clinical MWE (congestive heart failure)

•  parts of speech (POS) –  inflectional morphemes

•  number •  person •  case

Page 19: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

rule-based POS tagging

•  based on transformation rules

•  can be learned (TBL)

NN    VB  if  previous  tag  is  TO  NN    JJ  if  following  tag  is  NN  

Before  Rule  Applica/on   A1er  Rule  Applica/on  

total/NN  hip/NN  replacement/NN   total/JJ  hip/NN  replacement/NN  

a/DT  total/NN  of/IN  four/NN  units/NNS   no  change  

refused/VBD  to/TO  stay/NN   refused/VBD  to/TO  stay/VB  

her/PP$  hospital/NN  stay/NN   no  change  

allergy/NN  to/IN  penicillin/NN   no  change  

Page 20: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

Markov model POS tagging

•  can also be lexicalized (HMM)

NN   VB   VBD   VBN   TO   IN  

NN   0.34   0.00   0.22   0.02   0.01   0.40  

VB   0.28   0.01   0.02   0.27   0.04   0.39  

VBD   0.12   0.01   0.01   0.62   0.05   0.19  

VBN   0.21   0.00   0.00   0.03   0.11   0.65  

TO   0.02   0.98   0.00   0.00   0.00   0.00  

IN   0.85   0.00   0.02   0.05   0.00   0.08  

NN  NN  VBD  TO  VB  NN  NN  VB  VBN  TO  VB  NN  NN  VB  VBN  IN  VB  NN  

Page 21: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

discriminative POS tagging

•  based on structured classification – most common tag –  tag distribution –  previous tag –  previous word –  two words previous –  two tags previous –  next word –  bigram previous…

•  is the state of the art

Page 22: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

syntax

•  structure of phrases and sentences –  lexemes form phrases

•  noun phrases (severe chest pain) •  adjectival phrases (painful to touch) •  verb phrases (has increased)

– phrases form sentences

•  clinical text is telegraphic – constitutes a sublanguage

Page 23: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

•  regular expressions

•  context-free grammars (CFGs)

symbolic parsing

DT?  JJ*  NN*  (NN|NNS)  

S    NP  VP  .  NP    DT?  JJ*  (NN|NNS)  CONJ*  PP*  |  NP  and  NP  VP    (VBZ|VBP)  NP?  PP*  PP    IN  NP  CONJ    and  (NN|NNS)  

Page 24: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

parse trees

The  paBent  had  pain  in  lower  extremiBes  .  

DT   NN   VBD   NN   IN   JJ   NNS   .  

NP   NP  

PP  

NP  

VP  

S  

Page 25: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

probabilistic CFGs (PCFGs)

S    NP  VP  .  NP    DT?0.9  JJ*0.8  (NN0.6|NNS)  PP*0.8  VP    (VBZ0.4|VBP)  NP?0.9  PP*0.7  PP    IN  NP  

X-­‐ray  shows  patches  in  lung.  

Page 26: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

state of the art parsing

•  discriminative – structured prediction algorithm

•  lexicalized

•  active research area

•  limited to newswire (Penn Treebank)

Page 27: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

semantics

•  interpretation of linguistic structures •  word sense (e.g., bank, capsule) •  semantic types

– medication, gene, disease, etc.

•  semantic roles – medication-treats-disease, etc.

•  biomedical vs. general language

Page 28: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

semantics

•  external lexicon (e.g., UMLS) •  morphological analysis

– “-itis” and “-osis” are diseases – “-otomy” and “-ectomy” are procedures

•  word sense disambiguation – same routine as syntax

•  semantic role labeling (SRL) – regexp, semantic grammar

Page 29: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

semantic parse S    Finding  .  Finding    DegreePhrase?  ChangePhrase?  SYMP  ChangePhrase    NEG?  CHNG  DegreePhrase    DEGR  |  NEG  

No  increased  tenderness  .   No  increased  tenderness  .  

NEG   CHNG  

SYMP  

Finding  

DegreePhrase   ChangePhrase  

S  

.  

NEG   CHNG  

SYMP  

Finding  

ChangePhrase  

S  

.  

Page 30: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

state of the art SRL

•  discriminative – structured prediction algorithm – pipelined (attempts to collapse)

•  lexicalized

•  external knowledge

I  lek  my  pearls  to  my  daughter-­‐in-­‐law  in  my  will.  

[A0  I  ][v  lek  ][A1  my  pearls  ][A2  to  my  daughter-­‐in-­‐law  ][AM-­‐LOC  in  my  will  ]  

agent  (leaver)   paBent  (thing  lek)   benefactor  (lek  to)   locaBon  adjunct  (where)  

Page 31: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

pragmatics

•  structure of discourse – word and phrase senses

•  “mass” (mammography vs. radiology) •  “drinks” (health care vs. life)

– reference attachment •  co-reference

– narrative centering

An  infiltrate  was  noted  in  right  upper  lobe;  it  was  patchy.  

Page 32: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

clinical language

•  requires high sensitivity & specificity •  lacks contextual features •  telegraphic morphological features •  global context necessary for

significant ambiguity resolution •  lack of direct measurement •  standards

Page 33: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

biomedical language

•  time-varying •  morphological nesting •  syntactical and semantic nesting •  syntax wildly different than

“standard” English

Page 34: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

resources

•  controlled vocabularies / lexicons – UMLS – SNOMED, ICD-9 – biological databases (Flybase) – GENIA

Page 35: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

take away

•  representation is crucial – “deeper” analysis useful – “deeper” analysis noisy

•  symbolic methods – relies on expert information – brittle, but state of the art for clinical

•  statistical methods – relies on labeled data – robust, seemingly the future

Page 36: A.I. in health informatics• structure of phrases and sentences – lexemes form phrases • noun phrases (severe chest pain) • adjectival phrases (painful to touch) • verb phrases

next lecture

•  information extraction – named entities – relation extraction