22
Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of Texas Health Science Center at Houston 1

Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Building pipeline-based NLP systems for your applications

Hua Xu

School of Biomedical Informatics, University of Texas Health Science Center at Houston

1

Page 2: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Disclosure

•  Ireceivegrantfundingfrom:– NIH:NLM,NIGMS,NCI– CPRIT(CancerPreven?onandResearchIns?tuteofTexas)

•  Ihavebeenaconsultantfor:– HebtaLLC

2

Page 3: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

What Is NLP? •  Broad Definition – any system that

manipulates text or speech. It could involve various degrees of linguistic knowledge.

•  NLP Systems –  Natural Language understanding –  Natural Language extraction –  Natural Language generation –  Machine translation –  NLP-based information retrieval –  NLP-interfaces

3

Page 4: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Study of Natural Language

•  Humanlanguage(vs.formalandcomputerlanguage)

•  Linguis?cs-adescrip?onoflanguage-usedbytheore?callinguists.

•  Psycholinguis?cs-acogni?vemodelofhowpeopleunderstandandgeneratelanguage.

•  Computa?onallinguis?cs-buildcomputa?onalmodelstounderstandandgeneratelanguage.

4

Page 5: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Computa)onalLinguis)cs

♦ An interdisciplinary field dealing with the statistical and/or rule-based modeling of natural language from a computational perspective – Driven by need to process natural language –

convert to structured form for further computerized processes

– Computational model is not necessarily same as human model - we don’t understand much about human language facility

5

Page 6: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Overview of Linguistic Levels •  Phonology: units of sound combine to produce

words (will not cover) •  Morphology: basic units combine to produce

words •  Lexicography: syntactic (part of speech) and

semantic categories of words •  Syntax: structures combine to produce

sentences •  Semantics: meaning/interpretations •  Discourse – previous information affects the

interpretation of the current information •  Pragmatic: context or world knowledge affects

the interpretation of meaning

6

Page 7: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Morphology

•  Defini?on:Thestudyofhowwordsarecomposedfromsmaller,meaning-bearingunits(morphemes)§  Inflec?on:Wordstem+gramma?calmorpheme

○  likeàlikes,liked,liking§  Deriva?on:Wordstem+syntac?c/gramma?calmorpheme○  generalizeàgeneraliza?on

§  Compounding:Twobaseformsjointoformanewword○ bed?me

•  Applica?on:spellingcheck,stemming,POStagging,speechrecogni?on

7

Page 8: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Lexicography-Words

♦ Recognize word – Tokenization (determine the word boundary)

♦ Identify word – Lookup (map to dictionary entry)

♦ Categorize word – Tagging – Syntactic – Assign Part-of-Speech Tags – Semantic – Assign semantic categories

8

Page 9: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Syntax-Sentences♦ Definition: study of the structure of a

sentence. –  Categories combine with others to produce a well-formed

structure with underlying relations ♦ Difficulties: ambiguous, nesting, omitted

structures –  pain in (hands and feet) vs. (pain in hands) and fever

♦ Parsing – determining syntax - Formalisms: regular expressions vs. context-free

grammar - Partial vs. full parsing

9

Page 10: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Seman)cs♦ Lexical level – to determine the meaning of

a word ♦ Semantic categories of a word

•  Abdomen – body location •  Fever – symptom •  pt – labtest (prothrombin?meassay) vs. treatment

(physical therapy) ♦ Word sense disambiguation

♦ Grammatical level - word senses in a structure combine to form a meaning of the whole structure

10

Page 11: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Discourse

♦ Previous information in text affects current text – Correct reference for pronouns, definite noun

phrases, bridging noun phrases. •  Mass noted in left upper lobe. It was well-

marginated. – Time of events – Determining topic – Coherence of text

11

Page 12: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Pragma)cs

♦ Context affect meaning – Domain: A mass was observed – Section of Report: past history vs. hospital

course – Prior information

♦ World knowledge affects interpretation - He couldn’t do any trading on the past

Monday. (Market was closed on President Day - Monday.)

12

Page 13: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

It’s all about Ambiguity!•  POStagging-saw (noun vs. verb) •  Semantic tagging - pt (patient, physical therapy, prothrombin?meassay) •  Syntactic parsing - The patient had pain in lower extremities. vs.

The patient had pain in emergency room.

13

S

np vp

det the

n patient

v had

np

n pain

pp

p in

np

adj lower

n extremities

S

np vp

det the

n patient

v had

n pain

pp

p in

np

n emergency

n room

Page 14: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Most of current clinical NLP systems are information extraction systems

•  General-purpose – MedLEE – MetaMap –  cTAKES – KnowledgeMap Concept Identifier – ….

•  Specific-purpose – MIST – the MITRE identification scrubber toolkit – MedEx – medication information extraction – ……

14

Page 15: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Pipeline-based architecture

15

cTAKES(clinicalTextAnalysisandKnowledgeExtrac?onSystem)UIMA(UnstructuredInforma?onManagementArchitecture)annota?onflowofsideeffectpipeline.

Source:Sohnetal.JAMIA,2011

Page 16: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Demo of building clinical NLP pipelines using CLAMP

•  ClinicalLanguageAnnota?on,Modeling,andProcessingToolkit(CLAMP)

•  Demo1–determinesmokingstatususingrule-basedapproaches

•  Demo2–extractlabnamesusingahybridapproachthatcombinesmachinelearningandrules

16

Page 17: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Introduction to CLAMP•  AgeneralpurposeclinicalNLPsystembuiltonproven

methods

•  AnIDE(integrateddevelopmentenvironment)forbuilding

customizedclinicalNLPpipelinesviaGUIs–  Annota?ng/analyzingclinicaltext–  TrainingofML-basedmodules–  Specifyingrule

17

NLPTasks Ranking

Nameden?tyrecogni?on

2009i2b2,medica?on #2

2010i2b2problem,treatment,test #2

2013SHARe/CLEFabbrevia?on #1

UMLSencoding 2014SemEval,disorder #1

Rela?onextrac?on

2012i2b2Temporal #1

2015SemEvalDisease-modifier #1

2015BioCREATIVEChemical-induceddisease #1

Page 18: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

What does CLAMP address?

•  TheTransportabilityProblemofNLP– Fromonetypeofclinicalnotestoanother– Fromoneins?tutetoanother– Fromoneapplica?ontoanother

•  Needasolu?onfornon-NLPexpertstoefficientlybuildhigh-performanceNLPmodulesforindividualapplica?ons!

18

Page 19: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

CLAMP Demo 1

•  Buildarule-basedsystemtoextractsmokingstatusfromclinicaltext

•  Input:sentencescontainingpa?entsmokinginforma?on

•  Output:threetypesofstatusforeachsmokingmen?on:–  CurrentSmoker:Shehasapriorhistoryofsmokingalthoughnotcurrently

–  PastSmoker:Sheiscon?nuingtosmoke– Non-Smoker:Shedeniesanytobaccouse,alcoholuse

19

Page 20: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

CLAMP Demo 2

•  Buildahybrid(machinelearning+rules)systemforextrac?nglabtestconceptsfromclinicaltext

•  Input:dischargesummaries•  Output:labtestconceptsmen?onedinthetextwithakributesof:– Offsets– Nega?on– UMLSCUIs

20

Page 21: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

CLAMP Availability

•  CLAMPisavailableintwoversions:–  CLAMPCMD(free)–  CLAMPGUI(dependsonthelicense)hkps://sbmi.uth.edu/ccb/resources/clamp.htm

•  Itisnotanopensourcesolware,butsourcecodesareavailableforcollaboratorswithappropriatelicenses.

•  Wearelookingforcollaboratorstoco-developthesystem!Ifinterested,pleasecontact:[email protected]

Page 22: Building pipeline-based NLP systems for your applications€¦ · Building pipeline-based NLP systems for your applications Hua Xu School of Biomedical Informatics, University of

Thankyou!Ques?ons?

[email protected]

22