Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Building pipeline-based NLP systems for your applications
Hua Xu
School of Biomedical Informatics, University of Texas Health Science Center at Houston
1
Disclosure
• Ireceivegrantfundingfrom:– NIH:NLM,NIGMS,NCI– CPRIT(CancerPreven?onandResearchIns?tuteofTexas)
• Ihavebeenaconsultantfor:– HebtaLLC
2
What Is NLP? • Broad Definition – any system that
manipulates text or speech. It could involve various degrees of linguistic knowledge.
• NLP Systems – Natural Language understanding – Natural Language extraction – Natural Language generation – Machine translation – NLP-based information retrieval – NLP-interfaces
3
Study of Natural Language
• Humanlanguage(vs.formalandcomputerlanguage)
• Linguis?cs-adescrip?onoflanguage-usedbytheore?callinguists.
• Psycholinguis?cs-acogni?vemodelofhowpeopleunderstandandgeneratelanguage.
• Computa?onallinguis?cs-buildcomputa?onalmodelstounderstandandgeneratelanguage.
4
Computa)onalLinguis)cs
♦ An interdisciplinary field dealing with the statistical and/or rule-based modeling of natural language from a computational perspective – Driven by need to process natural language –
convert to structured form for further computerized processes
– Computational model is not necessarily same as human model - we don’t understand much about human language facility
5
Overview of Linguistic Levels • Phonology: units of sound combine to produce
words (will not cover) • Morphology: basic units combine to produce
words • Lexicography: syntactic (part of speech) and
semantic categories of words • Syntax: structures combine to produce
sentences • Semantics: meaning/interpretations • Discourse – previous information affects the
interpretation of the current information • Pragmatic: context or world knowledge affects
the interpretation of meaning
6
Morphology
• Defini?on:Thestudyofhowwordsarecomposedfromsmaller,meaning-bearingunits(morphemes)§ Inflec?on:Wordstem+gramma?calmorpheme
○ likeàlikes,liked,liking§ Deriva?on:Wordstem+syntac?c/gramma?calmorpheme○ generalizeàgeneraliza?on
§ Compounding:Twobaseformsjointoformanewword○ bed?me
• Applica?on:spellingcheck,stemming,POStagging,speechrecogni?on
7
Lexicography-Words
♦ Recognize word – Tokenization (determine the word boundary)
♦ Identify word – Lookup (map to dictionary entry)
♦ Categorize word – Tagging – Syntactic – Assign Part-of-Speech Tags – Semantic – Assign semantic categories
8
Syntax-Sentences♦ Definition: study of the structure of a
sentence. – Categories combine with others to produce a well-formed
structure with underlying relations ♦ Difficulties: ambiguous, nesting, omitted
structures – pain in (hands and feet) vs. (pain in hands) and fever
♦ Parsing – determining syntax - Formalisms: regular expressions vs. context-free
grammar - Partial vs. full parsing
9
Seman)cs♦ Lexical level – to determine the meaning of
a word ♦ Semantic categories of a word
• Abdomen – body location • Fever – symptom • pt – labtest (prothrombin?meassay) vs. treatment
(physical therapy) ♦ Word sense disambiguation
♦ Grammatical level - word senses in a structure combine to form a meaning of the whole structure
10
Discourse
♦ Previous information in text affects current text – Correct reference for pronouns, definite noun
phrases, bridging noun phrases. • Mass noted in left upper lobe. It was well-
marginated. – Time of events – Determining topic – Coherence of text
11
Pragma)cs
♦ Context affect meaning – Domain: A mass was observed – Section of Report: past history vs. hospital
course – Prior information
♦ World knowledge affects interpretation - He couldn’t do any trading on the past
Monday. (Market was closed on President Day - Monday.)
12
It’s all about Ambiguity!• POStagging-saw (noun vs. verb) • Semantic tagging - pt (patient, physical therapy, prothrombin?meassay) • Syntactic parsing - The patient had pain in lower extremities. vs.
The patient had pain in emergency room.
13
S
np vp
det the
n patient
v had
np
n pain
pp
p in
np
adj lower
n extremities
S
np vp
det the
n patient
v had
n pain
pp
p in
np
n emergency
n room
Most of current clinical NLP systems are information extraction systems
• General-purpose – MedLEE – MetaMap – cTAKES – KnowledgeMap Concept Identifier – ….
• Specific-purpose – MIST – the MITRE identification scrubber toolkit – MedEx – medication information extraction – ……
14
Pipeline-based architecture
15
cTAKES(clinicalTextAnalysisandKnowledgeExtrac?onSystem)UIMA(UnstructuredInforma?onManagementArchitecture)annota?onflowofsideeffectpipeline.
Source:Sohnetal.JAMIA,2011
Demo of building clinical NLP pipelines using CLAMP
• ClinicalLanguageAnnota?on,Modeling,andProcessingToolkit(CLAMP)
• Demo1–determinesmokingstatususingrule-basedapproaches
• Demo2–extractlabnamesusingahybridapproachthatcombinesmachinelearningandrules
16
Introduction to CLAMP• AgeneralpurposeclinicalNLPsystembuiltonproven
methods
• AnIDE(integrateddevelopmentenvironment)forbuilding
customizedclinicalNLPpipelinesviaGUIs– Annota?ng/analyzingclinicaltext– TrainingofML-basedmodules– Specifyingrule
17
NLPTasks Ranking
Nameden?tyrecogni?on
2009i2b2,medica?on #2
2010i2b2problem,treatment,test #2
2013SHARe/CLEFabbrevia?on #1
UMLSencoding 2014SemEval,disorder #1
Rela?onextrac?on
2012i2b2Temporal #1
2015SemEvalDisease-modifier #1
2015BioCREATIVEChemical-induceddisease #1
What does CLAMP address?
• TheTransportabilityProblemofNLP– Fromonetypeofclinicalnotestoanother– Fromoneins?tutetoanother– Fromoneapplica?ontoanother
• Needasolu?onfornon-NLPexpertstoefficientlybuildhigh-performanceNLPmodulesforindividualapplica?ons!
18
CLAMP Demo 1
• Buildarule-basedsystemtoextractsmokingstatusfromclinicaltext
• Input:sentencescontainingpa?entsmokinginforma?on
• Output:threetypesofstatusforeachsmokingmen?on:– CurrentSmoker:Shehasapriorhistoryofsmokingalthoughnotcurrently
– PastSmoker:Sheiscon?nuingtosmoke– Non-Smoker:Shedeniesanytobaccouse,alcoholuse
19
CLAMP Demo 2
• Buildahybrid(machinelearning+rules)systemforextrac?nglabtestconceptsfromclinicaltext
• Input:dischargesummaries• Output:labtestconceptsmen?onedinthetextwithakributesof:– Offsets– Nega?on– UMLSCUIs
20
CLAMP Availability
• CLAMPisavailableintwoversions:– CLAMPCMD(free)– CLAMPGUI(dependsonthelicense)hkps://sbmi.uth.edu/ccb/resources/clamp.htm
• Itisnotanopensourcesolware,butsourcecodesareavailableforcollaboratorswithappropriatelicenses.
• Wearelookingforcollaboratorstoco-developthesystem!Ifinterested,pleasecontact:[email protected]