Upload
morris-mcdaniel
View
227
Download
1
Tags:
Embed Size (px)
Citation preview
CRFs and Joint Inferencein NLP
Andrew McCallum
Computer Science Department
University of Massachusetts Amherst
Joint work with Charles Sutton, Aron Culotta, Xuerui Wang,Ben Wellner, Fuchun Peng, Michael Hay.
From Text to Actionable Knowledge
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Uncertainty Info
Emerging Patterns
Joint Inference
An HLT Pipeline
SNA, KDD, EventsTDT, Summarization
Coreference
Relations
NER
Parsing
MT
ASR
Errorscascade &
accumulate
An HLT Pipeline
SNA, KDDTDT, Summarization
Coreference
Relations
NER
Parsing
MT
ASR
Unified,joint
inference.
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
Database
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Uncertainty Info
Emerging Patterns
Joint Inference
SegmentClassifyAssociateCluster
Filter
Prediction Outlier detection Decision support
IE
Documentcollection
ProbabilisticModel
Discover patterns - entity types - links / relations - events
DataMining
Spider
Actionableknowledge
Solution:
Conditional Random Fields [Lafferty, McCallum, Pereira]
Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]
Discriminatively-trained undirected graphical models
Complex Inference and LearningJust what we researchers like to sink our teeth into!
Unified Model
(Linear Chain) Conditional Random Fields
yt -1
yt
xt
yt+1
xt +1
xt -1
Finite state model Graphical model
Undirected graphical model, trained to maximize
conditional probability of output sequence given input sequence
. . .
FSM states
observations
yt+2
xt +2
yt+3
xt +3
said Jones a Microsoft VP …
OTHER PERSON OTHER ORG TITLE …
output seq
input seq
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
[Lafferty, McCallum, Pereira 2001]
€
p(y | x) =1
Zx
Φ(y t , y t−1,x, t)t
∏ where
€
Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k
∑ ⎛
⎝ ⎜
⎞
⎠ ⎟
Outline
• Motivating Joint Inference for NLP.
• Brief introduction of Conditional Random Fields
• Joint inference: Motivation and examples
– Joint Labeling of Cascaded Sequences (Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution (Graph Partitioning)
– Joint Segmentation and Co-ref (Sparse BP)
– Joint Extraction and Data Mining (Iterative)
• Topical N-gram models
Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
But errors cascade--must be perfect at every stage to do well.
Jointly labeling cascaded sequencesFactorial CRFs
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
[Sutton, Khashayar, McCallum, ICML 2004]
Joint prediction of part-of-speech and noun-phrase in newswire,matching accuracy with only 50% of the training data.
Inference:Loopy Belief Propagation
2. Jointly labeling distant mentionsSkip-chain CRFs
Senator Joe Green said today … . Green ran for …
…
[Sutton, McCallum, SRL 2004]
Dependency among similar, distant mentions ignored.
2. Jointly labeling distant mentionsSkip-chain CRFs
Senator Joe Green said today … . Green ran for …
…
[Sutton, McCallum, SRL 2004]
14% reduction in error on most repeated field in email seminar announcements.
Inference:Tree reparameterization BP
[Wainwright et al, 2002]
See also[Finkel, et al, 2005]
3. Joint co-reference among all pairsAffinity Matrix CRF
. . . Mr Powell . . .
. . . Powell . . .
. . . she . . .
45
99Y/N
Y/N
Y/N
11
[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]
~25% reduction in error on co-reference of proper nouns in newswire.
Inference:Correlational clusteringgraph partitioning
[Bansal, Blum, Chawla, 2002]
“Entity resolution”“Object correspondence”
p
Databasefield values
c
4. Joint segmentation and co-reference
o
s
o
s
c
c
s
o
Citation attributes
y y
y
Segmentation
[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Sparse Generalized Belief Propagation
Co-reference decisions
Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.
Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.
[Pal, Sutton, McCallum, 2005]
World Knowledge
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Extraction from and matching of research paper citations.
see also [Marthi, Milch, Russell, 2003]
Joint IE and Coreference from Research Paper Citations
Textual citation mentions(noisy, with duplicates)
Paper database, with fields,clean, duplicates collapsed
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
AUTHORS TITLE VENUECowell, Dawid… Probab… SpringerMontemerlo, Thrun…FastSLAM… AAAI…Kjaerulff Approxi… Technic…
4. Joint segmentation and co-reference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
Citation Segmentation and Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
Citation Segmentation and Coreference
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
Citation Segmentation and Coreference
Y?N
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
3) Form canonical database record
Citation Segmentation and Coreference
AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990
Y?N
Resolving conflicts
Laurel, B. Interface Agents: Metaphors with Character , in
The Art of Human-Computer Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents: Metaphors with Character , in
Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .
1) Segment citation fields
2) Resolve coreferent citations
3) Form canonical database record
Citation Segmentation and Coreference
AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990
Y?N
Perform jointly.
x
s
Observed citation
CRF Segmentation
IE + Coreference Model
J Besag 1986 On the…
AUT AUT YR TITL TITL
x
s
Observed citation
CRF Segmentation
IE + Coreference Model
Citation mention attributes
J Besag 1986 On the…
AUTHOR = “J Besag”YEAR = “1986”TITLE = “On the…”
c
x
s
IE + Coreference Model
c
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Structure for each citation mention
x
s
IE + Coreference Model
c
Binary coreference variablesfor each pair of mentions
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
x
s
IE + Coreference Model
c
y n
n
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Binary coreference variablesfor each pair of mentions
y n
n
x
s
IE + Coreference Model
c
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Research paper entity attribute nodes
AUTHOR = “P Smyth”YEAR = “2001”TITLE = “Data Mining…”...
Inference by Sparse “Generalized BP”
Exact inference onthese linear-chain regions
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
From each chainpass an N-best List
into coreference
[Pal, Sutton, McCallum 2005]
Inference by Sparse “Generalized BP”
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Approximate inferenceby graph partitioning…
…integrating outuncertaintyin samples
of extraction
Make scale to 1Mcitations with Canopies
[McCallum, Nigam, Ungar 2000]
[Pal, Sutton, McCallum 2005]
y n
n
Inference by Sparse “Generalized BP”
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Exact (exhaustive) inferenceover entity attributes
[Pal, Sutton, McCallum 2005]
y n
n
Inference by Sparse “Generalized BP”
J Besag 1986 On the…Smyth . 2001 Data Mining…
Smyth , P Data mining…
Revisit exact inferenceon IE linear chain,
now conditioned on entity attributes
[Pal, Sutton, McCallum 2005]
y n
n
Parameter Estimation: Piecewise Training
Coref graph edge weightsMAP on individual edges
Divide-and-conquer parameter estimation
IE Linear-chainExact MAP
Entity attribute potentialsMAP, pseudo-likelihood
In all cases:Climb MAP gradient with
quasi-Newton method
[Sutton & McCallum 2005]
p
Databasefield values
c
4. Joint segmentation and co-reference
o
s
o
s
c
c
s
o
Citation attributes
y y
y
Segmentation
[Wellner, McCallum, Peng, Hay, UAI 2004]
Inference:Variant of Iterated Conditional Modes
Co-reference decisions
Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.
Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.
[Besag, 1986]
World Knowledge
35% reduction in co-reference error by using segmentation uncertainty.
6-14% reduction in segmentation error by using co-reference.
Extraction from and matching of research paper citations.
Outline
• Motivating Joint Inference for NLP.
• Brief introduction of Conditional Random Fields
• Joint inference: Motivation and examples
– Joint Labeling of Cascaded Sequences (Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution (Graph Partitioning)
– Joint Segmentation and Co-ref (Sparse BP)
– Joint Extraction and Data Mining (Iterative)
• Topical N-gram models
“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”
“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”
“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”
“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”
?
Relation Extraction as Sequence Labeling
George W. Bush
…George H. W. Bush (son of Prescott Bush) …
Father Grandfather
Learning Relational Database Features
George W. Bush
…George H. W. Bush (son of Prescott Bush) …
Father Grandfather
Name Son
Prescott Bush George H. W. Bush
George H. W. Bush George W. Bush
Search DB for “relational paths” between subject and token
Subject_Is_SonOf_SonOf_Token=1.0
Highly weighted relational paths
• Many Family equivalences– Sibling=Parent_Offspring– Cousin=Parent_Sibling_Offspring
• College=Parent_College• Religion=Parent_Religion• Ally=Opponent_Opponent• Friend=Person_Same_School
• Preliminary results: nice performance boost using relational features (~8% absolute F1)
Testing on Unknown Entities
John F. Kennedy
… son of Joseph P. Kennedy, Sr. and Rose Fitzgerald
Name Son
Joseph P. Kennedy John F. Kennedy
Rose Fitzgerald John F. Kennedy
Father Mother
Fill DB with “first-pass” CRFUse relational features with “second-pass” CRF
Next Steps
• Feature induction to discover complex rules
• Measure relational features’ sensitivity to noise in DB
• Collective inference among related relations
Outline
• Motivating Joint Inference for NLP.
• Brief introduction of Conditional Random Fields
• Joint inference: Motivation and examples
– Joint Labeling of Cascaded Sequences (Belief Propagation)
– Joint Labeling of Distant Entities (BP by Tree Reparameterization)
– Joint Co-reference Resolution (Graph Partitioning)
– Joint Segmentation and Co-ref (Sparse BP)
– Joint Extraction and Data Mining (Iterative)
• Topical N-gram models
Topical N-gram Model - Our first attempt
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1
T
D
. . .
. . .
. . .
WTW
1 2 2
{0, 1, 1:2, 2:2, 1:3, 2:3, 3:3}
Wang & McCallum
Beyond bag-of-words
z1 z2 z3 z4
w1 w2 w3 w4
TW
D
. . .
. . .
Wallach
LDA-COL (Collocation) Model
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1 2
T
Griffiths & Steyvers
D
1 2
WW
. . .
. . .
. . .
Topical N-gram Model
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1
T
D
. . .
. . .
. . .
WTW
1 2 2
Wang & McCallum
Topical N-gram Model
z1 z2 z3 z4
w1 w2 w3 w4
y1 y2 y3 y4
1
T
D
. . .
. . .
. . .
WTW
1 2 2
Wang & McCallum
Topic Comparison
learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning
LDA
reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning RLfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods
policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies
Topical N-grams (2+) Topical N-grams (1)
Topic Comparison
motionvisualfieldpositionfiguredirectionfieldseyelocationretinareceptivevelocityvisionmovingsystemflowedgecenterlightlocal
LDA
receptive fieldspatial frequencytemporal frequencyvisual motionmotion energytuning curveshorizontal cellsmotion detectionpreferred directionvisual processingarea mtvisual cortexlight intensitydirectional selectivityhigh contrastmotion detectorsspatial phasemoving stimulidecision strategyvisual stimuli
motionresponsedirectioncellsstimulusfigurecontrastvelocitymodelresponsesstimulimovingcellintensitypopulationimagecentertuningcomplexdirections
Topical N-grams (2+) Topical N-grams (1)
Topic Comparison
wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels
LDA
speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent
speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid
Topical N-grams (2+) Topical N-grams (1)
Summary
• Joint inference can avoid accumulating errors in an pipeline from extraction to data mining.
• Examples– Factorial finite state models– Jointly labeling distant entities– Coreference analysis– Segmentation uncertainty aiding coreference & vice-versa– Joint Extraction and Data Mining
• Many examples of sequential topic models.