CRFs and Joint Inference in NLP Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta,

CRFs and Joint Inferencein NLP

Andrew McCallum

Computer Science Department

University of Massachusetts Amherst

Joint work with Charles Sutton, Aron Culotta, Xuerui Wang,Ben Wellner, Fuchun Peng, Michael Hay.

From Text to Actionable Knowledge

SegmentClassifyAssociateCluster

Filter

Prediction Outlier detection Decision support

IE

Documentcollection

Database

Discover patterns - entity types - links / relations - events

DataMining

Spider

Actionableknowledge


Filter


IE

Documentcollection

Database


DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Joint Inference

An HLT Pipeline

SNA, KDD, EventsTDT, Summarization

Coreference

Relations

NER

Parsing

MT

ASR

Errorscascade &

accumulate

An HLT Pipeline

SNA, KDDTDT, Summarization

Coreference

Relations

NER

Parsing

MT

ASR

Unified,joint

inference.


Filter


IE

Documentcollection

Database


DataMining

Spider

Actionableknowledge

Uncertainty Info

Emerging Patterns

Joint Inference


Filter


IE

Documentcollection

ProbabilisticModel


DataMining

Spider

Actionableknowledge

Solution:

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Discriminatively-trained undirected graphical models

Complex Inference and LearningJust what we researchers like to sink our teeth into!

Unified Model

(Linear Chain) Conditional Random Fields

yt -1

yt

xt

yt+1

xt +1

xt -1

Finite state model Graphical model

Undirected graphical model, trained to maximize

conditional probability of output sequence given input sequence

. . .

FSM states

observations

yt+2

xt +2

yt+3

xt +3

said Jones a Microsoft VP …

OTHER PERSON OTHER ORG TITLE …

output seq

input seq

Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]

Wide-spread interest, positive experimental results in many applications.

Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…

[Lafferty, McCallum, Pereira 2001]

€

p(y | x) =1

Zx

Φ(y t , y t−1,x, t)t

∏ where

€

Φ(y t ,y t−1,x, t) = exp λ k fk (y t ,y t−1,x, t)k

∑ ⎛

⎝ ⎜

⎞

⎠ ⎟

Outline

• Motivating Joint Inference for NLP.

• Brief introduction of Conditional Random Fields

• Joint inference: Motivation and examples

– Joint Labeling of Cascaded Sequences (Belief Propagation)

– Joint Labeling of Distant Entities (BP by Tree Reparameterization)

– Joint Co-reference Resolution (Graph Partitioning)

– Joint Segmentation and Co-ref (Sparse BP)

– Joint Extraction and Data Mining (Iterative)

• Topical N-gram models

Jointly labeling cascaded sequencesFactorial CRFs

Part-of-speech

Noun-phrase boundaries

Named-entity tag

English words

[Sutton, Khashayar, McCallum, ICML 2004]


Part-of-speech


Named-entity tag

English words



Part-of-speech


Named-entity tag

English words


But errors cascade--must be perfect at every stage to do well.


Part-of-speech


Named-entity tag

English words


Joint prediction of part-of-speech and noun-phrase in newswire,matching accuracy with only 50% of the training data.

Inference:Loopy Belief Propagation

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

…

[Sutton, McCallum, SRL 2004]

Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentionsSkip-chain CRFs

Senator Joe Green said today … . Green ran for …

…

[Sutton, McCallum, SRL 2004]

14% reduction in error on most repeated field in email seminar announcements.

Inference:Tree reparameterization BP

[Wainwright et al, 2002]

See also[Finkel, et al, 2005]

3. Joint co-reference among all pairsAffinity Matrix CRF

. . . Mr Powell . . .

. . . Powell . . .

. . . she . . .

45

99Y/N

Y/N

Y/N

11

[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

~25% reduction in error on co-reference of proper nouns in newswire.

Inference:Correlational clusteringgraph partitioning

[Bansal, Blum, Chawla, 2002]

“Entity resolution”“Object correspondence”

p

Databasefield values

c

4. Joint segmentation and co-reference

o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]Inference:Sparse Generalized Belief Propagation

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Pal, Sutton, McCallum, 2005]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.

see also [Marthi, Milch, Russell, 2003]

Joint IE and Coreference from Research Paper Citations

Textual citation mentions(noisy, with duplicates)

Paper database, with fields,clean, duplicates collapsed

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

AUTHORS TITLE VENUECowell, Dawid… Probab… SpringerMontemerlo, Thrun…FastSLAM… AAAI…Kjaerulff Approxi… Technic…


Laurel, B. Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) ,

Addison-Wesley , 1990 .

Brenda Laurel . Interface Agents: Metaphors with Character , in

Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

Citation Segmentation and Coreference






1) Segment citation fields








2) Resolve coreferent citations


Y?N








3) Form canonical database record


AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Resolving conflicts








3) Form canonical database record


AUTHOR = Brenda Laurel TITLE = Interface Agents: Metaphors with CharacterPAGES = 355-366BOOKTITLE = The Art of Human-Computer Interface DesignEDITOR = T. SmithPUBLISHER = Addison-WesleyYEAR = 1990

Y?N

Perform jointly.

x

s

Observed citation

CRF Segmentation

IE + Coreference Model

J Besag 1986 On the…

AUT AUT YR TITL TITL

x

s

Observed citation

CRF Segmentation


Citation mention attributes

J Besag 1986 On the…

AUTHOR = “J Besag”YEAR = “1986”TITLE = “On the…”

c

x

s


c

J Besag 1986 On the…Smyth . 2001 Data Mining…

Smyth , P Data mining…

Structure for each citation mention

x

s


c

Binary coreference variablesfor each pair of mentions



x

s


c

y n

n



Binary coreference variablesfor each pair of mentions

y n

n

x

s


c



Research paper entity attribute nodes

AUTHOR = “P Smyth”YEAR = “2001”TITLE = “Data Mining…”...

Inference by Sparse “Generalized BP”

Exact inference onthese linear-chain regions



From each chainpass an N-best List

into coreference

[Pal, Sutton, McCallum 2005]




Approximate inferenceby graph partitioning…

…integrating outuncertaintyin samples

of extraction

Make scale to 1Mcitations with Canopies

[McCallum, Nigam, Ungar 2000]


y n

n




Exact (exhaustive) inferenceover entity attributes


y n

n




Revisit exact inferenceon IE linear chain,

now conditioned on entity attributes


y n

n

Parameter Estimation: Piecewise Training

Coref graph edge weightsMAP on individual edges

Divide-and-conquer parameter estimation

IE Linear-chainExact MAP

Entity attribute potentialsMAP, pseudo-likelihood

In all cases:Climb MAP gradient with

quasi-Newton method

[Sutton & McCallum 2005]

p

Databasefield values

c


o

s

o

s

c

c

s

o

Citation attributes

y y

y

Segmentation

[Wellner, McCallum, Peng, Hay, UAI 2004]

Inference:Variant of Iterated Conditional Modes

Co-reference decisions

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), Addison-Wesley, 1990.

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

[Besag, 1986]

World Knowledge

35% reduction in co-reference error by using segmentation uncertainty.

6-14% reduction in segmentation error by using co-reference.

Extraction from and matching of research paper citations.

Outline










“George W. Bush’s father is George H. W. Bush (son of Prescott Bush).”




?

Relation Extraction as Sequence Labeling

George W. Bush

…George H. W. Bush (son of Prescott Bush) …

Father Grandfather

Learning Relational Database Features

George W. Bush

…George H. W. Bush (son of Prescott Bush) …

Father Grandfather

Name Son

Prescott Bush George H. W. Bush

George H. W. Bush George W. Bush

Search DB for “relational paths” between subject and token

Subject_Is_SonOf_SonOf_Token=1.0

Highly weighted relational paths

• Many Family equivalences– Sibling=Parent_Offspring– Cousin=Parent_Sibling_Offspring

• College=Parent_College• Religion=Parent_Religion• Ally=Opponent_Opponent• Friend=Person_Same_School

• Preliminary results: nice performance boost using relational features (~8% absolute F1)

Testing on Unknown Entities

John F. Kennedy

… son of Joseph P. Kennedy, Sr. and Rose Fitzgerald

Name Son

Joseph P. Kennedy John F. Kennedy

Rose Fitzgerald John F. Kennedy

Father Mother

Fill DB with “first-pass” CRFUse relational features with “second-pass” CRF

Next Steps

• Feature induction to discover complex rules

• Measure relational features’ sensitivity to noise in DB

• Collective inference among related relations

Outline










Topical N-gram Model - Our first attempt

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

WTW

1 2 2

{0, 1, 1:2, 2:2, 1:3, 2:3, 3:3}

Wang & McCallum

Beyond bag-of-words

z1 z2 z3 z4

w1 w2 w3 w4

TW

D

. . .

. . .

Wallach

LDA-COL (Collocation) Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1 2

T

Griffiths & Steyvers

D

1 2

WW

. . .

. . .

. . .

Topical N-gram Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

WTW

1 2 2

Wang & McCallum

Topical N-gram Model

z1 z2 z3 z4

w1 w2 w3 w4

y1 y2 y3 y4

1

T

D

. . .

. . .

. . .

WTW

1 2 2

Wang & McCallum

Topic Comparison

learningoptimalreinforcementstateproblemspolicydynamicactionprogrammingactionsfunctionmarkovmethodsdecisionrlcontinuousspacessteppoliciesplanning

LDA

reinforcement learningoptimal policydynamic programmingoptimal controlfunction approximatorprioritized sweepingfinite-state controllerlearning systemreinforcement learning RLfunction approximatorsmarkov decision problemsmarkov decision processeslocal searchstate-action pairmarkov decision processbelief statesstochastic policyaction selectionupright positionreinforcement learning methods

policyactionstatesactionsfunctionrewardcontrolagentq-learningoptimalgoallearningspacestepenvironmentsystemproblemstepssuttonpolicies

Topical N-grams (2+) Topical N-grams (1)

Topic Comparison

motionvisualfieldpositionfiguredirectionfieldseyelocationretinareceptivevelocityvisionmovingsystemflowedgecenterlightlocal

LDA

receptive fieldspatial frequencytemporal frequencyvisual motionmotion energytuning curveshorizontal cellsmotion detectionpreferred directionvisual processingarea mtvisual cortexlight intensitydirectional selectivityhigh contrastmotion detectorsspatial phasemoving stimulidecision strategyvisual stimuli

motionresponsedirectioncellsstimulusfigurecontrastvelocitymodelresponsesstimulimovingcellintensitypopulationimagecentertuningcomplexdirections


Topic Comparison

wordsystemrecognitionhmmspeechtrainingperformancephonemewordscontextsystemsframetrainedspeakersequencespeakersmlpframessegmentationmodels

LDA

speech recognitiontraining dataneural networkerror ratesneural nethidden markov modelfeature vectorscontinuous speechtraining procedurecontinuous speech recognitiongamma filterhidden controlspeech productionneural netsinput representationoutput layerstraining algorithmtest setspeech framesspeaker dependent

speechwordtrainingsystemrecognitionhmmspeakerperformancephonemeacousticwordscontextsystemsframetrainedsequencephoneticspeakersmlphybrid


Summary

• Joint inference can avoid accumulating errors in an pipeline from extraction to data mining.

• Examples– Factorial finite state models– Jointly labeling distant entities– Coreference analysis– Segmentation uncertainty aiding coreference & vice-versa– Joint Extraction and Data Mining

• Many examples of sequential topic models.

Documents

CRFs and Joint Inference in NLP Andrew McCallum Computer Science Department University of Massachusetts Amherst Joint work with Charles Sutton, Aron Culotta,