252
Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing Alexander M. Rush (based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag)

Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Probabilistic Graphical Models:Lagrangian Relaxation Algorithmsfor Natural Language Processing

Alexander M. Rush

(based on joint work with Michael Collins,Tommi Jaakkola, Terry Koo, David Sontag)

Page 2: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Uncertainty in language

natural language is notoriusly ambiguous, even in toy sentences

United flies some large jet

problem becomes very difficult in real-world sentences

Lin made the first start of his surprising N.B.A.

career Monday just as the Knicks lost Amar’e

Stoudemire and Carmelo Anthony, turning the roster

upside down and setting the scene for another strange

and wondrous evening at Madison Square Garden.

Page 3: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

NLP challenges

Lin made the first start of his surprising N.B.A.

career Monday just as the Knicks lost Amar’e

Stoudemire and Carmelo Anthony, turning the roster

upside down and setting the scene for another strange

and wondrous evening at Madison Square Garden.

• who lost Stoudemire?

• what is turning the roster upside down?

• are “strange and wonderous” paired?

• what about “down and setting”?

Page 4: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

NLP applications

Page 5: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

This lecture

1. introduce models for natural language processing in thecontext of decoding in graphical models

2. describe research in Lagrangian relaxation methods forinference in natural language problems.

Page 6: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Decoding in NLP

focus: decoding as a combinatorial optimization problem

y∗ = arg maxy∈Y

f (y)

where f is a scoring function and Y is a set of structures

also known as the MAP problem

f (y ; x) = p(y |x)

Page 7: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithms for exact decodingcoming lectures will explore algorithms for decoding

y∗ = arg maxy∈Y

f (y)

for some problems, simple combinatorial algorithms are exact

• dynamic programming - e.g. tree-structured MRF’s

• min cut - for certain types of weight structures

• minimum spanning tree - certain NLP problems

can depend on structure of the graphical model

Page 8: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

United flies some large jet

United1 flies2 some3 large4 jet5

N V D A N

Page 9: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Parsing

United flies some large jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

United flies some large jet

*0 United1 flies2 some3 large4 jet5

Page 10: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Decoding complexity

issue: simple combinatorial algorithms do not scale to richermodels

y∗ = arg maxy∈Y

f (y)

need decoding algorithms for complex natural language tasks

motivation:

• richer model structure often leads to improved accuracy

• exact decoding for complex models tends to be intractable

Page 11: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Phrase-based translation

das muss unsere sorge gleichermaßen sein

das muss unsere sorge gleichermaßen sein

this must our concernalso be

Page 12: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Word alignment

the ugly dog has red fur

le chien laid a fourrure rouge

Page 13: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Decoding taskshigh complexity

• combined parsing and part-of-speech tagging (Rush et al.,2010)

• “loopy” HMM part-of-speech tagging

• syntactic machine translation (Rush and Collins, 2011)

NP-Hard

• symmetric HMM alignment (DeNero and Macherey, 2011)

• phrase-based translation (Chang and Collins, 2011)

• higher-order non-projective dependency parsing (Koo et al.,2010)

in practice:

• approximate decoding methods (coarse-to-fine, beam search,cube pruning, gibbs sampling, belief propagation)

• approximate models (mean field, variational models)

Page 14: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Motivation

cannot hope to find exact algorithms (particularly when NP-Hard)

aim: develop decoding algorithms with formal guarantees

method:

• derive fast algorithms that provide certificates of optimality

• show that for practical instances, these algorithms often yieldexact solutions

• provide strategies for improving solutions or findingapproximate solutions when no certificate is found

Lagrangian relaxation helps us develop algorithms of this form

Page 15: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian relaxationa general technique for constructing decoding algorithms

solve complicated models

y∗ = arg maxy

f (y)

by decomposing into smaller problems.

upshot: can utilize a toolbox of combinatorial algorithms.

• dynamic programming

• minimum spanning tree

• shortest path

• min cut

• ...

Page 16: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian relaxation algorithms

Simple - uses basic combinatorial algorithms

Efficient - faster than solving exact decoding problems

Strong guarantees

• gives a certificate of optimality when exact

• direct connections to linear programming relaxations

Page 17: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Outline

1. background on exact NLP decoding

2. worked algorithm for combined parsing and tagging

3. important theorems and formal derivation

4. more examples from parsing and alignment

Page 18: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

1. Background: Part-of-speech tagging

United flies some large jet

United1 flies2 some3 large4 jet5

N V D A N

Page 19: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Graphical model formulation

given:

• a sentence of length n and a tag set T

• one variable for each word, takes values in T

• edge potentials θ(i − 1, i , t ′, t) for all i ∈ n, t, t ′ ∈ T

example:

United1 flies2 some3 large4 jet5

T = {A,D,N,V }

note: for probabilistic HMM θ(i − 1, i , t ′, t) = log(p(wi |t)p(t|t ′))

Page 20: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Decoding problem

let Y = T n be the set of assignments to the variables

decoding problem: arg maxy∈Y

f (y)

f (y) =n∑

i=2

θ(i − 1, i , y(i − 1), y(i))

example:

United1 flies2 some3 large4 jet5

N V D A N

Page 21: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Combinatorial solution: Viterbi algorithm

Set c(i , t) = −∞ for all i 6= 1, t ∈ T

Set c(1, t) = 0 for all t ∈ T

For i = 2 to n

For t = 1 to T

c(i , t)← maxt′∈T

c(i − 1, t ′) + θ(i − 1, i , t ′, t)

Return maxt′∈T

c(n, t ′)

Note: run time O(n|T |2), can greatly reduce |T | in practice

special case of the belief propagation (next week)

Page 22: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Example run

POS Tag Best Path

A

D

N

V

United1 flies2 some3 large4 jet5

Page 23: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Example run

POS Tag Best Path

A 0

D 0

N 0

V 0

United1 flies2 some3 large4 jet5

Page 24: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Example run

POS Tag Best Path

A 0 1

D 0 1

N 0 2

V 0 2

United1 flies2 some3 large4 jet5

Page 25: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Example run

POS Tag Best Path

A 0 1 2

D 0 1 3

N 0 2 2

V 0 2 2

United1 flies2 some3 large4 jet5

Page 26: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Example run

POS Tag Best Path

A 0 1 2 4 4

D 0 1 3 3 4

N 0 2 2 3 6

V 0 2 2 3 5

United1 flies2 some3 large4 jet5

Page 27: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Example run

POS Tag Best Path

A 0 1 2 4 4

D 0 1 3 3 4

N 0 2 2 3 6

V 0 2 2 3 5

N V D A N

United1 flies2 some3 large4 jet5

Page 28: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Parsing

United flies some large jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

Page 29: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Parse decodingparsing has potentials:

• θ(A→ B C ) for each internal rules

• θ(A→ w) for pre-terminal (bottom) rules

not easily represented as a graphical model

• complicated independence structureS

VP

NP

N

jet

A

large

D

some

V

flies

NP

N

United

luckily, similar decoding algorithms to Viterbi

• CKY algorithm for context-free parsing - O(G 3n3)

• Eisner algorithm for dependency parsing - O(n3)

Page 30: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

2. Worked example

aim: walk through a Lagrangian relaxation algorithm for combinedparsing and part-of-speech tagging

• introduce formal notation for parsing and tagging

• give assumptions necessary for decoding

• step through a run of the Lagrangian relaxation algorithm

Page 31: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Combined parsing and part-of-speech tagging

S

VP

NP

N

jet

A

large

D

some

V

flies

NP

N

United

goal: find parse tree that optimizes

score(S → NP VP) + score(VP → V NP) +

...+ score(N→ V) + score(N→ United) + ...

Page 32: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Constituency parsingnotation:

• Y is set of constituency parses for input• y ∈ Y is a valid parse• f (y) scores a parse tree

goal:arg max

y∈Yf (y)

example: a context-free grammar for constituency parsing

S

VP

NP

N

jet

A

large

D

some

V

flies

NP

N

United

Page 33: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Part-of-speech taggingnotation:

• Z is set of tag sequences for input

• z ∈ Z is a valid tag sequence

• g(z) scores of a tag sequence

goal:arg max

z∈Zg(z)

example: an HMM for part-of speech tagging

United1 flies2 some3 large4 jet5

N V D A N

Page 34: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Identifying tagsnotation: identify the tag labels selected by each model

• y(i , t) = 1 when parse y selects tag t at position i

• z(i , t) = 1 when tag sequence z selects tag t at position i

example: a parse and tagging with y(4,A) = 1 and z(4,A) = 1

S

VP

NP

N

jet

A

large

D

some

V

flies

NP

N

United

y

United1 flies2 some3 large4 jet5

N V D A N

z

Page 35: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Combined optimization

goal:arg max

y∈Y,z∈Zf (y) + g(z)

such that for all i = 1 . . . n, t ∈ T ,

y(i , t) = z(i , t)

i.e. find the best parse and tagging pair that agree on tag labels

equivalent formulation:

arg maxy∈Y

f (y) + g(l(y))

where l : Y → Z extracts the tag sequence from a parse tree

Page 36: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Exact method: Dynamic programming intersectioncan solve by solving the product of the two models

example:

• parsing model is a context-free grammar

• tagging model is a first-order HMM

• can solve as CFG and finite-state automata intersection

replace VP → V NPwith VPN,V → VN,V NPV,N

S

VP

NP

N

jet

A

large

D

some

V

flies

NP

N

United

Page 37: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Intersected parsing and tagging complexity

let G be the number of grammar non-terminals

parsing CFG require O(G 3n3) time with rules VP → V NP

S

VPN,N

NPV ,N

NA,N

jet

A

large

DV ,D

some

VN,V

flies

NP∗,N

N

United

with intersection O(G 3n3|T |3) with rules VPN,V → VN,V NPV,N

becomes O(G 3n3|T |6) time for second-order HMM

Page 38: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging assumptionassumption: optimization with u can be solved efficiently

arg maxz∈Z

g(z)−∑

i ,t

u(i , t)z(i , t)

example: HMM with scores with edge potentials θ

g(z) =n∑

i=2

θ(i − 1, i , z(i − 1), z(i))

where

arg maxz∈Z g(z)−∑

i ,t

u(i , t)z(i , t) =

arg maxz∈Zn∑

i=2

(θ(i − 1, i , z(i − 1), z(i))− u(i , z(i)))− u(1, z(1))

Page 39: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

(Modified) Viterbi algorithm

Set c(i , t) = −∞ for all i 6= 1, t ∈ T

Set c(1, t) = u(1, t) for all t ∈ T

For i = 2 to n

For t = 1 to T

c(i , t)← maxt′∈T

c(i − 1, t ′) + θ(i − 1, i , t ′, t)− u(i , t)

Return maxt′∈T

c(n, t ′)

Note: same algorithm as before with modified potentials

Page 40: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Parsing assumptionassumption: optimization with u can be solved efficiently

arg maxy∈Y

f (y) +∑

i ,t

u(i , t)y(i , t)

example: CFG with rule scoring function θ

f (y) =∑

X→Y Z∈yθ(X → Y Z ) +

(i ,X )∈yθ(X → wi )

where

arg maxy∈Y f (y) +∑

i ,t

u(i , t)y(i , t) =

arg maxy∈Y∑

X→Y Z∈yθ(X → Y Z ) +

(i ,X )∈y(θ(X → wi ) + u(i ,X ))

Page 41: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian relaxation algorithm

Set u(1)(i , t) = 0 for all i , t ∈ T

For k = 1 to K

y (k) ← arg maxy∈Y

f (y) +∑

i ,t

u(k)(i , t)y(i , t) [Parsing]

z(k) ← arg maxz∈Z

g(z)−∑

i ,t

u(k)(i , t)z(i , t) [Tagging]

If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))

Else u(k+1)(i , t)← u(k)(i , t)− αk(y (k)(i , t)− z(k)(i , t))

Page 42: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 43: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 44: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A N

N V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 45: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A N

N V D A N

A N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 46: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A N

N V D A N

A N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 47: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 48: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 49: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A N

A N D A N

A N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 50: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A N

A N D A N

N V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 51: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 52: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 53: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A N

N V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 54: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

A

United

N

flies

D

some

A

large

VP

V

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

United

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,t

u(i , t)y(i , t))

Viterbi Decoding

United1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A N

N V D A N

z∗ = arg maxz∈Z

(g(z)−∑

i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)Key

f (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 55: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Main theorem

theorem: if at any iteration, for all i , t ∈ T

y (k)(i , t) = z(k)(i , t)

then (y (k), z(k)) is the global optimum

proof: focus of the next section

Page 56: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Convergence

0

20

40

60

80

100

<=1<=2

<=3<=4

<=10<=20

<=50

% e

xam

ples

con

verg

ed

number of iterations

Page 57: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

2. Formal properties

aim: formal derivation of the algorithm given in the previoussection

• derive Lagrangian dual

• prove three properties

I upper bound

I convergence

I optimality

• describe subgradient method

Page 58: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangiangoal:

arg maxy∈Y,z∈Z

f (y) + g(z) such that y(i , t) = z(i , t)

Lagrangian:

L(u, y , z) = f (y) + g(z) +∑

i ,t

u(i , t) (y(i , t)− z(i , t))

redistribute terms

L(u, y , z) =

f (y) +

i ,t

u(i , t)y(i , t)

+

g(z)−

i ,t

u(i , t)z(i , t)

Page 59: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian dual

Lagrangian:

L(u, y , z) =

f (y) +

i ,t

u(i , t)y(i , t)

+

g(z)−

i ,t

u(i , t)z(i , t)

Lagrangian dual:

L(u) = maxy∈Y,z∈Z

L(u, y , z)

= maxy∈Y

f (y) +

i ,t

u(i , t)y(i , t)

+

maxz∈Z

g(z)−

i ,t

u(i , t)z(i , t)

Page 60: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Theorem 1. Upper bound

define:

• y∗, z∗ is the optimal combined parsing and tagging solutionwith y∗(i , t) = z∗(i , t) for all i , t

theorem: for any value of u

L(u) ≥ f (y∗) + g(z∗)

L(u) provides an upper bound on the score of the optimal solution

note: upper bound may be useful as input to branch and bound orA* search

Page 61: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Theorem 1. Upper bound (proof)

theorem: for any value of u, L(u) ≥ f (y∗) + g(z∗)

proof:

L(u) = maxy∈Y,z∈Z

L(u, y , z) (1)

≥ maxy∈Y,z∈Z:y=z

L(u, y , z) (2)

= maxy∈Y,z∈Z:y=z

f (y) + g(z) (3)

= f (y∗) + g(z∗) (4)

Page 62: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Formal algorithm (reminder)

Set u(1)(i , t) = 0 for all i , t ∈ T

For k = 1 to K

y (k) ← arg maxy∈Y

f (y) +∑

i ,t

u(k)(i , t)y(i , t) [Parsing]

z(k) ← arg maxz∈Z

g(z)−∑

i ,t

u(k)(i , t)z(i , t) [Tagging]

If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))

Else u(k+1)(i , t)← u(k)(i , t)− αk(y (k)(i , t)− z(k)(i , t))

Page 63: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Theorem 2. Convergencenotation:

• u(k+1)(i , t)← u(k)(i , t) + αk(y (k)(i , t)− z(k)(i , t)) is update

• u(k) is the penalty vector at iteration k

• αk > 0 is the update rate at iteration k

theorem: for any sequence α1, α2, α3, . . . such that

limt→∞

αt = 0 and∞∑

t=1

αt =∞,

we havelimt→∞

L(ut) = minu

L(u)

i.e. the algorithm converges to the tightest possible upper bound

proof: by subgradient convergence (next section)

Page 64: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual solutions

define:

• for any value of u

yu = arg maxy∈Y

f (y) +

i ,t

u(i , t)y(i , t)

and

zu = arg maxz∈Z

g(z)−

i ,t

u(i , t)z(i , t)

• yu and zu are the dual solutions for a given u

Page 65: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Theorem 3. Optimality

theorem: if there exists u such that

yu(i , t) = zu(i , t)

for all i , t then

f (yu) + g(zu) = f (y∗) + g(z∗)

i.e. if the dual solutions agree, we have an optimal solution

(yu, zu)

Page 66: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Theorem 3. Optimality (proof)

theorem: if u such that yu(i , t) = zu(i , t) for all i , t then

f (yu) + g(zu) = f (y∗) + g(z∗)

proof: by the definitions of yu and zu

L(u) = f (yu) + g(zu) +∑

i ,t

u(i , t)(yu(i , t)− zu(i , t))

= f (yu) + g(zu)

since L(u) ≥ f (y∗) + g(z∗) for all values of u

f (yu) + g(zu) ≥ f (y∗) + g(z∗)

but y∗ and z∗ are optimal

f (yu) + g(zu) ≤ f (y∗) + g(z∗)

Page 67: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual optimization

Lagrangian dual:

L(u) = maxy∈Y,z∈Z

L(u, y , z)

= maxy∈Y

f (y) +

i ,t

u(i , t)y(i , t)

+

maxz∈Z

g(z)−

i ,t

u(i , t)z(i , t)

goal: dual problem is to find the tightest upper bound

minu

L(u)

Page 68: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual subgradient

L(u) = maxy∈Y

f (y) +

i,t

u(i , t)y(i , t)

+ max

z∈Z

g(z)−

i,t

u(i , t)z(i , t)

properties:• L(u) is convex in u (no local minima)• L(u) is not differentiable (because of max operator)

handle non-differentiability by using subgradient descent

define: a subgradient of L(u) at u is a vector gu such that for all v

L(v) ≥ L(u) + gu · (v − u)

Page 69: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Subgradient algorithm

L(u) = maxy∈Y

f (y) +

i,t

u(i , t)y(i , t)

+ max

z∈Z

g(z)−

i,j

u(i , t)z(i , t)

recall, yu and zu are the argmax’s of the two terms

subgradient:

gu(i , t) = yu(i , t)− zu(i , t)

subgradient descent: move along the subgradient

u′(i , t) = u(i , t)− α (yu(i , t)− zu(i , t))

guaranteed to find a minimum with conditions given earlier for α

Page 70: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

4. More examples

aim: demonstrate similar algorithms that can be applied to otherdecoding applications

• context-free parsing combined with dependency parsing

• combined translation alignment

Page 71: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Combined constituency and dependency parsing(Rush et al., 2010)

setup: assume separate models trained for constituency anddependency parsing

problem: find constituency parse that maximizes the sum of thetwo models

example:

• combine lexicalized CFG with second-order dependency parser

Page 72: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lexicalized constituency parsingnotation:

• Y is set of lexicalized constituency parses for input• y ∈ Y is a valid parse• f (y) scores a parse tree

goal:arg max

y∈Yf (y)

example: a lexicalized context-free grammar

S(flies)

VP(flies)

NP(jet)

N

jet

A

large

D

some

V

flies

NP(United)

N

United

Page 73: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dependency parsing

define:

• Z is set of dependency parses for input

• z ∈ Z is a valid dependency parse

• g(z) scores a dependency parse

example:

*0 United1 flies2 some3 large4 jet5

Page 74: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Identifying dependenciesnotation: identify the dependencies selected by each model

• y(i , j) = 1 when word i modifies of word j in constituencyparse y

• z(i , j) = 1 when word i modifies of word j in dependencyparse z

example: a constituency and dependency parse with y(3, 5) = 1and z(3, 5) = 1

S(flies)

VP(flies)

NP(jet)

N

jet

A

large

D

some

V

flies

NP(United)

N

United

y

*0 United1 flies2 some3 large4 jet5

z

Page 75: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Combined optimization

goal:arg max

y∈Y,z∈Zf (y) + g(z)

such that for all i = 1 . . . n, j = 0 . . . n,

y(i , j) = z(i , j)

Page 76: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 77: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 78: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 79: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 80: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 81: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 82: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 83: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 84: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

CKY Parsing

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

United

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑

i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 United1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑

i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 85: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Convergence

0

20

40

60

80

100

<=1<=2

<=3<=4

<=10<=20

<=50

% e

xam

ples

con

verg

ed

number of iterations

Page 86: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Integrated Constituency and Dependency Parsing: Accuracy

87

88

89

90

91

92Collins

DepDual

F1 Score

I Collins (1997) Model 1

I Fixed, First-best Dependencies from Koo (2008)

I Dual Decomposition

Page 87: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Corpus-level tagging

setup: given a corpus of sentences and a trained sentence-leveltagging model

problem: find best tagging for each sentence, while at the sametime enforcing inter-sentence soft constraints

example:

• test-time decoding with a trigram tagger

• constraint that each word type prefer a single POS tag

Page 88: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Corpus-level tagging

English is my first language

He studies language arts now

Language makes us human beings

N

Page 89: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sentence-level decodingnotation:

• Yi is set of tag sequences for input sentence i• Y = Y1× . . .×Ym is set of tag sequences for the input corpus• Y ∈ Y is a valid tag sequence for the corpus• F (Y ) =

i

f (Yi ) is the score for tagging the whole corpus

goal:arg max

Y∈YF (Y )

example: decode each sentence with a trigram tagger

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Page 90: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Inter-sentence constraintsnotation:

• Z is set of possible assignments of tags to word types

• z ∈ Z is a valid tag assignment

• g(z) is a scoring function for assignments to word types

example: an MRF model that encourages words of the same typeto choose the same tag

z1

language

N

language

N

language

N

N

z2

language

N

language

A

language

N

N

g(z1) > g(z2)

Page 91: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Identifying word tagsnotation: identify the tag labels selected by each model

• Ys(i , t) = 1 when the tagger for sentence s at position iselects tag t

• z(s, i , t) = 1 when the constraint assigns at sentence sposition i the tag t

example: a parse and tagging with Y1(5,N) = 1 andz(1, 5,N) = 1

English is my first language

He studies language arts now

Y

language language language

z

Page 92: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Combined optimization

goal:arg max

Y∈Y,z∈ZF (Y ) + g(z)

such that for all s = 1 . . .m, i = 1 . . . n, t ∈ T ,

Ys(i , t) = z(s, i , t)

Page 93: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 94: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 95: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 96: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 97: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 98: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 99: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 100: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 101: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 102: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 103: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 104: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tagging

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

A

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

English

N

is

V

my

P

first

A

language

N

He

P

studies

V

language

N

arts

N

now

R

Language

N

makes

V

us

P

human

N

beings

N

MRF

language

A

language

A

language

A

A

language

A

language

A

language

A

A

language

N

language

N

language

N

N

language

N

language

N

language

N

N

language

N

language

N

language

N

N

Penaltiesu(s, i , t) = 0 for all s,i ,t

Iteration 1

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

Iteration 2

u(1, 5,N) -1

u(1, 5,A) 1

u(3, 1,N) -1

u(3, 1,A) 1

u(2, 3,N) 1

u(2, 3,A) -1

Key

F (Y ) ⇐ Tagging model g(z) ⇐ MRFY ⇐ Sentence-level tagging Z ⇐ Inter-sentence constraintsYs(i , t) = 1 if sentence s has tag t at position i

Page 105: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Combined alignment (DeNero and Macherey, 2011)

setup: assume separate models trained for English-to-French andFrench-to-English alignment

problem: find an alignment that maximizes the score of bothmodels

example:

• HMM models for both directional alignments (assume correctalignment is one-to-one for simplicity)

Page 106: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Alignment

the ugly dog has red fur

le chien laid a fourrure rouge

Page 107: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Graphical model formulation

given:

• French sentence of length n and English sentence of length m

• one variable for each English word, takes values in {1, . . . , n}

• edge potentials θ(i − 1, i , f ′, f ) for all i ∈ n, f , f ′ ∈ {1, . . . , n}

example:

The1 ugly2 dog3 has4 red5 fur6

Page 108: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French alignmentdefine:

• Y is set of all possible English-to-French alignments• y ∈ Y is a valid alignment• f (y) scores of the alignment

example: HMM alignment

The1 ugly2 dog3 has4 red5 fur6

Le1 laid3 chien2 a4 rouge6 fourrure5

Page 109: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

French-to-English alignmentdefine:

• Z is set of all possible French-to-English alignments• z ∈ Z is a valid alignment• g(z) scores of an alignment

example: HMM alignment

Le1 chien2 laid3 a4 fourrure5 rouge6

The1 ugly2 dog3 has4 fur6 red5

Page 110: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Identifying word alignmentsnotation: identify the tag labels selected by each model

• y(i , j) = 1 when e-to-f alignment y selects French word i toalign with English word j

• z(i , j) = 1 when f-to-e alignment z selects French word i toalign with English word j

example: two HMM alignment models with y(6, 5) = 1 andz(6, 5) = 1

Page 111: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Combined optimization

goal:arg max

y∈Y,z∈Zf (y) + g(z)

such that for all i = 1 . . . n, j = 1 . . . n,

y(i , j) = z(i , j)

Page 112: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 113: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 114: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 115: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 116: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 117: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,jIteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 118: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,jIteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 119: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,jIteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 120: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,jIteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 121: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

English-to-French

y∗ = argmaxy∈Y

(f (y) +∑

i,j

u(i , j)y(i , j))

French-to-English

z∗ = argmaxz∈Z

(g(z)−∑

i,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,jIteration 1

u(3, 2) -1

u(2, 2) 1

u(2, 3) -1

u(3, 3) 1

Key

f (y) ⇐ HMM Alignment g(z) ⇐ HMM AlignmentY ⇐ English-to-French model Z ⇐ French-to-English modely(i , j) = 1 if French word i aligns to English word j

Page 122: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

MAP problem in Markov random fields

given: binary variables x1 . . . xn

goal: MAP problem

arg maxx1...xn

(i ,j)∈Efi ,j(xi , xj)

where each fi ,j(xi , xj) is a local potential for variables xi , xj

Page 123: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual decomposition for MRFs (Komodakis et al., 2010)

+

goal:

arg maxx1...xn

(i ,j)∈Efi ,j(xi , xj)

equivalent formulation:

arg maxx1...xn,y1...yn

(i ,j)∈T1

f ′i ,j(xi , xj) +∑

(i ,j)∈T2

f ′i ,j(yi , yj)

such that for i = 1 . . . n,xi = yi

Lagrangian:

L(u, x , y) =∑

(i ,j)∈T1

f ′i ,j(xi , xj) +∑

(i ,j)∈T2

f ′i ,j(yi , yj) +∑

i

ui (xi − yi )

Page 124: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

4. Practical issues

aim: overview of practical dual decomposition techniques

• tracking the progress of the algorithm

• choice of update rate αk

• lazy update of dual solutions

• extracting solutions if algorithm does not converge

Page 125: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Optimization tracking

at each stage of the algorithm there are several useful values

track:

• y (k), z(k) are current dual solutions

• L(u(k)) is the current dual value

• y (k), l(y (k)) is a potential primal feasible solution

• f (y (k)) + g(l(y (k))) is the potential primal value

Page 126: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tracking example

-19

-18

-17

-16

-15

-14

-13

0 10 20 30 40 50 60

Val

ue

Round

Current PrimalCurrent Dual

example run from syntactic machine translation (later in talk)

• current primalf (y (k)) + g(l(y (k)))

• current dualL(u(k))

Page 127: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Optimization progress

useful signals:

• L(u(k))− L(u(k−1)) is the dual change (may be positive)

• mink

L(u(k)) is the best dual value (tightest upper bound)

• maxk

f (y (k)) + g(l(y (k))) is the best primal value

the optimal value must be between the best dual and primal values

Page 128: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Progress example

-19

-18

-17

-16

-15

-14

-13

0 10 20 30 40 50 60

Val

ue

Round

Best PrimalBest Dual

0

0.5

1

1.5

2

2.5

3

3.5

4

0 10 20 30 40 50 60

Val

ue

Round

Gap

best primal

maxk

f (y (k)) + g(l(y (k)))

best dual

mink

L(u(k))

gap

mink

L(uk) −

maxk

f (y (k)) + g(l(y (k))

Page 129: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Update rate

choice of αk has important practical consequences

• αk too high causes dual value to fluctuate

• αk too low means slow progress

-16

-15.5

-15

-14.5

-14

-13.5

-13

0 5 10 15 20 25 30 35 40

Val

ue

Round

0.010.005

0.0005

Page 130: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Update rate

practical: find a rate that is robust to varying inputs

• αk = c (constant rate) can be very fast, but hard to findconstant that works for all problems

• αk =c

k(decreasing rate) often cuts rate too aggressively,

lowers value even when making progress• rate based on dual progress

I αk =c

t + 1where t < k is number of iterations where dual

value increasedI robust in practice, reduces rate when dual value is fluctuating

Page 131: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lazy decoding

idea: don’t recompute y (k) or z(k) from scratch each iteration

lazy decoding: if subgradient u(k) is sparse, then y (k) may bevery easy to compute from y (k−1)

use:

• helpful if y or z factor naturally into independent components

• can be important for fast decompositions

Page 132: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lazy decoding example

recall corpus-level taggingexample

at this iteration, onlysentence 2 receives aweight update

with lazy decoding

Y(k)1 ← Y

(k−1)1

Y(k)3 ← Y

(k−1)3

Page 133: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lazy decoding results

lazy decoding is critical for the efficiency of some applications

0

5

10

15

20

25

30

0 1000 2000 3000 4000 5000

% o

f Hea

d A

utom

ata

Rec

ompu

ted

Iterations of Dual Decomposition

% recomputed, g+s% recomputed, sib

recomputation statistics for non-projective dependency parsing

Page 134: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Approximate solution

upon agreement the solution is exact, but this may not occur

otherwise, there is an easy way to find an approximate solution

choose: the structure y (k′) where

k ′ = arg maxk

f (y (k)) + g(l(y (k)))

is the iteration with the best primal score

guarantee: the solution yk′

is non-optimal by at most

(mink

L(uk))− (f (y (k′)) + g(l(y (k

′))))

there are other methods to estimate solutions, for instance byaveraging solutions (see ?)

Page 135: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Choosing best solution

-30

-25

-20

-15

-10

-5

0

0 10 20 30 40 50 60 70

Val

ue

Round

Best Primal

Current PrimalCurrent Dual

non-exact example from syntactic translation

best approximate primal solution occurs at iteration 63

Page 136: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Early stopping results

early stopping results for constituency and dependency parsing

50

60

70

80

90

100

0 10 20 30 40 50

Per

cent

age

Maximum Number of Dual Decomposition Iterations

f score% certificates

% match K=50

Page 137: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Early stopping results

early stopping results for non-projective dependency parsing

50

60

70

80

90

100

0 200 400 600 800 1000

Per

cent

age

Maximum Number of Dual Decomposition Iterations

% validation UAS% certificates

% match K=5000

Page 138: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Phrase-Based Translation

define:

I source-language sentence words x1, . . . , xNI phrase translation p = (s, e, t)

I translation derivation y = p1, . . . , pL

example:

x1 x2 x3 x4 x5 x6

p1 p2 p3 p4

das muss unsere sorge gleichermaßen sein

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

Page 139: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Phrase-Based Translation

define:

I source-language sentence words x1, . . . , xNI phrase translation p = (s, e, t)

I translation derivation y = p1, . . . , pL

example:

x1 x2 x3 x4 x5 x6

p1

das muss unsere sorge gleichermaßen sein

this must

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

Page 140: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Phrase-Based Translation

define:

I source-language sentence words x1, . . . , xNI phrase translation p = (s, e, t)

I translation derivation y = p1, . . . , pL

example:

x1 x2 x3 x4 x5 x6

p1 p2

das muss unsere sorge gleichermaßen sein

this must also

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

Page 141: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Phrase-Based Translation

define:

I source-language sentence words x1, . . . , xNI phrase translation p = (s, e, t)

I translation derivation y = p1, . . . , pL

example:

x1 x2 x3 x4 x5 x6

p1 p2 p3

das muss unsere sorge gleichermaßen sein

this must also be

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

Page 142: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Phrase-Based Translation

define:

I source-language sentence words x1, . . . , xNI phrase translation p = (s, e, t)

I translation derivation y = p1, . . . , pL

example:

x1 x2 x3 x4 x5 x6

p1 p2 p3 p4

das muss unsere sorge gleichermaßen sein

this must our concernalso be

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

Page 143: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Phrase-Based Translation

define:

I source-language sentence words x1, . . . , xNI phrase translation p = (s, e, t)

I translation derivation y = p1, . . . , pL

example:

x1 x2 x3 x4 x5 x6

p1 p2 p3 p4

das muss unsere sorge gleichermaßen sein

this must our concernalso be

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

Page 144: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Phrase-Based Translation

define:

I source-language sentence words x1, . . . , xNI phrase translation p = (s, e, t)

I translation derivation y = p1, . . . , pL

example:

x1 x2 x3 x4 x5 x6

p1 p2 p3 p4

das muss unsere sorge gleichermaßen sein

this must our concernalso be

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}

Page 145: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Scoring Derivations

derivation:

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must our concernalso be

2

objective:

f (y) = h(e(y)) +L∑

k=1

g(pk) +L−1∑

k=1

η|t(pk) + 1− s(pk+1)|

I language model score h

I phrase translation score g

I distortion penalty η

Page 146: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Scoring Derivations

derivation:

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must our concernalso be

2

objective:

f (y) = h(e(y)) +L∑

k=1

g(pk) +L−1∑

k=1

η|t(pk) + 1− s(pk+1)|

I language model score h

I phrase translation score g

I distortion penalty η

Page 147: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Scoring Derivations

derivation:

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must our concernalso be

2

objective:

f (y) = h(e(y)) +L∑

k=1

g(pk) +L−1∑

k=1

η|t(pk) + 1− s(pk+1)|

I language model score h

I phrase translation score g

I distortion penalty η

Page 148: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Scoring Derivations

derivation:

y = {(1, 2, this must), (5, 5, also), (6, 6, be), (3, 4, our concern)}x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must our concernalso be

22

objective:

f (y) = h(e(y)) +L∑

k=1

g(pk) +L−1∑

k=1

η|t(pk) + 1− s(pk+1)|

I language model score h

I phrase translation score g

I distortion penalty η

Page 149: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Relaxed Problem

Y ′: only requires the total number of words translated to be N

Y ′ ={y :N∑

i=1

y(i) = N and the distortion limit d is satisfied}

example:

y(i) 0 1 2 2 0 1sum−−−→ 6

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

our concern our concernmust be

(3, 4, our concern)(2, 2,must)(6, 6, be)(3, 4, our concern)

Page 150: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Relaxed Problem

Y ′: only requires the total number of words translated to be N

Y ′ ={y :N∑

i=1

y(i) = N and the distortion limit d is satisfied}

example:

y(i) 0 1 2 2 0 1sum−−−→ 6

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

our concern our concernmust be

(3, 4, our concern)(2, 2,must)(6, 6, be)(3, 4, our concern)

Page 151: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Relaxed Problem

Y ′: only requires the total number of words translated to be N

Y ′ ={y :N∑

i=1

y(i) = N and the distortion limit d is satisfied}

example:

y(i) 0 1 2 2 0 1sum−−−→ 6

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

our concern our concernmust be

(3, 4, our concern)(2, 2,must)(6, 6, be)(3, 4, our concern)

Page 152: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Relaxed Problem

Y ′: only requires the total number of words translated to be N

Y ′ ={y :N∑

i=1

y(i) = N and the distortion limit d is satisfied}

example:

y(i) 0 1 2 2 0 1sum−−−→ 6

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

our concern our concernmust be

(3, 4, our concern)(2, 2,must)(6, 6, be)(3, 4, our concern)

Page 153: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian Relaxation Method

original:arg max

y∈Yf (y)

︸ ︷︷ ︸exact DP is NP-hard

Y = {y : y(i) = 1 ∀i = 1 . . .N} 1 1 . . . 1

rewrite:

arg maxy∈Y ′

f (y)

︸ ︷︷ ︸can be solved efficiently by DP

such that y(i) = 1 ∀i = 1 . . .N︸ ︷︷ ︸using Lagrangian relaxation

Y ′ = {y :∑N

i=1 y(i) = N} 2 0 . . . 1︸ ︷︷ ︸sum to N

Page 154: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian Relaxation Method

original:arg max

y∈Yf (y)

︸ ︷︷ ︸exact DP is NP-hard

Y = {y : y(i) = 1 ∀i = 1 . . .N} 1 1 . . . 1

rewrite:

arg maxy∈Y ′

f (y)

︸ ︷︷ ︸can be solved efficiently by DP

such that y(i) = 1 ∀i = 1 . . .N︸ ︷︷ ︸using Lagrangian relaxation

Y ′ = {y :∑N

i=1 y(i) = N} 2 0 . . . 1︸ ︷︷ ︸sum to N

Page 155: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian Relaxation Method

original:arg max

y∈Yf (y)

︸ ︷︷ ︸exact DP is NP-hard

Y = {y : y(i) = 1 ∀i = 1 . . .N} 1 1 . . . 1

rewrite:

arg maxy∈Y ′

f (y)

︸ ︷︷ ︸can be solved efficiently by DP

such that y(i) = 1 ∀i = 1 . . .N︸ ︷︷ ︸using Lagrangian relaxation

Y ′ = {y :∑N

i=1 y(i) = N} 2 0 . . . 1︸ ︷︷ ︸sum to N

Page 156: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian Relaxation Method

original:arg max

y∈Yf (y)

︸ ︷︷ ︸exact DP is NP-hard

Y = {y : y(i) = 1 ∀i = 1 . . .N} 1 1 . . . 1

rewrite:

arg maxy∈Y ′

f (y)

︸ ︷︷ ︸can be solved efficiently by DP

such that y(i) = 1 ∀i = 1 . . .N︸ ︷︷ ︸using Lagrangian relaxation

Y ′ = {y :∑N

i=1 y(i) = N} 2 0 . . . 1︸ ︷︷ ︸sum to N

Page 157: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Lagrangian Relaxation Method

original:arg max

y∈Yf (y)

︸ ︷︷ ︸exact DP is NP-hard

Y = {y : y(i) = 1 ∀i = 1 . . .N} 1 1 . . . 1

rewrite:

arg maxy∈Y ′

f (y)

︸ ︷︷ ︸can be solved efficiently by DP

such that y(i) = 1 ∀i = 1 . . .N︸ ︷︷ ︸using Lagrangian relaxation

Y ′ = {y :∑N

i=1 y(i) = N} 2 0 . . . 1︸ ︷︷ ︸sum to N

Page 158: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm

Iteration 1:

I update u(i): u(i)← u(i)− α(y(i)− 1)

α = 1

u(i) 0 0 0 0 0 0

y(i) 0 1 2 2 0 1update

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

Page 159: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm

Iteration 1:

I update u(i): u(i)← u(i)− α(y(i)− 1)

α = 1

u(i) 0 0 0 0 0 0

y(i) 0 1 2 2 0 1update

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

our concern our concernmust be

Page 160: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm

Iteration 1:

I update u(i): u(i)← u(i)− α(y(i)− 1)

α = 1

u(i) 1 0 −1 −1 1 0

y(i) 0 1 2 2 0 1update

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

our concern our concernmust be

Page 161: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm

Iteration 2:

I update u(i): u(i)← u(i)− α(y(i)− 1)

α = 0.5

u(i) 1 0 −1 −1 1 0

y(i) 1 2 0 0 2 1update

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

Page 162: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm

Iteration 2:

I update u(i): u(i)← u(i)− α(y(i)− 1)

α = 0.5

u(i) 1 0 −1 −1 1 0

y(i) 1 2 0 0 2 1update

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must be equally must equally

Page 163: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm

Iteration 2:

I update u(i): u(i)← u(i)− α(y(i)− 1)

α = 0.5

u(i) 1 −0.5 −0.5 −0.5 0.5 0

y(i) 1 2 0 0 2 1update

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must be equally must equally

Page 164: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm

Iteration 3:

I update u(i): u(i)← u(i)− α(y(i)− 1)

α = 0.5

u(i) 1 −0.5 −0.5 −0.5 0.5 0

y(i)update

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

Page 165: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm

Iteration 3:

I update u(i): u(i)← u(i)− α(y(i)− 1)

α = 0.5

u(i) 1 −0.5 −0.5 −0.5 0.5 0

y(i) 1 1 1 1 1 1update

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must our concernalso be

Page 166: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tightening the Relaxation

In some cases, we never reach y(i) = 1 for i = 1 . . .N

If dual L(u) is not decreasing fast enough

run for 10 more iterations

count number of times each constraint is violated

add 3 most often violated constraints

Page 167: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tightening the Relaxation

Iteration 41:

count(i) 0 0 0 0 1 1

y(i) 1 1 1 1 2 0

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must also our concern equally

Add 2 hard constraints (x5, x6) to the dynamic program

Page 168: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tightening the Relaxation

Iteration 42:

count(i) 0 0 0 0 2 2

y(i) 1 1 1 1 0 2

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must be our concern is

Add 2 hard constraints (x5, x6) to the dynamic program

Page 169: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tightening the Relaxation

Iteration 43:

count(i) 0 0 0 0 3 3

y(i) 1 1 1 1 2 0

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must also our concern equally

Add 2 hard constraints (x5, x6) to the dynamic program

Page 170: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tightening the Relaxation

Iteration 44:

count(i) 0 0 0 0 4 4

y(i) 1 1 1 1 0 2

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must be our concern is

Add 2 hard constraints (x5, x6) to the dynamic program

Page 171: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tightening the Relaxation

Iteration 50:

count(i) 0 0 0 0 10 10

y(i) 1 1 1 1 2 0

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must also our concern equally

Add 2 hard constraints (x5, x6) to the dynamic program

Page 172: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tightening the Relaxation

Iteration 51:

count(i) 0

y(i) 1 1

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must our concernalso be

Add 2 hard constraints (x5, x6) to the dynamic program

Page 173: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Tightening the Relaxation

Iteration 51:

count(i) 0

y(i) 1 11 1 1 1

x1 x2 x3 x4 x5 x6

das muss unsere sorge gleichermaßen sein

this must our concernalso be

Add 2 hard constraints (x5, x6) to the dynamic program

Page 174: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Experiments: German to English

I Europarl data: German to English

I Test on 1,824 sentences with length 1-50 words

I Converged: 1,818 sentences (99.67%)

Page 175: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Experiments: Number of Iterations

0

20

40

60

80

100

0 50 100 150 200 250

Per

cent

age

Maximum Number of Lagrangian Relexation Iterations

1-10 words11-20 words21-30 words31-40 words41-50 words

all

Page 176: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Experiments: Number of Hard Constraints Required

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7 8 9

Per

cent

age

Number of Hard Constraints Added

1-10 words11-20 words21-30 words31-40 words41-50 words

all

Page 177: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Experiments: Mean Time in Seconds

# words 1-10 11-20 21-30 31-40 41-50 All

mean 0.8 10.9 57.2 203.4 679.9 120.9median 0.7 8.9 48.3 169.7 484.0 35.2

Page 178: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Comparison to ILP Decoding

(sec.) (sec.)

1-10 275.2 132.911-15 2,707.8 1,138.516-20 20,583.1 3,692.6

Page 179: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Higher-order non-projective dependency parsing

setup: given a model for higher-order non-projective dependencyparsing (sibling features)

problem: find non-projective dependency parse that maximizes thescore of this model

difficulty:

• model is NP-hard to decode

• complexity of the model comes from enforcing combinatorialconstraints

strategy: design a decomposition that separates combinatorialconstraints from direct implementation of the scoring function

Page 180: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Non-Projective Dependency Parsing

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Important problem in many languages.

Problem is NP-Hard for all but the simplest models.

Page 181: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual Decomposition

A classical technique for constructing decoding algorithms.

Solve complicated models

y∗ = arg maxy

f (y)

by decomposing into smaller problems.

Upshot: Can utilize a toolbox of combinatorial algorithms.

I Dynamic programming

I Minimum spanning tree

I Shortest path

I Min-Cut

I ...

Page 182: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Non-Projective Dependency Parsing

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

I Starts at the root symbol *

I Each word has a exactly one parent word

I Produces a tree structure (no cycles)

I Dependencies can cross

Page 183: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) =

score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 184: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head =∗0,mod =saw2)

+score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 185: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 186: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4)

+score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 187: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 188: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 189: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 190: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 191: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Arc-Factored

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head =∗0,mod =saw2) +score(saw2, John1)

+score(saw2,movie4) +score(saw2, today5)

+score(movie4, a3) + ...

e.g. score(∗0, saw2) = log p(saw2|∗0) (generative model)

or score(∗0, saw2) = w · φ(saw2, ∗0) (CRF/perceptron model)

y∗ = arg maxy

f (y) ⇐ Minimum Spanning Tree Algorithm

Page 192: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sibling Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) =

score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1) +score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Page 193: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sibling Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1) +score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Page 194: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sibling Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1)

+score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Page 195: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sibling Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1) +score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Page 196: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sibling Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1) +score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Page 197: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sibling Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1) +score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Page 198: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sibling Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1) +score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Page 199: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Sibling Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) = score(head = ∗0, prev = NULL,mod = saw2)

+score(saw2,NULL, John1) +score(saw2,NULL,movie4)

+score(saw2,movie4, today5) + ...

e.g. score(saw2,movie4, today5) = log p(today5|saw2,movie4)

or score(saw2,movie4, today5) = w · φ(saw2,movie4, today5)

y∗ = arg maxy

f (y) ⇐ NP-Hard

Page 200: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment: Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)

score(saw2,NULL, John1) + score(saw2,NULL, that6)

score(saw2,NULL, a3) + score(saw2, a3,he7)

2n−1

possibilities

Under Sibling Model, can solve for each word with Viterbi decoding.

Page 201: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment: Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)

score(saw2,NULL, John1) + score(saw2,NULL, that6)

score(saw2,NULL, a3) + score(saw2, a3,he7)

2n−1

possibilities

Under Sibling Model, can solve for each word with Viterbi decoding.

Page 202: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment: Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)

score(saw2,NULL, John1) + score(saw2,NULL, that6)

score(saw2,NULL, a3) + score(saw2, a3,he7)

2n−1

possibilities

Under Sibling Model, can solve for each word with Viterbi decoding.

Page 203: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment: Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)

score(saw2,NULL, John1) + score(saw2,NULL, that6)

score(saw2,NULL, a3) + score(saw2, a3,he7)

2n−1

possibilities

Under Sibling Model, can solve for each word with Viterbi decoding.

Page 204: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment: Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)

score(saw2,NULL, John1) + score(saw2,NULL, that6)

score(saw2,NULL, a3) + score(saw2, a3,he7)

2n−1

possibilities

Under Sibling Model, can solve for each word with Viterbi decoding.

Page 205: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment: Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

score(saw2,NULL, John1) + score(saw2,NULL,movie4)+score(saw2,movie4, today5)

score(saw2,NULL, John1) + score(saw2,NULL, that6)

score(saw2,NULL, a3) + score(saw2, a3,he7)

2n−1

possibilities

Under Sibling Model, can solve for each word with Viterbi decoding.

Page 206: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 207: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 208: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 209: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 210: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 211: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 212: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 213: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 214: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Thought Experiment Continued

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

Idea: Do individual decoding for each head word using dynamicprogramming.

If we’re lucky, we’ll end up with a valid final tree.

But we might violate some constraints.

Page 215: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual Decomposition Structure

Goal y∗ = arg maxy∈Y

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid TreesAll Possible

Sibling Arc-Factored

Constraint

Page 216: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual Decomposition Structure

Goal y∗ = arg maxy∈Y

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid TreesAll Possible

Sibling Arc-Factored

Constraint

Page 217: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual Decomposition Structure

Goal y∗ = arg maxy∈Y

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid Trees

All Possible

Sibling Arc-Factored

Constraint

Page 218: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual Decomposition Structure

Goal y∗ = arg maxy∈Y

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid TreesAll Possible

Sibling Arc-Factored

Constraint

Page 219: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual Decomposition Structure

Goal y∗ = arg maxy∈Y

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid TreesAll Possible

Sibling

Arc-Factored

Constraint

Page 220: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual Decomposition Structure

Goal y∗ = arg maxy∈Y

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid TreesAll Possible

Sibling Arc-Factored

Constraint

Page 221: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Dual Decomposition Structure

Goal y∗ = arg maxy∈Y

f (y)

Rewrite as argmax

z∈ Z, y∈ Yf (z) + g(y)

such that z = y

Valid TreesAll Possible

Sibling Arc-Factored

Constraint

Page 222: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm Sketch

Set penalty weights equal to 0 for all edges.

For k = 1 to K

z(k) ← Decode (f (z) + penalty) by Individual Decoding

y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree

If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))

Page 223: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm Sketch

Set penalty weights equal to 0 for all edges.

For k = 1 to K

z(k) ← Decode (f (z) + penalty) by Individual Decoding

y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree

If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))

Page 224: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm Sketch

Set penalty weights equal to 0 for all edges.

For k = 1 to K

z(k) ← Decode (f (z) + penalty) by Individual Decoding

y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree

If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))

Page 225: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm Sketch

Set penalty weights equal to 0 for all edges.

For k = 1 to K

z(k) ← Decode (f (z) + penalty) by Individual Decoding

y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree

If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))

Page 226: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Algorithm Sketch

Set penalty weights equal to 0 for all edges.

For k = 1 to K

z(k) ← Decode (f (z) + penalty) by Individual Decoding

y (k) ← Decode (g(y)− penalty) by Minimum Spanning Tree

If y (k)(i , j) = z(k)(i , j) for all i , j Return (y (k), z(k))

Else Update penalty weights based on y (k)(i , j)− z(k)(i , j)

Page 227: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 228: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 229: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 230: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 231: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 232: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 233: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 234: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 235: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 236: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 237: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 238: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Individual Decoding

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

z∗ = arg maxz∈Z

(f (z) +∑

i ,j

u(i , j)z(i , j))

Minimum Spanning Tree

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

y∗ = arg maxy∈Y

(g(y)−∑

i ,j

u(i , j)y(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(8, 1) -1

u(4, 6) -1

u(2, 6) 1

u(8, 7) 1

Iteration 2

u(8, 1) -1

u(4, 6) -2

u(2, 6) 2

u(8, 7) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (z) ⇐ Sibling Model g(y) ⇐ Arc-Factored ModelZ ⇐ No Constraints Y ⇐ Tree Constraintsy(i , j) = 1 if y contains dependency i , j

Page 239: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Guarantees

TheoremIf at any iteration y (k) = z(k), then (y (k), z(k)) is the global

optimum.

In experiments, we find the global optimum on 98% of examples.

If we do not converge to a match, we can still return anapproximate solution (more in the paper).

Page 240: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Guarantees

TheoremIf at any iteration y (k) = z(k), then (y (k), z(k)) is the global

optimum.

In experiments, we find the global optimum on 98% of examples.

If we do not converge to a match, we can still return anapproximate solution (more in the paper).

Page 241: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Extensions

I Grandparent Models

*0 John1 saw2 a3 movie4 today5 that6 he7 liked8*0 John1 saw2 a3 movie4 today5 that6 he7 liked8

f (y) =...+ score(gp =∗0, head = saw2, prev =movie4,mod =today5)

I Head Automata (Eisner, 2000)

Generalization of Sibling models

Allow arbitrary automata as local scoring function.

Page 242: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

ExperimentsProperties:

I Exactness

I Parsing Speed

I Parsing Accuracy

I Comparison to Individual Decoding

I Comparison to LP/ILP

Training:I Averaged Perceptron (more details in paper)

Experiments on:

I CoNLL Datasets

I English Penn Treebank

I Czech Dependency Treebank

Page 243: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

How often do we exactly solve the problem?

90

92

94

96

98

100

CzeEng

DanDut

PorSlo

SweTur

I Percentage of examples where the dual decomposition findsan exact solution.

Page 244: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Parsing Speed

0

10

20

30

40

50

CzeEng

DanDut

PorSlo

SweTur

Sibling model

0

5

10

15

20

25

CzeEng

DanDut

PorSlo

SweTur

Grandparent model

I Number of sentences parsed per second

I Comparable to dynamic programming for projective parsing

Page 245: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Accuracy

Arc-Factored Prev Best Grandparent

Dan 89.7 91.5 91.8Dut 82.3 85.6 85.8Por 90.7 92.1 93.0Slo 82.4 85.6 86.2Swe 88.9 90.6 91.4Tur 75.7 76.4 77.6Eng 90.1 — 92.5Cze 84.4 — 87.3

Prev Best - Best reported results for CoNLL-X data set, includes

I Approximate search (McDonald and Pereira, 2006)

I Loop belief propagation (Smith and Eisner, 2008)

I (Integer) Linear Programming (Martins et al., 2009)

Page 246: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Comparison to Subproblems

88

89

90

91

92

93

Eng

IndividualMSTDual

F1 for dependency accuracy

Page 247: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Comparison to LP/ILP

Martins et al.(2009): Proposes two representations ofnon-projective dependency parsing as a linear programmingrelaxation as well as an exact ILP.

I LP (1)I LP (2)I ILP

Use an LP/ILP Solver for decoding

We compare:

I AccuracyI ExactnessI Speed

Both LP and dual decomposition methods use the same model,features, and weights w .

Page 248: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Comparison to LP/ILP: Accuracy

80

85

90

95

100LP(1)LP(2)

ILPDual

Dependency Accuracy

I All decoding methods have comparable accuracy

Page 249: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Comparison to LP/ILP: Exactness and Speed

80

85

90

95

100LP(1)LP(2)

ILPDual

Percentage with exact solution

0

2

4

6

8

10

12

14LP(1)LP(2)

ILPDual

Sentences per second

Page 250: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

Summary

presented Lagrangian relaxation as a method for decoding in NLP

formal guarantees

• gives certificate or approximate solution

• can improve approximate solutions by tightening relaxation

efficient algorithms

• uses fast combinatorial algorithms

• can improve speed with lazy decoding

widely applicable

• demonstrated algorithms for a wide range of NLP tasks(parsing, tagging, alignment, mt decoding)

Page 251: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

References I

Y. Chang and M. Collins. Exact Decoding of Phrase-basedTranslation Models through Lagrangian Relaxation. In Toappear proc. of EMNLP, 2011.

J. DeNero and K. Macherey. Model-Based Aligner CombinationUsing Dual Decomposition. In Proc. ACL, 2011.

J. Duchi, D. Tarlow, G. Elidan, and D. Koller. Using CombinatorialOptimization within Max-Product Belief Propagation. In NIPS,pages 369–376, 2007.

D. Klein and C.D. Manning. Factored A* Search for Models overSequences and Trees. In Proc IJCAI, volume 18, pages1246–1251. Citeseer, 2003.

N. Komodakis, N. Paragios, and G. Tziritas. Mrf energyminimization and beyond via dual decomposition. IEEETransactions on Pattern Analysis and Machine Intelligence,2010. ISSN 0162-8828.

Page 252: Probabilistic Graphical Models: Lagrangian Relaxation ...people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdfcombined parsing and part-of-speech tagging (Rush et al., 2010)

References IITerry Koo, Alexander M. Rush, Michael Collins, Tommi Jaakkola,

and David Sontag. Dual decomposition for parsing withnon-projective head automata. In EMNLP, 2010. URLhttp://www.aclweb.org/anthology/D10-1125.

B.H. Korte and J. Vygen. Combinatorial Optimization: Theory andAlgorithms. Springer Verlag, 2008.

A.M. Rush and M. Collins. Exact Decoding of SyntacticTranslation Models through Lagrangian Relaxation. In Proc.ACL, 2011.

A.M. Rush, D. Sontag, M. Collins, and T. Jaakkola. On DualDecomposition and Linear Programming Relaxations for NaturalLanguage Processing. In Proc. EMNLP, 2010.

D.A. Smith and J. Eisner. Dependency Parsing by BeliefPropagation. In Proc. EMNLP, pages 145–156, 2008. URLhttp://www.aclweb.org/anthology/D08-1016.