Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 1

Probabilistic Models of Nonprojective Dependency

Trees

David A. SmithCenter for Language

and Speech Processing

Computer Science Dept.

Johns Hopkins University

Noah A. SmithLanguage Technologies Institute

Machine Learning Dept.

School of Computer Science

Carnegie Mellon University


See Also

On the Complexity of Non-Projective Data-Driven Dependency ParsingR. McDonald and G. SattaIWPT 2007

Structured-Prediction Models via the Matrix-Tree TheoremT. Koo, A. Globerson, X. Carreras and M. CollinsEMNLP-CoNLL 2007

Coming Up Next!


Nonprojective Syntax

ista meam norit gloria canitiemROOT

I give a on bootstrappingtalk tomorrowROOT ‘ll

thatNOM myACC may-know gloryNOM going-grayACC

How would we parse this?

That glory shall last till I go gray


Edge-Factored Models (McDonald et al., 2005)

€

0 s(1,0) s(2,0) L s(n,0)

0 0 s(2,1) L s(n,1)

0 s(1,2) 0 L s(n,2)

M M M O M

0 s(1,n) s(2,n) L 0

⎡

⎣

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

€

s(i, j) = exp[w ⋅f(i, j)]Non-neg. score for each edge

€

maxy∈T

s(i, j)(i, j )∈y

∏Find edge sum among legal trees

children

pare

nts

Score edges in isolation

Find maximum spanning tree with Chu-Liu-Edmonds

NP hard to add sibling or degree constraints, hidden node variables

What about training?Unlabeled for now


If Only It Were Projective…

I give a on bootstrappingtalk tomorrowROOT ‘ll

An Inside-Outside algorithm gives us• Normalizing constant for globally

normalized models• Posterior probability of edges• Summing over hidden variables€

p(y | x) = Zx−1ew• f (x,y )

€

p(y | x) = p(y,z | x)z

∑

But we can’t use Inside-Outside for nonprojective parsing!


Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

Exactly the Z we need!

O(n3) time!


Building the Kirchoff (Laplacian) Matrix

€

0 −s(1,0) −s(2,0) L −s(n,0)

0 0 −s(2,1) L −s(n,1)

0 −s(1,2) 0 L −s(n,2)

M M M O M

0 −s(1,n) −s(2,n) L 0

⎡

⎣

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

€

0 −s(1,0) −s(2,0) L −s(n,0)

0 s(1, j)j≠1

∑ −s(2,1) L −s(n,1)

0 −s(1,2) s(2, j)j≠2

∑ L −s(n,2)

M M M O M

0 −s(1,n) −s(2,n) L s(n, j)j≠n

∑

⎡

⎣

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎤

⎦

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

€

s(1, j)j≠1

∑ −s(2,1) L −s(n,1)

−s(1,2) s(2, j)j≠2

∑ L −s(n,2)

M M O M

−s(1,n) −s(2,n) L s(n, j)j≠n

∑

• Negate edge scores• Sum columns

(children)• Strike root row/col.• Take determinant

N.B.: This allows multiple children of root, but see Koo et al. 2007.


Why Should This Work?Chu-Liu-Edmonds analogy:Every node selects best parentIf cycles, contract and recur

€

′ K ≡ K with contracted edge 1,2

′ ′ K ≡ K({1,2} |{1,2})

K = s(1,2) ′ K + ′ ′ K

€

s(1, j)j≠1

∑ −s(2,1) L −s(n,1)

−s(1,2) s(2, j)j≠2

∑ L −s(n,2)

M M O M

−s(1,n) −s(2,n) L s(n, j)j≠n

∑

Clear for 1x1 matrix; use induction

Undirected case; special root cases for directed


When You Have a Hammer…

Matrix-Tree Theorem

•Sequence-normalized log-linear models (Lafferty et al. ‘01)•Minimum Bayes-risk parsing (cf. Goodman ‘96)•Hidden-variable models

•O(n) inference with length constraints (cf. N. Smith & Eisner, ‘05)•Minimum risk training (D. Smith & Eisner ‘06)•Tree (Rényi) entropy (Hwa ‘01; S & E ‘07)


Analogy to Other Models

Sequence

Projective PCFGs

Shift-Reduce(Action-Based)

Projective

global log-linear

Max-margin(or error-driven)

(e.g. McDonald,

Collins)Non-

projective?

Parent-predicting(K. Hall ‘07)

This Work


More Machinery: The Gradient

€

∂ logA

∂ A[ ] j,i

= A−1[ ]

i, j

€

∂ logZθ (x)

∂s(i, j)= s(i, j) K−1

[ ]i,i

− K−1[ ]

i, j( )j= 0

n

∑i=1

n

∑

SinceInvert Kirchoff matrix

K in O(n3) timevia LU factorization

The edge gradient is also edge posterior

probability.

Use the chain rule to backpropagate into s(i,j), whatever its internal structure may be.


Nonprojective Conditional Log-Linear Training

train Arabic Czech Danish Dutch

MIRA 79.9 81.4 86.6 90.0

CL 80.4 80.2 87.5 90.0

•CoNLL 2006 Danish and Dutch•CoNLL 2007 Arabic and Czech•Features from McDonald et al. 2005•Compared with MSTParser’s MIRA max-margin training•Trained LL weights with stochastic gradient descent•Same #iterations and stopping criteria as MIRA

Significance on paired permutation test

€

p(y | x) = Zx−1ew• f (x,y )


Minimum Bayes-Risk Parsing

parse train Arabic Czech Danish Dutch

map MIRA 79.9 81.4 86.6 90.0

CL 80.4 80.2 87.5 90.0

mBr MIRA 79.4 80.3 85.0 87.2

CL 80.5 80.4 87.5 90.0

Select the tree, not with the highest probability, but the most expected correct edges.

€

ˆ y = argmaxy

Ep( ′ y |x ) −δ( ′ y i,y i)i=1

n

∑ ⎡

⎣ ⎢

⎤

⎦ ⎥= argmax

yp

i=1

n

∑ (y i ∈ parse of x | x)

Plug posteriors into MST

MIRA doesn’t

estimate probs.

N.B. One could do mBr inside MIRA.


Edge Clustering

Franz loves Milena

OBJSUBJ

Franz loves Milena

ZC

B

A

Y

X

(Supervised) labeled

dependency parsing

Simple idea: conjoin each model feature with a cluster

Sum out all possible edge labelings if we don’t care about labels per se.

OR


Edge Clustering

74

76

78

80

82

84

86

88

90

92

Arabic Danish Dutch

1 cluster2 clusters16 clusters32 clusters

No significant gains or losses from clustering


What’s Wrong with Edge Clustering?

Edge labels don’t interact

Unlike clusters on PCFG nonterminals (e.g. Matsuzaki et al.’05)

Cf. small/no gains for unlabeled accuracy from supervised labeled parsers

NP-A

NP-B NP-A

Franz loves Milena

A B

No interaction

Interaction in rewrite rule


Constraints on Link Length

€

0 −1 −1 −1 −1 −1 −1 −1 −1 −1

0 2 −1 0 0 0 0 0 0 0

0 −1 3 −1 0 0 0 0 0 0

0 −1 −1 4 −1 0 0 0 0 0

0 0 −1 −1 4 −1 0 0 0 0

0 0 0 −1 −1 4 −1 0 0 0

0 0 0 0 −1 −1 4 −1 0 0

0 0 0 0 0 −1 −1 4 −1 0

0 0 0 0 0 0 −1 −1 3 −1

0 0 0 0 0 0 0 −1 −1 2

Example with L=1, R=2

• Max. left/right child distances L and R (Cf. Eisner & N. Smith‘05)

• Band-diagonal Kirchoff matrix once root row and column are removed

• Inversion in O(min(L3R2, L2R3)n) time


Conclusions

• O(n3) inference for edge-factored nonprojective dependency models

• Performance closely comparable to MIRA

• Learned edge clustering doesn’t seem to help unlabeled parsing

• Many other applications to hit


Thanks

Jason EisnerKeith Hall

Sanjeev Khudanpur

The Anonymous Reviewers

Ryan McDonald & Michael Collins & colleaguesFor sharing drafts

Documents

Probabilistic Models of Nonprojective Dependency Trees