19
28 June 2007 EMNLP-CoNLL 1 Probabilistic Models of Nonprojective Dependency Trees David A. Smith Center for Language and Speech Processing Computer Science Dept. Johns Hopkins University Noah A. Smith Language Technologies Institute Machine Learning Dept. School of Computer Science Carnegie Mellon University

Probabilistic Models of Nonprojective Dependency Trees

  • Upload
    gaenor

  • View
    35

  • Download
    0

Embed Size (px)

DESCRIPTION

Probabilistic Models of Nonprojective Dependency Trees. Noah A. Smith Language Technologies Institute Machine Learning Dept. School of Computer Science Carnegie Mellon University. David A. Smith Center for Language and Speech Processing Computer Science Dept. Johns Hopkins University. - PowerPoint PPT Presentation

Citation preview

Page 1: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 1

Probabilistic Models of Nonprojective Dependency

Trees

David A. SmithCenter for Language

and Speech Processing

Computer Science Dept.

Johns Hopkins University

Noah A. SmithLanguage Technologies Institute

Machine Learning Dept.

School of Computer Science

Carnegie Mellon University

Page 2: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 2

See Also

On the Complexity of Non-Projective Data-Driven Dependency ParsingR. McDonald and G. SattaIWPT 2007

Structured-Prediction Models via the Matrix-Tree TheoremT. Koo, A. Globerson, X. Carreras and M. CollinsEMNLP-CoNLL 2007

Coming Up Next!

Page 3: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 3

Nonprojective Syntax

ista meam norit gloria canitiemROOT

I give a on bootstrappingtalk tomorrowROOT ‘ll

thatNOM myACC may-know gloryNOM going-grayACC

How would we parse this?

That glory shall last till I go gray

Page 4: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 4

Edge-Factored Models (McDonald et al., 2005)

0 s(1,0) s(2,0) L s(n,0)

0 0 s(2,1) L s(n,1)

0 s(1,2) 0 L s(n,2)

M M M O M

0 s(1,n) s(2,n) L 0

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

s(i, j) = exp[w ⋅f(i, j)]Non-neg. score for each edge

maxy∈T

s(i, j)(i, j )∈y

∏Find edge sum among legal trees

children

pare

nts

Score edges in isolation

Find maximum spanning tree with Chu-Liu-Edmonds

NP hard to add sibling or degree constraints, hidden node variables

What about training?Unlabeled for now

Page 5: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 5

If Only It Were Projective…

I give a on bootstrappingtalk tomorrowROOT ‘ll

An Inside-Outside algorithm gives us• Normalizing constant for globally

normalized models• Posterior probability of edges• Summing over hidden variables€

p(y | x) = Zx−1ew• f (x,y )

p(y | x) = p(y,z | x)z

But we can’t use Inside-Outside for nonprojective parsing!

Page 6: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 6

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.

Exactly the Z we need!

O(n3) time!

Page 7: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 7

Building the Kirchoff (Laplacian) Matrix

0 −s(1,0) −s(2,0) L −s(n,0)

0 0 −s(2,1) L −s(n,1)

0 −s(1,2) 0 L −s(n,2)

M M M O M

0 −s(1,n) −s(2,n) L 0

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

0 −s(1,0) −s(2,0) L −s(n,0)

0 s(1, j)j≠1

∑ −s(2,1) L −s(n,1)

0 −s(1,2) s(2, j)j≠2

∑ L −s(n,2)

M M M O M

0 −s(1,n) −s(2,n) L s(n, j)j≠n

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

s(1, j)j≠1

∑ −s(2,1) L −s(n,1)

−s(1,2) s(2, j)j≠2

∑ L −s(n,2)

M M O M

−s(1,n) −s(2,n) L s(n, j)j≠n

• Negate edge scores• Sum columns

(children)• Strike root row/col.• Take determinant

N.B.: This allows multiple children of root, but see Koo et al. 2007.

Page 8: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 8

Why Should This Work?Chu-Liu-Edmonds analogy:Every node selects best parentIf cycles, contract and recur

′ K ≡ K with contracted edge 1,2

′ ′ K ≡ K({1,2} |{1,2})

K = s(1,2) ′ K + ′ ′ K

s(1, j)j≠1

∑ −s(2,1) L −s(n,1)

−s(1,2) s(2, j)j≠2

∑ L −s(n,2)

M M O M

−s(1,n) −s(2,n) L s(n, j)j≠n

Clear for 1x1 matrix; use induction

Undirected case; special root cases for directed

Page 9: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 9

When You Have a Hammer…

Matrix-Tree Theorem

•Sequence-normalized log-linear models (Lafferty et al. ‘01)•Minimum Bayes-risk parsing (cf. Goodman ‘96)•Hidden-variable models

•O(n) inference with length constraints (cf. N. Smith & Eisner, ‘05)•Minimum risk training (D. Smith & Eisner ‘06)•Tree (Rényi) entropy (Hwa ‘01; S & E ‘07)

Page 10: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 10

Analogy to Other Models

Sequence

Projective PCFGs

Shift-Reduce(Action-Based)

Projective

global log-linear

Max-margin(or error-driven)

(e.g. McDonald,

Collins)Non-

projective?

Parent-predicting(K. Hall ‘07)

This Work

Page 11: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 11

More Machinery: The Gradient

∂ logA

∂ A[ ] j,i

= A−1[ ]

i, j

∂ logZθ (x)

∂s(i, j)= s(i, j) K−1

[ ]i,i

− K−1[ ]

i, j( )j= 0

n

∑i=1

n

SinceInvert Kirchoff matrix

K in O(n3) timevia LU factorization

The edge gradient is also edge posterior

probability.

Use the chain rule to backpropagate into s(i,j), whatever its internal structure may be.

Page 12: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 12

Nonprojective Conditional Log-Linear Training

train Arabic Czech Danish Dutch

MIRA 79.9 81.4 86.6 90.0

CL 80.4 80.2 87.5 90.0

•CoNLL 2006 Danish and Dutch•CoNLL 2007 Arabic and Czech•Features from McDonald et al. 2005•Compared with MSTParser’s MIRA max-margin training•Trained LL weights with stochastic gradient descent•Same #iterations and stopping criteria as MIRA

Significance on paired permutation test

p(y | x) = Zx−1ew• f (x,y )

Page 13: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 13

Minimum Bayes-Risk Parsing

parse train Arabic Czech Danish Dutch

map MIRA 79.9 81.4 86.6 90.0

CL 80.4 80.2 87.5 90.0

mBr MIRA 79.4 80.3 85.0 87.2

CL 80.5 80.4 87.5 90.0

Select the tree, not with the highest probability, but the most expected correct edges.

ˆ y = argmaxy

Ep( ′ y |x ) −δ( ′ y i,y i)i=1

n

∑ ⎡

⎣ ⎢

⎦ ⎥= argmax

yp

i=1

n

∑ (y i ∈ parse of x | x)

Plug posteriors into MST

MIRA doesn’t

estimate probs.

N.B. One could do mBr inside MIRA.

Page 14: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 14

Edge Clustering

Franz loves Milena

OBJSUBJ

Franz loves Milena

ZC

B

A

Y

X

(Supervised) labeled

dependency parsing

Simple idea: conjoin each model feature with a cluster

Sum out all possible edge labelings if we don’t care about labels per se.

OR

Page 15: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 15

Edge Clustering

74

76

78

80

82

84

86

88

90

92

Arabic Danish Dutch

1 cluster2 clusters16 clusters32 clusters

No significant gains or losses from clustering

Page 16: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 16

What’s Wrong with Edge Clustering?

Edge labels don’t interact

Unlike clusters on PCFG nonterminals (e.g. Matsuzaki et al.’05)

Cf. small/no gains for unlabeled accuracy from supervised labeled parsers

NP-A

NP-B NP-A

Franz loves Milena

A B

No interaction

Interaction in rewrite rule

Page 17: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 17

Constraints on Link Length

0 −1 −1 −1 −1 −1 −1 −1 −1 −1

0 2 −1 0 0 0 0 0 0 0

0 −1 3 −1 0 0 0 0 0 0

0 −1 −1 4 −1 0 0 0 0 0

0 0 −1 −1 4 −1 0 0 0 0

0 0 0 −1 −1 4 −1 0 0 0

0 0 0 0 −1 −1 4 −1 0 0

0 0 0 0 0 −1 −1 4 −1 0

0 0 0 0 0 0 −1 −1 3 −1

0 0 0 0 0 0 0 −1 −1 2

Example with L=1, R=2

• Max. left/right child distances L and R (Cf. Eisner & N. Smith‘05)

• Band-diagonal Kirchoff matrix once root row and column are removed

• Inversion in O(min(L3R2, L2R3)n) time

Page 18: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 18

Conclusions

• O(n3) inference for edge-factored nonprojective dependency models

• Performance closely comparable to MIRA

• Learned edge clustering doesn’t seem to help unlabeled parsing

• Many other applications to hit

Page 19: Probabilistic Models of Nonprojective Dependency Trees

28 June 2007 EMNLP-CoNLL 19

Thanks

Jason EisnerKeith Hall

Sanjeev Khudanpur

The Anonymous Reviewers

Ryan McDonald & Michael Collins & colleaguesFor sharing drafts