184
Integrated Supertagging and Parsing Michael Auli University of Edinburgh

Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Supertagging and Parsing

Michael Auli !University of Edinburgh

Page 2: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

Parsing

Page 3: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

Parsing

NP NPVBD

VP

S

Page 4: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

CCG Parsing

(S\NP)/NPNP NP

S\NP

S

Combinatory Categorial Grammar (CCG; Steedman 2000)

Page 5: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

CCG Parsing

NP NP(S\NP)/NP

S\NP

S

<proved, (S\NP)/NP, completeness>

Page 6: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

CCG Parsing

NP NP(S\NP)/NP

S\NP

S

<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>

Page 7: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Why CCG Parsing?

• MT: Can analyse nearly any span in a sentence (Auli ’09; Mehay ‘10; Zhang & Clark 2011; Weese et. al. ’12)

e.g. “conjectured and proved completeness” ⊢S\NP!

• Composition of regular and context-free languages -- mirrors situation in syntactic MT (Auli & Lopez, ACL 2011)!

• Transparent interface to semantics (Bos et al. 2004) e.g. proved ⊢ (S\NP)/NP : λx.λy.proved’ xy

Page 8: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

CCG Parsing is hard!

Over 22 tags per word! (Clark & Curran 2004)

NP NP(S\NP)/NP>

S\NP<

S

Page 9: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

Supertagging

Page 10: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

Supertagging

NP NP(S\NP)/NP

Page 11: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

Supertagging

NP NP(S\NP)/NP>

S\NP<

S

Page 12: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Supertagging

time flies like an arrowNP NPS\NP NP/NP(S\NP)/NP

Page 13: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Supertagging

time flies like an arrowNP NPS\NP NP/NP

✗(S\NP)/NP

Page 14: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

The Problem• Supertagger has no sense of overall grammaticality.!

• But parser restricted by its decisions.!

• Supertagger probabilities not used in parser.

Page 15: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

The Problem• Supertagger has no sense of overall grammaticality.!

• But parser restricted by its decisions.!

• Supertagger probabilities not used in parser.

supertag sequences

Page 16: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

The Problem• Supertagger has no sense of overall grammaticality.!

• But parser restricted by its decisions.!

• Supertagger probabilities not used in parser.

supertag sequences

supertagger

Page 17: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

parser

The Problem• Supertagger has no sense of overall grammaticality.!

• But parser restricted by its decisions.!

• Supertagger probabilities not used in parser.

supertag sequences

supertagger

Page 18: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

parser

The Problem• Supertagger has no sense of overall grammaticality.!

• But parser restricted by its decisions.!

• Supertagger probabilities not used in parser.

supertag sequences

supertagger

Page 19: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

parser

The Problem• Supertagger has no sense of overall grammaticality.!

• But parser restricted by its decisions.!

• Supertagger probabilities not used in parser.

supertag sequences

parsersupertagger

Page 20: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

This talk

• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)

• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)

• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)

Methods achieve most accurate CCG parsing results.

Page 21: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

This talk

• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)

• Integrated supertagging and parsingwith Loopy Belief Propagation and Dual Decomposition (ACL 2011b)

• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)

Methods achieve

Page 22: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Adaptive Supertagging

time flies like an arrowNP NPS\NP (S\NP)/NP NP/NP

Page 23: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Adaptive Supertagging

time flies like an arrowNP NPS\NP (S\NP)/NP NP/NP

((S\NP)\(S\NP))/NP....

....

NPNP/NP

... ...

... ...

Page 24: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Clark & Curran (2004)

Adaptive Supertagging

Page 25: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

• Algorithm:!

• Run supertagger.!

• Return tags with posterior higher than some alpha.!

• Parse by combining tags (CKY).!

• If parsing succeeds, stop.!

• If parsing fails, lower alpha and repeat.

Clark & Curran (2004)

Adaptive Supertagging

Page 26: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

• Algorithm:!

• Run supertagger.!

• Return tags with posterior higher than some alpha.!

• Parse by combining tags (CKY).!

• If parsing succeeds, stop.!

• If parsing fails, lower alpha and repeat.

• Q: are parses returned in early rounds suboptimal?

Clark & Curran (2004)

Adaptive Supertagging

Page 27: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Answer... L

abel

led

F-sc

ore

92

95

97

100

Tight beam Loose beam

Oracle parsing (Huang 2008)

Standard parsing (Clark and Curran 2007)

Page 28: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Answer... L

abel

led

F-sc

ore

92

95

97

100

Tight beam Loose beam

Oracle parsing (Huang 2008)

Standard parsing (Clark and Curran 2007)

Page 29: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Answer... L

abel

led

F-sc

ore

92

95

97

100

Tight beam Loose beam

Oracle parsing (Huang 2008)

Standard parsing (Clark and Curran 2007)

Page 30: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Answer... L

abel

led

F-sc

ore

92

95

97

100

Tight beam Loose beam

Oracle parsing (Huang 2008)

Lab

elle

d F-

scor

e

85

87

88

90

Standard parsing (Clark and Curran 2007)

Page 31: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

88.2$

88.4$

88.6$

88.8$

89.0$

89.2$

89.4$

89.6$

89.8$

85600$

85800$

86000$

86200$

86400$

86600$

86800$

87000$

87200$

87400$

0.075$

0.03$

0.01$

0.005$

0.001$

0.0005$

0.0001$

0.00005$

0.00001$

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'

Model$score$ F6measure$

Parsing

Note: only sentences parsable at all beam settings.

least aggressive

most aggressive

Page 32: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Parsing

Note: only sentences parsable at all beam settings.

88.2$

88.4$

88.6$

88.8$

89.0$

89.2$

89.4$

89.6$

89.8$

85600$

85800$

86000$

86200$

86400$

86600$

86800$

87000$

87200$

87400$

0.075$

0.03$

0.01$

0.005$

0.001$

0.0005$

0.0001$

0.00005$

0.00001$

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'

Model$score$ F6measure$

least aggressive

most aggressive

Page 33: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Oracle Parsing

Note: only sentences parsable at all beam settings.

93.5%

94.0%

94.5%

95.0%

95.5%

96.0%

96.5%

97.0%

97.5%

98.0%

98.5%

82500%

83000%

83500%

84000%

84500%

85000%

0.075%

0.03%

0.01%

0.005%

0.001%

0.0005%

0.0001%

0.00005%

0.00001%

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'

Model%score% F6measure%

least aggressive

most aggressive

Page 34: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

What’s happening here?

Page 35: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

What’s happening here?

• Supertagger keeps parser from making serious errors.

Page 36: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

What’s happening here?

• Supertagger keeps parser from making serious errors.

• But it also occasionally prunes away useful parses.

Page 37: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

What’s happening here?

• Supertagger keeps parser from making serious errors.

• But it also occasionally prunes away useful parses.

• Why not combine supertagger and parser into one?

Page 38: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Overview

• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)

• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)

• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)

Page 39: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Overview

• Analysis of state-of-the-art approachTrade-off between efficiency and accuracy (ACL 2011a)

• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)

• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)

Page 40: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model• Supertagger & parser are log-linear models.!

• Idea: combine their features into one model.!

• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.

Page 41: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model• Supertagger & parser are log-linear models.!

• Idea: combine their features into one model.!

• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.

B C → A O(Gn3)original parsing problem:

Page 42: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model• Supertagger & parser are log-linear models.!

• Idea: combine their features into one model.!

• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.

B C → A O(Gn3)original parsing problem:

qBs sCr → qAr O(G3n3)new parsing problem:

Page 43: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model• Supertagger & parser are log-linear models.!

• Idea: combine their features into one model.!

• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.

Intersection of a regular and context-free language!(Bar-Hillel et al. 1964)

B C → A O(Gn3)original parsing problem:

qBs sCr → qAr O(G3n3)new parsing problem:

Page 44: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Algorithms

• Loopy belief propagation: approximate calculation of marginals. (Pearl 1988; Smith & Eisner 2008)!

• Dual decomposition: exact (sometimes) calculation of maximum. (Dantzig & Wolfe 1960; Komodakis et al. 2007; Koo et al. 2010)

Page 45: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Belief Propagation

Page 46: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Belief Propagation

Forward-backward is belief propagation (Smyth et al. 1997)

Page 47: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Belief Propagation

Marcel proved completeness

start stop

Forward-backward is belief propagation (Smyth et al. 1997)

Page 48: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Belief Propagation

Marcel proved completeness

start stop

e1,j e2,j e3,j

ei,jemission message:fi,j =

X

j0

fi�1,j0ei�1,j0tj0,jforward message:

f1,j f2,j f3,j

backward message: bi,j =X

j0

bi+1,j0ei+1,j0tj,j0

b1,j b2,j b3,j

pi,j =1Z

fi,jei,jbi,jbelief (probability) that tag j is at position i:

Forward-backward is belief propagation (Smyth et al. 1997)

Page 49: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Belief Propagation

Marcel proved completeness

Notational convenience: one factor describes whole!distribution over supertag sequence...

Page 50: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

span variables

Belief Propagation

Marcel proved completeness

We can also do the same for the distribution over parse trees!(Case-factor diagrams: McAllester et al. 2008)

0S3 0NP3

0NP2 1S\NP3

0NP1 1S\NP/NP2 2NP3

Page 51: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

span variables

Belief Propagation

Marcel proved completeness

We can also do the same for the distribution over parse trees!(Case-factor diagrams: McAllester et al. 2008)

Inside-outside is belief propagation (Sato 2007)

0S3 0NP3

0NP2 1S\NP3

0NP1 1S\NP/NP2 2NP3

Page 52: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Belief Propagation

Marcel proved completeness

parsing factor

supertagging factor

Page 53: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Belief Propagation

Marcel proved completeness

parsing factor

supertagging factor

Graph is not a tree!

Page 54: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Loopy Belief Propagation

Marcel proved completeness

Graph is not a tree!

parsing factor

supertagging factor

Page 55: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Loopy Belief Propagation

Marcel proved completeness

Graph is not a tree!

forward-backward

parsing factor

supertagging factor

Page 56: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Loopy Belief Propagation

Marcel proved completeness

Graph is not a tree!

forward-backward

inside-outside

parsing factor

supertagging factor

Page 57: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Loopy Belief Propagation

Marcel proved completeness

Graph is not a tree!

forward-backward

inside-outside

parsing factor

supertagging factor

Page 58: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Loopy Belief Propagation

Marcel proved completeness

Graph is not a tree!

pi,j =1Z

fi,jei,jbi,joi,j

forward-backward

inside-outside

parsing factor

supertagging factor

Page 59: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Loopy Belief Propagation

Marcel proved completeness

Graph is not a tree!

pi,j =1Z

fi,jei,jbi,joi,j

forward-backward

inside-outside

parsing factor

supertagging factor

Page 60: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

!

• Computes approximate marginals, no guarantees.!

• Complexity is additive:!

• Used to compute minimum-risk parse (Goodman 1996).

Loopy Belief Propagation

O(Gn3 +Gn)

Page 61: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decomposition

Marcel proved completeness

parsing factor

supertagging factor

Page 62: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decomposition

Marcel proved completeness

parsing factor

supertagging factor

f(y)

g(z)

Page 63: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decomposition

Marcel proved completeness

parsing factor

supertagging factor

f(y)

g(z)

arg max

y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t

Page 64: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decompositionarg max

y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t

Page 65: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decompositionarg max

y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t

L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)

Page 66: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decompositionarg max

y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t

relaxed!original!problem

L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)

Page 67: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decompositionarg max

y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t

modified!subproblem

L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)

Page 68: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decompositionarg max

y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t

L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)

u(i, t)Dual objective: find assignment of that minimises L(u)

Page 69: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decomposition

u(i, t) = u(i, t) + ↵ · [y(i, t)� z(i, t)] (Rush et al. 2010)

arg max

y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t

L(u) = max

yf(y) +

X

i,t

u(i, t) · y(i, t)

+ max

zg(z)�

X

i,t

u(i, t) · z(i, t)

u(i, t)Dual objective: find assignment of that minimises L(u)

Solution provably solves original problem.

Page 70: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decomposition

Marcel proved completeness

parsing factor

supertagging factor

Page 71: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decomposition

Marcel proved completeness

parsing factor

supertagging factor Viterbi tags

Viterbi parse

Page 72: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Dual Decomposition

Marcel proved completeness

parsing factor

supertagging factor Viterbi tags

Viterbi parse

“Message passing” (Komodakis et al. 2007)

Page 73: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

• Computes exact maximum, if it converges.!

• Otherwise: return best parse seen (approximation).!

• Complexity is additive:!

• Use to compute Viterbi solutions.

Dual Decomposition

O(Gn3 +Gn)

Page 74: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments

• Standard parsing task:!

• C&C Parser and supertagger (Clark & Curran 2007).!

• CCGBank standard train/dev/test splits.!

• Piecewise optimisation (Sutton and McCallum 2005)!

• Approximate algorithms used to decode test set.

Page 75: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: Accuracy over time

Page 76: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: Accuracy over time

tight search (AST)

loose search (Rev)

Page 77: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: Convergence

Page 78: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: Convergence

Dual decomposition exact in 99.7% of cases What about belief propagation?

Page 79: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: BP Exactness

Page 80: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: BP Exactness

90#

92#

94#

96#

98#

100#

1# 10# 100# 1000#

Match&(%

)&

Itera-ons&

match#DD#k=1000#

match#BP#k=1000#

Page 81: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: BP Exactness

90#

92#

94#

96#

98#

100#

1# 10# 100# 1000#

Match&(%

)&

Itera-ons&

match#DD#k=1000#

match#BP#k=1000#

Instantly, 91% match final DD solutions!

Takes DD 15 iterations to reach same level.

Page 82: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: Accuracy

87

87.5

88

88.5

89

Tight beam Loose beam

Baseline Belief Propagation Dual Decomposition

Test set results

Labe

lled

F-m

easu

re

Page 83: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: Accuracy

87

87.5

88

88.5

89

Tight beam Loose beam

Baseline Belief Propagation Dual Decomposition

88.188.3

87.7

Test set results

Labe

lled

F-m

easu

re

Page 84: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: Accuracy

87

87.5

88

88.5

89

Tight beam Loose beam

Baseline Belief Propagation Dual Decomposition

88.8

88.1

88.9

88.3

87.787.7

Note: BP accuracy after 1 iteration; DD accuracy after 25 iterations

Test set results

Labe

lled

F-m

easu

re

+1.1

Page 85: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments: Accuracy

87

87.5

88

88.5

89

Tight beam Loose beam

Baseline Belief Propagation Dual Decomposition

88.8

88.1

88.9

88.3

87.787.7

Note: BP accuracy after 1 iteration; DD accuracy after 25 iterations

Test set results

Labe

lled

F-m

easu

re

+1.1

Best published result

Page 86: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Oracle Results Again

89.4%

89.5%

89.6%

89.7%

89.8%

89.9%

90.0%

60000%

80000%

100000%

120000%

140000%

160000%

180000%

200000%

0.075%

0.03%

0.01%

0.005%

0.001%

0.0005%

0.0001%

0.00005%

0.00001%

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'

Model%score% F6measure%

Belief Propagation

89.2%

89.3%

89.4%

89.5%

89.6%

89.7%

89.8%

89.9%

85200%

85400%

85600%

85800%

86000%

86200%

86400%

0.075%

0.03%

0.01%

0.005%

0.001%

0.0005%

0.0001%

0.00005%

0.00001%

Labe

lleld'F)score'

Mod

el'sc

ore'

Supertagger'beam'

Model%score% F6measure%

Dual Decomposition

Page 87: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Summary so far

• Supertagging efficiency comes at the cost of accuracy.!

• Interaction between parser and supertagger can be exploited in an integrated model.!

• Practical inference for complex integrated model.!

• First empirical comparison between dual decomposition and belief propagation on NLP task.!

• Loopy belief propagation is fast, accurate and exact.

Page 88: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Overview

• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)

• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)

• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)

Page 89: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Overview

• Analysis of state-of-the-art approachTrade-off between efficiency and accuracy (ACL 2011a)

• Integrated supertagging and parsingwith Loopy Belief Propagation and Dual Decomposition (ACL 2011b)

• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)

Page 90: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Training the Integrated Model

• So far optimised Conditional Log-Likelihood (CLL).!

• Optimise towards task-specific metric e.g. F1 such as in SMT (Och, 2003).!

• Past work used approximations to Precision (Taskar et al. 2004).!

• Contribution: Do it exactly and verify approximations.

Page 91: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

Parsing Metrics

NP NP(S\NP)/NP

S\NP

S

CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)

<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>

Page 92: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Evaluate this}

Marcel proved completeness

Parsing Metrics

NP NP(S\NP)/NP

S\NP

S

CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)

<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>

Page 93: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Not this!

Marcel proved completeness

Parsing Metrics

NP NP(S\NP)/NP

S\NP

S

CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)

<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>

Page 94: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

y = dependencies in ground truthy’ = dependencies in proposed output

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Parsing Metrics

Page 95: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

y = dependencies in ground truthy’ = dependencies in proposed output

Precision P (y, y0) =|y \ y0||y0| =

n

d

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Recall R(y, y0) =|y \ y0||y| =

n

|y|

F-measure F1(y, y0) =

2PR

P +R=

2|y \ y0||y|+ |y0| =

2n

d+ |y|

Parsing Metrics

Page 96: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin Training

• Discriminative.!

• Probabilistic.!

• Convex objective.!

• Minimises bound on expected risk for a given loss function.!

• Requires little change to existing CLL implementation.

(Sha & Saul, 2006; Povey & Woodland,2008; Gimpel & Smith, 2010)

Page 97: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin Training

CLL: min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌃ (2)

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌃ (3)

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).

Page 98: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin Training

CLL: min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌃ (2)

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌃ (3)

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).

weights true outputfeatures input proposed outputpossible outputs

training examples

Page 99: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin Training

CLL: min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌃ (2)

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌃ (3)

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).

SMM:

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌃ (2)

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌃ (3)

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).

weights true outputfeatures input proposed outputpossible outputs

training examples

Page 100: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin Training

CLL: min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌃ (2)

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌃ (3)

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).

SMM:

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌃ (2)

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌃ (3)

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).

weights true outputfeatures input proposed outputpossible outputs

training examples

Page 101: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin Training

CLL: min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌃ (2)

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌃ (3)

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).

SMM:

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y)}

⌃ (2)

min�

m�

i=1

⇧��Tf(x(i), y(i)) + log�

y�Y(x(i))

exp{�Tf(x(i), y) + �(y(i), y)}

⌃ (3)

⌥⇥k=

m�

i=1

��hk(x

(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥

y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x

(i), y(i))⇥k

⇥(4)

Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).

Draft, do not circulate without permission.

Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)

GOAL ⇥= S0,N,n,d ⇤�1� 2n

d+ |y|

Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.

counts, n+ and d+:

DecP = d+ � n+ (6)

Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:

DecR = y+ � n+ (7)

However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<�>

NP<

(S\NP)

Dependencies:and - pearsand - appleslikes - pears, likes - apples

Figure 5: Example illustrating handling of conjunctionsin CCG: .

Finally, decomposed F-measure is simply the sumof the two decomposed losses:

DecF1 = (d+ � n+) + (y+ � n+) (8)

5 Experiments

Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.

Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran

likes apples and pears

(S\NP)/NP NP CONJ NP<�>

NP\NP<

NP>

(S\NP)

Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.

same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+

whenever y+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposed approximationto F-measure.

DecF1 = DecP +DecR (10)

4 Experiments

Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,

we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.

Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.

Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-

Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).

The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.

Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-

measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.

tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j

accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:

Ai,i+1 ⇤= w(ai : A)

Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N

Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.

3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.

4Here we use state-splitting to refer to splitting an item Ai,j

into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside

algorithm, or mechanically using the reverse values of Good-man (1999).

• Penalise high-loss outputs.!

• Re-weight outcomes by loss function.!

• Loss function an unweighted feature -- if decomposable.

weights true outputfeatures input proposed outputpossible outputs

training examples

Page 102: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Decomposability

• CKY assumes weights factor over substructures (node + children = substructure).

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

• A decomposable loss function must factor identically.

Page 103: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Decomposability

• CKY assumes weights factor over substructures (node + children = substructure).

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

• A decomposable loss function must factor identically.

Page 104: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Decomposability

• CKY assumes weights factor over substructures (node + children = substructure).

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

• A decomposable loss function must factor identically.

Page 105: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Decomposability

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = dn = n1 + n2

Page 106: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Decomposability

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

: n1 : n2

: n

Correct dependency counts

n = n1 + n2

Page 107: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: f1 : f2

: f

F-measure

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Decomposability

Page 108: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: f1 : f2

: f

F-measure

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Decomposability

f = f1 f2⌦

Page 109: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: f1 : f2

: f

F-measure

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Decomposability

f = f1 f2⌦

Approximations!

Page 110: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Loss Functions

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

Page 111: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Loss Functions

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,:0,0 :1,1

:1,1

:0,0:0,0

n+ correct dependencies d+ all dependencies

c+ gold dependencies

for each substructure:

Page 112: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Loss Functions

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,:0,0 :1,1

:1,1

:0,0:0,0

Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N,n,d ⌅�1� 2n

d+ |y|

Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.

Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.

3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.

Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).

DecP (y) =⇤

t⇥T (y)

d+(t)� n+(t) (8)

To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.

DecR(y) =⇤

t⇥T (y)

c+(t)� n+(t) (9)

Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies

to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+

whenever c+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposable approximationto F-measure.

DecF1(y) = DecP (y) +DecR(y) (10)

4 Experiments

Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.

First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.

Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N,n,d ⌅�1� 2n

d+ |y|

Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.

Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.

3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.

Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).

DecP (y) =⇤

t⇥T (y)

d+(t)� n+(t) (8)

To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.

DecR(y) =⇤

t⇥T (y)

c+(t)� n+(t) (9)

Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies

to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+

whenever c+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposable approximationto F-measure.

DecF1(y) = DecP (y) +DecR(y) (10)

4 Experiments

Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.

First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.

Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)

Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)

GOAL ⇤= S0,N,n,d ⌅�1� 2n

d+ |y|

Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.

Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.

3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.

Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).

DecP (y) =⇤

t⇥T (y)

d+(t)� n+(t) (8)

To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.

DecR(y) =⇤

t⇥T (y)

c+(t)� n+(t) (9)

Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies

to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+

whenever c+ < n+ to account for these occasionaldiscrepancies.

Finally, we obtain a decomposable approximationto F-measure.

DecF1(y) = DecP (y) +DecR(y) (10)

4 Experiments

Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.

First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.

Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.

n+ correct dependencies d+ all dependencies

c+ gold dependencies

for each substructure:

Page 113: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Losses with CKY

time1 flies2 like3 an4 arrow5

items Ai,j target analysiscorrect dependencies all dependencies

Page 114: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Losses with CKY

time1 flies2 like3 an4 arrow5

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j target analysiscorrect dependencies all dependencies

Page 115: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Losses with CKY

time1 flies2 like3 an4 arrow5

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

target analysis

DecF1(1,1)

correct dependencies all dependencies

Page 116: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Losses with CKY

time1 flies2 like3 an4 arrow5

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

target analysis

DecF1(1,1)

DecF1(1,1)

correct dependencies all dependencies

Page 117: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Losses with CKY

time1 flies2 like3 an4 arrow5

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

target analysis

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)

correct dependencies all dependencies

Page 118: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Losses with CKY

time1 flies2 like3 an4 arrow5

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5

target analysis

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)

correct dependencies all dependencies

Page 119: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Approximate Losses with CKY

time1 flies2 like3 an4 arrow5

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

items Ai,j

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5

target analysisGOAL

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)

DecF1(1,1)

correct dependencies all dependencies

Page 120: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

another analysisitems Ai,j

correct dependencies all dependencies

Approximate Losses with CKY

Page 121: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

another analysisitems Ai,j

correct dependencies all dependencies

Approximate Losses with CKY

Page 122: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

another analysisitems Ai,j

DecF1(1,1)

correct dependencies all dependencies

Approximate Losses with CKY

Page 123: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

S\NP2,5,1,2

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

another analysisitems Ai,j

DecF1(1,1)

DecF1(0,1)

correct dependencies all dependencies

Approximate Losses with CKY

Page 124: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

NP0,2,0,1

S\NP2,5,1,2

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

another analysisitems Ai,j

DecF1(1,1)

DecF1(0,1)

DecF1(0,1)

correct dependencies all dependencies

Approximate Losses with CKY

Page 125: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

NP0,2,0,1

S\NP2,5,1,2

S0,5

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

another analysisitems Ai,j

DecF1(1,1)

DecF1(0,1)

DecF1(0,1)

DecF1(0,1)

correct dependencies all dependencies

Approximate Losses with CKY

Page 126: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

NP0,2,0,1

S\NP2,5,1,2

S0,5

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

another analysisitems Ai,jGOAL

DecF1(1,1)

DecF1(0,1)

DecF1(0,1)

DecF1(0,1)

correct dependencies all dependencies

Approximate Losses with CKY

Page 127: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5

NP0,2,0,1

S\NP2,5,1,2

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

both analysisitems Ai,jGOAL

DecF1(1,1)

DecF1(0,1)

DecF1(0,1)

DecF1(1,1)DecF1(0,1)

DecF1(1,1)

DecF1(1,1)

correct dependencies all dependencies

Approximate Losses with CKY

Page 128: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5

NP0,2,0,1

S\NP2,5,1,2

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

both analysisitems Ai,jGOAL

DecF1(1,1)

DecF1(0,1)

DecF1(0,1)

DecF1(1,1)DecF1(0,1)

DecF1(1,1)

DecF1(1,1)

correct dependencies all dependencies

Approximate Losses with CKY

Page 129: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

F1(y, y0) =

2n

d+ |y|

F-measure

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Decomposability Revisited

Page 130: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: n1, d1 : n2, d2

F1(y, y0) =

2n

d+ |y|

F-measure

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Decomposability Revisited

Page 131: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,: n1, d1 : n2, d2

F1(y, y0) =

2n

d+ |y|

F-measure

correct dependencies returned all dependencies returned

|y \ y0| = n

|y0| = d

Decomposability Revisited

f =2n1

d1 + |y| ⌦2n2

d2 + |y|

=2(n1 + n2)

d1 + d2 + 2|y|

Page 132: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact Loss Functions

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

Page 133: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact Loss Functions

• Treat sentence-level F1 as non-local feature dependent on n, d.

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,

Page 134: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact Loss Functions

• Treat sentence-level F1 as non-local feature dependent on n, d.

• Result: new dynamic program over items Ai,j,n,d

Marcel proved completeness

NP0,1,

NP2,3(S\NP)/NP1,2,

S0,3,

S\NP1,3,,0,0

,0,0,0,0

,1,1

,1,1

Page 135: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact Losses with State-Split CKYitems Ai,j,n,d

time1 flies2 like3 an4 arrow5

correct dependencies all dependencies

Page 136: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact Losses with State-Split CKYitems Ai,j,n,d

time1 flies2 like3 an4 arrow5

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

correct dependencies all dependencies

Page 137: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact Losses with State-Split CKYitems Ai,j,n,d

time1 flies2 like3 an4 arrow5

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

correct dependencies all dependencies

GOAL

Page 138: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact Losses with State-Split CKYitems Ai,j,n,d

time1 flies2 like3 an4 arrow5

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

correct dependencies all dependencies

GOALF1(4,4)

Page 139: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,2,0,1

S\NP3,5,1,2

S0,5,1,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

correct dependencies all dependencies

items Ai,j,n,dGOAL F1(1,4)F1(4,4)

Exact Losses with State-Split CKY

Page 140: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,2,0,1

S\NP3,5,1,2

S0,5,1,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

correct dependencies all dependencies

items Ai,j,n,dGOAL F1(1,4)F1(4,4)

Exact Losses with State-Split CKY

Page 141: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,2,0,1

S\NP3,5,1,2

S0,5,1,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

Speed O(L7) Space O(L4)

items Ai,j,n,dGOAL F1(1,4)F1(4,4)

Exact Losses with State-Split CKY

Page 142: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

time1 flies2 like3 an4 arrow5

NP3,5,1,1

(S\NP)\(S\NP)2,5,2,2

S\NP1,5,3,3

S0,5,4,4

NP0,2,0,1

S\NP3,5,1,2

S0,5,1,4

NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0

NP/NP3,4,0,0 NP4,5,0,0

NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0

items Ai,j,n,dGOAL F1(1,4)F1(4,4)

Exact Losses with State-Split CKYin practice!

48 x larger DP!30 x slower

Page 143: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Experiments

• Standard parsing task:!

• C&C Parser and supertagger (Clark & Curran 2007).!

• CCGBank standard train/dev/test splits.!

• Piecewise optimisation (Sutton and McCallum 2005)

Page 144: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact versus Approximate

86.9

87.1

87.3

87.6

87.8

88.0

Precision Recall F-measure

Approximate Exact

Page 145: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact versus Approximate

86.9

87.1

87.3

87.6

87.8

88.0

Precision Recall F-measure

Approximate Exact

Page 146: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact versus Approximate

86.9

87.1

87.3

87.6

87.8

88.0

Precision Recall F-measure

Approximate Exact

Page 147: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact versus Approximate

86.9

87.1

87.3

87.6

87.8

88.0

Precision Recall F-measure

Approximate Exact

Page 148: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Exact versus Approximate

Approximate loss functions work, and much faster!

86.9

87.1

87.3

87.6

87.8

88.0

Precision Recall F-measure

Approximate Exact

Page 149: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin beats CLLTest set results

87.5

87.9

88.2

88.6

Tight beam Loose beam

C&C ‘07 DecF1

Labe

lled

F-m

easu

re

Page 150: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin beats CLLTest set results

87.5

87.9

88.2

88.6

Tight beam Loose beam

C&C ‘07 DecF1

88.1

87.7Labe

lled

F-m

easu

re

Page 151: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Softmax-Margin beats CLLTest set results

87.5

87.9

88.2

88.6

Tight beam Loose beam

C&C ‘07 DecF1

88.6

88.1

87.787.7

+0.9

Labe

lled

F-m

easu

re

Page 152: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Does task-specific optimisation degrade accuracy on other metrics?

Softmax-Margin beats CLL

Page 153: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Does task-specific optimisation degrade accuracy on other metrics?

37.0

38.0

39.0

40.0

Tight beam Loose beam

39.1

38.0 38.0

37.7

C&C ‘07 DecF1

Labe

lled

Exac

t Mat

ch

Softmax-Margin beats CLL

Page 154: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model + SMM

Marcel proved completeness

Page 155: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model + SMM

Marcel proved completeness

Page 156: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model + SMM

Marcel proved completeness

Hamming!augmented!expectations

forward- backward

Page 157: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model + SMM

Marcel proved completeness

Hamming!augmented!expectations

forward- backward

Page 158: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model + SMM

Marcel proved completeness

parsing factor

supertagging factor

Hamming!augmented!expectations

forward- backward

Page 159: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model + SMM

Marcel proved completeness

parsing factor

supertagging factor

Hamming!augmented!expectations

forward- backward

F-measure!augmented!expectations

inside-outside

Page 160: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Integrated Model + SMM

Marcel proved completeness

parsing factor

supertagging factor

Hamming!augmented!expectations

forward- backward

F-measure!augmented!expectations

inside-outside

Page 161: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

87.0

87.8

88.5

89.3

90.0

C&C ’07 Integrated +DecF1 +Tagger

Labe

lled

F-m

easu

re

Page 162: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

87.0

87.8

88.5

89.3

90.0

C&C ’07 Integrated +DecF1 +Tagger

87.7

Labe

lled

F-m

easu

re

Page 163: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

87.0

87.8

88.5

89.3

90.0

C&C ’07 Integrated +DecF1 +Tagger

88.9

87.7

Labe

lled

F-m

easu

re

Page 164: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

87.0

87.8

88.5

89.3

90.0

C&C ’07 Integrated +DecF1 +Tagger

89.288.9

87.7

Labe

lled

F-m

easu

re

Page 165: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

87.0

87.8

88.5

89.3

90.0

C&C ’07 Integrated +DecF1 +Tagger

89.389.288.9

87.7

Labe

lled

F-m

easu

re

+1.5

Page 166: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Automatic POS

85.0

85.8

86.5

87.3

88.0

C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger

Labe

lled

F-m

easu

re

Fowler & Penn (2010)

• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

Page 167: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Automatic POS

85.0

85.8

86.5

87.3

88.0

C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger

85.7

Labe

lled

F-m

easu

re

Fowler & Penn (2010)

• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

Page 168: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Automatic POS

85.0

85.8

86.5

87.3

88.0

C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger

86.085.7

Labe

lled

F-m

easu

re

Fowler & Penn (2010)

• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

Page 169: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Automatic POS

85.0

85.8

86.5

87.3

88.0

C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger

86.8

86.085.7

Labe

lled

F-m

easu

re

Fowler & Penn (2010)

• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

Page 170: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Automatic POS

85.0

85.8

86.5

87.3

88.0

C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger

87.186.8

86.085.7

Labe

lled

F-m

easu

re

Fowler & Penn (2010)

• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

Page 171: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Automatic POS

85.0

85.8

86.5

87.3

88.0

C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger

87.287.186.8

86.085.7

Labe

lled

F-m

easu

re

Fowler & Penn (2010)

+1.5

• F-measure loss for parsing sub-model (+DecF1).!

• Hamming loss for supertagging sub-model (+Tagger).!

• Belief propagation for inference.

Page 172: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Efficiency vs. AccuracySe

nten

ces/

seco

nd

AccuracyBetter

Faster

Page 173: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

0

11

22

33

44

55

66

77

88

99

110

87 88 89 90

Results: Efficiency vs. AccuracySe

nten

ces/

seco

nd

AccuracyBetter

Faster

Page 174: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

0

11

22

33

44

55

66

77

88

99

110

87 88 89 90

Results: Efficiency vs. AccuracySe

nten

ces/

seco

nd

AccuracyBetter

Faster

C&C

Page 175: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Results: Efficiency vs. AccuracySe

nten

ces/

seco

nd

AccuracyBetter

Faster

C&C

Page 176: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

0

11

22

33

44

55

66

77

88

99

110

87 88 89 90

Results: Efficiency vs. AccuracySe

nten

ces/

seco

nd

AccuracyBetter

Faster

C&C Integrated Model

Page 177: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

0

11

22

33

44

55

66

77

88

99

110

87 88 89 90

Results: Efficiency vs. AccuracySe

nten

ces/

seco

nd

AccuracyBetter

Faster

C&C Integrated Model

Softmax- Margin Training

Page 178: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Summary

• Softmax-Margin training is easy and improves our model.!

• Approximate loss functions are fast, accurate and easy to use.!

• Best ever CCG parsing results (87.7 → 89.3).

Page 179: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Future Directions

• What can we do with the presented methods?!

• BP for other complex problems e.g. SMT!

• Semantics for SMT.!

• Simultaneous parsing of multiple sentences.

Page 180: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

BP for other NLP pipelines

• Pipelines necessary for practical NLP systems!

• More accurate integrated models often too complex!

• This talk: Approximate inference can make these models practical!

• Use it for other pipelines e.g. POS, NER tagging & Parsing!

• Hard: BP for syntactic MT, another weighted intersection problem between LM & TM

Page 181: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Semantics for SMT• Compositional & distributional meaning

representation to compute vectors of sentence-meaning (Greffenstette & Sadrzadeh, 2011; Clark, to appear) !

• Syntax (e.g. CCG) drives compositional process!

• Directions: Model optimisation, evaluation, LM

Translation

Reference

Page 182: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Parsing beyond sentence-level• Many NLP tasks (e.g. IE) rely on uniform analysis of constituents!

• Skip-Chain CRFs successful to predict consistent NER tags across sentences (Sutton & McCallum, 2004)!

• Parse multiple sentences at once and enforce uniformity of parses

The securities and exchange commission issued ...

NNP/N

NP

... responded to the statement of the securities and exchange commission

NPconjNP

NP\NP

NP

1.

2.

Page 183: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Related Publications• A Comparison of Loopy Belief Propagation and Dual

Decomposition for Integrated CCG Supertagging and Parsing. with Adam Lopez. In Proc. of ACL, June 2011. !

• Efficient CCG Parsing: A* versus Adaptive Supertagging. with Adam Lopez. In Proc. of ACL, June 2011.!

• Training a Log-Linear Parser with Softmax-Margin. with Adam Lopez. In Proc. of EMNLP, July 2011.!

• A Systematic Comparison of Translation Model Search Spaces. with Adam Lopez, Hieu Hoang, Philipp Koehn. In Proc. of WMT, March 2009.

Page 184: Integrated Supertagging and Parsingmichaelauli.github.io/talks/parsing-tagging-talk.pdfnew parsing problem: 3 n3) Integrated Model •Supertagger & parser are log-linear models.! •Idea:

Thank you