Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Integrated Supertagging and Parsing
Michael Auli !University of Edinburgh
Marcel proved completeness
Parsing
Marcel proved completeness
Parsing
NP NPVBD
VP
S
Marcel proved completeness
CCG Parsing
(S\NP)/NPNP NP
S\NP
S
Combinatory Categorial Grammar (CCG; Steedman 2000)
Marcel proved completeness
CCG Parsing
NP NP(S\NP)/NP
S\NP
S
<proved, (S\NP)/NP, completeness>
Marcel proved completeness
CCG Parsing
NP NP(S\NP)/NP
S\NP
S
<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>
Why CCG Parsing?
• MT: Can analyse nearly any span in a sentence (Auli ’09; Mehay ‘10; Zhang & Clark 2011; Weese et. al. ’12)
e.g. “conjectured and proved completeness” ⊢S\NP!
• Composition of regular and context-free languages -- mirrors situation in syntactic MT (Auli & Lopez, ACL 2011)!
• Transparent interface to semantics (Bos et al. 2004) e.g. proved ⊢ (S\NP)/NP : λx.λy.proved’ xy
Marcel proved completeness
CCG Parsing is hard!
Over 22 tags per word! (Clark & Curran 2004)
NP NP(S\NP)/NP>
S\NP<
S
Marcel proved completeness
Supertagging
Marcel proved completeness
Supertagging
NP NP(S\NP)/NP
Marcel proved completeness
Supertagging
NP NP(S\NP)/NP>
S\NP<
S
Supertagging
time flies like an arrowNP NPS\NP NP/NP(S\NP)/NP
Supertagging
time flies like an arrowNP NPS\NP NP/NP
✗(S\NP)/NP
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
supertagger
parser
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
supertagger
parser
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
supertagger
parser
The Problem• Supertagger has no sense of overall grammaticality.!
• But parser restricted by its decisions.!
• Supertagger probabilities not used in parser.
supertag sequences
parsersupertagger
This talk
• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
Methods achieve most accurate CCG parsing results.
This talk
• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsingwith Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
Methods achieve
Adaptive Supertagging
time flies like an arrowNP NPS\NP (S\NP)/NP NP/NP
Adaptive Supertagging
time flies like an arrowNP NPS\NP (S\NP)/NP NP/NP
((S\NP)\(S\NP))/NP....
....
NPNP/NP
... ...
... ...
Clark & Curran (2004)
Adaptive Supertagging
• Algorithm:!
• Run supertagger.!
• Return tags with posterior higher than some alpha.!
• Parse by combining tags (CKY).!
• If parsing succeeds, stop.!
• If parsing fails, lower alpha and repeat.
Clark & Curran (2004)
Adaptive Supertagging
• Algorithm:!
• Run supertagger.!
• Return tags with posterior higher than some alpha.!
• Parse by combining tags (CKY).!
• If parsing succeeds, stop.!
• If parsing fails, lower alpha and repeat.
• Q: are parses returned in early rounds suboptimal?
Clark & Curran (2004)
Adaptive Supertagging
Answer... L
abel
led
F-sc
ore
92
95
97
100
Tight beam Loose beam
Oracle parsing (Huang 2008)
Standard parsing (Clark and Curran 2007)
Answer... L
abel
led
F-sc
ore
92
95
97
100
Tight beam Loose beam
Oracle parsing (Huang 2008)
Standard parsing (Clark and Curran 2007)
Answer... L
abel
led
F-sc
ore
92
95
97
100
Tight beam Loose beam
Oracle parsing (Huang 2008)
Standard parsing (Clark and Curran 2007)
Answer... L
abel
led
F-sc
ore
92
95
97
100
Tight beam Loose beam
Oracle parsing (Huang 2008)
Lab
elle
d F-
scor
e
85
87
88
90
Standard parsing (Clark and Curran 2007)
88.2$
88.4$
88.6$
88.8$
89.0$
89.2$
89.4$
89.6$
89.8$
85600$
85800$
86000$
86200$
86400$
86600$
86800$
87000$
87200$
87400$
0.075$
0.03$
0.01$
0.005$
0.001$
0.0005$
0.0001$
0.00005$
0.00001$
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model$score$ F6measure$
Parsing
Note: only sentences parsable at all beam settings.
least aggressive
most aggressive
Parsing
Note: only sentences parsable at all beam settings.
88.2$
88.4$
88.6$
88.8$
89.0$
89.2$
89.4$
89.6$
89.8$
85600$
85800$
86000$
86200$
86400$
86600$
86800$
87000$
87200$
87400$
0.075$
0.03$
0.01$
0.005$
0.001$
0.0005$
0.0001$
0.00005$
0.00001$
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model$score$ F6measure$
least aggressive
most aggressive
Oracle Parsing
Note: only sentences parsable at all beam settings.
93.5%
94.0%
94.5%
95.0%
95.5%
96.0%
96.5%
97.0%
97.5%
98.0%
98.5%
82500%
83000%
83500%
84000%
84500%
85000%
0.075%
0.03%
0.01%
0.005%
0.001%
0.0005%
0.0001%
0.00005%
0.00001%
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model%score% F6measure%
least aggressive
most aggressive
What’s happening here?
What’s happening here?
• Supertagger keeps parser from making serious errors.
What’s happening here?
• Supertagger keeps parser from making serious errors.
• But it also occasionally prunes away useful parses.
What’s happening here?
• Supertagger keeps parser from making serious errors.
• But it also occasionally prunes away useful parses.
• Why not combine supertagger and parser into one?
Overview
• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
Overview
• Analysis of state-of-the-art approachTrade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
Integrated Model• Supertagger & parser are log-linear models.!
• Idea: combine their features into one model.!
• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.
Integrated Model• Supertagger & parser are log-linear models.!
• Idea: combine their features into one model.!
• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.
B C → A O(Gn3)original parsing problem:
Integrated Model• Supertagger & parser are log-linear models.!
• Idea: combine their features into one model.!
• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.
B C → A O(Gn3)original parsing problem:
qBs sCr → qAr O(G3n3)new parsing problem:
Integrated Model• Supertagger & parser are log-linear models.!
• Idea: combine their features into one model.!
• Problem: Exact computation of marginal or maximum quantities becomes very expensive because parsing and tagging submodels must agree on the tag sequence.
Intersection of a regular and context-free language!(Bar-Hillel et al. 1964)
B C → A O(Gn3)original parsing problem:
qBs sCr → qAr O(G3n3)new parsing problem:
Approximate Algorithms
• Loopy belief propagation: approximate calculation of marginals. (Pearl 1988; Smith & Eisner 2008)!
• Dual decomposition: exact (sometimes) calculation of maximum. (Dantzig & Wolfe 1960; Komodakis et al. 2007; Koo et al. 2010)
Belief Propagation
Belief Propagation
Forward-backward is belief propagation (Smyth et al. 1997)
Belief Propagation
Marcel proved completeness
start stop
Forward-backward is belief propagation (Smyth et al. 1997)
Belief Propagation
Marcel proved completeness
start stop
e1,j e2,j e3,j
ei,jemission message:fi,j =
X
j0
fi�1,j0ei�1,j0tj0,jforward message:
f1,j f2,j f3,j
backward message: bi,j =X
j0
bi+1,j0ei+1,j0tj,j0
b1,j b2,j b3,j
pi,j =1Z
fi,jei,jbi,jbelief (probability) that tag j is at position i:
Forward-backward is belief propagation (Smyth et al. 1997)
Belief Propagation
Marcel proved completeness
Notational convenience: one factor describes whole!distribution over supertag sequence...
span variables
Belief Propagation
Marcel proved completeness
We can also do the same for the distribution over parse trees!(Case-factor diagrams: McAllester et al. 2008)
0S3 0NP3
0NP2 1S\NP3
0NP1 1S\NP/NP2 2NP3
span variables
Belief Propagation
Marcel proved completeness
We can also do the same for the distribution over parse trees!(Case-factor diagrams: McAllester et al. 2008)
Inside-outside is belief propagation (Sato 2007)
0S3 0NP3
0NP2 1S\NP3
0NP1 1S\NP/NP2 2NP3
Belief Propagation
Marcel proved completeness
parsing factor
supertagging factor
Belief Propagation
Marcel proved completeness
parsing factor
supertagging factor
Graph is not a tree!
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
parsing factor
supertagging factor
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
forward-backward
parsing factor
supertagging factor
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
forward-backward
inside-outside
parsing factor
supertagging factor
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
forward-backward
inside-outside
parsing factor
supertagging factor
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
pi,j =1Z
fi,jei,jbi,joi,j
forward-backward
inside-outside
parsing factor
supertagging factor
Loopy Belief Propagation
Marcel proved completeness
Graph is not a tree!
pi,j =1Z
fi,jei,jbi,joi,j
forward-backward
inside-outside
parsing factor
supertagging factor
!
• Computes approximate marginals, no guarantees.!
• Complexity is additive:!
• Used to compute minimum-risk parse (Goodman 1996).
Loopy Belief Propagation
O(Gn3 +Gn)
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor
f(y)
g(z)
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor
f(y)
g(z)
arg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
relaxed!original!problem
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
modified!subproblem
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
Dual Decompositionarg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
u(i, t)Dual objective: find assignment of that minimises L(u)
Dual Decomposition
u(i, t) = u(i, t) + ↵ · [y(i, t)� z(i, t)] (Rush et al. 2010)
arg max
y,zf(y) + g(z) y(i, t) = z(i, t)s.t. for all i, t
L(u) = max
yf(y) +
X
i,t
u(i, t) · y(i, t)
+ max
zg(z)�
X
i,t
u(i, t) · z(i, t)
u(i, t)Dual objective: find assignment of that minimises L(u)
Solution provably solves original problem.
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor Viterbi tags
Viterbi parse
Dual Decomposition
Marcel proved completeness
parsing factor
supertagging factor Viterbi tags
Viterbi parse
“Message passing” (Komodakis et al. 2007)
• Computes exact maximum, if it converges.!
• Otherwise: return best parse seen (approximation).!
• Complexity is additive:!
• Use to compute Viterbi solutions.
Dual Decomposition
O(Gn3 +Gn)
Experiments
• Standard parsing task:!
• C&C Parser and supertagger (Clark & Curran 2007).!
• CCGBank standard train/dev/test splits.!
• Piecewise optimisation (Sutton and McCallum 2005)!
• Approximate algorithms used to decode test set.
Experiments: Accuracy over time
Experiments: Accuracy over time
tight search (AST)
loose search (Rev)
Experiments: Convergence
Experiments: Convergence
Dual decomposition exact in 99.7% of cases What about belief propagation?
Experiments: BP Exactness
Experiments: BP Exactness
90#
92#
94#
96#
98#
100#
1# 10# 100# 1000#
Match&(%
)&
Itera-ons&
match#DD#k=1000#
match#BP#k=1000#
Experiments: BP Exactness
90#
92#
94#
96#
98#
100#
1# 10# 100# 1000#
Match&(%
)&
Itera-ons&
match#DD#k=1000#
match#BP#k=1000#
Instantly, 91% match final DD solutions!
Takes DD 15 iterations to reach same level.
Experiments: Accuracy
87
87.5
88
88.5
89
Tight beam Loose beam
Baseline Belief Propagation Dual Decomposition
Test set results
Labe
lled
F-m
easu
re
Experiments: Accuracy
87
87.5
88
88.5
89
Tight beam Loose beam
Baseline Belief Propagation Dual Decomposition
88.188.3
87.7
Test set results
Labe
lled
F-m
easu
re
Experiments: Accuracy
87
87.5
88
88.5
89
Tight beam Loose beam
Baseline Belief Propagation Dual Decomposition
88.8
88.1
88.9
88.3
87.787.7
Note: BP accuracy after 1 iteration; DD accuracy after 25 iterations
Test set results
Labe
lled
F-m
easu
re
+1.1
Experiments: Accuracy
87
87.5
88
88.5
89
Tight beam Loose beam
Baseline Belief Propagation Dual Decomposition
88.8
88.1
88.9
88.3
87.787.7
Note: BP accuracy after 1 iteration; DD accuracy after 25 iterations
Test set results
Labe
lled
F-m
easu
re
+1.1
Best published result
Oracle Results Again
89.4%
89.5%
89.6%
89.7%
89.8%
89.9%
90.0%
60000%
80000%
100000%
120000%
140000%
160000%
180000%
200000%
0.075%
0.03%
0.01%
0.005%
0.001%
0.0005%
0.0001%
0.00005%
0.00001%
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model%score% F6measure%
Belief Propagation
89.2%
89.3%
89.4%
89.5%
89.6%
89.7%
89.8%
89.9%
85200%
85400%
85600%
85800%
86000%
86200%
86400%
0.075%
0.03%
0.01%
0.005%
0.001%
0.0005%
0.0001%
0.00005%
0.00001%
Labe
lleld'F)score'
Mod
el'sc
ore'
Supertagger'beam'
Model%score% F6measure%
Dual Decomposition
Summary so far
• Supertagging efficiency comes at the cost of accuracy.!
• Interaction between parser and supertagger can be exploited in an integrated model.!
• Practical inference for complex integrated model.!
• First empirical comparison between dual decomposition and belief propagation on NLP task.!
• Loopy belief propagation is fast, accurate and exact.
Overview
• Analysis of state-of-the-art approach Trade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsing with Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
Overview
• Analysis of state-of-the-art approachTrade-off between efficiency and accuracy (ACL 2011a)
• Integrated supertagging and parsingwith Loopy Belief Propagation and Dual Decomposition (ACL 2011b)
• Training the integrated model with Softmax-Margin towards task-specific metrics (EMNLP 2011)
Training the Integrated Model
• So far optimised Conditional Log-Likelihood (CLL).!
• Optimise towards task-specific metric e.g. F1 such as in SMT (Och, 2003).!
• Past work used approximations to Precision (Taskar et al. 2004).!
• Contribution: Do it exactly and verify approximations.
Marcel proved completeness
Parsing Metrics
NP NP(S\NP)/NP
S\NP
S
CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)
<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>
Evaluate this}
Marcel proved completeness
Parsing Metrics
NP NP(S\NP)/NP
S\NP
S
CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)
<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>
Not this!
Marcel proved completeness
Parsing Metrics
NP NP(S\NP)/NP
S\NP
S
CCG: Labelled, directed dependency recovery (Clark & Hockenmaier, 2002)
<proved, (S\NP)/NP, completeness><proved, (S\NP)/NP, Marcel>
y = dependencies in ground truthy’ = dependencies in proposed output
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Parsing Metrics
y = dependencies in ground truthy’ = dependencies in proposed output
Precision P (y, y0) =|y \ y0||y0| =
n
d
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Recall R(y, y0) =|y \ y0||y| =
n
|y|
F-measure F1(y, y0) =
2PR
P +R=
2|y \ y0||y|+ |y0| =
2n
d+ |y|
Parsing Metrics
Softmax-Margin Training
• Discriminative.!
• Probabilistic.!
• Convex objective.!
• Minimises bound on expected risk for a given loss function.!
• Requires little change to existing CLL implementation.
(Sha & Saul, 2006; Povey & Woodland,2008; Gimpel & Smith, 2010)
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
weights true outputfeatures input proposed outputpossible outputs
training examples
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
SMM:
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
weights true outputfeatures input proposed outputpossible outputs
training examples
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
SMM:
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
weights true outputfeatures input proposed outputpossible outputs
training examples
Softmax-Margin Training
CLL: min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
SMM:
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y)}
⌅
⌃ (2)
min�
m�
i=1
⇤
⇧��Tf(x(i), y(i)) + log�
y�Y(x(i))
exp{�Tf(x(i), y) + �(y(i), y)}
⌅
⌃ (3)
⌥
⌥⇥k=
m�
i=1
��hk(x
(i), y(i))⇥k +exp{�Tf(x(i), y(i))}⌥
y�Y(x(i)) exp{�Tf(x(i), y) + �(y(i), y)}hk(x
(i), y(i))⇥k
⇥(4)
Figure 1: Conditional log-likelihood (Eq. 2), Softmax-margin objective (Eq. 3) and gradient (Eq. 4).
Draft, do not circulate without permission.
Ai,i+1,n+(ai:A),d+(ai:A) ⇥= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇥= Bi,k,n,d ⇤ Ck,j,n�,d� ⇤ w(BC ⌅ A)
GOAL ⇥= S0,N,n,d ⇤�1� 2n
d+ |y|
⇥
Figure 2: State-split inside algorithm for computing softmax-margin with F-measure.
counts, n+ and d+:
DecP = d+ � n+ (6)
Recall requires the number of gold standard de-pendencies, y+, which should have been recoveredin a particular state; we compute it as follows: Agold dependency is due to be recovered if its headlies within the span of one of its children and the de-pendent in the other. With this we can compute thedecomposed recall:
DecR = y+ � n+ (7)
However, there is one issue with our formulationof y+ with CCG and its way of dealing withdependencies that makes our formulation slightlymore approximate: The unification mechanism ofCCG allows to realise dependencies later in thederivation when both the head and dependent arein the same span (Figure 5). This makes usingthe proposed decomposed recall difficult as ourgold-dependency count y+ may under or over-statethe number of correct dependencies n+. Given thatthis loss function is an approximation, we deal withthis inconsistency via setting y+ = n+ whenevery+ < n+ to account for gold-dependencies whichhave not been correctly classified by our method.
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<�>
NP<
(S\NP)
Dependencies:and - pearsand - appleslikes - pears, likes - apples
Figure 5: Example illustrating handling of conjunctionsin CCG: .
Finally, decomposed F-measure is simply the sumof the two decomposed losses:
DecF1 = (d+ � n+) + (y+ � n+) (8)
5 Experiments
Parsing Strategies. The most successful approachto CCG parsing is based on a pipeline strategy: First,we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger.
Pruning the categories in advance this way has aspecific failure mode: sometimes it is not possibleto produce a sentence-spanning derivation from thetag sequences preferred by the supertagger, since itdoes not enforce grammaticality. A workaround forthis problem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based on astep function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word. However, the technique is inherently ap-proximate: it will return a lower probability parseunder the parsing model if a higher probability parsecan only be constructed from a supertag sequencereturned by a subsequent iteration. In this way it pri-oritizes speed over exactness, although the tradeoffcan be modified by adjusting the beam step func-tion. Regardless, the effect of the approximation isunbounded.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Theused beam settings for both strategies during testingare in Table 1.Parser. We use the C&C parser (Clark and Cur-ran, 2007) and its supertagger (Clark, 2002). Ourbaseline is the hybrid model of Clark and Curran
likes apples and pears
(S\NP)/NP NP CONJ NP<�>
NP\NP<
NP>
(S\NP)
Figure 3: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates depen-dencies arising from coordination once all conjuncts werefound. The first application of the coordination rule (�)only notes the dependency “and - pears” (dotted line); thesecond application in the larger span, “apples and pears”,realises it, together with “and - apples”.
same span, violating the assumption used to com-pute y+ (Figure 3). Exceptions like this can causemismatches between n+ and y+. We set y+ = n+
whenever y+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposed approximationto F-measure.
DecF1 = DecP +DecR (10)
4 Experiments
Parsing Strategy. The most successful approach toCCG parsing is based on a pipeline strategy: First,
we tag (or multitag) each word of the sentence witha lexical category using a supertagger, a sequencemodel over these categories (Bangalore and Joshi,1999; Clark, 2002). Second, we parse the sentenceunder the requirement that the lexical categories arefixed to those preferred by the supertagger. In ourexperiments we used two variants on this strategy.
Pruning the categories in advance has a specificfailure mode: sometimes it is not possible to pro-duce a sentence-spanning derivation from the tag se-quences preferred by the supertagger, since it doesnot enforce grammaticality. A workaround for thisproblem is the adaptive supertagging (AST) ap-proach of Clark and Curran (2004). It is based ona step function over supertagger beam widths, relax-ing the pruning threshold for lexical categories onlyif the parser fails to find an analysis. The process ei-ther succeeds and returns a parse after some iterationor gives up after a predefined number of iterations.As Clark and Curran (2004) show, most sentencescan be parsed with a very small number of supertagsper word.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparseable when they otherwise would not be due toan impractically large search space. Reverse ASTstarts with a wide beam, narrowing it at each itera-tion only if a maximum chart size is exceeded. Ourbeam settings for both strategies during testing arein Table 1.
Adaptive supertagging aims for speed via pruningwhile the reverse strategy aims for accuracy by ex-posing the parser to a larger search space. AlthoughClark and Curran (2007) found no actual improve-ments from the latter strategy, we will show that withsome models it can have a substantial effect.Parser. We use the C&C parser (Clark and Curran,2007) and its supertagger (Clark, 2002). Our base-
Figure 2: Example of flexible dependency realisation inCCG: Our parser (Clark and Curran, 2007) creates de-pendencies arising from coordination once all conjunctsare found and treats “and” as the syntactic head of coor-dinations. The coordination rule (�) does not yet estab-lish the dependency “and - pears” (dotted line); it is thebackward application (<) in the larger span, “apples andpears”, that establishes it, together with “and - pears”.CCG also deals with unbounded dependencies which po-tentially lead to more dependencies than words (Steed-man, 2000); in this example a unification mechanism cre-ates the dependencies “likes - apples” and “likes - pears”in the forward application (>).
The key idea will be to treat F1 as a non-local fea-ture of the parse, dependent on values n and d.2 Tocompute expectations we split each span in an other-wise usual CKY program by all pairs �n, d incidentat that span. Since we anticipate the number of thesesplits to be approximately linear in sentence length,the algorithm’s complexity remains manageable.
Formally, our goal will be to compute expecta-2This is essentially the same trick used in the oracle F-
measure algorithm of Huang (2008), and indeed our algorithmis a sum-product variant of that max-product algorithm.
tions over the sentence a1...aN . In order to abstractaway from the particulars of CCG and present thealgorithm in relatively familiar terms as a variantof CKY, we will use the notation ai : A for lexi-cal entries and BC ⇧ A to indicate that categoriesB and C combine to form category A via forwardor backward composition or application.3 Item Ai,j
accumulates the inside score associated with cate-gory A spanning i, j, computed with the usual in-side algorithm, written here as a series of recursiveequations:
Ai,i+1 ⇤= w(ai : A)
Ai,j ⇤= Bi,k ⌅ Ck,j ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N
Our algorithm computes expectations on state-split items Ai,j,n,d.4 Let functions n+(·) and d+(·)respectively represent the number of correct and to-tal dependencies introduced by a parsing action. Wecan now present the state-split variant of the insidealgorithm in Fig. 3. The final recursion simply incor-porates the loss function for all derivations having aparticular F-score; by running the full inside-outsidealgorithm on this state-split program, we obtain thedesired expectations.5 A simple modification of theweight on the goal transition enables us to optimiseprecision, recall or a weighted F-measure.
3These correspond to unary rules A ! ai and binary rulesA ! BC in a context-free grammar in Chomsky normal form.
4Here we use state-splitting to refer to splitting an item Ai,j
into many items Ai,j,n,d, one for each hn, di pair.5The outside equations can be easily derived from the inside
algorithm, or mechanically using the reverse values of Good-man (1999).
• Penalise high-loss outputs.!
• Re-weight outcomes by loss function.!
• Loss function an unweighted feature -- if decomposable.
weights true outputfeatures input proposed outputpossible outputs
training examples
Decomposability
• CKY assumes weights factor over substructures (node + children = substructure).
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
• A decomposable loss function must factor identically.
Decomposability
• CKY assumes weights factor over substructures (node + children = substructure).
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
• A decomposable loss function must factor identically.
Decomposability
• CKY assumes weights factor over substructures (node + children = substructure).
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
• A decomposable loss function must factor identically.
Decomposability
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = dn = n1 + n2
Decomposability
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
: n1 : n2
: n
Correct dependency counts
n = n1 + n2
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: f1 : f2
: f
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: f1 : f2
: f
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability
f = f1 f2⌦
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: f1 : f2
: f
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability
f = f1 f2⌦
Approximations!
Approximate Loss Functions
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
Approximate Loss Functions
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,:0,0 :1,1
:1,1
:0,0:0,0
n+ correct dependencies d+ all dependencies
c+ gold dependencies
for each substructure:
Approximate Loss Functions
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,:0,0 :1,1
:1,1
:0,0:0,0
Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N,n,d ⌅�1� 2n
d+ |y|
⇥
Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.
Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.
3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.
Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).
DecP (y) =⇤
t⇥T (y)
d+(t)� n+(t) (8)
To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.
DecR(y) =⇤
t⇥T (y)
c+(t)� n+(t) (9)
Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies
to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+
whenever c+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposable approximationto F-measure.
DecF1(y) = DecP (y) +DecR(y) (10)
4 Experiments
Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.
First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.
Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N,n,d ⌅�1� 2n
d+ |y|
⇥
Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.
Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.
3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.
Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).
DecP (y) =⇤
t⇥T (y)
d+(t)� n+(t) (8)
To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.
DecR(y) =⇤
t⇥T (y)
c+(t)� n+(t) (9)
Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies
to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+
whenever c+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposable approximationto F-measure.
DecF1(y) = DecP (y) +DecR(y) (10)
4 Experiments
Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.
First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.
Ai,i+1,n+(ai:A),d+(ai:A) ⇤= w(ai : A)
Ai,j,n+n�+n+(BC�A),d+d�+d+(BC�A) ⇤= Bi,k,n,d ⌅ Ck,j,n�,d� ⌅ w(BC ⇧ A)
GOAL ⇤= S0,N,n,d ⌅�1� 2n
d+ |y|
⇥
Figure 3: State-split inside algorithm for computing softmax-margin with F-measure.
Note that while this algorithm computes exactsentence-level expectations, it is approximate at thecorpus level, since F-measure does not decomposeover sentences. We give the extension to exactcorpus-level expectations in Appendix A.
3.2 Approximate Loss FunctionsWe will also consider approximate but more effi-cient alternatives to our exact algorithms. The ideais to use cost functions which only utilise statisticsavailable within the current local structure, similar tothose used by Taskar et al. (2004) for tracking con-stituent errors in a context-free parser. We designthree simple losses to approximate precision, recalland F-measure on CCG dependency structures.
Let T (y) be the set of parsing actions requiredto build parse y. Our decomposable approximationto precision simply counts the number of incorrectdependencies using the local dependency counts,n+(·) and d+(·).
DecP (y) =⇤
t⇥T (y)
d+(t)� n+(t) (8)
To compute our approximation to recall we re-quire the number of gold dependencies, c+(·), whichshould have been introduced by a particular parsingaction. A gold dependency is due to be recoveredby a parsing action if its head lies within one childspan and its dependent within the other. This yields adecomposed approximation to recall that counts thenumber of missed dependencies.
DecR(y) =⇤
t⇥T (y)
c+(t)� n+(t) (9)
Unfortunately, the flexible handling of dependenciesin CCG complicates our formulation of c+, render-ing it slightly more approximate: The unificationmechanism of CCG sometimes causes dependencies
to be realised later in the derivation, at a point whenboth the head and the dependent are in the samespan, violating the assumption used to compute c+(see again Figure 2). Exceptions like this can causemismatches between n+ and c+. We set c+ = n+
whenever c+ < n+ to account for these occasionaldiscrepancies.
Finally, we obtain a decomposable approximationto F-measure.
DecF1(y) = DecP (y) +DecR(y) (10)
4 Experiments
Parsing Strategy. CCG parsers use a pipeline strat-egy: we first multitag each word of the sentence witha small subset of its possible lexical categories us-ing a supertagger, a sequence model over these cat-egories (Bangalore and Joshi, 1999; Clark, 2002).Then we parse the sentence under the requirementthat the lexical categories are fixed to those preferredby the supertagger. In our experiments we used twovariants on this strategy.
First is the adaptive supertagging (AST) approachof Clark and Curran (2004). It is based on a stepfunction over supertagger beam widths, relaxing thepruning threshold for lexical categories only if theparser fails to find an analysis. The process eithersucceeds and returns a parse after some iteration orgives up after a predefined number of iterations. AsClark and Curran (2004) show, most sentences canbe parsed with very tight beams.
Reverse adaptive supertagging is a much less ag-gressive method that seeks only to make sentencesparsable when they otherwise would not be due to animpractically large search space. Reverse AST startswith a wide beam, narrowing it at each iteration onlyif a maximum chart size is exceeded. Table 1 showsbeam settings for both strategies.
n+ correct dependencies d+ all dependencies
c+ gold dependencies
for each substructure:
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
items Ai,j target analysiscorrect dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j target analysiscorrect dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
target analysis
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
target analysis
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
target analysis
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5
target analysis
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
items Ai,j
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5
target analysisGOAL
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
time1 flies2 like3 an4 arrow5
another analysisitems Ai,j
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP3,5,1,1
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP3,5,1,1
S\NP2,5,1,2
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
DecF1(1,1)
DecF1(0,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP3,5,1,1
NP0,2,0,1
S\NP2,5,1,2
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP3,5,1,1
NP0,2,0,1
S\NP2,5,1,2
S0,5
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,j
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
DecF1(0,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP3,5,1,1
NP0,2,0,1
S\NP2,5,1,2
S0,5
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
another analysisitems Ai,jGOAL
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
DecF1(0,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5
NP0,2,0,1
S\NP2,5,1,2
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
both analysisitems Ai,jGOAL
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
DecF1(1,1)DecF1(0,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
time1 flies2 like3 an4 arrow5
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5
NP0,2,0,1
S\NP2,5,1,2
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
both analysisitems Ai,jGOAL
DecF1(1,1)
DecF1(0,1)
DecF1(0,1)
DecF1(1,1)DecF1(0,1)
DecF1(1,1)
DecF1(1,1)
correct dependencies all dependencies
Approximate Losses with CKY
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
F1(y, y0) =
2n
d+ |y|
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability Revisited
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: n1, d1 : n2, d2
F1(y, y0) =
2n
d+ |y|
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability Revisited
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,: n1, d1 : n2, d2
F1(y, y0) =
2n
d+ |y|
F-measure
correct dependencies returned all dependencies returned
|y \ y0| = n
|y0| = d
Decomposability Revisited
f =2n1
d1 + |y| ⌦2n2
d2 + |y|
=2(n1 + n2)
d1 + d2 + 2|y|
Exact Loss Functions
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
Exact Loss Functions
• Treat sentence-level F1 as non-local feature dependent on n, d.
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,
Exact Loss Functions
• Treat sentence-level F1 as non-local feature dependent on n, d.
• Result: new dynamic program over items Ai,j,n,d
Marcel proved completeness
NP0,1,
NP2,3(S\NP)/NP1,2,
S0,3,
S\NP1,3,,0,0
,0,0,0,0
,1,1
,1,1
Exact Losses with State-Split CKYitems Ai,j,n,d
time1 flies2 like3 an4 arrow5
correct dependencies all dependencies
Exact Losses with State-Split CKYitems Ai,j,n,d
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
correct dependencies all dependencies
Exact Losses with State-Split CKYitems Ai,j,n,d
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
correct dependencies all dependencies
GOAL
Exact Losses with State-Split CKYitems Ai,j,n,d
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
correct dependencies all dependencies
GOALF1(4,4)
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,2,0,1
S\NP3,5,1,2
S0,5,1,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
correct dependencies all dependencies
items Ai,j,n,dGOAL F1(1,4)F1(4,4)
Exact Losses with State-Split CKY
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,2,0,1
S\NP3,5,1,2
S0,5,1,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
correct dependencies all dependencies
items Ai,j,n,dGOAL F1(1,4)F1(4,4)
Exact Losses with State-Split CKY
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,2,0,1
S\NP3,5,1,2
S0,5,1,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
Speed O(L7) Space O(L4)
items Ai,j,n,dGOAL F1(1,4)F1(4,4)
Exact Losses with State-Split CKY
time1 flies2 like3 an4 arrow5
NP3,5,1,1
(S\NP)\(S\NP)2,5,2,2
S\NP1,5,3,3
S0,5,4,4
NP0,2,0,1
S\NP3,5,1,2
S0,5,1,4
NP0,1,0,0 S\NP1,2,0,0((S\NP)\(S\NP))/NP2,3,0,0
NP/NP3,4,0,0 NP4,5,0,0
NP/NP0,1,0,0(S\NP)/NP2,3,0,0NP0,1,0,0
items Ai,j,n,dGOAL F1(1,4)F1(4,4)
Exact Losses with State-Split CKYin practice!
48 x larger DP!30 x slower
Experiments
• Standard parsing task:!
• C&C Parser and supertagger (Clark & Curran 2007).!
• CCGBank standard train/dev/test splits.!
• Piecewise optimisation (Sutton and McCallum 2005)
Exact versus Approximate
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
Exact versus Approximate
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
Exact versus Approximate
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
Exact versus Approximate
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
Exact versus Approximate
Approximate loss functions work, and much faster!
86.9
87.1
87.3
87.6
87.8
88.0
Precision Recall F-measure
Approximate Exact
Softmax-Margin beats CLLTest set results
87.5
87.9
88.2
88.6
Tight beam Loose beam
C&C ‘07 DecF1
Labe
lled
F-m
easu
re
Softmax-Margin beats CLLTest set results
87.5
87.9
88.2
88.6
Tight beam Loose beam
C&C ‘07 DecF1
88.1
87.7Labe
lled
F-m
easu
re
Softmax-Margin beats CLLTest set results
87.5
87.9
88.2
88.6
Tight beam Loose beam
C&C ‘07 DecF1
88.6
88.1
87.787.7
+0.9
Labe
lled
F-m
easu
re
Does task-specific optimisation degrade accuracy on other metrics?
Softmax-Margin beats CLL
Does task-specific optimisation degrade accuracy on other metrics?
37.0
38.0
39.0
40.0
Tight beam Loose beam
39.1
38.0 38.0
37.7
C&C ‘07 DecF1
Labe
lled
Exac
t Mat
ch
Softmax-Margin beats CLL
Integrated Model + SMM
Marcel proved completeness
Integrated Model + SMM
Marcel proved completeness
Integrated Model + SMM
Marcel proved completeness
Hamming!augmented!expectations
forward- backward
Integrated Model + SMM
Marcel proved completeness
Hamming!augmented!expectations
forward- backward
Integrated Model + SMM
Marcel proved completeness
parsing factor
supertagging factor
Hamming!augmented!expectations
forward- backward
Integrated Model + SMM
Marcel proved completeness
parsing factor
supertagging factor
Hamming!augmented!expectations
forward- backward
F-measure!augmented!expectations
inside-outside
Integrated Model + SMM
Marcel proved completeness
parsing factor
supertagging factor
Hamming!augmented!expectations
forward- backward
F-measure!augmented!expectations
inside-outside
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
Labe
lled
F-m
easu
re
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
87.7
Labe
lled
F-m
easu
re
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
88.9
87.7
Labe
lled
F-m
easu
re
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
89.288.9
87.7
Labe
lled
F-m
easu
re
Results: Integrated Model• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
87.0
87.8
88.5
89.3
90.0
C&C ’07 Integrated +DecF1 +Tagger
89.389.288.9
87.7
Labe
lled
F-m
easu
re
+1.5
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
85.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
86.085.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
86.8
86.085.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
87.186.8
86.085.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
Results: Automatic POS
85.0
85.8
86.5
87.3
88.0
C&C ’07 Petrov-I5 Integrated +DecF1 +Tagger
87.287.186.8
86.085.7
Labe
lled
F-m
easu
re
Fowler & Penn (2010)
+1.5
• F-measure loss for parsing sub-model (+DecF1).!
• Hamming loss for supertagging sub-model (+Tagger).!
• Belief propagation for inference.
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
0
11
22
33
44
55
66
77
88
99
110
87 88 89 90
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
0
11
22
33
44
55
66
77
88
99
110
87 88 89 90
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
C&C
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
C&C
0
11
22
33
44
55
66
77
88
99
110
87 88 89 90
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
C&C Integrated Model
0
11
22
33
44
55
66
77
88
99
110
87 88 89 90
Results: Efficiency vs. AccuracySe
nten
ces/
seco
nd
AccuracyBetter
Faster
C&C Integrated Model
Softmax- Margin Training
Summary
• Softmax-Margin training is easy and improves our model.!
• Approximate loss functions are fast, accurate and easy to use.!
• Best ever CCG parsing results (87.7 → 89.3).
Future Directions
• What can we do with the presented methods?!
• BP for other complex problems e.g. SMT!
• Semantics for SMT.!
• Simultaneous parsing of multiple sentences.
BP for other NLP pipelines
• Pipelines necessary for practical NLP systems!
• More accurate integrated models often too complex!
• This talk: Approximate inference can make these models practical!
• Use it for other pipelines e.g. POS, NER tagging & Parsing!
• Hard: BP for syntactic MT, another weighted intersection problem between LM & TM
Semantics for SMT• Compositional & distributional meaning
representation to compute vectors of sentence-meaning (Greffenstette & Sadrzadeh, 2011; Clark, to appear) !
• Syntax (e.g. CCG) drives compositional process!
• Directions: Model optimisation, evaluation, LM
Translation
Reference
Parsing beyond sentence-level• Many NLP tasks (e.g. IE) rely on uniform analysis of constituents!
• Skip-Chain CRFs successful to predict consistent NER tags across sentences (Sutton & McCallum, 2004)!
• Parse multiple sentences at once and enforce uniformity of parses
The securities and exchange commission issued ...
NNP/N
NP
... responded to the statement of the securities and exchange commission
NPconjNP
NP\NP
NP
1.
2.
Related Publications• A Comparison of Loopy Belief Propagation and Dual
Decomposition for Integrated CCG Supertagging and Parsing. with Adam Lopez. In Proc. of ACL, June 2011. !
• Efficient CCG Parsing: A* versus Adaptive Supertagging. with Adam Lopez. In Proc. of ACL, June 2011.!
• Training a Log-Linear Parser with Softmax-Margin. with Adam Lopez. In Proc. of EMNLP, July 2011.!
• A Systematic Comparison of Translation Model Search Spaces. with Adam Lopez, Hieu Hoang, Philipp Koehn. In Proc. of WMT, March 2009.
Thank you