View
215
Download
0
Category
Tags:
Preview:
Citation preview
Computer Science Department
David Caley Thomas Folz-Donahue
Rob HallMatt Marzilli
Accurate Parsing('they worry that air the shows , drink too much , whistle johnny b. goode and watch the other ropes , whistle johnny b. goode and watch closely and suffer through the sale', 2.1730387621600077e-11)
2Computer Science Department
Accurate Parsing: Our Goal
Given a grammar• For a sentence S, return the parse tree with the max
probability conditioned upon S.
arg-max t in T P (t| S) where T is the set of possible parse trees of sentence S
3Computer Science Department
Talking Points
Using the Penn-Treebank• Reading in n-ary trees• Finding Head-tags within n-ary productions• Converting to Binary Trees• Inducing a CFG grammar
Probabilistic CYK• Handling Unary rules• Dealing with unknowns• Dealing with run times
• Beam search, limiting depth of unary rules, further optimizations Example Parses and Trees Lexicalization Attempts
4Computer Science Department
Using the Penn-Treebank: Our Training Data
Contains tagged data and n-ary trees used from a Wall Street Journal corpus.
Contains some information unneeded by the parser.
Questionable Tagging • (JJ the) ??
Example…
5Computer Science Department
Using the Penn-Treebank: Handling N-ary trees
( (S (NP-SBJ-1 (NNS Consumers) ) (VP (MD may) (VP (VB want) (S (NP-SBJ (-NONE- *-1) ) (VP (TO to) (VP (VB move) (NP (PRP$ their) (NNS telephones) ) (ADVP-DIR (NP (DT a) (RB little) ) (RBR closer) (PP (TO to) (NP (DT the) (NN TV) (NN set) ))))))))
Functional tags such as NP-SBJ-1 are ignored
We simply call this an NP
Also –NONE- tags are used for traces, these are ignored also.
6Computer Science Department
Using the Penn-Treebank: Head-Tag Finding Algorithm
For a context-free rule X -> Y1 … Yn, for each rule we can use a function to determine the “head” of the rule.
In the example above this could be any Y1 … Yn
The head is the most important child tag.
Head-Tags Algorithm as Outlined in Collins Thesis• Allow us to determine the head-tags that will be used for later
binary tree conversion
7Computer Science Department
Using the Penn-Treebank: Head-Tag Finding Algorithm
If nothing is found in a list traversal the head-tag becomes the left or right most element.
8Computer Science Department
Using the Penn-Treebank: Head-Rule Finding Algorithm
Rules for NPs are a bit different• If the last word is tagged POS, return (last-word)• Else
• Search from right to left for the first child which is in the set {NN, NNP, NNPS, NNS, NX, POS, JJR}
• Else• Search from left to right for first child which is an NP
• Else• Search from right to left for the first child which is in the set {$,
ADJP, PRN}• Else
• Do the same with the set {CD}• Else
• Do the same with the set {JJ, JJS, RB, QP}• Else
• Return the last word
9Computer Science Department
Using the Penn-Treebank: Binary Tree Conversion
Now we put the Head-Tags to use• Necessary for CFG grammar use with probabilistic CYK
R - > LiLi-1…L1LoHRoR1 … Ri-1Ri A General n-ary rule
LiLi-1…L1LoHRoR1 … Ri-1 Ri
On right side of H-tag we recursively split last element to make a new binary rule, left recursive. On the left side we do the same by removing the first element, right recursive.
Li Li-1…L1LoH
10Computer Science Department
Using the Penn-Treebank: Grammar Induction Procedure After we have binary trees we can easily begin to
identify rules and record their frequency• Identify every production and save them into a python
dictionary
Frequencies cached in a local file for later use, read-in on subsequent executions
No immediate smoothing is done on probabilities, Grammar is later trimmed to help with performance
11Computer Science Department
Probabilistic CYK: The Parsing Step
We use a Probabilistic CYK implementation to parse our CFG grammar and also assign probabilities to final parse trees.• Useful to provide multiple parses and disambiguate
sentences
New Concerns• Unary Rules and their lengths• Runtime (result of incredibly large grammar)
12Computer Science Department
Probabilistic CYK: Handling Unary Rules within Grammar Unary Rules of the form X->Y or X->a are
ubiquitous in our grammar
• The closure of a constituent is needed to determine all the unary productions that can lead to that constituent.
• Def Closure(X) = U{Closure(Y) | Y->X}, i.e all non terminals that are reachable, by unary rules, from X.
• We implement this iteratively and also maintain a closed list and limit depth, to prevent possible infinite recursion
13Computer Science Department
Probabilistic CYK: Dealing with Run times
Beam Search• Limit the number of nodes saved in each cell of CYK
dynamic programming table.
• Using beam width k, All generations are kept sorted and the k best are saved for the next iteration
• Experiences with 100, 200, 1000?
list size <= k
14Computer Science Department
Probabilistic CYK: Dealing with Run Times
Another optimization was to remove all productions rules with frequency < fc• Used fc = 1, 2…
Also limited depth when calculating the unary rules (closure) of a constituent present in our CYK table• Extensive unary rules found to greatly slow down our parser• Also long chains of unary productions have extremely low
probabilities, they are commonly pruned by beam search anyway
15Computer Science Department
Probabilistic CYK: Random Sentences and Example Trees
Some random sentences from our grammar with associated probabilities.
('buy jam , cocoa and other war-rationed goodies',0.0046296296296296294)
('cartoonist garry trudeau refused to impose sanctions , including petroleum equipment , which go into semiannual payments , including watches , including three , which the federal government , the same company formed by mrs. yeargin school district would be confidential',
2.9911073159300768e-33)
('33 men selling individual copies selling securities at the central plaza hotel die', 7.4942533128815141e-08)
16Computer Science Department
Probabilistic CYK: Random Sentences and Example Trees('young people believe criticism is led by south korea',
1.3798001044090654e-11)
('the purchasing managers believe the art is the often amusing , often supercilious , even vicious chronicle of bank of the issue yen-support intervention', 7.1905882731776209e-1)
17Computer Science Department
S(buy) +--VP(buy) +--VB(buy) | +--buy +--NP(jam) +--NP(jam)-NP(goodies) | +--NP(jam)-CC(and) | | +--NP(jam)-NP(cocoa) | | | +--NP(jam) | | | | +--NN(jam) | | | | +--jam | | | +--,(,) | | | +--, | | +--NP(cocoa) | | +--NN(cocoa) | | +--cocoa | +--CC(and) | +--and +--NP(goodies) +--JJ(other) | +--other +--NP(goodies):JJ(other)- +--JJ(war-rationed) | +--war-rationed +--NNS(goodies) +--goodies
S:-(VP) +--VP +--VP:-(VB)-NP +--VP:-(VB) | +--VB | +--buy +--NP +--NP:-(NP)-NP +--NP:-(NP)-CC | +--NP:-(NP)-NP | | +--NP:-(NP)-, | | | +--NP:-(NP) | | | | +--NP | | | | +--NP:-(NN) | | | | +--NN | | | | +--jam | | | +--, | | | +--, | | +--NP | | +--NP:-(NN) | | +--NN | | +--cocoa | +--CC | +--and +--NP +--NP:-(NNS)JJ-NNS +--JJ | +--other +--NP:-(NNS)JJ-NNS +--JJ | +--war-rationed +--NP:-(NNS) +--NNS +--goodies
Recommended