Upload
ruth-riley
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
LING/C SC 581: Advanced Computational Linguistics
Lecture NotesFeb 5th
Today's Topics
• A note on java• Tregex homework discussion• Treebanks and Statistical parsers• Homework– One more exercise on tregex– Install the Bikel-Collins Parser
Java and tregex on OSX
• If you're running on Mavericks (10.9), Java 7 is running by default.
• But the latest tregex (3.5.1) requires Java 8:– Unsupported major.minor version 52.0
• Solution:– Install Java 8 (JRE) from Oracle directly– It installs in /Library/Internet\ Plug-Ins/– Modify the path to java in run-tregex-gui.command as follows:
#!/bin/sh/Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home/bin/java -mx300m -cp `dirname $0`/stanford-tregex.jar edu.stanford.nlp.trees.tregex.gui.TregexGUI
Homework Discussion
• useful command line tool– diff <file1> <file2>
Homework Discussion• page 268 of PRSGUID1.PDF
http://www.clips.ua.ac.be/pages/mbsp-tags
Functional tag -CLF indicates a true cleft
Homework Discussion
• Wish: – every construction was marked with its own tag
• So – -CLF looks easy …
Homework Discussion
• Types:
that is sometimes WHNP
Homework Discussion
• There are no SQ-CLF and SINV-CLF …
• Gapping?
Homework Discussion
• Search conservatively…– 62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR)– 41: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WHNP/))wsj_0267.mrg-30 It was also in law school that Mr. O'Kicki and his first wife had the first of seven daughters *T*-1 .
Homework Discussion
• Search conservatively…– 62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR)– 57: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP)/))
wsj_0591.mrg-21 It is partly for this reason that the exchange last week began *-3 trading in its own stock `` basket '' product that *T*-2 allows big investors to buy or sell all 500 stocks in the Standard & Poor 's index in a single trade .
62: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /IN|(WH(ADV|NP))/))
Homework Discussion• wsj_1154.mrg-4 It isn't every day that we hear a Violetta who *T*-1 can sing
the first act's high-flying music with all the little notes perfectly pitched *-2 and neatly stitched *-2 together .
No promised temporal trace …
Homework Discussion• 41: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP)-[0-9]+/))• 57: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP).*-.*[0-
9]+/))wsj_0267.mrg-30 It was also in law school that Mr. O'Kicki and his first wife had the first of seven daughters *T*-1 .
Homework Discussion• 43: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP).*-.*([0-
9]+)/#2%i << (/NP-SBJ/ < (/-NONE-/ < /\*T.*([0-9]+)/))))• 39: S-CLF < (NP-SBJ << /^[iI]t$/) < (VP < (@SBAR < /WH(ADV|NP).*-.*([0-
9]+)/#2%i << (/NP-SBJ/ < (/-NONE-/ < /\*T.*([0-9]+)/#1%i))))wsj_1655.mrg-4 Still , it was in Argentine editions that his countrymen first read his story of Pascal Duarte , a field worker who *T*-1 stabbed his mother to death and has no regrets as he awaits his end in a prison cell *T*-4
Homework Discussion
• Wh-clefts
Homework Discussion
• Wh-clefts:– 28: /SBAR-NOM/ $+ @VP << /.*-PRD/
wsj_0415.mrg-5 Who that winner will be *T*-1 is highly uncertain .
Relevance of Treebanks
• Statistical parsers typically construct syntactic phrase structure– they’re trained on Treebank corpora like the Penn
Treebank• Note: some use dependency graphs, not trees
Parsers trained on the Treebank
• Don’t recover fully-annotated trees– not trained using nodes with indices or empty (-NONE-) nodes– not trained using functional tags, e.g. –SBJ
• Therefore they don’t fully parse• Example: no SBAR node in … a movie to see
Stanford parser
Parsers trained on the Treebank
• SBAR can be forced by the presence of an overt relative pronoun, but note there is no subject gap:
Parsers trained on the Treebank
• Probabilities are estimated from frequency information of each node given surrounding context (e.g. parent node, or the word that heads the node)
• Still these systems have enormous problems with prepositional phrase (PP) attachment
• Example:(borrowed from Igor Malioutov)
– A boy with a telescope kissed Mary on the lips– Mary was kissed by a boy with a telescope on the lips
• PP with a telescope should adjoin to the noun phrase (NP) a boy• PP on the lips should adjoin to the verb phrase (VP) headed by
kiss
Active/passive sentences
• Examples using the Stanford Parser:
Both active and passivesentences are parsed incorrectly
Active/passive sentences
• Examples:
X on the lips modifies MaryX on the lips modifies telescope
Homework Exercise• Use tregex to find out how many passive sentences there are in the
Treebank WSJ section?• Report your search formula and frequency count• The passive construction (according to the Bracketing Guidelines)
– Note: by-phrase containing logical subject (LGS) is optional
Treebank Rules
• Just how many rules are there in the WSJ treebank?
• What’s the most common POS tag?• What’s the most common syntax rule?
Treebank
NN IN NNP DT -NONE- JJ NNS , . CD RB VBD VB CC TO VBZ VBN PRP VBG VBP MD POS PRP$0
20000
40000
60000
80000
100000
120000
140000
160000
180000
WSJ POS tag frequencies
POS
Freq
uenc
y
Total # of tags: 1,253,013
Treebank
NN IN NNP DT -NONE- JJ NNS , . CD RB VBD VB CC TO VBZ VBN PRP VBG VBP MD POS PRP$0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
14.0%
WSJ POS tags
POS
Perc
enta
ge
Category Frequency PercentageN 355039 28.3%V 154975 12.4%
Treebank
S->NP-SB
J+VP
PP->IN+N
P
NP-SBJ->
-NONE
NP->NP+P
P
NP->DT+
NN
S->NP-SB
J+VP+.
NP-SBJ->
PRP
VP->TO+V
P
PP-LOC->I
N+NP
NP->-NONE
NP->NN
NP->NNS
PP-CLR->I
N+NP
NP->DT+
JJ+NN
NP->NNP
VP->MD+V
P
SBAR->W
HNP+S
VP->VB+N
P
SBAR->-
NONE+S
PP-TMP->I
N+NP
ADVP->RB
NP->NNP+N
NP
NP->JJ+N
NS
NP-SBJ->
DT+NN
VP->VBD+S
BAR
NP-SBJ->
NP+PP
SBAR->I
N+S
S->-N
ONE
VP->VBZ+
VP
NP-SBJ->
NNP+NNP
0
10000
20000
30000
40000
50000
60000
Treebank Grammar Rules
Rule
Freq
uenc
y
Total # of rules: 978,873# of different rules: 31,338
Treebank
S->NP-SB
J+VP
PP->IN+N
P
NP-SBJ->
-NONE
NP->NP+P
P
NP->DT+
NN
S->NP-SB
J+VP+.
NP-SBJ->
PRP
VP->TO+V
P
PP-LOC->I
N+NP
NP->-NONE
NP->NN
NP->NNS
PP-CLR->I
N+NP
NP->DT+
JJ+NN
NP->NNP
VP->MD+V
P
SBAR->W
HNP+S
VP->VB+N
P
SBAR->-
NONE+S
PP-TMP->I
N+NP
ADVP->RB
NP->NNP+N
NP
NP->JJ+N
NS
NP-SBJ->
DT+NN
VP->VBD+S
BAR
NP-SBJ->
NP+PP
SBAR->I
N+S
S->-N
ONE
VP->VBZ+
VP
NP-SBJ->
NNP+NNP
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
Treebank Grammar Rules
Rule
Perc
enta
ge
Treebank
PP->IN+N
P
S->NP+V
PNP->
NP->NP+P
P
NP->DT+
NN
S->NP+V
P+.
NP->PRP
ADVP->RB
NP->NN
NP->NNS
VP->TO+V
P
NP->NNP
NP->NNP+N
NP
NP->DT+
JJ+NN
SBAR->I
N+S
VP->VB+N
P
SBAR->W
HNP+S
PP->TO+N
P
VP->MD+V
P
NP->JJ+N
NS
SBAR->+
S
VP->VBD+S
BAR
NP->NP+S
BAR
VP->VBN+N
P+PP
NP->DT+
NNS
NP->JJ+N
N S->
VP->VBZ+
VP
VP->VBD+N
P
VP->VBZ+
NP0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
Treebank Grammar Rules
Rules
Freq
uenc
y
Total # of rules: 978,873# of different rules: 17,554
Treebank
PP->IN+N
P
S->NP+V
PNP->
NP->NP+P
P
NP->DT+
NN
S->NP+V
P+.
NP->PRP
ADVP->RB
NP->NN
NP->NNS
VP->TO+V
P
NP->NNP
NP->NNP+N
NP
NP->DT+
JJ+NN
SBAR->I
N+S
VP->VB+N
P
SBAR->W
HNP+S
PP->TO+N
P
VP->MD+V
P
NP->JJ+N
NS
SBAR->+
S
VP->VBD+S
BAR
NP->NP+S
BAR
VP->VBN+N
P+PP
NP->DT+
NNS
NP->JJ+N
N S->
VP->VBZ+
VP
VP->VBD+N
P
VP->VBZ+
NP0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
12.0%
Treebank Grammar Rules
Rules
Perc
enta
ge
Today’s TopicLet’s segue from Treebank search to stochastic parsers trained on the WSJ Penn Treebank
Examples:• Berkeley Parser
– http://tomato.banatao.berkeley.edu:8080/parser/parser.html
• Stanford Parser– http://nlp.stanford.edu:8080/parser/
are all trained on the Treebank.
We’ll play with Bikel’s implementation of Collins’s Parser …
Using the Treebank
• What is the grammar of the Treebank?– We can extract the phrase structure rules used, and– count the frequency of rules, and construct a
stochastic parser
Using the Treebank
• Breakthrough in parsing accuracy with lexicalized trees– think of expanding the nonterminal names to include head
information and the words that are at the leaves of the subtrees.
Bikel Collins Parser
• Java re-implementation of Collins’ parser• Paper– Daniel M. Bikel. 2004. Intricacies of Collins’ Parsing Model.
(PS) (PDF) in Computational Linguistics, 30(4), pp. 479-511.• Software– http://www.cis.upenn.edu/~dbikel/software.html#stat-pars
er (page no longer exists)
Bikel Collins
• Download and install Dan Bikel’s parser– dbp.zip (on course homepage)
• File: install.sh– Java code– but at this point I think Windows won’t work
because of the shell script (.sh)– maybe after files are extracted?
Bikel Collins
• Download and install the POS tagger MXPOST
parser doesn’t actually need a separate tagger…
Bikel Collins
• Training the parser with the WSJ PTB• See guide – userguide/guide.pdf
directory: TREEBANK_3/parsed/mrg/wsjchapters 02-21: create one single .mrg fileevents: wsj-02-21.obj.gz
Bikel Collins
• Settings:
Bikel Collins• Parsing
– Command
– Input file format (sentences)
Bikel Collins
• Verify the trainer and parser work on your machine
Bikel Collins
• File: bin/parse is a shell script that sets up program parameters and calls java
Bikel Collins
Bikel Collins
• File: bin/train is another shell script
Bikel Collins
• Relevant WSJ PTB files