38
Using Treebanks tgrep2 Lecture 2: 07/12/2011

Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Using Treebanks

tgrep2

Lecture 2: 07/12/2011

Page 2: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Using Corpora

• For discovery

• For evaluation of theories

• For identifying tendencies

– distribution of a class of words

– distribution of structural configurations

– frequency of a certain distribution

Page 3: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Why Treebanks

• Raw corpora are not enough for most linguistic purposes.

• Let’s start with the rawest of them all: the web, which I’ll call `Google’

- Convenient

- Potentially inexhaustible

- Varied and free-form

Page 4: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Problems with `Google’

• Quality control – hard to identify the identity of the author, making it

difficult to keep track of variation the text could be computer generated

• What does Google count?– google counts are notoriously unreliable and change

from minute to minute– problem of repeated elements– no clear estimate of sample size, so difficult to go

beyond order of magnitude estimations of frequency

Page 5: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Sentences

• Sentences are an important unit of linguistic organization.

• They are not an important unit of organization for most search engines.

Consequently not straightforward to restrict searches to remain within a sentence, a task that is crucial for linguistic purposes.

Page 6: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Selecting Texts

• Using the web/search engine directly is inadequate for any but the most basic linguistic purposes.

• The next step forward is to judiciously assemble a set of texts (possibly from the web) and use an appropriate search language

– Regular Expressions based systems are fast, easily available, and easy to use.

Page 7: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

The need for annotation

Even after the creation of a corpus (a set of texts), there are still many basic linguistic investigations that cannot be conducted.

• generalizing searches – e.g. when we want to examine all sequences of Det and N

• identifying a subset of cases when there is ambiguity in part-of-speech e.g. `to’

Page 8: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Step 1: POS tags

• Part of Speech Tagging is the most basic kind of annotation.

– POS-tagging makes corpora much more linguistically useful.

– POS-tagging can often be done automatically with high reliability, allowing us to use large texts for linguistic purposes.

Page 9: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Step 2: Beyond POS tags

(1) Ann likes Bill and Tim likes Nina.

(2) Ann likes Bill and Tim, who are her mentors.

Assume you want to search for coordinated noun phrases. You want to get (2) but not (1).

• But a search for the POS sequence `Noun DetNoun’ will catch both.

• We need structural information.

Page 10: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Step 3: Structural Information

• We need structural information for a corpus to be fully linguistically useful.

• We also need structural information if we want to train parsers off the corpus.

• The nature of this structural information is quite underdetermined.

Page 11: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Structural and Other Information

• One could include structural information, leading to a set of syntactic trees of the familiar sort: hence the term `treebank’

• But the information can be quite different and there is no commitment that the formal objects involved are `trees’

• Other alternatives: theta-roles, semantic argument information (PropBank) etc.

Page 12: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Searching a Treebank

• For linguistic purposes, we need a way to extract linguistically interesting patterns.

• What counts as a `linguistically interesting pattern’ will vary greatly depending upon your theoretical interests and the nature of the treebank.

• Here we will assume that the formal objects are trees and discuss a general and powerful way to search for trees with certain properties.

Page 13: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Verbs Accounts

• Class material is in:

/data/home/verbs/shared/LSA7800_076

• To reduce typing, set up a link:

ln –s /data/home/verbs/shared/LSA7800_076 lsa

Page 14: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

The Unix you’ll need

• ls –llists files in current directory

• cd DIRECTORYNAMEgo to designated directory

• cat FILENAMEread file and display on screen, or

• > FILENAMEdirect output into designated file

• more FILENAMEscroll designated file

• wccount number of words/characters in input, provided by

• |pipe output of one program into anothercat FILENAME | wc

Page 15: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Regular Expressions and grep

• Regular Expressions are a powerful and fast way to express search patterns

• grep– program for using Regular Expression search

patterns

– Syntax:

grep RegExPattern FILENAME

Page 16: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Very Basic RegEx

• words are RegExs that match themselves as well as superstrings of themselves

• If A and B are RegEx, then

1. AB

2. A|B

3. A* (also A? and A+)

are also RegEx

Page 17: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Regular expression summary

. Matches any character (aka wildcard)^abc Matches some pattern abc at the start of a stringabc$ Matches some pattern abc at the end of a string[abc] Matches one of a range of characters[A-Z0-9] Matches one of a range of charactersed|ing|s Matches one of the specified strings (aka disjunction)* Zero or more of previous item (aka closure)+ One or more of previous item (aka closure)? Zero or one of the previous item (aka optionality){n} Exactly n repeats{n,} At least n repeats{,n} At most n repeats{m,n} At least m and at most n repeatsa(b|c)+ Parentheses indicate the scope of the operators

Page 18: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

TGrep2

• A general way of writing Regular Expressions over trees

• tgrep2, grep for trees, written by Douglas Rohde, builds upon tgrep

• To use TGrep2, a special TGrep2 corpus file needs to be created – wsj.t2c in lsa directory, has 49,209 sentences.

Page 19: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

TGrep2 Syntax 1

• Basic Pattern Syntax

– Regular expression syntax can be used to select words or node labels:

– Ex: /ˆNP/ matches any node label that begins with NP such as NP-SBJ

• Command syntax:

lsa/tgrep –c lsa/wsj.t2c PATTERN

Page 20: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

At the beginning

• >lsa/tgrep -c lsa/wsj.t2c 'S << Vinken' | more

(S (NP-SBJ (NP (NNP Pierre) (NNP Vinken)) (, ,) (ADJP (NP (CD 61) (NNS years)) (JJ old)) (, ,)) (VP (MD will) (VP (VB join) (NP (DT the) (NN board)) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director))) (NP-TMP (NNP Nov.) (CD 29)))) (. .))

(S (NP-SBJ (NNP Mr.) (NNP Vinken)) (VP (VBZ is) (NP-PRD (NP (NN chairman)) (PP (IN of) (NP (NP (NNP Elsevier) (NNP N.V.)) (, ,) (NP (DT the) (NNP Dutch) (VBG publishing) (NN group)))))) (. .))

Page 21: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Tree Relationships

1. Immediate domination (<)

2. Domination (<<)

3. Sisterhood ($)

4. Immediate Precedence (.)

5. Precedence (..)

6. First/nth/last child

7. First/Last descendant

Page 22: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Immediate Domination

A < B

A is the parent of B, retrieves subtree rooted at A.

A > B

A is the daughter of B, retrieves subtree rooted at A.

NP < PP

Matches any NP that immediately dominates a PP.

Page 23: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Domination

A << B

A dominates B, retrieves subtree rooted at A.

A >> B

A is dominated by B, retrieves subtree rooted at A.

NP << PP

Matches any NP that dominates a PP.

Page 24: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Combining Search Patterns 1

• Default interpretation is &

VP < NP < PP- a VP that immediately dominates an NP and a PP - Ex: (You can [see comets with a telescope])

VP < (NP < PP)- a VP that immediately dominates [an NP that

dominates a PP]

- Ex: (I [praised [the students [with good grades]]])

Page 25: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Combining Search Patterns 2

• For optionality, use |

NP < PP | << AP

- an NP that

immediately dominates a PP

OR

dominates an AP

Page 26: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Sisterhood and Precedence 1

• A $ B- A is a sister of B

• A . B- A immediately precedes B

• A .. B- A precedes B

• A , B- A immediately follows B

• A ,, B- A follows B

Page 27: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Sisterhood and Precedence 2

Combining the two:• A $. B

– A is a sister of B and immediately precedes B

• A $.. B-- A is a sister of B and precedes B

• A $, B-- A is a sister of B and immediately follows B

• A $,, B-- A is a sister of B and follows B

Page 28: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Which child?

We can pick out a particular child: from left to right (start at 1):

• A <N B- B is the nth child of A

• A >N B- A is the nth child of B

• A <1 B (also used: A <, B)- B is the first child of A

• A >1 B (also used: A >, B)- A is the first child of B

Page 29: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Which child?

We can pick out a particular child:from right to left (start at -1):• A <-N B

- B is the nth child of A from the right

• A >-N B- A is the nth child of B from the right

• A <-1 B (also used: A <‘ B)- B is the last child of A

• A >-1 B (also used: A >‘ B)- A is the last child of B

Page 30: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Which descendant?

• A <<, B- B is *a* left-most descendant of an A

• A >>, B- A is *a* left-most descendant of a B

• A <<‘ B- B is *a* right-most descendant of an A

• A >>’ B- A is *a* right-most descendant of a B

Page 31: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Uniqueness

• A <: B- B is the only child of A

• A >: B- A is the only child of B

• A <<: B- there is a single path of descent from A and B is on it

• A >>: B- there is a single path of descent from B and A is on it

Page 32: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Combining Links

NP << (PP . VP)

NP <‘ (PP <, (IN < on))

S < (A < B) < C

S < ((A < B) < C)

S < (A < B < C)

Page 33: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

! Negating Links !

• ! before any link relationship negates it

A !.. B

- A does not precede B

A [< B | . C] [< D | . E]

A [< B | ![. C !, F]] | ![< D !.. E]

Page 34: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

= Naming Node labels =

• Any node label can be given a name using =

S=foo << (VP .. (PP >> =foo))

NP < (PP=pp < (IN < on)) | < (NP < =pp)

Page 35: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

: Segmented Patterns :

Sometimes it is useful to break patterns up into segments

S < (NP=n1 .. (VP=v < PP)) < (NP=n2 !.. VP)

S < NP=n1 < NP=n2 : =n1 .. VP=v : =v < PP : =n2 !.. VP

S << (VP=v < NP) : =v < /ˆPP/

S << (VP=v < NP < /ˆPP/)

Page 36: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

@ Macros @

• Patterns that are likely to re-used can be constructed using macros

@ NP /ˆNP/;

@ NN /ˆNN/;

@ CNP @NP=cnp [!< @NP | < @NN] !$.. @NN;

#CNP – core NP

macros make subsequent modification easy!

Page 37: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Heavy NP Shift

>tgrep -c wsj.t2c 'VP <-1 (NP <: *)' | more

>tgrep -c wsj.t2c 'VP <-1 (NP <: *) < PP’

>tgrep -c wsj.t2c 'VP <2 (PP $. NP=foo) <-1 (=foo <: *)’

>tgrep -c wsj.t2c 'VP <-1 (NP <: *) < PP’

>tgrep -c wsj.t2c 'VP <-1 (NP <: PRP)’

>tgrep -c wsj.t2c 'VP <1 /^V*/ <2 PP <-1 (NP <: PRP)'

Page 38: Using Treebanks - University of Washingtonfaculty.washington.edu/fxia/lsa2011/slides/class2-tgrep.pdf•Using the web/search engine directly is inadequate for any but the most basic

Adjective Ordering

>tgrep -c wsj.t2c '/^NP/ <2 /^JJ/ <3 /^JJ/’

>tgrep -c wsj.t2c '/^NP/ <2 /^JJ/ <3 /^JJ/ <4 /^JJ/ <5 /^JJ/’

(NP (DT the) (JJ first) (JJ negative) (JJ compound) (JJ annual) (NN growth) (NN rate))