Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 1

Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze)

Wen-Hsiang Lu (盧文祥 )

Department of Computer Science and Information Engineering,

National Cheng Kung University

2004/12/15

(Slides from Dr. Mary P. Harper,

http://min.ecn.purdue.edu/~ee669/)


Motivation• N-gram models and HMM Tagging only allow us to process

sentences linearly; however, the structural analysis of simple sentences require a nonlinear model that reflects the hierarchical structure of sentences rather than the linear order of words.

• Context-free grammars can be generalized to include probabilistic information by adding rule use statistics.

• Probabilistic Context Free Grammars (PCFGs) are the simplest and most natural probabilistic model for tree structures and the algorithms for them are closely related to those for HMMs.

• Note, however, that there are other ways of building probabilistic models of syntactic structure (see Chapter 12).


CFG Parse Example

• #1 S NP VP

• #2 VP V NP PP

• #3 VP V NP

• #4 NP N

• #5 NP N PP

• #6 PP PREP N

• #7 N a_dog

• #8 N a_cat

• #9 N a_telescope

• #10 V saw

• #11 PREP with a_dog saw a_cat with a_telescope

N V N PREP N

NP NP PP

VP

SVP

NP

PPV N

PREP N


Dependency Parse Example

• Same example, dependency representationsaw saw

a_dog a_doga_cat

a_catwith

with

a_telescope a_telescope

Sb Sb Attr

Obj

Obj Adv_Tool


PCFG Notation

• G is a PCFG• L is a language generated by G• {N1, …, Nn} is a set of nonterminals for G; there

are n nonterminals• {w1, …, wV} is the set of terminals; there are V

terminals• w1…wm is a sequence of words constituting a

sentence to be parsed, also denoted as w1m

• nonterminal Nj spans wp through wq, or in

other words Nj * wp wp+1 …wq

jpqN


Formal Definition of a PCFG

• A PCFG consists of:– A set of terminals, {wk}, k= 1,…,V– A set of nonterminals, Ni, i= 1,…, n– A designated start symbol N1

– A set of rules, {Ni j}, (where j is a sequence of terminals and nonterminals)

– A corresponding set of probabilities on rules such that: i j P(Ni j) = 1


Probability of a Derivation Tree and a String

• The probability of a derivation (i.e. parse) tree:

P(T) = i=1..k p(r(i))

where r(1), …, r(k) are the rules of the CFG used to generate the sentence w1m of which T is a parse.

• The probability of a sentence (according to grammar G) is given by:

P(w1m) = t P(w1m,t) = {t: yield(t)=w1m} P(t)

where t is a parse tree of the sentence. Need dynamic programming to make this efficient!


Example: Probability of a Derivation Tree


Example: Probability of a Derivation Tree


The PCFG• Below is a probabilistic CFG (PCFG) with

probabilities derived from analyzing a parsed version of Allen's corpus. Rule Count for

LHS Count for Rule

PROB

1. S NP VP 300 300 1

2. VP V 300 116 .386

3. VP V NP 300 118 .393 4. VP V NP PP 300 66 .22 5. NP NP PP 1032 241 .23 6. NP N N 1032 92 .09 7. NP N 1032 141 .14 8. NP ART N 1032 558 .54

9. PP P NP 307 307 1


Assumptions of the PCFG Model• Place Invariance: The probability of a subtree does

not depend on where in the string the words it dominates are (like time invariance in an HMM)

• Context Free: The probability of a subtree does not depend on words not dominated by that subtree.

• Ancestor Free: The probability of a subtree does not depend on nodes in the derivation outside the subtree.

same theis )( )( j

ckkNPk

)() outside anything|( jkl

jkl NPlkNP

)()in not nodesancestor any |( jkl

jkl

jkl NPNNP


Probability of a Rule

• Rule r: Nj (Nw)+

• Let RNj be the set of all rules that have nonterminal

Nj on the left-hand side;

• Then define probability distribution on RNj :

rRNj p(r) = 1, 0 p(r) 1

• Another point of view:

p(| Nj ) = p(r), where r is Nj


Estimating Probability of a Rule

• MLE from a treebank produced using a CFG grammar

• Let r: Nj , then– p(r) = c(r) / c(Nj)

– c(r): how many times r appears in the treebank.

– c(Nj)= c(Nj )

– p(r) = c(Nj ) / c(Nj )

Nj

1 2 k


Example PCFG

Rule Count for LHS

Count for Rule

PROB

1. S NP VP 300 300 1

2. VP V 300 116 .386

3. VP V NP 300 118 .393

4. VP V NP PP 300 66 .22

5. NP NP PP 1032 241 .23

6. NP N N 1032 92 .09

7. NP N 1032 141 .14

8. NP ART N 1032 558 .54

9. PP P NP 307 307 1


Some Features of PCFGs• A PCFG gives some idea of the plausibility of different parses;

however, the probabilities are based on structural factors and not lexical ones.

• PCFGs are good for grammar induction.• PCFGs are robust. • PCFGs give a probabilistic language model for English.• The predictive power of a PCFG tends to be greater than for an

HMM. Though in practice, it is worse.• PCFGs are not good models alone but they can be combined

with a trigram model.• PCFGs have certain biases which may not be appropriate;

smaller trees are preferred. (why?)• Estimates of PCFG parameters from a corpus obtains a proper

probability distribution (Chi and Geman, 1998).


Recall the Questions for HMMs

• Given a model =(A, B, ), how do we efficiently compute how likely a certain observation is, that is, P(O| )

• Given the observation sequence O and a model , how do we choose a state sequence (X1, …, X T+1) that best explains the observations?

• Given an observation sequence O, and a space of possible models found by varying the model parameters = (A, B, ), how do we find the model that best explains the observed data?


Questions for PCFGs

• There are also three basic questions we wish to answer for PCFGs:– What is the probability of a sentence w1m according to a

grammar G: P(w1m|G)?

– What is the most likely parse for a sentence: argmaxt P(t|w1m,G)?

– How can we choose rule probabilities for the grammar G that maximize the probability of a sentence: argmaxG P(w1m|G) ?


Restriction• For this presentation, we only consider the case of

Chomsky Normal Form (CNF) GrammarsChomsky Normal Form (CNF) Grammars, which only have unary and binary rules of the form:• Ni Nj Nk // two nonterminals on the RHS• Ni wj // one terminal on the RHS

• The parameters of a PCFG in Chomsky Normal Form are:• P(Nj Nr Ns | G) // an n3 matrix of parameters• P(Nj wk|G) // n×V parameters

(n is the number of nonterminals and V is the number of terminals)

• For j=1,…,n, r,s P(Nj Nr Ns) + k P (Nj wk) = 1• Any CFG can be represented by a weakly equivalent CNF

form.


HMMs and PCFGs• An HMM is able to efficiently do calculations using

forward and backward probabilities. – i(t) = P(w1(t-1), Xt = i)

– i(t) = P(wtT|Xt = i)

• The forward probability corresponds in a parse tree to everything above and including a certain node (outside), while the backward probability corresponds to the probability of everything below a certain node (inside).

• For PCFGs, the Outside (j ) and Inside (j) Probabilities

are defined as:

),|(),(

)|,,(),( )1()1(1

GNwPqp

GwNwPqpjpqpqj

mqjpqpj


Inside and Outside Probabilities of PCFGs

N1

w1 wm

N

j

wp-1 wq+1wp wq… ……


Inside and Outside Probabilities

• The inside probability j(p,q) is the total probability of generating wpq starting off with nonterminal Nj.

• The outside probability j(p,q) is the total probability of beginning with nonterminal N1 and generating Nj

pq and all words outside of wpq.


From HMMs to Probabilistic Regular Grammars (PRG)

• A PRG has start state N1 and rules of the form:– Ni wj Nk – Ni wj

• This is like an HMM except that for an HMM:n w1n

P(w1n) = 1

whereas, for a PCFG: wL P(w) = 1

where L is the language generated by the grammar.• PRGs are related to HMMs in that a PRG is an HMM to

which a start state and a finish (or sink) state is added.


Inside Probability

• j(p,q) = P(Nj * wpq)

Nj

wp ... wq


The Probability of a String: Using Inside Probabilities

• We can calculate the probability of a string using the inside algorithm, a dynamic programming algorithm based on the inside probabilities:

• Base Case: j

),1(),|(

)|()|(

1111

1

*1

1

mGNwP

GwNPGwP

mm

mm

)|(),|(),( GwNPGNwPkk kjj

kkkj


Induction: Divide the String wpq

wp wd… wd+1 wq

Nj

Nr Ns


Induction Stepj, 1 p < q m

),1(),()(

),|(),|(),|,(

),,,,|(

),,,|(),|,(

),|,,,(

),|(),(

,

1

)1()1()1(,

1

)1()1(

)1()1(,

1

)1()1(,

1

qddpNNNP

GNwPGNwPGNNNP

GwNNNwP

GNNNwPGNNNP

GNNwNwP

GNwPqp

srsrj

sr

q

pd

sqdqd

rpdpd

jpq

sqd

rpd

sr

q

pd

pds

qdrpd

jpqqd

sqd

rpd

jpqpd

jpq

sqd

rpd

sr

q

pd

jpq

sqdqd

rpdpd

sr

q

pd

jpqpqj

wp wd… wd+1 wq

Nj

Nr Ns


Example PCFG

• #1 S NP VP 1.0

• #2 VP V NP PP 0.4

• #3 VP V NP 0.6

• #4 NP N 0.7

• #5 NP N PP 0.3

• #6 PP PREP N 1.0

• #7 N a_dog 0.3

• #8 N a_cat 0.5

• #9 N a_telescope 0.2

• #10 V saw 1.0

• #11 PREP with 1.0P(a_dog saw a_cat with a_telescope) =

N V N PREP N

NP NP PP

VP

SVP

NP

PPV N

PREP N

1.0

0.4

0.7

0.3 1.0 0.5 1.0 0.2

0.71.0

0.6

0.3

1.0

1.7 .4 .3 .7 1 .5 1 1 .2 + ... .6... .3... = .00588 + .00378 = .00966


Computing Inside Probability• a_dog saw a_cat with a_telescope

1 2 3 4 5

• Create table m m (m is length of string).

• Initialize on diagonal, using N rules.

• Recursively compute along the diagonal towards the upper right corner.

from/to 1 2 3 4 5 1 NP .21

N .3 S .441 S .00966

2 V 1 VP .21 VP .046 3 NP .35

N .5 NP .03

4 PREP 1 PP .2 5 N .2


The Probability of a String: Using Outside Probabilities

• We can also calculate the probability of a string using the outside algorithm, based on the outside probabilities, for any k, 1 k m:

• Outside probabilities are calculated top down and require reference to inside probabilities, which must be calculated first.

)(),(

),,,|()|,,(

)|,,,()|(

)1()1(1)1()1(1

)1()1(11

kj

jj

mkj

kkkkmkj

kkj

k

jkkmkk

jkm

wNPkk

GwNwwPGwNwP

GNwwwPGwP


Outside Algorithm

• Base Case: 1(1,m)= 1; j(1,m)=0 for j1 (the probability of the root being nonterminal Ni with nothing outside it)

• Inductive Case: 2 cases depicted on the next two slides. Sum over both possibilities but restrict first sum so that gj to avoid double counting for rules of the form Ni Nj Nj


Case 1: left previous derivation step

N1

w1 wmwe+1… …wp-1 wp wq… wq+1 we

jpqΝ

geqN )1(

fpeN


Case 2: right previous derivation step

N1

w1 wmwe-1wq+1we wp-1… …… wp wq

jpqNg

peN )1(

feqN


Outside Algorithm Induction

)]1,()(),(

),1()(),([

)]|()|,(),,(

)|()|,(),,([

)],,,,(

),,,,([),(

,

1

1

, 1

)1()1()1(,

1

1)1()1(1

)1()1()1(, 1

)1()1(1

)1(,

1

1)1()1(1

)1(, 1

)1()1(1

peNNNPqe

eqNNNPep

NwPNNNPNwwP

NwPNNNPNwwP

NNNwwP

NNNwwPqp

gjgf

gf

p

ef

ggj

jgf

m

qe

ff

gpepe

feq

jpq

gpe

feq

gf

p

emqe

geqeq

fpe

geq

jpq

fpe

jgf

m

qemep

jpq

gpe

feq

gf

p

emqp

geq

jpq

fpe

jgf

m

qemqpj


Product of Inside and Outside Probabilities

• Similarly to an HMM, we can form a product of inside and outside probabilities:

• The PCFG requires us to postulate a nonterminal node. Hence, the probability of the sentence AND that there is a constituent spanning the words p through q is:

P(w1m, Npq|G)= j j(p,q) j(p,q)

)|,(

),|()|,,(),(),(

1

)1()1(1

GNwP

GNwPGwNwPqpqpjpqm

jpqpqmq

jpqpjj


CNF Assumption

• Since we know that there will always be some nonterminal spanning the whole tree (the start symbol), as well as each individual terminal (due to CNF assumption), we obtain the following as the total probability of a string:

),(),()(),(

),1(),1(),|( 11111

kkkkwNPkk

mmGNwP

jj

jkj

jj

mm


Most Likely Parse for a Sentence

• The O(m3n3) algorithm works by finding the highest probability partial parse tree spanning a certain substring that is rooted with a certain nonterminal.

i(p,q) is the highest inside probability parse of a subtree Nipq

• Initialization: i(p,p) = P(Ni wp)

• Induction: i(p,q) = max1j,kn,pr<qP(Ni Nj Nk) j(p,r) k(r+1,q)

• Store backtrace: i(p,q)=argmax(j,k,r)P(Ni Nj Nk) j(p,r) k(r+1,q)

• At termination: The probability of the most likely tree is P( )= 1(1,m). To reconstruct the tree, see the next slide.

t̂


Extracting the Viterbi Parse

• To reconstruct the maximum probability tree , we regard the tree as a set of nodes {Xx}.

– Because the grammar has a start symbol, the root node must be N1

1m

– Thus we only need to construct the left and right daughters of a non-terminal node recursively. If Xx=Ni

1m is in the Viterbi parse, and i(p,q)= (j,k,r) then left(Xx)= Nj

pr and right(Xx)= Nk(r+1)q

• There can be more than one maximum value parse.

t̂


Training a PCFG• Purpose of training: limited grammar learning (induction)• Restrictions: We must provide the number of terminals and

nonterminals, as well as the start symbol, in advance. Then it is possible to assume all possible CNF rules for that grammar exist, or alternatively provide more structure on the rules, including specifying the set of rules.

• Purpose of training: to attempt to find the optimal probabilities to assign to different grammar rules once we have specified the grammar in some way.

• Mechanism: This training process uses an EM Training Algorithm called the Inside-Outside Algorithm which allows us to train the parameters of a PCFG on unannotated sentences of the language.

• Basic Assumption: a good grammar is one that makes the sentences in the training corpus likely to occur, i.e., we seek the grammar that maximizes the likelihood of the training data.


Inside-Outside Algorithm• If a parsed corpus is available, it is best to train using that

information. If such is unavailable, then we have a hidden data problem: we need to determine probability functions on rules by only seeing the sentences.

• Hence, we must use an iterative algorithm (like Baum Welch for HMMs) to improve estimates until some threshold is achieved. We begin with a grammar topology and some initial estimates of the rules (e.g., all rules with a non-terminal on the LHS are equi-probable).

• We use the probability of the parse of each sentence in the training set to indicate our confidence in it, and then sum the probabilities of each rule to obtain an expectation of how often it is used.


Inside-Outside Algorithm• The expectations are then used to refine the

probability estimates of the rules, with the goal of increasing the likelihood of the training set given the grammar.

• Unlike the HMM case, when dealing with multiple training instances, we cannot simply concatenate the data together. We must assume that the sentences in the training set are independent and then the likelihood of the corpus is simply the product of their component sentence probabilities.

• This complicates the reestimation formulas, which appear on pages 399-401 of Chapter 11.


Problems with the Inside-Outside Algorithm

• Extremely Slow: For each sentence, each iteration of training is O(m3n3), given m is the sentence length and n is the number of nonterminals.

• Local Maxima are much more of a problem than in HMMs

• Satisfactory learning requires many more nonterminals than are theoretically needed to describe the language.

• There is no guarantee that the learned nonterminals will be linguistically motivated.

• Hence, although grammar induction from unannotated corpora is possible, it is extremely difficult to do well.


Another View of Probabilistic Parsing

• By assuming that the probability of a constituent being derived by a rule is independent of how the constituent is used as a subconstituent, the inside probability can be used to develop a best-first PCFG parsing algorithm.

• This assumption suggests that the probabilities of NP rules are the same whether the NP is used as a subject, object, or object of a preposition. (However, subjects are more likely to be pronouns!)

• The inside probabilities for lexical categories can be their lexical generation probabilities. For example, P(flower|N)=.063 is the inside probability that the constituent N is realized as the word flower.


Lexical Probability Estimates

P(the|ART) .54 P(a|ART) .360 P(flies|N) .025 P(a|N) .001 P(flies|V) .076 P(flower|N) .063 P(like|V) .1 P(flower|V) .05 P(like|P) .068 P(birds|N) .076 P(like|N) .012

The table below gives the lexical probabilities which are needed for our example:


Word/Tag Counts

N V ART P TOTAL flies 21 23 0 0 44 fruit 49 5 1 0 55 like 10 30 0 21 61 a 1 0 201 0 202 the 1 0 300 2 303 flower 53 15 0 0 68 flowers 42 16 0 0 58 birds 64 1 0 0 65 others 592 210 56 284 1142 TOTAL 833 300 558 307 1998


The PCFG• Below is a probabilistic CFG (PCFG) with

probabilities derived from analyzing a parsed version of Allen's corpus. Rule Count for

LHS Count for Rule

PROB

1. S NP VP 300 300 1

2. VP V 300 116 .386

3. VP V NP 300 118 .393 4. VP V NP PP 300 66 .22 5. NP NP PP 1032 241 .23 6. NP N N 1032 92 .09 7. NP N 1032 141 .14 8. NP ART N 1032 558 .54

9. PP P NP 307 307 1


Parsing with a PCFG

• Using the lexical probabilities, we can derive probabilities that the constituent NP generates a sequence like a flower. Two rules could generate the string of words:

NP

ART N

a flower

8NP

N N

a flower

6


Parsing with a PCFG

• The likelihood of each rule has been estimated from the corpus, so the probability that the NP generates a flower is the sum of the two ways it can be derived:P(a flower|NP) = P(R8|NP)×P(a|ART)×P(flower|N) +

P(R6|NP)×P(a|N)×PROB(flower|N)

= .55 × .36 × .063 + .09 × .001 × .063 = .012

• This probability can then be used to compute probabilities for larger constituents, like the probability of generating the words, A flower wilted (枯萎 ), from an S constituent.


Three Possible Trees for an S S

1

N

a flower

7 NP VP

V

3

wilted

N

7 NP

NP

N N

a flower

6 VP

V

2

wilted

S

1

NP

ART N

a flower

8 VP

V

2

wilted

S

1


Parsing with a PCFG

• The probability of a sentence generating A flower wilted: P(a flower wilted|S) = P(R1|S) × P(a flower|NP) ×

P(wilted|VP) + P(R1|S) × P(a|NP) × P(flower wilted|VP)

• Using this approach, the probability that a given sentence will be generated by the grammar can be efficiently computed. It only requires some way of recording the value of each constituent between each two possible positions. The requirement can be filled by a packed chart structure.


Parsing with a PCFG

• The probabilities of specific parse trees can be found using a standard chart parsing algorithm, where the probability of each constituent is computed from the probability of its subconstituents and the rule used. When entering an entry E of nonterminal Nj using rule i with n subconstituents corresponding to E1, E2, ..., En, then: P(E) = P(Rule i| Nj) × P(E1) ×... × P(En)

• A standard chart parser can be used with a step to compute the probability of each entry when it is added to the chart. Using a bottom-up algorithm plus probabilities from a corpus, the figure on slide 49 shows the complete chart for the input, a flower.


Lexical Generation Probabilities

• A better estimate of the lexical category for the word would be calculated by determining how likely it is that category Ni occurred at a position t over all sequences given the input w1t. In other words, rather than ignoring context or simply searching for the one sequence that yields the maximum probability for the word, we want to compute the sum of the probabilities from all sequences using the forward algorithm.

• The forward probability can then be used in the chart as the probability for a lexical item. The forward probabilities strongly prefer the noun reading flower in contrast to the context independent probabilities (as can be seen on the next slide).


PCFG Chart For A flower

ART416 .99 .00047V419

N417 N422.001 .999

1 2 3A flower

NP4181. N417 .00018 .00018

VP4201. V419

S4213.2*10^-8

1. NP418 2. VP420

NP423.54

1. ART416 2. N422

NP423.00011

1. N417 2. N422

NP425.141. N422


Accuracy Levels• A parser built using this technique will identify the correct

parse about 50% of the time. The drastic independence assumptions may prevent it from doing better. For example, the context-free model assumes that the probability of a particular verb being used in a VP rule is independent of the rule in question.

• For example, any structure that attaches a PP to a verb will have a probability of .22 from the fragment, compared with one that attaches to an NP in the VP (.39*.24 = .093).

VP

V NP PP.22

VP

V NP

PPNP

.39

.24


Best-First Parsing• Algorithms can be developed that explore high-probability

constituents first using a best-first strategy. The hope is that the best parse can be found quickly without using much search space and that low probability constituents will never be created.

• The chart parser can be modified by making the agenda a priority queue (where the most probable elements are first in the queue). The parser then operates by removing the highest ranked constituent first and adding it to the chart.

• The previous chart parser relied on the fact that it worked from left to right, finishing early constituents before those later. With the best-first strategy, this assumption does not hold, and so the algorithm must be modified.


Best-First Chart Issues

• For example, if the last word in the sentence has the highest score, it will be entered into the chart first in a best-first parser.

• This possibility suggests that a constituent needed to extend an arc may already be in the chart, requiring that whenever an active arc is added to the chart, we need to check if it can be extended by anything already in the chart.

• The arc extension algorithm must therefore be modified.


Arc Extension Algorithm• To add a constituent C from position p1 to p2:

1. Insert C into the chart from position p1 to p2

2. For any active arc of the form: X X1 ... •C ... Xn from position p0 to p1, add a new active arc X X1 ... C • ... Xn from position p0 to p2

• To add an active arc X X1 ... C • C' ... Xn to the chart from p0 to p2:

1. If C is the last constituent (i.e., the arc is completed), add a new constituent of type X to the agenda

2. Otherwise, if there is a constituent Y of category C' in the chart from p2 to p3, then recursively add an active arc: X X1 ... C C' • ... Xn from p0 to p3 (which may of course add further arcs or create further constituents).


Best-First Strategy

• The best-first strategy improves the efficiency of the parser by creating fewer constituents than a standard chart parser which terminates as soon as a sentence parse is found.

• Even though it does not consider every possible constituent, it is guaranteed to find the highest probability parse first.

• Although conceptually simple, there are some problems with the technique in practice. For example, if we combine scores by multiplying probabilities, the scores of constituents fall as they cover more words. Hence, with large grammars, the probabilities may drop off so quickly that the search resembles breadth-first search!


Context-Dependent Best-First Parser

• The best-first approach may improve efficiency, but cannot improve the accuracy of the probabilistic parser.

• Improvements are possible if we use a simple alternative for computing rule probabilities that uses more context-dependent lexical information.

• In particular, we exploit the observation that the first word in a constituent is often its head and, as such, exerts an influence on the probabilities of rules that account for its complements.


Simple Addition of Lexical Context

• Hence, we can use the following measure that takes into account the first word in the constituent.P(R|Ni,w)=C(rule R generates Ni that starts with w)/

C(Ni starts with w).

• This makes the probabilities sensitive to more lexical information. This is valuable:– It is unusual for singular nouns to be used by

themselves.

– Plurals are rarely used to modify other nouns.


Example

• Notice the probabilities for the two rules, NP N and NP N N, given the words house and peaches. Also, context sensitive rules can encode verb subcategorization preferences.

Rule the house peaches flowers NP N 0 0 .65 .76 NP N N 0 .82 0 0 NP NP PP .23 .18 .35 .24 NP ART N .76 0 0 0 Rule ate bloom like put VP V .28 .84 0 .03 VP V NP .57 .1 .9 .03 VP V NP PP .14 .05 .1 .93


Accuracy and Size

Strategy Accuracy on 84 PP Attachment Problems

Size of Chart Generated for The man put the bird in the house

Full Parse 33% (taking first S found)

158

Context-Free Probabilities

49% 65

Context-Dependent Probabilities

66% 36


Source of Improvement

• To see why the context-dependent parser does better, consider the attachment decision that must be made for sentences like:– The man put the bird in the house.

– The man likes the bird in the house.

• The basic PCFG will assign the same structure to both, i.e., it will attach the PP to the VP. However, the context-sensitive approach will utilize subcategorization preferences for the verbs.


Chart for put the bird in the house• The probability of attaching the PP to the VP

is .54 (.93×.99×.76×.76) compared to .0038 for the alternative.


Chart for likes the bird in the house

• The probability of attaching the PP to the VP is .054 compared to .1 for the alternative .


Room for Improvement

• The question is, just how accurate can this approach become? 33% error is too high. Additional information could be added to improve performance.

• One can use probabilities relative to a larger fragment of input (e.g., bigrams or trigrams for the beginnings of rules). This would require more training data.

• Attachment decisions depend not only on the word that a PP is attaching to but also the PP itself. Hence, we could use a more complex estimate based on the previous category, the head verb, and the preposition. This creates a complication in that one cannot compare across rules easily: VP V NP PP is evaluated using a verb and a preposition, but VP V NP has no preposition.


Room for Improvement

• In general, the more selective the lexical categories, the more predictive the estimates can be, assuming that there is enough data.

• Clearly function words like prepositions, articles, quantifiers, and conjunctions can receive individual treatment; however, open class words like nouns and verbs are too numerous for individual handling.

• Words may be grouped based on similarity, or useful classes can be learned by analyzing corpora.

Documents

Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer