66
Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧盧盧 ) Department of Computer Science and Information Engineering, National Cheng Kung University 2004/12/15 (Slides from Dr. Mary P. Harper, http://min.ecn.purdue.edu/~ee669/)

Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

  • View
    243

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 1

Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze)

Wen-Hsiang Lu (盧文祥 )

Department of Computer Science and Information Engineering,

National Cheng Kung University

2004/12/15

(Slides from Dr. Mary P. Harper,

http://min.ecn.purdue.edu/~ee669/)

Page 2: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 2

Motivation• N-gram models and HMM Tagging only allow us to process

sentences linearly; however, the structural analysis of simple sentences require a nonlinear model that reflects the hierarchical structure of sentences rather than the linear order of words.

• Context-free grammars can be generalized to include probabilistic information by adding rule use statistics.

• Probabilistic Context Free Grammars (PCFGs) are the simplest and most natural probabilistic model for tree structures and the algorithms for them are closely related to those for HMMs.

• Note, however, that there are other ways of building probabilistic models of syntactic structure (see Chapter 12).

Page 3: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 3

CFG Parse Example

• #1 S NP VP

• #2 VP V NP PP

• #3 VP V NP

• #4 NP N

• #5 NP N PP

• #6 PP PREP N

• #7 N a_dog

• #8 N a_cat

• #9 N a_telescope

• #10 V saw

• #11 PREP with a_dog saw a_cat with a_telescope

N V N PREP N

NP NP PP

VP

SVP

NP

PPV N

PREP N

Page 4: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 4

Dependency Parse Example

• Same example, dependency representationsaw saw

a_dog a_doga_cat

a_catwith

with

a_telescope a_telescope

Sb Sb Attr

Obj

Obj Adv_Tool

Page 5: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 5

PCFG Notation

• G is a PCFG• L is a language generated by G• {N1, …, Nn} is a set of nonterminals for G; there

are n nonterminals• {w1, …, wV} is the set of terminals; there are V

terminals• w1…wm is a sequence of words constituting a

sentence to be parsed, also denoted as w1m

• nonterminal Nj spans wp through wq, or in

other words Nj * wp wp+1 …wq

jpqN

Page 6: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 6

Formal Definition of a PCFG

• A PCFG consists of:– A set of terminals, {wk}, k= 1,…,V– A set of nonterminals, Ni, i= 1,…, n– A designated start symbol N1

– A set of rules, {Ni j}, (where j is a sequence of terminals and nonterminals)

– A corresponding set of probabilities on rules such that: i j P(Ni j) = 1

Page 7: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 7

Probability of a Derivation Tree and a String

• The probability of a derivation (i.e. parse) tree:

P(T) = i=1..k p(r(i))

where r(1), …, r(k) are the rules of the CFG used to generate the sentence w1m of which T is a parse.

• The probability of a sentence (according to grammar G) is given by:

P(w1m) = t P(w1m,t) = {t: yield(t)=w1m} P(t)

where t is a parse tree of the sentence. Need dynamic programming to make this efficient!

Page 8: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 8

Example: Probability of a Derivation Tree

Page 9: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 9

Example: Probability of a Derivation Tree

Page 10: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 10

The PCFG• Below is a probabilistic CFG (PCFG) with

probabilities derived from analyzing a parsed version of Allen's corpus. Rule Count for

LHS Count for Rule

PROB

1. S NP VP 300 300 1

2. VP V 300 116 .386

3. VP V NP 300 118 .393 4. VP V NP PP 300 66 .22 5. NP NP PP 1032 241 .23 6. NP N N 1032 92 .09 7. NP N 1032 141 .14 8. NP ART N 1032 558 .54

9. PP P NP 307 307 1

Page 11: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 11

Assumptions of the PCFG Model• Place Invariance: The probability of a subtree does

not depend on where in the string the words it dominates are (like time invariance in an HMM)

• Context Free: The probability of a subtree does not depend on words not dominated by that subtree.

• Ancestor Free: The probability of a subtree does not depend on nodes in the derivation outside the subtree.

same theis )( )( j

ckkNPk

)() outside anything|( jkl

jkl NPlkNP

)()in not nodesancestor any |( jkl

jkl

jkl NPNNP

Page 12: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 12

Probability of a Rule

• Rule r: Nj (Nw)+

• Let RNj be the set of all rules that have nonterminal

Nj on the left-hand side;

• Then define probability distribution on RNj :

rRNj p(r) = 1, 0 p(r) 1

• Another point of view:

p(| Nj ) = p(r), where r is Nj

Page 13: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 13

Estimating Probability of a Rule

• MLE from a treebank produced using a CFG grammar

• Let r: Nj , then– p(r) = c(r) / c(Nj)

– c(r): how many times r appears in the treebank.

– c(Nj)= c(Nj )

– p(r) = c(Nj ) / c(Nj )

Nj

1 2 k

Page 14: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 14

Example PCFG

Rule Count for LHS

Count for Rule

PROB

1. S NP VP 300 300 1

2. VP V 300 116 .386

3. VP V NP 300 118 .393

4. VP V NP PP 300 66 .22

5. NP NP PP 1032 241 .23

6. NP N N 1032 92 .09

7. NP N 1032 141 .14

8. NP ART N 1032 558 .54

9. PP P NP 307 307 1

Page 15: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 15

Some Features of PCFGs• A PCFG gives some idea of the plausibility of different parses;

however, the probabilities are based on structural factors and not lexical ones.

• PCFGs are good for grammar induction.• PCFGs are robust. • PCFGs give a probabilistic language model for English.• The predictive power of a PCFG tends to be greater than for an

HMM. Though in practice, it is worse.• PCFGs are not good models alone but they can be combined

with a trigram model.• PCFGs have certain biases which may not be appropriate;

smaller trees are preferred. (why?)• Estimates of PCFG parameters from a corpus obtains a proper

probability distribution (Chi and Geman, 1998).

Page 16: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 16

Recall the Questions for HMMs

• Given a model =(A, B, ), how do we efficiently compute how likely a certain observation is, that is, P(O| )

• Given the observation sequence O and a model , how do we choose a state sequence (X1, …, X T+1) that best explains the observations?

• Given an observation sequence O, and a space of possible models found by varying the model parameters = (A, B, ), how do we find the model that best explains the observed data?

Page 17: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 17

Questions for PCFGs

• There are also three basic questions we wish to answer for PCFGs:– What is the probability of a sentence w1m according to a

grammar G: P(w1m|G)?

– What is the most likely parse for a sentence: argmaxt P(t|w1m,G)?

– How can we choose rule probabilities for the grammar G that maximize the probability of a sentence: argmaxG P(w1m|G) ?

Page 18: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 18

Restriction• For this presentation, we only consider the case of

Chomsky Normal Form (CNF) GrammarsChomsky Normal Form (CNF) Grammars, which only have unary and binary rules of the form:• Ni Nj Nk // two nonterminals on the RHS• Ni wj // one terminal on the RHS

• The parameters of a PCFG in Chomsky Normal Form are:• P(Nj Nr Ns | G) // an n3 matrix of parameters• P(Nj wk|G) // n×V parameters

(n is the number of nonterminals and V is the number of terminals)

• For j=1,…,n, r,s P(Nj Nr Ns) + k P (Nj wk) = 1• Any CFG can be represented by a weakly equivalent CNF

form.

Page 19: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 19

HMMs and PCFGs• An HMM is able to efficiently do calculations using

forward and backward probabilities. – i(t) = P(w1(t-1), Xt = i)

– i(t) = P(wtT|Xt = i)

• The forward probability corresponds in a parse tree to everything above and including a certain node (outside), while the backward probability corresponds to the probability of everything below a certain node (inside).

• For PCFGs, the Outside (j ) and Inside (j) Probabilities

are defined as:

),|(),(

)|,,(),( )1()1(1

GNwPqp

GwNwPqpjpqpqj

mqjpqpj

Page 20: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 20

Inside and Outside Probabilities of PCFGs

N1

w1 wm

N

j

wp-1 wq+1wp wq… ……

Page 21: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 21

Inside and Outside Probabilities

• The inside probability j(p,q) is the total probability of generating wpq starting off with nonterminal Nj.

• The outside probability j(p,q) is the total probability of beginning with nonterminal N1 and generating Nj

pq and all words outside of wpq.

Page 22: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 22

From HMMs to Probabilistic Regular Grammars (PRG)

• A PRG has start state N1 and rules of the form:– Ni wj Nk – Ni wj

• This is like an HMM except that for an HMM:n w1n

P(w1n) = 1

whereas, for a PCFG: wL P(w) = 1

where L is the language generated by the grammar.• PRGs are related to HMMs in that a PRG is an HMM to

which a start state and a finish (or sink) state is added.

Page 23: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 23

Inside Probability

• j(p,q) = P(Nj * wpq)

Nj

wp ... wq

Page 24: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 24

The Probability of a String: Using Inside Probabilities

• We can calculate the probability of a string using the inside algorithm, a dynamic programming algorithm based on the inside probabilities:

• Base Case: j

),1(),|(

)|()|(

1111

1

*1

1

mGNwP

GwNPGwP

mm

mm

)|(),|(),( GwNPGNwPkk kjj

kkkj

Page 25: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 25

Induction: Divide the String wpq

wp wd… wd+1 wq

Nj

Nr Ns

Page 26: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 26

Induction Stepj, 1 p < q m

),1(),()(

),|(),|(),|,(

),,,,|(

),,,|(),|,(

),|,,,(

),|(),(

,

1

)1()1()1(,

1

)1()1(

)1()1(,

1

)1()1(,

1

qddpNNNP

GNwPGNwPGNNNP

GwNNNwP

GNNNwPGNNNP

GNNwNwP

GNwPqp

srsrj

sr

q

pd

sqdqd

rpdpd

jpq

sqd

rpd

sr

q

pd

pds

qdrpd

jpqqd

sqd

rpd

jpqpd

jpq

sqd

rpd

sr

q

pd

jpq

sqdqd

rpdpd

sr

q

pd

jpqpqj

wp wd… wd+1 wq

Nj

Nr Ns

Page 27: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 27

Example PCFG

• #1 S NP VP 1.0

• #2 VP V NP PP 0.4

• #3 VP V NP 0.6

• #4 NP N 0.7

• #5 NP N PP 0.3

• #6 PP PREP N 1.0

• #7 N a_dog 0.3

• #8 N a_cat 0.5

• #9 N a_telescope 0.2

• #10 V saw 1.0

• #11 PREP with 1.0P(a_dog saw a_cat with a_telescope) =

N V N PREP N

NP NP PP

VP

SVP

NP

PPV N

PREP N

1.0

0.4

0.7

0.3 1.0 0.5 1.0 0.2

0.71.0

0.6

0.3

1.0

1.7 .4 .3 .7 1 .5 1 1 .2 + ... .6... .3... = .00588 + .00378 = .00966

Page 28: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 28

Computing Inside Probability• a_dog saw a_cat with a_telescope

1 2 3 4 5

• Create table m m (m is length of string).

• Initialize on diagonal, using N rules.

• Recursively compute along the diagonal towards the upper right corner.

from/to 1 2 3 4 5 1 NP .21

N .3 S .441 S .00966

2 V 1 VP .21 VP .046 3 NP .35

N .5 NP .03

4 PREP 1 PP .2 5 N .2

Page 29: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 29

The Probability of a String: Using Outside Probabilities

• We can also calculate the probability of a string using the outside algorithm, based on the outside probabilities, for any k, 1 k m:

• Outside probabilities are calculated top down and require reference to inside probabilities, which must be calculated first.

)(),(

),,,|()|,,(

)|,,,()|(

)1()1(1)1()1(1

)1()1(11

kj

jj

mkj

kkkkmkj

kkj

k

jkkmkk

jkm

wNPkk

GwNwwPGwNwP

GNwwwPGwP

Page 30: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 30

Outside Algorithm

• Base Case: 1(1,m)= 1; j(1,m)=0 for j1 (the probability of the root being nonterminal Ni with nothing outside it)

• Inductive Case: 2 cases depicted on the next two slides. Sum over both possibilities but restrict first sum so that gj to avoid double counting for rules of the form Ni Nj Nj

Page 31: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 31

Case 1: left previous derivation step

N1

w1 wmwe+1… …wp-1 wp wq… wq+1 we

jpqΝ

geqN )1(

fpeN

Page 32: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 32

Case 2: right previous derivation step

N1

w1 wmwe-1wq+1we wp-1… …… wp wq

jpqNg

peN )1(

feqN

Page 33: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 33

Outside Algorithm Induction

)]1,()(),(

),1()(),([

)]|()|,(),,(

)|()|,(),,([

)],,,,(

),,,,([),(

,

1

1

, 1

)1()1()1(,

1

1)1()1(1

)1()1()1(, 1

)1()1(1

)1(,

1

1)1()1(1

)1(, 1

)1()1(1

peNNNPqe

eqNNNPep

NwPNNNPNwwP

NwPNNNPNwwP

NNNwwP

NNNwwPqp

gjgf

gf

p

ef

ggj

jgf

m

qe

ff

gpepe

feq

jpq

gpe

feq

gf

p

emqe

geqeq

fpe

geq

jpq

fpe

jgf

m

qemep

jpq

gpe

feq

gf

p

emqp

geq

jpq

fpe

jgf

m

qemqpj

Page 34: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 34

Product of Inside and Outside Probabilities

• Similarly to an HMM, we can form a product of inside and outside probabilities:

• The PCFG requires us to postulate a nonterminal node. Hence, the probability of the sentence AND that there is a constituent spanning the words p through q is:

P(w1m, Npq|G)= j j(p,q) j(p,q)

)|,(

),|()|,,(),(),(

1

)1()1(1

GNwP

GNwPGwNwPqpqpjpqm

jpqpqmq

jpqpjj

Page 35: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 35

CNF Assumption

• Since we know that there will always be some nonterminal spanning the whole tree (the start symbol), as well as each individual terminal (due to CNF assumption), we obtain the following as the total probability of a string:

),(),()(),(

),1(),1(),|( 11111

kkkkwNPkk

mmGNwP

jj

jkj

jj

mm

Page 36: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 36

Most Likely Parse for a Sentence

• The O(m3n3) algorithm works by finding the highest probability partial parse tree spanning a certain substring that is rooted with a certain nonterminal.

i(p,q) is the highest inside probability parse of a subtree Nipq

• Initialization: i(p,p) = P(Ni wp)

• Induction: i(p,q) = max1j,kn,pr<qP(Ni Nj Nk) j(p,r) k(r+1,q)

• Store backtrace: i(p,q)=argmax(j,k,r)P(Ni Nj Nk) j(p,r) k(r+1,q)

• At termination: The probability of the most likely tree is P( )= 1(1,m). To reconstruct the tree, see the next slide.

Page 37: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 37

Extracting the Viterbi Parse

• To reconstruct the maximum probability tree , we regard the tree as a set of nodes {Xx}.

– Because the grammar has a start symbol, the root node must be N1

1m

– Thus we only need to construct the left and right daughters of a non-terminal node recursively. If Xx=Ni

1m is in the Viterbi parse, and i(p,q)= (j,k,r) then left(Xx)= Nj

pr and right(Xx)= Nk(r+1)q

• There can be more than one maximum value parse.

Page 38: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 38

Training a PCFG• Purpose of training: limited grammar learning (induction)• Restrictions: We must provide the number of terminals and

nonterminals, as well as the start symbol, in advance. Then it is possible to assume all possible CNF rules for that grammar exist, or alternatively provide more structure on the rules, including specifying the set of rules.

• Purpose of training: to attempt to find the optimal probabilities to assign to different grammar rules once we have specified the grammar in some way.

• Mechanism: This training process uses an EM Training Algorithm called the Inside-Outside Algorithm which allows us to train the parameters of a PCFG on unannotated sentences of the language.

• Basic Assumption: a good grammar is one that makes the sentences in the training corpus likely to occur, i.e., we seek the grammar that maximizes the likelihood of the training data.

Page 39: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 39

Inside-Outside Algorithm• If a parsed corpus is available, it is best to train using that

information. If such is unavailable, then we have a hidden data problem: we need to determine probability functions on rules by only seeing the sentences.

• Hence, we must use an iterative algorithm (like Baum Welch for HMMs) to improve estimates until some threshold is achieved. We begin with a grammar topology and some initial estimates of the rules (e.g., all rules with a non-terminal on the LHS are equi-probable).

• We use the probability of the parse of each sentence in the training set to indicate our confidence in it, and then sum the probabilities of each rule to obtain an expectation of how often it is used.

Page 40: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 40

Inside-Outside Algorithm• The expectations are then used to refine the

probability estimates of the rules, with the goal of increasing the likelihood of the training set given the grammar.

• Unlike the HMM case, when dealing with multiple training instances, we cannot simply concatenate the data together. We must assume that the sentences in the training set are independent and then the likelihood of the corpus is simply the product of their component sentence probabilities.

• This complicates the reestimation formulas, which appear on pages 399-401 of Chapter 11.

Page 41: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 41

Problems with the Inside-Outside Algorithm

• Extremely Slow: For each sentence, each iteration of training is O(m3n3), given m is the sentence length and n is the number of nonterminals.

• Local Maxima are much more of a problem than in HMMs

• Satisfactory learning requires many more nonterminals than are theoretically needed to describe the language.

• There is no guarantee that the learned nonterminals will be linguistically motivated.

• Hence, although grammar induction from unannotated corpora is possible, it is extremely difficult to do well.

Page 42: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 42

Another View of Probabilistic Parsing

• By assuming that the probability of a constituent being derived by a rule is independent of how the constituent is used as a subconstituent, the inside probability can be used to develop a best-first PCFG parsing algorithm.

• This assumption suggests that the probabilities of NP rules are the same whether the NP is used as a subject, object, or object of a preposition. (However, subjects are more likely to be pronouns!)

• The inside probabilities for lexical categories can be their lexical generation probabilities. For example, P(flower|N)=.063 is the inside probability that the constituent N is realized as the word flower.

Page 43: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 43

Lexical Probability Estimates

P(the|ART) .54 P(a|ART) .360 P(flies|N) .025 P(a|N) .001 P(flies|V) .076 P(flower|N) .063 P(like|V) .1 P(flower|V) .05 P(like|P) .068 P(birds|N) .076 P(like|N) .012

The table below gives the lexical probabilities which are needed for our example:

Page 44: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 44

Word/Tag Counts

N V ART P TOTAL flies 21 23 0 0 44 fruit 49 5 1 0 55 like 10 30 0 21 61 a 1 0 201 0 202 the 1 0 300 2 303 flower 53 15 0 0 68 flowers 42 16 0 0 58 birds 64 1 0 0 65 others 592 210 56 284 1142 TOTAL 833 300 558 307 1998

Page 45: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 45

The PCFG• Below is a probabilistic CFG (PCFG) with

probabilities derived from analyzing a parsed version of Allen's corpus. Rule Count for

LHS Count for Rule

PROB

1. S NP VP 300 300 1

2. VP V 300 116 .386

3. VP V NP 300 118 .393 4. VP V NP PP 300 66 .22 5. NP NP PP 1032 241 .23 6. NP N N 1032 92 .09 7. NP N 1032 141 .14 8. NP ART N 1032 558 .54

9. PP P NP 307 307 1

Page 46: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 46

Parsing with a PCFG

• Using the lexical probabilities, we can derive probabilities that the constituent NP generates a sequence like a flower. Two rules could generate the string of words:

NP

ART N

a flower

8NP

N N

a flower

6

Page 47: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 47

Parsing with a PCFG

• The likelihood of each rule has been estimated from the corpus, so the probability that the NP generates a flower is the sum of the two ways it can be derived:P(a flower|NP) = P(R8|NP)×P(a|ART)×P(flower|N) +

P(R6|NP)×P(a|N)×PROB(flower|N)

= .55 × .36 × .063 + .09 × .001 × .063 = .012

• This probability can then be used to compute probabilities for larger constituents, like the probability of generating the words, A flower wilted (枯萎 ), from an S constituent.

Page 48: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 48

Three Possible Trees for an S S

1

N

a flower

7 NP VP

V

3

wilted

N

7 NP

NP

N N

a flower

6 VP

V

2

wilted

S

1

NP

ART N

a flower

8 VP

V

2

wilted

S

1

Page 49: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 49

Parsing with a PCFG

• The probability of a sentence generating A flower wilted: P(a flower wilted|S) = P(R1|S) × P(a flower|NP) ×

P(wilted|VP) + P(R1|S) × P(a|NP) × P(flower wilted|VP)

• Using this approach, the probability that a given sentence will be generated by the grammar can be efficiently computed. It only requires some way of recording the value of each constituent between each two possible positions. The requirement can be filled by a packed chart structure.

Page 50: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 50

Parsing with a PCFG

• The probabilities of specific parse trees can be found using a standard chart parsing algorithm, where the probability of each constituent is computed from the probability of its subconstituents and the rule used. When entering an entry E of nonterminal Nj using rule i with n subconstituents corresponding to E1, E2, ..., En, then: P(E) = P(Rule i| Nj) × P(E1) ×... × P(En)

• A standard chart parser can be used with a step to compute the probability of each entry when it is added to the chart. Using a bottom-up algorithm plus probabilities from a corpus, the figure on slide 49 shows the complete chart for the input, a flower.

Page 51: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 51

Lexical Generation Probabilities

• A better estimate of the lexical category for the word would be calculated by determining how likely it is that category Ni occurred at a position t over all sequences given the input w1t. In other words, rather than ignoring context or simply searching for the one sequence that yields the maximum probability for the word, we want to compute the sum of the probabilities from all sequences using the forward algorithm.

• The forward probability can then be used in the chart as the probability for a lexical item. The forward probabilities strongly prefer the noun reading flower in contrast to the context independent probabilities (as can be seen on the next slide).

Page 52: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 52

PCFG Chart For A flower

ART416 .99 .00047V419

N417 N422.001 .999

1 2 3A flower

NP4181. N417 .00018 .00018

VP4201. V419

S4213.2*10^-8

1. NP418 2. VP420

NP423.54

1. ART416 2. N422

NP423.00011

1. N417 2. N422

NP425.141. N422

Page 53: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 53

Accuracy Levels• A parser built using this technique will identify the correct

parse about 50% of the time. The drastic independence assumptions may prevent it from doing better. For example, the context-free model assumes that the probability of a particular verb being used in a VP rule is independent of the rule in question.

• For example, any structure that attaches a PP to a verb will have a probability of .22 from the fragment, compared with one that attaches to an NP in the VP (.39*.24 = .093).

VP

V NP PP.22

VP

V NP

PPNP

.39

.24

Page 54: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 54

Best-First Parsing• Algorithms can be developed that explore high-probability

constituents first using a best-first strategy. The hope is that the best parse can be found quickly without using much search space and that low probability constituents will never be created.

• The chart parser can be modified by making the agenda a priority queue (where the most probable elements are first in the queue). The parser then operates by removing the highest ranked constituent first and adding it to the chart.

• The previous chart parser relied on the fact that it worked from left to right, finishing early constituents before those later. With the best-first strategy, this assumption does not hold, and so the algorithm must be modified.

Page 55: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 55

Best-First Chart Issues

• For example, if the last word in the sentence has the highest score, it will be entered into the chart first in a best-first parser.

• This possibility suggests that a constituent needed to extend an arc may already be in the chart, requiring that whenever an active arc is added to the chart, we need to check if it can be extended by anything already in the chart.

• The arc extension algorithm must therefore be modified.

Page 56: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 56

Arc Extension Algorithm• To add a constituent C from position p1 to p2:

1. Insert C into the chart from position p1 to p2

2. For any active arc of the form: X X1 ... •C ... Xn from position p0 to p1, add a new active arc X X1 ... C • ... Xn from position p0 to p2

• To add an active arc X X1 ... C • C' ... Xn to the chart from p0 to p2:

1. If C is the last constituent (i.e., the arc is completed), add a new constituent of type X to the agenda

2. Otherwise, if there is a constituent Y of category C' in the chart from p2 to p3, then recursively add an active arc: X X1 ... C C' • ... Xn from p0 to p3 (which may of course add further arcs or create further constituents).

Page 57: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 57

Best-First Strategy

• The best-first strategy improves the efficiency of the parser by creating fewer constituents than a standard chart parser which terminates as soon as a sentence parse is found.

• Even though it does not consider every possible constituent, it is guaranteed to find the highest probability parse first.

• Although conceptually simple, there are some problems with the technique in practice. For example, if we combine scores by multiplying probabilities, the scores of constituents fall as they cover more words. Hence, with large grammars, the probabilities may drop off so quickly that the search resembles breadth-first search!

Page 58: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 58

Context-Dependent Best-First Parser

• The best-first approach may improve efficiency, but cannot improve the accuracy of the probabilistic parser.

• Improvements are possible if we use a simple alternative for computing rule probabilities that uses more context-dependent lexical information.

• In particular, we exploit the observation that the first word in a constituent is often its head and, as such, exerts an influence on the probabilities of rules that account for its complements.

Page 59: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 59

Simple Addition of Lexical Context

• Hence, we can use the following measure that takes into account the first word in the constituent.P(R|Ni,w)=C(rule R generates Ni that starts with w)/

C(Ni starts with w).

• This makes the probabilities sensitive to more lexical information. This is valuable:– It is unusual for singular nouns to be used by

themselves.

– Plurals are rarely used to modify other nouns.

Page 60: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 60

Example

• Notice the probabilities for the two rules, NP N and NP N N, given the words house and peaches. Also, context sensitive rules can encode verb subcategorization preferences.

Rule the house peaches flowers NP N 0 0 .65 .76 NP N N 0 .82 0 0 NP NP PP .23 .18 .35 .24 NP ART N .76 0 0 0 Rule ate bloom like put VP V .28 .84 0 .03 VP V NP .57 .1 .9 .03 VP V NP PP .14 .05 .1 .93

Page 61: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 61

Accuracy and Size

Strategy Accuracy on 84 PP Attachment Problems

Size of Chart Generated for The man put the bird in the house

Full Parse 33% (taking first S found)

158

Context-Free Probabilities

49% 65

Context-Dependent Probabilities

66% 36

Page 62: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 62

Source of Improvement

• To see why the context-dependent parser does better, consider the attachment decision that must be made for sentences like:– The man put the bird in the house.

– The man likes the bird in the house.

• The basic PCFG will assign the same structure to both, i.e., it will attach the PP to the VP. However, the context-sensitive approach will utilize subcategorization preferences for the verbs.

Page 63: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 63

Chart for put the bird in the house• The probability of attaching the PP to the VP

is .54 (.93×.99×.76×.76) compared to .0038 for the alternative.

Page 64: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 64

Chart for likes the bird in the house

• The probability of attaching the PP to the VP is .054 compared to .1 for the alternative .

Page 65: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 65

Room for Improvement

• The question is, just how accurate can this approach become? 33% error is too high. Additional information could be added to improve performance.

• One can use probabilities relative to a larger fragment of input (e.g., bigrams or trigrams for the beginnings of rules). This would require more training data.

• Attachment decisions depend not only on the word that a PP is attaching to but also the PP itself. Hence, we could use a more complex estimate based on the previous category, the head verb, and the preposition. This creates a complication in that one cannot compare across rules easily: VP V NP PP is evaluated using a verb and a preposition, but VP V NP has no preposition.

Page 66: Fall 2001 EE669: Natural Language Processing 1 Lecture 13: Probabilistic CFGs (Chapter 11 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer

Fall 2001 EE669: Natural Language Processing 66

Room for Improvement

• In general, the more selective the lexical categories, the more predictive the estimates can be, assuming that there is enough data.

• Clearly function words like prepositions, articles, quantifiers, and conjunctions can receive individual treatment; however, open class words like nouns and verbs are too numerous for individual handling.

• Words may be grouped based on similarity, or useful classes can be learned by analyzing corpora.