20
1 Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing) Computer Science 83000 (Under: Topics in Artificial Intelligence) The Graduate School of the City University of New York Fall 2001 William Gregory Sakas Hunter College, Department of Computer Science Graduate Center, PhD Programs in Computer Science and Linguistics The City University of New York

William Gregory Sakas Hunter College, Department of Computer Science

  • Upload
    argyle

  • View
    36

  • Download
    2

Embed Size (px)

DESCRIPTION

Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing ) Computer Science 83000 (Under: Topics in Artificial Intelligence ) The Graduate School of the City University of New York Fall 2001. William Gregory Sakas - PowerPoint PPT Presentation

Citation preview

Page 1: William Gregory Sakas Hunter College, Department of Computer Science

1

Introduction to Computational Natural Language Learning

Linguistics 79400 (Under: Topics in Natural Language Processing)Computer Science 83000 (Under: Topics in Artificial Intelligence)

The Graduate School of the City University of New YorkFall 2001

William Gregory SakasHunter College, Department of Computer Science

Graduate Center, PhD Programs in Computer Science and LinguisticsThe City University of New York

Page 2: William Gregory Sakas Hunter College, Department of Computer Science

2

Someone shot the servant of the actress who was on the balcony.Who was on the balcony and how did they get there?

Lecture 9:

Learning the "best" parse from corpora and tree-banks

READING :Charniak, E. (1997) Statistical techniques for natural language parsing,

AI Magazine.

http://www.cs.brown.edu/people/ec/papers/aimag97.ps

This is a wonderfully easy to read introduction to how simple patterns in corpora can be used to resolve ambiguities in tagging and parsing. (This is a must read.)

Costa et al. (2001) Wide coverage incremental parsing by learning attachment preferences

http://www.dsi.unifi.it/~costa/online/AIIA2001.pdf

A novel approach to learning parsing preferences that incorporates an artificial neural network. Read Charniak first, but try to get started on this before the meeting after

Thanksgiving.

Page 3: William Gregory Sakas Hunter College, Department of Computer Science

3

Review: Context-free Grammars

1. S -> NP VP

2. VP -> V NP

3. VP -> V NP NP

4. NP -> det N

5. NP -> N

6. NP -> det N N

7. NP -> NP NP

The dog ate.

The diner ate seafood.

The boy ate the fish.

Order of rules in a top down parse with one word look-ahead:

(see blackboard)

Just like our first language models, ambiguity plays an important role.

10 dollars a share.

Page 4: William Gregory Sakas Hunter College, Department of Computer Science

4

Salespeople sold the dog biscuits.

At least three parses. (See Charniak, see blackboard).

Wide coverage parsers can generate 100's of parses for every sentence. See Costa et al. for some numbers from the Penn tree-bank. Most are pretty senseless.

Traditionally, non-statistically-minded NLP engineering types thought of disambiguation as a post-parsing problem.

Statistically-minded NLP engineering folk think more of a continuum parsing and disambiguation go toghether, it's just that some parses are more reasonable than others.

Page 5: William Gregory Sakas Hunter College, Department of Computer Science

5

POS tagging:

The can will rust.

det aux aux nounnoun noun verbverb verb

Learning algorithm I

Input for training: A pre-tagged training corpus.

1) record the frequencies, for each word, of parts of speech. E.g.:The det 1,230 can aux 534 etc noun 56 verb 6

2) on an unseen corpus, apply, for each word, the most frequent POS observed from step (1). For words with frequency of 0 (they didin't appear in the training corpus) guess proper-noun.

ACHIEVES 90% (in English)!

Page 6: William Gregory Sakas Hunter College, Department of Computer Science

6

Learning algorithm I

Input for training: A pre-tagged training corpus.

1) record the frequencies, for each word, of parts of speech.The det 1,230 can aux 534 etc noun 56 verb 6

2) on an unseen corpus, apply, for each work, the most frequent POS observed from step (1). For words with frequency of 0 (they didin't appear in the training corpus, guess Proper-noun).

ACHIEVES 90%! (But remember totally unambiguous words like "the" are relatively frequent in English which pushes the number

way up).

Easy to turn frequencies into approximations of the probability that a POS tag is correct given the word: p(t | w). For can these would be:

p( aux | can) = 534 / (534 + 56 + 6) = .90p(noun | can) = 56 / (534 + 56 + 6) = .09p(verb | can) = 6 / (534 + 56 + 6) = .01

Page 7: William Gregory Sakas Hunter College, Department of Computer Science

7

)|(maxarg it

wtp

Notation for: the tag (t) out of all possible t's that maximizes the probability of t given the current word (wi ) under consideration:

)"|"(maxarg cantpt

= the tag (t) that generates the maximum p of p( aux | can) = 534 / (534 + 56 + 6) = .90p(noun | can) = 56 / (534 + 56 + 6) = .09p(verb | can) = 6 / (534 + 56 + 6) = .01

= aux

n

iii

twtp

n 1

)|(maxarg,1

Extending the notation for a sequence of tags :

This means give the sequence of tags (t1,n), that maximizes the product of the probabilities that a tag is correct for each word.

Page 8: William Gregory Sakas Hunter College, Department of Computer Science

8

Hidden Markov Models (HMM)

deta .245

the .586

adjlarge .004small .005

nounhouse .001stock .001

.475

.0016

.218

.45

p( ti | ti-1) =

p(adj | det)

p( wi | ti ) =

p(large | adj) p(small | adj)

Page 9: William Gregory Sakas Hunter College, Department of Computer Science

9

• Secretariat is expected to race tomorrow• People continue to inquire the reason for the race for

outer space

Consider:to/TO race/????the/Det race/????

• The naive Learning Algorithm I would simply assign the most probable tag – ignoring the preceding word - which would obviously be wrong for one of the two sentences above.

Page 10: William Gregory Sakas Hunter College, Department of Computer Science

10

Consider the first sentence, with the "to race."

The (simple bigram) HMM model would choose the greater of the these two probabilities:

p(Verb | TO) p(race | Verb)

p(Noun | TO) p(race | Noun)

Let's look at the first expression:

How likely are we to find a verb given the previous tag TO?

Can find calculate from a training corpus (or corpora), a verb following a tag TO, is 15 times more likely:

p(Verb | TO) = .021 p(Noun | TO) = .34

The second expression:

Given each tag (Verb and Noun), ask "if were expecting the tag Verb, would the lexical item be "race?" and "if were expecting the tag Noun, would the lexical item be "race?" I.e. want the likelihood of:

p(race | Verb) = .00003 and p(race | Noun) = .00041

Page 11: William Gregory Sakas Hunter College, Department of Computer Science

11

Putting them together:

The bigram HMM correctly predicts that race should be a Verb despite fact that race as a Noun is more common:

p(Verb | TO) p(race|Verb) = 0.00001

p(Noun | TO) p(race|Noun) = 0.000007

So a bigram HMM taggers chooses the tage sequence that maximizes (it's easy to increase the number of tags looked at):

p(word|tag) p(tag|previous tag)

a bit more formally:

ti = argmaxj P(tj | tj-1 )P(wi | tj )

)|()|(maxarg1

1,1

ii

n

iii

ttwpttp

n

Page 12: William Gregory Sakas Hunter College, Department of Computer Science

12

After some math (application of Bayes theorem and the chain rule) and

Two important simplifying assumptions (the probability of a word depends only on its tag and only the previous tag is enough to approximate the current tag),

We have, for a whole sequence of tags t1,n an HMM bigram model for the predicted tag sequence, given words w1 . . . wn :

)|()|(maxarg1

1,1

ii

n

iii

ttwpttp

n

Page 13: William Gregory Sakas Hunter College, Department of Computer Science

13

the can will rust

noun

aux

DT

aux

noun

verb

noun

Want the most likely path trough this graph.

Page 14: William Gregory Sakas Hunter College, Department of Computer Science

14

This is done by the Viterbi algorithm.

Accuracy of this method is around 96%.

But what about if no training data to calculate likelyhood of tags and words | tags?

Can estimate using the forward-backward algorithm. Doesn't work too well without at least a small training set to get it started.

Page 15: William Gregory Sakas Hunter College, Department of Computer Science

15

Given that there can be many, many parses of a sentence given a typical, large CFG, we should pick

the most probable parse as defined by:

Probability of a parse of a sentence = the product of the probabilities of all the rules that were applied

to expand each constituent.

))((),( c

crpsp

prob of a parse π of sentence s

probability of expanding constituent c by context-free rule r

product over all constituents

in parse π

PCFGs

Page 16: William Gregory Sakas Hunter College, Department of Computer Science

16

1. S -> NP VP (1.0)

2. VP -> V NP (0.8)

3. VP -> V NP NP(0.2)

4. NP -> det N (0.5)

5. NP -> N (0.3)

6. NP -> det N N (0.15)

7. NP -> NP NP (0.05)

How do we come by these probabilities? Simply

count the number of times they are applied in a tree-

bank.

For example, if the rule: NP -> det N

is used 1,000 times, and overall, NP -> X (i.e. any NP rule) is applied 2,000

times, then the probability of

NP -> det N = 0.5.

Some example "toy" probabilites attached to CF rules.

Apply to Salespeople sold the dog biscuits see Charniak and blackboard.

Page 17: William Gregory Sakas Hunter College, Department of Computer Science

17

Basic tree-bank grammars parse surprising well (at around 75%)

But often mis-predict the correct parse (according to humans).

The most troublesome report may be the August merchandise trade deficit due out tomorrow.

Tree-bank grammar gets the (incorrect) reading:

His worst nightmare may be [the telephone bill due] over $200.

N ADJ PP NP

deficit due out tomorrow

Trees on blackboard or can construct from Charniak.

Page 18: William Gregory Sakas Hunter College, Department of Computer Science

18

Preferences depend on many factors

• On type of verb:The women kept the dogs on the beachThe women discussed the dogs on the beach

Kept:(1) Kept the dogs which were on the beach(2) Kept them (the dogs), while on the beach

Discussed:(1) Discussed the dogs which were on the beach(2) Discussed them (the dogs), while on the beach

Page 19: William Gregory Sakas Hunter College, Department of Computer Science

19

Verb-argument relations

• Subcategorization

• But also selects into the type of the Prepositional Phrase(Hindle and Roth)

• Or, even more deeply, seems to depend on the frequency of semantic associations:

The actress delivered flowers threw them in the trashThe postman delivered flowers threw them in the trash

Page 20: William Gregory Sakas Hunter College, Department of Computer Science

20

On ‘selectional’ restrictions

• “walking on air”; “skating on ice”, vs. “eating on ice”• Verb takes a certain kind of argument

• Subject sometimes must be certain type:

John admires honesty vs. ?? Honesty admires John