28
Tutorial - I 2 nd September 2005

Tutorial - I

  • Upload
    annot

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

Tutorial - I. 2 nd September 2005. Problem 1: N-grams. Let C be a natural language corpus consisting of N tokens and V types w 1 , w 2 , ..., w V . Let p i be the unigram probability of w i estimated from C . Also, given that  ij, i < j  p i  p j - PowerPoint PPT Presentation

Citation preview

Page 1: Tutorial - I

Tutorial - I

2nd September 2005

Page 2: Tutorial - I

Problem 1: N-grams

Let C be a natural language corpus consisting of N tokens and V types w1, w2, ..., wV. Let pi be the unigram probability of wi estimated from C. Also, given that ij, i < j pi pj

a. Give an estimate for pi in terms of N, V, and i.

b. An artificial corpus C1 was generated stochastically on the basis of the unigram probabilities pi. Estimate the bigram probabilities pij = P(wi wj) for C1 in terms of N, V, i & j. [Hint: Use the expression for pi derived above]

Soln.

Soln.

Page 3: Tutorial - I

Problem 1: N-grams (contd.)

c. Show that the bigram distribution of C1 does not follow Zipf’s law perfectly. For this, use the estimated expression for pij derived in (b).

d. It is known that natural languages exhibit Zipfian distribution over n-grams for all n. Can you use this fact to show that the bigram characteristics of C1 is different from C.

e. Prove the generalization of (d), i.e. “for any finite n, a stochastically generated corpus Cn based on the n-gram estimates of C has different (n+1)-gram characteristics from C”. What can you infer from this about n-gram models for natural languages?

Soln.

Soln.

Soln.

Page 4: Tutorial - I

Problem 2: Problematic AND!

Given below is a toy grammar G for English.

S NP VP

| S CNJ S

VP V NP | V S |

VP CNJ VP

NP NP CNJ NP | N

CNJ and

N John | Mary

V liked | said

Page 5: Tutorial - I

Problem 2: Problematic AND! (contd.)

a. Show that the sentence “John liked Mary and Mary liked John” is ambiguous for G. Point out the parse(s) that you think is/are semantically correct.

b. The sentence “John said John and Mary liked John”? has the same structure as that of (a). Is the semantically valid parse for (a) also meaning-ful for (b)? Why or why not?

Soln.

Soln.

Page 6: Tutorial - I

Problem 2: Problematic AND! (contd.)

c. The ambiguity arises because and can connect noun and verb phrases as well as clauses. Can you suggest a method to resolve this (at least partially) by

i. Verb sub-categorization

ii. By introducing new POS categories (not for verbs) and augmenting G accordingly. [Assume that POS tagging is a step before parsing and the process is perfect]

Soln.

Page 7: Tutorial - I

Problem 3: Geo-Morph Consider the following pairs of the name of the

Geographical location and the corresponding terms for their dwellers. Let us call this system of morphology Geo-Morph.

Geo-root Dweller Geo-root Dweller

Assam

Burma

China

Denmark

Egypt

France

Assamese

Burmese

Chinese

Danish

Egyptian

French

Georgia

Holland

India

Japan

Korea

London

Georgian

Dutch

Indian

Japanese

Korean

Londoner

Page 8: Tutorial - I

Problem 3: Geo-Morph (contd.)

a. Classify Geo-Morph as derivational/inflectional and linear/non-linear system of morphology.

b. Identify the set of affixes. Classify the examples as regular and irregular cases. Classify the regular cases further by the affixes.

c. Identify the different morphological paradigms. Can you classify the Geo-roots based on their graphemic/phonemic structure into these paradigms?

d. Design rewrite rules to capture orthographic changes for these paradigms.

Soln.

Soln.

Page 9: Tutorial - I

Problem 3: Geo-Morph (contd.)

e. Predict the dweller terms for the following Geo-roots based on the morphological system developed with the help of the paradigms and the rewrite rules (c-d). Which of them do you think are used in standard English?

o Sweden o Omano Libyao Viennao Europe

Soln.

Page 10: Tutorial - I

SOLUTIONS

Page 11: Tutorial - I

Solution 1(a): N-grams

a) ij, i < j pi pj implies that wi s are sorted in descending order of unigram probability, i.e. frequencies. In other words, the rank (according to frequency) of wi is i.

According to Zipf’s law, frequency rank = constant

i Npii

pi

Σ1...Vpi

k

pi

=

=

=

k (some constant)

k / (N i) (I)

(k/N) Σ1...V(1/i)

N/lnV

1/(i lnV) (from I)

Page 12: Tutorial - I

Solution 1(b): N-grams

b) Since C1 was generated stochastically based on the unigram probabilities only, the two tokens ts and ts+1 in C1 were generated independent of each other. In other words, the events ts = wi and ts+1 = wj are independent.

Therefore,

pij = P(ts = wi ts+1 = wj)

= P(ts = wi) P(ts+1 = wj)

= pi pj

1/(ij ln2V)

Page 13: Tutorial - I

Solution 1(c): N-grams

c) If the bigram distribution of C1 has to follow Zipf’s law, then bigram-probability bigram-rank = constant (say k’),

We know that pij 1/(ij ln2V) Therefore, first few bigram probabilities in order of rank are

p1,1, p1,2, p2,1, p3,1, p1,3, p4,1, ...

k’ = p1,1 1 = 1/ ln2V But, then

p2,1 = 1/2ln2V 1/3ln2V

p3,1 = 1/3ln2V 1/4ln2V

p1,3 = 1/3ln2V 1/5ln2V

Thus, it does not follow Zipf’s law (and even Mandelbrot’s law)

Page 14: Tutorial - I

Solution 1(d): N-grams

d) It follows from (c) that the bigram distribution of C1 does not

follow Zipf’s law, whereas that of C does. Therefore, the bigram characteristics of the two distribution must be different.

We know that for C1, pij 1/(ij ln2V).

However, just as in (a) we can estimate the bigram distribution of C from the Zipfian assumption. There are V2 probabilities.

Therefore, we can assume that [br is the probability of the rth bigram.

br = 1/(2rlnV)

But, this estimate may be quite erroneous. Why?

Page 15: Tutorial - I

Solution 1(e): N-gramse) Hint: Assume Zipf’s law for n-grams. Estimate n+1-gram

probabilities from n-grams (product of two n-gram probabilities). Now show that n+1-grams does not follow Zipf's law

Try to prove the following (more general) results: Mandelbrot’s law, a generalization of Zipf’s law says

(frequency + ρ) rankα = constant. Prove (c), (d) and (e) when the distribution follows Mandelbrot’s law rather than Zipf’s law.

For any finite length corpus (i.e. when N is finite), we cannot have n-gram distributions that follow Mandelbrot’s law perfectly.

Page 16: Tutorial - I

Solution 2(a): Problematic AND!

John liked Mary and Mary liked John

N V N CNJ N V N

NP V NP CNJ NP V NP

NP VP CNJ NP VP

S CNJ S

S

PARSE 1

Page 17: Tutorial - I

Solution 2(a): Problematic AND!

John liked Mary and Mary liked John

N V N CNJ N V NNP V NP CNJ NP V NP

NP V NP V NP

NP V NP VP

NP V S

NP VP

S

PARSE 2

Page 18: Tutorial - I

Solution 2(b): Problematic AND!

John said John and Mary liked John

N V N CNJ N V N

NP V NP CNJ NP V NP

NP VP CNJ NP VP

S CNJ S

S

PARSE 1

Page 19: Tutorial - I

Solution 2(b): Problematic AND!

John said John and Mary liked John

N V N CNJ N V NNP V NP CNJ NP V NP

NP V NP V NP

NP V NP VP

NP V S

NP VP

S

PARSE 2

Page 20: Tutorial - I

Solution 2(c): problematic AND!

Verb Sub-categorization: Verbs liked and said belong to subcategories 1 and 2 respectively, where VP V NP [For V in 1] VP V S [For V in 2]

POS category Augmentation: Break CNJ into two categories CNJP and CNJC for phrasal and clausal conjunctions respectively. The grammar G is augmented as:

Page 21: Tutorial - I

Solution 2(c): problematic AND!

The new G for English.

S NP VP

| S CNJC S

VP V NP | V S |

VP CNJP VP

NP NP CNJP NP

| N

CNJC and

CNJP and

N John | Mary

V liked | said

Page 22: Tutorial - I

Solution 2(c): Problematic AND!

John liked Mary and Mary liked John

N V N CNJC N V N

NP V NP CNJC NP V NP

NP VP CNJC NP VP

S CNJC S

S

Parsing using the new grammar

Page 23: Tutorial - I

Solution 2(c): Problematic AND!

John said John and Mary liked John

N V N CNJP N V NNP V NP CNJP NP V NP

NP V NP V NP

NP V NP VP

NP V S

NP VP

S

Parsing using the new grammar

Page 24: Tutorial - I

Solution 2(b): Problematic AND!

John said John and Mary liked John

N V N CNJP N V N

NP V NP CNJP NP V NP

NP VP CNJP NP VP

S CNJP S

Cannot parse otherwise

Page 25: Tutorial - I

Solution (3ab): Geo-Morph

Derivational and Linear Irregulars are shown in red, affixes: n, ese

Nation Dweller Nation Dweller

Assam

Burma

China

Denmark

Egypt

France

Assamese

Burmese

Chinese

Danish

Egyptian

French

Georgia

Holland

India

Japan

Korea

London

Georgian

Dutch

Indian

Japanese

Korean

Londoner

Page 26: Tutorial - I

Solution (3cd): Geo-Morph

c. Based on endings of the roots we might try to classify them into 4 paradigms [C:consonant-y, V:Vowel+y]:

o CVa, [V/a]CC* takes n, o Ca, aC takes ese

d) The Rewrite rules: n ian / C^_$ (Egypt^n Egyptian) a Φ/C_^ese (China^ese Chinese etc.)

Page 27: Tutorial - I

Solution (3e): Geo-Morph

Root Paradigm Suffix concatenation

After rewrite

Standard

forms

Sweden [V/a]CC* Sweden^n Swedenian Swedish

Oman aC Oman^ese Omanese Omani?

Libya CVa Libya^n Libyan Libyan

Vienna Ca Vienna^ese Viennese Viennese

Europe *** *** *** European

Page 28: Tutorial - I

A Problem to Ponder

Try to design a complete set of morphological rules for English Geo-Morph How many affixes, paradigms and exceptions do

you expect? Is it possible to classify the Geo-roots based solely

on the graphemic/phonemic forms?