Tutorial - I

Tutorial - I

2nd September 2005

Problem 1: N-grams

Let C be a natural language corpus consisting of N tokens and V types w1, w2, ..., wV. Let pi be the unigram probability of wi estimated from C. Also, given that ij, i < j pi pj

a. Give an estimate for pi in terms of N, V, and i.

b. An artificial corpus C1 was generated stochastically on the basis of the unigram probabilities pi. Estimate the bigram probabilities pij = P(wi wj) for C1 in terms of N, V, i & j. [Hint: Use the expression for pi derived above]

Soln.

Soln.

Problem 1: N-grams (contd.)

c. Show that the bigram distribution of C1 does not follow Zipf’s law perfectly. For this, use the estimated expression for pij derived in (b).

d. It is known that natural languages exhibit Zipfian distribution over n-grams for all n. Can you use this fact to show that the bigram characteristics of C1 is different from C.

e. Prove the generalization of (d), i.e. “for any finite n, a stochastically generated corpus Cn based on the n-gram estimates of C has different (n+1)-gram characteristics from C”. What can you infer from this about n-gram models for natural languages?

Soln.

Soln.

Soln.

Problem 2: Problematic AND!

Given below is a toy grammar G for English.

S NP VP

| S CNJ S

VP V NP | V S |

VP CNJ VP

NP NP CNJ NP | N

CNJ and

N John | Mary

V liked | said

Problem 2: Problematic AND! (contd.)

a. Show that the sentence “John liked Mary and Mary liked John” is ambiguous for G. Point out the parse(s) that you think is/are semantically correct.

b. The sentence “John said John and Mary liked John”? has the same structure as that of (a). Is the semantically valid parse for (a) also meaning-ful for (b)? Why or why not?

Soln.

Soln.

Problem 2: Problematic AND! (contd.)

c. The ambiguity arises because and can connect noun and verb phrases as well as clauses. Can you suggest a method to resolve this (at least partially) by

i. Verb sub-categorization

ii. By introducing new POS categories (not for verbs) and augmenting G accordingly. [Assume that POS tagging is a step before parsing and the process is perfect]

Soln.

Problem 3: Geo-Morph Consider the following pairs of the name of the

Geographical location and the corresponding terms for their dwellers. Let us call this system of morphology Geo-Morph.

Geo-root Dweller Geo-root Dweller

Assam

Burma

China

Denmark

Egypt

France

Assamese

Burmese

Chinese

Danish

Egyptian

French

Georgia

Holland

India

Japan

Korea

London

Georgian

Dutch

Indian

Japanese

Korean

Londoner

Problem 3: Geo-Morph (contd.)

a. Classify Geo-Morph as derivational/inflectional and linear/non-linear system of morphology.

b. Identify the set of affixes. Classify the examples as regular and irregular cases. Classify the regular cases further by the affixes.

c. Identify the different morphological paradigms. Can you classify the Geo-roots based on their graphemic/phonemic structure into these paradigms?

d. Design rewrite rules to capture orthographic changes for these paradigms.

Soln.

Soln.

Problem 3: Geo-Morph (contd.)

e. Predict the dweller terms for the following Geo-roots based on the morphological system developed with the help of the paradigms and the rewrite rules (c-d). Which of them do you think are used in standard English?

o Sweden o Omano Libyao Viennao Europe

Soln.

SOLUTIONS

Solution 1(a): N-grams

a) ij, i < j pi pj implies that wi s are sorted in descending order of unigram probability, i.e. frequencies. In other words, the rank (according to frequency) of wi is i.

According to Zipf’s law, frequency rank = constant

i Npii

pi

Σ1...Vpi

k

pi

=

=

=

k (some constant)

k / (N i) (I)

(k/N) Σ1...V(1/i)

N/lnV

1/(i lnV) (from I)

Solution 1(b): N-grams

b) Since C1 was generated stochastically based on the unigram probabilities only, the two tokens ts and ts+1 in C1 were generated independent of each other. In other words, the events ts = wi and ts+1 = wj are independent.

Therefore,

pij = P(ts = wi ts+1 = wj)

= P(ts = wi) P(ts+1 = wj)

= pi pj

1/(ij ln2V)

Solution 1(c): N-grams

c) If the bigram distribution of C1 has to follow Zipf’s law, then bigram-probability bigram-rank = constant (say k’),

We know that pij 1/(ij ln2V) Therefore, first few bigram probabilities in order of rank are

p1,1, p1,2, p2,1, p3,1, p1,3, p4,1, ...

k’ = p1,1 1 = 1/ ln2V But, then

p2,1 = 1/2ln2V 1/3ln2V

p3,1 = 1/3ln2V 1/4ln2V

p1,3 = 1/3ln2V 1/5ln2V

Thus, it does not follow Zipf’s law (and even Mandelbrot’s law)

Solution 1(d): N-grams

d) It follows from (c) that the bigram distribution of C1 does not

follow Zipf’s law, whereas that of C does. Therefore, the bigram characteristics of the two distribution must be different.

We know that for C1, pij 1/(ij ln2V).

However, just as in (a) we can estimate the bigram distribution of C from the Zipfian assumption. There are V2 probabilities.

Therefore, we can assume that [br is the probability of the rth bigram.

br = 1/(2rlnV)

But, this estimate may be quite erroneous. Why?

Solution 1(e): N-gramse) Hint: Assume Zipf’s law for n-grams. Estimate n+1-gram

probabilities from n-grams (product of two n-gram probabilities). Now show that n+1-grams does not follow Zipf's law

Try to prove the following (more general) results: Mandelbrot’s law, a generalization of Zipf’s law says

(frequency + ρ) rankα = constant. Prove (c), (d) and (e) when the distribution follows Mandelbrot’s law rather than Zipf’s law.

For any finite length corpus (i.e. when N is finite), we cannot have n-gram distributions that follow Mandelbrot’s law perfectly.

Solution 2(a): Problematic AND!

John liked Mary and Mary liked John

N V N CNJ N V N

NP V NP CNJ NP V NP

NP VP CNJ NP VP

S CNJ S

S

PARSE 1

Solution 2(a): Problematic AND!


N V N CNJ N V NNP V NP CNJ NP V NP

NP V NP V NP

NP V NP VP

NP V S

NP VP

S

PARSE 2

Solution 2(b): Problematic AND!

John said John and Mary liked John

N V N CNJ N V N

NP V NP CNJ NP V NP

NP VP CNJ NP VP

S CNJ S

S

PARSE 1



N V N CNJ N V NNP V NP CNJ NP V NP

NP V NP V NP

NP V NP VP

NP V S

NP VP

S

PARSE 2

Solution 2(c): problematic AND!

Verb Sub-categorization: Verbs liked and said belong to subcategories 1 and 2 respectively, where VP V NP [For V in 1] VP V S [For V in 2]

POS category Augmentation: Break CNJ into two categories CNJP and CNJC for phrasal and clausal conjunctions respectively. The grammar G is augmented as:

Solution 2(c): problematic AND!

The new G for English.

S NP VP

| S CNJC S

VP V NP | V S |

VP CNJP VP

NP NP CNJP NP

| N

CNJC and

CNJP and

N John | Mary

V liked | said

Solution 2(c): Problematic AND!


N V N CNJC N V N

NP V NP CNJC NP V NP

NP VP CNJC NP VP

S CNJC S

S

Parsing using the new grammar

Solution 2(c): Problematic AND!


N V N CNJP N V NNP V NP CNJP NP V NP

NP V NP V NP

NP V NP VP

NP V S

NP VP

S

Parsing using the new grammar



N V N CNJP N V N

NP V NP CNJP NP V NP

NP VP CNJP NP VP

S CNJP S

Cannot parse otherwise

Solution (3ab): Geo-Morph

Derivational and Linear Irregulars are shown in red, affixes: n, ese

Nation Dweller Nation Dweller

Assam

Burma

China

Denmark

Egypt

France

Assamese

Burmese

Chinese

Danish

Egyptian

French

Georgia

Holland

India

Japan

Korea

London

Georgian

Dutch

Indian

Japanese

Korean

Londoner

Solution (3cd): Geo-Morph

c. Based on endings of the roots we might try to classify them into 4 paradigms [C:consonant-y, V:Vowel+y]:

o CVa, [V/a]CC* takes n, o Ca, aC takes ese

d) The Rewrite rules: n ian / C^_$ (Egypt^n Egyptian) a Φ/C_êse (Chinaêse Chinese etc.)

Solution (3e): Geo-Morph

Root Paradigm Suffix concatenation

After rewrite

Standard

forms

Sweden [V/a]CC* Sweden^n Swedenian Swedish

Oman aC Omanêse Omanese Omani?

Libya CVa Libya^n Libyan Libyan

Vienna Ca Viennaêse Viennese Viennese

Europe *** *** *** European

A Problem to Ponder

Try to design a complete set of morphological rules for English Geo-Morph How many affixes, paradigms and exceptions do

you expect? Is it possible to classify the Geo-roots based solely

on the graphemic/phonemic forms?

Documents

Tutorial - I