Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional

Chapter 5: Collocations ∗

presented by Dustin Boswell

May 3, 2004

∗from Foundations of Statistical Natural Language Processing

What is a Collocation?

-“An expression consisting of two or more words that correspondto some conventional way of saying things.” - Ch. 5 of FSNLP

-“Collocations of a given word are statements of the habitualor customary places of that word.” -Firth (1957)

-“A phrase that means more than the sum of its parts.” -Dustin

There is no exact definition.

Examples:‘‘strong tea’’, ‘‘New York’’,

‘‘weapons of mass destruction’’, etc..

1

What is a Collocation?

Characteristics:

• non-compositionality

Eg: white wine (wine isn’t white...)

• non-substitutability

Eg: white yellow wine (doesn’t work)

• non-modifiability

Eg: I have a (slimy?)frog in my throat

Non-Examples:

• of the

• doctor ... nurse

(related words are simply co-occurences)

2

Why care about Collocations?

• Sentence Parsing:

Helps identify noun/verb phrases.

• Natural Language Generation & Translation:

Avoid awkward output like

‘‘powerful tea’’ or ‘‘to take a decision’’

• Dictionary Building:

Identify phrases that essentially act like

individual words

3

How do we find Collocations?

• Counting frequencies of adjacent words

• Mutual Information between words

• Hypothesis Testing

– t test

– t test of differences

– χ2 test

– likelihood ratios

4

Counting Frequencies of Adjacent Words

- Method: Simply choose the most frequent adjacent pairs.

- Difficulty: prepositions are frequent.

- Results:

5


- Fix: only look for phrases with special “part of speech patterns”- Results:

6


Summary

+ Easy to implement

+ Gets the simple cases right

- Too sensitive to frequent bigrams. (strong man)

- Ignores rare bigrams

7

Pointwise Mutual Information

PMI(w1, w2) = log2P (w1, w2)

P (w1)P (w2)

w1 and w2 are values of random variables, like word tokens.

Don’t confuse this with the usual Mutual Information:

MI(W1;W2) = E[ PMI(w1, w2) ]

=∑

w1,w2

P (w1, w2) log2P (w1, w2)

P (w1)P (w2)

W1 and W2 are random variables, like word locations.

8


- Method: Choose bigrams with highest PMI = I(w1, w2)

- Results:

9


- Difficulty: PMI is too sensitive to rare bigrams

- The problem is that P (w1,w2)P (w1)P (w2)

easily becomes large for infre-

quent individual words.

10


- Possible Fixes:

• Ignore bigrams that occur less than (say) 20 times

• Redefine PMI(w1, w2)′ = C(w1, w2) ∗ PMI(w1, w2)

11

Hypothesis Testing

- We really just want to know if words collocate

more often than chance.

- Define a null hypothesis H0 that says two words

are independent:

P (w1w2) = P (w1)P (w2)

- If (w1, w2) is a collocation, the hypothesis should

be rejected to some significance level.

12

Hypothesis Testing: The t test

• H0: we have data coming from a normal

distribution with mean µ.

• Data: we observe N points with sample mean x,

and sample variance s2

• Compute the t statistic: t = x−µ√s2N

• If t is greater than some threshold, reject H0.

• The threshold (lookup in table) is 2.576 for large

N and 99.5% confidence.

13

The t test Applied to Collocations

-Our statistic is the frequency of the bigram.

• µ is the frequency assuming the words are

independent

• x is the observed frequency

• s2 is the observed variance (of this ’binomial’)

• The ’frequency’ is really just the probability as

calculated by simply counting:

P (w1w2) =Count(w1w2)

N

14

Hypothesis Testing: The t test: Example

• We have a corpus with

– N = 14 million words.

– C(new) = 15,828

– C(companies) = 4675

• H0: ‘‘new companies’’ occurs with probability

µ = P (new)P (companies)

= 15,82814million ∗ 4675

14million≈ 3.6× 10−7

15

Hypothesis Testing: The t test: Example

• Data: we observe 8 occurrences of new companies, so

x = 814million ≈ 5.6× 10−7.

• For a Bernoulli trial, s2 = p(1− p) ≈ p (for small p).

• Compute the t statistic:

t =x − µ√

s2

N

=5.6× 10−7 − 3.6× 10−7√

5.6×10−7

14million

≈ 1.00

• t is not greater than 2.576, so new companies is not a

collocation.

16

Hypothesis Testing: The t test

- Method: Choose bigrams with highest t-statistic

- Results:

17

Hypothesis Testing: The t test of differences

• Consider two words with similar meaning:

strong, and powerful

• We want to find collocates that best distinguish

the usage of the two.

Ex: powerful computer vs. strong computer

• H0: we expect both pairs to occur just as

frequently.

• Compute a similar t statistic:

t =x1 − x2√

s21+s22N

• Find words (like computer) with highest t score.18

Hypothesis Testing: The t test of differences

- Results:

19

Hypothesis Testing: Pearson’s chi-square test

• t-test has been criticized because it assumes the data is

normally distributed.

• Pearson’s χ2 test also starts by assuming words are

independent.

• First, compute a table of observed values:

20


• The X2 statistic is computed as

X2 =∑i,j

(Oij − Eij)2

Eij

i and j are over all rows and columns of the table.

• Oij is the observed value (in the table)

• Eij is the expected value (if words were truly

independent). For example, to compute E11:

E(new companies) = C(new)N × C(companies)

N × N

= (3.6× 10−7)× 14million

≈ 5.2

21


• Again, if the X2 statistic is above some threshold we

accept the collocation.

• The top 20 collocations are the same for χ2 and t tests.

• But the χ2 test is considered more robust, and is more

frequently used.

22


• Consider another application: Learning word-to-word

translations from an aligned corpus.

• Here are observations for how often the French vache

was aligned with the English cow:

• X2 = 456400 (very high), so ( vache, cow ) is a likely

translation pair.

23

Hypothesis Testing: Likelihood ratios

• Another approach to hypothesis testing

• We consider two hypotheses:

– Hypothesis 1. (Two words are independent.)

p = P (w2|w1) = P (w2|¬w1) = P (w2)

– Hypothesis 2. (w2 depends on w1.)

p1 = P (w2|w1)

p2 = P (w2|¬w1)

p1 6= p2

24

Hypothesis Testing: Likelihood ratios

Quick Notation

• c1 = C(w1)

• c2 = C(w2)

• c12 = C(w1w2)

• We will use the binomial model

b(k;n, p) =(n

k

)pk(1− p)(n−k)

- A coin is biased to heads with probability p.

- Flip the coin n times.

- b(k;n, p) is the probability of k heads.

25

The World According to H1

• What we expect:

– P (w2|w1) = P (w2) = p = c2N

– P (w2|¬w1) = P (w2) = p = c2N

• In actuality, w1w2 occurred c12 times - how likely is this?

• We are assuming a binomial distribution:

– Each time w1 appears, w2 should follow with prob p.

– Each time ¬w1 appears, w2 should follow with prob p.

26

The Likelihood of our data, according to H1

• Of the c1 times w1 occurred, w2 followed c12 times.

• This should happen with probability b(c12; c1, p).

• Of the C(¬w1) = N − c1 times ¬w1 occurred,

w2 followed c2 − c12 times.

• This should happen with probability b(c2− c12;N − c1, p).

• The total probability (likelihood) of all the data is

simply the product:

L(H1) = P ( all the times we saw w2 )

= b(c12; c1, p) ∗ b(c2 − c12;N − c1, p)

27

The World According to H2

• What we expect:

– P (w2|w1) = P (w2) = p1 = c12c1

– P (w2|¬w1) = P (w2) = p2 = c2−c12N−c1

• In actuality, w1w2 occurred c12 times - how likely is this?

• Well, we are assuming a binomial distribution:

– Each time w1 appears, w2 should follow with prob p1.

– Each time ¬w1 appears, w2 should follow with prob p2.

28

The Likelihood of our data, according to H2

• Of the c1 times w1 occurred, w2 followed c12 times.

• This should happen with probability b(c12; c1, p1).

• Of the C(¬w1) = N − c1 times ¬w1 occurred,

w2 followed c2 − c12 times.

• This should happen with probability

b(c2 − c12;N − c1, p2).

• The total probability (likelihood) of all the data is

simply the product:

L(H2) = P ( all the times we saw w2 )

= b(c12; c1, p1) ∗ b(c2 − c12;N − c1, p2)

29

The Likelihood Ratio - general hypothesis testing

• We are given two hypotheses and a some data.

• We have no reason to believe one over the other

(P (H1) = P (H2))

• We pick H2 if

P (H2|data)

P (H1|data)=

P (data|H2)P (H2)

P (data|H1)P (H1)

=P (data|H2)

P (data|H1)

is large.

30

The Likelihood Ratio

• We want bigrams with large ratio

L(H2)

L(H1)

• To make the numbers nice, we equivalently find bigrams

with large

−2 ∗ log(L(H1)

L(H2))

• It turns out that the value −2 ∗ log(L(H1)L(H2)

) is

“asymptotically χ2 distributed” (more so than the X2

statistic).

31

Using The Likelihood Ratio

32

Conclusion

• Counting Frequencies of adjacent words

- too sensitive to frequent pairs

• Mutual Information between words

- too sensitive to rare words

• Hypothesis Testing

– t test - okay

– χ2 test - better

– likelihood ratios - even better

33

The end

Questions?

34

Documents

Chapter 5: Collocations - Dustin Boswelldustwell.com/PastWork/Collocations.pdf · What is a Collocation?-“An expression consisting of two or more words that correspond to some conventional