Upload
vuonghuong
View
226
Download
2
Embed Size (px)
Citation preview
Chapter 5: Collocations ∗
presented by Dustin Boswell
May 3, 2004
∗from Foundations of Statistical Natural Language Processing
What is a Collocation?
-“An expression consisting of two or more words that correspondto some conventional way of saying things.” - Ch. 5 of FSNLP
-“Collocations of a given word are statements of the habitualor customary places of that word.” -Firth (1957)
-“A phrase that means more than the sum of its parts.” -Dustin
There is no exact definition.
Examples:‘‘strong tea’’, ‘‘New York’’,
‘‘weapons of mass destruction’’, etc..
1
What is a Collocation?
Characteristics:
• non-compositionality
Eg: white wine (wine isn’t white...)
• non-substitutability
Eg: white yellow wine (doesn’t work)
• non-modifiability
Eg: I have a (slimy?)frog in my throat
Non-Examples:
• of the
• doctor ... nurse
(related words are simply co-occurences)
2
Why care about Collocations?
• Sentence Parsing:
Helps identify noun/verb phrases.
• Natural Language Generation & Translation:
Avoid awkward output like
‘‘powerful tea’’ or ‘‘to take a decision’’
• Dictionary Building:
Identify phrases that essentially act like
individual words
3
How do we find Collocations?
• Counting frequencies of adjacent words
• Mutual Information between words
• Hypothesis Testing
– t test
– t test of differences
– χ2 test
– likelihood ratios
4
Counting Frequencies of Adjacent Words
- Method: Simply choose the most frequent adjacent pairs.
- Difficulty: prepositions are frequent.
- Results:
5
Counting Frequencies of Adjacent Words
- Fix: only look for phrases with special “part of speech patterns”- Results:
6
Counting Frequencies of Adjacent Words
Summary
+ Easy to implement
+ Gets the simple cases right
- Too sensitive to frequent bigrams. (strong man)
- Ignores rare bigrams
7
Pointwise Mutual Information
PMI(w1, w2) = log2P (w1, w2)
P (w1)P (w2)
w1 and w2 are values of random variables, like word tokens.
Don’t confuse this with the usual Mutual Information:
MI(W1;W2) = E[ PMI(w1, w2) ]
=∑
w1,w2
P (w1, w2) log2P (w1, w2)
P (w1)P (w2)
W1 and W2 are random variables, like word locations.
8
Pointwise Mutual Information
- Method: Choose bigrams with highest PMI = I(w1, w2)
- Results:
9
Pointwise Mutual Information
- Difficulty: PMI is too sensitive to rare bigrams
- The problem is that P (w1,w2)P (w1)P (w2)
easily becomes large for infre-
quent individual words.
10
Pointwise Mutual Information
- Possible Fixes:
• Ignore bigrams that occur less than (say) 20 times
• Redefine PMI(w1, w2)′ = C(w1, w2) ∗ PMI(w1, w2)
11
Hypothesis Testing
- We really just want to know if words collocate
more often than chance.
- Define a null hypothesis H0 that says two words
are independent:
P (w1w2) = P (w1)P (w2)
- If (w1, w2) is a collocation, the hypothesis should
be rejected to some significance level.
12
Hypothesis Testing: The t test
• H0: we have data coming from a normal
distribution with mean µ.
• Data: we observe N points with sample mean x,
and sample variance s2
• Compute the t statistic: t = x−µ√s2N
• If t is greater than some threshold, reject H0.
• The threshold (lookup in table) is 2.576 for large
N and 99.5% confidence.
13
The t test Applied to Collocations
-Our statistic is the frequency of the bigram.
• µ is the frequency assuming the words are
independent
• x is the observed frequency
• s2 is the observed variance (of this ’binomial’)
• The ’frequency’ is really just the probability as
calculated by simply counting:
P (w1w2) =Count(w1w2)
N
14
Hypothesis Testing: The t test: Example
• We have a corpus with
– N = 14 million words.
– C(new) = 15,828
– C(companies) = 4675
• H0: ‘‘new companies’’ occurs with probability
µ = P (new)P (companies)
= 15,82814million ∗ 4675
14million≈ 3.6× 10−7
15
Hypothesis Testing: The t test: Example
• Data: we observe 8 occurrences of new companies, so
x = 814million ≈ 5.6× 10−7.
• For a Bernoulli trial, s2 = p(1− p) ≈ p (for small p).
• Compute the t statistic:
t =x − µ√
s2
N
=5.6× 10−7 − 3.6× 10−7√
5.6×10−7
14million
≈ 1.00
• t is not greater than 2.576, so new companies is not a
collocation.
16
Hypothesis Testing: The t test
- Method: Choose bigrams with highest t-statistic
- Results:
17
Hypothesis Testing: The t test of differences
• Consider two words with similar meaning:
strong, and powerful
• We want to find collocates that best distinguish
the usage of the two.
Ex: powerful computer vs. strong computer
• H0: we expect both pairs to occur just as
frequently.
• Compute a similar t statistic:
t =x1 − x2√
s21+s22N
• Find words (like computer) with highest t score.18
Hypothesis Testing: The t test of differences
- Results:
19
Hypothesis Testing: Pearson’s chi-square test
• t-test has been criticized because it assumes the data is
normally distributed.
• Pearson’s χ2 test also starts by assuming words are
independent.
• First, compute a table of observed values:
20
Hypothesis Testing: Pearson’s chi-square test
• The X2 statistic is computed as
X2 =∑i,j
(Oij − Eij)2
Eij
i and j are over all rows and columns of the table.
• Oij is the observed value (in the table)
• Eij is the expected value (if words were truly
independent). For example, to compute E11:
E(new companies) = C(new)N × C(companies)
N × N
= (3.6× 10−7)× 14million
≈ 5.2
21
Hypothesis Testing: Pearson’s chi-square test
• Again, if the X2 statistic is above some threshold we
accept the collocation.
• The top 20 collocations are the same for χ2 and t tests.
• But the χ2 test is considered more robust, and is more
frequently used.
22
Hypothesis Testing: Pearson’s chi-square test
• Consider another application: Learning word-to-word
translations from an aligned corpus.
• Here are observations for how often the French vache
was aligned with the English cow:
• X2 = 456400 (very high), so ( vache, cow ) is a likely
translation pair.
23
Hypothesis Testing: Likelihood ratios
• Another approach to hypothesis testing
• We consider two hypotheses:
– Hypothesis 1. (Two words are independent.)
p = P (w2|w1) = P (w2|¬w1) = P (w2)
– Hypothesis 2. (w2 depends on w1.)
p1 = P (w2|w1)
p2 = P (w2|¬w1)
p1 6= p2
24
Hypothesis Testing: Likelihood ratios
Quick Notation
• c1 = C(w1)
• c2 = C(w2)
• c12 = C(w1w2)
• We will use the binomial model
b(k;n, p) =(n
k
)pk(1− p)(n−k)
- A coin is biased to heads with probability p.
- Flip the coin n times.
- b(k;n, p) is the probability of k heads.
25
The World According to H1
• What we expect:
– P (w2|w1) = P (w2) = p = c2N
– P (w2|¬w1) = P (w2) = p = c2N
• In actuality, w1w2 occurred c12 times - how likely is this?
• We are assuming a binomial distribution:
– Each time w1 appears, w2 should follow with prob p.
– Each time ¬w1 appears, w2 should follow with prob p.
26
The Likelihood of our data, according to H1
• Of the c1 times w1 occurred, w2 followed c12 times.
• This should happen with probability b(c12; c1, p).
• Of the C(¬w1) = N − c1 times ¬w1 occurred,
w2 followed c2 − c12 times.
• This should happen with probability b(c2− c12;N − c1, p).
• The total probability (likelihood) of all the data is
simply the product:
L(H1) = P ( all the times we saw w2 )
= b(c12; c1, p) ∗ b(c2 − c12;N − c1, p)
27
The World According to H2
• What we expect:
– P (w2|w1) = P (w2) = p1 = c12c1
– P (w2|¬w1) = P (w2) = p2 = c2−c12N−c1
• In actuality, w1w2 occurred c12 times - how likely is this?
• Well, we are assuming a binomial distribution:
– Each time w1 appears, w2 should follow with prob p1.
– Each time ¬w1 appears, w2 should follow with prob p2.
28
The Likelihood of our data, according to H2
• Of the c1 times w1 occurred, w2 followed c12 times.
• This should happen with probability b(c12; c1, p1).
• Of the C(¬w1) = N − c1 times ¬w1 occurred,
w2 followed c2 − c12 times.
• This should happen with probability
b(c2 − c12;N − c1, p2).
• The total probability (likelihood) of all the data is
simply the product:
L(H2) = P ( all the times we saw w2 )
= b(c12; c1, p1) ∗ b(c2 − c12;N − c1, p2)
29
The Likelihood Ratio - general hypothesis testing
• We are given two hypotheses and a some data.
• We have no reason to believe one over the other
(P (H1) = P (H2))
• We pick H2 if
P (H2|data)
P (H1|data)=
P (data|H2)P (H2)
P (data|H1)P (H1)
=P (data|H2)
P (data|H1)
is large.
30
The Likelihood Ratio
• We want bigrams with large ratio
L(H2)
L(H1)
• To make the numbers nice, we equivalently find bigrams
with large
−2 ∗ log(L(H1)
L(H2))
• It turns out that the value −2 ∗ log(L(H1)L(H2)
) is
“asymptotically χ2 distributed” (more so than the X2
statistic).
31
Using The Likelihood Ratio
32
Conclusion
• Counting Frequencies of adjacent words
- too sensitive to frequent pairs
• Mutual Information between words
- too sensitive to rare words
• Hypothesis Testing
– t test - okay
– χ2 test - better
– likelihood ratios - even better
33
The end
Questions?
34