Albert Gatt Corpora and Statistical Methods. In this lecture Corpora and Statistical Methods We have considered distributions of words and lexical variation

PowerPoint Presentation

Albert GattCorpora and Statistical MethodsIn this lectureCorpora and Statistical MethodsWe have considered distributions of words and lexical variation in corpora.Today we consider collocations:definition and characteristicsmeasures of collocational strengthexperiments on corporahypothesis testingCollocations: Definition and characteristicsPart 1A motivating exampleCorpora and Statistical MethodsConsider phrases such as:strong tea? powerful teastrong support? powerful supportpowerful drug? strong drug

Traditional semantic theories have difficulty accounting for these patterns.strong and powerful seem near-synonymsdo we claim they have different senses?what is the crucial difference?

The empiricist view of meaningCorpora and Statistical MethodsFirths view (1957):You shall know a word by the company it keeps

This is a contextual view of meaning, akin to that espoused by Wittgenstein (1953).

In the Firthian tradition, attention is paid to patterns that crop up with regularity in language.Contrast symbolic/rationalist approaches, emphasising polysemy, componential analysis, etc.

Statistical work on collocations tends to follow this tradition.Defining collocationsCorpora and Statistical MethodsCollocations are statements of the habitual or customary places of [a] word. (Firth 1957)

Characteristics/Expectations:regular/frequently attested;occur within a narrow window (span of few words);not fully compositional;non-substitutable;non-modifiabledisplay category restrictions Frequency and regularityCorpora and Statistical MethodsWe know that language is regular (non-random) and rule-based.this aspect is emphasised by rationalist approaches to grammar

We also need to acknowledge that frequency of usage is an important factor in language development.why do big and large collocate differently with different nouns?Regularity/frequencyCorpora and Statistical Methodsf(strong tea) > f(powerful tea)

f(credit card) > f(credit bankruptcy)

f(white wine) > f(yellow wine)(even though white wine is actually yellowish)

Narrow window (textual proximity)Corpora and Statistical MethodsUsually, we specify an n-gram window within which to analyse collocations:bigram: credit card, credit crunchtrigram: credit card fraud, credit card expiry

The idea is to look at co-occurrence of words within a specific n-gram window

We can also count n-grams with intervening words:federal (.*) subsidymatches: federal subsidy, federal farm subsidy, federal manufacturing subsidyTextual proximity (continued)Corpora and Statistical MethodsUsually collocates of a word occur close to that word.may still occur across a span

Examples:bigram: white wine, powerful tea>bigram: knock on the door; knock on Xs door

Non-compositionalityCorpora and Statistical Methodswhite winenot really white, meaning not fully predictable from component words + syntax

signal interpretationa term used in Intelligent Signal Processing: connotations go beyond compositional meaning

Similarly:regression coefficientgood practice guidelines

Extreme cases:idioms such as kick the bucketmeaning is completely frozenNon-substitutabilityCorpora and Statistical MethodsIf a phrase is a collocation, we cant substitute a word in the phrase for a near-synonym, and still have the same overall meaning.

E.g.:white wine vs. yellow winepowerful tea vs. strong tea

Non-modifiabilityCorpora and Statistical MethodsOften, there are restrictions on inserting additional lexical items into the collocation, especially in the case of idioms.

Example:kick the bucket vs. ?kick the large bucket

NB:this is a matter of degree!non-idiomatic collocations are more flexible

Category restrictionsCorpora and Statistical MethodsFrequency alone doesnt indicate collocational strength:by the is a very frequent phrase in Englishnot a collocation

Collocations tend to be formed from content words:A+N: powerful teaN+N: regression coefficient, mass demonstrationN+PREP+N: degrees of freedomCollocations in a broad senseCorpora and Statistical MethodsIn many statistical NLP applications, the term collocation is quite broadly understood:any phrase which is frequent/regular enoughproper names (New York)compound nouns (elevator operator)set phrases (part of speech)idioms (kick the bucket)

Why are collocations interesting?Corpora and Statistical MethodsSeveral applications need to know about collocations:

terminology extraction: technical or domain-specific phrases crop up frequently in text (oil prices)

document classification: specialist phrases are good indicators of the topic of a text

named entity recognition: names such as New York tend to occur together frequently; phrases like new toy dont

Example application: ParsingCorpora and Statistical MethodsShe spotted the man with a pair of binoculars[VP spotted [NP the man [PP with a pair of binoculars]]][VP spotted [NP the man] [PP with a pair of binoculars]]

Parser might prefer (2) if spot/binoculars are frequent co-occurrences in a window of a certain width.Example application: GenerationCorpora and Statistical MethodsNLG systems often need to map a semantic representation to a lexical/syntactic one.Shouldnt use the wrong adjective-noun combinations: clean face vs. ?immaculate face

Lapata et al. (1999):experiment asking people to rate different adjective-noun combinationsfrequency of the combination a strong predictor of peoples preferencesargue that NLG systems need to be able to make contextually-informed decisions in lexical choiceFinding collocations in corpora: basic methodsFrequency-based approachCorpora and Statistical MethodsMotivation: if two (or three, or) words occur together a lot within some window, theyre a collocation

Problems:frequent collocations under this definition include with the, onto a, etc.not very interestingImproving the frequency-based approachCorpora and Statistical MethodsJusteson & Katz (1995): part of speech filteronly look at word combinations of the right category:N + N: regression coefficientN + PRP + N: jack in (the) box dramatically improves the resultscontent-word combinations more likely to be phrasesCase study: strong vs. powerfulCorpora and Statistical MethodsSee: Manning & Schutze `99, Sec 5.2Motivation:try to distinguish the meanings of two quasi-synonymsdata from New York Times corpus

Basic strategy:find all bigrams where w1 = strong or powerfulapply POS filter to remove strong on [crime], powerful in [industry] etc.Case study (cont/d)Corpora and Statistical MethodsSample results from Manning & Schutze `99:f(strong support) = 50f(strong supporter) = 10f(powerful force) = 13f(powerful computers) = 10

Teaser:would you also expect powerful supporter?whats the difference between strong supporter and powerful supporter?

Limitations of frequency-based searchCorpora and Statistical MethodsOnly work for fixed phrasesBut collocations can be looser, allowing interpolation of other words.knock on [the,Xs,a] doorpull [a] punch

Simple frequency wont do for these: different interpolated words dilute the frequency.Using mean and varianceCorpora and Statistical MethodsGeneral idea: include bigrams even at a distance:w1Xw2pull apunch

Strategy:find co-occurrences of the two words in windows of varying lengthcompute mean offset between w1 and w2compute variance of offset between w1 and w2if offsets are randomly distributed, then we have high variance and conclude that is not a collocation

Example outcomes (M&S `99)Corpora and Statistical Methodsposition of strong wrt oppositionmean = -1.15, standard dev = 0.67i.e. most occurrences are strong [] opposition

position of strong wrt formean = -1.12, standard dev = 2.15i.e. for occurs anywhere around strong, SD is higher than mean.can get strong support for, for the strong support, etc.More limitations of frequencyCorpora and Statistical MethodsIf we use simple frequency or mean & variance, we have a good way of ranking likely collocations.

But how do we know if a frequent pattern is frequent enough? Is it above what would be predicted by chance?

We need to think in terms of hypothesis-testing.Given , we want to compare:The hypothesis that they are non-independent.The hypothesis that they are independent.Preliminaries: Hypothesis testing and the binomial distributionPermutationsSuppose we have the 5 words {the, dog, ate, a, bone}How many permutations (possible orderings) are there of these words?the dog ate a bonedog the ate a bone

E.g. there are 5! = 120 ways of permuting 5 words.

Binomial coefficientSlight variation:How many different choices of three words are there out of these 5?This is known as an n choose k problem, in our case: 5 choose 3

For our problem, this gives us 10 ways of choosing three items out of 5

Bernoulli trialsA Bernoulli (or binomial) trial is like a coin flip. Features: There are two possible outcomes (not necessarily with the same likelihood), e.g. success/failure or 1/0.If the situation is repeated, then the likelihoods of the two outcomes are stable.

Sampling with/out replacementSuppose were interested in the probability of pulling out a function word from a corpus of 100 words.we pull out words one by one without putting them back

Is this a Bernoulli trial?we have a notion of success/failure: w is either a function word (success) or not (failure)but our chances arent the same across trials: they diminish since we sample without replacement

Cutting cornersIf the sample (e.g. the corpus) is large enough, then we can assume a Bernoulli situation even if we sample without replacement.Suppose our corpus has 52 million wordsSuccess = pulling out a function wordSuppose there are 13 million function wordsFirst trial: p(success) = .25Second trial: p(success) = 12,999,999/51,999,999 = .249On very large samples, the chances remain relatively stable even without replacement.Binomial probabilities - ILet represent the probability of success on a Bernoulli trial (e.g. our simple word game on a large corpus).Then, p(failure) = 1 - Problem: What are the chances of achieving success 3 times out of 5 trials?Assumption: each trial is independent of every other. (Is this assumption reasonable?)

Binomial probabilities - IIHow many ways are there of getting success three times out of 5?Several: SSSFF, SFSFS, SFSSF, To estimate the number of possible ways of getting k outcomes from n possibilities, we use the binomial coefficient:

Binomial probabilities - III5 choose 3 gives 10.Given independence, each of these sequences is equally likely.

Whats the probability of a sequence?its an AND problem (multiplication rule)P(SSSFF) = (1- )(1 ) = 3(1- )2P(SFSFS) = (1- ) (1- ) = 3(1- )2(they all come out the same)

Binomial probabilities - IVThe binomial distribution states that:given n Bernoulli trials, with probability of success on each trial, the probability of getting exactly k successes is:

probability of each successNumber of different ways of getting k successesprobability of k successes out of nExpected value and varianceExpected value:

where is our probability of success

Expected value of X over n trialsVariance of X over n trialsUsing the t-test for collocation discoveryThe logic of hypothesis testingThe typical scenario in hypothesis testing compares two hypotheses:The research hypothesisA null hypothesis

The idea is to set up our experiment (study, etc) in such a way that:If we show the null hypothesis to be false thenwe can affirm our research hypothesis with a certain degree of confidenceH0 for collocation studiesThere is no real association between w1 and w2, i.e. occurrence of is no more likely than chance.

More formally:H0: P(w1 & w2) = P(w1)P(w2)i.e. P(w1) and P(w2) are independentSome more on hypothesis testingOur research hypothesis (H1): are strong collocatesP(w1 & w2) > P(w1)P(w2)

A null hypothesis H0P(w1 & w2) = P(w1)P(w2)

How do we know whether our results are sufficient to affirm H1?I.e. how big is our risk of wrongly falsifying H0?The notion of significanceWe generally fix a level of confidence in advance.

In many disciplines, were happy with being 95% confident that the result we obtain is correct.So we have a 5% chance of error.Therefore, we state our results at p = 0.05 The probability of wrongly rejecting H0 is 5% (0.05)

Tests for significanceMany of the tests we use involve:having a prior notion of what the mean/variance of a population is, according to H0computing the mean/variance on our sample of the populationchecking whether the sample mean/variance is different from the sample predicted by H0, at 95% confidence.The t-test: strategyobtain mean (x) and variance (s2) for a sampleH0: sample is drawn from a population with mean and variance 2estimate the t value: this compares the sample mean/variance to the expected (population) mean/variance under H0check if any difference found is significant enough to reject H0Computing tcalculate difference between sample mean and expected population meanscale the difference by the variance

Assumption: population is normally distributed.If t is big enough, we reject H0. The magnitude of t given our sample size N is simply looked up in a table. Tables tell us what the level of significance is (p-value, or likelihood of making a Type 1 error, wrongly rejecting H0).

Example: new companiesWe think of our corpus as a series of bigrams, and each sample we take is an indicator variable (Bernoulli trial):value = 1 if a bigram is new companiesvalue = 0 otherwise

Compute P(new) and P(companies) using standard MLE.

H0: P(new companies) = P(new)P(companies)

Example continuedWe have computed the likelihood of our bigram of interest under H0.Since this is a Bernoulli Trial, this is also our expected mean.

We then compute the actual sample probability of (new companies).Compute t and check significanceUses of the t-testOften used to rank candidate collocations, rather than compute significance.Stop word lists must be used, else all bigrams will be significant.e.g. M&S report 824 out of 831 bigrams that pass the significance test. Reason: language is just not randomregularities mean that if the corpus is large enough, all bigrams will occur together regularly and often enough to be significant.Kilgarriff (2005): Any null hypothesis will be rejected on a large enough corpus.Extending the t-test to compare samplesVariation on the original problem:what co-occurrence relations are best to distinguish between two words, w1 and w1 that are near-synonyms?e.g. strong vs. powerful

Strategy:find all bigrams and e.g. strong tea, strong supportcheck, for each w1, if it occurs significantly more often with w2, versus w2.

NB. This is a two-sample t-testTwo-sample t-test: detailsH0: For any w1, the probabilities of and is the same.i.e. (expected difference) = 0

Strategy:extract sample of and assume they are independentcompute mean and SD for each samplecompute tcheck for significance: is the magnitude of the difference large enough?

Formula:

Simplifying under binomial assumptionsOn large samples, variance in the binomial distribution approaches the mean. I.e.:

(similarly for the other sample mean)

Therefore:

Concrete example: strong vs. powerful (M&S, p. 167); NY Times

Words occurring significantly more often with powerful than strongWords occurring significantly more often with strong than powerfulCriticisms of the t-testAssumes that the probabilities are normally distributed. This is probably not the case in linguistic data, where probabilities tend to be very large or very small.

Alternative: chi-squared test (2)compare differences between expected and observed frequencies (e.g. of bigrams)The chi-square testExampleImagine were interested in whether poor performance is a good collocation. H0: frequency of poor performance is no different from the expected frequency if each word occurs independently.

Find frequencies of bigrams containing poor, performance and poor performance.compare actual to expected frequenciescheck if the value is high enough to reject H0

Example continuedf(w1= poor)F(w1 =/= poor)f(w2=performance)15(poor performance)1,230(bad performance)F(w2 =/= performance)3,580(poor people)12,000(all other bigrams)

OBSERVED FREQUENCIESExpected frequencies need to be computed for each cell:E.g. expected value for cell (1,1) poor performance:

Computing the valueThe chi-squared value is the sum of differences of observed and expected frequencies, scaled by expected frequencies.

Value is once again looked up in a table to check if degree of confidence (p-value) is acceptable.If so, we conclude that the dependency between w1 and w2 is significant.

More applications of this statisticKilgarriff and Rose 1998 use chi-square as a measure of corpus similaritydraw up an n (row)*2 (column) tablecolumns correspond to corporarows correspond to individual typescompare the difference in counts between corporaH0: corpora are drawn from the same underlying linguistic population (e.g. register or variety)corpora will be highly similar if the ratio of counts for each word is roughly constant. This uses lexical variation to compute corpus-similarity.Limitations of t-test and chi-squareNot easily interpretablea large chi-square or t value suggests a large differencebut makes more sense as a comparative measure, rather than in absolute terms

t-test is problematic because of the normality assumption

chi-square doesnt work very well for small frequencies (by convention, we dont calculate it if the expected value for any of the cells is less than 5)but n-grams will often be infrequent!Likelihood ratios for collocation discoveryRationaleA likelihood ratio is the ratio of two probabilitiesindicates how much more likely one hypothesis is compared to another

Notation:c1 = C(w1)c2 = C(w2)c12 = C()

Hypotheses:H0: P(w2|w1) = p = P(w2|w1)H1: P(w2|w1) = p1P(w2|w1) = p2p1 =/= p2

H0H1P(w2|w1)P(w2|w1)Prob. that c12 bigrams out of c1 are Prob. that c2 - c12 out of N- c1 bigrams are )Computing the likelihood ratio

Computing the likelihood ratioThe likelihood (odds) that a hypothesis H is correct is L(H).

Computing the Likelihood ratioWe usually compute the log of the ratio:

Usually expressed as:

because, for v. large samples, this is roughly equivalent to a 2 value

Interpreting the ratioSuppose that the likelihood ratio for some bigram is x. This says:If we make the hypothesis that w2 is somehow dependent on w1, then we expect it to occur x times more than its actual base rate of occurrence would predict.This ratio is also better for sparse data.we can use the estimate as an approximate chi-square value even when expected frequencies are small.Concrete example: bigrams involving powerful (M&S, p. 174)

Source: NY Times corpus (N=14.3m)

Note: sparse data can still have a high log likelihood value!

Interpreting -2 log l as chi-squared allows us to reject H0, even for small samples (e.g. powerful cudgels)Relative frequency ratiosAn extension of the same logic of a likelihood ratioused to compare collocations across corpora

Let be our bigram of interest. Let C1 and C2 be two corpora:p1 = P() in C1p2 = P() in C2.

r= p1/p2 gives an indication of the relative likelihood of in C1 and C2.

Example applicationManning and Schutze (p.176) compare:C1: NY Times texts from 1990C2: NY Times texts from 1989

Bigram occurs 44 times in C2, but only 2 times in C1, so r = 0.03

The big difference is due to 1989 papers dealing more with the fall of the Berlin Wall.

SummaryWeve now considered two forms of hypothesis testing:t-testchi-squareAlso, log-likelihood ratios as measures of relative probability under different hypotheses.Next, we begin to look at the problem of lexical acquisition.ReferencesM. Lapata, S. McDonald & F. Keller (1999). Determinants of Adjective-Noun plausibility. Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, EACL-99

A. Kilgarriff (2005). Language is never, ever, ever random. Corpus Linguistics and Linguistic Theory 1(2): 263

Church, K. and Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics 16(1).

Documents

Albert Gatt Corpora and Statistical Methods. In this lecture Corpora and Statistical Methods We have considered distributions of words and lexical variation