77
Simple Statistics for Simple Statistics for Corpus Linguistics Corpus Linguistics Sean Wallis Survey of English Usage University College London [email protected]

Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London [email protected]

Embed Size (px)

Citation preview

Page 1: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Simple Statistics for Simple Statistics for Corpus LinguisticsCorpus Linguistics

Sean WallisSurvey of English Usage

University College London

[email protected]

Page 2: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

OutlineOutline

• Numbers…

• A simple research question– do women speak or write more than men

in ICE-GB?– p = proportion = probability

• Another research question– what happens to speakers’ use of modal shall

vs. will over time?– the idea of inferential statistics– plotting confidence intervals

• Concluding remarks

Page 3: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Numbers...Numbers...

• We are used to concepts like these being expressed as numbers:– length (distance, height)– area– volume– temperature – wealth (income, assets)

Page 4: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Numbers...Numbers...

• We are used to concepts like these being expressed as numbers:– length (distance, height)– area– volume– temperature – wealth (income, assets)

• We are going to discuss another concept:– probability

• proportion, percentage

– a simple idea, at the heart of statistics

Page 5: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

Page 6: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n – e.g. the probability that the

speaker says will instead of shall

Page 7: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– e.g. the probability that the speaker says will instead of shall

Page 8: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– cases of will

– e.g. the probability that the speaker says will instead of shall

Page 9: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– baseline n is• the number of times something could happen• the number of hits

– in a more general search – in several alternative patterns (‘alternate forms’)

– cases of will

– e.g. the probability that the speaker says will instead of shall

Page 10: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– baseline n is• the number of times something could happen• the number of hits

– in a more general search – in several alternative patterns (‘alternate forms’)

– cases of will

– total: will + shall

– e.g. the probability that the speaker says will instead of shall

Page 11: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ProbabilityProbability

• Based on another, even simpler, idea:– probability p = x / n

• where– frequency x (often, f )

• the number of times something actually happens• the number of hits in a search

– baseline n is• the number of times something could happen• the number of hits

– in a more general search – in several alternative patterns (‘alternate forms’)

• Probability can range from 0 to 1

– e.g. the probability that the speaker says will instead of shall– cases of will

– total: will + shall

Page 12: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/

elicitation– controlled laboratory experiment– computer simulation

Page 13: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/

elicitation– controlled laboratory experiment– computer simulation

}How do these

differ in what they might tell

us?

Page 14: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/

elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language

}How do these

differ in what they might tell

us?

Page 15: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)

}How do these

differ in what they might tell

us?

Page 16: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a corpus tell us?What can a corpus tell us?

• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation

• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)

}How do these

differ in what they might tell

us?

How does this affect the types

of knowledg

e we might

obtain?

}

Page 17: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in

a parsed corpus:

Page 18: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in

a parsed corpus:

Frequency evidence of a particularknown rule, structure or linguistic event - How often?

Page 19: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in

a parsed corpus:

Frequency evidence of a particularknown rule, structure or linguistic event

Factual evidence of new rules, etc. - How novel?

- How often?

Page 20: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in a

parsed corpus:

Frequency evidence of a particularknown rule, structure or linguistic event

Factual evidence of new rules, etc.

Interaction evidence of relationshipsbetween rules, structures and events - Does X affect

Y?

- How novel?

- How often?

Page 21: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in a

parsed corpus:

Frequency evidence of a particularknown rule, structure or linguistic event

Factual evidence of new rules, etc.

Interaction evidence of relationshipsbetween rules, structures and events

• Lexical searches may also be made more precise using the grammatical analysis

- Does X affect Y?

- How novel?

- How often?

Page 22: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

A simple research questionA simple research question

• Let us consider the following question:

• Do women speak or write more words than men in the ICE-GB corpus?

• What do you think?

• How might we find out?

Page 23: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open ICE-GB with ICECUP– Text Fragment query for words:

• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses

and punctuation

Page 24: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open ICE-GB with ICECUP– Text Fragment query for words:

• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses

and punctuation

– Variable query:• TEXT CATEGORY = spoken, written

Page 25: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open ICE-GB with ICECUP– Text Fragment query for words:

• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses

and punctuation

– Variable query:• TEXT CATEGORY = spoken, written

– Variable query:• SPEAKER GENDER = f, m, <unknown>

combine these3 queries}

Page 26: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open ICE-GB with ICECUP– Text Fragment query for words:

• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses

and punctuation

– Variable query:• TEXT CATEGORY = spoken, written

– Variable query:• SPEAKER GENDER = f, m, <unknown>

F M <unknown> TOTALTOTAL 275,999 667,934 93,355 1,037,288 spoken 174,499 439,741 1,076 615,316 written 101,500 228,193 92,279 421,972

combine these3 queries}

Page 27: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category

spoken/written by women and men– The authors of some texts are unspecified– Some written material may be jointly

authored

– female/male ratio varies slightly

0 0.2 0.4 0.6 0.8 1

TOTAL

spoken

written femalefemale

malemale

p

Page 28: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category

spoken/written by women and men– The authors of some texts are unspecified– Some written material may be jointly

authored

– female/male ratio varies slightly

0 0.2 0.4 0.6 0.8 1

TOTAL

spoken

written femalefemale

malemale

p

pp (female)(female) = words spoken by = words spoken by women /women /

total words (excluding total words (excluding <unknown>)<unknown>)

Page 29: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

pp = Probability = Proportion = Probability = Proportion

• We asked ourselves the following question:– Do women speak or write more words

than men in the ICE-GB corpus?– To answer this we looked at the proportion

of words in ICE-GB that are produced by women (out of all words where the gender is known)

Page 30: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

pp = Probability = Proportion = Probability = Proportion

• We asked ourselves the following question:– Do women speak or write more words than men in

the ICE-GB corpus?– To answer this we looked at the proportion of words in

ICE-GB that are produced by women (out of all words where the gender is known)

• The proportion of words produced by women can also be thought of as a probability:– What is the probability that, if we were to pick

any random word in ICE-GB (and the gender was known) it would be uttered by a woman?

Page 31: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Another research questionAnother research question

• Let us consider the following question:

• What happens to modal shall vs. will over time in British English?– Does shall increase or decrease?

• What do you think?

• How might we find out?

Page 32: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open DCPSE with ICECUP– FTF query for first person declarative shall:

• repeat for will

Page 33: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Lets get some dataLets get some data

• Open DCPSE with ICECUP– FTF query for first person declarative shall:

• repeat for will– Corpus Map:

• DATE Do the first set of queries and then drop into Corpus

Map}

Page 34: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)

shallshall = 100% = 100%

shallshall = 0% = 0%0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

(Aarts et al. 2013)

Page 35: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

(Aarts et al. 2013)

shallshall = 100% = 100%

shallshall = 0% = 0%

Page 36: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

Is shall going up or down?

(Aarts et al. 2013)

shallshall = 100% = 100%

shallshall = 0% = 0%

Page 37: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Is Is shall shall going up or down? going up or down?

• Whenever we look at change, we must ask ourselves two things:

Page 38: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Is Is shall shall going up or down? going up or down? • Whenever we look at change, we must ask ourselves two things:

What is the change relative to?– Is our observation higher or lower than we might expect?

• In this case we ask • Does shall decrease relative to shall +will ?

Page 39: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Is Is shall shall going up or down? going up or down? • Whenever we look at change, we must ask ourselves two things:

What is the change relative to?– Is our observation higher or lower than we might expect?

• In this case we ask • Does shall decrease relative to shall +will ?

How confident are we in our results?– Is the change big enough to be reproducible?

Page 40: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77.27% of uses of think in 1920s data

have a literal (‘cogitate’) meaning

Page 41: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77.27% of uses of think in 1920s data

have a literal (‘cogitate’) meaning

Really? Not 77.28, or 77.26?

Page 42: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77% of uses of think in 1920s data

have a literal (‘cogitate’) meaning

Page 43: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77% of uses of think in 1920s data

have a literal (‘cogitate’) meaning

Sounds defensible. But how confident can we be in this number?

Page 44: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77% (66-86%*) of uses of think in 1920s

data have a literal (‘cogitate’) meaning

Page 45: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise

– Randomness is a fact of life– Our abilities are finite:

• to measure accurately or • reliably classify into types

• We need to express caution in citing numbers

• Example (from Levin 2013):– 77% (66-86%*) of uses of think in 1920s

data have a literal (‘cogitate’) meaning

Finally we have a credible range of values - needs a footnote* to explain how it was calculated.

Page 46: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample

Page 47: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample

• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact

Page 48: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample

• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact

• Now we are asking about “British English”

?

Page 49: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample

• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact

• Now we are asking about “British English”– We want to draw an inference

• from the sample (in this case, DCPSE)• to the population (similarly-sampled BrE utterances)

– This inference is a best guess– This process is called inferential statistics

Page 50: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Basic inferential Basic inferential statisticsstatistics

• Suppose we carry out an experiment– We toss a coin 10 times and get 5 heads– How confident are we in the results?

• Suppose we repeat the experiment• Will we get the same result again?

Page 51: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Basic inferential Basic inferential statisticsstatistics

• Suppose we carry out an experiment– We toss a coin 10 times and get 5 heads– How confident are we in the results?

• Suppose we repeat the experiment• Will we get the same result again?

• Let’s try…– You should have one coin– Toss it 10 times– Write down how many heads you get– Do you all get the same results?

Page 52: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 1

x

531 7 9

• We toss a coin 10 times, and get 5 heads

X

Page 53: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 4

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 54: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 8

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 55: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 12

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 56: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 16

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 57: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 20

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 58: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Repeated sampling tends to form a Binomial distribution around the expected mean X

F

N = 26

x

531 7 9

• Due to chance, some samples will have a higher or lower score

X

Page 59: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution• It is helpful to express x as the probability of choosing a head, p, with expected mean P

• p = x / n– n = max. number of

possible heads (10)

• Probabilities are inthe range 0 to 1=percentages

(0 to 100%)

F

p

0.50.30.1 0.7 0.9

P

Page 60: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Take-home point:– A single observation, say x hits (or p as a

proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’!

• Estimating the confidence you have in your results is essential

F

p

P

0.50.30.1 0.7 0.9

p

Page 61: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The Binomial distributionThe Binomial distribution

• Take-home point:– A single observation, say x hits (or p as a

proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’!

• Estimating the confidence you have in your results is essential

– We want to makepredictions about future runs of the same experiment

F

p

P

p

0.50.30.1 0.7 0.9

Page 62: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Binomial Binomial Normal Normal

• The Binomial (discrete) distribution is close to the Normal (continuous) distribution

x

F

0.50.30.1 0.7 0.9

Page 63: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The central limit theoremThe central limit theorem

• Any Normal distribution can be defined by only two variables and the Normal function z

z . S z . S

F

– With more data in the experiment, S will be smaller

p0.50.30.1 0.7

population

mean P

standard deviationS = P(1 – P) / n

Page 64: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The central limit theoremThe central limit theorem

• Any Normal distribution can be defined by only two variables and the Normal function z

z . S z . S

F

2.5% 2.5%

population

mean P

– 95% of the curve is within ~2 standard deviations of the expected mean

standard deviationS = P(1 – P) / n

p0.50.30.1 0.7

95%

– the correct figure is 1.95996!

= the critical value of z for an error level of 0.05.

Page 65: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

The single-sample The single-sample zz test...test...

• Is an observation p > z standard deviations from the expected (population) mean P?

z . S z . S

F

P2.5% 2.5%

p0.50.30.1 0.7

observation p• If yes, p is

significantly different from P

Page 66: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . S is the confidence interval for P

– We want to plot the interval about p

z . S z . S

F

P

p0.50.30.1 0.7

2.5% 2.5%

Page 67: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . S is the confidence interval for P

– We want to plot the interval about p

w+

F

P2.5% 2.5%

p0.50.30.1 0.7

observation p

w–

95%

Page 68: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the

Wilson score interval

• This interval reflects the Normal interval about P:

• If P is at the upper limit of p,p is at the lower limit of P

(Wallis, 2013)

F

P2.5% 2.5%

p

w+

observation p

w–

0.50.30.1 0.7

Page 69: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Simple test: – Compare p for

• all LLC texts in DCPSE (1956-77) with• all ICE-GB texts (early 1990s)

– We get the following data

– We may plot the probabilityof shall being selected,with Wilson intervals

LLC ICE-GB totalshall 110 40 150will 78 58 136total 188 98 286

0.0

0.2

0.4

0.6

0.8

1.0

LLC ICE-GB

p(shall | {shall, will})

Page 70: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Simple test: – Compare p for

• all LLC texts in DCPSE (1956-77) with• all ICE-GB texts (early 1990s)

– We get the following data

– We may plot the probabilityof shall being selected,with Wilson intervals

0.0

0.2

0.4

0.6

0.8

1.0

LLC ICE-GB

p(shall | {shall, will})LLC ICE-GB total

shall 110 40 150will 78 58 136total 188 98 286

May be input in a

2 x 2 chi-square test

- or you can check Wilson intervals

Page 71: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

• Small amounts of data / year

Page 72: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})• Small amounts

of data / year

• Confidence intervals identify the degree of certainty in our results

Page 73: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

• Small amounts of data / year

• Confidence intervals identify the degree of certainty in our results

• Highly skewed p in some cases

– p = 0 or 1 (circled)

Page 74: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Modal Modal shallshall vs. vs. willwill over time over time

• Plotting modal shall/will over time (DCPSE)

0.0

0.2

0.4

0.6

0.8

1.0

1955 1960 1965 1970 1975 1980 1985 1990 1995

p(shall | {shall, will})

• Small amounts of data / year

• Confidence intervals identify the degree of certainty in our results

• We can now estimate an approximate downwards curve

(Aarts et al. 2013)

Page 75: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Recap Recap • Whenever we look at change, we must ask ourselves two things:

What is the change relative to?– Is our observation higher or lower than we might expect?

• In this case we ask • Does shall decrease relative to shall +will ?

How confident are we in our results?– Is the change big enough to be reproducible?

Page 76: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ConclusionsConclusions

• An observation is not the actual value – Repeating the experiment might get different results

• The basic idea of these methods is – Predict range of future results if experiment was

repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)

• Based on the Binomial distribution– Approximated by Normal distribution – many uses

• Plotting confidence intervals• Use goodness of fit or single-sample z tests to compare

an observation with an expected baseline• Use 22 tests or two independent sample z tests to

compare two observed samples

Page 77: Simple Statistics for Corpus Linguistics Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

ReferencesReferences

• Aarts, B., J. Close, G. Leech and S.A. Wallis (eds). The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP.– Aarts, B., Close, J., and Wallis, S.A. 2013. Choices over time:

methodological issues in investigating current change. Chapter 2.– Levin, M. 2013. The progressive in modern American English.

Chapter 8.

• Wallis, S.A. 2013. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20:3, 178-208.

• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.

• NOTE: Statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: http://corplingstats.wordpress.com