Simple Statistics for Simple Statistics for Corpus LinguisticsCorpus Linguistics
Sean WallisSurvey of English Usage
University College London
OutlineOutline
• Numbers…
• A simple research question– do women speak or write more than men
in ICE-GB?– p = proportion = probability
• Another research question– what happens to speakers’ use of modal shall
vs. will over time?– the idea of inferential statistics– plotting confidence intervals
• Concluding remarks
Numbers...Numbers...
• We are used to concepts like these being expressed as numbers:– length (distance, height)– area– volume– temperature – wealth (income, assets)
Numbers...Numbers...
• We are used to concepts like these being expressed as numbers:– length (distance, height)– area– volume– temperature – wealth (income, assets)
• We are going to discuss another concept:– probability
• proportion, percentage
– a simple idea, at the heart of statistics
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n – e.g. the probability that the
speaker says will instead of shall
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– e.g. the probability that the speaker says will instead of shall
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– cases of will
– e.g. the probability that the speaker says will instead of shall
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– baseline n is• the number of times something could happen• the number of hits
– in a more general search – in several alternative patterns (‘alternate forms’)
– cases of will
– e.g. the probability that the speaker says will instead of shall
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– baseline n is• the number of times something could happen• the number of hits
– in a more general search – in several alternative patterns (‘alternate forms’)
– cases of will
– total: will + shall
– e.g. the probability that the speaker says will instead of shall
ProbabilityProbability
• Based on another, even simpler, idea:– probability p = x / n
• where– frequency x (often, f )
• the number of times something actually happens• the number of hits in a search
– baseline n is• the number of times something could happen• the number of hits
– in a more general search – in several alternative patterns (‘alternate forms’)
• Probability can range from 0 to 1
– e.g. the probability that the speaker says will instead of shall– cases of will
– total: will + shall
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/
elicitation– controlled laboratory experiment– computer simulation
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/
elicitation– controlled laboratory experiment– computer simulation
}How do these
differ in what they might tell
us?
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/
elicitation– controlled laboratory experiment– computer simulation
• A corpus is a sample of language
}How do these
differ in what they might tell
us?
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation
• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)
}How do these
differ in what they might tell
us?
What can a corpus tell us?What can a corpus tell us?
• A corpus is a source of knowledge about language:– corpus– introspection/observation/elicitation– controlled laboratory experiment– computer simulation
• A corpus is a sample of language, varying by:– source (e.g. speech vs. writing, age...)– levels of annotation (e.g. parsing)– size (number of words)– sampling method (random sample?)
}How do these
differ in what they might tell
us?
How does this affect the types
of knowledg
e we might
obtain?
}
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in
a parsed corpus:
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in
a parsed corpus:
Frequency evidence of a particularknown rule, structure or linguistic event - How often?
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in
a parsed corpus:
Frequency evidence of a particularknown rule, structure or linguistic event
Factual evidence of new rules, etc. - How novel?
- How often?
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in a
parsed corpus:
Frequency evidence of a particularknown rule, structure or linguistic event
Factual evidence of new rules, etc.
Interaction evidence of relationshipsbetween rules, structures and events - Does X affect
Y?
- How novel?
- How often?
What can a What can a parsedparsed corpus tell corpus tell us?us?• Three kinds of evidence may be found in a
parsed corpus:
Frequency evidence of a particularknown rule, structure or linguistic event
Factual evidence of new rules, etc.
Interaction evidence of relationshipsbetween rules, structures and events
• Lexical searches may also be made more precise using the grammatical analysis
- Does X affect Y?
- How novel?
- How often?
A simple research questionA simple research question
• Let us consider the following question:
• Do women speak or write more words than men in the ICE-GB corpus?
• What do you think?
• How might we find out?
Lets get some dataLets get some data
• Open ICE-GB with ICECUP– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses
and punctuation
Lets get some dataLets get some data
• Open ICE-GB with ICECUP– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses
and punctuation
– Variable query:• TEXT CATEGORY = spoken, written
Lets get some dataLets get some data
• Open ICE-GB with ICECUP– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses
and punctuation
– Variable query:• TEXT CATEGORY = spoken, written
– Variable query:• SPEAKER GENDER = f, m, <unknown>
combine these3 queries}
Lets get some dataLets get some data
• Open ICE-GB with ICECUP– Text Fragment query for words:
• “*+<{~PUNC,~PAUSE}>”• counts every word, excluding pauses
and punctuation
– Variable query:• TEXT CATEGORY = spoken, written
– Variable query:• SPEAKER GENDER = f, m, <unknown>
F M <unknown> TOTALTOTAL 275,999 667,934 93,355 1,037,288 spoken 174,499 439,741 1,076 615,316 written 101,500 228,193 92,279 421,972
combine these3 queries}
ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category
spoken/written by women and men– The authors of some texts are unspecified– Some written material may be jointly
authored
– female/male ratio varies slightly
0 0.2 0.4 0.6 0.8 1
TOTAL
spoken
written femalefemale
malemale
p
ICE-GB: gender / written-ICE-GB: gender / written-spokenspoken• Proportion of words in each category
spoken/written by women and men– The authors of some texts are unspecified– Some written material may be jointly
authored
– female/male ratio varies slightly
0 0.2 0.4 0.6 0.8 1
TOTAL
spoken
written femalefemale
malemale
p
pp (female)(female) = words spoken by = words spoken by women /women /
total words (excluding total words (excluding <unknown>)<unknown>)
pp = Probability = Proportion = Probability = Proportion
• We asked ourselves the following question:– Do women speak or write more words
than men in the ICE-GB corpus?– To answer this we looked at the proportion
of words in ICE-GB that are produced by women (out of all words where the gender is known)
pp = Probability = Proportion = Probability = Proportion
• We asked ourselves the following question:– Do women speak or write more words than men in
the ICE-GB corpus?– To answer this we looked at the proportion of words in
ICE-GB that are produced by women (out of all words where the gender is known)
• The proportion of words produced by women can also be thought of as a probability:– What is the probability that, if we were to pick
any random word in ICE-GB (and the gender was known) it would be uttered by a woman?
Another research questionAnother research question
• Let us consider the following question:
• What happens to modal shall vs. will over time in British English?– Does shall increase or decrease?
• What do you think?
• How might we find out?
Lets get some dataLets get some data
• Open DCPSE with ICECUP– FTF query for first person declarative shall:
• repeat for will
Lets get some dataLets get some data
• Open DCPSE with ICECUP– FTF query for first person declarative shall:
• repeat for will– Corpus Map:
• DATE Do the first set of queries and then drop into Corpus
Map}
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)
shallshall = 100% = 100%
shallshall = 0% = 0%0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
(Aarts et al. 2013)
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
(Aarts et al. 2013)
shallshall = 100% = 100%
shallshall = 0% = 0%
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting probability of speaker selecting modal shall out of shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
Is shall going up or down?
(Aarts et al. 2013)
shallshall = 100% = 100%
shallshall = 0% = 0%
Is Is shall shall going up or down? going up or down?
• Whenever we look at change, we must ask ourselves two things:
Is Is shall shall going up or down? going up or down? • Whenever we look at change, we must ask ourselves two things:
What is the change relative to?– Is our observation higher or lower than we might expect?
• In this case we ask • Does shall decrease relative to shall +will ?
Is Is shall shall going up or down? going up or down? • Whenever we look at change, we must ask ourselves two things:
What is the change relative to?– Is our observation higher or lower than we might expect?
• In this case we ask • Does shall decrease relative to shall +will ?
How confident are we in our results?– Is the change big enough to be reproducible?
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77.27% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77.27% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
Really? Not 77.28, or 77.26?
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77% of uses of think in 1920s data
have a literal (‘cogitate’) meaning
Sounds defensible. But how confident can we be in this number?
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77% (66-86%*) of uses of think in 1920s
data have a literal (‘cogitate’) meaning
The idea of a confidence The idea of a confidence intervalinterval• All observations are imprecise
– Randomness is a fact of life– Our abilities are finite:
• to measure accurately or • reliably classify into types
• We need to express caution in citing numbers
• Example (from Levin 2013):– 77% (66-86%*) of uses of think in 1920s
data have a literal (‘cogitate’) meaning
Finally we have a credible range of values - needs a footnote* to explain how it was calculated.
The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample
The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample
• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact
The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample
• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact
• Now we are asking about “British English”
?
The ‘sample’ and the The ‘sample’ and the ‘population’‘population’• We said that the corpus was a sample
• Previously, we asked about the proportions of male/female words in the corpus (ICE-GB)– We asked questions about the sample– The answers were statements of fact
• Now we are asking about “British English”– We want to draw an inference
• from the sample (in this case, DCPSE)• to the population (similarly-sampled BrE utterances)
– This inference is a best guess– This process is called inferential statistics
Basic inferential Basic inferential statisticsstatistics
• Suppose we carry out an experiment– We toss a coin 10 times and get 5 heads– How confident are we in the results?
• Suppose we repeat the experiment• Will we get the same result again?
Basic inferential Basic inferential statisticsstatistics
• Suppose we carry out an experiment– We toss a coin 10 times and get 5 heads– How confident are we in the results?
• Suppose we repeat the experiment• Will we get the same result again?
• Let’s try…– You should have one coin– Toss it 10 times– Write down how many heads you get– Do you all get the same results?
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 1
x
531 7 9
• We toss a coin 10 times, and get 5 heads
X
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 4
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 8
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 12
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 16
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 20
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
The Binomial distributionThe Binomial distribution
• Repeated sampling tends to form a Binomial distribution around the expected mean X
F
N = 26
x
531 7 9
• Due to chance, some samples will have a higher or lower score
X
The Binomial distributionThe Binomial distribution• It is helpful to express x as the probability of choosing a head, p, with expected mean P
• p = x / n– n = max. number of
possible heads (10)
• Probabilities are inthe range 0 to 1=percentages
(0 to 100%)
F
p
0.50.30.1 0.7 0.9
P
The Binomial distributionThe Binomial distribution
• Take-home point:– A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’!
• Estimating the confidence you have in your results is essential
F
p
P
0.50.30.1 0.7 0.9
p
The Binomial distributionThe Binomial distribution
• Take-home point:– A single observation, say x hits (or p as a
proportion of n possible hits) in the corpus, is not guaranteed to be correct ‘in the world’!
• Estimating the confidence you have in your results is essential
– We want to makepredictions about future runs of the same experiment
F
p
P
p
0.50.30.1 0.7 0.9
Binomial Binomial Normal Normal
• The Binomial (discrete) distribution is close to the Normal (continuous) distribution
x
F
0.50.30.1 0.7 0.9
The central limit theoremThe central limit theorem
• Any Normal distribution can be defined by only two variables and the Normal function z
z . S z . S
F
– With more data in the experiment, S will be smaller
p0.50.30.1 0.7
population
mean P
standard deviationS = P(1 – P) / n
The central limit theoremThe central limit theorem
• Any Normal distribution can be defined by only two variables and the Normal function z
z . S z . S
F
2.5% 2.5%
population
mean P
– 95% of the curve is within ~2 standard deviations of the expected mean
standard deviationS = P(1 – P) / n
p0.50.30.1 0.7
95%
– the correct figure is 1.95996!
= the critical value of z for an error level of 0.05.
The single-sample The single-sample zz test...test...
• Is an observation p > z standard deviations from the expected (population) mean P?
z . S z . S
F
P2.5% 2.5%
p0.50.30.1 0.7
observation p• If yes, p is
significantly different from P
...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . S is the confidence interval for P
– We want to plot the interval about p
z . S z . S
F
P
p0.50.30.1 0.7
2.5% 2.5%
...gives us a “confidence ...gives us a “confidence interval”interval”• P ± z . S is the confidence interval for P
– We want to plot the interval about p
w+
F
P2.5% 2.5%
p0.50.30.1 0.7
observation p
w–
95%
...gives us a “confidence ...gives us a “confidence interval”interval”• The interval about p is called the
Wilson score interval
• This interval reflects the Normal interval about P:
• If P is at the upper limit of p,p is at the lower limit of P
(Wallis, 2013)
F
P2.5% 2.5%
p
w+
observation p
w–
0.50.30.1 0.7
Modal Modal shallshall vs. vs. willwill over time over time
• Simple test: – Compare p for
• all LLC texts in DCPSE (1956-77) with• all ICE-GB texts (early 1990s)
– We get the following data
– We may plot the probabilityof shall being selected,with Wilson intervals
LLC ICE-GB totalshall 110 40 150will 78 58 136total 188 98 286
0.0
0.2
0.4
0.6
0.8
1.0
LLC ICE-GB
p(shall | {shall, will})
Modal Modal shallshall vs. vs. willwill over time over time
• Simple test: – Compare p for
• all LLC texts in DCPSE (1956-77) with• all ICE-GB texts (early 1990s)
– We get the following data
– We may plot the probabilityof shall being selected,with Wilson intervals
0.0
0.2
0.4
0.6
0.8
1.0
LLC ICE-GB
p(shall | {shall, will})LLC ICE-GB total
shall 110 40 150will 78 58 136total 188 98 286
May be input in a
2 x 2 chi-square test
- or you can check Wilson intervals
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting modal shall/will over time (DCPSE)
• Small amounts of data / year
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})• Small amounts
of data / year
• Confidence intervals identify the degree of certainty in our results
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
• Small amounts of data / year
• Confidence intervals identify the degree of certainty in our results
• Highly skewed p in some cases
– p = 0 or 1 (circled)
Modal Modal shallshall vs. vs. willwill over time over time
• Plotting modal shall/will over time (DCPSE)
0.0
0.2
0.4
0.6
0.8
1.0
1955 1960 1965 1970 1975 1980 1985 1990 1995
p(shall | {shall, will})
• Small amounts of data / year
• Confidence intervals identify the degree of certainty in our results
• We can now estimate an approximate downwards curve
(Aarts et al. 2013)
Recap Recap • Whenever we look at change, we must ask ourselves two things:
What is the change relative to?– Is our observation higher or lower than we might expect?
• In this case we ask • Does shall decrease relative to shall +will ?
How confident are we in our results?– Is the change big enough to be reproducible?
ConclusionsConclusions
• An observation is not the actual value – Repeating the experiment might get different results
• The basic idea of these methods is – Predict range of future results if experiment was
repeated• ‘Significant’ = effect > 0 (e.g. 19 times out of 20)
• Based on the Binomial distribution– Approximated by Normal distribution – many uses
• Plotting confidence intervals• Use goodness of fit or single-sample z tests to compare
an observation with an expected baseline• Use 22 tests or two independent sample z tests to
compare two observed samples
ReferencesReferences
• Aarts, B., J. Close, G. Leech and S.A. Wallis (eds). The Verb Phrase in English: Investigating recent language change with corpora. Cambridge: CUP.– Aarts, B., Close, J., and Wallis, S.A. 2013. Choices over time:
methodological issues in investigating current change. Chapter 2.– Levin, M. 2013. The progressive in modern American English.
Chapter 8.
• Wallis, S.A. 2013. Binomial confidence intervals and contingency tests. Journal of Quantitative Linguistics 20:3, 178-208.
• Wilson, E.B. 1927. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association 22: 209-212.
• NOTE: Statistics papers, more explanation, spreadsheets etc. are published on corp.ling.stats blog: http://corplingstats.wordpress.com