Upload
rafe-summers
View
218
Download
1
Embed Size (px)
Citation preview
IA901 2012 Session Four
• Lab Session: Corpora
• What is a corpus?• What do corpora tell us about the English language• Corpus-driven language description• Practical application of corpora in the classroom
A link to last week…
HONIED or HONEYED?
ENJOY → ENJOYEDPLAY → PLAYED
WORRY → WORRIEDHURRY → HURRIED
MONEY → MONIED / MONEYED?
“Honeyed” is almost 40 times as common
(online) as “honied”
Option 1 Option 2 Preferred plural form?
cowboys cowsboy
cowgirls cowsgirl
breakfasts breaksfast
christmases christsmas
businesses busiesness
girl from Ipanemas girls from Ipanema
mother-in-laws mothers-in-law
gin and tonics gins and tonic
tablespoonfuls tablespoonsful
work of arts works of art
hole in ones holes in one
passerbys passersby
governor-generals governors-general
POWs POW
Also in relation to last week’s session, I found that:
“mother-in-laws” is almost 50% more common than “mothers-in-law”
“tablespoonfuls” is 12 times more common than “tablespoonsful”
“passersby” is almost 17 times more common than “passerbys”
“gin and tonics” is over 60 times more common than “gins and tonic”
“works of art” is over 250 times more common than “work of arts”
What is a corpus?
What can it tell us?
Where do you think this word list comes from?
And this?
created using wordle.net
So…
is my IA902 corpus a “principled collection of texts available for qualitative and quantitative analysis”? (Biber, Conrad, Reppen, 1998)
A history of corpora
1700s: Dr Johnson wrote the first
comprehensive dictionary of English,
compiled by manually collating samples of
language from 1560-1660.
1960s Brown Corpus of Standard American English : first of the modern, computer
readable, general corpora
1980s John Sinclair & colleagues: Collins Birmingham University International
Language Database (COBUILD)
1987 Collins COBUILD English Dictionary
1990 Willis: the Lexical Syllabus
2007 Cambridge International Corpus => 1 billion words
ANC
BASE
BNC
BoE
BROWN
CIC
CANCODE
COBUILD
MICASE
American National Corpus
British Academic Spoken English
British National Corpus
Bank of English
Brown University
Cambridge International Corpus
Cambridge & Nottingham Corpus of Discourse in
English
Collins Birmingham University International Language
Database
Michigan Corpus of Academic Spoken English
Corpora not limited to general or native-speaker data:
Business and Academic corpora
The International Corpus of Learner English
VOICE (The Vienna-Oxford International Corpus of English) is a
collection of English as a Lingua Franca
Corpus development with the idea of SUEs (Successful Users of
English) as a model
How big does a corpus need to be?
What do corpora tell us?
• Frequency of individual words
• Frequency of “chunks”
Frequency of individual words
Word Freq %
1 I 13 9.772 YESTERDAY 9 6.773 TO 8 6.024 MM 6 4.515 NOW 5 3.766 OH 4 3.017 SHE 4 3.018 A 3 2.269 AWAY 3 2.2610 BELIEVE 3 2.26
Within a bigger corpus (say, 5 million words), which words would you expect to occur most frequently? Write down 10 words that you’d expect to be in the top 50.
What differences would you expect to find between lists of the most frequent words in corpora of WRITTEN and SPOKEN English?
From O’Keefe et al (2007)
From O’Keefe et al (2007)
A B C Dpossibly, must, seem ,
just, clearly, honestly, pretty
house, TV, cheese, kids
sad, brilliant, lovely, terrible
Eventually, always, usually, generally
explain, accept, help, listen
O’Keefe et al (2007) divide the 2000 most frequently occurring words in the CIC and CANCODE corpora into 4 sub-lists: A = 1-500 B = 501-1000 C = 1001-1500 D = 1501-2000.
Can you identify the most frequently-occurring word in each set below?
A B C Dpossibly, must, seem ,
just, clearly, honestly, pretty
house, TV, cheese, kids
sad, brilliant, lovely, terrible
Eventually, always, usually, generally
explain, accept, help, listen
O’Keefe et al (2007) divide the 2000 most frequently occurring words in the CIC and CANCODE corpora into 4 sublists: A = 1-500 B = 501-1000 C = 1001-1500 D = 1501-2000.
Can you identify the most frequently-occurring word in each set below?
A B C Dpossibly, must, seem
just, clearly, honestly, pretty
house, TV, cheese, kids
sad, brilliant, lovely, terrible
Eventually, always, usually, generallyexplain, accept, help, listen
A B C Dmust seem possibly
just, clearly, honestly, pretty
house, TV, cheese, kids
sad, brilliant, lovely, terrible
Eventually, always, usually, generallyexplain, accept, help, listen
A B C Dmust seem possibly
just pretty clearly honestly
house, TV, cheese, kids
sad, brilliant, lovely, terrible
Eventually, always, usually, generallyexplain, accept, help, listen
A B C Dmust seem possibly
just pretty clearly honestly
house kids TV cheese
sad, brilliant, lovely, terrible
Eventually, always, usually, generallyexplain, accept, help, listen
A B C Dmust seem possibly
just pretty clearly honestly
house kids TV cheese
lovely terrible brilliant sad
Eventually, always, usually, generallyexplain, accept, help, listen
A B C Dmust seem possibly
just pretty clearly honestly
house kids TV cheese
lovely terrible brilliant sad
always usually eventually generally
explain, accept, help, listen
A B C Dmust seem possibly
just pretty clearly honestly
house kids TV cheese
lovely terrible brilliant sad
always usually eventually generally
help listen explain accept
A B C Dmust seem possibly
just pretty clearly honestly
house kids TV cheese
lovely terrible brilliant sad
always usually eventually generally
help listen explain accept
“The broad categories of a basic vocabulary” (O’Keefe et al, 2007)
A B C DMODAL ITEMS must seem possibly
just pretty clearly honestly
house kids TV cheese
lovely terrible brilliant sad
always usually eventually generally
help listen explain accept
“The broad categories of a basic vocabulary” (O’Keefe et al, 2007)
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
house kids TV cheese
lovely terrible brilliant sad
always usually eventually generally
help listen explain accept
“The broad categories of a basic vocabulary” (O’Keefe et al, 2007)
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
BASIC NOUNS house kids TV cheese
lovely terrible brilliant sad
always usually eventually generally
help listen explain accept
“The broad categories of a basic vocabulary” (O’Keefe et al, 2007)
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
BASIC NOUNS house kids TV cheese
BASIC ADJECTIVES lovely terrible brilliant sad
always usually eventually generally
help listen explain accept
“The broad categories of a basic vocabulary” (O’Keefe et al, 2007)
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
BASIC NOUNS house kids TV cheese
BASIC ADJECTIVES lovely terrible brilliant sad
BASIC ADVERBS always usually eventually generally
help listen explain accept
“The broad categories of a basic vocabulary” (O’Keefe et al, 2007)
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
BASIC NOUNS house kids TV cheese
BASIC ADJECTIVES lovely terrible brilliant sad
BASIC ADVERBS always usually eventually generally
BASIC VERBS FOR ACTIONS AND EVENTS
help listen explain accept
“The broad categories of a basic vocabulary” (O’Keefe et al, 2007)
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
BASIC NOUNS house kids TV cheese
BASIC ADJECTIVES lovely terrible brilliant sad
BASIC ADVERBS always usually eventually generally
BASIC VERBS FOR ACTIONS AND EVENTS
help listen explain Accept
DELEXICAL VERBS
DISCOURSE MARKERS
GENERAL DEICTICS
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
BASIC NOUNS house kids TV cheese
BASIC ADJECTIVES lovely terrible brilliant sad
BASIC ADVERBS always usually eventually generally
BASIC VERBS FOR ACTIONS AND EVENTS
help listen explain Accept
DELEXICAL VERBS do
DISCOURSE MARKERS
GENERAL DEICTICS
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
BASIC NOUNS house kids TV cheese
BASIC ADJECTIVES lovely terrible brilliant sad
BASIC ADVERBS always usually eventually generally
BASIC VERBS FOR ACTIONS AND EVENTS
help listen explain Accept
DELEXICAL VERBS do
DISCOURSE MARKERS so
GENERAL DEICTICS
A B C DMODAL ITEMS must seem possibly
STANCE WORDS just pretty clearly honestly
BASIC NOUNS house kids TV cheese
BASIC ADJECTIVES lovely terrible brilliant sad
BASIC ADVERBS always usually eventually generally
BASIC VERBS FOR ACTIONS AND EVENTS
help listen explain Accept
DELEXICAL VERBS do
DISCOURSE MARKERS so
GENERAL DEICTICS here
Three Relevant Word lists?
The General Service List (Michael West, 1953)
The Academic Word List (Averil Coxhead, 2000)
The Academic Keyword List (Magali Paquot, 2010)
Good news for the beginner?
Bad news for the advanced-level student?
From O’Keefe et al (2007)
Frequency of “chunks”
• Collocation
• Strings of words
• Colligation
Definitions
Biber et al (2002):
Collocation : “a combination of lexical words which frequently co-occur in texts”
Lexical Bundle : “a sequence of words which is used repeatedly in texts”
Alternatives:
Collocation:
- “just the way we say it”?
- “the occurrence of two or more words within a short space of each other in a text” (Sinclair, 1991)
- “the relationship a lexical item has with items that appear with greater than random probability in its (textual) context” (Hoey, 1991)
- “a psychological association between words (rather than lemmas) up to four words apart =…evidenced by their occurrence together in corpora more often than is explicable in terms of random distribution” (Hoey, 2005)
- “the lexical company that words keep” (Hoey, 2011)
Collocations Dictionaries
username: mholloway, password: ia902
What words collocate with both STUDY and RESEARCH?
What words collocate with both STUDY and RESEARCH?
“Chunks” : how long? How significant?
Put the following items in order of the frequency with which they are used in spoken English:
a) a bit ofb) and things like thatc) regularlyd) sincee) this that and the otherf) twice
From O’Keefe et al (2007)
a couple of, possible, at the moment, alone, all the time, fun, in terms of, something like that, expensive, you know what i mean, stairs, at the same time, nowhere
From O’Keefe et al (2007)
Commonly-occurring six-word chunks:
1. Do you know _______ _______ _______?2. At the end _______ _______ _______3. And all the rest _______ _______4. And all that sort _______ _______5. I don’t know _______ _______ _______
6. Do you know what I mean?7. At the end of the day8. All of the rest of it9. And all that sort of thing10. I don’t know what it is
From O’Keefe et al (2007)
“a bit” is the 24th most common two-word chunk in CANCODE
but,…what does “a bit” mean? Does it have any meaning by itself?
How meaningful is “a bit” as a quantifier?
What about its “hedging” function?
It also belongs to several “frames”:
e.g. it was a bit of a mess problem performance hassle nuisance bargain
COLLIGATION : Where lexis meet grammar?
Data on language usage tells us that:
• “a bit” is more likely than “the bit”• “a bit” is likely to be followed by “of” + NP• “a bit” is more likely to be used in an object
position than a subject position
1. Which preposition is most likely to follow DIFFERENT – TO or FROM?
DIFFERENT TO DIFFERENT
FROM
Brown 0 35
BNC Written 4 22
BNC Spoken 21 12
From the Compleat Lexcial Tutor:
Entitle – Active or Passive?
ILLUSTRATE and DRAW
Among the many differences you may have found between these two words, did you discover anything about COLLIGATION?
DRAW is a more frequent item than ILLUSTRATEBoth verbs are frequently preceded by “to”.Relatively speaking, ILLUSTRATE occurs significantly more frequently with “to” than DRAW doesILLUSTRATE is frequently used in INFINITIVE CLAUSE
To illustrate this, we can compare concordance lists of each word using any of the websites linked to on the IA902 blog.
Widening context / Narrowing meaning
• Written and spoken contexts
• Semantic association
• Semantic prosody
Differences in spoken and written English:
- data on spoken English reflects an orientation to the “speaker-listener world in conversation”. (I, you)- spoken discourse markers (well, right)- high frequency items that are arguably not words at all (yeah, oh, er)
What functions do ABSOLUTELY and DEFINITELY have in spoken English?
What would you expect to be the most common uses of the words LIKE and MEAN?
Collocates for LIKE & MEAN (BNC Written & Spoken + Brown)
LIKEwould=35 look=27 was=25 I=20 looked=18 looks=17 and=15more=15 just=14
not=14 something=13 is=12 much=11 the=11 you=11 feel=10
MEANI=611 you=86 not=38 the=29 would=27 to=13 we=11 Didn’t=10 may=10
will=9 a=8 could=8 that=8 Don’t=7 it=7 can=6 necessarily=6 (mm=2)
Semantic asssociation
Semantic prosody
Collocations: inner ear, glue ear; a clip round the ear; she whispered in his ear; ear, nose, and throat doctor; hear a voice in your ear
Semantic association: parts of the body
Semantic prosody???
What’s the difference between SKINNY and SLIM?
Slim : elegant, graceful
Skinny: sick, shy?
Differences between HANDSOME and PRETTY
Differences between HANDSOME and PRETTY
Differences between HANDSOME and PRETTY
How would explain the difference between CAUSE and PROVIDE?
How would explain the difference between CAUSE and PROVIDE?
Materials
Corpus-informed publications for students
Corpus-informed publications for students
Corpus-informed publications for students
Corpus-informed publications for students
Corpus-informed publications for students
For teachers: Corpus-informed or “impulse-based”?
Activities
From Cobb (1997)
Discussion
Disadvantages?
- overly-reliant on technology?
- does navigation of corpora also require an element of “instinct”?
- the dangers of becoming “corpus-bound”
From O’Keefe et al (2007)
For further exploration
- what do corpora tell us about existing theories of language? (see Hoey, 2005)
- how can YOU use corpora in your teaching?
- what use can you make of corpora in your research?