48
Overview of Corpus Linguistics Ling 240

Overview of Corpus Linguistics

Embed Size (px)

DESCRIPTION

Outline Definition History Current status

Citation preview

Page 1: Overview of Corpus Linguistics

Overview of Corpus Linguistics

Ling 240

Page 2: Overview of Corpus Linguistics

Outline

DefinitionHistoryCurrent status

Page 3: Overview of Corpus Linguistics

What is corpus linguistics?

Linguistics: the scientific study of languageusing…

Corpus: a large and principled collection of natural texts

Page 4: Overview of Corpus Linguistics

History of corpus linguisticsAs early as 1897, Wilhelm Kaeding compiled a

had 5,000 people compile a corpus of 11 million German words (and calculate their frequency, distribution of letters).In the early 1900s, Otto Jesperson, a Danish

professor filled shoeboxes with thousands of paper slips containing interesting English sentences.In 1959, Randolph Quirk started the Survey of

English Usage of spoken and written language which he used to create a comprehensive English grammar.

Page 5: Overview of Corpus Linguistics

History of corpus linguistics

1961: Brown Corpus1M words500 samples of 2,000 wordsVarious genres; printed, edited, American English

1961: Lancaster-Oslo/Bergen (LOB) CorpusBritish version of Brown Corpus

1991: Frown and FLOB Corpora1988: International Corpus of English (ICE)

World English varieties20 completed so far

Page 6: Overview of Corpus Linguistics

History of corpus linguistics

1991: British National Corpus (BNC)100M wordsWide range of written (90%) and spoken (10%) texts

2008: BYU CorporaCorpus of Contemporary American English (COCA)TIME corpusCorpus of Historical American English (COHA)GloWbE Corpus

International Corpus of Learner English (ICLE)MICASE & MICUSP

Page 7: Overview of Corpus Linguistics

Status of corpus linguistics

Is corpus linguistics a branch of linguistics or a method for doing linguistics?Evidence for branch:

Journals such as Corpora and the International Journal of Corpus LinguisticsSome researchers claim corpus linguistics as their area of emphasis

Evidence for method:Most linguistic phenomena can be measured using CLCL has the potential to inform virtually any theory

Page 8: Overview of Corpus Linguistics

Characteristics of corpus-based analyses

It is empirical, analyzing the actual patterns of use in natural texts;It utilizes a large and principled collection of

natural texts, known as a “corpus” as the basis for analysis;It makes extensive use of computers for

analysis, using both automatic and interactive techniques;It depends on both quantitative and qualitative

analytical techniques

Page 9: Overview of Corpus Linguistics

Uses of corpora

Changes over timeChanges in registerChanges in situationChanges in individual

Page 10: Overview of Corpus Linguistics

Time

Page 11: Overview of Corpus Linguistics

Different Genitives

Of genitiveThe leg of the table's genitive

The table's legNN genitive

The table leg

Page 12: Overview of Corpus Linguistics

0

20

40

60

1700 1800 1900 2000Year

Freq

uenc

y

Feature

NN

Of-Gen

S-Gen

‘s genitive vs. of-genitive vs. NN sequence

Page 13: Overview of Corpus Linguistics

NN sequence across time in three registers

Page 14: Overview of Corpus Linguistics

Situation

Page 15: Overview of Corpus Linguistics

Phrasal Compression

Uncompressed The dog that was hungry was looking for

something to eat. Drugs that require a prescription should be

monitored Compressed

The hungry dog was looking for something to eat. Prescription drugs should be monitored.

Page 16: Overview of Corpus Linguistics

Phrasal compression across levels in an EAP reading series

Page 17: Overview of Corpus Linguistics

Phrasal compression across levels in another EAP reading series

Page 18: Overview of Corpus Linguistics

Individual

Page 19: Overview of Corpus Linguistics

‘Abstract Exposition versus Concrete Action’

Dim

ensi

on 2

: 'Ab

stra

ct N

arra

tive

vers

us C

oncr

ete

Act

ion'

-5

0

5

Alcott Dickens Eliot Hawthorne James Kipling Melville Stevenson Twain Wells

Page 20: Overview of Corpus Linguistics

Corpus Design and Representativeness

Ling 240

Page 21: Overview of Corpus Linguistics

Designing Representative Corpora• Many people believe that the design of a corpus

doesn’t matter as long as it is large enough.• Researchers typically focus on target domain

representativeness and ignore linguistic representativeness Target domain (medical texts, newspapers,

academic, general English, spoken)• Very few corpora are actually evaluated in terms of

their representativeness (target domain or linguistic)

Page 22: Overview of Corpus Linguistics

Steps—representing the target domain

1.Describe the target domain2.Design the corpus to represent target domain3.Complete the sampling

• Simple random Randomly choose sections of the data for the corpus

• Stratified Determine what genres are included and randomly sample from

those data• Cluster

Divide data into naturally occurring groups and sample from them

Page 23: Overview of Corpus Linguistics

Norming practice

Text A Text B

# Nouns 50 100

# Words 200 1000

(raw count/total words) * 1000Text A: (50 nouns / 200 words) * 1000 = 250 nouns per thousand wordsText B: (100 nouns / 500 words) * 1000 = 200 nouns per thousand words

Page 24: Overview of Corpus Linguistics

Norming practice

• BNC has 100 million words• COCA has 450 million words

BNC BNC COCA COCA

# Per M # Per M

snuck 11 767

sneaked 132 830

Page 25: Overview of Corpus Linguistics

Corpus AnnotationLing 240

Page 26: Overview of Corpus Linguistics

Annotation• Corpora can be annotated for a wide range of

external and internal variables.• External variables

• Speaker• L1 background• Gender• Extralinguistic information (e.g., laughter, nodding, etc.)

Page 27: Overview of Corpus Linguistics

External annotation—example <Exam ID: 3B><Arrangement ID: 54945><Center ID: 14><Candidate ID: 42285><Test Date: 12/6/2013><Age: 19><Gender: F><L1: Arabic><Reason for test: B><Original MELAB: 2><Original Transformed: 3><Second MELAB: ><Second Transformed: ><End header>

E: Alright, welcome to the MELAB speaking exam, my name is <deleted>. And uh what is your name?T: Uh my name is uh <deleted>.E: Now I'll just uh read the MELAB ID number that we have for you. Uh you don't need to know it or anything. The number is <deleted>. Alright now that's out of the way. Uh why don't you uh start by telling me a little bit about why you're taking the MELAB today.T: Uh actually I came to USA to complete my education here, so uh if I want to go to university I need to get score and to to be good in speak English and I take uh a lab exam so I can enter to the university.E: Okay uh so uh what are you interested in studying at the university?T: Actually I think about uh medical science.

Page 28: Overview of Corpus Linguistics

Part of speech tagging• Rule-based• Probability-based• 95%+ accuracy rate• Some features very easy (e.g., the)• Some features more difficult (e.g., that)

• Pronoun (He doesn’t like that.)• Determiner (He doesn’t like that dog.)• Complementizer (They thought that they could do it.)• Relativizer (The thought that I entertained.)

Page 29: Overview of Corpus Linguistics

POS tagging accuracy• Accuracy

• Precision – What percent of the cases labeled as X are actually X?

• Recall – What percent of all of the true cases of X were labeled as X?

• Example• He saw that dog that I saw.• If both ‘that’s are tagged as determiners:

• Calculate the precision and recall for determiners• Calculate the precision and recall for relativizers

Page 30: Overview of Corpus Linguistics

POS tagging—two examples CLAWS Tagger

Everything_PN1 I_PPIS1 've_VH0 read_VVN says_VV0 they_PPHS2 were_VBDR warned_VVN to_TO leave_VVI immediately_RR

Biber Tagger

Everything ^pn++++=EverythingI ^pp1a+pp1+++=I've've ^vb+hv+aux++0=EXTRAWORDread ^vprf+++xvbnx+=readsays ^vb+vpub+++=say'sthey ^pp3a+pp3+++=theywere ^vbd+bed+aux++=werewarned ^vpsv++agls+xvbnx+=warnedto ^to+vcmp+++=toleave ^vbi++++=leaveimmediately ^rb+tm+++=immediately

Page 31: Overview of Corpus Linguistics

Lemmatization• Lemma

The citation or dictionary entry Run is the lemma It includes the words run, running, runs, ran

We often want the frequency of the lemma not of a particular word like running

Page 32: Overview of Corpus Linguistics

Answer these questions about COCA• What external annotation does it contain?• What internal annotation does it contain?

Page 33: Overview of Corpus Linguistics

Answer these questions about COCA• What external annotation does it contain?

Text source Date of publication

• What internal annotation does it contain? Lemmatization Part of speech Genre

Page 34: Overview of Corpus Linguistics

Example: ‘s-’ versus ‘of-genitive’

• ‘the bird’s owner’ vs. ‘the owner of the bird’• Finding 1: “by 1991, the s-genitive had

overtaken the of-genitive in frequency” (Leech, et al., 2009)

• Finding 2: of-genitive is almost 10 times more frequent than the s-genitive in present-day English (Longman Grammar)

• Q: Are these findings contradictory???

34

Page 35: Overview of Corpus Linguistics

Corpus study design—variationist

• Two approaches to corpus linguistics: Variationist and Text-Linguistic (Biber, 2012)

• Variationist: “has the goal of comparing linguistic variants: whether one or the other variant is preferred,” and identifying factors that predict which variant is used (Biber, 2012).– Statistics: Binomial/logistic regression; Linear discriminant

analysis– Interpretation: When a choice can be made, variant X is

preferred over variant Y, and factors A, B, and C play a role.

35

Page 36: Overview of Corpus Linguistics

Variationist Analysis (Type A)

• Unit of analysis is linguistic feature • Most studies do not take register into account (e.g. collocational studies)

• Comparison of the proportion of use in a particular register

• E.g., Benedict Szmrecsanyi & Hinrichs, 2008–preference of s-genitive over of-genitive in

speech; BUT: s-genitives overall more frequent in writing.

Page 37: Overview of Corpus Linguistics

Corpus study design—text-linguistic

• Text-Linguistic: “has the goal of providing a linguistic description of texts, by describing the density of grammatical features in texts” (Biber, 2012)

–Statistics: T-test, ANOVA, Multiple regression, Factor analysis

–Interpretation: Feature X is more frequent in context A than context B; or Feature X is more frequent than feature Y

37

Page 38: Overview of Corpus Linguistics

Text-linguistic (Type B)

• Comparison of actual frequency of use in a particular register

• Unit of analysis is text• Normed rates of occurrence by text• Much more common for register studies

Page 39: Overview of Corpus Linguistics

Text-linguistic (Type C)

• Also compares frequencies of use in a particular register

• Unit of analysis is subcorpus• Normed rates of occurrence for features across subcorpora

• Cannot use inferential statistics (need to look at individual text to get a mean score for the register)

Page 40: Overview of Corpus Linguistics

Quantitative analysis

• Coding/tagging features• Counts in text vs. subcorpus• Norming (raw count/total words * 1000)• Use appropriate statistical tests if applicable

Page 41: Overview of Corpus Linguistics

Kinds of Corpora

• Spoken language• General corpora (mainly written)• Bitext (two languages side-by-side)• Specialized

• Children’s speech• L2 learner speech

• Historical

Page 42: Overview of Corpus Linguistics

General Corpora

• Mainly written• British National Corpus (BNC)

• 100 million words• 10% spoken• 25% fiction• 75% non-fiction

Page 43: Overview of Corpus Linguistics

General Corpora

• Corpus of Contemporary American English (COCA)

• 450 million words (more added every year)

• Divided into registers• Spoken• Fiction• Academic

Page 44: Overview of Corpus Linguistics

General Corpora

• International Corpus of English• 1 million words from each

country• 60% spoken, 40% written

Page 45: Overview of Corpus Linguistics

Historical Corpora

• Helsinki Corpus• English texts from 770-1700

• Corpus of Historical American English (COHA)• 1860-present

Page 46: Overview of Corpus Linguistics

Introduction to COHACorpus of Historical American English• End up verbing• Try and verb versus try to verb• Adjectives and nouns used in 2000s not before• Collocates of Muslim, liberal, Mormon• ?

Page 47: Overview of Corpus Linguistics

Raw Corpora

• Not easily searchable• Not tagged

• Project Gutenberg• Pre 1928 books (copyright

expired)• Online newspapers• Time Magazine • The internet • General Conference

Page 48: Overview of Corpus Linguistics

Where can you get corpora?• Online

• BNC, COCA, COHA• Distributors (membership

required)• ELRA (based in Europe)• Linguistic Data Consortium

• US based• BYU has a membership• Catalog• Top 10 corpora