46
CL timetable Mon Tues Weds Thurs Fri 25 26 27 28 29 12 3 4 5 89 10 11 12 7 th April (Sun): Friday timetable

CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Embed Size (px)

Citation preview

Page 1: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

CL timetable

Mon Tues Weds Thurs Fri

25 26 27 28 29

1 2 3 4 5

8 9 10 11 12

7th April (Sun): Friday timetable

Page 2: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

CL timetable

• 27/03 (Wed) 18:30-21:30 E6-224

• 28/03 (Thu) 13:15-16:40 E6-219

• 29/03 (Fri) 14:05-17:30 E6-219

• 03/04 (Wed) 18:30-21:30 E6-224

• 07/04 (Fri) 14:05-17:30 E6-219

• 10/04 (Wed) 18:30-21:30 E6-224

• 11/04 (Thu) 13:15-16:40 E6-219

• 12/04 (Fri) 14:05-17:30 E6-219

Page 3: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Introducing Corpus Linguistics

Corpus Linguistics

Richard Xiao

[email protected]

Page 4: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Module description

• Since the 1990s, the corpus methodology has revolutionized nearly all branches of linguistics– Corpus analysis can be illuminating in “virtually all branches of

linguistics or language learning.” (Leech 1997)

• One of the strengths of corpus data lies in its empirical and attested nature– … pools together the intuitions of a great number of speakers– … makes linguistic analysis more objective

• This module – …introduces the theoretical and practical issues of using

corpora in linguistic studies – …explores how the corpus-based approach and other

methodologies can be combined in linguistic studies

Page 5: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Aims of the module• The module aims to

– provide an introduction to corpus linguistics;– familiarise students with major corpus

resources and tools;– pass on essential knowledge and skills for

building DIY corpora; – to keep students up to date with the latest

developments in corpus research;– develop students’ ability in corpus-based

language studies.

Page 6: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Contents1) Introducing corpus linguistics

2) Corpus design and types of corpora

3) Data capture and markup

4) Corpus annotation

5) Making statistic claims

6) Corpus analysis (1): concordance and wordlist

7) Corpus analysis (2): keyword analysis

8) Corpora in lexicographic and lexical studies

9) Corpora in grammatical studies

10) Corpora in diachronic studies

11) Corpora in language variation research

12) Corpora in sociolinguistic studies

13) Corpora in language education

14) Corpora in literary and stylistic studies

15) Corpora in critical discourse analysis

16) Corpora in contrastive and translation studies

Page 7: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Learning outcomes

• On successful completion of the module, students will be able to

– understand the major theoretical frameworks in corpus linguistics and formulate research questions that are amenable to corpus research;

– think critically about the strengths and weaknesses of the corpus methodology and decide when and how to interface it with other methodologies;

– get familiar with major corpus resources and tools and to develop DIY corpora when necessary;

– apply the corpus-based approach in their own research.

Page 8: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Teaching/learning strategies

• With a dual focus on ‘why’ and ‘how to’ in corpus-based language studies, this practical module will be delivered through a series of lectures and hands-on lab sessions

• The module also engages students in extensive reading and interaction with corpus data outside of class

Page 9: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Assessment• Option A

– A 1,000-word essay that critically reviews a corpus exploration tool or a corpus-based study (40%)

– A 2,500-word project report (60%)

• Option B– One 3,500-word essay based on a research project of

your own choice (100%)

• Deadline: Friday 31 May 2013• Submission

– A Word copy as email attachment

Page 10: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Reading list• Set text

– McEnery, A., Xiao, R. and Tono, Y. (2006) Corpus-Based Language Studies: An Advanced Resource Book. London & New York: Routledge.

– Wynne, M. (2005) Developing Linguistic Corpora. Oxford: Oxbow Books. Available online at http://www.ahds.ac.uk/creating/guides/linguistic-corpora

• Recommended reading– See the module syllabus at the course website– www.lancs.ac.uk/fass/projects/corpus/ZJU/CL_syllabus.htm

(pass for unzipping ebooks: lancs)

Page 11: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Outline of this session

• Lecture: introducing key concepts and debates in corpus linguistics– What is and is not a corpus?– Why use corpora?– Corpora vs. intuitions– The corpus methodology– A brief history of Corpus Linguistics– Nature and applications of corpus-based studies

• Lab: testing your intuitions + exploring online resources

Page 12: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

What is a corpus?• The word corpus comes from Latin (“body”) and

the plural is corpora• A corpus is a body of naturally occurring

language– …but rarely a random collection of text– Corpora “are generally assembled with particular

purposes in mind, and are often assembled to be (informally speaking) representative of some language or text type.” (Leech 1992)

• “A corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety.” (MXT 2006: 5)

Page 13: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

What is not a corpus?• A list of words is not a corpus

– Building blocks of language• A text archive is not a corpus

– A random collection of texts• A collection of citations is not a corpus

– A short quotation which contains a word or phrase that is the reason for its selection

• A collection of quotations is not a corpus– A short selection from a text chosen on internal criteria by human

beings• A text is not a corpus

– Intending to be read in different ways• The Web is not a corpus

– Its dimensions unknown, constantly changing, not designed from a linguistic perspective

Sinclair (2005)

Page 14: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

What is a corpus for?

• A corpus is made for the study of language in a broad sense– To test existing linguistic theory and hypotheses

– To generate and verify new linguistic hypotheses

– Beyond linguistics, to provide textual evidence in text-based humanities and social sciences subjects

• The purpose is reflected in a well-designed corpus

Page 15: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Why use corpora?

• Even expert speakers have only a partial knowledge of a language– A corpus can be more comprehensive and

balanced

• Even expert speakers tend to notice the unusual and think of what is possible– A corpus can show us what is common and

typical

• Even expert speakers cannot quantify their knowledge of language– A corpus can readily give us accurate statistics

Page 16: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Why use corpora?

• Even expert speakers cannot remember everything they know– A corpus can store and recall all the information

that has been stored in it• Even experts speakers cannot make up natural

examples– A corpus can provide us with a vast number of

examples in real communication context• Even expert speakers have prejudices and

preferences and every language has cultural connotations and underlying ideology– A corpus can give you more objective evidence

Page 17: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Why use corpora?

• Even expert speakers are not always available to be consulted– A corpus can be made permanently accessible to

all• Even expert speakers cannot keep up with

language change– A constantly updated corpus can reflect even

recent changes in the language• Even expert speakers lack authority: they can be

challenged by other expert speakers– A corpus can encompass the actual language use

of many expert speakers

Page 18: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Intuitions as an alternative

• Intuitions are always useful in linguistics– To invent (grammatical, ungrammatical, or

questionable) example sentences for linguistic analysis

– To make judgments about the acceptability / grammaticality or meaning of an expression

– To help with categorization

Page 19: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Intuitions as an alternative• Intuitions should be applied with caution

– Possibly biased as they are likely to be influenced by one’s dialect or sociolect

– Introspective data is artificial and may not represent typical language use as one is consciously monitoring one’s language production

– Introspective data is decontextualized because it exists in the analyst’s mind rather than in any real linguistic context

– Intuitions are not observable and verifiable by everyone as corpora are

– Excessive reliance on intuitions blinds the analyst to the realities of language usage because we tend to notice the unusual but overlook the commonplace

– There are areas in linguistics where intuitions cannot be used reliably e.g. language variation, historical linguistics, register and style, first and second language acquisition

– Human beings have only the vaguest notion of the frequency of a construct or a word

Page 20: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Benefits of corpus data• Corpus data is more reliable

– A corpus pools together linguistic intuitions of a range of language speakers, which offsets the potential biases in intuitions of individual speakers

• Corpus data is more natural– It is used in real communications instead of being invented

specifically for linguistic analysis• Corpus data is contextualized

– Attested language use which has already occurred in real linguistic context

• Corpus data is quantitative– Corpora can provide frequencies and statistics readily

• Corpus data can find differences that intuitions alone cannot perceive– E.g. synonyms totally, absolutely, utterly, completely, entirely

Page 21: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Corpora vs. intuitions• Not necessarily antagonistic, but rather corroborate each

other and can be gainfully viewed as being complementary– Armchair linguists and corpus linguists “need each other. Or

better, […] the two kinds of linguists, wherever possible, should exist in the same body.” (Fillmore 1992)

– “Neither the corpus linguist of the 1950s, who rejected intuitions, nor the general linguist of the 1960s, who rejected corpus data, was able to achieve the interaction of data coverage and the insight that characterize the many successful corpus analyses of recent years.” (Leech 1991)

• The key to using corpus data is to find the balance between the use of corpus data and the use of one’s intuitions

Page 22: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

The corpus methodology

• It is debatable whether CL is a methodology or a branch of linguistics– CL goes well beyond this methodological role and has

become an independent discipline• In spite of the name, CL is indeed a methodology

rather than an independent branch of linguistics in the same sense as phonetics, syntax, semantics or pragmatics– These latter areas of linguistics describe, or explain, a

certain aspect of language use– Corpus linguistics, in contrast, is not restricted to a

particular aspect of language - it can be employed to explore almost any area of linguistic research

Page 23: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

A brief history of CL

• The term corpus linguistics first appeared only in the early 1980s, but corpus-based language study has a substantial history

• The history of CL can be split into two periods: before and after Chomsky

Page 24: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

A brief history of CL• Before Chomsky

– Field linguists and linguists of the structuralist tradition used “shoebox corpora” – shoeboxes filled with paper slips

• Their methodology was essentially “corpus-based” in the sense that it was empirical and based on observed data

– The work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions

• The sentences of a natural language are finite. • The sentences of a natural language can be collected and

enumerated.– Most linguists saw the “corpus” as the only source of

linguistic evidence in the formation of linguistic theories

Page 25: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

A brief history of CL

• Chomsky revolution: Between 1957 and 1965 Chomsky changed the direction of linguistics from empiricism towards rationalism– “Any natural corpus will be skewed. Some sentences won’t

occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list.” (Chomsky 1962)

– Our internal knowledge of language in human brain (competence, I-language) replaces observed data (performance, E-language)

– Intuitions started to be relied on as evidence• [Xiao, R. (2008) “Theory-driven corpus research: using corpora to

inform aspect theory”. In A. Lüdeling & M. Kyto (eds.) Corpus Linguistics: An International Handbook. Berlin: Mouton de Gruyter]

Page 26: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

A brief history of CL• Revival of CL

– Corpus research was continued in a few centres (Brown, Lancaster) in the 60s-70s

• The Brown University Standard Corpus of Present-day American English (Brown corpus)

• Lancaster-Oslo-Bergen Corpus of BrE (LOB)

– The hardware still imposed some restrictions until the real development started in the 1980s

• The marriage of corpora with computer technology rekindled interest in the corpus methodology

• Since then, the number and size of corpora and corpus-based studies have increased dramatically

– Nowadays, the corpus methodology enjoys widespread popularity, and has opened up or foregrounded many new areas of research

Page 27: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Areas that have used corpora

• Lexicography• Lexical studies• Grammatical studies• Register/genre analysis• Language variation• Contrastive analysis• Translation studies• Language change• Language teaching

• Semantics• Pragmatics• Stylistics• Literary study• Sociolinguistics• Discourse analysis• Forensic linguistics• Computational linguistics• …

Page 28: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Nature of corpus-based approach

• It is empirical, analysing the actual patterns of use from natural texts

• It utilises a large and principled collection of natural texts as the basis for analysis

• It makes extensive use of computers for analysis, using both automatic and interactive techniques

• It integrates both quantitative and qualitative analytical techniques

(Biber et al 1998: 4-5)

Page 29: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Why use computers?

• Development of computer technology has revived CL

• Machine-readability is a de facto attribute of modern corpora

• Electronic corpora have advantages unavailable to their “shoebox” ancestors– It is the use of computerized corpora, together with

computer programs which facilitate linguistic analysis, that distinguishes modern electronic corpora from early ‘drawer-cum-slip’ corpora

Page 30: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Why use computers?

• Computerized corpora can be processed and manipulated rapidly at minimal cost– E.g. searching, selecting, sorting and formatting

• Computers can process machine-readable data accurately and consistently

• Computers can avoid human bias in an analysis, thus making the result more reliable

• Machine-readability allows further automatic processing to be performed on the corpus so that corpus texts can be enriched with various metadata and linguistic analyses– Corpus markup and corpus annotation

Page 31: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

A question for Deep Thought“Alright,” said the computer Deep Thought. “The Answer to the Great

Question...” “Yes...!”“Of Life, the Universe and Everything ...” said Deep Thought. “Yes...!”“Is...”“Yes...!!!...?” “Forty-two,” said Deep Thought, with infinite majesty and calm.It was a long time before anyone spoke. “Forty-two!” yelled someone in the audience. “Is that all you’ve got to show

for seven and a half million years’ work?”“I checked it very thoroughly,” said the computer, “and that quite definitely

is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”

Hitchhikers Guide to the Galaxy by Douglas AdamsWhat can we learn from this story?

Page 32: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

What corpora cannot do• Corpora do not provide negative evidence

– Cannot tell us what is possible or not possible– Can show what is central and typical in language

• Corpora can yield findings but rarely provide explanations for what is observed– Interfacing other methodologies

• The use of corpora as a methodology also defines the boundaries of any given study– Importance of amenable research questions

• The findings based on a particular corpus only tell us what is true in that corpus– Generalisation vs. representativeness

• See Unit B2 for pros and cons of corpora

Page 33: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Ask corpora the right questions• Corpus linguistics as a methodology is only one of the

(many) ways of doing things – “doing linguistics”• The usefulness of corpora depends upon the research

question being investigated– “They are invaluable for doing what they do, and what they do

not do must be done in another way.” (Hunston 2002: 20)• The development of the corpus-based approach as a

tool in language studies has been compared to the invention of telescopes in astronomy– If it is ridiculous to criticize a telescope for not being a

microscope, it is equally pointless to criticize the corpus-based approach for not doing what it is not intended to do

• It is up to you to formulate research questions amenable to corpus-based investigation and to decide how to combine corpora with other resources

Page 34: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Testing your intuitions with BUY-BNC

http://corpus.byu.edu/bnc/

Page 35: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Most common noun in EnglishSearch for [n*]

Top 10: time, people, way, years, year, work, government, day, man, world

Page 36: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Most common noun in advertsSearch for [nn*] in w-advert

Top 20: hotel, centre, time, world, holiday, day, service, year, house, facilities, range, club, bar, years, information, rooms, people, city, life, castle

Page 37: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Top 10 adj. in nonfiction vs. fiction

Top 10 in Nonfiction: aggregate, regulatory, offline, Keynesian, non-executive, macroeconomic, no-arbitrage, nationalised, short-run, pioneering

Top 10 in Fiction: Sabine, narrowed, unsmiling, flushed, clammy, navy-blue, sidelong, muttered, strangled, froggy

Page 38: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Distribution of phrasal verbs

Page 39: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

“Talk” as a noun/verb in different registers

Page 40: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Synonyms: utter vs. sheer

Page 41: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Semantic prosody of caused

Noun collocates of “CAUSE”:problems, damage, death, trouble, harm, concern, injury, problem, difficulties, loss, confusion, pain

Page 42: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

data singular or plural?

Per million words:Singular: 776Academic: 21misc: 9.2spoken: 1.9newspaper: 1.6fiction: 0.3

Per million words:Plural: 1,035academic: 42.5misc: 8.8spoken: 0.2fiction/news: 0.1

Page 43: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

How are women and men described?

Page 44: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

reason for vs. reason to

Page 45: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Extra Practice with BNC VIEW

• 1) What are the top 5 modal verbs in English?• 2) Is there any difference between verbs

destroy, ruin, and demolish? If so, what is it?• 3) Do you think the adjectives in “utterly +

adjective” have anything in common? If so what is that?

• 4) Can we use the plural form of research as in “his researches”?

Page 46: CL timetable MonTuesWedsThursFri 2526272829 12345 89101112 7 th April (Sun): Friday timetable

Where to find what…• BNCweb

– http://bncweb.lancs.ac.uk/bncwebSignup/user/login.php

• BYU corpus hub: http://corpus.byu.edu/corpora.asp

• CQP corpus hub at BFSU (ID/pass=test): http://124.193.83.252/cqp/

• Bank of English (56M sample)

– http://www.collins.co.uk/Corpus/CorpusSearch.aspx

• Sketch Engine: http://www.sketchengine.co.uk/

• David Lee’s CL bookmarks

– http://www.uow.edu.au/~dlee/CBLLinks.htm

• Corpus linguistics, translation, and language learning

– http://sites.google.com/site/federicozanettinnet/cl-htm

• Corpus4u Community: www.corpus4u.org