Corpus Linguistics

An introduction to computational teaching

Prof. Rogério Pereira AzeredoSemana de Letras – Faculdade Pitágoras – Vitória

Outubro 2008

What is a Corpus?

The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form. However, in modern Linguistics this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form. Other definitions, broader or stricter, exist

Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, called mark-up or annotation.

Types of corpora

There are many different kinds of corpora. They can contain written or spoken (transcribed) language, modern or old texts, texts from one language or several languages. The texts can be whole books, newspapers, journals, speeches etc, or consist of extracts of varying length. The kind of texts included and the combination of different texts vary between different corpora and corpus types.

2Prof. Rogério Pereira Azeredo

What is Corpus Linguistics?

Corpus Linguistics is now seen as the study of linguistic phenomena through large collections of machine-readable texts: corpora. These are used within a number of research areas going from the Descriptive Study of the Syntax of a Language to Prosody or Language Learning, to mention but a few.

The use of real examples of texts in the study of language is not a new issue in the history of linguistics. However, Corpus Linguistics has developed considerably in the last decades due to the great possibilities offered by the processing of natural language with computers. The availability of computers and machine-readable text has made it possible to get data quickly and easily and also to have this data presented in a format suitable for analysis.

Corpus linguistics is, however, not the same as mainly obtaining language data through the use of computers. Corpus linguistics is the study and analysis of data obtained from a corpus. The main task of the corpus linguist is not to find the data but to analyze it. Computers are useful, and sometimes indispensable, tools used in this process.


A landmark in modern corpus linguistics was the publication by Henry Kucera and Nelson Francis of Computational Analysis of Present-Day American English in 1967, a work based on the analysis of the Brown Corpus, a carefully compiled selection of current American English, totaling about a million words drawn from a wide variety of sources. Kucera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, language teaching, psychology, statistics, and sociology. A further key publication was Randolph Quirk's 'Towards a description of English Usage' (1960)in which he introduced The Survey of English Usage.

SOME HISTORICAL BACKGROUND

Shortly thereafter, Boston publisher Houghton-Mifflin approached Kucera to supply a million word, three-line citation base for its new American Heritage Dictionary, the first dictionary to be compiled using corpus linguistics. The AHD made the innovative step of combining prescriptive elements (how language should be used) with descriptive information (how it actually is used).


Other publishers followed suit. The British publisher Collins' COBUILD monolingual learner's dictionary, designed for users learning English as a foreign language, was compiled using the Bank of English. The Survey of English Usage Corpus was used in the development of one of the most important Corpus-based Grammars, the Comprehensive Grammar of English (Quirk et al 1985).

The Brown Corpus has also spawned a number of similarly structured corpora: the LOB Corpus (1960s British English), Kolhapur (Indian English), Wellington (New Zealand English), Australian Corpus of English (Australian English), the Frown Corpus (early 1990s American English), and the FLOB Corpus (1990s British English). Other corpora represent many languages, varieties and modes, and include the International Corpus of English, and the British National Corpus, a 100 million word collection of a range of spoken and written texts, created in the 1990s by a consortium of publishers, universities (Oxford and Lancaster) and the British Library. For contemporary American English, work has stalled on the American National Corpus, but the 360 million word Corpus of Contemporary American English (COCA) (1990-present) is now available.

The first computerized corpus of transcribed spoken language was constructed in 1971 by the Montreal French Project , containing one million words, which inspired Shana Poplack's much larger corpus of spoken French in the Ottawa-Hull area .


Knowing your corpusSomething about corpus compilation

To combine texts into a corpus is called to compile a corpus. There are various ways of doing this, depending on what kind of corpus you want to create and on what resources (time, money, knowledge) you have at your disposal.

Even if you are not compiling your own corpus, it is important to know something about corpus compilation when you use a corpus. Using a corpus is using a selection of texts to represent the language. How the corpus has been compiled is of utmost importance for the results you get when using it. What texts are included, how these are marked up, the proportions of different text types, the size of the various texts, how the texts have been selected, etc. are all important issues.


Illustration: the language as a newspaper

Let us imagine that you have a newspaper - a collection of texts of different kinds (editorials, reportage on different topics, reviews, cartoons, letters to the editor, sports commentaries, lists of shares, etc) written by different people. You then cut the paper into small pieces with one word on each. You put all the pieces/words into a bowl and pick a sample of ten at random. Obviously there would be several words that you know exist in the newspaper that are not found in your sample. If you were to pick another ten pieces of paper you would not expect the two sets of ten words to be exactly the same. If you picked two sets of 100 words each, you would probably find that some words, especially frequent words like function words, can be found in both samples, if not in exactly the same numbers. You would also find that many words are found in only one of the samples. If you took two very large samples you would find that the frequent words would occur to a similar extent. Words that occur only once in the newspaper would be found in only one of the samples (at most). Words that occur infrequently would not necessarily be evenly distributed across the two samples.


Now imagine that you divide the newspaper into sections (or classify its content into categories/text types) before cutting it up, and then put the cuttings in different bowls. By picking your paper slips from the different bowls you can influence the composition of your sample. You can choose to take slips from only one bowl or from several, in equal or different proportions. If there is a difference in the language in the bowls, there will be a difference in the language on the slips and that will affect your sample correspondingly. You can easily see that if you were to take 100 slips of paper from the 'sports' bowl and 100 slips from the 'editorial' bowl, you would probably find a larger number of the word football in the sample taken from the 'sports' bowl than from the 'editorial'.


A practical example

The Dictionary Research Centre is an umbrella for lexicographical activities and interests within the Department of English. The Department has been involved in dictionary projects of different kinds for nearly twenty-five years. Perhaps best-known are the COBUILD project (1980-2000, in association with HarperCollins), and the Johnson Dictionary project (1988 onwards, partly in association with Cambridge University Press). There have also been several smaller research initiatives, for example within the Centre for Corpus Linguistics. The Dictionary Research Centre was created in autumn 2001, when the Dictionary Research Centre in the School of English, University of Exeter, was transferred to the University of Birmingham. Exeter itself had a long tradition in relation to dictionaries and lexicography within its Dictionary Research Centre, created by Professor Reinhard Hartmann (now an honorary professor at Birmingham).

University of Birmingham : http://www.english.bham.ac.uk/drc/


LINKS

Virtual Language Centre ( Hong Kong) : http://vlc.polyu.edu.hk

The Corpus of Contemporary American English (385+ million words, 1990-2008): http://www.americancorpus.org

The British National Corpus : http://www.natcorp.ox.ac.uk/

Web Concordancer VLC: http://www.edict.com.hk/concordance/WWWConcappE.htm

The Collins WordbanksOnline English Corpus http://www.collins.co.uk/Corpus/CorpusSearch.aspx

Business Letter Corpus Online KWIC Concordancer http://ysomeya.hp.infoseek.co.jp/

ONLINE CORPORA http://corpus.byu.edu/


WEBCORP : http://www.webcorp.org.uk/cgi-bin/webcorp2.nm

SOME QUERIES


have a bath x take a bathhave a nap x take a napmake a mistake x commit a mistakesalt and peeper x pepper and saltdead or alive x alive or deadlost and found x found and loston and off x off and onfish and chips x chips and fishsick and tired x tired and sickblack and white x black on whitecats and dogs x dogs and catsbacon and eggs x eggs and baconout of the ( blue/green/white/black/red/grey)(green/red/black/white/yellow) with anger(green/red/black/white/yellow) with envymake/prepare/fix/cook dinner

REFERENCES

http://www.corpus-linguistics.de/ http://www.english.bham.ac.uk/drc/ http://www.essex.ac.uk/linguistics/clmt/

w3c/corpus_ling/content/introduction.html

http://en.wikipedia.org/wiki/Corpus_linguistics


Documents

Corpus Linguistics