Corpus and bnc

Preview:

Citation preview

BNC and its Online Use

Corpus and Famous Corpora

Presenter Memoona Butt

Roll No. 02

corpus

•A corpus can be defined as a systematic collection of naturally occurring text in electronic form.

Corpus linguistics•Corpus linguistics is the study of language/linguistic phenomena through the analysis of data obtained from a corpus.•Corpus linguistic is the analysis of text with the help of computer, i.e. with specialized software.

•A corpus is always designed for a particular purpose, the usefulness of a ready made corpus must be judged with regard to the purpose to which a user intends to put it.

Famous corpora•The Brown Corpus•The Lancaster-Oslo/Bergen•The London Lund Corpus•The British National Corpus

The Brown Corpus

• The Brown Corpus of Standard American English was the first of the modern, computer readable, general corpora. The corpus consists of one million words of American English texts printed in 1961.

The Lancaster-Oslo/Bergen

• The Lancaster-Oslo/Bergen Corpus is a million word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, The University of Oslo, and the Norwegian Computing Center for the Humanities, Bergen.

The London Lund Corpus

• The London Lund Corpus of English derives from two projects: the Survey of English Usage at University College London and the Survey of Spoken English, which was started at Lund University in 1975. the corpus consists of 500,000 words of spoken British English.

The British National Corpus

• The British National Corpus is a 100 million collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of British English from the later part of the 20th century.

Creation of BNC:• The project was developed by an

academic consortium called BNC consortium.• An industrial/academic consortium lead

by Oxford University press of which the members are more dictionary publishers.

• The Consortium was formed in 1990 and started work in 1991 on the three year task of producing a hundred million word corpus of modern British English for use in commercial and academic research. All major decisions regarding BNC are still made by them.

•The BNC comprises approximately 100 million words of•Written texts (90%)•Transcripts of speech (10%)

Why we use BNC• BNC can be used to know about aspects we

did not know about a word and to check our thoughts about its meaning. Moreover, the corpus can help to find out the meaning of a word not just what we think it means. We can use BNC to check either a word is a part of BNC or not.

Properties of British National Corpus

Presented by:- Hadia Tabassum

Bnc is a sample of 100 million words including spoken and written Britain English. It is a balanced and finite corpus that contains approximately 90% written data and 10%spoken data.

Features of British National Corpus

Spoken components\data in BNC:

Spoken compone

nts

The conversation part

Task oriented

part

The conversational part:-

• This part is largely based on recordings of every day conversation interaction engaged in by some 127 adults aged 15 and over. Some additional recording of under fifteen were included from COLT. The volunteers were selected according to demographic area of age, social group, and sex with the aim of obtaining approximately equal number in each group. well, conversational part make up just over 40% of the spoken corpus.

Respondents in ‘’conversational part” were

selected according to following properties;

Age Social group

Sex

Percentage

Under fifteen

Upper class

Male

41.14

15-24 Middle class

Female

58.47

24-34 Lower class

Unclassified

0.38

The task oriented part:

In this material was intended to represent those types of task oriented spoken activity that were unlikely to be recorded by conversational volunteers during a typical day in their lives. e.g. Lectures, consultations, sermons, T.V/radio broadcasting etc and this part contains 60% of spoken corpus.

The written components:

Written compone

nts

imaginative

Mostly fiction

informative

Non fictional

Continued…..Imaginative text account for 20% and informative text about 80% in written components. the imaginative text are divided into further categories prose,

poetry etc. on the other hand informative data is subdivided into eight categories.

1.Arts 2.Natural sciences

3.Commerce 4.Applied sciences5.Leisure 6.Social sciences7.Beliefs and arts 8. World affairs

Abbreviations and acronyms:

BNC provides us the same abbreviated sequence in many different ways such as P.C, PC, P.C although the same forms reflect different origins .(police Constable, postcard, personal computer)

Monolingual:

Although BNC include many different styles, verities and genera yet it deal with only modern British English and not with other languages used in Britain.

Synchronic:

BNC Covers British English of the late twentieth century ,rather than the historical development which produced it. it is updated time by time or with the passage of time

Editions of BNC Presenter

Kinza Asghar

First edition• The first edition of BNC was

completed in 1994.• The first general release of

the corpus for European researchers was announced in February 1995.

BNC World• BNC World, a slightly revised

version was made available in 2001, indicates that the corpus is now available under license world wide.

BNC is available in two flavors;1. Under the single user license (cost

50 pound) you can install the whole corpus and the SARA software on a single machine for personal use.

2. Alternatively, for the same price, you can install just the corpus itself and use whatever software you want.

BNC XML • BNC XML is the latest version of

the British National Corpus.• XML stands for Extensible

Markup Language. • XML is a set of rules for encoding

documents in machine readable form.

• The main differences between this version and the BNC World are:

1. Errors and inconsistencies have been removed.

2. Lemma information.3. Simplified part of speech

information added.

• BNC XML can be accessed in three ways:

1. Online use.2. Download the corpus and XAIRA.3. Download just the corpus and use

it with any software you want.

•Two subsets of BNC have been produced separately:

• BNC Baby.• BNC Sampler.

BNC Baby

• BNC Baby is a subset of the BNC. It consists of four one million word samples, each compiled as an example of a particular genre: fiction, newspaper, academic writing and spoken conversation.

BNC Sampler

• The BNC Sampler is a subset of the full BNC. It comprises two samples of written and spoken material of one million word each, compiled to mirror the composition of the full BNC as far as possible.

• The sampler was first created at Lancaster University during the creation of the BNC.

Online use of BNC

• Go to the home page.• Put the word into search bar and then

click on the search button.• It will show the content in which the

word is being used.• For instance, if we look for a word

“couch” the corpus will show us its collocations, frequency and KWIC.

.

Recommended