20

What is a corpus?* A corpus is defined in terms of form purpose The word corpus is used to describe a collection of examples of language collected

Embed Size (px)

Citation preview

Page 1: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected
Page 2: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

What is a corpus?*A corpus is defined in terms of

formpurposeThe word corpus is used to describe a

collection of examples of language collected for linguistic study.

It can also describe collections of texts stored and accessed electronically. (Hunston:2002).

Corpus planning and design is functional to some linguistic purpose. It is on this basis that texts are selected and stored, so that they can be studied quantitatively and qualitatively.

*Ref. Text: Hunston S. Corpora in Applied Linguistics

2002

Page 3: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

What are corpora used for?Corpora are often used for language

teaching and learning. They give information about how a language works.

They also help calculate the relative frequency of different features.

Exploring corpora can help students to observe nuances of usage and to make comparisons between languages.

Corpora are also used to investigate cultural attitudes expressed through language.

NB a corpus will not give information about whether something is possible or not, only whether it is frequent or not!

Page 4: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Using corpora in translation

Corpora are also used in translation.Comparable corpora allow to compare

the use of apparent equivalentsParallel corpora allow to see how

words and phrases have been translated in the past.

General corpora can be used to establish norm of frequency and usage.

Page 5: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

What can a corpus do?Corpus access software is used to

rearrange the information which has been stored so that observations of various kinds can be made.

It is not the corpus which gives new information about language. It is the software which gives new perspectives on what is already familiar.

Software packages process data showing:frequency,phraseologycollocation.

Page 6: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Frequency Corpus processing allows comparisons

of words in terms of frequency lists.Quite obviously, grammar words are

more frequent than lexical words. That explains why they are found top of the list.

Frequency lists can be useful for identifying differences between the corpora. But comparisons can be made only if the corpora are comparable, i.e. if their length is approximately the same.

Page 7: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Concordance

The most frequent way to access a corpus is through a concordancing program.

Concordance lines bring together instances of use of words or phrases, so that regularities in use can be observed.

Concordances also help to understand how nouns or adjectives are used

Page 8: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Collocation

Collocation is the tendency of words to co-occur.

The collocates of a given word are those words which often occur in conjunction

Collocation can indicate pairs of lexical items, or the association between a lexical word and its frequent grammatical environment. In the latter case, the term used is colligation.

Page 9: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Types of corporaA corpus is designed for a particular

purpose. Consequently, the type of corpus depends on its purpose:Specialized corpusGeneral corpusComparable corporaParallel corporaLearner corpusHistorical or diachronic corpusMonitor corpus

Page 10: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Specialized corpus: a corpus of texts of a particular type (editorials, academic articles, lectures, essays, etc.). Specialized corpora reflect the type of language a researcher wants to explore. You may also restrict the corpus to a time frame, to a social setting, to a given topic.

General corpus: is a corpus of texts of many types, of written or spoken language, or of both. A general corpus is usually much larger than a specialized corpus. Since it can be used to produce reference materials it is sometimes called a reference corpus.

Page 11: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Comparable corpora: two or more corpora in different languages, or in different varieties of a language. They are designed to contain the same proportion of texts (i.e. newspaper texts, essays, novels, conversations, etc.). They can be used by translators and learners to identify differences and equivalences in each language.

Parallel corpora: two or more corpora in different languages, containing translated texts, or texts produced simultaneously in two or more languages (e.g. EU texts). They can be used by translators and learners to find potential equivalents in each language, and to investigate differences between languages.

Page 12: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Learner corpus: a collection of texts produced by learners of a language. It is used to identify differences among learners, frequency and type of mistakes, etc.

Historical or diachronic corpus: a corpus of texts from different periods of time. It helps to trace the development of a language over time.

Monitor corpus: a corpus used to track current changes in a language. It rapidly increases in size, since it is added annually, monthly, daily, etc. The proportion of text types has to remain constant, so that each year is comparable with every other.

Page 13: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

The use of corpora is not limited to identifying, quantifying and analyzing keywords. The concordance lines offer many instances of use of words or phrases, so that the user can observe regularities in use by means of several examples of the same word or phrase in its natural context.

Calculating collocation means finding the statystical tendency of words to co-occur, and collocations also emphasize some metaphorical use. A good example is the collocations of the word shed, with light, tears, blood, pounds, confidence, hair, skin, labour. In this contexts shed is a verb. As such, its Italian equivalent may vary, so collocates are different.

Page 14: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Shed light fare/gettare luceShed tears spargere lacrimeShed blood spargere/versare sangueShed pounds perdere chili/pesoShed skin perdere/mutare la pelle (fare

la muda)Shed confidence ispirare fiduciaShed hair perdere il peloshed labour disfarsi della manodopera

(licenziare)

Page 15: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Key termsTypeTokenHapaxLemmaWord-formTaggingParsingAnnotate

Page 16: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Tokens: the term is used to indicate the words which are counted in a corpus or in a given text.

But many of these words occur more than once. So, if we count each repeated item once only, the total number changes. In a given text, for instance, we have 250 tokens, but 194 types (articles, repeated nouns etc. are counted once only). Hapax legomena or hapaxes are those words which occur only once.

We may also have words which occur in two (or more) different forms: friend and friends, for instance. These are two word-forms which belong to the same lemma. The same is for go, goes, going, went, gone: five word-forms which belong to the same lemma, go. This implies that when using the lemma as a keyword, all its different word-forms have to be looked for.

Page 17: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Usually word-forms are considered to belong to the same lemma when they belong to the same word-class (verb, noun, adjective, etc.)

Tagging usually refers to the addition of a code to each word in a corpus, to indicate the part of speech. Automatic tagging is possible, but not fully accurate. Tagging is useful when you want to look at different word categories. For instance, the noun work can be considered separately from the verb.

Page 18: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected

Corpus parsing is the analysis of a text constituents, for instance clauses, and groups. This allows you to analyse the different structures in a corpus.

Just like tagging, parsing can be done automatically, though the output is not very accurate. Manual editing is often necessary.

Page 19: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected
Page 20: What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected