Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Corpus Linguistic Processing Problems

Mike ScottSchool of English

University of LiverpoolCharles University, Prague 22.5.06

This presentation is at www.lexically.net/downloads/corpus_linguistics

Abstract

This lecture considers the problems of handling and analysing sizeable corpora using standard PC technology. Issues to be addressed include the problem of dealing with text in a variety of formats, what constitutes a text boundary, memory versus disk storage, and retrieval from a hard disk of relevant texts which can be said to be “about” a given topic.

H.G. Wells World Brain (1938)

This World Encyclopaedia would be the mental background of every intelligent man in the world. It would be alive and growing and changing continually under revision, extension and replacement from the original thinkers in the world everywhere. Every university and research institute would be feeding it … its contents would be the standard source of material… (in Witten et al 1999:435)

Issues and Questions

Retrieval – Queries Text formats Text boundaries Storage Finding relevant text data on a hard disk

Part 1: Retrieval

What we do with a Corpus: search it

1. Find texts meeting certain criteria

2. Discover characteristics of text-types

Text Focus Language Focus

1. Find words / phrases / structures meeting certain criteria

2. Discover characteristics of words / phrases / structures

Find text-types with these characteristics

“Query” Operations

List all instances of X

Addition Operations

merge documents

insert into list

Removal Operations

split documents

delete from list

View Operations

re-order

see wider context

Text Attributes Date Authorship Readership / audience Location Participants Length Format (encoding) Language Style Mode Domain Availability Meaning (aboutness) etc.

Simple Query Types

identical to topic/wording X similar to topic X touches on topic X quotes text X quoted by text Y refers/alludes to text X referred to in text Y

Complex Queries

More than 1 simple query type, and/or more than 1 text attribute …

…in Boolean combinations (and, or, not)

Part 2: Text Formats

The chaos of text formats

Character formats Text formats

Characters

“Legacy” formats from the 1980s (e.g. DOS and its fore-runners) Unicode (now at version 5 beta):“Fundamentally, computers just deal with numbers. They store letters and

other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.” http://www.unicode.org/standard/WhatIsUnicode.html

Text Processing

Unix, Windows, Mac – can each handle some aspects of texts differently, e.g. how they process ends of lines

Word .doc v. RTF v. HTML v. XML: extra information built into the text

Prague.doc – 26,064 bytes

Prague.xml = 8,534 bytes

Prague.rtf = 7,893 bytes

Prague.htm = 7,272 bytes

Prague.txt = 8 bytes

Part 3: Text Boundaries

The Colony

“… a colony is a discourse whose component parts do not derive their meaning from the sequence in which they are placed.” (Hoey 1986: 4)

Examples of Colony Texts

“shopping lists, letter pages, dictionaries, hymn books, exam papers, concordances, small ads, class lists, bibliographies (to papers), abstracts (in volume form), constitutions, address books, newspapers, encyclopaedias, cookery books, seminar programmes, journals, certain kinds of reference books (e.g. Films on TV), footnotes to literary works, telephone directories, the Book of Proverbs, the Radio Times (and other TV magazines), gardening columns (sometimes), horoscopes (in newspapers), conference proceedings, menus…” (Hoey 1986:5)

Features of the Colony

1. Meaning not derived from sequence;2. Adjacent units do not form continuous prose;3. There is a framing context;4. No single author and/or anon;5. One component may be used without referring to

the others;6. Components can be reprinted or reused in

subsequent works;7. Components may be added, removed or altered;8. Many of the components serve the same function;9. Alphabetic, numeric or temporal sequencing.(Hoey 1986:20)

“Mainstream” Texts

share some of these features: they

refer to quote allude to share meaning with

other texts.

Part 4: Storage

Corpus Storage

Usually done using folders and sub-folders using some text attribute, often date, as a general key

Sometimes (BNC) the opportunity to make the filename informative has been wasted

But a tree is not the best way to access corpus contents…

Corpus

2001 2002 2003

because

of what we saw in Part 1: there are a number of different text attributes

any of which at different times may guide a given research query

given the unpredictability of research goals

so

a better strategy would be to let the component texts remain wherever they happen to be: in emails in .doc, html, .xml files in previous corpora (.txt usually)

and access them by an index structure

Part 5: Finding

Accessing relevant corpus texts

via the index with a mechanism for determining & then

labelling each text’s format start and end aboutness language, authorship etc.

A database solution.

Conclusions

1. Only a sub-set of retrieval methods are catered for at present

2. Text formats represent a significant problem for corpus builders

3. Text boundaries are often (always?) quite fuzzy if one is interested in meaning

4. Storage has traditionally been organised in discrete corpora

5. But it would be better to organise a discrete index instead.

…which is not very different from…

References: Aston, Guy & Lou Burnard, 1988. The BNC Handbook.

Edinburgh: Edinburgh University Press. Hoey, M. 1986, “The Discourse Colony: a preliminary study of a

neglected discourse type”, in M. Coulthard (ed.) Talking About Text. Birmingham: English Language Research Discourse Analysis Monographs no. 13, pp. 1-26.

Scott, Mike & Chris Tribble (2006) Textual Patterns: key words and corpus analysis in language education. Amsterdam: Benjamins.

Wells, H.G. (1938) World Brain. New York: Doubleday. Witten, I.H, A. Moffat & T.C. Bell, 1999, Managing Gigabytes.

2nd edition. San Francisco: Morgan Kaufman.

Documents

Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at