29
Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at www.lexically.net/downloads/corpus_lingu istics

Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Embed Size (px)

Citation preview

Page 1: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Corpus Linguistic Processing Problems

Mike ScottSchool of English

University of LiverpoolCharles University, Prague 22.5.06

This presentation is at www.lexically.net/downloads/corpus_linguistics

Page 2: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Abstract

This lecture considers the problems of handling and analysing sizeable corpora using standard PC technology. Issues to be addressed include the problem of dealing with text in a variety of formats, what constitutes a text boundary, memory versus disk storage, and retrieval from a hard disk of relevant texts which can be said to be “about” a given topic.

Page 3: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

H.G. Wells World Brain (1938)

This World Encyclopaedia would be the mental background of every intelligent man in the world. It would be alive and growing and changing continually under revision, extension and replacement from the original thinkers in the world everywhere. Every university and research institute would be feeding it … its contents would be the standard source of material… (in Witten et al 1999:435)

Page 4: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Issues and Questions

Retrieval – Queries Text formats Text boundaries Storage Finding relevant text data on a hard disk

Page 5: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Part 1: Retrieval

Page 6: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

What we do with a Corpus: search it

1. Find texts meeting certain criteria

2. Discover characteristics of text-types

Text Focus Language Focus

1. Find words / phrases / structures meeting certain criteria

2. Discover characteristics of words / phrases / structures

Find text-types with these characteristics

Page 7: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

“Query” Operations

List all instances of X

Addition Operations

merge documents

insert into list

Removal Operations

split documents

delete from list

View Operations

re-order

see wider context

Page 8: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Text Attributes Date Authorship Readership / audience Location Participants Length Format (encoding) Language Style Mode Domain Availability Meaning (aboutness) etc.

Page 9: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Simple Query Types

identical to topic/wording X similar to topic X touches on topic X quotes text X quoted by text Y refers/alludes to text X referred to in text Y

Page 10: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Complex Queries

More than 1 simple query type, and/or more than 1 text attribute …

…in Boolean combinations (and, or, not)

Page 11: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Part 2: Text Formats

Page 12: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

The chaos of text formats

Character formats Text formats

Page 13: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Characters

“Legacy” formats from the 1980s (e.g. DOS and its fore-runners) Unicode (now at version 5 beta):“Fundamentally, computers just deal with numbers. They store letters and

other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.” http://www.unicode.org/standard/WhatIsUnicode.html

Page 14: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Text Processing

Unix, Windows, Mac – can each handle some aspects of texts differently, e.g. how they process ends of lines

Word .doc v. RTF v. HTML v. XML: extra information built into the text

Page 15: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Prague.doc – 26,064 bytes

Prague.xml = 8,534 bytes

Prague.rtf = 7,893 bytes

Prague.htm = 7,272 bytes

Prague.txt = 8 bytes

Page 16: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Part 3: Text Boundaries

Page 17: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

The Colony

“… a colony is a discourse whose component parts do not derive their meaning from the sequence in which they are placed.” (Hoey 1986: 4)

Page 18: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Examples of Colony Texts

“shopping lists, letter pages, dictionaries, hymn books, exam papers, concordances, small ads, class lists, bibliographies (to papers), abstracts (in volume form), constitutions, address books, newspapers, encyclopaedias, cookery books, seminar programmes, journals, certain kinds of reference books (e.g. Films on TV), footnotes to literary works, telephone directories, the Book of Proverbs, the Radio Times (and other TV magazines), gardening columns (sometimes), horoscopes (in newspapers), conference proceedings, menus…” (Hoey 1986:5)

Page 19: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Features of the Colony

1. Meaning not derived from sequence;2. Adjacent units do not form continuous prose;3. There is a framing context;4. No single author and/or anon;5. One component may be used without referring to

the others;6. Components can be reprinted or reused in

subsequent works;7. Components may be added, removed or altered;8. Many of the components serve the same function;9. Alphabetic, numeric or temporal sequencing.(Hoey 1986:20)

Page 20: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

“Mainstream” Texts

share some of these features: they

refer to quote allude to share meaning with

other texts.

Page 21: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Part 4: Storage

Page 22: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Corpus Storage

Usually done using folders and sub-folders using some text attribute, often date, as a general key

Sometimes (BNC) the opportunity to make the filename informative has been wasted

But a tree is not the best way to access corpus contents…

Corpus

2001 2002 2003

Page 23: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

because

of what we saw in Part 1: there are a number of different text attributes

any of which at different times may guide a given research query

given the unpredictability of research goals

Page 24: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

so

a better strategy would be to let the component texts remain wherever they happen to be: in emails in .doc, html, .xml files in previous corpora (.txt usually)

and access them by an index structure

Page 25: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Part 5: Finding

Page 26: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Accessing relevant corpus texts

via the index with a mechanism for determining & then

labelling each text’s format start and end aboutness language, authorship etc.

A database solution.

Page 27: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

Conclusions

1. Only a sub-set of retrieval methods are catered for at present

2. Text formats represent a significant problem for corpus builders

3. Text boundaries are often (always?) quite fuzzy if one is interested in meaning

4. Storage has traditionally been organised in discrete corpora

5. But it would be better to organise a discrete index instead.

Page 28: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

…which is not very different from…

Page 29: Corpus Linguistic Processing Problems Mike Scott School of English University of Liverpool Charles University, Prague 22.5.06 This presentation is at

References: Aston, Guy & Lou Burnard, 1988. The BNC Handbook.

Edinburgh: Edinburgh University Press. Hoey, M. 1986, “The Discourse Colony: a preliminary study of a

neglected discourse type”, in M. Coulthard (ed.) Talking About Text. Birmingham: English Language Research Discourse Analysis Monographs no. 13, pp. 1-26.

Scott, Mike & Chris Tribble (2006) Textual Patterns: key words and corpus analysis in language education. Amsterdam: Benjamins.

Wells, H.G. (1938) World Brain. New York: Doubleday. Witten, I.H, A. Moffat & T.C. Bell, 1999, Managing Gigabytes.

2nd edition. San Francisco: Morgan Kaufman.