29
Corpus Linguistics and Lexicography * WOLFGANG TEUBERT Corpus Linguistics—More Than a Slogan? During the last decade, it has been common practice among the linguistic community in Europe—both on the continent and on the British Isles—to use corpus linguistics to verify the results of classical linguistics. In North America, however, the situation is different. There, the Philadelphia-based Linguistic Data Consortium, responsible for the dissemination of language resources, is addressing the commercially oriented market of language engi- neering rather than academic research, the latter often being more interested in universal grammar or semantic universals than in the idiosyncrasies of natural languages. American corpus linguists such as Doug Biber or Nancy Ide and general linguists who are corpus users by conviction such as Charles Fillmore are almost better known in Europe than in the United States, which is even more astonishing when we take into account that the first real corpus in the modern sense, the Brown Corpus, was compiled in Providence, R.I., during the sixties. Meanwhile, European corpus linguistics is gradually becoming a sub- discipline in its own right. Unfortunately, during the last few years, this lead to a slight bias towards those ‘self-centred’ issues such as the problems of corpus compilation, encoding, annotation and validation, the procedures needed for transforming raw corpus data into artificial intelligence applica- tions and automatic language processing software, not to mention the problem of standardisation with regard to form and content (cf. the long-term project EAGLES [Expert Advisory Group on Language Engineering Standards] and INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 6(special issue), 2001. 125–153 John Benjamins Publishing Co.

corpus linguistics and lexicography

  • Upload
    ayfa

  • View
    2.767

  • Download
    4

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: corpus linguistics and lexicography

Corpus Linguistics and Lexicography *

WOLFGANG TEUBERT

Corpus Linguistics—More Than a Slogan?

During the last decade, it has been common practice among the linguisticcommunity in Europe—both on the continent and on the British Isles—touse corpus linguistics to verify the results of classical linguistics. In NorthAmerica, however, the situation is different. There, the Philadelphia-basedLinguistic Data Consortium, responsible for the dissemination of languageresources, is addressing the commercially oriented market of language engi-neering rather than academic research, the latter often being more interestedin universal grammar or semantic universals than in the idiosyncrasies ofnatural languages. American corpus linguists such as Doug Biber or NancyIde and general linguists who are corpus users by conviction such as CharlesFillmore are almost better known in Europe than in the United States, whichis even more astonishing when we take into account that the first real corpusin the modern sense, the Brown Corpus, was compiled in Providence, R.I.,during the sixties.

Meanwhile, European corpus linguistics is gradually becoming a sub-discipline in its own right. Unfortunately, during the last few years, thislead to a slight bias towards those ‘self-centred’ issues such as the problemsof corpus compilation, encoding, annotation and validation, the proceduresneeded for transforming raw corpus data into artificial intelligence applica-tions and automatic language processing software, not to mention the problemof standardisation with regard to form and content (cf. the long-term projectEAGLES [Expert Advisory Group on Language Engineering Standards] and

INTERNATIONAL JOURNAL OF CORPUS LINGUISTICS Vol. 6(special issue), 2001. 125–153 John Benjamins Publishing Co.

Page 2: corpus linguistics and lexicography

126 WOLFGANG TEUBERT

the transatlantic TEI [Text Encoding Initiative]). Today, these issues oftentend to prevail over the original gain that the analysis of corpora may con-tribute to our knowledge of language. But it was exactly this corpus-specificknowledge that the first generation of European corpus linguists such as StureAllen, Vladislav Andrushenko, Stig Johannson, Ferenc Kiefer, Bernard Que-mada, Helmut Schnelle, or John Sinclair had in mind. In West Germany, theInstitut fur Deutsche Sprache was among the first institutions that consideredthe collection of corpus data as one of their permanent tasks; its corpora dateback as early as the late sixties, although at that time most corpus data wasstill only used for the verification of research results gained from traditionalmethods. But has today’s corpus linguistics really advanced from there?

The recent textbooks claiming to provide an introduction to corpus lin-guistics still do not add up to more than a dozen—all of them in English.Unfortunately, except for the commendable books of Stubbs 1996 and Biber,Conrad and Reppen 1998, they do deplorably little to establish corpus lin-guistics as a linguistic discipline in its own right. Instead, they are focussingon the use of corpora and corpus analysis in traditional linguistics (syn-tax, lexicology, stylistics, diachrony, variety research) and applied linguistics(language teaching, translation, language technology). Corpus Linguistics byTony McEnery and Andrew Wilson (McEnery and Wilson 1996) may serveas an example of this kind. Forty pages describe the aspects of encoding;20 pages deal with quantitative analysis; 25 pages describe the usefulness ofcorpus data for computer linguistics with another 30 pages covering the useof corpora in speech, lexicology, grammar, semantics, pragmatics, discourseanalysis, sociolinguistics, stylistics, language teaching, diachrony, dialectol-ogy, language variation studies, psycholinguistics, cultural anthropology andsocial psychology and the final 20 pages contain a case study on sublan-guages and closure. McEnery and Wilson’s book reflects the current state ofcorpus linguistics. In fact, it more or less corresponds to the topics coveredat the annual meetings held by the venerable IACME, an association dealingwith English language corpora (cf. Renouf 1998). Semantics are mainly leftaside.

Surprisingly, when judged by their commercial value, it is not the writtenlanguage corpora that are most successful, but rather speech corpora that canclaim the highest prices. Speech corpora are special collections of some care-fully selected text samples (words, phrases, sentences) spoken by numerousdifferent speakers under various acoustic conditions. They caused the final

Page 3: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 127

breakthrough in automatic speech recognition that computer models basedon cognitive linguistics failed to achieve for many years. The recognitionof speech patterns was only made possible by a combination of categorialand probabilistic approaches towards a connectionist model trained on largespeech corpora. Thus, speech analysis can thus be seen as an early impetusfor the establishment of corpus linguistics as an independent discipline withits own theoretical background.

Lexicography is the second major field where corpus linguistics notonly introduced new methods, but also extended the entire scope of research,however, without putting too much emphasis on the theoretical aspects ofcorpus-based lexicography. Here again, it was John Sinclair who lead theway as initiator of the first strictly corpus-based dictionary of general lan-guage (COBUILD 1987). Britain was also the site of the first corpus-basedcollocation dictionaries (such as Kjellmer 1994). Bilingual lexicography mayalso benefit from a corpus-oriented approach: a fact that is evident whencomparing the traditional Le Robert & Collins English-French Dictionaryedited by B.T.S. Atkins with Valerie Grundy and Marie-Helene Correard’sOxford-Hachette Dictionary which covers the same language pair. Here, theuse of (monolingual) corpora lead to a remarkably greater number of multi-word translation units (collocations, set phrases) and to context profiles thathad been written with the target language in mind. Worter und Wortgebrauchin Ost und West [Words and Word Usage in East and West Germany] (1992)by Manfred W. Hellmann may serve as the only German example of that era,using the corpus for lemma selection rather than semantic description. Onlyrecently, in 1997, did a true corpus-based dictionary appear: Schlusselworterder Wendezeit [Keywords during German Unification] by Dieter Herberg,Doris Steffens and Elke Tellenbach.

Thus, at least in the field of written language, corpus linguistics is still inits infancy as a discipline with its own theoretical background—a statementwhich holds true not only for Germany but also for most other Europeancountries. In this orientation phase, where corpus linguistics is still in theprocess of defining its position, most publications are in English, the languagethat has become interlingua of the modern world. But this does not mean thatcorpus linguistics is dominated mainly by English and American scholars:this can be clearly seen when browsing through any issue of the InternationalJournal of Corpus Linguistics. Still, German linguistics appears somewhatunderrepresented in this discussion. One exception is Hans Jurgen Heringer.

Page 4: corpus linguistics and lexicography

128 WOLFGANG TEUBERT

His innovative study on ‘distributive semantics’ shows a growing receptionof the programme for corpus linguistics which is outlined below. In his bookDas hochste der Gefuhle [The most sublime of feelings] (Heringer 1999), hedescribes the validation of semantic cohesion between adjacent words on thebasis of larger corpora. Above all, it is this area between lexis and syntaxwhere corpus linguistics offers new insights.

Corpus Linguistics—A Programme

Corpus linguistics believes in structuralism as defined by John R. Firth; there-fore, it insists on the notion that language as a research object can only beobserved in the form of written or spoken texts. Neither language-independentcognition nor propositional logic can provide information on the nature ofnatural languages. For these are, as stated in an apophthegm by Mario Wan-druszka, characterised by a mixture of analogy and anomaly. The quest for auniversal structure of grammar and lexicon which is typical for the follow-ers of Chomsky or Lakoff cannot meet the demands of these two aspects.1

Instead, corpus linguistics is closer to the semantic concept inherent in thecontinental European structuralism of Ferdinand de Saussure, which regardsthe meaning as inseparable from the form, that is, the word, the phrase,the text. In this theory, the meaning does not exist per se. Corpus linguis-tics rejects the ubiquitous concept of the meaning being ‘pure information,’encoded into language by the sender and decoded by the receiver. Corpuslinguistics, instead, holds that content cannot be separated from form, ratherthey constitute the two aspects under which texts can be analysed. The word,the phrase, the text is both form and meaning.

The above statement clearly outlines the programme of corpus linguis-tics. It is mainly interested in those phenomena on the fringe between syntaxand lexicon, the two subjects of classical linguistics. It deals with the pat-terns and structures of semantic cohesion between text elements that areinterpreted as compounds, multi-word units, collocations and set phrases. Inthese phenomena, the importance of the context for the meaning becomesevident.

Corpus linguistics extends our knowledge of language by combiningthree different approaches: the (procedural) identification of language databy categorial analysis, the correlation of language data by statistical methods

Page 5: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 129

and finally the (intellectual) interpretation of the results. Whilst the first twosteps should be done automatically as much as possible, the last step requireshuman intentionality, as any interpretation is an act involving consciousnessand, therefore, not transmutable into an algorithmic procedure. This is themain difference between corpus linguistics and computational linguistics,which reduces language to a set of procedures.

Corpus linguistics assumes that language is a social phenomenon, to beobserved and described above all in accessible empirical data—as it were,communication acts. Corpora are cross-sections through a universe of dis-course which incorporates virtually all communication acts of any selectedlanguage community, be it monolingual (e.g., German or English), bilingual(e.g., South Tyrolean, Welsh) or multilingual (e.g., Western European). How-ever, the majority of texts that are preserved and made accessible throughcorpora in principle only have a limited life-span: most printed texts such asnewspaper texts are out of public reach within a very short time.

If we consider language as a social phenomenon, we do not know—and do not want to know—what is going on in the minds of the people,how the speaker or the hearer is understanding the words, sentences andtexts that they speak or hear. Language as a social phenomenon manifestsitself only in texts that can be observed, recorded, described and analysed.Most texts happen to be communication acts, that is, interactions betweenmembers of a language community. An ideal universe of discourse would bethe sum of all communication acts ever uttered by members of a languagecommunity. Therefore, it has an inherent diachronic dimension. Of course,this ideal universe of discourse would be far too large for linguistics toexplore it in its entirety. It would have to be broken down into cross-sectionswith regard to the phenomena that we want to describe. There is no suchthing as a ‘one-size-fits-all’-corpus. It is the responsibility of the linguist tolimit the scope of the universe of discourse in such a way that it may bereduced to a manageable corpus, by means of parameters such as language(sociolect, terminology, jargon), time, region, situation, external and internaltextual characteristics, to mention just a few.

When looking towards language as a social phenomenon, we assume thatmeaning is expressed in texts. What a text element or text segment means isthe result of negotiation among the members of a language community, andthese negotiations are also part of the discourse. Thus, the language com-munity sets the conventions on the formal correctness of sentences and on

Page 6: corpus linguistics and lexicography

130 WOLFGANG TEUBERT

their meaning. Those conventions are both implicit and dynamic; they are notengraved in stone like commandments. Any communication act may utilisesyntactic structures in a new way, create new collocations, introduce newwords or redefine existing ones. If those modifications are used in a suffi-cient number of other communication acts or texts, they may well result in themodification or amendment of an existing convention. One basic differencebetween natural and formal languages is the fact that natural language notonly permits but actually integrates metalinguistic statements without explic-itly marking the metalinguistic level. There is no separation between objectlanguage and metalanguage. Any convention may be discussed, questionedor even rejected in a text. Above all, discourses deal with meaning, and itis corpus linguistics that is best suited to deal with this dynamic aspect ofmeaning.

We, as linguists, have no access to the cognitive encoding of the con-ventions of a language community. We only know what is expressed in texts.Dictionaries, grammars, and language textbooks are also texts; therefore, theyare part of the universe of discourse. As long as they represent socially ac-cepted standards, we have to consider their special status. Still, their contentsare neither comprehensive nor always based on factual evidence. Corpuslinguistics, on the other hand, aims to reveal the conventions of a certainlanguage community on the basis of a relevant corpus. In a corpus, wordsare embedded in their context. Corpus linguistics is, therefore, especiallysuited to describe the gradual changes in meaning: it is the context whichdetermines the concrete meaning in most areas of the vocabulary.

Cognitive Linguistics, Logical Semantics and Corpus Linguistics

People normally—if they are not linguists, that is—listen to or read textsbecause of their meaning. They are interested in the syntactic features ofphrases, sentences or texts only insofar as is necessary for understandingthem. Meaning is the core feature of natural language, and this is the reasonwhy semantics is the central linguistic discipline. Still, regardless of theenormous progress that phonology, syntax and many other disciplines havemade, when it comes to explaining and describing the meaning of phrases,sentences, and texts, we are far from a consensus.

Page 7: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 131

As said above, corpus linguistics regards language as a social phenom-enon. This implies a strict division between meaning and understanding. Isit really the task of linguistics to investigate how the speaker and the listenerunderstand the words, sentences or texts that they utter or perceive? Un-derstanding is a psychological, a mental, or—in modern words—a cognitivephenomenon. This is why no bond exists between cognitive linguistics andcorpus linguistics. Language as a social phenomenon is laid down in textsand only there. If we, as corpus linguists, wish to find out how a text isunderstood, we have to ask the listeners for paraphrases; these paraphrases,being texts themselves, again become part of the discourse and can becomethe object of linguistic analysis.

The difference between cognitive linguistics and corpus linguistics liesin how each deals with the unique property of language to signify. Any textelement is inevitably both form (expression) and meaning. If you delete theform, the meaning is deleted as well. There is no meaning without form,without an expression. Text elements and segments are symbols, and beingsymbols, linguistic signs, they can be analysed in principle under two aspects:the form aspect or the meaning aspect. The consequence of this stance is thatthe only way to express the meaning of a text element or a text segmentis to interpret it, that is, to paraphrase it. This is the stance of hermeneuticphilosophy, as opposed to analytic philosophy (cf. Keller 1995, Jager [2000]).

In cognitive linguistics, which is embedded in analytic philosophy, mean-ing and understanding is seen as one. Here, text elements and text segmentscorrespond to conceptual representations on the mental level. Within thissystem, however, it is not clear what the term ‘representation’ means. Doesit refer to content linked with a form (what we could call presentations) ordoes it refer to pure content disconnected from form (what we could callideations)? This ambiguity is of vast consequence (Janik and Toulmin 1973:133), as presentations themselves are signs, that is, symbols, and thus needto be understood, that is, interpreted. Cognitive linguistics, however, does nottell us how this is to happen. Rather, it describes the manipulation of mentalrepresentations as a process (whereas an interpretation is an act, presuppos-ing intentionality). Processes themselves are meaningless. It is only the actof interpretation that assigns meaning to them. Both Daniell Dennet and JohnSearle point out this aporia of the cognitive approach. In their opinion, themental processes would again require a central meaner (Dennet 1998: 287f.)or homunculus (Searle 1992: 212f.) on a level higher than cognition, that is,

Page 8: corpus linguistics and lexicography

132 WOLFGANG TEUBERT

for understanding mental representations, and the same would then apply forthat level, too, and so on, ad infinitum.

On the other hand, if we translate ‘representation’ with ‘ideation,’ wedismiss the assumption of the symbolic character of language. The meaning ofa word, a sentence or a text would then correspond to something immaterial,something without form, formulated in a so-called ‘mental language,’ whoseelements would consist of either complex or atomistic concepts, dependingwhether one refers to Anna Wierzbicka and the early Jerry Fodor (Wierzbicka1996, Fodor 1975) or to the later Jerry Fodor (Fodor 1998). On a large scale,these concepts of cognitive linguistics seem to correspond to words, but thedifference lies in the fact that they are not material symbols which call forinterpretation, but instead they are pure astral ideation, not contaminated byany form (cf. Teubert 1999).

In practice, particularly in artificial intelligence and automatic transla-tion, this cognitive approach has failed. Alan Melby gave a plausible ex-planation why it was due to fail no matter which formal language had beendefined for encoding the conceptual representations: “The real problem couldbe that the language-independent universal sememes we were looking for donot exist. . . [O]ur approach to word senses was dead wrong.” (Melby 1995:48.)

It seems that the idea behind cognitive linguistics is the transduction ortranslation of phrases, sentences and texts in natural language, that is, of sym-bolic units, into an obviously language-independent ‘language of thought’ or‘mentalese,’ which is non-symbolic and is exclusively defined by syntax.This transduction or translation is seen as a process and does not involve in-tentionality. Cognitive linguistics is committed to the computational model ofmind. According to this theory, mental representations are seen as structuresconsisting of what is called uninterpreted symbols, while mental processes arecaused by the manipulation of these representations according to rule-based,that is, exclusively syntactic, algorithms. But does it really make sense touse the term ‘symbols’ for these mental representation units, just as we callwords ‘linguistic signs’? On a cognitive (or computative) level, those entitiesare only symbols inasmuch as a content can become assigned to them fromthe outside of the mental (or computational) calculus. This content or mean-ing, however, does not affect the permissibility of manipulations with regardto their representation. The content of a text consisting of linguistic signs, onthe other hand, is something inherent to the text itself (and not assigned from

Page 9: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 133

the outside), a feature we can and must investigate if we want to make senseof a text. As Rudi Keller has pointed out, the symbols of natural languageare suitable for and in need of interpretation (Keller 1995).

What appeals to many researchers of semantics is the fact that in cogni-tive semantics the meaning of a text is expressed through a calculation whoseexpressions are based exclusively on syntactic rules, or in other words, thatsemantics is transformed into syntax. They take it for granted that this ispossible, as they claim that both natural and formal language are workingwith symbols. But in natural language, these symbols need to be interpretedwhereas symbols in formal languages work without being assigned a cer-tain (external) definition. Whether a formal language, a calculus, permits acertain permutation of symbols or not has nothing to do with the meaningor the definition of these symbols, it is just a question of syntax. As earlyas 1847, George Boole stated: “Those who are acquainted with the presentstate of the theory of Symbolic Algebra, are aware that the validity of theprocesses of analysis does not depend upon the interpretation of the symbolswhich are employed, but solely upon the laws of their combination.” RichardMontague also believes in the possibility of describing natural language se-mantics the same way as formal language semantics: “There is in my opinionno important theoretical difference between natural languages and the arti-ficial languages of logigicians; indeed, I consider it possible to comprehendthe syntax and semantics of both kinds of languages within a single naturaland mathematically precise theory. On this point I differ from a number ofphilosophers, but agree, I believe, with Chomsky and his associates.” (Bothquotes from Devlin, 1997: 73 and 117.)

From the point of view of corpus linguistics, the meaning of naturallanguage symbols, of text elements or text segments is negotiated by thediscourse participants and can be found in the paraphrases they offer, and itis contained in language usage, that is, in context patterns. Natural languagesymbols refer not so much to language-external facts, but rather they createsemantic links to other language signs. The meaning of a text segment is thehistory of the use of its constituents.

Linguistic signs always require interpretation. Whoever understands atext is able to interpret it. This interpretation can be communicated as a textin itself, a paraphrase of the original text. The act of interpretation requiresintentionality, and therefore, cannot be reduced to a rule-based, algorithmic,‘mathematically precise’ procedure. If we see language as a social phenom-

Page 10: corpus linguistics and lexicography

134 WOLFGANG TEUBERT

enon, natural language semantics can leave aside the mental or cognitivelevel. Everything that can be said about the meaning of words, phrases orsentences will be found in the discourse. Anything that cannot be paraphrasedin natural language has nothing to do with meaning. In a nutshell, this is thecore programme that distinguishes corpus linguistics from cognitive linguis-tics.

Collocation and Meaning

In traditional linguistics, it is rather difficult to pinpoint the difference be-tween a collocation such as harte Auseinandersetzung (hefty discussion) anda free combination such as harte Matratze (hard mattress). In corpus lin-guistics, on the other hand, it is possible to trace this awareness among themembers of a language community of a distinct semantic cohesion betweenthe lexical elements of a collocation by statistic means, that is, by detect-ing a significant co-occurrence of these elements within a sufficiently largecorpus. Before it was possible to procedurally and systematically processlarge amounts of language data, syntactic rules had been the only way todescribe the complex behaviour of co-occurrence between textual elements(i.e., words). Such rules describe the relation between different classes of ele-ments, for instance, between nouns and modifying adjectives. Still, syntacticdescriptions such as ‘Adjective + Noun’ are not specific enough to detectcollocations as distinct types of semantic relationships. Traditional lexicologyfails to come up with a feasible definition for collocations that would allowtheir automatic identification in a corpus. To classify certain co-occurringtextual elements as semantic units, that is, as collocations, it is necessary torecognise these text segments as recurrent phenomena, which is only possiblewithin a sufficiently large corpus. Therefore, we must complement the intra-textual perspective with its intertextual counterpart. By applying probabilisticmethods, it is possible to measure recurrence within a virtual universe of dis-course, or more precisely, within a real corpus. Collocation dictionaries in thestrict sense are always corpus-based. Even so, the speaker’s competence isstill needed to check statistically determined collocation candidates for theirrelevant semantic cohesion. The following case study aims to illustrate thepotential of the corpus linguistic approach:

Page 11: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 135

Case study 1: hart as collocator

The collocation dictionary Kollokationsworterbuch Adjektive mit ihren Be-gleitsubstantiven (Teubert, Kervio-Berthou and Windisch [in preparation]),which is currently being compiled at the Institut fur Deutsche Sprache,is based on the IDS corpora of about 320 million words. The 400 ad-jectives were mainly selected from basic vocabulary lists. Candidates forcollocations were combinations of adjectives and nouns showing a signif-icantly higher frequency than the expected frequency based on the occur-rence of the relevant single words. The occurrences are ranked accord-ing to significance: their overall frequency, thus, have no principal influ-ence. The concept for the statistic procedures applied here was designed byCyril Belica. It is up to the competent speaker to decide whether a suf-ficient lexical cohesion can be seen in the collocation candidates detectedby the computer. Manually selected citations are provided in order to facil-itate this interpretation. If a collocation candidate is translated into a for-eign language as a whole instead of a word-by-word translation, this canbe seen as evidence of a distinct semantic cohesion; therefore, we haveadded the English translation equivalents to our German examples. The ex-ample below covers rank 1-10 [for an explanation of the abbreviations seehttp://www.ids-mannheim.de/kt/cosmas.html]:

Kern Rank: 1 Frequency: 63 WKB In der Treuhand selbst hat sich ein harterKern aus fruheren SED-Betonkopfen eingegraben. WKB Dennoch enthalte derBericht einen “harten Kern an Wahrheit.” H68 Die “Kommandoebene,” derharte Kern der RAF, umfaßt 25 bis 30 Mitglieder. H87 Der “harte Kern” um-faßt 187 Personen. H87 [. . . ] ein sicherer Hinweis, daß sich die Betreffendendem harten Kern der RAF angeschlossen haben. H87 140 eingeschriebeneSoulmanner kamen regelmaßig, ein harter Kern von 50 Jugendlichen fasttaglich. (Engl.: diehards/ hard core)

Arbeit Rank: 2 Frequency: 94 WKD In harter Arbeit haben wir unserenStaat aufgebaut. (Uberschrift) WKD Aber wir haben eben in dieser harten Ar-beit alle noch ein bißchen zu lernen. H85 [. . . ] Risikobereitschaft und harteArbeit sollen sich in Malaysia wieder lohnen. H86 Mangelnde personlicheAusstrahlung machte er durch harte Arbeit, eiserne Disziplin und Wil-lensstarke wett. WKD Ein Sommer hartester Arbeit steht bevor. H85 DieTechnik macht es moglich, den Menschen von harter und ubermaßiger Arbeitauch zeitlich zu entlasten. (Engl.: hard work)

Wahrung Rank: 3 Frequency: 40 WKB Die Deutschen wurden nicht nurdurch eine harte Wahrung vereint. WKB Harte Wahrung soll mangelnden

Page 12: corpus linguistics and lexicography

136 WOLFGANG TEUBERT

Geist wettmachen. WKD Doch wundersam ist die Umwandlung der Ostmarkin harte Wahrung allemal. H87 Dann ware es endgultig vorbei mit dem Glanzder einst hartesten Wahrung der Welt. (Engl.: hard currency)

Schlag Rank: 4 Frequency: 24 BZK Das war fur ihn ein harter Schlag. MK1Ich habe eine junge Mannschaft, die einen harten Schlag verkraften kann, ohnezu zerbrechen. MK2 Es war ein harter, gezielter Schlag, der mich prompt vonden Beinen holte. (Engl.: heavy blow)

Drogen Rank: 5 Frequency: 20 H88 Außerdem sei ein immer starker werden-der Trend zu harten Drogen zu beobachten. H87 Kontakt zu harten Drogenhatte der Jugendliche bald bekommen [. . . ] (Engl.: hard drugs)

Kritik Rank: 6 Frequency: 34 H86 Aber sie erfuhren schon damals von vielenSeiten harte Kritik. MK2 Harte Kritik am Biedenkopf-Plan. (Uberschrift)H88 Zugleich ubte er harte Kritik an der Landesregierung [. . . ] (Engl.: harshcriticism)

Bandagen Rank: 7 Frequency: 12 H86 Beide Seiten schlagen derweil mitharten Bandagen zu [. . . ] WKB Der Kampf um Berlin als Hauptstadt wirdmit harten Bandagen gefuhrt. (Uberschrift) (Engl.: taking one’s gloves off )

Kampf Rank: 8 Frequency: 30 MK1 Amerika musse notfalls auf einen langenharten Kampf vorbereitet sein. H86 Die meisten sehen zu, daß sie im hartenKampf um die Zehntel fur sich das Beste rausholen. BZK Verkaufsforderunggewinnt immer mehr Bedeutung im harten Kampf um die Gunst der Ver-braucher. WKD Fur sie geht es jetzt nicht einfach um einen harten Kampfum Arbeitsplatze. (Engl.: close fight)

D-Mark Rank: 9 Frequency: 22 WKB Dann bekamen die DDR-Burger harteD-Mark in die Hand und wurden drubenbleiben. WKB Nichts hat Vormarschund Endsieg der harten D-Mark aufhalten konnen. WKD Die harte D-Markdient als Schmiedehammer. (Engl.: strong Deutschmark)

Worte Rank: 10 Frequency: 25 H85 Harte Worte - Berliner Verhaltnisse?H86 Selbst Außenminister Shultz benutzte harte Worte. H85 [. . . der] ersteVorsitzende der Gesellschaft, findet nicht minder harte Worte, um den Bruchzu begrunden [. . . ] (Engl.: bitter words)

Discourse and Meaning

One of corpus linguistics’ most essential tenets is the assumption that themeaning of text elements and segments can be found solely in discourse.This assumption makes sense if we call to mind that in principle, every wordor combination of words was once a neologism. Neologisms are introduced

Page 13: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 137

to the discourse by explicitly assigning certain meanings to new expressions,that is, by paraphrasing what a new word is supposed to mean. As statedabove, we can determine meaning in two ways: by paraphrase and by usage.Neologisms, however, still lack the usage. They only become used once otherparticipants of the discourse start using them either by accepting the proposedmeaning or by negotiating the meaning by offering a new paraphrase. Thisalso applies to those cases where a new meaning is assigned to an alreadyexisting word.

It is obvious that we cannot go ‘back to the roots’ for all our establishedvocabulary; also, this is not how children learn the meaning of words. Buteven so, it is not simply the usage of words that leads to their meaning. Inmost cases, an act of explanation, very often by the parents, but sometimesalso through picture-books, sets the starting point for language acquisition.Obviously, deictic references to reality (or images thereof) are of highestimportance, but they are not understood without narrative explanations ofwords that describe what we have to watch out for in reality (or in images ofreality). The meaning of school, for instance, cannot be explained by picturesof the building, classroom, teachers or pupils. In fact, only very few wordsrelate to images unambiguously. Picture-book texts play a more importantrole with regard to the acquisition of word meanings than dictionaries.

Since the times of the German lexicographers Adelung and Campe, thebasic principle of German lexicography had been the assumption that themeaning of words can be found in text samples, a basic principle also forcorpus linguistics. Nevertheless, corpus linguistics differs from traditionallexicography in various details. Firstly, corpus linguistics does not use cor-pora merely for examples: it explores them systematically. Secondly, corpuslinguistic does not try to decontextualise the objects it describes. In otherwords, it does not abstract the meaning from the context. Thirdly, corpuslinguistics tries to capture different usages in their correlation to differentcontexts, unlike traditional lexicography which tries to position word mean-ings upon a blueprint of a language-independent ontological concept (forinstance, by genus proximum and differentia specifica). Fourthly, corpus lin-guistics is less interested in the single text element or word than in thesemantic interaction between text elements and context.

The following case study of Globalisierung [globalisation] aims todemonstrate that it is indeed the discourse (or in other words: our corpus)where information about the meaning of words can be found. The reason why

Page 14: corpus linguistics and lexicography

138 WOLFGANG TEUBERT

we all seem to know the meaning of Globalisierung as it is used currentlyis the fact that we all have read those texts that explain Globalisierung. Wecannot depict Globalisierung, any more than we can point at it. In its cur-rent use, Globalisierung is certainly a neologism. It is characteristic for theintroductory phase of new words that the first citations show a large numberof paraphrases, a fact that demonstrates the role of the discourse participantsin negotiating meaning.

Case study 2: Globalisierung

Globalisierung (Engl.: globalisation) as a non-lexicalised derivation has been,for a long time, part of our vocabulary. Its semantic vagueness is indicativeof its non-lexicalised status. As nomen actionis or nomen resultativum, ithas long been nothing more than the nominalisation of globalisieren. Thepresence of descriptive attributes is significant for its lack of semantic spec-ification: metalingual indicators (like paraphrases), on the other hand, arealmost totally absent. The following examples were found in the Germandaily Tageszeitung:

Die Vorstellung [. . . ] der Globalisierung der Kleistschen Verzuckung [. . . ]scheint mir denn doch eher marchenhaft. [14.10.89]

Aber die Globalisierung von Politik, Okonomie und Technologie dulde keinenpartikularen Bezugspunkte mehr [. . . ] [05.06.92]

Mit der Globalisierung der Lebensweise der modernen Zivilisation geht dieSelbstaufhebung der [. . . ] Ideale und Grunduberzeugungen einher. [25.02.95]

As a neologism, Globalisierung manages to almost completely displace theoriginal, non-lexicalised derivation only as late as in 1996. Suddenly, there isa distinct rise in frequency: whereas we have only about 160 citations from1988 to the end of 1995, there are about 320 citations for 1996 alone. Also,most citations come without descriptive attributes: apparently, it is no longernecessary to explain what is being globalised. Finally, many citations showmetalingual indicators (below printed in italics) that demonstrate how thediscourse participants take part in assigning a meaning to the word, namely,the following examples:

Die “Globalisierung”—ein etwas unscharfer Begriff, mit dem zugleich dieAusweitung des Handels, die Liberalisierung der Finanzmarkte, der Sieg der

Page 15: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 139

Freiheitsideologie, die unkontrollierte Macht der multinationalen Unternehmen,die Internationalisierung des Arbeitsmarktes und die Umstrukturierung derVolkswirtschaften gemeint ist—hat die Gewerkschaften weiter geschwacht.[12.01.96]

Verbissener Konkurrenzkampf im Inneren und nach außen hin eine maximaleOffnung fur Kapitel, Guter und Dienstleistungen. So lautet eine der moglichenDefinitionen der Globalisierung. [12.01.96]

[. . . ] die Globalisierung, das heißt die vollstandige Liberalisierung allerMarkte auf der Welt [. . . ] [10.05.96]

Lisa Maza [. . . ] sieht die Globalisierung vollig anders: Sie sei eine Fortset-zung der Kolonialisierung mit anderen Mitteln—zum Nachteil des Sudens, derArmen und der Frauen. [08.06.96]

Stichwort Globalisierung: In einer globalen Wirtschaft wird es auf Dauerkein geschutztes Umfeld fur die Wirtschaft irgendeines Landes mehr geben.[27.07.96]

Globalisierung bedeutet auch die Europaisierung des Globus, Kolonialismus,okonomischer und okologischer Imperialismus. [04.05.96]

Denn in der Tat bedeutet Globalisierung Amerikanisierung, und zwar nicht nurder Weltwirtschaft, sondern auch eine normative Amerikanisierung. [11.10.96]

Das Stichwort “Maastricht” und das Modewort “Globalisierung” sind zu Syn-onymen fur sozialen Ruckschritt geworden. [18.10.96]

Typischerweise schweigen die Intellektuellen in Deutschland beharrlich zu Eu-ropa, Globalisierung und Zukunft der Arbeit [. . . ] [13.12.96]

This is a brief list of comparable English citations taken from the Bank ofEnglish and shortened:

What does globalisation mean? The term can happily accommodate all mannerof things: expanding international trade, the growth of multinational business,the rise in international joint ventures and increasing interdependence throughcapital flows.

Globalisation: Low wages in other countries contribute to low wages in theUnited States.

Words like globalisation and outsourcing are now in common use.

Watkins sees globalisation as a euphemism for a race to maximise profit bylowering workers’ pay and condition.

As Mr. Keegan says, globalisation means that tax cuts for business are crucial.

Globalisation represents an attempt to exploit South Korea’s enormous poten-

Page 16: corpus linguistics and lexicography

140 WOLFGANG TEUBERT

tial.

But doesn’t globalisation mean world-wide sameness?

Globalisation is still more a philosophy than a business reality.

Globalisation comes in many flavours.

More so than other words, neologisms show that the meaning of words isto be found in the texts rather than in some discourse-external reality. Thecitations—be it in their virtual entirety within the universe of discourse orbe it in some cross-section in a real corpus—are the meaning, and we mayunderstand this meaning by interpreting the citation.

The formulation of a dictionary entry for globalisation, however, is theresponsibility of lexicography, not of corpus linguistics, whose main task—apart from finding the references—would instead be the correlation (by sys-tematic context analysis) of the various sets of paraphrases and usage patternsto different parameters such as text type (newspaper), genre (politics/society),ideological stance and so on. Particularly in the area of ideologically contro-versial keywords, it seems as if a useful selection of citations can be morehelpful to the user than traditional definitions.

Linguistic Knowledge and Encyclopaedic Knowledge

Corpus linguistics aims to analyse the meaning of words within texts, orrather, within their individual context. First and foremost, words are textelements, not lexicon or dictionary entries. Corpus linguistics is interested intext segments whose elements exhibit an inherent semantic cohesion whichcan be made visible through quantitative analyses of discourse or corpus(Biber, Conrad and Reppen 1998).

If the research focus is shifted from single words to text segments,the distinction between linguistic and encyclopaedic knowledge graduallybecomes fuzzy. The word Machtergreifung (seizure of power), outside itscontext, may be described as an incident where a certain group, previouslyexcluded from political influence, seizes the power by its own force andwithout democratic legitimation. However, we will interpret text segmentssuch as braune Machtergreifung or die Machtergreifung im Jahre 1933 asreferring to the ‘seizure of power by the Nazis’ without hesitation. Is thisbecause these texts refer to a extralingual reality, to a language-independent

Page 17: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 141

knowledge? Although the majority of linguists would agree with this assump-tion, there may well be another, simpler, explanation: we have learned from alarge number of citations, whenever braune Machtergreifung or Machtergrei-fung im Jahre 1933 is mentioned, this refers to the seizure of power by theNazis and to nothing else. There is a co-occurrence between both expressionsthat may result, for instance, in an anaphoric situation: the expressions areparaphrases of each other.

In the tradition of German lexicography, linguistic knowledge is sepa-rated from encyclopaedic knowledge by the process of decontextualisation, bythe endeavour to describe the meaning of words unadulterated by the contextin which they occur. If we detach all references from their relevant context,the isolated meaning remains. The different events of Machtergreifung thatare dealt with in texts are viewed as references to a discourse-external reality.Corpus linguistics, on the other hand, above all is interested in the meaningof textual segments displaying a distinct semantic cohesion. Machtergreifungim Jahre 1933 is such a segment, and by projecting it upon our discourse (i.e.,linguistic) knowledge, we are able to interpret it as ‘Nazi seizure of power’without problem. If we are no longer limited to single words detached fromtheir contexts and if we do away with decontextualisation, we can give upwith the distinction between linguistic and encyclopaedic knowledge. Forwhat we normally call encyclopaedic knowledge is in fact nothing but dis-course knowledge. Everything we know and are able to know about the Naziseizure of power is based on texts. Although some may even have witnessedone relevant incident or the other, their ability to interpret the whole courseof events as Machtergreifung is also based on texts from other persons. Ifwe reduce encyclopaedic knowledge to discourse knowledge, the distinctiondisappears.

Let us take a look at the example klassische Rollenverteilung (traditionalrole allocation) (Spiegel 13, 1999: 128):

Ein Zuhause wie ein Bilderbuchideal. Hier [. . . ] ist die klassische Rollen-verteilung die Regel: Ein Elternteil kummert sich um Haushalt und Kinder-erziehung, der andere verdient das Geld. Auch dieser traditionellen Familien-vorstellung entspricht das Leben im Reihenhaus.

[A home like a picture-book cliche. Here [. . . ] the traditional role allocationis still the rule: one parent takes care of the household and of bringing up thechildren, the other parent earns the family income. Also living in a terracedhouse contributes to this traditional image of family.]

Page 18: corpus linguistics and lexicography

142 WOLFGANG TEUBERT

Within the context of family/home, the meaning of the collocation klassis-che Rollenverteilung in the above example corresponds exactly to the sen-tence that may serve as definition: Ein Elternteil. . . [One parent. . . ]. Notethe sublime subversive touch that is present here, characteristic of so manySpiegel articles: what seems to be a generally acceptable definition, actu-ally shows an essential deviation from the traditional meaning of klassischeRollenverteilung—it does not distinguish between male and female.

The above example aptly illustrates challenges and achievements of cor-pus linguistics. Firstly, it is not interested in the meaning of isolated wordsoutside their relevant contexts, but instead in the meaning of semanticallyconnected text segments, extracted from discourse or, in practice, from thecorpus. In the context of home and family, klassische Rollenverteilung canbe interpreted in different ways with regard to period and genre. If the aboveSpiegel-definition becomes the accepted thing, we may apply the term klas-sische Rollenverteilung even to gay or lesbian partnerships. For corpus lin-guistics, this implies a dynamic view of meaning. Every new reference mayadd to the meaning of a certain text segment; older meanings may fall intooblivion if they are not sanctioned by new evidence. The above example alsoshows that the ways in which meaning can be negotiated within the languagecommunity can be controversial indeed. It is not so long ago when lesbianpartnership and family were two different meanings that could not be imag-ined, let alone used, synonymously. Corpus linguistics may thus serve as auseful instrument to detect changes of meaning that are essential to neology.

Secondly, corpus linguistics is developing devices for the identificationand extraction of potentially metalinguistic elements of citations, that is, oftext elements co-occurring with a paraphrase, thus enabling the automaticextraction, processing and presentation of semantically relevant material fromcorpora. Phrases such as something is the rule; x means y; this is to say;we understand it as; it can be said etc. point to metalingual content. Ifthe meaning of a semantically controversial textual segment is negotiated,we often find indicators such as: some time ago; in fact; strictly speaking;without doubt; wrongly etc. These indicators can give us important clues.Above all, it must be realised that just as the meaning of a text segmentis a paraphrase found in earlier citations, peoples’ interpretations are alsoparaphrases and therefore part of the discourse. In principle, the meaningof a text element or a text segment is everything that has been said aboutit, in terms of a paraphrase or as a matter of usage; it is the result of the

Page 19: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 143

negotiation of the meaning within the discourse community. Indeed this isthe difference between natural language words and technical terms. Technicalterms are defined by experts, and their meaning is restricted to that definition(and thus, is discourse-external). For instance, if a tree meets the criteriafor elm-trees listed in the expert’s definition, it is rightly called an elm-tree no matter what the citations say. Any terminological definition is—atleast in principle—an algorithmic instruction for the usage of the relevantterm. This explains why it is possible to automatically translate technicaltexts, provided they are monosemous and only use specialist vocabulary.Lexicographic definitions, on the other hand, are interpretations of citations,that is, results of intentional acts. They cannot automatically be processedfrom corpus citations, because every citation can be interpreted in variousdifferent ways. Therefore, an automatic translation of general language textsis not feasible.

Thirdly, corpus linguistics uses the context to distinguish between us-ages. For example, the collocation klassische Rollenverteilung is not onlyfound in the family context but also at work or in society in general. Itsmeaning differs according to on the context.

Fourthly, corpus linguistics is interested in larger units of meaning,namely, in text segments. The traditional lexicographic practice of decon-textualisation and isolation of single words impedes us from knowing themeaning of larger units such as klassische Rollenverteilung. As a rule, themeaning of text segments such as multi word units, collocations or set phrasesis far more specific than that of single words. The reason why traditional lin-guistics is focussing on the single word, isolated from its context, can onlybe explained by space constraints in the past, as it is impossible to list allcollocations and set phrases even in a dictionary consisting of several vol-umes. But is klassische Rollenverteilung really a true collocation? Is corpuslinguistics really able to provide a credible validation of semantic cohesion?Is the co-occurrence klassische Rollenverteilung more than a mere additionof klassisch and Rollenverteilung? In a sufficiently large corpus, if the fre-quency of klassische Rollenverteilung differs significantly from the statisti-cally expected frequency of this combination, this can be seen as one signfor possible collocation. Another sign would be the occurrence of a specialmeaning that can not be derived from the sum of the individual meaningsof the text elements. For instance, if we find six tokens of klassische Rol-lenverteilung within the corpus although we would only expect three, given

Page 20: corpus linguistics and lexicography

144 WOLFGANG TEUBERT

the frequency of the constituents, and if they all suggest that one parent isthe wage-earner whereas the other is bringing up the children, then we mayregard this co-occurrence as collocation.

Finally, corpus linguistic considers meaning as a feature of language, oftext elements, segments, and texts, and not as an external feature existing onlyin the human mind or in reality. The meaning of klassische Rollenverteilungin the context of family is represented in texts, and only there; it is not thereflection of a non-textual external reality that we could point our fingers at.There is no meaning outside language, outside the discourse. We know whatglobalisation means today, because we have read the texts that explain it, butwe cannot see globalisation.

Multilingual Corpus Linguistics

When translating a text into another language, we paraphrase the sourcetext. The translation represents the meaning of the original text just like aparaphrase within the source language. Translation requires understandingand thus intentionality. Only if we understand a text can we interpret oreven paraphrase it. This implies that different translations will yield differentversions of the same text, which again shows that translation or paraphrasingcannot be reduced to algorithmic procedures.

The universe of discourse, containing all texts ever translated along withtheir translations, is the empirical base for multilingual corpus linguistics. Itis a virtual universe, and it can be realised by multilingual parallel corpora (ora collection of bilingual parallel corpora). Parallel corpora consist of sourcetexts along with their translations into other languages, whereas reciprocalparallel corpora contain the source texts in two languages along with theirtranslations into the target languages.

Just as in monolingual corpus linguistics, meaning is also seen as astrictly linguistic (or better, textual) term here. Meaning is paraphrase. Theentire meaning of a text segment within a multilingual universe of discourseis enclosed in the history of all translation equivalents of the segment.

The translation unit, that is, the text segment completely represented bythe translation equivalent, is the base unit of multilingual corpus semantics.Translation units, consisting of a single word or of several words, are theminimal units of translation. If they consist of several words, they are trans-

Page 21: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 145

lated as a whole and not word by word. Therefore, translation equivalentscorrespond to the text segments of monolingual corpus linguistics.

Within the framework of multilingual corpus linguistics, we take thatthe meaning of translation units is contained in their translation equivalentsin other languages. This corresponds to the base assumption of corpus lin-guistics, which does not regard semantic cohesion as something fixed butas belonging to a large spectrum reaching from inalterable units to text seg-ments whose elements can be varied, expanded or omitted. Identifying thesetranslation units (or text segments) again involves interpretation. The transla-tion shows us whether a given co-occurrence of words is a single translationequivalent or a combination of them, that is, merely a chain of text elements.This leads to two consequences. What can be seen as an integral translationequivalent in one target language may be a simple word-by-word transla-tion in another. This may even be the case within a single target language,depending on the stylistic preferences of different translators. In fact, it isthe community of translators (along with the translation critics) who in theirdaily practice decide what is the translation equivalent, just as the monolin-gual language community decides what is a text segment.

The definition of a translation unit therefore depends both on the targetlanguage and the common practice of translation. A virtual text segment is atranslation unit only in respect to those languages into which it is translated asa whole. Translation units and their equivalents are not metaphysical entities;they are the contingent results of translation acts. According to the analysisof parallel corpora, more than half of the translation units are larger thanthe single word—another example of how corpus linguistics may help toinvestigate the nature of text segments.

The meaning of a translation unit is its paraphrase, that is, the translationequivalent in the target language. For ambiguous translation units, this im-plies that there are as many meanings to the unit as there are non-synonymoustranslation equivalents. If the phenomenon of meaning is thus operationalised,the meaning of a translation unit depends on the selected target language. Agiven translation unit in language A may have two non-synonymous equiv-alents in language B, but three non-synonymous equivalents in language C.

Let us look at an example. The English word sorrow (a translation unitconsisting only of a single word) will usually be translated into French byone of the three equivalents chagrin, peine or tristesse; the first two, chagrinand peine, are obviously synonymous in a variety of contexts. They both

Page 22: corpus linguistics and lexicography

146 WOLFGANG TEUBERT

point at a cause for this emotion and, therefore, are sometimes interchange-able with deuil (‘loss,’ the term for the cause). Tristesse, on the other hand, isthe variety of sorrow which is not caused by a special incident. In German,there are also three standard equivalents for sorrow, namely, Trauer (causedby loss), Kummer (caused by an adverse incident, intense and usually lim-ited in duration) and finally Gram (caused by unhappiness resulting froman incident, not very intense, more a disposition than a feeling, but often oflong duration). Those three German equivalents are neither synonymous withnor corresponding to the three French equivalents. By the way, the differ-ent senses of sorrow usually found in English monolingual dictionaries andthesauri corresponds to neither the French nor the German distinctions.

The above example of sorrow shows that the concept of synonymy can-not be expressed in an algorithm. To call two expressions synonymous re-quires a prior understanding of their meaning, that is, an act of interpretation.For instance, if we look at how the Greek verb proseuchomai in the first sen-tence of Plato’s Republic is translated into English, we will find five differentequivalents in eight different translations of this book: to make my prayers,to say a prayer, to offer up my prayers, to worship, to pay my devoirs and topay my devotions. We, as human beings, must decide whether we considerthe Greek verb ambiguous or just fuzzy and whether the relevant equivalentscan be seen as synonyms. This is something computers cannot do. The ex-ample also shows that the concept of synonymy can only be applied locally,referring to translation equivalents or text segments within a defined context.Although we may assume that Plato’s contemporary audience considered theverb proseuchomai as unambiguous within the above context, this is not thecase with native speakers of English, where there is no synonymy betweento make my prayers and to pay my devotions. It can be clearly seen thatmeaning has a dynamic quality and also that the act of translation requiresintention and thus cannot be reduced to a mere procedure. We will never findthe correct German equivalent for sorrow or the correct English equivalentfor proseuchmai just by defining formal instructions for a machine. Beforewe can translate texts and their elements, we must understand them.

Page 23: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 147

Multilingual Corpus Linguistics in Practice

Neither a lexicon derived from a bilingual dictionary nor the supposedlylanguage-neutral conceptual ontologies applied within Artificial Intelligencewill solve the problem of machine translation of general language texts.Meanwhile, this fact is acknowledged by the experts. Therefore, they focuson the machine translation of texts written in a controlled documentationlanguage, which is a more or less formal language in which all technical termsare defined unambiguously along with a syntax that rejects all ambiguousexpressions as non-grammatical.

General language texts written in natural languages cannot be translatedwithout interpretation. Here, multilingual corpus linguistics steers clear of thisobstacle in an elegant way. Unlike disciplines such as Artificial Intelligenceand Machine Translation, which are based on cognitive linguistics, it doesnot try to model and emulate mental processes, but instead tries to supportthe translator by processing parallel corpora. They contain the practice ofprevious human translation. In these corpora, those translation equivalentsthat are proven to be reliable and accepted will outweigh equivalents that havebeen dismissed as inadequate in the long run. If, for instance, proseuchomaiis translated as to make my prayers three times out of eight, it may well beassumed that it is an accepted—albeit not the ideal—equivalent within thegiven context.

Parallel corpora are translation repositories. They link translation unitswith their equivalents. As first studies have shown (Steyer and Teubert 1998),we may assume that 90 percent of all translation units along with their rel-evant equivalents may be found in a carefully compiled corpus of about 20million words per language, provided that the text to be translated is suffi-ciently close to the corpus with regard to text type and genre.

Multilingual corpus linguistics does not pretend to solve the problem ofmachine translation of general language. But it may help the human translatorin finding a suitable equivalent for the unit to be translated more efficientlythan traditional bilingual dictionaries, because it includes the context even inthose cases where the translation equivalent is not a syntagmatically definedcollocation but a certain textual element within a sequence. The goal is toselect from among all given elements the one whose contextual profile isclosest to that of the textual segment to be translated.

Page 24: corpus linguistics and lexicography

148 WOLFGANG TEUBERT

Case study 3: The translation into German of sorrow and grief

For the two words sorrow and grief, we find three common non-synonymous German translation equivalents: Trauer, Kummer and Gram.An analysis of the contexts of all references of these German wordsas found in the IDS corpora, based on a method designed by CyrilBelica (see http://www.ids-mannheim.de/cgi-bin/idsforms/cosmas-www-client), gives us the context profiles listed below. In ourexample, the number of neighbouring words (i.e. span) has been restricted to5 words on each side. The context profiles given below have been slightlyedited for the sake of clarity.

Context profile for Trauer: Wut, Angst, Betroffenheit, Schmerz, Tod,Besturzung, Freude, Hoffnung, Verzweiflung, Scham; tragen, empfinden; tief,groß-

Context profile for Kummer: Sorgen, Schmerz, Leid, Seele, Freude,Stress, Arger, Not; bereiten, machen, gewohnt/gewohnt sein; viel, groß-

Context profile for Gram: Leid, Hass, Bitterkeit, Scham; sterben;gebeugt, lauter, voll-

In an English-German parallel corpus we would distinguish betweenthree translations for sorrow and grief : the first group would contain thosecases where sorrow or grief is translated by Trauer; the second group whereit is translated by Kummer, and finally, the third group where it is translatedby Gram. For each of the above cases, we could compute a context profilesimilar to the ones quoted above for the German words from the IDS corpus.We may assume that the context profile for sorrow and grief, as taken fromthe parallel corpus, in the case of the translation equivalent Kummer, will notdiffer much from the context profile for Kummer extracted from the Germanreference corpus, apart from it being in English instead of German.

Unfortunately, a sufficiently large enough English-German parallel cor-pus that would allow the extraction of English context profiles for Germantranslation equivalents on the basis of recurrence is not yet available. As analternative, I have searched the Bank of English for those instances of sor-row and grief whose contexts are similar to our context profiles for Trauer,Kummer and Gram. So far these results are not thoroughly convincing: onereason is the different composition of the IDS corpora compared to the Bankof English which results in a clear imbalance of the German and Englishinstances with regard to text type and genre; also, the search criteria for the

Page 25: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 149

English contexts have been too narrow, and last but not least, sorrow andgrief along with their German counterparts Trauer, Kummer, and Gram be-long to an area of vocabulary which is highly culture-specific and is almostimpossible to reduce to a common denominator.

Still, the following instances taken from the Bank of English show,that in practice, the approach for the detection of equivalents outlined abovewill function to some extent. The words in square brackets are the Germanequivalents of the context words contained within the context profiles.

(1) TrauerSo on the night of the crucifixion I place Simon in the home inBethany of Mary called Magdalene and her sister Maria. I en-vision a scene in which trauma, grief, anger [Wut], and despair[Besturzung] were all present, to say nothing of fear [Angst].

(2) KummerShe enjoys her job though it is full of stress [Stress], sorrow andnever-ending challenges.

(3) GramThe terrible affliction [Leid] that has fallen so suddenly upon ourunhapply country fills and monopolises my thoughts. My soulis full of grief and bitterness [Bitterkeit] and hate [Hass] andvengeance.

Although matching the context of the element to be translated againstthe context profiles of all possible equivalents may suggest a method for theautomatic selection of suitable equivalents, this only works in those caseswhere we have clear selection-relevant contextual information at our disposal.As stated above, this is not always the case, especially if the text element to betranslated is referring to earlier instances within the same text. In these cases,we may assume that, provided the intratextual continuity is sufficiently high,the text element (sorrow or grief in our example) can always be translated bythe same equivalent with regard to the target language, be it Trauer, Kummeror Gram. In most cases, whenever a word with a fuzzy, strongly context-dependent meaning appears in a text for the first time, the information neededfor the specification of its meaning will be found within the context. Laterinstances of the word within the text often tend to omit this informationas redundant. Within a text, we must find one or two references where a

Page 26: corpus linguistics and lexicography

150 WOLFGANG TEUBERT

suitable translation equivalent is indicated by the context profile and applythe result to the other instances. This shows that it is imperative to onlyinclude complete texts in the corpus.

Future Prospects

Corpus linguistics sees itself not in opposition to but as a complement of tra-ditional linguistics. Corpus linguistics helps to make us aware not only of theinteraction between text element and context but also of text segments, thatis, larger, flexible units whose elements are semantically linked in a certainway: multi-word-units, collocations, set phrases. It explains the repeated co-occurrence of text elements as a discourse phenomenon that can be exploredby statistical means, and it makes those co-occurrence patterns visible by acombination of quantitative and categorial devices.

The investigation of the context enables us to better cope with wordsdisplaying fuzzy meanings, words of the ‘Thespian vocabulary,’ as John Sin-clair called them (Sinclair 1996), by generating context profiles as presentedabove on the basis of sufficiently large corpora. Especially when combin-ing these context profiles with those citations containing a paraphrase of themeaning or aspects thereof (cf. our case study of globalisation), this may leadto descriptions of meaning enabling the user to participate in the discourse.

Corpus linguistics distinguishes between text segments on the one handand text elements embedded in context on the other, depending on howthey can be described. Context profiles are only statistically defined. Withina context profile, there is no such thing as an obligatory element that isindispensable within the context of a citation. The lexical constituents of textsegments, however, can be defined either as indispensable or as optional.But there is still another difference between the text element with its contextprofile and the text segment: the latter is defined not only on a lexical butalso on a syntactic level. The collocation Kummer gewohnt ceases to be acollocation as soon as the verb gewohnt sein is replaced by gewohnen: Erhatte sich an seinen Kummer gewohnt is not a collocation. The same appliesfor collocations such as geheimer Kummer, Kummer bereiten, Kummer undSorgen. If we change the syntagma or even just the word order (for example,into Sorgen und Kummer), the words lose their collocation character.

Page 27: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 151

During the last decades, we have witnessed a growing interest in seman-tic cohesion, in the special semantic relations between words within sentencesand phrases, even in traditional linguistics. Among the relatively new con-cepts are lexical solidarities, collocations, set phrases, valency, case roles,semantic frames and scripts. They all try to demonstrate that language ismore than just the assembling of context-free words using semantics-freerules. The co-occurrence patterns developed by corpus linguistics may helpto clarify heuristically the concept of text segments defined by semantic co-hesion.

When it comes to the identification of text segments, multilingual cor-pus linguistics holds a privileged position. Within monolingual corpora, thisidentification is a gruesome task that can only be turned into an automaticprocedure by a painstaking combination of various procedures based on fre-quencies, lists or rules. The use of parallel corpora makes it easier to identifytext segments (as translation units or equivalents), as they are the true prac-tical results of interpretation and paraphrase. They show what usually takesplace within the minds of the speakers without leaving their traces in texts.Parallel corpora, therefore, provide direct access to the translation practice ofhuman translators. If we assume that we may find the meaning of a textualelement through its paraphrase, which is also a text, then we may describeparallel corpora as repositories for such paraphrases. Obviously, dictionariesalso attempt to list those paraphrases. However, since their size is limited,they need to decontextualise and isolate the lexical units, whereas the para-phrases of translators display the text elements embedded within their con-texts, along with whole text segments. Parallel corpus evidence helps us totrace the phenomenon of semantic cohesion.

Meanwhile, with the availability of large corpora and improved softwarefor their exploration, corpus linguistics has become part of general lexicog-raphy. Linguistics is gradually becoming more interested in larger units ofmeaning and the use of context for their definition. Also, it is generallyaccepted that the next generation of dictionaries, both monolingual and bilin-gual, needs to be corpus-validated, if not entirely corpus-based. But there ismore to the corpus linguistic approach. By interactive procedures, the am-bitious user should be able to have direct access to corpus evidence insteadof being confronted with the subjective findings provided by lexicographers.Such a corpus platform would allow the members of the language community

Page 28: corpus linguistics and lexicography

152 WOLFGANG TEUBERT

to participate in the social activity of negotiating meanings in a committedand informed way.

Notes

* This contribution is a revised version of my article ‘Korpuslinguistik und Lexikographie’in Deutsche Sprache 4/99, pp. 292–313, translated into English by Norbert Volz.

1. The rules that those followers of a universal grammar hope to find in their quest for thelanguage organ are not based on deductions of analogy. Whereas rules based on innate-ness had been the central factor in Chomskyan language theory until recently (cf. StephenPinker in The Language Instinct [Pinker 1994]), Pinker now sees language faculty as aninteraction between ‘distinct mental mechanisms’ which is not yet fully explored, namely,the ‘symbolic computation’ [i.e., the algorithmic processing of uninterpreted symbols]as opposed to the ‘memory’ [i.e., recollection], the latter being responsible for the as-signment of form and meaning of symbols (Pinker 1999). The memory is seen as partlyassociative—an appropriate term for its description could be ‘connectionist network’.However, Pinker still sees ‘symbolic computation’ as a strictly rule-based process. Wemay assume that this tentative change in attitude towards language faculty and the extentof its genetic embedding might be partly due to Terrence W. Deacon’s convincing ex-planation of first language acquisition which does without any language organ (Deacon1997).

References

Biber, Douglas; Conrad, Susan; Reppen, Randi. 1998. Corpus Linguistics. InvestigatingLanguage Structure and Use. Cambridge University Press.

Collins COBUILD. 1987. English Language Dictionary. Editor in Chief: John Sinclair.Deacon, Terrence W. 1997. The Symbolic Species. New York: Norton.Dennett, Daniel C. 1998. “Reflections on Language and Mind.” In: Peter Carruthers/

Jill Boncher (Eds.): Language and Thought. Interdisciplinary Themes. Cambridge:Cambridge University Press, 284–294.

Devlin, Keith. 1997. Goodbye, Descartes. New York: Wiley.Fodor, Jerry A. 1975. The Language of Thought. New York: Crowell.Fodor, Jerry A. 1998. Concepts. Where Cognitive Science Went Wrong. Oxford: Clarendon

Press.Hellmann, Manfred W. 1992. Worter und Wortgebrauch in Ost und West. Vol. 1–3.

Tubingen: Narr.Herberg, Dieter; Steffens, Doris; Tellenbach, Elke. 1997. Schlusselworter der Wendezeit.

Worter-Buch zum offentlichen Sprachgebrauch 1989/90. Berlin: Walter de Gruyter.Heringer, Hans Jurgen. 1999. Das hochste der Gefuhle. Empirische Studien zur

distributiven Semantik. Tubingen: Stauffenberg Verlag.

Page 29: corpus linguistics and lexicography

CORPUS LINGUISTICS AND LEXICOGRAPHY 153

Jager, Ludwig. 2000. “Die Sprachvergessenheit der Medientheorie. Ein Pladoyer fur dasMedium Sprache.” In: Werner Kallmeyer (Ed.): Sprache und neue Medien. Jahrbuch1999 des Instituts fur Deutsche Sprache. Berlin/New York: de Gruyter, 9–30.

Janik, Allen; Toulmin, Stephen. 1973. Wittgenstein’s Vienna. New York: Schuster &Schuster.

Keller, Rudi. 1995. Zeichentheorie. Tubingen: Francke.Kjellmer, Goran. 1994. A Dictionary of English Collocations. Based on the Brown Corpus.

Oxford: Clarendon Press.Lenz, Susanne. 2000. Studienbibliographie Korpuslinguistik. Heidelberg: Groos.McEnery, Tony; Wilson, Andrew. 1996. Corpus Linguistics. Edinburgh: Edinburgh

University Press.Melby, Allen K. 1995. The Possibility of Language. A Discussion of the Nature of

Language with Implications for Human and Machine Translation. Amsterdam: JohnBenjamins.

The Oxford-Hachette French Dictionary. 1994. French-English/ English-French. Marie-Helene Correard, Valerie Grundy (Eds.). Oxford: Oxford University Press.

Pinker, Stephen. 1994. The Language Instinct. New York: William Morrow.Pinker, Stephen. 1999. “Regular habits. How we learn language by mixing memory and

rules.” In: Times Literary Supplement, October 29, 1999, 11–13.Renouf, Antoinette (Ed.). 1998. Working with Corpora. Selected Papers from the 18th

ICAME Conference. Amsterdam: Rodope.Le Robert & Collins. 1993. Dictionnaire Francais–Anglais/Anglais–Francais. 4th Edition.

Editor in Chief: Beryl S. Atkins.Searle, John R. 1992. The Rediscovery of the Mind. Cambridge, Mass.: The MIT Press.Sinclair, John M. 1996. “The Empty Lexicon.” In: International Journal of Corpus

Linguistics I(1): 99–120.Steyer, Kathrin; Teubert, Wolfgang. 1998. “Deutsch-Franzosische Ubersetzungsplattform.

Ansatze, Methoden, empirische Moglichkeiten.” In: Deutsche Sprache 4(97): 343–359.Stubbs, Michael. 1996. Text and Corpus Analysis. Oxford: Blackwell.Teubert, Wolfgang. 1999. In: Modelle der Ubersetzung—Grundlagen der Methodik.

Frankfurt/M.: Lang, 118–135.Teubert, Wolfgang; Kervio-Berthou, Valerie; Windisch, Eric. To be published.

Kollokationsworterbuch Adjektive und ihre Begleitsubstantive.Wierzbicka, Anna. 1996. Semantics. Primes and Universals. Oxford: Oxford University

Press.