15
Representativeness in Corpus Design DOUGLAS BIBER Department of English, Northern Arizona University Abstract The present paper addresses a number of issues related to achieving 'representativeness' in linguistic corpus design, including: discussion of what it means to 'represent' a lan- guage, definition of the target population, stratified versus proportional sampling of a language, sampling within texts, and issues relating to the required sample size (number of texts) of a corpus. The paper distinguishes among various ways that linguistic features can be distributed within and across texts; it analyses the distributions of several particular features, and it discusses the implications of these distribu- tions for corpus design. The paper argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community, and to identify the types of linguistic features that will be analysed in the corpus. These theoretical considerations should be complemented by empirical investigations of linguistic variation in a pilot corpus of texts, as a basis for specific sampling decisions. The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of lin- guistic variation and revision of the design. 1. General Considerations Some of the first considerations in constructing a corpus concern the overall design: for example, the kinds of texts included, the number of texts, the selection of particular texts, the selection of text samples from with- in texts, and the length of text samples. Each of these involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid empirical foundation for general purpose language tools and descriptions, and enables analyses of a scope not otherwise possible. However, a corpus must be 'representative' in order to be appropriately used as the basis for generalizations concerning a language as a whole; for example, corpus-based dictionaries, gram- mars, and general part-of-speech taggers are applica- tions requiring a representative basis (cf. Biber, 1993b). Typically researchers focus on sample size as the most importantconsiderationinachievingrepresentativeness: how many texts must be included in the corpus, and how many words per text sample. Books on sampling theory, however, emphasize that sample size is not the most important consideration in selecting a representa- tive sample; rather, a thorough definition of the target population and decisions concerning the method of sampling are prior considerations. Representativeness refers to the extent to which a sample includes the full range of variability in a population. In corpus design, Correspondence: Douglas Biber, Department of English, North- ern Arizona University, PO Box 6032, Flagstaff, AZ 86011-6032, USA. E-mail:[email protected] Literary and Linguistic Computing, Vol. 8, No. 4, 1993 variability can be considered from situational and from linguistic perspectives, and both of these are important in determining representativeness. Thus a corpus design can be evaluated for the extent to which it includes: (1) the range of text types in a language, and (2) the range of linguistic distributions in a language. Any selection of texts is a sample. Whether or not a sample is 'representative', however, depends first of all on the extent to which it is selected from the range of text types in the target population; an assessment of this representativeness thus depends on a prior full definition of the 'population' that the sample is in- tended to represent, and the techniques used to select the sample from that population. Definition of the target population has at least two aspects: (1) the boundaries of the population—what texts are included and excluded from the population; (2) hierarchical organization within the population—what text cate- gories are included in the population, and what are their definitions. In designing text corpora, these concerns are often not given sufficient attention, and samples are collected without a prior definition of the target popula- tion. As a result, there is no possible way to evaluate the adequacy or representativeness of such a corpus (because there is no well-defined conception of what the sample is intended to represent). In addition, the representativeness of a corpus de- pends on the extent to which it includes the range of linguistic distributions in the population; i.e. different linguistic features are differently distributed (within texts, across texts, across text types), and a representa- tive corpus must enable analysis of these various distri- butions. This condition of linguistic representativeness depends on the first condition; i.e. if a corpus does not represent the range of text types in a population, it will not represent the range of linguistic distributions. In addition, linguistic representativeness depends on issues such as the number of words per text sample, the number of samples per 'text', and the number of texts per text type. These issues will be addressed in Sections 3 and 4. However, the issue of population definition is the first concern in corpus design. To illustrate, consider the population definitions underlying the Brown corpus (Francis and Kucera 1964/79) and the LOB corpus (Johansson et al., 1978). These target populations were defined both with respect to their boundaries (all published English texts printed in 1961, in the United States and United Kingdom respectively), and their hierarchical organizations (fifteen major text categories and numerous subgenre distinctions within these cate- gories). In constructing these corpora, the compilers also had good 'sampling frames', enabling probabilistic, random sampling of the population. A sampling frame © Oxford University Press 1993

Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

Representativeness in Corpus DesignDOUGLAS BIBERDepartment of English, Northern Arizona University

Abstract

The present paper addresses a number of issues related toachieving 'representativeness' in linguistic corpus design,including: discussion of what it means to 'represent' a lan-guage, definition of the target population, stratified versusproportional sampling of a language, sampling within texts,and issues relating to the required sample size (number oftexts) of a corpus. The paper distinguishes among variousways that linguistic features can be distributed within andacross texts; it analyses the distributions of several particularfeatures, and it discusses the implications of these distribu-tions for corpus design.

The paper argues that theoretical research should be priorin corpus design, to identify the situational parameters thatdistinguish among texts in a speech community, and toidentify the types of linguistic features that will be analysedin the corpus. These theoretical considerations should becomplemented by empirical investigations of linguisticvariation in a pilot corpus of texts, as a basis for specificsampling decisions. The actual construction of a corpuswould then proceed in cycles: the original design based ontheoretical and pilot-study analyses, followed by collectionof texts, followed by further empirical investigations of lin-guistic variation and revision of the design.

1. General Considerations

Some of the first considerations in constructing a corpusconcern the overall design: for example, the kinds oftexts included, the number of texts, the selection ofparticular texts, the selection of text samples from with-in texts, and the length of text samples. Each of theseinvolves a sampling decision, either conscious or not.

The use of computer-based corpora provides a solidempirical foundation for general purpose languagetools and descriptions, and enables analyses of a scopenot otherwise possible. However, a corpus must be'representative' in order to be appropriately used as thebasis for generalizations concerning a language as awhole; for example, corpus-based dictionaries, gram-mars, and general part-of-speech taggers are applica-tions requiring a representative basis (cf. Biber,1993b).

Typically researchers focus on sample size as the mostimportantconsiderationinachievingrepresentativeness:how many texts must be included in the corpus, andhow many words per text sample. Books on samplingtheory, however, emphasize that sample size is not themost important consideration in selecting a representa-tive sample; rather, a thorough definition of the targetpopulation and decisions concerning the method ofsampling are prior considerations. Representativenessrefers to the extent to which a sample includes the fullrange of variability in a population. In corpus design,

Correspondence: Douglas Biber, Department of English, North-ern Arizona University, PO Box 6032, Flagstaff, AZ 86011-6032,USA. E-mail:[email protected]

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

variability can be considered from situational and fromlinguistic perspectives, and both of these are importantin determining representativeness. Thus a corpusdesign can be evaluated for the extent to which itincludes: (1) the range of text types in a language, and(2) the range of linguistic distributions in a language.

Any selection of texts is a sample. Whether or not asample is 'representative', however, depends first of allon the extent to which it is selected from the range oftext types in the target population; an assessment ofthis representativeness thus depends on a prior fulldefinition of the 'population' that the sample is in-tended to represent, and the techniques used to selectthe sample from that population. Definition of thetarget population has at least two aspects: (1) theboundaries of the population—what texts are includedand excluded from the population; (2) hierarchicalorganization within the population—what text cate-gories are included in the population, and what are theirdefinitions. In designing text corpora, these concerns areoften not given sufficient attention, and samples arecollected without a prior definition of the target popula-tion. As a result, there is no possible way to evaluatethe adequacy or representativeness of such a corpus(because there is no well-defined conception of whatthe sample is intended to represent).

In addition, the representativeness of a corpus de-pends on the extent to which it includes the range oflinguistic distributions in the population; i.e. differentlinguistic features are differently distributed (withintexts, across texts, across text types), and a representa-tive corpus must enable analysis of these various distri-butions. This condition of linguistic representativenessdepends on the first condition; i.e. if a corpus does notrepresent the range of text types in a population, it willnot represent the range of linguistic distributions. Inaddition, linguistic representativeness depends onissues such as the number of words per text sample, thenumber of samples per 'text', and the number of textsper text type. These issues will be addressed in Sections3 and 4.

However, the issue of population definition is thefirst concern in corpus design. To illustrate, considerthe population definitions underlying the Brown corpus(Francis and Kucera 1964/79) and the LOB corpus(Johansson et al., 1978). These target populationswere defined both with respect to their boundaries (allpublished English texts printed in 1961, in the UnitedStates and United Kingdom respectively), and theirhierarchical organizations (fifteen major text categoriesand numerous subgenre distinctions within these cate-gories). In constructing these corpora, the compilersalso had good 'sampling frames', enabling probabilistic,random sampling of the population. A sampling frame

© Oxford University Press 1993

Page 2: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

is an operational definition of the population, an item-ized listing of population members from which a repre-sentative sample can be chosen. The LOB corpusmanual (Johansson etal., 1978) is fairly explicit aboutthe sampling frame used: for books, the target popula-tion was operationalized as all 1961 publications listedin The British National Bibliography Cumulated SubjectIndex, 1960-1964 (which is based on the subjectdivisions of the Dewey Decimal Classification system),and for periodicals and newspapers, the target popula-tion was operationalized as all 1961 publicationslisted in Willing's Press Guide (1961). In the case of theBrown corpus, the sampling frame was the collection ofbooks and periodicals in the Brown University Libraryand the Providence Athenaeum; this sampling frame isless representative of the total texts in print in 1961than the frames used for construction of the Lancaster-Oslo/Bergen (LOB) corpus, but it provided well-definedboundaries and an itemized listing of members. Inchoosing and evaluating a sampling frame, considera-tions of efficiency and cost effectiveness must bebalanced against higher degrees of representativeness.

Given an adequate sampling frame, it is possible toselect a probabilistic sample. There are several kinds ofprobabilistic samples, but they all rely on random selec-tion. In a simple random sampling, all texts in thepopulation have an equal chance of being selected. Forexample, if all entries in the British National Biblio-graphy were numbered sequentially, then a table ofrandom numbers could be used to select a randomsample of books. Another method of probabilistic sam-pling, which was apparently used in the construction ofthe Brown and LOB corpora, is 'stratified sampling'. Inthis method, subgroups are identified within the targetpopulation (in this case, the genres), and then each ofthose 'strata' are sampled using random techniques.This approach has the advantage of guaranteeing thatall strata are adequately represented while at the sametime selecting a non-biased sample within each stratum(i.e. in the case of the Brown and LOB corpora, therewas 100% representation at the level of genre cate-gories and an unbiased selection of texts within eachgenre).

Note that, for two reasons, a careful definition andanalysis of the non-linguistic characteristics of thetarget population is a crucial prerequisite to samplingdecisions. First, it is not possible to identify an adequatesampling frame or to evaluate the extent to which aparticular sample represents a population until thepopulation itself has been carefully defined. A goodillustration is a corpus intended to represent the spokentexts in a language. As there are no catalogues orbibliographies of spoken texts, and since we are allconstantly expanding the universe of spoken texts inour everyday conversations, identifying an adequatesampling frame in this case is difficult; but without aprior definition of the boundaries and parameters ofspeech within a language, evaluation of a given sampleis not possible.

The second motivation for a prior definition of thepopulation is that stratified samples are almost alwaysmore representative than non-stratified samples (andthey are never less representative). This is because

244

identified strata can be fully represented (100% sampl-ing) in the proportions desired, rather than dependingon random selection techniques. In statistical terms,the between-group variance is typically larger thanwithin-group variance, and thus a sample that forcesrepresentation across identifiable groups will be morerepresentative overall.1 Returning to the Brownand LOB corpora, a prior identification of the genrecategories (e.g. press reportage, academic prose, andmystery fiction) and subgenre categories (e.g. medicine,mathematics, and humanities within the genre of aca-demic prose) guaranteed 100% representation at thosetwo levels; i.e. the corpus builders attempted to com-pile an exhaustive listing of the major text categories ofpublished English prose, and all of these categorieswere included in the corpus design. Therefore, randomsampling techniques were required only to obtain arepresentative selection of texts from within each sub-genre. The alternative, a random selection from theuniverse of all published texts, would depend on a largesample and the probabilities associated with randomselection to assure representation of the range of varia-tion at all levels (across genres, subgenres, and textswithin subgenres), a more difficult task.

In the present paper, I will first consider issues relat-ing to population definitions for language corpora andattempt to develop a framework for stratified analysisof the corpus population (Section 2). In Section 3, then,I will return to particular sampling issues, includingproportional versus non-proportional sampling, sampl-ing within texts (how many words per text and stratifiedsampling within texts), and issues relating to samplesize. In Section 4, I will describe differences in thedistributions of linguistic features, presenting the distri-butions of several particular features, and I will discussthe implications of these distributions for corpus de-sign. Finally, in Section 5, I offer a brief overview ofcorpus design in practice.

2. Strata in a Text Corpus: an Operational Propo-sal Concerning the Salient Parameters of Registerand Dialect Variation

As noted in the last section, definition of the corpuspopulation requires specification of the boundaries andspecification of the strata. If we adopt the ambitiousgoal of representing a complete language, the popula-tion boundaries can be specified as all of the texts in thelanguage. Specifying the relevant strata and identifyingsampling frames are obviously more difficult tasks,requiring a theoretically motivated and complete speci-fication of the kinds of texts. In the present section Ioffer a preliminary proposal for identifying the stratafor such a corpus and operationalizing them as sampl-ing frames. The proposal is restricted to westernsocieties (with examples from the United States), and isintended primarily as an illustration rather than a finalsolution, showing how a corpus of this kind could bedesigned.

I use the terms genre or register to refer to situation-ally defined text categories (such as fiction, sportsbroadcasts, psychology articles), and text type to refer

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

Page 3: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

to linguistically defined text categories. Both of thesetext classification systems are valid, but they havedifferent bases. Although registers/genres are notdefined on linguistic grounds, there are statistically im-portant linguistic differences among these categories(Biber, 1986, 1988), and linguistic feature counts arerelatively stable across texts within a register (Biber,1990). In contrast, text types are identifed on the basisof shared linguistic co-occurrence patterns, so that thetexts within each type are maximally similar in theirlinguistic characteristics, while the different types aremaximally distinct from one another (Biber, 1989).

In defining the population for a corpus, register/genre distinctions take precedence over text type dis-tinctions. This is because registers are based on criteriaexternal to the corpus, while text types are based oninternal criteria; i.e. registers are based on the differentsituations, purposes, and functions of text in a speechcommunity, and these can be identifed prior to theconstruction of a corpus. In contrast, identification ofthe salient text type distinctions in a language requires arepresentative corpus of texts for analysis; there is no apriori way to identify linguistically defined types. As Ishow in Section 4, though, the results of previous stu-dies, as well as on-going research during the construc-tion of a corpus, can be used to assure that the selectionof texts is linguistically as well as situationally represen-tative.

For the most part, corpus linguistics has concentratedon register differences. In planning the design of acorpus, however, decisions must be made whether toinclude a representative range of dialects or to restrictthe corpus to a single dialect (e.g. a 'standard' variety).Dialect parameters specify the demographic character-istics of the speakers and writers, including geographicregion, age, sex, social class, ethnic group, education,and occupation.3

Different overall corpus designs represent differentpopulations and meet different research purposes.Three of the possible overall designs are organizedaround text production, text reception, and texts asproducts. The first two of these are demographicallyorganized at the top level; i.e. individuals are selectedfrom a larger demographic population, and then theseindividuals are tracked to record their language use. Aproduction design would include the texts (spoken andwritten) actually produced by the individuals in thesample; a reception design would include the textslistened to or read. These two approaches wouldaddress the question of what people actually do withlanguage on a regular basis. The demographic selectioncould be stratified along the lines of occupation, sex,age, etc.

A demographically oriented corpus would not repre-sent the range of text types in a language, since manykinds of language are rarely used, even though they areimportant on other grounds. For example, few indi-viduals will ever write a law or treaty, an insurancecontract, or a book of any kind, and some of thesekinds of texts are also rarely read. It would thus bedifficult to stratify a demographic corpus in such a waythat it would insure representativeness of the rangeof text categories. Many of these categories are very

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

important, however, in defining a culture. A corpusorganized around texts as products would be designedto represent the range of registers and text types ratherthan the typical patterns of use of various demographicgroups.

Work on the parameters of register variation hasbeen carried out by anthropological linguists such asHymes and Duranti, and by functional linguists such asHalliday (see Hymes 1974; Brown and Fraser, 1979;Duranti, 1985; Halliday and Hasan, 1989). In Biber(1993a), I attempt to develop a relatively completeframework, arguing that 'register' should be specifiedas a continuous (rather than discrete) notion, and dis-tinguishing among the range of situational differencesthat have been considered in register studies. Thisframework is overspecified for corpus design work—values on some parameters are entailed by values onother parameters, and some parameters are specific torestricted kinds of texts. Attempting to sample at thislevel of specificity would thus be extremely difficult.For this reason I propose in Table 1 a reduced set ofsampling strata, balancing operational feasibility withthe desire to define the target population as completelyas possible.

Table 1 Situational parameters listed as hierarchical samplingstrata

1. Primary channel. Written/spoken/scripted speech2. Format. Published/not published (+ various formats within

'published')3. Setting. Institutional/other public/private-personal4. Addressee.

(a) Plurality. Unenumerated/plural/individual/self(b) Presence (place and time). Present/absent(c) Interactiveness. None/little/extensive(d) Shared knowledge. General/specialized/personal

5. Addressor.(a) Demographic variation. Sex, age, occupation, etc.(b) Acknowledgement. Acknowledged individual/institution

6. Factuatity. Factual-informational/intermediate orindeterminate/imaginative

7. Purposes. Persuade, entertain, edify, inform, instruct,explain, narrate, describe, keep records, reveal self, expressattitudes, opinions, or emotions, enhance interpersonalrelationship,. ..

8. Topics....

The first of the above parameters divides the corpusinto three major components: writing, speech, andscripted speech. Each of these three requires differentsampling considerations, and thus not all subsequentsituational parameters are relevant for each compo-nent.

Within writing, the first important distinction ispublication.4 This is because the population of pub-lished texts can be operationally bounded, and variouscatalogues and indexes provide itemized listings ofmembers. For example, the following criteria might beused for the operational definition of 'published' texts:(1) they are printed in multiple copies for distribution;(2) they are copyright registered or recorded by a majorindexing service. In the United States, a record of allcopyright registered books and periodicals is availableat the Library of Congress. Other 'published' texts that

245

Page 4: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

are not copyright registered include government reportsand documents, legal reports and documents, certainmagazines and newspapers, and some dissertations; inthe United States, these are indexed in sources such asthe Monthly Catalog of US Government Publications,Index to US Government Periodicals, a whole system oflegal reports (e.g. the Pacific Reporter, the SupremeCourt Reports), periodical indexes (e.g. Readers'Guide to Periodical Literature, Newsbank), and disser-tation abstracts (indexed by University MicrofilmsInternational).

A third stratum for written published texts could thusbe these 'formats' represented by the various cata-loguing and indexing systems. Together these indexesprovide an itemized listing of published writing, andthey could therefore be used as an adequate samplingframe. With a large enough sample (see following sec-tion), such a sampling frame would help achieve 'repre-sentativeness' of the various kinds of published writing.However, we know on theoretical grounds that thereare several important substrata within published writ-ing (e.g. purposes and different subject areas), and it isthus better to additionally specify these in the corpusdesign. This approach is more conservative, in that itinsures representativeness in the desired proportionsfor each of these text categories, and at the same time itenables smaller sample sizes (since random techniquesrequire larger samples than stratified techniques).

Setting and format are parallel second-level strata:format is important for the sampling of published writ-ing; setting can be used in a similar way to providesampling frames for unpublished writing, speech, andscripted speech. Three types of setting are distin-guished here: institutional, other public, and private-personal. These settings are less adequate as samplingframes than publication catalogues—they do not pro-vide well defined boundaries for unpublished writing orspeech, and they do not provide an exhaustive listing oftexts within these categories. The problem is that nodirect sampling frame exists for unpublished writing orspeech. Setting, however, can be used indirectly tosample these target populations, by using three separatesubcategories: institutions (offices, factories, businesses,schools, churches, hospitals, etc.), private settings(homes), and other public settings (shopping areas,recreation centres, etc.). (For scripted speech, thecategory of other public settings would include speechon various public media, such as news broadcasts,scripted speeches, and scripted dialogue on televisionsitcoms and dramas.) Operational sampling frames foreach of these settings might be defined from variousgovernment and official records (e.g. census records,tax returns, or other registrations). The goal of suchsampling frames would be to provide an itemized listingof the members within each setting type, so that arandom sample of institutions, homes, and other publicplaces could be selected. (These three settings could befurther stratified with respect to the various types ofinstitution, types of home, etc.)

To this point, then, I have proposed the samplingframes shown in Table 2.

Before proceeding, it is necessary to distinguish be-tween two types of sampling strata. The first, as above,

246

actually defines a sampling frame, specifying theboundaries of the operationalized population and pro-viding an itemized listing of members. The second, asin the remaining parameters of Table 1, identifies cate-gories that must be represented in a corpus but do notprovide well-defined sampling frames. For example,Addressee plurality (no.4a: /unenumerated/plural/individual/self) provides no listing of the texts withthese four types of addressee; rather, it simply specifiesthat texts should be collected until these four categoriesare adequately represented.

Further, the remaining parameters in Table 1 are notequally relevant for all the major sampling frames listedin Table 2. Consider, for example, the parameterslisted under 'Addressee'. Published writing always hasunenumerated addressees5, is always written for non-present addressees, and is almost always non-interactive(except for published exchanges of opinion). It canrequire either general or specialized background knowl-edge (e.g. popular magazines versus academic journals)but rarely requires personal background knowledge(although this is needed for a full understanding ofmemoirs, published letters, diaries, and even somenovels and short stories). Unpublished writing, on theother hand, can fall into all of these addressee cate-gories. The addressees can be unenumerated (e.g.advertisements, merchandising catalogues, governmentforms or announcements), plural (office circular ormemo, business or technical report), individual (memoto an individual, professional or personal letter, e-mailmessage), or self (diary, notes, shopping list). Theaddressee of unpublished texts is usually absent, exceptin writing to oneself. Unpublished writing can be inter-active (e.g. letters) or not. Finally, unpublished writingcan require only general background knowledge (e.g.some advertisements), specialized knowledge (e.g.technical reports), or personal knowledge (e.g. lettersand diaries).

Speech is typically directed to a plural or individualaddressee, who is present. Speech addressed to self isoften considered strange. Speech can be directed tounenumerated, absent addressees through mass media(e.g. a televised interview). Individual and small-groupaddressees can also be absent, as in the case of telephoneconversations and 'conference calls'. (Individualaddressees can even be non-interactive in the case oftalking to an answering machine.) Private settingsfavour interactive addressees (either individual or smallgroup conversations) while both interactive and non-interactive addressees can be found in institutionalsettings (e.g. consider the various kinds of lectures,sermons, and business presentations). General knowl-edge can be required in all kinds of conversation; spe-cialized background knowledge is mostly required ofaddressees in institutional settings; personal knowledgeis most needed in private settings.

Table 2 Outline of sampling frames

Writing (published). Books/periodicals/etc.(based on available indexes)

Writing (unpublished). Institutional/public/privateSpeech. Institutional/public/privateScripted speech. Institutional/public media/other

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

Page 5: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

Scripted speech is typically directed towards pluraladdressees (small groups in institutional settings andunenumerated audiences for mass media). Dialogue inplays and televised dramatic shows are examples ofscripted speech that is directed to an individual butheard by an unenumerated audience. Addressees aretypically present for scripted speech in institutional set-tings but are not present (physically or temporally) forscripted speech projected over mass media. Except forthe lecturer who allows questions during a writtenspeech, scripted speech is generally not interactive.Finally, scripted speech can require either general orspecialized background knowledge on the part of theaddressee, but it rarely requires personal backgroundknowledge.

Addressors can vary along a number of demographicparameters (the dialect characteristics mentioned above),and decisions must be made concerning the representa-tion of these parameters in the corpus. (Collection oftexts from some addressor categories will be difficult forsome sampling frames; e.g. there are relatively few pub-lished written texts by working class writers.) The secondparameter here, whether the addressor is acknowledgedor not, is relevant only for written texts: some writtentexts do not have an acknowledged personal author (e.g.advertisements, catalogues, laws and treaties, govern-ment forms, business contracts), while the more typicalkinds of writing have a specific author(s).

Factuality is similar to assessments of backgroundknowledge in that it is sometimes difficult to measurereliably, but this is an important parameter distin-guishing among texts within both writing and speech.At one pole are scientific reports and lectures, whichpurport to be factual, and at the other are the variouskinds of imaginative stories. In between these poles area continuum of texts with different bases in fact, rangingover speculation, opinion, historical fiction, gossip, etc.

The parameter of purpose requires further research,both theoretical (as the basis for corpus design) andempirical (using the resources of large corpora). I in-clude in Table 1 several of the purposes that should berepresented in a corpus, but this is not intended as anexhaustive listing.

Similarly the parameter of topic requires furthertheoretical and empirical research. Library classifica-tion systems are well developed and provide adequatetopic strata for published written texts. These sameclassifications might also serve as strata for unpublishedwriting, but they would need to be tested empirically.For spoken texts, especially in private settings, furtherresearch on the range of typical topics is required.

The spirit of the proposal outlined in this section isto show how basic situational parameters can be usedas sampling strata to provide an important first steptowards achieving representativeness. The particularparameter values used, however, must be refined, andthe framework proposed here is clearly not the finalword on corpus sampling strata.

3. Other Sampling Issues

3.1 Proportional SamplingIn most stratified sample designs, the selection of

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

observations across strata must be proportional inorder to be considered representative (Williams, 1978;Henry, 1990); i.e. the number of observations in eachstratum should be proportional to their numbers in thelarger population. For example, a survey of citizens inNorth Carolina (reported in Henry, 1990, pp.61-66)used two strata, each based on a government listingof adults: households that filed 1975 income tax re-turns, and households that were eligible for Medicaidassistance. These two lists together accounted for anestimated 96% of the population. In the selection ofobservations, though, these lists were sampledproportionately—89% from the tax list and 11% fromthe Medicaid list—to maintain the relative proportionsof these two strata in the larger population. The result-ing sample can thus be claimed to represent the adultpopulation of North Carolina. Representativeness inthis case means providing the basis for accurate descrip-tive statistics of the entire population (e.g. averageincome, education, etc.).

Demographic samples are representative to theextent that they reflect the relative proportions of stratain a population. This notion of representativeness hasbeen developed within sociological research, where re-searchers aim to determine descriptive statistics thatcharacterize the overall population (such as the popula-tion mean and population standard deviation). Anysingle statistic that characterizes an entire populationcrucially depends on a proportional sampling of stratawithin the population—if a strata which makes up asmall proportion of the population is sampled heavily,then it will contribute an unrepresentative weight tosummary descriptive statistics.

Language corpora require a different notion of repre-sentativeness, making proportional sampling inappro-priate in this case. A proportional language corpus wouldhave to be demographically organized (as discussed atthe beginning of Section 3.2), because we have noa priori way to determine the relative proportions ofdifferent registers in a language. In fact, a simpledemogaphically based sample of language use would beproportional by definition—the resulting corpus wouldcontain the registers that people typically use in theactual proportions in which they are used. A corpuswith this design might contain roughly 90% conversa-tion and 3% letters and notes, with the remaining 7%divided among registers such as press reportage, popu-lar magazines, academic prose, fiction, lectures, newsbroadcasts, and unpublished writing. (Very few peopleever produce published written texts, or unpublishedwritten and spoken texts for a large audience.) Such acorpus would permit summary descriptive statistics forthe entire language represented by the corpus. Thesekinds of generalizations, however, are typically not ofinterest for linguistic research. Rather, researchers re-quire language samples that are representative in thesense that they include the full range of linguistic varia-tion existing in a language.

In summary, there are two main problems with pro-portional language corpora. First, proportional samplesare representative only in that they accurately reflectthe relative numerical frequencies of registers in alanguage—they provide no representation of relative

247

Page 6: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

importance that is not numerical. Registers such asbooks, newspapers, and news broadcasts are muchmore influential than their relative frequencies indi-cate. Secondly, proportional corpora do not provide anadequate basis for linguistic analyses, in which therange of linguistic features found in different text typesis of primary interest. For example, it is not necessaryto have a corpus to find out that 90% of the texts in alanguage are linguistically similar (because they are allconversations); rather, we want to analyse the linguisticcharacteristics of the other 10% of the texts, since theyrepresent the large majority of the kinds of registersand linguistic distributions in a language.6

3.2 Sample SizeThere are many equations for determining sample size,based on the properties of the normal distribution andthe sampling distribution of the mean (or the samplingdistribution of the standard deviation). One of the mostimportant equations states that the standard error ofthe mean for some variable (s^) is equal to the standarddeviation of that variable (5) divided by the square rootof the sample size (n'/2), i.e.

sj, = slnh

The standard error of the mean indicates how far asample mean can be from the true population mean. Ifthe sample size is greater than 30, then the distributionof sample means has a roughly normal distribution, sothat 95% of the samples taken from a population willhave means that fall in the interval of plus or minus 1.96times the standard error. The smaller this interval is,the more confidence a researcher can have that she isaccurately representing the population mean. Asshown by the equation for the standard error, this con-fidence interval depends on the natural variation of thepopulation (estimated by the sample standard devia-tion) and the sample size («). The influence of samplesize in this equation is constant, regardless of the size ofthe standard deviation (i.e. the standard error is a func-tion of one divided by the square-root of n). To reducethe standard error (and thus narrow the confidenceinterval) by half, it is necessary to increase the samplesize by four times.

For example, if the sample standard deviation for thenumber of nouns in a text was 30, the sample meanscore was 100, and the sample size was nine texts, thenthe standard error would be equal to 10:

Standard error = 30/V(9) = 30/3 = 10

This value indicates that there is a 95% probability thatthe true population mean for the number of nouns pertext falls within the range of 80.4 to 119.6 (i.e. thesample mean of 100 ± 1.96 times the standard error of10). To reduce this confidence interval by cutting thestandard error in half, the sample size must be in-creased four times to 36 texts; i.e.

Standard error = 30/V(36) = 30/6 = 5

Similarly, if the original sample was 25 texts, then wewould need to increase the sample to 100 texts in orderto cut the standard error in half, i.e.

248

Standard error = 30/V(25) = 30/5 = 6

Standard error = 30/V(100) = 30/10 = 3

Unfortunately there are certain difficulties in usingthe equation for the standard error to determine therequired sample size of a corpus. In particular, it isnecessary to address three problems:

(1) The sample size (n) depends on a prior deter-mination of the tolerable confidence interval re-quired for the corpus; i.e. there needs to be anapriori estimate of the amount of uncertaintythat can be tolerated in typical analyses based onthe corpus.

(2) The equation depends on the sample standarddeviation, but this is the standard deviation forsome particular variable. Different variables canhave different standard deviations, resulting indifferent estimates of the required sample size.

(3) The equation must be used in a circular fashion;i.e. it is necessary to have selected a sample andcomputed the sample standard deviation beforethe equation can be used (and this is based on theassumption that the pilot sample is at least some-what representative)—but the purpose of theequation is to determine the required samplesize.

In Section 4, I consider the distribution of severallinguistic features and address these three problems,making preliminary proposals regarding sample size.

3.3 A Note on Sampling Within 'Texts'To this point I have not yet addressed the issue of howlong text samples need to be. I will consider this ques-tion in more detail in Section 4, discussing the distribu-tion of various linguistic features within texts. Here,though, I want to point out that the preference forstratified sampling applies to sampling within texts aswell as across texts. Corpus compilers have typicallytried to achieve better representation of texts by simplytaking more words from the texts. However, thesewords are certainly not selected randomly (i.e. they aresequential), and the adequacy of representation thusdepends on the sample length relative to the total textlength. Instead it is possible to use a stratified approachfor the selection of text samples from texts; i.e. espe-cially in the case of written texts and planned spokentexts, the selection of text samples can use the typicalsubcomponents of texts in that register as samplingstrata (e.g. chapters, sections, possibly main points in alecture or sermon). This approach will result in betterrepresentation of the overall text, regardless of thetotal number of words selected from each text.

4. Distributions of Linguistic Features: PreliminaryRecommendations Concerning Sample Size

4.1 Distributions Within Texts: Length of Text SamplesIn this section I consider first the distribution of linguis-tic features within texts, as a basis for addressing theissue of optimal text length. Traditional samplingtheory is less useful here than for the other aspectsof corpus design, because individual words cannot be

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

Page 7: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

treated as separate observations in linguistic analyses;i.e. since linguistic features commonly extend overmore than one word, any random selection of wordsfrom a text would fail to represent many features andwould destroy the overall structure of the text. Themain issue here is thus the number of contiguous wordsrequired in text samples. The present section illustrateshow this issue can be addressed through empirical in-vestigations of the distribution of linguistic featureswithin texts.

In Biber (1990) I approach this problem by compar-ing pairs of 1,000-word samples taken from single textsin the LOB and London-Lund corpora. (Text samplesare 2,000 words in the LOB corpus and 5,000 words inthe London-Lund corpus.) If large differences arefound between the two 1,000-word samples, then we canconclude that this sample length does not adequatelyrepresent the overall linguistic characteristics of a text,and that perhaps much larger samples are required. If,on the other hand, the two 1,000-word text samples aresimilar linguistically, then we can conclude that rela-tively small samples from texts adequately representtheir linguistic characteristics.

In the case of written texts (from the LOB corpus), Idivided each original text in half and compared the twoparts. In the case of spoken texts (from the London-Lund corpus), four 1,000-word samples were extractedfrom each original text, and these were then comparedpairwise.

To provide a relatively broad database, ten linguisticfeatures commonly used in variation studies wereanalysed. These features were chosen from differentfunctional and grammatical classes, since each classpotentially represents a different statistical distributionacross text categories (see Biber, 1988). The featuresare: first person pronouns, third person pronouns, con-tractions, past tense verbs, present tense verbs, pre-positions, passive constructions (combining by-passivesand agentless passives), WH relative clauses, and con-ditional subordinate clauses. Pronouns and contractionsare relatively interactive and colloquial in communica-tive function; nouns and prepositions are used for in-tegrating information into texts; relative clauses andconditional subordination represent types of structuralelaboration; and passives are characteristic of scientificor technical styles. These features were also chosen torepresent a wide range of frequency distributions intexts, as shown in Table 3, which presents their fre-quencies (per 1,000 words) in a corpus of 481 spokenand written texts (taken from Biber, 1988, pp.77-78).The ten features differ considerably in both their over-all average frequency of occurrence and in their rangeof variation. Nouns and prepositions are extremelycommon; present tense markers are quite common;past tense, first person pronouns, and third personpronouns are all relatively common; contractions andpassives are relatively rare; and WH relative clausesand conditional subordinators are quite rare. (In addi-tion, these features are differentially distributed acrossdifferent kinds of texts; see Biber, 1988, pp.246-269.)Comparison of these ten features across the 1,000-wordtext pairs thus represents several of the kinds of distri-butional patterns found in English.

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

Table 3 Descriptive statistics for frequency scores (per 1,000words) of ten linguistic features in a corpus of 481 texts taken from23 spoken and written text genres

Linguistic feature

NounsPrepositionsPresent tensePast tenseThird person pronounsFirst person pronounsContractionsPassivesWH relative clausesConditional subordination

Mean

1811117840302714103.52.5

Min.

8450120000000

Max.

29820918211912412289442613

Range

21415917011912412289442613

The distributions of these linguistic features wereanalysed in 110 1,000-word text samples (i.e. fifty-fivepairs of samples), taken from seven text categories:conversations, broadcasts, speeches, official docu-ments, academic prose, general fiction, and romancefiction. These categories represent a range of communi-cative situations in English, differing in purpose, topic,informational focus, mode, interactiveness, formality,and production circumstances; again, the goal was torepresent a broad range of frequency distributions.

Reliability coefficients were computed to assess thestability of frequency counts across the 1,000-wordsamples. In the case of the London-Lund corpus (thespoken texts), four 1,000-word samples were analysedfrom each text, and for the LOB corpus (the writtentexts), two 1,000-word subsamples were analysed fromeach text.

The reliability coefficient for each feature representsthe average correlation among the frequency counts ofthat feature (i.e. a count for each of the subsamples).For the spoken samples, all coefficients were high. Thelowest reliabilities were for passives (0.74) and con-ditional subordination (0.79), while all other featureshad reliability coefficients over 0.88. The coefficientswere somewhat smaller for the written samples, in partbecause they are based on two instead of four sub-samples. Conditional subordination in the written textshad a low reliability coefficient (0.31), while relativeclauses and present tense in the written texts hadmoderately low reliability coefficients (0.58 and 0.61respectively); all other features had reliability coef-ficients over 0.80. Overall, this analysis indicates thatfrequency counts for common linguistic features arerelatively stable across 1,000 word samples, while fre-quency counts for rare features (such as conditionalsubordination and WH relative clauses—see Table 3)are less stable and require longer text samples to bereliably represented.7

These earlier analyses can be complemented bytracking the distribution of various linguistic featuresacross 200-word segments of texts. For example, Fig. 1shows the distribution of prepositional phrasesthroughout the length of five texts from HumanitiesAcademic Prose—the figure plots the cumulative num-ber of prepositional phrases as measured at each 200-word interval in these texts. As can be seen from thisfigure, prepositional phrases are distributed linearly inthese texts. That is, there are approximately the same

249

Page 8: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

350

300

250

9 200

150

100

0 200 400 600 800 1000 1200 1400 1600 1800 2000Number of words

Fig. 1 Distribution of prepositions in five humanities texts

number of prepositional phrases occurring in each 200-word segment (roughly thirty per segment in three ofthe texts, and twenty-five per segment in the other twotexts). (The linear nature of these distributions can beconfirmed by lining up a ruler next to the plot of eachtext.) This figure indicates that a common feature suchas prepositional phrases is extremely stable in its dis-tribution within texts (at least Humanities AcademicProse texts)—that even across 200-word segments, allsegments will contain roughly the same number of pre-positional phrases.

Figure 2 illustrates a curvilinear distribution, in thiscase the cumulative word types (i.e. the number ofdifferent words) in five Humanities texts. In general,frequency counts of a linguistic feature will be distri-buted linearly (although that distribution will be moreor less stable within a text—see below), while frequen-cies of different types of linguistic features (lexical orgrammatical) will be distributed curvilinearly; i.e. be-cause many types are repeated across text segments,each subsequent segment contributes fewer new typesthan the preceding segment. In Fig. 2, the straight linemarked by triangles shows the 50% boundary of wordtypes (the score when 50% of the words in a text aredifferent word types). In all five of these texts, at least50% of the words are different types in the first 200word segment (i.e. at least half of the words are notrepeated), and two of the texts have more than 50%different types in the first three segments (up to 600

words). All of the texts show a gradual decay in thenumber of word types, however. The most diverse textdrops to roughly 780 word types per 2,000 words(39%), while the least diverse text drops to roughly 480word types per 2,000 words (only 24%). These trendswould continue in longer texts, with each subsequentsegment contributing fewer new types.

These two types of distributions must be treateddifferently. In Figs 3-9,1 plot the distributions of sevenlinguistic features within texts representing three regis-ters. Three of the features are cumulative frequencycounts: Fig. 3 plots the frequencies of prepositionalphrases, a common grammatical feature; Fig. 4 plotsthe frequencies of relative clauses, a relatively raregrammatical feature; and Fig. 5 plots the frequencies ofnoun-preposition sequences, a relatively commongrammatical sequence. The other four figures plot thedistributions of types in texts. Figures 6 and 7 plot thedistribution of lexical types: word types (the number ofdifferent words) in Fig. 6 and hapax legomena (once-occurring words) in Fig. 7. Figures 8 and 9 plot thedistribution of grammatical types: different grammati-cal categories or 'tags' in Fig. 8, and different gramma-tical tag sequences in Fig. 9. The figures thus illustratelexical and grammatical features, with rare and com-mon overall frequencies, having linear and curvilineardistributions.

200 400 600 800 1000 1200 1400 1600 1800 2000

Number of words

' Averoge Humanities

- 1 0—text Humanities

- Average Technicol

- 10-text Technical

• Average Fiction

• 10-text Fiction

Fig. 3 Distribution of prepositions in texts from three registers

200 400 600 800 1000 1200 1400 1600 1800 2000Number of words

50% new words

Fig. 2 Distribution of word types in five humanities texts

250

200 400 600 800 1000 1200 1400 1600 1800 2000Number of words

• Average Humanities

- 10-text Humanities

• Average Technical

- 10-text Technical

Average Fiction

• 10-text Fiction

Fig. 4 Distribution of relative clauses in texts from three registers

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

Page 9: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

180

0 200 400 600 800 1000 1200 1400 1600 1800 2000Number of words

- Average Humanities

- 10-text Humanities

- Average Technical

• 10-text Technical

- Average Fiction

• 10-text Fiction

Fig. 5 Distribution of N-PP sequences in texts from three registers

800

0 200 400 600 800 1000 1200 1400 1600 1800 2000Number of words

- » - Average Humanities

- e - 10-text Humanities

- Average Technical

- 10-text Technical

• Average Fiction

- 10-text Fiction

Fig. 6 Distribution of word types in texts from three registers

600

"0 200 400 600 800 1000 1200 1400 1600 1800 2000Number of words

- Average Humanities

- 10-text Humanities

• Average Technical

• 10-texl Technical

- Average Fiction

- 10-text Fiction

Fig. 7 Distribution of Hapax legomena in texts from three registers

The figures can be used to address several questions.First they present the overall distributions of thesefeatures. The stable linear distribution of prepositionalphrases is further confirmed by Fig. 3. In contrast, therelatively unstable distribution of relative clauses, indi-cated above by a relatively low reliability coefficient, isfurther supported by the frequent departures fromlinearity in Fig. 4. That is, since relative clauses arerelatively rare overall, even two or three extra relativesin a 200-word segment results in an aberration. Figure5 shows that the distribution of noun-prepositionsequences is similar to that of prepositional phrases in

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

200 400 600 800 1000 1200 1400 1600 1800 2000Number of words

- • - Average Humanities

- e - 10-text Humanities

• Average Technical

• 10-text Technical

• Average Fiction

• 10-text Fiction

Fig. 8 Distribution of grammatical tag types in texts from threeregisters

400

200 400 600 800 1000 1200 1400 1600 1800 2000Number of words

- Average Humanities

- 10-text Humanities

• Average Technical

• 10-text Technical

• Average Fiction

• 10-text Fiction

Fig. 9 Distribution of grammatical tag sequences in texts from fiveregisters

being linear and quite stable (although less frequentoverall).8

Figures 6-9 show different degrees of curvilinearity,with the grammatical and syntactic types showing sharperdrop-offs than the lexical types. Grammatical tag typesshow the sharpest decay: most different grammaticalcategories occur in the first 200 words, with relativelyfew additional grammatical categories being addedafter 600 words.

Figures 3-9 also illustrate distributional differencesacross registers, although only three registers are con-sidered here. For example, Figures 3 and 5 show fairlylarge differences between academic prose and fiction,with the former having much higher frequencies ofprepositional phrases and noun-prepositional phrasesequences. The differences among registers are lessclear-cut in Fig. 4, but humanities academic prose textsconsistently have more frequent relative clauses thaneither technical academic prose or fiction.

Each register is plotted twice in these figures: the'average' scores and the '10-text' scores. Average scorespresent the average value for ten texts from that regis-ter for the segment in question. (For example, Fig. 3shows that humanities texts have on average 130 pre-positions in the first 1,000 words of text.) In contrast,the '10-text' scores are composite scores, with each 200-word segment coming from a different text. Thus, thescore for 400 words represents the cumulative totals for

251

Page 10: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

the first 200 words from two texts, the score for 600words sums the first 200-word totals from three texts,etc.

In the case of stable, linear distributions, there isvery little difference between the average and 10-textscores. In fact, Figs 3 and 5 show a remarkable coinci-dence of average and 10-text values; a single distributionis found, within a register, regardless of whether subse-quent 200-word segments are taken from the same textor from different texts. Figure 4 shows greater differ-ences for relative clauses (a relatively rare and lessstable feature). Here averaging over ten texts smoothsout most aberrations from linearity, while the 10-textvalues show considerable departures from linearity.

In contrast, there are striking differences betweenthe average and 10-text distributions for the curvilinearfeatures (Figs 6-9). In these cases, the 10-text scoresare consistently higher than the corresponding averagescore from the same register. In the case of word types,the 10-text scores for all three registers are higher thanthe average scores of the registers. The difference isparticularly striking with respect to technical academicprose—after 2,000 words of text, 10-text technical prosehas the second highest word type score (approximately740, or 37%), while on average technical prose textshave the lowest word type score (approximately 570, or28%). This shows that there is a high degree of lexicalrepetition within technical prose texts, but there is ahigh degree of lexical diversity across technical texts.The distribution of hapax legomena, shown in Fig. 7,parallels that of word types—again, all three 10-textscores are higher than the average scores; 10-texthumanities prose shows the highest score; and the aver-age technical prose score is by far the lowest. Thesedistributions reflect the considerable lexical diversityfound across humanities texts, and the relatively littlelexical diversity within individual technical texts.

There is more similarity between the 10-text andaverage scores with respect to the distribution of gram-matical types (Figs 8 and 9), although for each registerthe 10-text score is higher than the correspondingaverage score. Interestingly, these figures show thattechnical prose has the least grammatical diversity aswell as the least lexical diversity.

In summary, the analyses presented in this sectionindicate the following:

(1) Common linear linguistic features are distributedin a quite stable fashion within texts and can thusbe reliably represented by relatively short textsegments.

(2) Rare linguistic features show much more distri-butional variation within texts and thus requirelonger text samples for reliable representation.

(3) Features distributed in a curvilinear fashion, i.e.different feature types, are relatively stableacross subsequent text segments, but occur-rences of new types decrease throughout thecourse of a text. The frequency of new types isconsistently higher in cross-text samples than insingle-text samples. These patterns were shownto hold for relatively short text segments (2,000words total) and for cross-text samples taken

252

from a single register; the patterns should beeven stronger for longer text segments and forcross-register samples. These findings supportthe preference for stratified sampling—morediversity among the texts included in a corpuswill translate into a broader representation oflinguistic feature types.

With regard to the issue of text length, the thrust ofthe present section is simply that text samples should belong enough to reliably represent the distributions oflinguistic features. For linearly distributed features, therequired length depends on the overall stability of thefeature. For curvilinear features, an arbitrary cut-offmust be specified marking an 'adequate' representa-tion, e.g. when subsequent text segments contributeless than 10% additional new types. Given a finiteeffort invested in developing a corpus, broader linguis-tic representation can be achieved by focusing on diver-sity across texts and text types rather than by focusingon longer samples from within texts.

Specific proposals on text length require furtherinvestigations of the kind presented here, focusingespecially on the distributions of less stable features, todetermine the text length required for stability, and onthe distributions of other kinds of features (e.g. dis-course and information-packaging features).

4.2 Distributions Across Texts: number of TextsA second major statistical issue in building a textcorpus concerns the sampling of texts: how linguisticfeatures are distributed across texts and across regis-ters, and how many texts must be collected for the totalcorpus and for each register to represent those distribu-tions?

4.2.1 Previous Research on Linguistic Variation Withinand Across Registers. Although registers are definedby reference to situational characteristics, they can beanalysed linguistically, and there are important linguis-tic differences among them; at the same time, someregisters also have relatively large ranges of linguisticvariation internally (see Biber, 1988, Chapters 7 and 8).For this reason, the linguistic characterization of a reg-ister should include both its central tendency and itsrange of variation. In fact, some registers are similar intheir central tendencies but differ markedly in theirranges of variation (e.g. science fiction versus generalfiction, and official documents versus academic prose,where the first register of each pair has a more re-stricted range of variation). In Biber (1988, pp. 170-98), Idescribe the linguistic variation within registers, includ-ing the linguistic relations among various subregisters.

The number of texts required in a corpus to repre-sent particular registers relates directly to the extentof internal variation. In Biber (1990), I analyse thestability of feature counts across texts from a registerby comparing the mean frequency counts for 10-textsubsamples taken from particular registers. Five regis-ters were analysed: conversations, public speeches,press reportage, academic prose, and general fiction.Three 10-text samples were extracted from each ofthese registers, and the mean frequency counts of six

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

Page 11: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

linguistic features were compared across the samples(first person pronouns, third person pronouns, pasttense, nouns, prepositions, passives). The reliabilityanalysis of these mean frequencies across the three 10-text samples showed an extremely high degree of stabil-ity for all six linguistic features (all coefficients greaterthan 0.95). These coefficients show that the mean scoresof the 10-text samples are very highly correlated; that is,the central linguistic tendencies of these registers withrespect to these linguistic features are quite stable, evenas measured by 10-text samples. However, there aretwo important issues not considered by this analysis.First, the six linguistic features considered were all rela-tively common; rare features such as WH relativeclauses or conditional subordination might show muchlower reliabilities. Secondly, this analysis addressedhow many texts were needed to reliably represent meanscores, but did not address the representation of lin-guistic diversity in registers.9

4.2.2 Total Linguistic Variation in a Corpus: TotalSample Size for a Corpus. In Section 3, I discussedhow the required sample size is related to the standarderror ( s j by the equation:

Si = slriA (4.1)

The actual computation of sample size depends on aspecification of the tolerable error (te):

te = t* Si (4.2)

Equation 4.2 states that the tolerable error is equal tothe standard error times the /-value. Given a samplesize greater than thirty (which permits the assumptionof a normal distribution), a researcher can know with95% confidence that the mean score of a sample willfall in the interval of the true population mean plus-or-minus the tolerable error.

Equation 4.2 can be manipulated to provide a secondequation for computing the standard error, i.e. s^ = te/t. If the ratio telt is substituted for s± in Equation 4.1,and the equation is then solved for n, we get a directcomputation of the required sample size for a corpus:

n = .^/(/e//)2 (4.3)

where n is the computed sample size, s is the estimatedstandard deviation for the population, te is the tolerableerror (equal to 1/2 of the desired confidence interval),and / is the /-value for the desired probability level.

I note in Section 3 that there are problems in theapplication of Equation 4.3. In one sense, the equationsimply shifts the burden of responsibility, from estimat-ing the unknown quantity for required sample size toestimating the unknown quantities for the tolerableerror and population standard deviation; i.e. in orderto use the equation, there needs to be a prior estimateof the tolerable error or confidence interval permittedin the corpus and a prior estimate of the standarddeviation of variables in the population as a whole.

The tolerable error depends on the precision re-quired of population estimates based on the corpussample. For example, say that we want to know howmany nouns on average occur in conversation texts.The confidence interval is the window within which we

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

can be 95% certain that the true population mean falls.For example, if the sample mean for nouns in conversa-tions was 120, and we needed to estimate the truepopulation mean of nouns with a precision of ± 2, thenthe confidence interval would be 4, extending from 118to 122. The tolerable error is simply one side (or one-half) of the confidence interval. The problem here isthat it is difficult to provide an a priori estimate of therequired precision of the analyses that will be based ona corpus.

Similar problems arise with the estimation of stan-dard deviations. In this case, it is not possible to esti-mate the standard deviation of a variable in a corpuswithout already having a representative sample oftexts. Here, as in many aspects of corpus design, workmust proceed in a circular fashion, with empirical inves-tigations based on pilot corpora informing the designprocess. The problem for initial corpus design, however,is to provide an initial estimate of standard deviation.

A final problem is that standard deviations must beestimated for particular variables, but in the case ofcorpus linguistics, there are numerous linguistic vari-ables of interest. Choosing different variables, withdifferent standard deviations, will result in differentestimates of required sample size.

In the present section, I use the analyses in Biber(1988, pp.77-78, 246-269) to address the first two ofthese problems. That study is based on a relativelylarge and wide-ranging corpus of English texts: 481texts taken from twenty-three spoken and written regis-ters. Statistical analyses of this corpus can thus be usedto provide initial estimates for both the tolerable errorand the population standard deviation.

In the design of a text corpus, tolerable error cannotbe stated in absolute terms because the magnitude offrequency counts varies considerably across features (aswas shown in Section 3). For example, a tolerable errorof ± 5 might work well for common features such asnouns, which have an overall mean of 180.5 per 1,000words in the pilot corpus, but it would be unacceptablefor rare features such as conditional subordinateclauses, which have an overall mean of only 2.5 in thecorpus (so that a tolerable error of 5 would translateinto a confidence interval of —2.5 to 7.5, and a textcould have three times the average number of condi-tional clauses and still be within the confidence interval).Instead I propose here computing a separate estimateof the tolerable error for each linguistic feature, basedon the magnitude of the mean score for the feature; forillustration, I will specify the tolerable error as ± 5% ofthe mean score (for a total confidence interval of 10%of the mean score). Table 4 presents the mean scoreand standard deviation of seven linguistic features inthe pilot corpus, together with the computed tolerableerror for each feature. It can be seen that the tolerableerror ranges from 9.03 for nouns (which have a mean of180.5) to 0.13 for conditional clauses (which have amean of only 2.5).

Given the tolerable errors and estimated standarddeviations listed in Table 4, required sample size (i.e.the total number of texts to be included in the corpus)can be computed directly using Equation 4.3. Table 4shows very large differences in required sample size

253

Page 12: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

Table 4 Estimates of required sample sizes (number of texts) for the total corpus.

NounsPrepositionsPresent tensePast tensePassivesWH relative clausesConditional clauses

Mean score inpilot corpus

180.5110.577.740.1

9.63.52.5

Standard deviation inpilot corpus

35.625.434.330.4

6.61.92.2

Tolerableerror

9.035.533.892.010.480.180.13

RequiredN

59.881.2

299.4883.1726.3452.8

1,190.0

across these linguistic features. These differences are afunction of the size of the standard deviation relative tothe mean for a particular feature. If the standard devia-tion is many times smaller than the mean, as in the caseof common features such as nouns and prepositions,the required sample size is quite small. If, on the otherhand, the standard deviation approaches the mean inmagnitude, as in the case of rare features such as WHrelative clauses and conditional clauses, the requiredsample size becomes quite large. Past tense markers areinteresting in that they are relatively common (mean of40.1) yet have a relatively large standard deviation(30.4) and thus require a relatively large sample of textsfor representation (883). Overall the most conservativeapproach in designing a corpus would be to use themost widely varying feature (proportional to itsmean—in this case conditional clauses) to set the totalsample size.

4.2.3 Linguistic Variation within Registers: Number ofTexts Needed to Represent Registers. The remainingissue concerns the required sample size for each regis-ter. Although most books on sample design simplyrecommend proportional sampling for stratified designs(see Section 3), a few books discuss the need for non-proportional stratified sampling in certain instances;these books differ, however, on the method for deter-mining the recommended sample sizes for subgroups.For example, Sudman (1976, pp.110-111) states thatnon-proportional stratified sampling should be usedwhen the subgroups themselves are of primary interest(as in the case of a text corpus), and that the samplesizes of the subgroups should be equal in that case (tominimize the standard error of the difference). Thisprocedure is appropriate when the variances of the sub-groups are roughly the same. In contrast, Kalton (1983,pp.24-25) recommends using the subgroup standarddeviations to determine their relative sample sizes. Thisprocedure is more appropriate for corpus design, sincethe standard deviations of linguistic features vary con-siderably from one register to the next.

Although I do not make specific recommendationsfor register sample size here, I illustrate this approachin Table 5, considering the relative variances of sevenlinguistic features (the same as in Table 4) across threeregisters: conversations, general fiction, and academicprose. As above, the data are taken from Biber, (1988,pp. 246-69).

Table 5 presents the mean score, standard deviation,

254

and the ratio of standard deviation to mean score, forthese seven linguistic features in the three registers.The ratio represents the normalized variance of each ofthese features within each register—the extent of inter-nal variation relative to the magnitude of the meanscore. The raw standard deviation is not appropriatehere (similar to Table 4) because the mean scores ofthese features vary to such a large extent.

Table 5 shows that the normalized standard deviationvaries considerably across features within a register.For example, within conversations the counts fornouns, prepositions, and present tense all show rela-tively small normalized variances, while passives, WHrelative clauses, and conditional clauses all show nor-malized variances at or above 50%. As shown earlier,features with lower overall frequencies tend to haveconsiderably higher normalized variances.

There are also large differences across the registers.For example, past tense has a normalized variance of46% in conversations and only 18% in general fiction,but it shows a normalized variance of 96% in academicprose. Conditional subordination also shows large dif-ferences across these three registers: it has a normal-ized variance of 54% in conversations, 73% in generalfiction, and 100% in academic prose.

In order to determine the sample size for each regis-ter, it is necessary to compute a single measure of thevariance within each register. This measure is then usedto allot a proportionally larger sample to registers withgreater variances. (This should not be confused with aproportional representation of the registers.) A certainminimum number of texts should be allotted for eachregister (e.g. at least twenty texts per register), andthen the remaining texts in the corpus can be dividedproportionally depending on the relative variance with-in registers.

To illustrate, consider Table 5 again. This table listsan average normalized deviation for each register,which represents an overall deviation score computedby averaging the normalized standard deviations of theseven linguistic features. Conversations and generalfiction both have relatively similar overall deviations(37% and 39% correspondingly) whle academic prosehas a somewhat higher overall deviation (49%). Tofollow through with this example, assume that therewere to be a total of 200 texts in a corpus, taken fromthese three registers. Each register would be allotteda minimum of twenty texts, leaving 140 texts to bedivided proportionally among the three registers. To

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

Page 13: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

Table 5 Relative variation within selected registers

Conversations. Average normalized deviation = 0.37

NounsPrepositionsPresent tensePast tensePassivesWH Relative clausesConditional clauses

Mean score inpilot corpus

137.485.0

128.437.4

4.21.43.9

Standard deviation inpilot corpus

15.612.422.217.32.10.92.1

General fiction. Average normalized deviation = 0.39

NounsPrepositionsPresent tensePast tensePassivesWH Relative clausesConditional clauses

Academic prose. Average

NounsPrepositionsPresent tensePast tensePassivesWH Relative clausesConditional clauses

Mean score inpilot corpus

160.792.853.485.65.71.92.6

Standard deviation inpilot corpus

25.715.818.815.73.21.11.9

normalized deviation = 0.49

Mean score inpilot corpus

188.1139.563.721.917.04.62.1

Standard deviation inpilot corpus

24.016.723.121.17.41.92.1

Ratio of standard deviation/mean(normalized deviation)

0.110.150.170.460.500.640.54

Ratio of standard deviation/mean(normalized deviation)

0.160.170.350.180.560.580.73

Ratio of standard deviation/mean(normalized deviation)

0.130.120.360.960.440.411.00

determine the relative sample size of the registers, onewould solve the following equation based upon theirrelative overall deviations:

0.37x + 0.39x + 0.49x = 1401.25* = 140x = 112

and thus the sample sizes would be:

Conversation 0.37 * 112 = 41General fiction 0.39 * 112 = 44Academic prose 0.49 * 112 = 55Total allocated texts = (41 + 20 for conversation) +

(44 + 20 for general fiction)+ (55 + 20 for academicprose) = 200 texts

To compute the actual values for register samplesizes, it is necessary to analyse the full range of linguis-tic features in all registers, computing a single averagedeviation score for each register. This could be done byaveraging across the normalized variances of all linguis-tic features, as illustrated here. An alternativeapproach would be to use the normalized variances ofthe linguistic dimensions identified in Biber (1988).This latter approach would have a more solid theoreti-cal foundation, in that the dimensions represent basicparameters of variation among registers, each based onan important co-occurrence pattern among linguistic

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

features. In contrast, the approach illustrated in thissection depends on the pooled influence of linguisticfeatures in isolation, and thus relatively aberrant distri-butions can have a relatively strong influence on thefinal outcome. In addition, use of the dimensions en-ables consideration of the distributions with respect toparticular functional parameters, so that some dimen-sions can be given more weight than others. In contrast,there is no motivated way for distinguishing among therange of individual features on functional grounds.

It is beyond the scope of this paper to illustrate theuse of dimension scores for the linguistic characteriza-tion of registers (since I would first need to explain thetheoretical and methodological bases of the dimen-sions). The same basic approach as illustrated in thissection would be used, however. The major differenceinvolves the analysis of deviation along basic dimen-sions of linguistic variation rather than with respect tonumerous linguistic features in isolation.

5. Conclusion: Beginning

I have tried to develop here a set of principles forachieving 'representativeness' in corpus design. I haveoffered specific recommendations regarding someaspects of corpus design, and illustrations elsewhere(regarding issues for which final recommendationscould not be developed in a paper of this scope). The

255

Page 14: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

bottom-line in corpus design, however, is that the para-meters of a fully representative corpus cannot be deter-mined at the outset. Rather, corpus work proceeds in acyclical fashion that can be schematically representedas follows:

Pilot empiricalnvestigation

and theoreticalanalysis

» Corpusdesign

I1» Compile

portionof corpus

> Empiricalinvestigation

Theoretical research should always precede the initialcorpus design and actual compilation of texts. Certainkinds of research can be well advanced prior to anyempirical investigations: identifying the situationalparameters that distinguish among texts in a speechcommunity, and identifying the range of importantlinguistic features that will be analysed in the corpus.Other design issues, though, depend on a pilot corpusof texts for preliminary investigations. Present-day re-searchers on English language corpora are extremelyfortunate in that they have corpora such as the Brown,LOB, and London-Lund corpus for pilot investiga-tions, providing a solid empirical foundation for initialcorpus design. The compilers of those corpora had nosuch pilot corpus to guide their designs. Similar situa-tions exist for current projects designing corpora torepresent non-western languages. For example, arecently completed corpus of Somali required extensivefieldwork to guide the initial design (see Biber andHared 1992). Thus the initial design of a corpus will bemore or less advanced depending on the availability ofprevious research and corpora.

Regardless of the initial design, the compilation of arepresentative corpus should proceed in a cyclicalfashion: a pilot corpus should be compiled first, repre-senting a relatively broad range of variation but alsorepresenting depth in some registers and texts. Gram-matical tagging should be carried out on these texts,as a basis for empirical investigations. Then empiricalresearch should be carried out on this pilot corpus toconfirm or modify the various design parameters. Partsof this cycle could be carried out in an almost con-tinuous fashion, with new texts being analysed as theybecome available, but there should also be discretestages of extensive empirical investigation and revisionof the corpus design.

Finally, it should be noted that various multivariatetechniques could be profitably used for these empiricalinvestigations. In this paper, I have restricted myself tounivariate techniques, and to simple descriptive statis-tics. Other research, though, suggests the usefulnessof two multivariate techniques for the analysis of lin-guistic variation in computerized corpora: factor analy-sis and cluster analysis. Factor analysis can be used ineither an exploratory fashion (e.g. Biber, 1988) or fortheory-based 'confirmatory' analyses (e.g. Biber, 1992).Both of these would be appropriate for corpus designwork, especially for the analysis of the range and typesof variation within a corpus and within registers. Suchanalyses would indicate whether the different para-meters of variation were equally well represented andwould provide a basis for decisions on sample size.

256

Cluster analysis has been used to identify 'text types' inEnglish—text categories defined in strictly linguisticterms (Biber, 1989). Text types cannot be identified ona priori grounds; rather they represent the groupings oftexts in a corpus that are similar in their linguisticcharacterizations, regardless of their register categor-ies. Ideally a corpus would represent both the range ofregisters and the range of text types in a language, andthus research on variation within and across both kindsof text categories is needed.10

In sum, the design of a representative corpus is nottruly finalized until the corpus is completed, and analysesof the parameters of variation are required throughoutthe process of corpus development in order to fine-tunethe representativeness of the resulting collection oftexts.

Acknowledgements

I would like to thank Edward Finegan for his manyhelpful comments on an earlier draft of this paper. Amodified version of this paper was distributed for thePisa Workshop on Textual Corpora, held at the Uni-versity of Pisa (January 1992), and discussions withseveral of the workshop participants were also helpfulin revising the paper.

Notes

1. Further, in the case of language corpora, proportionalrepresentation of texts is usually not desirable (see Sec-tion 3); rather, representation of the range of text types isrequired as a basis for linguistic analyses, making astratified sample even more essential.

2. Actually, very little work has been carried out on dialectvariation from a text-based perspective. Rather, dialectstudies have tended to concentrate on phonological varia-tion, downplaying the importance of grammatical anddiscourse features.

3. Other demographic factors characterize individual speak-ers and writers rather than groups of users; these includerelatively stable characteristics, such as personality, in-terests, and beliefs, and temporary characteristics, suchas mood and emotional state. These factors are probablynot important for corpus design, unless an intended useof the corpus is investigation of personal differences.

4. This parameter would not be important for many non-western societies, or for certain kinds of corpora repre-senting different historical periods; quite different sam-pling strategies would be required in these cases.

5. Published collections of letters and published diaries arespecial cases—these originally have individual addressees,but they are usually written with the hope of eventual pub-lication and thus with an unenumerated audience in mind.

6. A proportional corpus would be useful for assessmentsthat a word or syntactic construction is 'common' or 'rare'(as in lexicographic applications). Unfortunately, mostrare words would not appear at all in a proportional (i.e.primarily conversational) corpus, making the databaseill-suited for lexicographic research.

7. In Biber (1990) I also assess the representativeness of1,000-word text samples by computing difference scoresfor pairs of samples from each text. This analysis confirmsthe general picture given by the reliability coefficientswhile providing further details of the distribution of parti-cular features in particular registers.

Literary and Linguistic Computing, Vol. 8, No. 4, 1993

Page 15: Representativeness in Corpus Designotipl.philol.msu.ru/media/biber930.pdf · involves a sampling decision, either conscious or not. The use of computer-based corpora provides a solid

8. These are primarily prepositional phrases functioning asnoun modifiers, as opposed to prepositional phrases.withadverbial functions.

9. Actually this latter question was addressed by computingdifference scores, for the mean, standard deviation, andrange, across the 10-text samples.

10. For example, one of the most marked text types identi-fied in Biber (1989) consists of texts in which the addressor isproducing an on-line reportage of events in progress.Linguistically, this text type is marked in being extremelysituated in reference (many time and place adverbials anda present time orientation). Unfortunately, there areonly seven such texts in the combined London-Lund andLOB corpora, indicating that this text type is under-reprepresented and needs to be targeted in future corpusdevelopment.

References

Biber, D. (1986). Spoken and Written Textual Dimensions inEnglish: Resolving the Contradictory Findings. Language,62: 384-414.

(1988). Variation across Speech and Writing. CambridgeUniversity Press: Cambridge.

(1989). A Typology of English Texts, Linguistics, 27:3-43.

— (1990). Methodological Issues Regarding Corpus-basedAnalyses of Linguistic Variation. Literary and LinguisticComputing, 5: 257-69.

— (1992). On the Complexity of Discourse Complexity: AMultidimensional Analysis. Discourse Processes. 15: 133—163.

- (1993a). An Analytical Framework for Register Studies.

ives on Register. Oxford University Press, New York. Inpress.

(1993b). Register Variation and Corpus Design, Com-putational Linguistics. In press.

and Hared, M. (1992). Dimensions of Register Varia-

In D. Biber and E. Finegan (eds), Sociolinguistic Perspect-

tion in Somali, Language Variation and Change, 4: 41-75.Brown, P. and Fraser, C. (1979). Speech as a Marker of

Situation. In K. R. Scherer and H. Giles (eds), SocialMarkers in Speech. Cambridge University Press, Cam-bridge, pp.33-62.

Duranti, A. (1985). Sociocultural Dimensions of Discourse.In T. van Dijk (ed.), Handbook of Discourse Analysis,Vol. 1. Academic Press: New York: pp.193-230.

Francis, W. N. and Kucera, H. (1964/1979). Manual of Infor-mation to Accompany a Standard Corpus of Present-DayEdited American English, for use with Digital Computers.Department of Linguistics, Brown University.

Halliday, M. A. K. and Hasan, R. (1989). Language, Con-text, and Text: Aspects of Language in a Social-semioticPerspective. Oxford University Press, Oxford.

Henry, G. T. (1990). Practical Sampling. Sage, NewburyPark, CA.

Hymes, D. H. (1974). Foundations in Sociolinguistics.University of Pennsylvania Press, Philadelphia.

Johansson, S., Leech, G. N. and Goodluck, H. (1978).Manual of information to accompany the Lancaster-Oslo/Bergen Corpus of British English, for use with digital com-puters. Department of English, University of Oslo.

Kalton, G. (1983). Introduction to survey sampling. Sage,Newbury Park, CA.

Sudman, S. (1976). Applied Sampling. Academic Press, NewYork.

Svartvik, J. and Quirk, R. (eds) (1980). A Corpus of EnglishConversation. C. W. K. Gleerup, Lund.

Williams, B. (1978). A Sampler on Sampling. John Wiley andSons, New York.

Literary and Linguistic Computing, Vol. 8, No. 4, 1993 257