8
Computational Linguistics and Statistics in the Analysis of the Montreal French Corpus Author(s): David Sankoff, Rejean Lessard, Nguyen Ba Truong Source: Computers and the Humanities, Vol. 11, No. 4 (Jul. - Aug., 1977), pp. 185-191 Published by: Springer Stable URL: http://www.jstor.org/stable/30199894 . Accessed: 16/03/2011 17:15 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at . http://www.jstor.org/action/showPublisher?publisherCode=springer. . Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Springer is collaborating with JSTOR to digitize, preserve and extend access to Computers and the Humanities. http://www.jstor.org

Computational Linguistics and Statistics in the Analysis

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computational Linguistics and Statistics in the Analysis

Computational Linguistics and Statistics in the Analysis of the Montreal French CorpusAuthor(s): David Sankoff, Rejean Lessard, Nguyen Ba TruongSource: Computers and the Humanities, Vol. 11, No. 4 (Jul. - Aug., 1977), pp. 185-191Published by: SpringerStable URL: http://www.jstor.org/stable/30199894 .Accessed: 16/03/2011 17:15

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unlessyou have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and youmay use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .http://www.jstor.org/action/showPublisher?publisherCode=springer. .

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

Springer is collaborating with JSTOR to digitize, preserve and extend access to Computers and theHumanities.

http://www.jstor.org

Page 2: Computational Linguistics and Statistics in the Analysis

Computers and the Humanities, Vol. 11, pp. 185-191. PERGAMON PRESS, 1978. Printed in the U.S.A. 0010-4817/78/0706-0185$2.00/0 Copyright A 1978 Pergamon Press

Computational Linguistics and Statistics in the Analysis of the

Montreal French Corpus DAVID SANKOFF, REJEAN LESSARD and NGUYEN BA TRUONG

The unusual aspect of the Montreal French project is not the innovative character of the computational and statistical methods, nor the volume of data involved, but the proliferation of linguistic studies which have been facilitated by these techniques and materials. Indeed, the development of computational aids has at each step been in direct response to specific needs of linguistic analysis. We shall illustrate by tracing these developments in chronological order. Following this we shall discuss some of the statistical methodology which has been implemented for this work.

Sampling and the data base. The goal of the project, directed by G. Sankoff and H. J. Cedergren [25], was to study the structure of linguistic variation within the Montreal French-speaking community, roughly analogous to the sociolinguistic work of Labov [10], Wolfram [33] and others. An undertaking of this nature requires a corpus of the spoken language of somewhere between a few hundred thousand and a few million words; budgetary consideration permitted us about a million. From the beginning, sociological- statistical considerations had to be traded off against linguistic-statistical ones. To obtain even a limited idea of a single speaker's systematic linguistic ten- dencies, especially on the syntactic and lexical levels, requires many thousands of words. But a sociological- ly representative sample of a large speech community should also have at least several hundred, and preferably thousand, speakers. Our solution to this dilemma was to select a limited sample of 120 speakers, carefully stratified for a balanced distribu- tion according to sex, age, economic level of residen- tial area, and geographical location. The idea was not

so much to be able to reconstruct "typical" or average speech behaviour, but by tapping as many dimensions of variability as possible, to detect and assess sociodemographic factors which influence linguistic variation.

The sampling method required the generation of a large number of random addresses in a street direc- tory, until the sampling quotas for each residential and geographic area were filled. For age group and sex, quotas were controlled by the interviewers in the field. The interviews, largely informal conversations on a number of everyday topics concerning life in Montreal, past and present, lasted about an hour each and yielded about 8,000 words per person inter- viewed. A great deal of sociodemographic informa- tion, gathered about the informant in the course of the interview, has proved useful in the sociolinguistic analyses. A fuller description of the methods and procedures followed, as well as a sociological analysis of the sample obtained, has been published elsewhere [19, 20].

The interviews were, of course, recorded, and the important decision to transcribe this material directly on punch cards necessitated a certain trade-off. The possibility of automated phonetic or phonological analysis directly from the transcription was sacrificed to considerations of economy and to the desirability of lexically-based semantic and syntactic analyses of a corpus in standard orthography. Thus the tran- scriber-keypunchers were instructed to spell each word according to normal conventions, but not to normalize the order of the words nor other syntactic behavior. The ommission of accents, in the interests of speed, accelerated the transcription, and has not since led to any difficulties, except in the compilation

David Sankoff, Rejean Lessard and Nguyen Ba Truong are on the staff of the Centre de Recherches Mathbmatiques of the Universith de Montreal.

185

Page 3: Computational Linguistics and Statistics in the Analysis

186 DAVID SANKOFF, REJEAN LESSARD and NGUYEN BA TRUONG

of the frequency dictionary, to be discussed below. The final transcription, which went through two cycles of corrections, comprises about 100,000 punch cards.

As mentioned above, our transcription was not intended to serve as input to any automated phono- logical analysis. Detailed phonemic and phonetic transcriptions of about 15% of each speaker's inter- view is being carried out independently, under the supervision of Professor Laurent Santerre of the

D6partement de linguistique et philologie at the Uni-

versit6 de Montreal. A system for computer storage, access and analysis of this material, set up by Jean Millo, is fully operational.

Transcript edition and corpus storage modes. The first computer treatment of the corpus, the simplest but at the same time one of the most useful, converted the material on cards into a well-organized and easily readable working transcription. The only format conventions adopted during the keypunching had been to insert at least one space between words and to indicate the beginning of a speaker's turn (either interviewer or interviewee) with an identifying numeral. No record length having been defined, words were broken arbitrarily at the ends of cards and carried on without hyphen to the next. A pair of asterisks around faulty material permitted correction of typing errors without retyping-entire cards. The editing program simply discarded everything between two such asterisks, reduced all series of two or more blanks to one, set up print records of 120 characters in length and decomposed the conversation into paragraphs according to speaker identification numeral. These were printed out with double-spacing, on numbered pages.

The three-thousand-page transcript thus produced, in several copies, has been an invaluable working document. Most of the several dozen linguistic studies undertaken on the corpus have begun with a heuristic assessment of one or more syntactic, phonological or lexical variables, through a perusal of the transcript, usually accompanied by a selective relistening to the original tapes. A second step, after an analysis of the apparent conditioning of the occurrence of the variants, is systematically to extract and check all occurrences of the phenomena under study, in the entire corpus or appropriate subsample, and to register on a coding sheet the variant and relevant aspects of the context. The resulting data set serves as input to a subsequent quantitative (and automated) analysis. Thus the heuristic steps and the linguistic analysis form the "human" or manual research phase

between computational linguistics and automated statistical analysis.

To make the heuristic part of the research more efficient, we undertook to compile an index of the corpus. Since we had relatively little interest in the interviewer's speech behaviour, we first constructed a data file containing only the interviewees' speech and excluding a reading exercise at the end of each interview. In the form of 80-character records, each including an interviewee number (from 1 to 120) and a line number (from 1 up to one or two thousand per interviewee), the reduced corpus is accessible in two forms, tape and hard copy.

In summary, the corpus exists in five forms: recorded speech on audio-tape, transcription on cards, an edited and printed transcription, a some- what condensed and line-numbered form on tape, and the same form printed.

The index. As mentioned above, the version of the transcription containing line numbers was prepared specifically with a view to constructing an index. Fortunately having access to JEUDEMO, the text- handling system of Bratley, Lusignan and Ouellette [1], we adapted its text-inverting capabilities for our purposes. Because of time and storage limitations, we decided to invert 24 groups of five interviews each, separately, and to use a series of merges to construct the master index.

The final index is available in two forms, tape and the printed version, about 5,000 pages long. Each (unlemmatized) form, including common words such as le, un, 'a, de, is listed in alphabetical order, together with the total number and the addresses of all its occurrences in the corpus. The addresses refer in numerical order to the interview and line number in the condensed version of the corpus.

Used in conjunction with the transcription from which it is derived, the index immediately proved to be an extremely useful tool for linguistic research. Especially when the problem involves one or more relatively rare words or constructions including rare words (less than, say, a few dozen occurrences in the corpus), it is an easy matter to look up all the addresses listed and to examine all the usages in their context in the corpus. In addition, the index can be studied directly to detect words which are unevenly distributed throughout the speech community. For example, in examining the addresses for the conjunc- tion alors, "then," we find that some individuals use it frequently while others never do. A study of the sociodemographic data on these individuals quickly reveals that the use of alors is a characteristic of

Page 4: Computational Linguistics and Statistics in the Analysis

COMPUTATIONAL LINGUISTICS AND STATISTICS IN THE ANALYSIS OF THE MONTREAL FRENCH CORPUS 187

middle-class, highly-educated speakers, while the majority of the population use other conjunctions for the same syntactic function [4].

Despite its convenience in cutting down manual search time from days to minutes, the index does have certain limitations. To study forms which occur several times per interview, it becomes rather tedious to look up all the addresses and, even worse, to copy out all the relevant portions of the context of each occurrence, for later linguistic analysis. After con- siderable urging by users of the corpus, therefore, we set about constructing a concordance.

The concordance. At the outset, two major decisions had to be made about the concordance: the size of context, and the inclusion of the very frequent words. In both cases, the simplest course was chosen. Each occurrence was to be printed at the center of a printer line, and the line filled with the immediate context and its address. Although this has generally proved sufficient, occasional recourse to the corpus is necessary. Preferable, perhaps, would have been syntactically defined context, for example, every- thing between two periods (or question marks, etc.), but given the rough nature of the transcription and especially the nature of spoken language, with its hesitations, pauses, starting over of sentences, and other phenomena which render any analysis in terms of whole sentences extremely difficult, it would not have yielded very meaningful results. As for omitting very frequent words from the concordance, calcula- tions showed that this step would result in about only a 30-40% reduction in volume, but would make impossible many studies which would require at least a sample of such words in context.

During the construction of the concordance, the corpus was stored on a disc file to permit rapid access to occurrences of forms in the sequence listed in the index. For each occurrence an appropriate print record, based on the context plus the address, was stored on tape. Since constructing about a million such records was still time-consuming, the program execution was subdivided into five steps correspond- ing to a five-way partition of the alphabet, one tape being filled up with each step. At 60 lines per page, the printed concordance consisted of more than 15,000 pages.

The finished concordance, even more useful than the index, has made the latter obsolete for most purposes. It is easy and convenient, when studying a lexical form or a closely related set of forms such as the conjugations of most verbs, to proceed directly to the appropriate section of the concordance according

to alphabetical order, and to find dozens, hundreds or even thousands of pertinent examples in context grouped together in a compact and readable form.

Even the printed concordance is not the ultimate in convenience. Handling a 15,000-page document, especially when the entries to be looked at are not all together in alphabetical order, can be tedious. To reduce the concordance to more manageable physical dimensions, a microfiche version was prepared di- rectly from the five original concordance tapes, at a very reasonable cost and in short order, by Compu- trex Centres Limited, of Montreal. At 208 computer pages per fiche, requiring a 42x reduction, the entire concordance fits on some 80 4x6 fiches. For most purposes, working with a microfiche reader has proved to be far more convenient than working with hard copy, and has largely replaced the latter in routine use.

Further computational work. Although not especially motivated by our own sociolinguistic work, we have also carried out some word-frequency computations, largely in response to psychologists, educators, speech therapists and linguists who require data on the frequency of words, syllables, or phonemes in the spoken language. For certain purposes, the rough, unlemmatized frequencies which appear in the index are sufficient, listed in alphabetical order as well as in order of decreasing frequency. From the index tape, frequency distributions can be calculated for each of the 120 speakers separately, as we did, for example, in a multiple regression analysis of the socio- demographic conditioning of vocabulary rich- ness[17].

For linguistic and lexicological work, however, lemmatized word lists are generally preferable. In addition, the decision we took for rapid transcription resulted in a certain heterogeneity among secretaries with respect to spelling, especially of non-standard and dialectal forms which occur frequently in the spoken language. Accents, cedillas, capital letters, etc. are not indicated. Hyphens, apostrophes, and other problems of word-division are variable. Words are not labelled according to grammatical category. In short, not having originally been planned for a frequency dictionary, our corpus and other computer-prepared documents do not lend themselves easily to such a project. Nevertheless, the demand for this dictionary, and its comparative value as a supplement to Juilland's frequency dictionary of the written language [7], and Gougenheim's work on the spoken French of France [5], led us to undertake its preparation.

Page 5: Computational Linguistics and Statistics in the Analysis

188 DAVID SANKOFF, REJEAN LESSARD and NGUYEN BA TRUONG

Based on word-final morphology, some of the lemmatization and grammatical labeling was auto- matic, but the separation of homographs, correction and standardization of spelling, etc. was mostly manual. Some accuracy was sacrificed in favour of efficiency when we decided not to work directly on the corpus, but on the raw form-frequency lists derived from the index. Homographs of very frequent forms (e.g., a "at" versus a "had" versus A "letter A") were separated on a sample of at most a few hundred forms in context as sampled from the concordance. A first draft of the dictionary is presently being corrected prior to publication.

The linguistic variable. Sociolinguistic analysis differs from most linguistic work in that it does not aim specifically at a complete elaboration of the gram- matical rules of a language, but concentrates rather on those rules and constructions which vary in their usage from individual to individual within the speech community. What is qualitatively "free variation" phonologically can often be decomposed into quanti- tative tendencies conditioned not only by the phonological environment, but by syntactic factors and even by extralinguistic considerations such as the age, sex and sociological or occupational character- istics of the speaker. On the lexical level, the choice among synonyms which are apparently inter- changeable semantically may well be determined in part by the same sorts of extralinguistic factors. In transformational grammars, the choice of whether or not to apply a given optional meaning-preserving syntactic transformation in the derivation of a sentence can be influenced by the nature of the specific elements in the sentence, by the form of the preceding sentence, and by all the extralinguistic factors mentioned above. All these, and many other types of conditioned linguistic variability, may be modeled in terms of independent binomial trials, where the parameter p is a function of the various linguistic and extralinguistic conditions obtaining at the instant the choice is made.

In terms of the prevalent, introspective, linguistic methodology, it would be difficult to proceed further than, say, a recognition that the parameter p is influenced by one or more environmental factors. With the computer-produced materials at our dis- posal, however, it becomes a straightforward, though sometimes tedious task to collect a large enough number of "trials"-e.g., choice of one allophone or another, or of one synonym or another, or the application or non-application of a rule, etc.-to carry out a statistical analysis of the proposed model.

This approach has a number of implications at the level of linguistic theory. The usage statistics we collect, insofar as they indicate systematic tendencies, suggest some underlying probabilistic mechanism for generating these data. The simplest way we could account for these would be to incorporate as part of a speaker's "competence" along with grammatical rules and categories [2], the parameter p as well as its dependence on linguistic and partially extralinguistic context. The probabilized rules central to this formu- lation are known as "variables rules."

Which model? The recognition that the parameter p varies as a function of a number of contextual factors leaves undefined the exact form of that function. In the study of linguistic variation over the past ten years, theoretical and empirical developments have led to the use of a number of models, of the form

n

T(p) = + 2:aixi i= 1

where T is some given monotone transformation, and pi and the ai are parameters of the model to be estimated from the data. The xi may be continuous variables such as age of speaker or more usually, zero-one variables indicating the absence or presence of a given contextual factor, either linguistic or extralinguistic.

In the first study of this type [11], T was simply the identity transformation, i.e., the parameter p was postulated to be an additive function of the environ- mental parameters. With this model, however, dif- ficulties were encountered which were associated in part with the fact that the parameter p as a binomial trial probability is restricted to the interval [0,1], while the usual estimation methods for the additive model do not guarantee any such restriction. For this and other reasons, models based on T(p) = log p and T(p) = log (1 - p) were proposed [2], and applied to various data sets. Most recently, the best features of all of these models have effectively been combined, with the adoption of the transformation T(p)= log(p/l - p) [15].

The estimation problem. For an additive model, the usual estimation methodology involves the analysis of variance and/or multiple regression. Linguistic varia- tion data is such that the number of binomial trials per given context is extremely unequal from context

Page 6: Computational Linguistics and Statistics in the Analysis

COMPUTATIONAL LINGUISTICS AND STATISTICS IN THE ANALYSIS OF THE MONTREAL FRENCH CORPUS 189

to context, to the extent that the available adjust- ments for unequal data [3] are simply inadequate. In this situation, as well, analysis of variance for estimating the parameters of transformed models (where T is not the identity), followed by an inverse transformation to estimate p, is not a statistically justifiable procedure.

We are thus forced to fall back on the more fundamental principle of maximum likelihood, which for our problem does not lead to any elegant and rapid computing formulae such as exist for analysis of variance. There is needed instead an iterative approxi- mation to the maximum likelihood value of the parameters.

While this type of problem, requiring a maximum likelihood solution, arises occasionally in other scien- tific fields such as psychology, agriculture, fishery research and pharmacology [6, 13], it is not so widespread that appropriate computer programs are incorporated in any of the well-known and widely- distributed program packages for data analysis. The programs which do exist, while often quite efficient, are applicable to rather restricted types of data sets, corresponding to the specific types of problems for which they were written. That linguistic variation data, however, usually involve a very large number of contextual factors and a large number of different contexts leads to programming problems of various sorts.

Computational methods. To estimate the parameters ai , we apply the methods of non-linear programming. The (log-) likelihood expression to be maximized simultaneously as a function of all of these ai is the sum, over all contexts, of terms of the form X log p + (N-X) log(l-p) where p = T-1 [.t+ ai xi ], as before, and X/N is the number of binomial trial "successes" over the total number of trials. This maximization is constrained by certain relations holding amongst ai representing contextual factors which are mutually exclusive with respect to their presence in the environment of a variable, relations which are neces- sary to ensure uniqueness of the estimates. The task of finding the maximum is reduced to solving a number of simultaneous nonlinear equations, some- what fewer than the number of parameters to be estimated.

The first method to be used involved "inner" and "outer" iterations. Starting with some initial con- figuration of the parameters and using up to ten iterations of Newton's method, the likelihood was maximized with respect to one of the parameters. This procedure, repeated with each parameter in turn,

completed one outer iteration. This method proved satisfactory for most practical problems. Although many outer iterations were sometimes required for convergence to a maximum, the time for each iteration was so very small that efficiency was reasonable.

More recently, Rousseau [15] has introduced a number of improvements in the algorithm and its implementation. It is often informative to analyze each individual speaker as representing a different contextual "factor." This requires estimating 120 different ai's plus a few more factors of a linguistic nature, greatly increasing computing time, and calling for a more efficient algorithm. The Newton-Raphson method, where there are no inner iterations, but instead the likelihood is maximized with respect to all parameters simultaneously, converges much more rapidly in terms of the total number of iterations, though each iteration requires the inversion of a matrix of size equal to the number of independent parameters to be estimated.

Tests of significance. Under the null hypothesis, the addition of r extra parameters to the q already present in the "true" model should increase the log likelihood by one half of a X2 variable, approximate- ly. This property allows all sorts of tests of signifi- cance, such as whether two parameters are significant- ly different, or whether a given parameter is different from zero.

An overview of the sociolinguistic study. Among the research efforts which have been made possible by the construction of the corpus and its accessibility in various forms, one of our first studies was an exploration into the presence or absence of IVI in pronouns and articles such as il, elle, la, les [24, 26], which were shown to be strongly influenced by social factors. Another socially-influenced variable is the aspiration of lit to

thi in such words as jamais and

voyage [32]. Here there was a clear differentiation between the sexes as well as a geographical effect, aspiration being characteristic of men raised in certain areas of the city. For a small but diverse subsample of speakers, we have undertaken a detailed study for coarticulatory effects on vowels [34, 35, 36]. This required statistical techniques not mentioned above and spectrographic analysis, for which we employed a digitalizing data recorder to collect rapidly and store data on formant frequency. Another investigation analyzed the factors affecting the simplification of various consonant clusters [8], including social factors, syntactic and phonological conditioning,

Page 7: Computational Linguistics and Statistics in the Analysis

190 DAVID SANKOFF, REJEAN LESSARD and NGUYEN BA TRUONG

intonation, etc. Extensive phonological research on the corpus has also been carried out by Santerre and colleagues, especially with regard to diphthongization [31] and vowel fusion. For all this phonological research, the original tape-recordings are essential, though the transcription can serve as a data-collection form, as well as to search for occurrences of specific contexts.

Much more work has been done on the syntactic level. The absence of que from the complement of such verbs as penser and savoir has been analyzed and reanalyzed [28, 2, 16, 18, 23], as well as the presence of que after adverbial conjunctions like quand, comment, pourquoi [22]; the rare use of ne in negative constructions [30]; and the alternation of &tre and avoir as auxiliary verbs in compound tenses in a very extensive and detailed study [29], of ce que or qu'est-ce-que in indirect questions and headless relatives [27], and the alternation between the conjunctions alors, donc, and fait que [4]. Post- relativization transformations have been analyzed by Lefebvre with close reference to the corpus [12]. Coordination and coordinators within subordinate structures were the topic of another study, along with coordination and conjunction-reduction in general and the alternation of future conjugations versus aller + infinitive. Tense relations between clauses in si (if) constructions have been analyzed [14]. As in the phonological studies, this syntactic work has all focused on the interaction of social and linguistic phenomena.

The same preoccupation can be found in our lexical and semantic studies. Laberge's massive study of variability in the pronominal system [9] unravels the stylistic and thematic influences, from syntactic constraints, from historical trends and from socio- logical correlates of linguistic usage, on the choice of pronoun for a given referent, all from a quantitative viewpoint. Three studies of the semantic fields of a number of closely related words have shown vari- ability according to social characteristics of speakers [21]. Current work involves the investigation of the lexical realization of semantic "intensification," which reveals strong differences according to social class and sex of speaker.

Other, unpublished studies based on the corpus have been carried out in all these areas as well as in discourse analysis and related topics.

REFERENCES

[1] BRATLEY, P., S. LUSIGNAN and F. OUELLETTE. 1974. JEUDMO: a text-handling system. In Computers

in the Humanities, ed. by J. L. Mitchell, 234-249. Edinburgh University Press.

[2] CEDERGREN, H. J., and D. SANKOFF. 1974. Variable rules: performance as a statistical reflection of competence. Language 50, 333-355.

[3] COX, D. R. 1970. The Analysis of Binary Data. London: Methuen.

[4] DESSUREAULT-DOBER, D. 1973. Etude Sociolin- guistique de ca fait que; Coordonant Logique et Marquer d'Interaction. M.A. Thesis, Universite de Quebec a Montreal.

[5] GOUGENHEIM, G., R. MICHEA, P. RIVENC and A. SAUVAGEOT. 1964. L 'Elaboration du Francais Fondamental. Paris: Didier.

[6] JONES, R. H. 1975. Probability estimation using a multinomial logistic function. Journal of Statistical Computation and Simulation 3, 315-329.

[7] JUILLAND, A., D. BRODIN and C. DAVIDOVITCH. 1970. Frequency Dictionary of French Words. The Hague: Mouton.

[8] KEMP, W., P. PUPIER and M. YAEGER. 1976. Socially-based variability in consonant cluster reduc- tion rules. Paper presented at the Fifth Annual Col- loquium on New Ways of Analyzing Variation, Georgetown University, Washington, D.C.

[9] LABERGE, S. 1977. Etude de la variation des pronoms, sujets defins et indefins dans le franqais parle i' Montreal. Ph.D. dissertation, Universite de Montreal.

[10] LABOV, W. 1966. The Social Stratification of English in New York City. Washington, D.C.: Center for Applied Linguistics.

[11] LABOV, W. 1969. Contraction, deletion, and inherent variability of the English copula. Language 45, 715-762.

[12] LEFEBVRE, C. and R. FOURNIER. 1976. Les rela- tives en francais montrealais. Paper presented at the Colloque de Montr6al de Syntaxe et S6mantique.

[13] LINDSEY, J. K. 1975. Likelihood analyses and tests for binary data. Applied Statistics 24, 1-16.

[14] PIQUETTE, E. 1976. L'emploi du conditionnel dans les subordonndes hypothdtiques en "si" par les fran- cophones d'origine montrialaise-6tude sociolinguisti- que de la variation. Unpublished manuscript.

[15] ROUSSEAU, P. 1976. Advances in variable rule methodology. Paper presented at the Fifth Annual Colloquium on New Ways of Analyzing Variation, Georgetown University, Washington, D.C.

[16] SANKOFF, D., and H. J. CEDERGREN. 1976. The dimensionality of grammatical variation. Language 52, 163-178.

[17] SANKOFF, D. and R. LESSARD. 1975. Vocabulary richness: a sociolinguistic analysis. Science 190, 689-690.

[18] SANKOFF, D., and P. ROUSSEAU. 1974. A method for assessing variable rule and implicational scale analyses of linguistic variation. In Computers in the Humanities, ed. by J. L. Mitchell, 3-15, Edinburgh University Press.

[19] SANKOFF, D., and G. SANKOFF. 1973. Sample survey methods and computer assisted analysis in the study of grammatical variation. In Canadian Languages in Their Social Context, ed. by R. Darnell, 7-64. Edmonton: Linguistic Research Inc.

Page 8: Computational Linguistics and Statistics in the Analysis

COMPUTATIONAL LINGUISTICS AND STATISTICS IN THE ANALYSIS OF THE MONTREAL FRENCH CORPUS 191

[20] SANKOFF, D., G. SANKOFF, S. LABERGE and M. TOPHAM. 1976. Mithodes d'&chantillonage et utilisa- tion de l'ordinateur dans l'etude de la variation grammaticale. Cahiers de Linguistique de l' Universite de Quebec 6, 85-125.

[21] SANKOFF, D., P. THIBAULT and H. BERUBE. 1975. Semantic field variability. Technical report 581, Centre de recherches math6matiques, Montre aL

[22] SANKOFF, G. 1973. Above and beyond phonology in variable rules. In New Ways of Analyzing Variation in English, ed. by R. Shuy and C. J. Bailey, 44-61. Washington, D.C. Georgetown University Press.

[23] SANKOFF, G. 1974. A quantitative paradigm for the study of communicative competence. In Exploration in the Ethnography of Speaking, ed. by R. Bauman and J. Sherzer, 18-49. Cambridge University Press.

[24] SANKOFF, G., and H. J. CEDERGREN. 1971. Some results of a sociolinguistic study of Montreal French. In Linguistic Diversity in Canadian Society, ed. by R. Darnell, 61-87. Edmonton: Linguistic Research Inc.

[25] SANKOFF, G., and H. J. CEDERGREN. 1972. Socio- linguistic research on French in Montreal. Language in Society 1, 173-174.

[26] SANKOFF, G., and H. J. CEDERGREN. 1976. Les constraintes linguistiques et sociales de l'6lision du L chez les montrialais. In Actes du XIIIe Congreds International de Linguistique et de Philologie Romanes, ed. by M. Boudreault and F. W. Moehren. Quebec: Presses de l'Universite Laval.

[27] SANKOFF, G., W. KEMP, and H. J. CEDERGREN. 1975. The syntax of ce que/qu'est-ce-que variation and its social correlates. Paper presented at the Fourth Annual Colloquium on New Ways of Analyzing Variation, Georgetown University, Washington, D.C.

[28] SANKOFF, G., R. SARRASIN, and H. J. CEDER-

GREN. 1971. Quelques considdrations sur la distribu- tion de la variable que dans le franpais de Montreal. Paper presented at the 39e congres annuel de l'Associ- ation canadienne-franqaise pour l'avancement des sciences. Sherbrooke, Qudbec.

[29] SANKOFF, G. and P. THIBAULT. 1977. L'alternance entre les auxiliaires avoir et etre dans le francais parl6 a Montreal. Langue Frangaise, 34, 81-108.

[30] SANKOFF, G., and D. VINCENT. 1977. L'emploi productif dune dans le franqais parl6 a Montr6al. Le Frangais Moderne, 45, 243-256.

[31] SANTERRE, L., and D. VILLA. 1976. Importance sociolinguistique de la diphtongaison en franqais mon- tr6alais. Paper presented at the 44e congres annuel de l'Association canadienne-francaise pour l'avancement des sciences. Sherbrooke, Quebec.

[32] VINCENT, D., and D. SANKOFF. 1975. The geo- graphical dimension of phonological variation in an urban speech community. Technical report 574, Centre de recherches mathematiques, Montr6al.

[33] WOLFRAM, W. 1969. A Sociolinguistic Description of Detroit Negro Speech. Washington, D.C.: Center for Applied Linguistics.

[34] YAEGER, M. 1976. Vowel harmony in Montreal French. Paper presented to Acoustic Society of America, Washington, D.C.

[35] YAEGER, M., D. SANKOFF, and H. J. CEDERGREN. 1977. Harmonie et conditionnement consonnantique dans le systeme vocalique du franpais parle a Montr6al. In Phonologie et Societe, ed. by H. Walter. Paris: Didier.

[36] YAEGER, M., D. SANKOFF, and H. J. CEDERGREN. 1975. Statistical analysis of vowel harmony. Technical report 580, Centre de recherches mathematiques, Montre'al.