Corpora in language variation studies Corpus Linguistics Richard Xiao

  • View

  • Download

Embed Size (px)

Text of Corpora in language variation studies Corpus Linguistics Richard Xiao

  • Slide 1

Corpora in language variation studies Corpus Linguistics Richard Xiao Slide 2 Aims of this session Lecture Bibers (1988) MF/MD approach Xiaos (2009) enhanced MDA model Case study of world Englishes Lab session Using Xaira to explore distribution of passives across genres in FLOB Slide 3 Corpora vs. register and genre analysis Register and genre are two terms that are often used interchangeably The corpus-based approach is well suited for the study of register variation and genre analysis A corpus is created using external criteria, which define different registers and genres Corpora, especially balanced sample corpora, typically cover a wide range of registers or genres Bibers (1988) MF/MF analytical framework is the most powerful tool for approaching register variation and genre analysis Slide 4 Bibers MF/MD approach Established in Biber (1988): Variation across Speech and Writing (CUP) Factor analysis of 67 functionally related linguistic features 481 text samples, amounting to 960,000 running words LOB London-Lund corpus Brown corpus A collection of professional and personal letters Slide 5 Factor analysis The key to the multidimensional analysis approach A common data reduction method available in many standard statistics packages e.g. SPSS: Analyze Data reduction Factor analysis Reducing a large number of variables to a manageable set of underlying factors (dimensions) e.g. questions + 1 st /2 nd person pronouns vs. passives + nominalization Extensively used in social sciences to identify clusters of inter-related variables Slide 6 Methodological overview 1.Collect texts with register information 2.Collect a set of potential (functionally related) linguistic features to analyze (usually based on literature review) 3.Automatically tag texts with linguistic features, post- editing where necessary 4.Compute frequency of co-occurrence patterns of linguistic features using factor analysis Functional interpretation of co-occurrence patterns (i.e. dimensions of variation) through analysis of co-occurring features 5.Sum the factor scores of features on each dimension Mean dimension scores for each register are used to analyze similarities and differences Two ways of doing MDA in genre analysis Following Bibers model and factor scores Establishing your own MDA model Slide 7 How does factor analysis work? Build a correlation matrix of all variables (i.e. linguistic features) From this, determine the loading (or weight) of each linguistic feature Loading tells us to what degree we can generalize from this factor to the linguistic feature Positive loading = positive correlation (likewise for negative) A higher absolute value of a feature = the feature is more representative of a factor/dimension or register/genre Biber discarded features with absolute value under the cut-off point 0.35 Features are only kept on the factor they had the highest loading for (even if they occur on 2+ with scores above 0.35): one feature, one factor/dimension Slide 8 Bibers MF/MD approach Bibers seven factors / dimensions 1) Informational vs. involved production 2) Narrative vs. non-narrative concerns 3) Explicit vs. situation-dependent reference 4) Overt expression of persuasion 5) Abstract vs. non-abstract information 6) Online informational elaboration 7) Academic hedging Slide 9 Bibers MF/MD approach Factors 1, 3 and 5 are associated with oral and literate differences in English The spoken vs. written distinction is too broad Spoken and written registers can be similar in some dimensions but differ in others Each dimension is associated with a different set of underlying communicative functions, and each defines a different set of similarities and differences among genres. Consideration of all dimensions is required for an adequate description of the relations among spoken and written texts. (Biber 1988: 169) Slide 10 Bibers MF/MD approach The primary motivations for the MDA approach are the two assumptions (Biber 1995) Generalizations about register variation in a language must be based on analysis of the full range of spoken and written registers No single linguistic parameter is adequate in itself to capture the range of similarities and differences among spoken and written registers Slide 11 Bibers MF/MD approach Bibers MF/MD approach has been well received as it establishes a link between form and function Influential and widely used Synchronic analysis of specific registers / genres and author styles Diachronic studies describing the evolution of registers Register studies of non-Western languages and contrastive analyses Research of University English and materials development Move analysis and study of discourse structure Biers initial MDA model is largely confined to lexical and grammatical categories Slide 12 The enhanced MDA model Xiao (2009) seeks to enhance Bibers MDA by incorporating semantic components with grammatical categories Wmatrix = CLAWS + USAS A total of 141 linguistic features investigated 109 features retained in the final model Five million words in 2,500 text samples, with one million words in 500 samples for each of the 5 varieties of English ICE GB, HK, India, Singapore, the Philippines 300 spoken + 200 written samples 12 registers ranging from private conversation to academic writing [Xiao, R. (2009) Multidimensional analysis and the study of world Englishes. World English 28(4): 421-450.] Slide 13 ICE registers and proportions S1A (20%)Spoken Private S1B (16%)Spoken Public S2A (14%)Spoken Monologue Unscripted S2B (10%)Spoken Monologue Scripted W1A (4%)Written Non-printed Non-professional writing W1B (6%)Written Non-printed Correspondence W2A (8%)Written Printed Academic writing W2B (8%)Written Printed Non-academic writing W2C (4%)Written Printed Reportage W2D (4%)Written Printed Instructional writing W2E (2%)Written Printed Persuasive writing W2F (4%)Written Printed Creative writing Slide 14 141 linguistic features covered A) Nouns: 21 categories, e.g. nominalisation, other nouns; 19 semantic classes of nouns (e.g. evaluations, speech acts) B) Verbs: 28 categories, e.g. do as pro-verb, be as main verb, tense and aspect markers, modals, passives, 16 semantic categories of verbs C) Pronouns: 10 categories, e.g. person, case, demonstrative D) Adjectives: 11 categories, e.g. attributive vs. predicative use, 9 semantic categories Slide 15 141 linguistic features covered E) Adverbs: 7 categories F) Prepositions (2 categories) G) Subordination (3 categories) H) Coordination (2 categories) I) WH-questions / clauses (2 categories) J) Nominal post-modifying clauses (5 categories) K) THAT-complement clauses (3 categories) L) Infinitive clauses (3 categories) M) Participle clauses (2 categories) N) Reduced forms and dispreferred structures (4 categories) O) Lexical and structural complexity (3 categories) Slide 16 141 Linguistic features covered P) Quantifiers (4 categories) Q) Time expressions (11 categories) R) Degree expressions (8 categories) S) Negation (2 categories) T) Power relationship (4 categories) U) Definiteness (2 categories) V) Helping/hindrance (2 categories) X) Linear order (1 category) Y) Seem / Appear (1 category) Z) Discourse bin (1 category) Slide 17 Procedure of data analysis 1) Data clean-up 2) Grammatical and semantic tagging with Wmatrix 3) Extracting the frequencies of 141 linguistic features from 2,500 corpus files 4) Building a profile of normalised frequencies (per 1,000 words) for each linguistic feature 5) Factor analysis Factor extraction (Principal Factor Analysis) Factor rotation (Pramax) Optimum structure: 9 factors 6) Interpreting extracted factors in functional terms 7) Computing factor scores of various dimensions/factors 8) Using the enhanced MDA model in exploration of variation across registers and language varieties Slide 18 The enhanced MDA model Nine factors established in the new model 1) Interactive casual discourse vs. informative elaborate discourse 2) Elaborative online evaluation 3) Narrative concern 4) Human vs. object description 5) Future projection 6) Subjective impression and judgement 7) Lack of temporal / locative focus 8) Concern with degree and quantity 9) Concern with reported speech Robustness of the model in register analysis Slide 19 1) Interactive casual discourse vs. informative elaborate discourse Private conversation is most interactive and casual Academic writing is most informative and elaborate Spoken registers are generally more interactive and less elaborate than written registers ANOVA : F=775.86 p