56
Corpora in language variation studies Corpus Linguistics Richard Xiao [email protected]

Corpora in language variation studies

  • Upload
    dwight

  • View
    53

  • Download
    4

Embed Size (px)

DESCRIPTION

Corpora in language variation studies. Corpus Linguistics Richard Xiao [email protected]. Aims of this session. Lecture Biber’s (1988) MF/MD approach Xiao’s (2008) enhanced MDA model Case study of world Englishes Lab session - PowerPoint PPT Presentation

Citation preview

Page 1: Corpora in language variation studies

Corpora in language variation studies

Corpus LinguisticsRichard Xiao

[email protected]

Page 2: Corpora in language variation studies

Aims of this session

• Lecture– Biber’s (1988) MF/MD approach– Xiao’s (2008) enhanced MDA model– Case study of world Englishes

• Lab session– Using Xaira to explore distribution of passives across

genres in FLOB

Page 3: Corpora in language variation studies

Corpora vs. register and genre analysis

• “Register” and “genre” are two terms that are often used interchangeably

• The corpus-based approach is well suited for the study of register variation and genre analysis– A corpus is created using external criteria, which define

different registers and genres– Corpora, especially balanced sample corpora, typically

cover a wide range of registers or genres• Biber’s (1988) MF/MF analytical framework is the

most powerful tool for approaching register variation and genre analysis

Page 4: Corpora in language variation studies

21/04/23 CRG, Lancaster University 4

Biber’s MF/MD approach• Established in Biber

(1988): Variation across Speech and Writing (CUP)– Factor analysis of 67

functionally related linguistic features

– 481 text samples, amounting to 960,000 running words

• LOB• London-Lund• Brown corpus• A collection of professional

and personal letters

Page 5: Corpora in language variation studies

21/04/23 CRG, Lancaster University 5

Factor analysis

• The key to the multidimensional analysis approach• A common data reduction method available in many

standard statistics packages such as SPSS • Reducing a large number of variables to a

manageable set of underlying factors (“dimensions”) • Extensively used in social sciences to identify clusters

of inter-related variables

Page 6: Corpora in language variation studies

Methodological overview• 1. Collect texts with register information• 2. Collect a set of potential linguistic features to

analyze (based on literature review)• 3. Automatically tag texts with features, post-editing

where necessary• 4. Compute frequency of co-occurrence patterns of

linguistic features using factor analysis– Functional interpretation of co-occurrence patterns

(dimensions of variation)• 5. Sum the features on each dimension

– Mean dimension scores for each register used to analyze similarities and differences

Page 7: Corpora in language variation studies

How does factor analysis work?• Build a correlation matrix of all features• From this, determine the loading, or weight, of each

linguistic feature– Loading tells us to what degree we can generalize from

this factor to the linguistic feature– Positive loading = positive correlation (likewise for

negative)– High absolute value = more representative the feature is

of a factor/dimension/register• Biber removed features with absolute value under

the cut-off point 0.35– Features are only kept on the factor they had the highest

loading for (even if they occur on 2+ with scores above 0.35)

Page 8: Corpora in language variation studies

21/04/23 CRG, Lancaster University 8

Biber’s MF/MD approach

• Biber’s seven factors / dimensions– 1) Informational vs. involved production– 2) Narrative vs. non-narrative concerns– 3) Explicit vs. situation-dependent reference– 4) Overt expression of persuasion– 5) Abstract vs. non-abstract information – 6) Online informational elaboration– 7) Academic hedging

Page 9: Corpora in language variation studies

Biber’s MF/MD approach

• Factors 1, 3 and 5 are associated with “oral” and “literate” differences in English

• Spoken and written registers can be similar in some dimensions but differ in others

• “Each dimension is associated with a different set of underlying communicative functions, and each defines a different set of similarities and differences among genres. Consideration of all dimensions is required for an adequate description of the relations among spoken and written texts.” (Biber 1988: 169)

Page 10: Corpora in language variation studies

Motivations of the MF/MD approach

• The primary motivations for the multi-dimensional approach are the two assumptions (Biber 1995)– Generalizations concerning register variation in a

language must be based on analysis of the full range of spoken and written registers

– No single linguistic parameter is adequate in itself to capture the range of similarities and differences among spoken and written registers

Page 11: Corpora in language variation studies

21/04/23 CRG, Lancaster University 11

Biber’s MF/MD approach

• Biber’s MF/MD approach has been well received as it establishes a link between form and function

• Influential and widely used– Synchronic analysis of specific registers / genres and

author styles– Diachronic studies describing the evolution of registers– Register studies of non-Western languages and contrastive

analyses– Research of University English and materials development– Move analysis and study of discourse structure

• …largely confined to grammatical categories

Page 12: Corpora in language variation studies

21/04/23 CRG, Lancaster University 12

The enhanced MDA model• Xiao (2008) seeks to enhance Biber’s MDA by

incorporating semantic components with grammatical categories– Wmatrix = CLAWS + USAS– A total of 141 linguistic features investigated

• 109 features retained in the final model– Five million words in 2,500 text samples, with one million

words in 500 samples for each of the 5 varieties of English• ICE – GB, HK, India, Singapore, the Philippines• 300 spoken + 200 written samples• 12 registers ranging from private conversation to academic writing

Page 13: Corpora in language variation studies

21/04/23 CRG, Lancaster University 13

ICE registers and proportionsS1A (20%) Spoken – Private

S1B (16%) Spoken – Public

S2A (14%) Spoken – Monologue – Unscripted

S2B (10%) Spoken – Monologue – Scripted

W1A (4%) Written – Non-printed – Non-professional writing

W1B (6%) Written – Non-printed – Correspondence

W2A (8%) Written – Printed – Academic writing

W2B (8%) Written – Printed – Non-academic writing

W2C (4%) Written – Printed – Reportage

W2D (4%) Written – Printed – Instructional writing

W2E (2%) Written – Printed – Persuasive writing

W2F (4%) Written – Printed – Creative writing

Page 14: Corpora in language variation studies

21/04/23 CRG, Lancaster University 14

141 linguistic features covered

• A) Nouns: 21 categories, e.g.– nominalisation, other nouns; 19 semantic classes of nouns

(e.g. evaluations, speech acts)• B) Verbs: 28 categories, e.g.

– Do as pro-verb, be as main verb, tense and aspect markers, modals, passives, 16 semantic categories of verbs

• C) Pronouns: 10 categories, e.g.– Person, case, demonstrative

• D) Adjectives: 11 categories, e.g.– Attributive vs. predicative use, 9 semantic categories

Page 15: Corpora in language variation studies

21/04/23 CRG, Lancaster University 15

141 linguistic features covered• E) Adverbs: 7 categories• F) Prepositions (2 categories)• G) Subordination (3 categories)• H) Coordination (2 categories)• I) WH-questions / clauses (2 categories)• J) Nominal post-modifying clauses (5 categories)• K) THAT-complement clauses (3 categories)• L) Infinitive clauses (3 categories)• M) Participle clauses (2 categories)• N) Reduced forms and dispreferred structures (4 categories)• O) Lexical and structural complexity (3 categories)

Page 16: Corpora in language variation studies

21/04/23 CRG, Lancaster University 16

141 Linguistic features covered• P) Quantifiers (4 categories)• Q) Time expressions (11 categories)• R) Degree expressions (8 categories)• S) Negation (2 categories)• T) Power relationship (4 categories)• U) Definiteness (2 categories)• V) Helping/hindrance (2 categories)• X) Linear order (1 category)• Y) Seem / Appear (1 category)• Z) Discourse bin (1 category)

Page 17: Corpora in language variation studies

21/04/23 CRG, Lancaster University 17

Procedure of data analysis• 1) Data clean-up• 2) Grammatical and semantic tagging with Wmatrix• 3) Extracting the frequencies of 141 linguistic features from

2,500 corpus files• 4) Building a profile of normalised frequencies (per 1,000

words) for each linguistic feature• 5) Factor analysis

– Factor extraction (Principal Factor Analysis)– Factor rotation (Pramax)– Optimum structure: 9 factors

• 6) Interpreting extracted factors• 7) Computing factor scores• 8) Using the enhanced MDA model in exploration of variation

across registers and language varieties

Page 18: Corpora in language variation studies

21/04/23 CRG, Lancaster University 18

The enhanced MDA model• Nine factors established in the new model

– 1) Interactive casual discourse vs. informative elaborate discourse

– 2) Elaborative online evaluation– 3) Narrative concern– 4) Human vs. object description – 5) Future projection– 6) Subjective impression and judgement– 7) Lack of temporal / locative focus– 8) Concern with degree and quantity– 9) Concern with reported speech

• Robustness of the model in register analysis

Page 19: Corpora in language variation studies

21/04/23 CRG, Lancaster University 19

1) Interactive casual discourse vs. informative elaborate discourse

• Private conversation is most interactive and casual• Academic writing is most informative and elaborate• Spoken registers are generally more interactive and less elaborate than

written registers

-60-40-200204060

S-PrivateS-Public

W-Printed-Creative writingS-Mono-Unscripted

W-Nonprinted-CorrespondenceW-Printed-Non-academic writing

W-Nonprinted-Non-prof writingS-Mono-Scripted

W-Printed-Persuasive writingW-Printed-Instructional writing

W-Printed-Reportage W-Printed-Academic writing

F=775.86p<0.0001R2=77.4%

Page 20: Corpora in language variation studies

21/04/23 CRG, Lancaster University 20

2) Elaborative online evaluation

• Public dialogue (e.g. broadcast discussion / interview, parliamentary debate) has the most prominent focus on elaborative online evaluation

• Unscripted monologue also involves a high level of elaborative online evaluation• Persuasive writing may relate to elaborative evaluation but is not restricted by real-time

production• Private conversation is least elaborative even if the evaluation is made online • Evaluation is not a concern in creative writing

-6-4-20246

S-PublicS-Mono-Unscripted

W-Printed-Persuasive writingW-Nonprinted-Non-prof writing

S-Mono-ScriptedW-Printed-Academic writing

W-Printed-Non-academic writing W-Printed-Reportage

W-Printed-Instructional writingW-Nonprinted-Correspondence

S-PrivateW-Printed-Creative writing

F=102.20p<0.0001R2=31.1%

Page 21: Corpora in language variation studies

21/04/23 CRG, Lancaster University 21

3) Narrative concern

• Unscripted monologue (e.g. demonstrations, presentations, commentaries) has a narrative concern

• Unsurprisingly, creative writing is also narrative • Not a concern in academic writing, non-professional writing (student

essays and exam scripts), and instructional writing

-8-6-4-20246

S-Mono-UnscriptedW-Printed-Creative writing

S-PrivateS-Public

S-Mono-ScriptedW-Nonprinted-Correspondence

W-Printed-Reportage W-Printed-Persuasive writing

W-Printed-Non-academic writing W-Printed-Instructional writingW-Nonprinted-Non-prof writing

W-Printed-Academic writing

F=134.50p<0.0001R2=37.3%

Page 22: Corpora in language variation studies

21/04/23 CRG, Lancaster University 22

4) Human vs. object description

• Private conversation is most likely to have a focus on people• Correspondence (social letters and business letters) also involves human description• Instructional writing tends to give concrete descriptions of objects• Academic and non-academic writings can also be concrete when an object or substance is

described

-4-3-2-10123

S-PrivateW-Nonprinted-Correspondence

S-Mono-ScriptedS-Public

W-Printed-Persuasive writingW-Printed-Reportage

W-Nonprinted-Non-prof writingS-Mono-Unscripted

W-Printed-Creative writingW-Printed-Non-academic writing

W-Printed-Academic writingW-Printed-Instructional writing

F=44.03p<0.0001R2=16.3%

Page 23: Corpora in language variation studies

21/04/23 CRG, Lancaster University 23

5) Future projection

• Persuasive writing (e.g. press editorials, trying to influence people’s future attitudes and actions) has the most prominent focus on future projection

• Correspondence and public dialogue also involve future projection to varying extents

• Academic writing (timeless truth?) is least concerned with future projection

-6-4-20246

W-Printed-Persuasive writingW-Nonprinted-Correspondence

S-PublicS-Mono-Scripted

W-Printed-Instructional writingS-Mono-Unscripted

S-PrivateW-Printed-Reportage

W-Printed-Creative writingW-Printed-Non-academic writing

W-Nonprinted-Non-prof writingW-Printed-Academic writing

F=28.10p<0.0001R2=11.1%

Page 24: Corpora in language variation studies

21/04/23 CRG, Lancaster University 24

6) Subjective impression / judgement

• Factor score of creative writing is by far greater than any other register– Frequent use of possessive and reflective pronouns, as well as adjectives of judgement / appearance

• Instructional writing, private conversation, and student essays display low scores– They do not have a focus on personal impression and judgement

• Scripted and unscripted monologue, public dialogue and news reportage also tend to avoid expressions of subjective impression and judgement

-4-20246810

W-Printed-Creative writingW-Printed-Non-academic writing

W-Printed-Persuasive writingW-Nonprinted-CorrespondenceW-Nonprinted-Non-prof writing

S-Private

W-Printed-Instructional writingW-Printed-Academic writing

S-Mono-UnscriptedW-Printed-Reportage

S-PublicS-Mono-Scripted

F=126.22p<0.0001R2=35.8%

Page 25: Corpora in language variation studies

21/04/23 CRG, Lancaster University 25

7) Lack of temporal / locative focus

• Student essays and persuasive writing do not have a temporal / locative focus (not concerned with concepts such as when, how long, and where)

• Such specific information is of vital importance in correspondence (social and business letters)

-8-6-4-2024

W-Nonprinted-Non-prof writingW-Printed-Persuasive writingW-Printed-Academic writing

W-Printed-Creative writingS-Public

S-PrivateW-Printed-Non-academic writing

S-Mono-UnscriptedS-Mono-Scripted

W-Printed-Reportage W-Printed-Instructional writingW-Nonprinted-Correspondence

F=89.55p<0.0001R2=28.4%)

Page 26: Corpora in language variation studies

21/04/23 CRG, Lancaster University 26

8) Concern with degree / quantity

• Non-academic popular writing has the greatest concern of degree and quantity• Persuasive writing also displays a high propensity for expressions of degree and

quantity• Such expressions tend to be avoided in instructional writing (e.g. administrative

documents) and correspondence

-2-10123

W-Printed-Non-academic writing W-Printed-Persuasive writing

S-Mono-ScriptedS-Mono-Unscripted

W-Printed-Academic writingW-Nonprinted-Non-prof writing

S-PublicW-Printed-Reportage

S-PrivateW-Printed-Creative writing

W-Nonprinted-CorrespondenceW-Printed-Instructional writing

F=19.33p<0.0001R2=7.9%

Page 27: Corpora in language variation studies

21/04/23 CRG, Lancaster University 27

9) Concern with reported speech

• News reportage has the greatest concern with reported speech (both direct and indirect speech)

• Reported speech is also very common in creative writing (fictional dialogue)• Instructional writing and academic prose do not appear to have a concern

with reported speech

-4-3-2-1012345

W-Printed-Reportage W-Printed-Creative writing

S-Mono-ScriptedS-Public

S-PrivateW-Nonprinted-Correspondence

S-Mono-UnscriptedW-Printed-Non-academic writing

W-Printed-Persuasive writingW-Nonprinted-Non-prof writing

W-Printed-Academic writingW-Printed-Instructional writing

F=80.02p<0.0001R2=26.1%

Page 28: Corpora in language variation studies

21/04/23 CRG, Lancaster University 28

12 registers along 9 factors

• Factor 1 is the dimension along which the 12 registers demonstrate the sharpest contrasts– Interactive casual discourse vs. informative elaborate discourse: a

fundamental aspect of variation across registers• Robustness of the model

-50-40-30-20-10

01020304050

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

RegisterF

acto

r sc

ore

Factor 1 Factor 2 Factor 3 Factor 4 Factor 5

Factor 6 Factor 7 Factor 8 Factor 9

Page 29: Corpora in language variation studies

21/04/23 CRG, Lancaster University 29

5 English varieties across 9 factors

• Both differences and similarities• This general picture may blur many register-based subtleties

– Language can vary across registers even more substantially than across language varieties (cf. Biber 1995)

-20

-15

-10

-5

0

5

Factor1

Factor2

Factor3

Factor4

Factor5

Factor6

Factor7

Factor8

Factor9

Factors

Fac

tor

sco

re

GB

HK

IN

PH

SG

Page 30: Corpora in language variation studies

21/04/23 CRG, Lancaster University 30

1) Interactive casual discourse vs. informative elaborate discourse

• Indian English displays the lowest score in nearly all registers - it is less interactive but more elaborate

– Sanyal (2007): “clumsy Victorian English [that] hangs like a dead Albatross around each educated Indian’s neck”

• Modern BrE appears to be most interactive and least elaborate (e.g. S1A, S1B, W2D)

• 3 varieties of English used in East and Southeast Asia are very similar

F=9.04, 4 d.f. p<0.001

-50-40-30-20-10

0102030405060

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

Register

Fac

tor

sco

re

GB HK IN PH SG

Page 31: Corpora in language variation studies

21/04/23 CRG, Lancaster University 31

2) Elaborative online evaluation

• BrE generally shows a higher score than non-native varieties of English (e.g. W2A, W1B, S2B)

• Non-native English varieties tend to be very close in most registers

F=14.13 4 d.f.p<0.001

-6

-4

-2

0

2

4

6

8

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

RegisterF

acto

r sc

ore

GB HK IN PH SG

Page 32: Corpora in language variation studies

21/04/23 CRG, Lancaster University 32

3) Narrative concern

• BrE demonstrates a greater propensity for narrative concern– Most noticeably in news reportage (W2C) and instructional writing (W2D)

• Indian English is least concerned with narrative– Esp. in registers like correspondence (W1B), instructional writing (W2D), and

unscripted monologue (S2A)

F=7.974 d.f.p<0.001

-8

-6

-4

-2

0

2

4

6

8

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

Register

Fac

tor

sco

re

GB HK IN PH SG

Page 33: Corpora in language variation studies

21/04/23 CRG, Lancaster University 33

4) Human vs. object description

• Very close in a number of registers (e.g. S2B, W1B, W2E)• Indian English and BrE show similarity in a greater range of registers• HK and Singapore Englishes display great similarity (except W1A)• Creative writing (W2F) is very similar in non-native varieties of English

F=5.92 4 d.f.p<0.001

-6

-5

-4

-3

-2

-1

0

1

2

3

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

RegisterF

acto

r sc

ore

GB HK IN PH SG

Page 34: Corpora in language variation studies

21/04/23 CRG, Lancaster University 34

5) Future projection

• BrE has the highest score in all printed written registers (W2A–W2F)• Indian English shows the lowest score in nearly all registers

F=47.63 4 d.f.p<0.001

-8

-6

-4

-2

0

2

4

6

8

10

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

RegisterF

acto

r sc

ore

GB HK IN PH SG

Page 35: Corpora in language variation studies

21/04/23 CRG, Lancaster University 35

6) Subjective impression / judgement

• Very similar in many registers…with most noticeable differences in non-printed written registers (W1A, W1B), non-academic writing (W2B), and news reportage (W2C)

• HK English displays a distribution pattern similar to Singapore English in spoken registers (S1A–S2B) and unpublished written registers (W1A, W1B), but it is very close to Philippine English in printed writing (W2A–W2F)

F=12.25 4 d.f.p<0.001

-4

-2

0

2

4

6

8

10

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

RegisterF

acto

r sc

ore

GB HK IN PH SG

Page 36: Corpora in language variation studies

21/04/23 CRG, Lancaster University 36

7) Lack of temporal / locative focus

• Overall difference is not significant statistically– …but there are noticeable differences in some registers (e.g. W1B, W2D)

• Interestingly, Indian English demonstrates a consistently higher score in spoken registers (S1A-S2B) – …but a lower score in unpublished writing (e.g. W1B)

F=2.28 4 d.f.p=0.058

-12

-10

-8

-6

-4

-2

0

2

4

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

Register

Fac

tor

sco

re

GB HK IN PH SG

Page 37: Corpora in language variation studies

21/04/23 CRG, Lancaster University 37

8) Concern with degree / quantity

• BrE generally displays a higher score in nearly all registers• HK English does not appear to be concerned with degree and quantity (e.g. W2D)• Similarly Indian English also lacks a focus on degree and quantity (e.g. W1B)

F=24.324 d.f.p<0.001

-6-5-4-3-2-1012345

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

Register

Fac

tor

sco

re

GB HK IN PH SG

Page 38: Corpora in language variation studies

21/04/23 CRG, Lancaster University 38

9) Concern with reported speech

• Overall difference is not significant• …in spite of noticeable difference in news reportage (W2C)

– East and Southeast Asian English varieties show a greater propensity for concern with reported speech than BrE and Indian English

F=1.51 4 d.f.p=0.196

-6

-4

-2

0

2

4

6

8

10

S1A S1B S2A S2B W1A W1B W2A W2B W2C W2D W2E W2F

RegisterF

acto

r sc

ore

GB HK IN PH SG

Page 39: Corpora in language variation studies

21/04/23 CRG, Lancaster University 39

Case study summary

• Summary– Seeking to enhance Biber’s MDA model with

semantic components– Introducing the new model in research of World

Englishes

• Lab session: Exploring distribution of passives in the FLOB corpus

Page 40: Corpora in language variation studies

Open FLOB in Xaira

Page 41: Corpora in language variation studies

Define subcorpora

Page 42: Corpora in language variation studies

Define subcorpora

Page 43: Corpora in language variation studies

Define subcorpora

Page 44: Corpora in language variation studies

Define subcorpora

Page 45: Corpora in language variation studies

Define subcorpora

Page 46: Corpora in language variation studies

Open subcorpora

Page 47: Corpora in language variation studies

Open subcorpora

Page 48: Corpora in language variation studies

Query builder

Page 49: Corpora in language variation studies

Define scope node

Page 50: Corpora in language variation studies

Define 1st search node

Select all tags starting with VB

Page 51: Corpora in language variation studies

Define 2nd search node

Select all tags starting with VVN

Page 52: Corpora in language variation studies

Define link type

Page 53: Corpora in language variation studies

Random sampling

Page 54: Corpora in language variation studies

KWIC versus page mode

Page 55: Corpora in language variation studies
Page 56: Corpora in language variation studies

Sorted by %