37
1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

Embed Size (px)

Citation preview

Page 1: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

1

Text-based typology

Corpora, corpora of elicited texts and parallel corpora

(based on STUF 2007)

МД

Page 2: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

2

Pros as compared to questionnaires

Contextualization of examples Naturalistic discourse Intralinguistic variation Potentially, makes up for grammar

gaps

Page 3: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

3

Frog stories

(Mercer Mayer)

Page 4: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

4

Pear stories

W. Chafe et al. A six-minute film shot in UC

(Berkeley) in 1975 Widely used in various cross-linguistic

research referential density project

Page 5: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

5

Referential density (Bickel 2003) Relative frequency of overt NPs:

Via

Nich

ols

201

4

Page 6: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

6

Contras of elicited corpora

Not directly comparable events focused and omitted mostly quantitative results

Require massive linguistic effort limited data for each language

Any alternative? Parallel corpora

Page 7: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

7

Massively parallel texts Harry Potter

Including subtitles (76, 21) Biblical translations

Pater Noster in 1300 lgs, 400 full texts, 1,000 gospels Marxist texts

State and Revolution: 71 tr in 36 lgs Legal databases:

Proceedings of the European Parliament Universal Declaration of Human Rights (329) Unesco online database of literary translations (1,5 mln

items) Andersen, Le Petit Prince, …

Cysouw and Wälchli 2007

Page 8: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

8

Comparability (easy counts)

Parallel corpora: roughly comparable number of sentences

(from 1,663 to 1,528 for Petit Prince) Elicited texts:

pear stories in the same language vary from 29 to 119 sentences (Bickel 2003 via Nichols 2014)

‘Free’ corpora: not applicable…

Page 9: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

9

Comparability (methodology) Comparison by intension

definition of a phenomena browsing grammars

Comparison by extension linguistic structures used for

expressing a contextualized situation truly functional

Wälchli 2007

Page 10: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

10

Extensional typology in parallel corpora

data we work with may be linguistically different but semantically identical cf. much looser identity in elicited texts

rather, they are “defined as a selection of places in the parallel texts”

they may reflect linguistic variation at points where one language uses the

same construction, another languages uses several

Page 11: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

11

Parallel corpora support conventional typology

Newmeyer against Stassen Classical Greek, Latin and Tibetan have the

‘exceed’ type comparative - contra Stassen 1985 Wälchli supports Stassen

A study of parallel corpora does not show ‘exceed’ but ‘separative’ construction

Parallel corpora reflect dominant patterns – exactly where the typology’s primary interests lie But they also numerically reflect variation or

competition between dominant patterns, rather than provide yes or no typology

Page 12: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

12

Case studies, among other:

Wälchli 2005: co-compounds Auwera et al. 2004: epistemic poss. in Slavic Wälchli 2006: ‘again’ Wälchli 2001: motion events Wälchli & Zúñiga 2006: motion events ‘again’ Stolz 2004: total reduplication Stolz et al. 2005: comitatives and instrumentals Stolz et al.: absolute possessives

Page 13: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

13

Stolz 2003, 2004 Le Petit Prince - quantitative

‘avec’-cline

Total-reduplication-cline

Does this require parallel corpora?

Page 14: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

14

Stolz 2003, 2004 Le Petit Prince – qualitative?

Puis il s-épongea le front avec un mouchoir à carreaux rouges.Then he mopped his forhead with a handkerchief decorated with red squares.Zatim obrise čelo rupčičem s crvenim kvadratima.

Wells with a rusty pulley – ornative or a separate category?

Page 15: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

15

Pitfalls: data analysis

Easier than raw texts we know what was intended and where to

look still, as any grammatical analysis by a

non expert, subject to mistakes Alignment issuesAnyway, same or easier than with

elicited textsWälchli 2007

Page 16: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

16

Pitfalls: sample bias

Europe overrepresented, convenience sampling:

Europe > IE > other families In his study of comitatives, Stolz ended

up with an areal rather than sampling study

Page 17: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

17

Pitfalls: style/variant choice

Standard language bias Better include texts reporting speech

‘Hagiolect’ effects ‘The sinners will-Evid not enter the

heaven’ Style incomparability

Bible translation are stylistically diverse Purism

Wälchli 2007

Page 18: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

18

Pitfalls: translation bias“Incommensurability” of linguistic structures: some

languages think differently… Australian lgs prefer absolute over relative frame

of reference In Australian Gospels, occurrences of AFR are

found but significantly less frequent than in natural discourse from this area

Wälchli 2007

“Inert” construction – a construction that tends to be imported from the source language

Page 19: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

19

Case study: MVC in ‘bring’ and ‘run’ events

Bible-based, Bernhard Wälchli

Multi-verb construction: clauses that contain more than one lexical verb

BRING and RUN events may be described as MVC or “solitarizing” verbs

Page 20: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

20

BRING and RUN events (Wälchli)

Examples:

Minnin ti-bouay la ban mouin. (Haitian Creole)lead little-boy def give I

Ač-i-ne Man pat-ăm-a il-se kil-ĕr. (Chuvash)

child-ps3-dat/acc I.gen to-poss1sg-dat take-conv come-imp2pl

‘… bring him unto me.’ (solitarizing)Data usually unavailable from grammars…

Page 21: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

21

BRING and RUN events (Wälchli)

Bible-based, Bernhard WälchliMulti-verb construction: clauses that

contain more than one lexical verbBRING and RUN events may be

described as MVC or “solitarizing” verbs

Is there any correlation between the choice of either construction for encoding the two events?

Page 22: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

22

BRING and RUN events (Wälchli)

BRING

Solit MVC

RUN

Solit

Dinka, Navajo, Russian

Ainu, Ewe, Khasi

MVC

English, Guarani, Maltese

Choctaw, Chuvash, Khoekhoe

Page 23: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

23

BRING and RUN events (Wälchli)

165 languages (Eurasia over-represented) 18 BRING events, six RUN events Correlation between MVC in BRING and RUN

is highly significant (Fisher’s test)

BRING

Solit MVC

RUNSolit 65 12

MVC 46 42

Page 24: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

24

BRING and RUN events (Wälchli)

Is a language consistently MVC vs. solitarizing? Surely not – then, is this a typological parameter at all?

Page 25: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

25

BRING and RUN events (Wälchli)

But: the distribution is bimodal

Page 26: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

26

BRING and RUN events (Wälchli)

If we only consider LOW and HIGH, fewer (14) languages are inconsistent

Page 27: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

27

Case study: demonstratives

Potter-based, Federica da Milano 2007

Distance-oriented systems this near – that far

Person-oriented systems this with us – that far from us

Is this a real disctinction, or are these two subtypes of something more general?

Page 28: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

28

Demonstratives (da Milano)

48 stimuli (da Milano 2005) Also include reciprocal orientation of

the locutors: face to face, face to back, side by side

83 occurrences of deictic demosntratives in “… and the Chamber of Secrets” this with us – that far from us

Page 29: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

29

Demonstratives (da Milano)

‘Tie that round the bars,’ said Fred, throwing the end of a rope to Harry.

‘Przywiąż to do kraty’, powiedział Fred, rzucając Harry’emu koniec liny.

Page 30: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

30

Demonstratives (da Milano)

One term systems:

French – cela, ca (ceci not used)German – der/die/das (dieser, jener not

used)

Page 31: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

31

Demonstratives (da Milano)

Two term systems:

Unmarked vs. proximal – Scandinavian, English, Northen Italian

Unmarked vs. distal – Polish, Russian, Czech, Hungarian, Modern Greek

Dyad oriented - Catalan

Page 32: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

32

Demonstratives (da Milano)Three term systems: proximal, medial,

distal

Dual-anchored – medial (close to addressee or medium distance)

Spanish (este~ese~aquel) Basque (hau~hori~hura)Addressee-anchored – medial is close to

addressee only – not verified on HPPortuguese (esto~esso~aquele)Also Sardinian and Tuscun

Page 33: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

33

Demonstratives (da Milano)

da Milano then proceeds to build a similar typology for adverbs; her conclusions are as follows:

The map of adverbs is by and large isomorphic to the map of pronouns

Levinson 2004 “perhaps one can hazard the generalizations that speaker-centered degrees of distance are usually (more) fully represented in the adverbs than the pronominals” confirmed

“It has turned out to be fruitful to use parallel texts as a control test of data obtained through the questionnaire. The results from the parallel texts mainly confirmed the prior typological generalizations.”

Page 34: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

34

‘Free’ corpora!

No translations – no risk of inert categories, closer to naturalistic

Massive amounts of texts Usually – literary

Vast playground for quantitative analysis

Page 35: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

35

‘Free’ corpora!

Examples: Combinatorial statistics for

property words Lexical typology by LexTyp

Comparative occurrences May be useful – cf. temperature

domain

Page 36: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

38

Comparison: texts in typology

Free corpora: No ‘meaning identity’, shift towards intensional typology Massive collections: almost all kinds of phenomena But a shift towards intensional typology Natural discourse

Elicited texts: Weak ‘meaning’ identity Massive effort for transcription, poor collections Only frequent phenomena Natural discourse (with provisos)

Parallel corpora: Strong ‘meaning’ identity Natural written discourse (with provisos)

Page 37: 1 Text-based typology Corpora, corpora of elicited texts and parallel corpora (based on STUF 2007) МД

39

Summary (obvious):

Corpora have their limitations and can not substitute conventional methods – but can go hand in hand with them