Upload
homer-harrell
View
241
Download
0
Embed Size (px)
Citation preview
1
Text-based typology
Corpora, corpora of elicited texts and parallel corpora
(based on STUF 2007)
МД
2
Pros as compared to questionnaires
Contextualization of examples Naturalistic discourse Intralinguistic variation Potentially, makes up for grammar
gaps
3
Frog stories
(Mercer Mayer)
4
Pear stories
W. Chafe et al. A six-minute film shot in UC
(Berkeley) in 1975 Widely used in various cross-linguistic
research referential density project
5
Referential density (Bickel 2003) Relative frequency of overt NPs:
Via
Nich
ols
201
4
6
Contras of elicited corpora
Not directly comparable events focused and omitted mostly quantitative results
Require massive linguistic effort limited data for each language
Any alternative? Parallel corpora
7
Massively parallel texts Harry Potter
Including subtitles (76, 21) Biblical translations
Pater Noster in 1300 lgs, 400 full texts, 1,000 gospels Marxist texts
State and Revolution: 71 tr in 36 lgs Legal databases:
Proceedings of the European Parliament Universal Declaration of Human Rights (329) Unesco online database of literary translations (1,5 mln
items) Andersen, Le Petit Prince, …
Cysouw and Wälchli 2007
8
Comparability (easy counts)
Parallel corpora: roughly comparable number of sentences
(from 1,663 to 1,528 for Petit Prince) Elicited texts:
pear stories in the same language vary from 29 to 119 sentences (Bickel 2003 via Nichols 2014)
‘Free’ corpora: not applicable…
9
Comparability (methodology) Comparison by intension
definition of a phenomena browsing grammars
Comparison by extension linguistic structures used for
expressing a contextualized situation truly functional
Wälchli 2007
10
Extensional typology in parallel corpora
data we work with may be linguistically different but semantically identical cf. much looser identity in elicited texts
rather, they are “defined as a selection of places in the parallel texts”
they may reflect linguistic variation at points where one language uses the
same construction, another languages uses several
11
Parallel corpora support conventional typology
Newmeyer against Stassen Classical Greek, Latin and Tibetan have the
‘exceed’ type comparative - contra Stassen 1985 Wälchli supports Stassen
A study of parallel corpora does not show ‘exceed’ but ‘separative’ construction
Parallel corpora reflect dominant patterns – exactly where the typology’s primary interests lie But they also numerically reflect variation or
competition between dominant patterns, rather than provide yes or no typology
12
Case studies, among other:
Wälchli 2005: co-compounds Auwera et al. 2004: epistemic poss. in Slavic Wälchli 2006: ‘again’ Wälchli 2001: motion events Wälchli & Zúñiga 2006: motion events ‘again’ Stolz 2004: total reduplication Stolz et al. 2005: comitatives and instrumentals Stolz et al.: absolute possessives
13
Stolz 2003, 2004 Le Petit Prince - quantitative
‘avec’-cline
Total-reduplication-cline
Does this require parallel corpora?
14
Stolz 2003, 2004 Le Petit Prince – qualitative?
Puis il s-épongea le front avec un mouchoir à carreaux rouges.Then he mopped his forhead with a handkerchief decorated with red squares.Zatim obrise čelo rupčičem s crvenim kvadratima.
Wells with a rusty pulley – ornative or a separate category?
15
Pitfalls: data analysis
Easier than raw texts we know what was intended and where to
look still, as any grammatical analysis by a
non expert, subject to mistakes Alignment issuesAnyway, same or easier than with
elicited textsWälchli 2007
16
Pitfalls: sample bias
Europe overrepresented, convenience sampling:
Europe > IE > other families In his study of comitatives, Stolz ended
up with an areal rather than sampling study
17
Pitfalls: style/variant choice
Standard language bias Better include texts reporting speech
‘Hagiolect’ effects ‘The sinners will-Evid not enter the
heaven’ Style incomparability
Bible translation are stylistically diverse Purism
Wälchli 2007
18
Pitfalls: translation bias“Incommensurability” of linguistic structures: some
languages think differently… Australian lgs prefer absolute over relative frame
of reference In Australian Gospels, occurrences of AFR are
found but significantly less frequent than in natural discourse from this area
Wälchli 2007
“Inert” construction – a construction that tends to be imported from the source language
19
Case study: MVC in ‘bring’ and ‘run’ events
Bible-based, Bernhard Wälchli
Multi-verb construction: clauses that contain more than one lexical verb
BRING and RUN events may be described as MVC or “solitarizing” verbs
20
BRING and RUN events (Wälchli)
Examples:
Minnin ti-bouay la ban mouin. (Haitian Creole)lead little-boy def give I
Ač-i-ne Man pat-ăm-a il-se kil-ĕr. (Chuvash)
child-ps3-dat/acc I.gen to-poss1sg-dat take-conv come-imp2pl
‘… bring him unto me.’ (solitarizing)Data usually unavailable from grammars…
21
BRING and RUN events (Wälchli)
Bible-based, Bernhard WälchliMulti-verb construction: clauses that
contain more than one lexical verbBRING and RUN events may be
described as MVC or “solitarizing” verbs
Is there any correlation between the choice of either construction for encoding the two events?
22
BRING and RUN events (Wälchli)
BRING
Solit MVC
RUN
Solit
Dinka, Navajo, Russian
Ainu, Ewe, Khasi
MVC
English, Guarani, Maltese
Choctaw, Chuvash, Khoekhoe
23
BRING and RUN events (Wälchli)
165 languages (Eurasia over-represented) 18 BRING events, six RUN events Correlation between MVC in BRING and RUN
is highly significant (Fisher’s test)
BRING
Solit MVC
RUNSolit 65 12
MVC 46 42
24
BRING and RUN events (Wälchli)
Is a language consistently MVC vs. solitarizing? Surely not – then, is this a typological parameter at all?
25
BRING and RUN events (Wälchli)
But: the distribution is bimodal
26
BRING and RUN events (Wälchli)
If we only consider LOW and HIGH, fewer (14) languages are inconsistent
27
Case study: demonstratives
Potter-based, Federica da Milano 2007
Distance-oriented systems this near – that far
Person-oriented systems this with us – that far from us
Is this a real disctinction, or are these two subtypes of something more general?
28
Demonstratives (da Milano)
48 stimuli (da Milano 2005) Also include reciprocal orientation of
the locutors: face to face, face to back, side by side
83 occurrences of deictic demosntratives in “… and the Chamber of Secrets” this with us – that far from us
29
Demonstratives (da Milano)
‘Tie that round the bars,’ said Fred, throwing the end of a rope to Harry.
‘Przywiąż to do kraty’, powiedział Fred, rzucając Harry’emu koniec liny.
30
Demonstratives (da Milano)
One term systems:
French – cela, ca (ceci not used)German – der/die/das (dieser, jener not
used)
31
Demonstratives (da Milano)
Two term systems:
Unmarked vs. proximal – Scandinavian, English, Northen Italian
Unmarked vs. distal – Polish, Russian, Czech, Hungarian, Modern Greek
Dyad oriented - Catalan
32
Demonstratives (da Milano)Three term systems: proximal, medial,
distal
Dual-anchored – medial (close to addressee or medium distance)
Spanish (este~ese~aquel) Basque (hau~hori~hura)Addressee-anchored – medial is close to
addressee only – not verified on HPPortuguese (esto~esso~aquele)Also Sardinian and Tuscun
33
Demonstratives (da Milano)
da Milano then proceeds to build a similar typology for adverbs; her conclusions are as follows:
The map of adverbs is by and large isomorphic to the map of pronouns
Levinson 2004 “perhaps one can hazard the generalizations that speaker-centered degrees of distance are usually (more) fully represented in the adverbs than the pronominals” confirmed
“It has turned out to be fruitful to use parallel texts as a control test of data obtained through the questionnaire. The results from the parallel texts mainly confirmed the prior typological generalizations.”
34
‘Free’ corpora!
No translations – no risk of inert categories, closer to naturalistic
Massive amounts of texts Usually – literary
Vast playground for quantitative analysis
35
‘Free’ corpora!
Examples: Combinatorial statistics for
property words Lexical typology by LexTyp
Comparative occurrences May be useful – cf. temperature
domain
38
Comparison: texts in typology
Free corpora: No ‘meaning identity’, shift towards intensional typology Massive collections: almost all kinds of phenomena But a shift towards intensional typology Natural discourse
Elicited texts: Weak ‘meaning’ identity Massive effort for transcription, poor collections Only frequent phenomena Natural discourse (with provisos)
Parallel corpora: Strong ‘meaning’ identity Natural written discourse (with provisos)
39
Summary (obvious):
Corpora have their limitations and can not substitute conventional methods – but can go hand in hand with them