Upload
beverley-taylor
View
222
Download
5
Embed Size (px)
Citation preview
Wido van Peursen,VU University Amsterdam, Faculty of
Theology
1. The corpus: Hebrew Bible2. The WIVU Database3. CLARIN-project: SHEBANQ4. NWO-project: Syntactic Diversity in
BH
5. Case study: Judges 4 and 5
Ca. 400.000 words Probably composed over a period of ca. 1000
years (1200-200 BC) Complex transmission history Oldest complete MS: Codex Leningradensis,
1008/9 AD Various linguistic layers (e.g. vowel signs) No native speakers
WIVU database of the Hebrew Bible [WIVU = Werkgroep Informatica Vrije
Universiteit]• Createted since 1970s• Linguistic levels:
Morphology (encoding rather than tagging!) Words Phrases Clauses Sentences Text hierarchy
1. The corpus: Hebrew Bible2. The WIVU Database3. CLARIN-project: SHEBANQ4. NWO-project: Syntactic Diversity in
BH
5. Case study: Judges 4 and 5
System for HEBrew text: ANnotations for Queries and markup
Challenges:
1. No dedicated space on the web where an authorized version of this resource is guaranteed to exist.
2. No possibility to annotate it, link to it or build (open source) tools around it.
3. Results of existing queries cannot be shown on the web.
4. EMDROS is maintained by one-person private company.
5. Mainly used by specialists in Bible & Computer.
Mission:• To build a bridge between the linguistically
annotated Hebrew Text corpus and biblical scholars.
Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address
research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by
creating the opportunity of unique identifiers referring to saved queries.
Mission:• To build a bridge between the linguistically
annotated Hebrew Text corpus and biblical scholars.
Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address
research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by
creating the opportunity of unique identifiers referring to saved queries.
Mission:• To build a bridge between the linguistically
annotated Hebrew Text corpus and biblical scholars.
Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address
research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by
creating the opportunity of unique identifiers referring to saved queries.
Mission:• To build a bridge between the linguistically
annotated Hebrew Text corpus and biblical scholars.
Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address
research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by
creating the opportunity of unique identifiers referring to saved queries.
Example: “in-his –feet”: a.“on foot” orb.“in his footsteps”.Disambiguation: 1.intuitive/contextual or2.on basis of pattern recognition (participants/agreement)
Mission:• To build a bridge between the linguistically
annotated Hebrew Text corpus and biblical scholars.
Three steps:(1)make text & annotations, available to scholars;(2)demonstrate how queries can function to address
research questions: repository of saved queries;(3)give textual scholarship more empirical basis, by
creating the opportunity of unique identifiers referring to saved queries.
[she-sang <Pr>] [Deborah and Barak <Su>]
1. The corpus: Hebrew Bible2. The WIVU Database3. CLARIN-project: SHEBANQ4. NWO-project: Syntactic Diversity in
BH
5. Case study: Judges 4 and 5
Does Syntactic Variation reflect Language Change? Tracing Syntactic Diversity in Biblical Hebrew Texts
Explanations for linguistic diversity:• Genre• Chronology• Language contact (Aramaic)• Dialects• Textual transmission• Oral versus written layers
Limitations in current research:• Focus on separate Bible books• Methodological presuppositions• Focus on lexical items or set phrases• Failure to make use of methods for
researching linguistic variation and change. • Failure to incorporate insights into syntactic
differences between independent / dependent clauses and between narration / direct speech.
Our approach• Focus on syntax in three project
components: Phrase level Clause level Text level
• Synthesis: Integration of congruous and contradicting tendencies.
• Extra-biblical texts used as points of comparison.
1. The corpus: Hebrew Bible2. The WIVU Database3. CLARIN-project: SHEBANQ4. NWO-project: Syntactic Diversity in
BH
5. Case study: Judges 4 and 5
These chapters deal with battle• of Deborah, Barak and Israelite tribes• against the Canaanite king Jabin and his
army-captain Sisera. Differences, e.g.:
• 4 is prose, 5 is poetry.• Main figures (Jabin absent in 5).• Tribes involved (only two in 4).
4 depends on 5 Wellhausen 1878; Halpern 1983; Houston 1997;
Neef 2002 and many others. 5 depends on 4
Bechmann 1989; Waltisberg 1999. Common source/tradition
Richter 1963; Younger 1991. Synchronous/sequential
Guest 1998; Reis 2005.
1. Identification of ‘similar’ text segments on the basis of ‘distance’ (synopsis impossible).
2. Identification of text features that cause high similarity scores.
3. Analysis of the distribution of these features in the larger context of Judges and the Old Testament.
Is intuition that 4 and 5 belong together supported by textual features?
If so, where in the text can they be found?
Similarity matrices: ‘distance’ measuring between each verse from ch. 4 and each verse from ch. 5.
4\ 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 1 1 2 2 1 2 1 1 1 2 0 2 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 2 0 1 2 1 1 0 0 0 1 1 1 0 1 0 1 1 1 0 2 1 0 0 2 0 0 2 0 1 0 1 1 2 3 1 2 2 1 2 1 1 1 2 0 2 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 2 3 4 1 1 1 0 1 0 2 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 4 5 2 1 1 0 2 1 2 1 1 1 2 2 0 1 1 2 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 5 6 4 2 3 1 4 2 1 3 2 1 2 3 1 2 2 0 0 2 1 0 0 0 2 0 0 1 0 0 0 0 1 6 7 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 2 0 0 0 1 2 0 2 0 1 0 7 8 2 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 8 9 3 1 1 1 1 1 2 0 1 2 1 3 1 0 2 0 0 0 0 1 0 0 2 1 0 2 0 1 0 1 1 9
10 2 0 0 0 0 0 1 1 0 0 0 2 0 1 3 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 10 11 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 11 12 3 0 0 0 1 1 0 0 0 0 0 3 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 12 13 0 1 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 0 1 2 0 0 0 0 1 0 2 0 1 1 13 14 4 1 1 2 3 1 2 1 1 0 2 3 2 2 2 0 0 0 0 1 0 0 2 0 1 2 0 1 0 1 2 14 15 1 1 1 1 2 0 0 0 1 0 2 1 2 1 2 0 0 0 0 1 0 0 1 0 0 1 1 3 0 1 2 15 16 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 2 0 1 1 16 17 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 0 5 0 1 2 1 0 1 0 17 18 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 2 0 1 0 1 0 1 1 18 19 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 2 0 0 0 0 0 0 19 20 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 1 0 0 1 0 0 0 20 21 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 1 4 0 3 0 1 0 0 1 21 22 2 0 0 1 0 2 0 1 0 1 0 1 0 0 1 0 0 1 1 1 0 0 2 1 0 3 1 2 0 1 1 22 23 2 1 3 0 3 2 1 2 1 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 23 24 1 1 2 0 1 2 1 1 1 1 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 1 0 0 0 0 0 24
Shared Lexemes: the more shared lexemes, the smaller the
distance. ‘Noise’: e.g. ‘and’ >
Stoplist: exclude frequent particles etc. Selection of content words on basis of part of
speech: only words with inflection (nouns, verbs, adjectives).
Basic unit for text comparison: verse, but ‘verse’ based on traditional unit delimitation.
Differences in verse size may affect results.
Jaccard Index: the intersection of the number of shared lexemes divided by the union.
I went homeI went home yesterday
Intersection: Shared lexemes (types): 3 (I, went, home)Union: Total number of lexemes: 4 (I, went, home, yesterday)Jaccard Index = 3/4 = 0.75
I went homeAfter the meeting I went home yesterday
Intersection: 3 (I, went, home)Union: 7 (I, went, home, after, the, meeting, yesterday)Jaccard Index = 3/7 = 0.43
Shared lexemes: ‘feature-based’. Also ‘blind’ methods, based on
mathematical characteristics of the digital representation of the text, e.g. Normalized Compression Distance (NCD).
Example: verse pairs with the highest number of shared lexemes (4 or more)
5:1 5:5 5:24
4:6AbinoamBaraksayson
GodIsraelthe LORDmountain
4:14BarakdayDeborasay
4:17
HeberJaelKenitetentwife
4:21HeberJaeltentwife
Proper nouns: ‘Barak’, ‘Israel’.
Common nouns that are part of proper noun phrases:
‘wife’ in ‘Jael the wife of Heber’; ‘son’ in ‘Barak the son of Abinoam’.
Other verbs and common nouns: ‘say’, ‘tent’, ‘day’.
High similarity scores in places that show high concentration of proper nouns.
Even within category of proper nouns considerable differences.
Shared common nouns and verbs: frequent words such as ‘day’, ‘say’. No significant concentration.
In case of literary dependency we would expect at least some concentration of shared lexemes.
Significant number of shared lexemes only in case of proper nouns.
But proper nouns suggest shared traditions, rather than literary dependency.