31
The Construction of Anglo-Norman Text Corpus Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth . • AHRC-funded. Anglo-Norman Online Dictionary Anglo-Norman Text Corpus http://www.anglo-norman.net

The Construction of Anglo-Norman Text Corpus

  • Upload
    amalie

  • View
    47

  • Download
    1

Embed Size (px)

DESCRIPTION

The Construction of Anglo-Norman Text Corpus. Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth . AHRC-funded. Anglo-Norman Online Dictionary Anglo-Norman Text Corpus http://www.anglo-norman.net. Goal of the Anglo-Norman Hub Text Digitisation Project. - PowerPoint PPT Presentation

Citation preview

Page 1: The Construction of Anglo-Norman Text Corpus

The Construction of Anglo-Norman Text Corpus

• Joint Project of the University of Wales, Swansea and the University of Wales, Aberystwyth .

• AHRC-funded.

• Anglo-Norman Online Dictionary

• Anglo-Norman Text Corpus

• http://www.anglo-norman.net

Page 2: The Construction of Anglo-Norman Text Corpus

Goal of the Anglo-Norman Hub Text Digitisation Project

• To provide a set of digitised texts and articles to mediaeval linguists and historians which is searchable and fully cross-referenced within itself and to and from the Anglo-Norman Online Dictionary

Page 3: The Construction of Anglo-Norman Text Corpus

Main Challenges facing the Anglo-Norman Hub Project

• Image to text migration for maximum throughput at minimum cost

• Application of markup suitable for rendering and full cross-referencing

• Handling of non-standard character sets (mediaeval abbreviations)

Page 4: The Construction of Anglo-Norman Text Corpus

Image to Text Migration Strategies

• Optical Character Recognition

• Re-keying

• Both require subsequent proofreading

• Both allow insertion of appearance metadata as provisional markup

Page 5: The Construction of Anglo-Norman Text Corpus

Advantages of Alternative Image to Text Migration Strategies

• OCR

• Rapid processing• Can be performed by

students on-site and can be supervised.

• Rekeying

• Less error-prone• Cheap if outsourced• Non-standard characters

can be represented by combinations

• More consistent output quality

• Image quality less critical• Consistent output quality

Page 6: The Construction of Anglo-Norman Text Corpus

Economic Image to Text Migration: Conclusions

• Re-keying is more economic for the bulk of the mediaeval-language material

• OCR is competitive for modern languages (critical material)

• OCR can also be used for mediaeval language material when required by workflows provided that– good image quality can be easily achieved– the material consists of standard characters

Page 7: The Construction of Anglo-Norman Text Corpus

Markup requirements: must

• Conform to widely-accepted standards• Be capable of encapsulating diverse

document structures• Allow for automation• Enable internal and external referencing• Preserve as much appearance metadata

as possible • Not be tied to any one approach to

rendering

Page 8: The Construction of Anglo-Norman Text Corpus

Document types requiring a variety of XML Structures

• Texts– Verse– Prose – Lists & Tables

• Critical material– Introductions (conform to prose structures)– Notes (do not conform to any of the above

structures)

Page 9: The Construction of Anglo-Norman Text Corpus

Cross-referencing of Critical Matter

• Need to navigate from pointer to note

• Need to navigate cross-references from critical material to specific points in the text or elsewhere in critical material

• Achieved by use of target-id pairs

Page 10: The Construction of Anglo-Norman Text Corpus

Markup Density and Automation

• Verse: medium density; can be automated

• Prose: variable density; can be automated if footnote pointers present

• Lists & tables: medium density; can be automated

• Critical material: high-density; many cross-references; limited scope for automation

Page 11: The Construction of Anglo-Norman Text Corpus

Extract from XML version of “La Passiun de St. Edmund”

• <lg n="316"><l id="L1261">A Deu del cel ad graciéd</l>

• <l id="L1262">E al martir suvent a voéd</l>

• <l id="L1263">Que si bel l'at delivréd</l>

• <pb ed="folio" n="123a"/><l id="L1264" n="1264">De ço qu'esteit ainz encumbrét.</l></lg>

Page 12: The Construction of Anglo-Norman Text Corpus

Extract from XML version of “La Passiun de St. Edmund”

• <note id="N1261-4" target="L1261" targetEnd="L1264">These lines present several problems: (a) <q lang="AN" rend="b">A Deu. . .ad graciéd</q> <ref target="L1261">1261</ref>. The verb <term lang="AN" rend="i">gracier</term>, occurring here with an indirect object, normally takes a direct object and does so in its other occurrences in the text: <ref target="L826 L943 L1132">ll. 826, 943, 1132</ref>.

Page 13: The Construction of Anglo-Norman Text Corpus

Additional Markup for Critical Material

– <term>: Terms discussed may need to be linked to the Anglo-Norman Dictionary

– <q>: Citations: may need to be linked to their sources within the text base

– <bibl>, <title> etc.: Bibliographical information needs to be encoded to link citations with their sources

• Much of the above can be extrapolated from the appearance metadata embedded in the provisional markup

• <hi>: to encode embedded appearance metadata whose significance is not apparent

Page 14: The Construction of Anglo-Norman Text Corpus

“La Passiun de St. Edmund”Rendered for a Web Browser

• These lines present several problems: (a) A Deu. . .ad graciéd 1261 . The verb gracier , occurring here with an indirect object, normally takes a direct object and does so in its other occurrences in the text: ll. 826, 943, 1132 . T.-L. 4,502 cites one instance of gracier with indirect object, but in the construction gracier. qc. a qn . If this construction were applied here, ll. 1263-4 would have to be taken as the direct object of gracier and also, presumably, of voer 1262 . The use here of gracier with indirect object may have been influenced by the construction rendre graces a qn. employed at ll. 995, 1046, 1512 .

Page 15: The Construction of Anglo-Norman Text Corpus

Markup Density and Automation

• Verse: medium density; can be automated

• Prose: variable density; can be automated if footnote pointers present

• Lists & tables: medium density; can be automated

• Critical material: high-density; many cross-references; limited scope for automation

Page 16: The Construction of Anglo-Norman Text Corpus

Markup Requirements: Application

• 1,000 to 100,000 XML tags per document

• Automation essential for high throughput

• Digitisers can embed appearance metadata in provisional markup

• Well-designed provisional markup schemes facilitate automation

Page 17: The Construction of Anglo-Norman Text Corpus

Facsimile of part of the Statute Roll

Page 18: The Construction of Anglo-Norman Text Corpus

The same passage in the 1800 printed edition

Page 19: The Construction of Anglo-Norman Text Corpus

Extract from the explanation published with the Statutes, exemplifying the two forms resembling 9s.

Page 20: The Construction of Anglo-Norman Text Corpus

"rum"-abbreviation and flourishes

Page 21: The Construction of Anglo-Norman Text Corpus

Handling of Non-Unicode Characters: 1) Transcription

• Transcription is the one-to-one encapsulation of character appearance metadata

• Transliteration is the expansion of abbreviated characters into an intelligible sequence of letters

• Transliteration requires transcription as a starting point

• Transcription codes must resemble originals to facilitate re-keying

Page 22: The Construction of Anglo-Norman Text Corpus

P-contractions

Page 23: The Construction of Anglo-Norman Text Corpus

Examples of the "per" "pro" and "pre"

contractions as represented by the agency Signifies Keyed as

Expanded example Rekeyed example

per p!! ceperit cep!!it

pro $p$ propria $p$p<sup>i</sup>a

pro $p$ probum $p$bū

per p!! persone p!!sone

per p!! apertement ap!!tement

pro $p$ profit $p$fit

per p!! permisit p!!misit

pro $p$ promisit $p$misit

pro $p$ prochein $p$chein

per p!! persona p!!<sup>a</sup>

par p!! paratus p!!atus

par p!! parceles p!!celes

por p!! tempore temp!!e

por p!! corporum corp!!um

pre p?~ presentem p?~sentem

pre p?~ prelatz p?~laz!!

pre p?~ predictum p?~d!!c~m

pre p?~ prendront p?~ndront

Page 24: The Construction of Anglo-Norman Text Corpus

Transcription:1810 Edition and Rekeyed Version

Page 25: The Construction of Anglo-Norman Text Corpus

<p><expan abbr="R-">Rex</expan> Collectorib<expan abbr="z$">us</expan> custume sue lana<expan abbr="z£">rum</expan> in Civitate Londo<expan abbr="n-">nii</expan>, sa<expan abbr="l-t">lute</expan>m. Cum nu<expan abbr="p-">per</expan> <expan abbr="p-">per</expan> nos &amp; consili<expan abbr="u-">um</expan> n<expan abbr="r~">ostru</expan>m ordinatum fuisset, q<expan abbr="d-">uod</expan> lane, coria, pelles lanute, plumbum &amp; stagmen n<expan abbr="o-">on</expan> dimit<expan abbr="t?">ter</expan>ent<expan abbr="rsup">ur</expan> seu quomodolibet venderent<expan abbr="rsup">ur</expan>, nisi <expan abbr="p$">pro</expan> bonis sterlingis seu aliis <expan abbr="m?">mer</expan>candisis legalib<expan abbr="z$">us</expan>, <expan abbr="p$">pro</expan>ut in statuto inde edito plenius continet<expan abbr="rsup">ur</expan>

Transcription to Transliteration:Rekeyed Version & XML File

Page 26: The Construction of Anglo-Norman Text Corpus

Handling of Non-Unicode Characters: 2) Transliteration

• Manual transliteration would take too long

• Blanket replacement is not possible because of ambiguous abbreviations

• Semi-automated transliteration can be achieved using a list of words for block-replacement, derived from a concordance

• The appearance metadata from the transcription should remain embedded

Page 27: The Construction of Anglo-Norman Text Corpus

Extract from Concordance

Page 28: The Construction of Anglo-Norman Text Corpus

Table of expansions, example 1

Contracted word Occurrences Expansion

& 6264 &

q~ 2989 q'

p<sup>r</sup> 803 p'r

seign<sup>r</sup> 325 seignour

aut?~s 289 autres

man?~e 250 manere

s<sup>r</sup> 224 sur

p!!lement 215 parlement

t?~re 199 terre

t?~res 196 terres

denglet?~re 191 dengleterre

g<sup>a</sup>nt 181 grant

lo<sup>r</sup> 167 lour

p!!tie 152 partie

ap?~s 142 apres

s?~ront 139 serront

h<bar>o</bar>me 137 homme

Page 29: The Construction of Anglo-Norman Text Corpus

Table of expansions, example 2

Contracted word Occurrences Expansion

memorand!! 8 memorandum

mest?~ 8 mestre

p!!dre 8 perdre

p?~dc~m 8

p?~mer 8 premer

p?~scheins 8 proscheins

p?~sentz 8 presentz

pasch!! 8 Pasche

t?~minez 8 terminez

t?~ra 8 terra

ten!! 8

ten~tz 8 tenementz

v?~ge 8 verge

v?~roie 8 verroie

v?~tue 8 vertue

$p$pres 7 propres

$q$ 7

Page 30: The Construction of Anglo-Norman Text Corpus

Transliteration:XML File & Rendered Output

<p><expan abbr="R-">Rex</expan> Collectorib<expan abbr="z$">us</expan> custume sue lana<expan abbr="z£">rum</expan> in Civitate Londo<expan abbr="n-">nii</expan>, sa<expan abbr="l-t">lute</expan>m. Cum nu<expan abbr="p-">per</expan> <expan abbr="p-">per</expan> nos &amp; consili<expan abbr="u-">um</expan> n<expan abbr="r~">ostru</expan>m ordinatum fuisset, q<expan abbr="d-">uod</expan> lane, coria, pelles lanute, plumbum &amp; stagmen n<expan abbr="o-">on</expan> dimit<expan abbr="t?">ter</expan>ent<expan abbr="rsup">ur</expan> seu quomodolibet venderent<expan abbr="rsup">ur</expan>, nisi <expan abbr="p$">pro</expan> bonis sterlingis seu aliis <expan abbr="m?">mer</expan>candisis legalib<expan abbr="z$">us</expan>, <expan abbr="p$">pro</expan>ut in statuto inde edito plenius continet<expan abbr="rsup">ur</expan>

Page 31: The Construction of Anglo-Norman Text Corpus

Main Challenges facing the Anglo-Norman Hub Project

• Image to text migration for maximum throughput at minimum cost

• Application of markup suitable for rendering and full cross-referencing

• Handling of non-standard character sets (mediaeval abbreviations)