Natural Language Processing for Historical Texts - GSCL ...dynalabs.de/mxp/data/uploads/gscl2013/session-02.pdf · Natural Language Processing for Historical Texts GSCL 2013 ... Convert

.

.

Natural Language Processing forHistorical Texts

GSCL 2013 Tutorial

Dr.-Ing. Michael [email protected]

@true mxp

Leibniz Institute of European History

September 24, 2013

Piotrowski September 24, 2013 NLPHist 1/1

.

Acquiring historical texts

I Rarely discussed for modern texts: Modern texts are easy toobtain in electronic form.

I The bulk of historical text has not yet been digitized.

Ü For “regular” NLP research, text acquisition is usually not an issue.

Typical workflow for historical texts:

1. Acquire text in its original form (e.g., as a book).

2. Scan it.

3. Convert images to digital text.


mailto:[email protected]

.

.

Decision criteria

I If the texts have already been digitized (e.g., in a previous projector for Project Gutenberg), error checking and correction may berequired to achieve the desired quality.

I If the texts are available in printed form with good print qualityand in roman type, optical character recognition (OCR) ordouble-keying may be options; for both approaches thedocuments have to be scanned first.

I If the texts are printed, and the print quality is low or the texts areprinted in blackletter types, it depends on the language and thedesired quality whether OCR (with post-processing) is still anoption, otherwise qualified double-keying is likely to be required.

I If the texts are handwritten, double-keying by qualified personnelis currently the only option (even though there is research onautomatic handwriting recognition for historical documents).


.

Scanning

I Scanning makes sense even if the text will be keyed in.I Focus on printed books.I Can books be cut open? Ü Sheet-fed scanners.I Otherwise: Flat-bed or book scanners with automatic or manual

page turning.I In-house or contractor?


.

.

Notes on scanners


.

Notes on scanners

I When scanning in-house, select the right scanner for the job.I If you have 50,000 or 100,000 pages to scan, speed and reliability

become important.

Example: If 65,000 pages are to be scanned with a scanner that scans21 pages per minute, total scanning time is

65000 pages21 pages/min

= 3095.2 min = 51.6 h

This corresponds to about seven eight-hour days, but additional time isneeded for loading the ADF, separating pages that are stickingtogether, etc.

I When comparing scanners, note which scan parameters(resolution and depth) were used for speed measurements.

I Daily or monthly duty cycle gives an indication of mechanicalquality and suitability for the job size.


.

.

Scan parameters

I Optimal scan parameters depend on intended use.I In general: Higher scan resultion is better, as it captures more

details.I Lower-resolution images can easily be created by scaling

down—the inverse is not possible.I Resolution must actually be provided by scanner (optical), not

interpolated.I Higher scan resolution means slower scanning.I In practice: Books usually scanned at 300 or 600 dpi.


.

Bilevel or grayscale?

I Should text intended for OCR be scanned in black and white orgrayscale?

I Actual OCR is always done on bilevel images.I OCR software vendors typically recommend grayscale scanning:

Want to use their own binarization algorithms.I But: Not certain whether it actually improves OCR quality. Holley

(2009): No significant improvement Ú Powell (2009)I Plus: Processing of grayscale files increases cost of OCR: Larger

files are slower to process for OCR software and need morestorage space.

I 100,000 bilevel images, 600 dpi, Group 4: 8.3 GBI 100,000 grayscale images, 600 dpi, Zip: 361.3 GBI Not a storage problem today, but backup, throughput and general

file management are still issues.Ü Run experiments with test pages from the actual documents. If

grayscale does not yield significant improvements, usehigher-resolution bilevel scans.


.

.

File sizes

Example: Scans of typical book pages.

Resolution Depth Compression File size/page(dpi) (bpp) (MB)

300 1 Group 4 0.04600 1 Group 4 0.09600 8 none 24.05600 8 Zip 3.701200 1 Group 4 0.29


.

File formats

I Preferred format for bilevel data: TIFF with Group 4 compression.I Even better compression: JBIG2, but encoding not yet widely

available.50 scanned pages at 600 dpi (4048 × 6072 pixels):

I Group 4: 5.6 MBI JBIG2: 744 KB

I Group 4 and JBIG2 data can be embedded in PDF files: No specialviewer required.

I Commercial solution: Ú CVISION PDFCompressor, Open source:Ú Sojka (2010).

I For grayscale: TIFF with Zip compression or PNGI JPEG is lossy and designed for photographs

Ü Never use JPEG for text!


.

.

OCR

I High quality of OCR output on modern documents criticallydepends on NLP resources, in particular dictionaries.

I If pattern recognition engine cannot decide between multiplepossible readings (e.g., cunent vs. current) dictionary is consulted.

I Readings in dictionary will be preferred.I Additional language-specific rules help resolve issues such as

capitalization and punctuation (e.g., commas vs. periods, andlowercase ell vs. capital I vs. number 1).


.

OCR for historical documents

I For historical languages corresponding resources do notexist—and are not easy to create(but see the work of the IMPACT project).

I Spelling variation makes construction of lexical resources difficult.I Typefaces used in historical prints are often hard to read.I Abbyy offers support for 19th century documents in blackletter

typefaces (previously in FineReader XIX, now in RecognitionServer).

I No general solution for older texts, though.


.

.

OCR quality for historical documents

I How good does OCR work for historical documents?I It depends. . .I Character error rate vs. word error rateI Experiment for Collection of Swiss Law Sources (high image

quality, roman typeface): 86.65% word accuracy.

..

10

1 0 9 7 . liath und Gemeinde von Zürich treffen mit Rath und Gemeinde von Winter-thur eine Übereinkunft betreffend Zidaming der Angehörigen der einen Stadt vor die Gerichte der andern ah Zeuqen und Abschaffung des Schiddverltafies.

1254. Mat 27. Zürich.

Allen die disen brief ansecbciit oder hörent lesen, künden wir der rat und al di'i menige der burger von Z ü r i c h , das wir mit dem rate und mit der menige der burger von W i n t e r t u r und si mit uns übereinkomen sin, das wir die gesetzte, die wir gegen enander hatten, beidenthalb abe-gelassen haben und haben gesetzet alsus, das die burger von W i n t e r t u r und die bi inen wonhaft und gesessen sint in ir stat und in ir getwinge, über unser burger und über die, die bi uns in unser »tat und in i'msevm getwinge wonhaft und gesessen sint, an ir geriehten und au ünsern gezügen wol mugen sin, und das hinwider unser burger und die in unser stat und in ünsern getwingen wonhaft und gesessen sint, dasselbe recht haben über ir burger und über die, die in ir stat und in ir ge-twingen wonhaft und gesessen sint an iren geriehten und an ünsern. Och soll man cntwedernthalb niemane heften und kunbern umb gelt mit vachenne ald mit verbiettenne sin gut, er si dann gelte oder bürge, und dennocht sol man denselben nicht heften mit vachenne, won daz man im sin gut wol künhern und verhielten mag mit dem gericht und inrethalb gerichte. Wer aber daz von deweder stat ein burger us der andern stat burger umb gelt klagete und er da rechtlos vor dem gerichte gelassen wurde, mag er das beweren vor siuem gerichte, das sol man dem andern gerichte künden, und swie daz nicht widertan wirt inret acht tagen, so mag [man]") darumb heften an dem gut des gerichtes burger und gesessen lüte, do er da rechtlos gelassen ist. Dise gesetzte sint beschechen durch liebe und durch früntschaft und durch besserunge der geriehten ietwcderhalb. Und herüber das dis war sij und stät belib, so haben wir unser insigel au disen brief ze einem offennen urküud . . h ) . Dirre brief ward geben zc Z ü r i c h . . do man zalt von gottes gebärt zwölf hundert und filnftzig jär und darnach in dem vierden jare an der mitwachen vor pfingsten, do och dise gesetzte geschneit.

Copie aus dem Anfang des 15. Jahrhunderts nach dem damals noch vorhandenen, jetzt verlorenen Original zu Wintevthur; in dem sog. Quodlibet d. h. einem Copialbuch der Zürcher Stadtseh reibe rei, welches um Mitte des 14. Jahrhunderts begonnen und rom Stadtschreiber Michael Stehler, genannt Graf, um 1432 — 1442 fortgesetzt wurde; St. A. Z., B III. 2. fol. 183. Von Grafs Hand ist diese Uikünde, sowie unter andern auch die vorhergehende, bei welcher ausdrücklich steht «scriptum ext 31ich.» (aet). lieber dieser Urkunde steht: *Ahgschrift des briefs, so die von Wintertur von uns haut», also nach dem Orig. in Wintev-thur, so dass das Zürcher Exemplar schon damals verloren war,

.Spätere Copien: Stadt. Archiv Winterthur, Copiebuch III 33 (von a" 1767) mit Bemerkung, dass die Copic «-von einem Herrn und Fründ in Zürich commumeieret worden». — Stadtbibl. Winterthur, Ms. 2ß 2>. 334, von Goldschmied 1765.

Druck im Auazwff bei Hotz: histor.-Jurist. Beitrage zur Geschichte der Stadt Winterthur, 1868 p. 'J!>, abev nur nach der Copie von 1767, obwohl die älteste Copie in dem von Hotz selbst verwalteten Archive lag.

a) Ist /.war gestrichen, doch wohl mit Unrecht, b) Hier fehlt: «gehenkt» oder e. a. Wort.

1 0 9 7 . liath und Gemeinde von Zürich treffen mit Rath und Gemeinde von Winter-thur eine Übereinkunft betreffend Zidaming der Angehörigen der einen Stadt vor dieGerichte der andern ah Zeuqen und Abschaffung des Schiddverltafies.1254. Mat 27. Zürich.


.

Example page from Collection of Swiss Law Sources

..

752

Nr. 467 – 468

467. Ammann und Rat von Davos fordern ihre Bundesgenossen von

Chur auf, an das von den Bünden beschlossene Recht im Streit zwi-

schen Maienfeld und Peter Spiner einen unverdachten Mann zu

schicken.

1469 Juni 24. [Davos.]

Unsser willig dienste und alles gøtt. F±rsichtigen, wissen herren, wir f•gen ±chze wissen, das die santbotten des meren tayls vom goczhus und ouch von gemey-nen gerichten uff Tavas ze tagen gewessen synd und da mit gemeynem raut ainrecht tag geseczt ist, als zwschent

a

denen von Meyenveld und Petter Spiner desandren tails, das selb recht beseczst ist von den Acht Gerichten und etlichen vomgoczhus, als dan uff dem tag mit gemeynem ratt beschechen ist. Und darumb, lie-ben herren, bitten und manent wir ±ch ayds und eren, als hoch wir ±ch ze manenhand, das ir ±ns ain wissen, unverdachten man schickent zø demselben rechten,darumb wir versprechen umb sold und zerung als dan gebûrlich jn sœlichem

a

din-gen ist ze tøn, und das der an dem næchsten sunentag nach sant Peters tag

1

zenacht unverzogenlich hie uff Tavas an der herbrig sy und das nit silasse. Und lie-ben herren, tønd hierjn als wir ±ch gancz wol getr±wen, das wellent wir mit gøt-tem umb ±ch verdienen. Geben uff sant Johans tag paptyste, anno domini lxix.

Amman und rautt uff Tavas etc.

Original:

(A) ASC, Ratsakten, offener Brief, rückseits Siegelspuren. – Adresse:

Den f±rsichtigen,wissen burgermayster und rautt der statt Chur, ±nsren lieben herren und getr±wen puntgenossen etc.

Druck: JM II, Nr. 21. – Regest: JM I, Nr. 34.

a

A.

1

2. Juli [1469].

468. Schreiben der zu Davos versammelten Boten der Elf Gerichte an

den Bischof von Chur, an Bürgermeister und Rat der Stadt Chur

und das gemeine Gotteshaus über die rechtlichen Bedingungen, unter

denen sie an das Gotteshaus überzugehen bereit wären.

1470 November 26. [Davos].

˙nser willig undertenig dienste zø voran. Hochwirdiger virst und genediger herr,ouch hochgelerter, erwirdiger herr und lieben herren und getr±wen bundgenos-sen. Wir haben ze gøtten tail wol vernomen die maynung und abrednust, so danusgangen ist von ±weren genaden und wishait gegen ±nsren bottenn, daby wir al-

5

10

15

20

25

30


.

.

Improving OCR quality

I Difficult to get inside (commercial) OCR systems.I How to improve OCR results for historical texts?

NLP-related ideas (ÚHolley 2009):

I Use more than one OCR system and voting to pick the best results(ÚHandley/Hickey 1991).

I Use special dictionaries in the OCR process.I Manual correction of OCR-produced text.I Language modeling during or after OCR processing.


.

Manual text entry

I If the texts have already been digitized (e.g., in a previous projector for Project Gutenberg), error checking and correction may berequired to achieve the desired quality.

I If the texts are available in printed form with good print qualityand in roman type, optical character recognition (OCR) ordouble-keyingmay be options; for both approaches thedocuments have to be scanned first.

I If the texts are printed, and the print quality is low or the texts areprinted in blackletter types, it depends on the language and thedesired quality whether OCR (with post-processing) is still anoption, otherwise qualified double-keying is likely to be required.

I If the texts are handwritten, double-keying by qualified personnelis currently the only option.


.

.

Double-keying

I Double-keying:1. Independent keyboarding by (at least) two typists2. Automatic comparison of resulting texts3. Error correction

I Often associated with offshore text entry (outsourcing ofkeyboarding to service providers in countries with lower wages, inparticular to the Philippines, India, or China)

I In fact: General method for quality control


.

Double-keying

I (Offshore) double-keying can be a cost-effective and highlyaccurate solution.

I If OCR is also feasible, weigh researcher time required forpost-OCR correction against cost of double-keying.

I Double-keying can achieve > 99.9% accuracy (99.995% accuracymeans less than one error per 20,000 characters)

I Cost depends on complexity, amount of text, and desired quality.I Some numbers:

I Text entry 0.40 to 0.70 EUR per 1000 charactersI Quality control 0.15 EUR per 1000 characters

I Postprocessing: Data format conversion, semantic markup, etc.(as for OCR output)

Book page ≈ 3440 characters


.

.

Expert transcription

..

216 F. Le Bourgeois, H. Emptoz

Fig. 32 Diversity of fonts and obsolete characters

complete transcription of a book can be achieved in 6 hfor a book of 200 pages with 2,000 characters per page bymanual labeling of all the patterns from the dictionary.These figures have been coarsely evaluated by observ-ing the transcription of several books by different users.Assisted transcription generally increases the transcrip-tion speed by 98%, which is the average redundancy ratemeasured on Renaissance books. It means that only 2%of all characters must be manually entered to obtain thetranscription of the entire book. This approach is possi-ble because we have used a precise pattern comparisonfor the entire book, which does not tolerate substitutionerrors. Links between full-text words and their locationin the image are also stored in the compressed file. More-over, 50% of the pattern dictionary concerns rare char-acter patterns that occur only once in the entire book.They generally represent degraded characters or rarelyligated or broken characters. If we manually transcribethe 50% of the pattern dictionary corresponding to thefirst 5,000 patterns from the dictionary, which describethe most frequent character patterns, we obtain a correcttranscription of 80% of the entire book in 3 h. This rateis sufficient for a query with a search text engine thataccepts missing characters. The encoding of obsoletecharacters from the Renaissance is not solved becausethere is still no standardized Unicode description forthese characters. Therefore we use substitution lettersto code these special characters with a loss in readabil-ity of the transcription. Moreover, the lack of precisesegmentation of words influences the transcription read-ability but not the query by keywords. We have observedthe users behavior during the transcription by using theCAT system, and we have designed the user interfaceto simplify the manual constraints. Because users can-not recognize an isolated character pattern without thecontext, we display all text lines which contain the cur-rent character pattern (Fig 33).

A number unexpected applications were found dur-ing the DEBORA project. For example, we noticedthat our compression scheme based on character patternredundancies is sensitive to imperceptible variation ofthe typesets used by the typographer. The variations inthe character pattern redundancy rate provide interest-ing information on printing regularity and book manu-facturing. It describes the successive corrections of thebooks and provides information to compare differenteditions of the same book. A future project will useour character pattern comparison algorithm to study thedifferent typesets used in France and authenticate thedifferent publisher and book origins during the Renais-sance.

5 Versatile format suited for digitized books

The DEBORA project is an exploratory project thathas attempted to develop new ideas to improve acces-sibility to digitized books. In this context, we studiedall existing formats that can manage digitized booksas heterogeneous data (structures, text, images, links,annotation) containing a high density of interconnectedinformation. The file format must also allow queryingand editing any element of the book content. Imageformats such as TIFF, XIFF, PNG, and JPEG2K, etc.,printing formats such as PDF and PS, and edition for-mats such as RTF/Word, LaTeX, SGML/TEI, XML arenot suitable for simultaneously managing images, text,annotations, links, and structures. On the other hand,XML METS is an appropriate file format that can de-scribe documents in both image and text mode as wellas book structure. But this file format has several draw-backs. The high density of information in the descriptionof a book at the word level provides very large XMLdescription files of several hundred megabytes, withoutthe images. The repetition of tags for each element and

I Material unsuitable for OCR, e.g., handwritten documents,blackletter older than 19th century, anything that requires humanjudgment Ü “expert” transcription.

I Tools may help transcribers: Computer-assisted transcription.


.

Transcription tools

Several projects have produced tools to support transcribers (“CAT,”computer-aided transcription), for example:

I AGORA for the Bibliothèques Virtuelles Humanistes project,Ú http://www.bvh.univ-tours.fr/)

I DEBORA CAT tool for the DEBORA EU projectI GIDOC for iDoc (Interactive Analysis, Transcription and Translation

of Old Text Documents), Instituto Tecnológico de Informática,Spain

I OTTO for the Reference Corpus Middle High German (1050–1350)(Ruhr University Bochum, Germany)

I T-PEN (Saint Louis University, USA)


http://www.bvh.univ-tours.fr/

.

.

DEBORA

..

DEBORA: Digital Access to BOoks of the RenAissance 217

Fig. 33 Software interfacefor the assisted transcription

the high number of hierarchy levels gives descriptionfiles that can be larger than the images. The parsingof large XML description files makes rapid browsingand querying difficult. Moreover, XML-METS managesbooks by separate files: the description files and imagefiles. Retaining the coherence between image descrip-tions and image files is difficult. A standalone file formatthat embeds image and metadata is easier to maintain.The encoding and transmission of digitized documentsare important issues for the development of digital li-braries, which open new research fields for the DIAcommunity.

For the DEBORA project, we developed a taggedfile format, similar to XML METS, but encoded in abinary file, to reduce storage and increase managementspeed [24]. Like RTF/Word, our format uses a highlycompressed standalone file. A single file can containonly a few elements, several images of pages, or severalbooks. Accessing the documents, editing the document’sphysical layout and logical structure, managing imagesand their descriptions, and editing annotations are allimproved. The file format can be decomposed for pro-gressive transmission and display for remote consulting.Figure 34 gives an example of a book’s description using

Fig. 34 Partialrepresentation of the book’sstructure in DEBORA format


.

iDoc: GIDOC tool

I Prototype system for computer-assisted transcription ofhandwritten historical documents

I Integrates interactive-predictive page layout analysis, text linedetection, and transcription

I Based on GIMPI Open source (Ú http://gidoc.sf.net/)


http://gidoc.sf.net/

.

.

iDoc: GIDOC tool


.

iDoc: GIDOC tool


.

.

OTTO

I Transcription tool designed for diplomatic transcription ofhistorical text

I Focus on easy entry and instant renderingI Supports management of transcription projects involving

distributed, collaborative work


.

OTTO

..

Figure 1: Screenshot of the text editor

Any server which runs PHP >5.2 can be a hostfor OTTO. Users can login to the tool from any-where using a standard web browser. A live demoof OTTO, with slightly restricted functionality,can be tried out here: http://underberg.linguistics.rub.de/ottolive.

2.1 Transcribing with OTTOOTTO integrates a user-definable header editor, toenter meta information about the manuscript, suchas its title, author, date of origin, etc. However, thetool’s core feature is the text editor. The upper partof the text editor in Fig. 1 displays the lines thathave been transcribed and saved already. Each lineis preceded by the bibliographic key, M117_sd2,the folio and line numbers, which are automati-cally generated.

The bottom part is dominated by two separateframes. The frame on the left, called Transcrip-tion, is the currently “active” field, where the userenters the transcription (or edits an existing one).The transcriber can use substitute characters to en-code non-ASCII characters. In the figure, the dol-lar sign ($) serves as a substitute for long s (<ſ>,see the first word of the text, De$), and u\o standsfor ou (see Cu\onrat in the Transcription field at thebottom).

The frame on the right, called Unicode, directlytransforms the user input to its diplomatic tran-

scription form, using a set of transcription rules.The diplomatic Unicode view thus provides imme-diate feedback to the transcriber whether the inputis correct or not.

Transcription rules have the form of “search-and-replace” patterns. The first entity specifies thecharacter “to be searched” (e.g. $), the second en-tity specifies the diplomatic Unicode character that“replaces” the actual character. Transcription rulesare defined by the user, who can consult a databasesuch as the ENRICH Gaiji Bank4 to look up Uni-code code points and standardized mappings forthem, or define new ones. OTTO uses the Juni-code font, which supports many of MUFI’s me-dieval characters, partly defined in Unicode’s Pri-vate Use Area.5

Rules can be defined locally—i.e., applying tothe current transcription only—or globally, i.e.,applying to all documents contained in OTTO’sdatabase.6 The rules are used to map the linesentered in the Transcription frame to the lines indiplomatic form in the Unicode frame.

OTTO allows for the use of comments, which4http://beta.manuscriptorium.com/5Junicode: http://junicode.sourceforge.

net/; MUFI (Medieval Unicode Font Initiative):http://www.mufi.info/

6Global rules can be thought of as the application of aproject’s transcription criteria; local rules can be viewed ashandy abbreviations defined by individual users.

184


.

.

T-PEN

I Web-based toolI Performs automatic line and column segmentation, then provides

transcriber with text entry box synchronized with the lines in themanuscript image

I Lines can be annotatedI Integrates project management facilities for collaboratively

working on transcriptionsI Transcriptions can be exported in a variety of formats, including

TEII Can be used on the project’s server at Saint Louis University

Ú http://t-pen.org/I Planned to be released as open-source software


.

T-PEN


http://t-pen.org/