44
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. JISC Workshop 2009, Bath, UK Targeted Language Resources for the Digitization of Historical Collections Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter, Klaus U. Schulz CIS, University of Munich

Targeted Language Resources for the Digitisation of Historical Collections

Embed Size (px)

Citation preview

Page 1: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

JISC Workshop 2009, Bath, UK

Targeted Language Resources for the Digitization of Historical Collections

Annette Gotscharek, Ulrich Reffle, Christoph Ringlstetter,

Klaus U. Schulz

CIS, University of Munich

Page 2: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

Questions and Methods

For historical documents of a specific period: what kind of linguistic resources?

What kind of improvements can be expected? Consequences for engeneering and processes?

---------------------------------------------------------------------- (1) Corpus analysis (2) Quantitative Experiments on OCR and IR

Page 3: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3

Survey

1. Special Challenges to Digitize Historical Materials 2. Composing and Analyzing a Historical Corpus 3. Types of Linguistic Resources 4. Evaluation of Benefits: OCR, IR 5. Consequences for Engineering

Page 4: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

• (1) Text Recognition: Adaption of Optical Character Recognition to historical documents

• (2) Resources Building: Enrichment of texts to Improve Information Retrieval (IR) on historical documents

• (3) Research beyond IMPACT: steps to a next generation interface to access collections of historical documents

0. CIS within IMPACT

Page 5: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

5

Digitization Projects for Historical Materials

SELECT Create a Historical Collection

SCAN Create Images: Greyscale, Color, QA

PROCESS Improve Images

OCR/Type Create a Symbolical Representation, QA

INDEX Process a Term-Document Representation

PRESENT Provide a User Interface for Access

Page 6: Targeted Language Resources for the Digitisation of Historical Collections

date footertext

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

1500 1600 1700 1800 1900

• Imaging Damages on Originals• Optical Character Recognition Rate of Recognition Errors• Information Retrieval Historical Variants

• Human Reading Unknown Words

1. Special Challenges for the Digitization of Historical Materials

Page 7: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Optical Character Recognition: Gothic, good quality

Städte den römischen mumcizmg gleich zu stellen. Allem wenn sich je in einem Rechtstheile die altrechtlichen teutschen Gewohnheiten, und Gesetze erhalten haben, so ist es gewiß in dieser Lehre, man mag entweder auf die Befugniß, die Stadtgerechtigkeit zu ertheilen , oder auf die innere Regimentsverfftssung so-

Page 8: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Optical Character Recognition: medium quality

Fürsten zu Gstternwerden/wer wollte vermainen / daßwtIhroKhurftrstl Durchl gnädiglsterHcttVatterinderpictcrrndFrombkcltallmFürstenvorzusetzen!scyn/vnd das halst> in^cclcQ^ vci pluz^uäzn 5accr6o5 daß tl iN KilchkN GottW wehr als ein Priester.

Page 9: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Examples of Noise introduced by imperfect OCR (1) processed word images may lead to False Friends Fischerei - Tischlerei: F->T, h -> hl (Engfishery - carpenterry)(2) processed word images may relate to no word at all

(3) severe word segmentation errors

OCR on Gothic materials: good (WER < 10%); medium (10-30%); bad (< 30%)

vndExcmpelFürstl-vnd HeroischerTuzenF

^.uglltt. schreibet/

Page 10: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

(1) The image quality is challenging, further processing needed

(2) The classifiers of the OCR disregard certain type faces used in historic print

(3) The language resources of the OCR are inappropriate: historical language

Why is it so bad

Page 11: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

– kräuter (= Engherbs) as kra uter, kreuter, kreüter, kreuter, creutherͤ

Kräuter

krauter

Kreüter

kreuter

creuther

?

0 Results for Kreuter

IR: Search Problemeven for keyed collections

(Eng herbs)

Page 12: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12

Special challenge for IR and OCR on historical Texts: Spelling variation Missing normalization of orthography leads to plenty

of spelling variants in historical documents, e.g. in German texts (1500-1850):– teil (= Engpart) as theil, teyl, theyl

– kräuter (= Engherbs) as kra uter, kreuter, kreüter, ͤkreuter, creuther

– fragte (= Engasked) as frug, fruk User is not aware of the variants and misses many

documents: sometimes even false friends Solution: Mapping from variants to modern lemma

Page 13: Targeted Language Resources for the Digitisation of Historical Collections

date footertext

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

• Lexica for OCR• Language Models for OCR • Statistical Information about transformation patterns• Historical Stopwords for IR• Normalization Lexica with a mapping between modern and historical wordform for IR• Syntactical Information for paradigmatic expansion and disambiguation at POS level

Language Resources to tackle challenges encountered in OCR and IR

Page 14: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Mantra: All Language Resources are Corpus Based

Possible sources:

• Keyed Materials on the Web

• Non Public Electronic Corpora

• Keying/corrected OCR of Image Corpora

• Noisy OCR Corpora

Page 15: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Status of German historical Corpora

1. Main development corpus •Proofread texts from 1400 to 1900, •Medium size: 2.7 Mill. tokens •For lexicon construction•For diachronic analysis/classification of vocabulary of distinct periods

2. OCR corpus for lexicon testing•OCRed Images + groundtruth aligned•Texts from 16th, 18th, 19th century (5034, 2659, 18052) tokens

3. IR test corpus for lexicon testing•Special linguistically annotated groundtruth•Texts from 16th, 17th, 18th, 19th century 31080 tokens

Page 16: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Modern lexicon (CISLEX): coverage on Main Corpus on 10 periods

Language in a historical corpus for German

Page 17: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Compounds (modern components); coverage on Main Corpus on 10 periods

Language in a historical corpus for German

Page 18: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18

Two Variants of Lexica for IR and OCR

Hypotetical Lexicon: Trying to map input strings to modern lexicon entries in a dynamic way via a special approximate matching procedure using historical transformation patterns.

Witnessed Lexicon:Corpus checked lexicon entries:

Historical spelling variant + modern Lemma for IR

Historical word list for OCR

Page 19: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

19

Hypothetical Lexicon: approximate matching procedure Many of the spelling variations can be traced back to

a modern word by applying characteristic patterns / rewrite rules– e.g. the historical string theyle can be traced to

its modern equivalent teile by applying th → t and ey → ei.

Required resources:– Contemporary lexicon with inflected word forms– Set of typical language-specific spelling variation

patterns

Page 20: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

20

Approximate matching procedure

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

Page 21: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

21

Approximate matching procedure

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

~ 140 Patterns

th → t

ei → ai

ey → ei

l → ll…

Page 22: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

22

Approximate matching procedure

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

~ 140 Patterns

th → t

ei → ai

ey → ei

l → ll…

Spelling variation

theile

Page 23: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

23

Approximate matching procedure

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

~ 140 Patterns

th → t

ei → ai

ey → ei

l → ll…

Spelling variation

theile

Page 24: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24

Approximate matching procedure

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

~ 140 Patterns

th → t

ei → ai

ey → ei

l → ll…

Spelling variation

theile

Page 25: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

25

Approximate matching procedure

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

~ 140 Patterns

th → t

ei → ai

ey → ei

l → ll…

Spelling variation

theile

Page 26: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

26

Approximate matching procedure

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

~ 140 Patterns

th → t

ei → ai

ey → ei

l → ll…

Spelling variation

frug

Page 27: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

27

Approximate matching procedure

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

~ 140 Patterns

th → t

ei → ai

ey → ei

l → ll…

Spelling variation

?

frug

Page 28: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

28

Approximate matching procedure Advantages:

– No manual work needed

– Dynamic approach

Limitations:– Mismatches may link a historical spelling

variation to a wrong modern word.– A part of the historical vocabulary cannot be

reduced to a modern word by simple matching

Page 29: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Hypothetical lexicon; coverage on Main Corpus on 10 periods

Language in a historical corpus for German

Page 30: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

30

Manually collected special lexica

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

Spelling variation

theile

frug

Page 31: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

31

Manually collected special lexica

Modern lexicon

Inflected forms

teile

...

taille

fragte

Lemmatizing information

teil (= part)

teilen (= to share)

taille (= waist)

fragen (= to ask)

Spelling variation

theile

frug

Manual mapping

Page 32: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

32

Manually collected special lexica Advantages:

– Associations between historical variant and modern lemma are safe

– Associations that are not covered by the matching approach can be stored explicitly

Limitations:– Time consuming, labor-intensive, situations

occur where specialists (historical linguists) are needed.

– Hardly ever complete because of immense number of spelling variants

Page 33: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

33

Evaluation of the approximative matcher – is the lexicon redundant?

Few empirical studies on crucial decisions for IR and OCR on historical texts:– Is a matching approach enough?

– Do we need a lexicon, and if so, in which scenarios?

Page 34: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

both assign modern lemmas to historical full forms

Validated lexicon constructed from Main Corpus

currently ca. 15,000 entries (“kernel lexicon for hist. German”) poor coverage

Witnessed lexicon for OCR: from Main corpus, 200,000 tokens without modern correspondence still limited coverage: corpus size

Hypothetical lexicon for IR”: matching procedure mapping historical full form to modern pendant plus lemmatizer for modern language (historical full form <---> modern lemma), based on 140 patterns theoretically 100 Mio entries. High coverage, assignments can be erroneous, only able to capture regular correspondences (pattern based)

Lexica built

Page 35: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Test Corpus for 16th , 18th , and 19th century Development version of a professional OCR engine with an external

dictionary interface Experiments with different lexicon settings

No additional lexicon, character model only German modern lexicon corpus based witnessed lexicon hypothetical lexicon

Evaluation of OCR Results

Page 36: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

36

Dictionary16th century No. of word

errors

Reduction of error rate

18th century No. of word

errors

Reduction of error rate

19th century No. of word

errors

Reduction of error rate

No Lexicon 1306 - 827 - 2074 -

Optimal Lexicon 756 42% 395 52% 612 70%

Modern Lexicon 1096 16% 501 39% 888 57%

W.Historical Lexicon 938 28% 481 42% 856 59%

Modern + Virtual H.L. 1011 25% 480 42% 849 59%

WER > 50% WER ~ 10%

Page 37: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

37

Evaluation of the approximate matcher in an IR scenario

proofread documents from 16th, 17th, 18th and 19th century and tagged each token manually.

Collected a list of historical) stopwords Defined “precision” and “recall” for our scenario.

Page 38: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

38

Main insights IR experiment

18th and 19th century:– Pure matching approach leads to good

precision values.– Recall values are acceptable

Page 39: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

39

Main insights IR experiment

16th and 17th century:– Precision of the matching approach poor, a

lexicon will help to avoid wrong matches.– Recall values show that a large number of words

can only be explained by a special lexicon

Page 40: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

40

Answer to our refined QuestionIs the lexicon redundant for OCR/IR on historical

texts? Depends on the material, especially on the date of

origin of the collection:– Matching approach leads to acceptable results for 19th

and 18th century collections.– Serious limitations for 16th and 17th century collections.

Special lexica will lead to important improvements

Combination of matching approach and manually collected lexica may lead to optimal results.

For postprocessing validated lexica are needed

Page 41: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

41

Engineering Consequences

A Focus Collection of the BavarianStateLibrary

VD16: Collection of Early High German Books

Collaborative project with BSB on OCR/IR for this collection: Clemens Neudecker/Fedor Bochow

Special lexicon building needed No 16th century electronic corpora available for

lexicon development For real world test we defined a topic area as main

interest: Theology

Page 42: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

42

Engineering Consequences

Iterative Process with Bavarian State Library to Create Resources for VD16 Collection

(1) A random selection of 200 pages from 100 sources

(2) OCR and corpus experiments

(3) Selection of usable sources

(4) Specification of keying by BSB/CIS for 70 complete books usable for both presentation and linguistic resources building

(5) Contract with service providers

Page 43: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

43

Engineering Consequences

A Focus Collection of the BavarianStateLibrary

VD16: Collection of Early High German Books with 30 million pages

Integrate OCR Supplier: Special Type Face

Models; Character Models

Improve OCR with a specialized historical

lexicon

Improve IR access with a normalization lexicon

Liguistic Database

for VD16

Page 44: Targeted Language Resources for the Digitisation of Historical Collections

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

Wrap Up• For challenging historical materials specialized

lexica are needed

• Special lexica directly implemented into basic OCR lifts OCR quality significantly. For bigger projects seek direct collaboration with OCR partners

• For IR: use approximative matching or normalization lexica to process user queries

• Integrate research institutions and collection holders