Upload
impact-centre-of-competence
View
514
Download
1
Tags:
Embed Size (px)
Citation preview
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Katrien Depuydt (Institute for Dutch Lexicology, Leiden)
A gentle introduction to lexicon building and lexicon application
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Outline What is a lexicon Lexica in IMPACT Lexicon building and lexicon application tools Results so far with focus on Dutch
IMPACT workshop, Bratislava, May 7, 2010 2
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
What is a lexicon?
IMPACT workshop, Bratislava, May 7, 2010 3
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon vs. electronic dictionary (1)
IMPACT workshop, Bratislava, May 7, 2010 4
An electronic dictionary has
Of course, digitized full text (no images)Primarily: for human useIdeally: searchable with explicitly (XML) tagged information
lemma, Part of speech, meaning, quotations etc.Example:online Oxford English Dictionary
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dictionary XML (example)
IMPACT workshop, Bratislava, May 7, 2010 5
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon vs. electronic dictionary (2)
IMPACT workshop, Bratislava, May 7, 2010 6
A computational lexicon isOf course, in structured digital format (XML, relational database)Primarily for use in computer applicationsHas explicitly coded information(eg. lemma, part of speech, morphology, semantics, syntax…).
Used (for instance):Linguistic annotationEnhanced retrieval (basic: inflected forms; advanced: synonyms etc.)Syntactic parsing, machine translation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010 7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexica in IMPACT
IMPACT workshop, Bratislava, May 7, 2010 8
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The OCR lexicon
IMPACT workshop, Bratislava, May 7, 2010 9
An OCR lexicon isA verified list of words in a languageBased on a corpus, dated to enable relevant selectionPreferably with frequency informationPreferably from same period/text type as the documentsyou want OCR’d (selection!)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR lexicon example
IMPACT workshop, Bratislava, May 7, 2010 10
From WNT attestation lexicon From DBNL historical corpus
absoluut 8absoluyt 2absoluyter 1absolveren 3absolverende 1absorbeeren 1absorbeert 1absorberen 1absorptie 3absoute 2abstineeren 1abstinencie 1abstinentie 2abstineren 1abstrackheyt 1abstract 7abstracta 1abstracte 7abstracten 4abstractheid 1abstractie 1abstractiën 1
wechgerukt 5wechgeschickt 6wechgeven 6wech-gevoerde 11wechgevoerde 14wech-gevoert 59wechgevoert 98wechgeworpen 21wechghenomen 12wechghevoert 7wechginck 5wechloopen 6wechneemt 11wechneme 6wech-nemen 20wechnemen 74wechneminge 12wech-neminge 6wechrapen 6wechrucken 6wechruiming 7wecht 7
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The IR lexiconIR lexicon: Main information categories:
wordforms (list of words) +- frequency information- quotations (dated sources) from corpora orelectronic dictionaries
- MODERN LEMMA (// dictionary entry) assigned to spelling variants and morphological variants of the same word
The modern lemma forms are the main search keys for retrieval This is a standard practice in corpus linguistics and modern historical
lexicography
IMPACT workshop, Bratislava, May 7, 2010 11
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010 12
<?xml version='1.0'?><!DOCTYPE lexicon SYSTEM 'NL_Structure.dtd'><lexicon><lexical_entry><lemma_id>219490</lemma_id><modern_lemma>aantuilen</modern_lemma><gloss></gloss><POS>VRB</POS><ne_label></ne_label><language_id></language_id><portmanteau_lemma_id></portmanteau_lemma_id>
<wordform><form_representation><wordform_id>850026</wordform_id><written_form>tuyld</written_form><attestation><id>92141</id><token_id></token_id><quote>Verhael ick (<I>t.w. een als vrouw verkleede man</I>) haer mijn min in Vrouwelijcker schynen:Sy acht het boertery, en tuyld daer weer op an, Vermits een Vrou niet op een Vrou verlieven kan,</quote><derivation_id>0</derivation_id><document_id>204</document_id><start_pos>119</start_pos><end_pos>124</end_pos></attestation></form_representation></wordform>
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
How to build and apply these lexica?
IMPACT workshop, Bratislava, May 7, 2010 13
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon buildingBuild a lexicon with the aims of Be profitable to OCR and OCR postcorrection Improving retrieval by building a lexicon of variants with the modern
lemma as a main entry key
Tools for lexicon building Tools on how to use the lexicon (lexicon deployment) for enrichment Lexicon cookbook Best practice and tools to use lexica in OCR
!!! No lexicon will ever contain all variants found in historical text
IMPACT workshop, Bratislava, May 7, 2010 14
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Types of variation (orthographical and other)
IMPACT workshop, Bratislava, May 7, 2010 15
uytterlijcste uyterlijkste d'uyterlijke uiterlyke uyterlijcke uiterlijke uyterlijck uiterlyken uiterlijkste uiterlicke wterlicke wterlijcke ulterlijk uiterlyk uiterlijk uyterlick wterlicken d'uyterlijcke uiterlijken uiterlijks wterlijck uytterlicke uitterlijke ujterlijke uytterlijk uyterlycke uyterlicken uijterlicke d'uiterlijcke wtterlijcke wterlyke wtterlijk uuterlick uuterlic uyterlijke uyterlijcken uyterlicke d'uiterlyke wterlijke vuyterlijcke uuterlycke uuterlicke wterlijken uyterlijcksten uuyterlicke uuyterlick uuyterlycke uytterlijcke uytterlycke uytterlick vuytterlicke uiterlijker uyterlyck uterliek wterlijcken uiterlijkst uitterlijk uytterlijcken uyterlyk wterlick uutterlijck uuyterlicken uyttelijck uijterlijk uytterlijck uuterlijck uiterlick uitterlyk uuyterlic uuyterlyck uuyterlijck uiterlijck uytterlyck uterlyc wterlijk
I
werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlytwereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlysswarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled
II
(most of these can be dealt with by means of patterns)
(some of these can be dealt with by patterns and/or fuzzy matching, others can only be handled by explicit listing)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The “hypothetical” vs. the witnessed lexicon (1)Mechanisms
- to extend the lexicon- to assess the plausibility of “hypothetical” wordswithout previous attestations, i.e. words we have not seen before.
IMPACT workshop, Bratislava, May 7, 2010 16
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The “hypothetical” vs. the witnessed lexicon (2)
Unknown inflected forms of registered lemmata: automatic expansion from the lemma to the full paradigm of word forms: paradigmatic expansion or reverse lemmatization
New spellings of known words can be dealt with by developing a good model of the historical spelling. (The database structure provides for the storage of orthographic variant patterns.)
Previously unseen compounds can be dealt with by means of a good model of word formation. (work scheduled for 2010)
IMPACT workshop, Bratislava, May 7, 2010 17
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010 18
Transformation Patterns
Witnessed Modern Word
Historical Variant 1
Historical Variant 2
Virtual lexiconof generated word forms
Hypothetical Modern Word
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
What is needed for lexicon building
Build models of linguistic variation (inflection, orthography) Collect variants
Approach Cycle: model helps to construct lexicon, and vice versa (induction of
rules/patterns) Combination of manual work and computational linguistics Lexicon building toolkit to support development, containing both
computational linguistic tools and tools supporting manual work
IMPACT workshop, Bratislava, May 7, 2010 19
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT workshop, Bratislava, May 7, 2010 20
Cf. Computational Tools and Lexica to Improve Access to Text, Jesse de Does, Katrien Depuydt
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Spelling variation tools (pattern-based) Language-independent approach: Supervised rule (pattern) induction from pairs (“modern” word,
historical word), yielding patterns like aa/ae, s/z, …. Pattern weights are computed from example material
Additional approaches possible: Use of aligned data (parallel historical text and modern version) Unsupervised pattern weighting (=~ text profiling from TR5)
IMPACT workshop, Bratislava, May 7, 2010 21
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lemmatization Reduction of historical word forms to modern lemma Historical word standard (“modern”) spelling lemma form
(pattern matching) (lemmatizer)
Dystels (1) distels (2) distel
When we have a perfect or near-perfect modern full form lexicon, the second step is simply lexicon lookup.
But: 1) We will not have full form information for many lemmata
(especially the historical ones)2) Even lemmata present in modern language may have historical
inflected forms different from the present-day paradigmIMPACT workshop, Bratislava, May 7, 2010 22
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lemmatization and reverse lemmatizationWe also need a lemmatization process for these situations A typical lemmatizer assigns some standard form (infinitive,
nominative, stem) to inflected forms. Usually based on patterns relating the inflected form to the standard form.
But: Matching these patterns can be hard to combine with matching
both spelling variation patterns and OCR errors (bok/bokken/bokkeu)
We adopt the solution of actually expanding the “hypothetical modern full form lexicon” containing the most plausible possible paradigmatic expansions of lemmata
This construction is carried out by means of a statistical reverse lemmatizer
IMPACT workshop, Bratislava, May 7, 2010 23
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Attestation From hypothetical (non-witnessed) lexicon content to attested word
forms in “real” text Automatic selection of candidate attestations Manual work: verification and correction
Two approaches Dictionary based (INL): Woordenboek der Nederlandsche Taal Corpus based (LMU, INL): Dutch DBNL corpus
IMPACT workshop, Bratislava, May 7, 2010 24
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Dictionary Attestation Tool
IMPACT workshop, Bratislava, May 7, 2010 25
work• We are working on what works.
• Depart from me, ye that worke iniquity.
• She worcketh knittinge of stockings.
headword
Quotations
variants
TaskFind the variants of a headword as they occur in the quotations
Lexicon building at work: Verifying attestations in historical dictionaries
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Dictionary Attestation Tool
IMPACT workshop, Bratislava, May 7, 2010 26
Automatically (preprocessing)
• match literallye.g: work work, Work
• match using existing lexica and listse.g: work works, worked, wrought
• approximate matchinge.g: work worke
By hand (using the tool)
• correct automatic mismatchese.g: works words, worms
• find missed matchese.g: work worketh, wrowght
TaskFind the variants of a headword as they occur in the quotations
Electronic
historical
dictionaryDatabase
with lemmata
and quotatioms
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Attestation Tool
IMPACT workshop, Bratislava, May 7, 2010 27
Tool
Lemma headword
QuotationsSorted by uncertainty
Up-to-date overview of what is done and needs to be don
Done by this user so far
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
IMPACT Lexicon Tool
IMPACT workshop, Bratislava, May 7, 2010 28
Automatically (preprocessing = apply lemmatizer)
• match literallye.g: work work, Work
• match using existing lexica and listse.g: work works, worked, wrought
• matching using spelling variation modulee.g: uiterlijk uyterlick
By hand (using the tool)
• assign correct lemma e.g: was (N) zijn (V)
• group tokens belonging togethere.g: konings zoon koningszoon
• select attestations
TaskFind and verify attestations in a historical corpus
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Corpus-based lexicon building: Impact Lexicon Tool
IMPACT workshop, Bratislava, May 7, 2010 29
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
General vocabulary vs. Named entities Tools for lexicon building described so far: applicable to general
lexicon Tools for NE recognition, classification and variant matching
- library requirement- distinguish general vocabulary from NE’s- avoid unpleasant mixups like Abimelech apemelk!
(b/p; i/e; e/0; k/ch)
IMPACT workshop, Bratislava, May 7, 2010 30
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Improvement of state of the art / innovation
We use existing computational linguistic approaches, but figure out how to apply them to historical language
We develop a workflow to deal with the problems posed by historical language, figuring out how all pieces fit together Data selection and acquisition Manual work Computational linguistics tools
IMPACT workshop, Bratislava, May 7, 2010 31
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Some results so far with focus on Dutch
IMPACT workshop, Bratislava, May 7, 2010 32
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Measuring results for Dutch
IMPACT workshop, Bratislava, May 7, 2010 33
We use the ground truth data developed in the projectEvaluation of EE toolsEvaluation of lexicon coverageEvaluation of lexicon usage in IR (2010)Evaluation of OCR and lexicon usage in OCR (2010)Evaluation of benefit of lexicon building for OCR (for which type of material / quality of OCR does this make sense) (2010-11)
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Dutch ground truth data
IMPACT workshop, Bratislava, May 7, 2010 34
Type and genre # wordsGold Standard Book 300kRandom Set Book 340kRandom Set Staten Generaal 2.5MGold Standard Staten Generaal 500kGold Standard Newspapers 1 3.4MGold Standard Newspapers 2 170kRandom Set Newspapers 3.2M
total 13.1M
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Efficiency of lexicon buildingDictionary-based lexicon building using historical dictionary:
Woordenboek der Nederlandsche Taal Lemmata: 220211, quotations: 1524366 Tempo: 1725 quotations/hour; 231 lemmata/hour
IMPACT workshop, Bratislava, May 7, 2010 35
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Reverse lemmatization Reminder: build hypothetical (non-attested) word forms in a “quick
and dirty” way to use in lemmatization and corpus-based lexicon building
Using simple statistical algorithms and a simple approach to inflection
Results:
IMPACT workshop, Bratislava, May 7, 2010 36
Accuracy
Small Dutch lexicon (JVKlex) 96.6%French lexicon (Morphalou) 99.4%Polish lexicon, verbs (Morfologik) 98.7%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (1: ground truth books)
IMPACT workshop, Bratislava, May 7, 2010 37
Type coverage Token coverage
Modern lexicon (e-Lex) 46% 76%
EE3.3 56% 84%
1 + 2 63% 89%Type frequency list historical corpus, top 200K (freq >= 19)
70% 93%
Type frequency list historical corpus, top 500K (freq >= 5)
78% 95%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (2: gt newspapers 18th-19th c.)
IMPACT workshop, Bratislava, May 7, 2010 38
Type coverage Token coverage
Modern lexicon (e-Lex) 40% 83%
EE3.3 41% 84%
1 + 2 51% 89%Type frequency list historical corpus, top 200K
52% 93%
Type frequency list historical corpus, top 500K
62% 95%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (3: gt Parl. Papers 19th c.)
IMPACT workshop, Bratislava, May 7, 2010 39
Type coverage Token coverage
Modern lexicon (e-Lex) 51% 89%
EE3.3 47% 88%
1 + 2 58% 93%Type frequency historical corpus, top 200K
59% 96%
Type frequency historical corpus, top 500K
68% 97%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (4: gt Parl. Papers 20th c.)
IMPACT workshop, Bratislava, May 7, 2010 40
Type coverage Token coverage
Modern lexicon (e-Lex) 70% 93%
EE3.3 66% 93%
1 + 2 76% 96%Type frequency historical corpus, top 200K
74% 97%
Type frequency historical corpus, top 500K
81% 98%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (5: Genesis, 1637 bible)
IMPACT workshop, Bratislava, May 7, 2010 41
Type coverage Token coverage
Modern lexicon (e-Lex) 31% 61%
EE3.3 62% 83%
1 + 2 65% 89%Type frequency historical corpus, top 200K
76% 97%
Type frequency historical corpus, top 500K
87% 98.6%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lexicon coverage (6: Hooft, historiën)
IMPACT workshop, Bratislava, May 7, 2010 42
Type coverage Token coverage
Modern lexicon (e-Lex) 26% 67%
EE3.3 47% 88%
1 + 2 50% 90%Type frequency historical corpus, top 200K
44% 93%
Type frequency historical corpus, top 500K
58% 96%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evident next step for Dutch lexicon building is corpus based work
First target: cover the top 200000 from the historical corpus.– Contains 97885 types not in the witnessed historical EE3.3
lexicon– Roughly 24% of these are covered by the modern lexicon– Roughly 25% are names– This leaves about 45000 common words to look into.
IMPACT workshop, Bratislava, May 7, 2010 43
Conclusion from this evaluation
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Measuring effect of lexicon use in IR Example: Improved recall for retrieval in a historical corpus of about
150 million tokens, using only the modern lexicon for wereld yields 23396 hits, using th current EE3.3 lexicon we get 34339 hits.
Simple IR will be part of the demonstrators Hard to IR results proper without special datasets We have measured up to now either lemmatization or modern to
historical word form matching accuracy
IMPACT workshop, Bratislava, May 7, 2010 44
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Lemmatization Combination of lookup, matching of spelling variation, reverse
lemmatization As yet no good evaluation set for IMPACT (current work) Evaluation on “type” levelWe will use other material here (1637 Genesis, 97144 tokens)Approach Restrict to “ordinary words” (no names, numbers, clitic
combinations) Ambiguous lemmatization (context is not used) (avg. 5
suggestions per word) Ranking based on frequency and pattern weightsIMPACT workshop, Bratislava, May 7, 2010 45
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Result 6265 distinct types. 5991 (95.7%) had at least one correct
suggestion Average rank of correct suggestions: 1.23
– 5222 types found in current EE3.3 (83%)– 65 additional types in modern lexicon– 49 types without any match– 969 types (15%) identified with “approximate” matching using
~500 weighted patterns and returning at most 2 suggestions
IMPACT workshop, Bratislava, May 7, 2010 46
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Real and hypothetical lexicon coverage (Hooft, historiën) Result (again restricting to ‘ordinary’ words) 36332 distinct types. Avg rank of correct suggestions: 1.23
– 20087 types found in current EE3.3 (55%)– 1061 additional types in modern lexicon– 2411 types without any match (7%)– 12773 types (35%) identified with “approximate” matching using
~500 weighted patterns and returning at most 2 suggestions (Probably about 75% of the highest-ranking approximate matches are correct)
IMPACT workshop, Bratislava, May 7, 2010 47
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Evaluation of TR results
IMPACT workshop, Bratislava, May 7, 2010 48
Using Finereader SDK (version 9) External dictionary interface for experimentation Not completely straighforward how to apply thisTranslation of corpus frequencies to weights on a scale 0-100Other details: hyphenated words, case-sensitivity, …Workaround to circumvent the long s problem
Lexicon Data usedCorpus-based type-frequency listEE3.2 deliverable lexiconFinereader internal lexicon
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR evaluation1. Character accuracy2. Word accuracy3. In case of block alignment problems, a simple alternative is bag-of-
words accuracy
1. and 2. presuppose a good alignment of OCR with ground truth.
We will use word accuracy, or the simpler alternative 3. when there are alignment problems
IMPACT workshop, Bratislava, May 7, 2010 49
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
OCR results
IMPACT workshop, Bratislava, May 7, 2010 50
Dataset With ABBYY internal Dutch lexicon
With combination of corpus-based historical lexicon and EE3.2 deliverable (case insensitive, taking hyphenation into account)
With combination of corpus-based historical lexicon and EE3.2 deliverable improved deployment
DPO35(word accuracy)
88.8% 90.9% 94.4 % accuracy
Parliamentary papers, 1826-27 selection(bag of words recall)
90.9% 94.9% 94.9%
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
‘The Book’
“Kort begrip der waereld-historie voor de jeugd” J.F. Martinet
Predikant te Zutphen, uit 1789.
IMPACT workshop, Bratislava, May 7, 2010 51
Why this book?Representative font and amount of spelling variation etc for late 18th century DutchIt has the “long s problem”:
…. = stilste not ftilfte
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
The long s problem: An example ….
IMPACT workshop, Bratislava, May 7, 2010 52
OCR at start of project Results April 2010
A. De eerde was de gevaarlykflti om de verlei¬ding aan 't Hof; de tweede de ftillie en veiligde;de derde de zwaarde, daar hy byna drie millioenenharde en onbefchaafde Menfchen beftieren moest.
A. De eerste was de gevaarlykste om de verlei-ding aan 't Hof; de tweede de stilste en veiligste;de derde de zwaarste, daar hy byna drie millioenenharde en onbeschaafde Menschen bestieren moest.
Workaround: “integrated postcorrection” tell the engine that “eerfte” is OK and postcorrect it afterwards with the lexicon.
In this way we keep it from turning to “eerde”
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.
Future work Compound analysis Irregular historical white space use (“impacttok++”) (cf attestations) Corpus based lexicon extension Testing and optimization with ground truth data Improve the TR lexicon by extending the IR lexicon and removing
false friends from the DBNL-corpus based TR lexicon Continue work on best way deploy lexica in OCR, with help from
ABBYY
IMPACT workshop, Bratislava, May 7, 2010 53