105
Constructing Language Resources for Historic Document Retrieval MSc thesis, Artificial Intelligence Marijn Koolen [email protected] june 2, 2005 Supervisors: Prof. Dr. Maarten de Rijke, Dr. Jaap Kamps Informatics Institute University of Amsterdam

Constructing language resources for historic document retrieval

Embed Size (px)

DESCRIPTION

Master thesis

Citation preview

Page 1: Constructing language resources for historic document retrieval

Constructing Language Resources for Historic

Document Retrieval

MSc thesis, Artificial Intelligence

Marijn [email protected]

june 2, 2005

Supervisors:Prof. Dr. Maarten de Rijke, Dr. Jaap Kamps

Informatics InstituteUniversity of Amsterdam

Page 2: Constructing language resources for historic document retrieval

ii

Page 3: Constructing language resources for historic document retrieval

Abstract

The aim of this research is to investigate the possibility of constructing languageresources for historic Dutch documents automatically. The specific problemsfor historic Dutch, when compared to modern Dutch, are the inconsistency inspelling, and the aged vocabulary. Finding relevant historic documents usingmodern keywords can be aided by specific resources that add historic variantsof modern words to the query or, resources that translate historic documentsto modern language. Several techniques from Computational Linguistics, Nat-ural Language Processing and Information Retrieval are used to build languageresources for Historic Document Retrieval on Dutch historic documents. Mostof these methods are language independent. The resulting resources consist ofa number of language independent algorithms and two thesauri for 17th cen-tury Dutch, namely, a synonym dictionary, and a spelling dictionary based onphonetic similarity.

iii

Page 4: Constructing language resources for historic document retrieval

iv ABSTRACT

Page 5: Constructing language resources for historic document retrieval

Acknowledgements

I’d like to express my gratitude towards Jaap Kamps and Maarten de Rijke fortheir guidance and supervision during this research. They’ve read numerousversions of this thesis without losing patience or hope, and were always quickwith advice when needed. I’d like to thank Frans Adriaans for the brainstormingsessions getting both our projects started, and for the discussions on science thatsomehow always shifted to discussions on music.

v

Page 6: Constructing language resources for historic document retrieval

vi ACKNOWLEDGEMENTS

Page 7: Constructing language resources for historic document retrieval

Contents

Abstract iii

Acknowledgements v

1 Introduction 11.1 Document retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Historic documents and IR . . . . . . . . . . . . . . . . . . . . . 11.3 Constructing language resources . . . . . . . . . . . . . . . . . . 21.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Historic Documents 52.1 Language variants or different languages? . . . . . . . . . . . . . 62.2 The gap between two variants . . . . . . . . . . . . . . . . . . . . 62.3 Bridging the gap . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Resources for historic Dutch . . . . . . . . . . . . . . . . . . . . . 92.5 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Corpus problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 Measuring the gap . . . . . . . . . . . . . . . . . . . . . . . . . . 122.8 Spelling check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Rewrite rules 173.1 Inconsistent spelling & rewrite rules . . . . . . . . . . . . . . . . 17

3.1.1 Spelling bottleneck . . . . . . . . . . . . . . . . . . . . . . 173.1.2 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.1.3 Linguistic considerations . . . . . . . . . . . . . . . . . . . 20

3.2 Rewrite rule generation . . . . . . . . . . . . . . . . . . . . . . . 223.2.1 Phonetic Sequence Similarity . . . . . . . . . . . . . . . . 223.2.2 The PSS algorithm . . . . . . . . . . . . . . . . . . . . . . 243.2.3 Relative Sequence Frequency . . . . . . . . . . . . . . . . 243.2.4 The RSF algorithm . . . . . . . . . . . . . . . . . . . . . 253.2.5 Relative N-gram Frequency . . . . . . . . . . . . . . . . . 263.2.6 The RNF algorithm . . . . . . . . . . . . . . . . . . . . . 26

3.3 Rewrite rule selection . . . . . . . . . . . . . . . . . . . . . . . . 273.3.1 Selection criteria . . . . . . . . . . . . . . . . . . . . . . . 27

vii

Page 8: Constructing language resources for historic document retrieval

viii CONTENTS

3.4 Evaluation of rewrite rules . . . . . . . . . . . . . . . . . . . . . . 293.4.1 Test and selection set . . . . . . . . . . . . . . . . . . . . 29

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.5.1 PSS results . . . . . . . . . . . . . . . . . . . . . . . . . . 333.5.2 RSF results . . . . . . . . . . . . . . . . . . . . . . . . . . 363.5.3 RNF results . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.6.1 problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.6.2 The y-problem . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Further evaluation 434.1 Iteration and combining of approaches . . . . . . . . . . . . . . . 43

4.1.1 Iterating generation methods . . . . . . . . . . . . . . . . 444.1.2 Combining methods . . . . . . . . . . . . . . . . . . . . . 454.1.3 Reducing double vowels . . . . . . . . . . . . . . . . . . . 47

4.2 Word-form retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Historic Document Retrieval . . . . . . . . . . . . . . . . . . . . . 50

4.3.1 Topics, queries and documents . . . . . . . . . . . . . . . 514.3.2 Rewriting as translation . . . . . . . . . . . . . . . . . . . 51

4.4 Document collections from specific periods . . . . . . . . . . . . . 544.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 Thesauri and dictionaries 575.1 Small parallel corpora . . . . . . . . . . . . . . . . . . . . . . . . 575.2 Non-parallel corpora: using context . . . . . . . . . . . . . . . . . 58

5.2.1 Word co-occurrence . . . . . . . . . . . . . . . . . . . . . 595.2.2 Mutual information . . . . . . . . . . . . . . . . . . . . . 60

5.3 Crawling footnotes . . . . . . . . . . . . . . . . . . . . . . . . . . 665.3.1 HDR evaluation . . . . . . . . . . . . . . . . . . . . . . . 71

5.4 Phonetic transcriptions . . . . . . . . . . . . . . . . . . . . . . . . 735.4.1 HDR and phonetic transcriptions . . . . . . . . . . . . . . 74

5.5 Edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.5.1 The phonetic edit distance algorithm . . . . . . . . . . . . 77

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6 Concluding 816.1 Language resources for historic Dutch . . . . . . . . . . . . . . . 816.2 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2.1 The spelling gap . . . . . . . . . . . . . . . . . . . . . . . 836.2.2 The vocabulary gap . . . . . . . . . . . . . . . . . . . . . 84

Appendix A - Resource descriptions 89

Appendix B - Scripts 93

Appendix C - Selection and Test set 95

Page 9: Constructing language resources for historic document retrieval

List of Tables

2.1 Categories of historic words . . . . . . . . . . . . . . . . . . . . . 13

3.1 Comparative recall for english historic word-forms . . . . . . . . 183.2 Comparative recall for Dutch historic word-forms . . . . . . . . . 193.3 Corpus statistics for modern and historic corpora . . . . . . . . . 203.4 Edit distance example 1 . . . . . . . . . . . . . . . . . . . . . . . 293.5 Edit distance example 2 . . . . . . . . . . . . . . . . . . . . . . . 303.6 Edit distance example 3 . . . . . . . . . . . . . . . . . . . . . . . 303.7 Edit distance example 4 . . . . . . . . . . . . . . . . . . . . . . . 313.8 Edit distance baseline . . . . . . . . . . . . . . . . . . . . . . . . 323.9 Manually constructed rules on test set . . . . . . . . . . . . . . . 333.10 Results of PSS on test set . . . . . . . . . . . . . . . . . . . . . . 343.11 Results of RSF on test set . . . . . . . . . . . . . . . . . . . . . . 373.12 Results of RNF on test set . . . . . . . . . . . . . . . . . . . . . . 393.13 Different modern spellings for historic y . . . . . . . . . . . . . . 41

4.1 Results of iterating RSF and RNF . . . . . . . . . . . . . . . . . 444.2 Results of combined rule generation methods . . . . . . . . . . . 464.3 Lexicon size after rewriting . . . . . . . . . . . . . . . . . . . . . 474.4 Results of RDV on test set . . . . . . . . . . . . . . . . . . . . . 494.5 Results of historic word-form retrieval . . . . . . . . . . . . . . . 504.6 HDR results using rewrite rules . . . . . . . . . . . . . . . . . . . 524.7 HDR results for expert topics . . . . . . . . . . . . . . . . . . . . 534.8 Results on test sets from different periods . . . . . . . . . . . . . 54

5.1 Classification of frequent English words . . . . . . . . . . . . . . 625.2 Classification of frequent historic Dutch words . . . . . . . . . . . 635.3 Classification of frequent modern Dutch words . . . . . . . . . . 645.4 DBNL dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 695.5 Simple evaluation of DBNL thesaurus . . . . . . . . . . . . . . . 695.6 DBNL thesaurus coverage of corpora . . . . . . . . . . . . . . . . 705.7 HDR results for known-item topics using DBNL thesaurus . . . . 715.8 Analysis of query expansion . . . . . . . . . . . . . . . . . . . . . 725.9 HDR results for expert topics using DBNL thesaurus . . . . . . . 73

ix

Page 10: Constructing language resources for historic document retrieval

x LIST OF TABLES

5.10 Evaluation of phonetic transcriptions . . . . . . . . . . . . . . . . 755.11 HDR results for known-item topics using Phonetic transcriptions 755.12 HDR results for expert topics using phonetic transcriptions . . . 765.13 Phonetically similar characters . . . . . . . . . . . . . . . . . . . 775.14 Results of historic word-form retrieval using PED . . . . . . . . . 79

1 DBNL dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Page 11: Constructing language resources for historic document retrieval

Chapter 1

Introduction

1.1 Document retrieval

An Information Retrieval (IR) system allows a user to pose a query, and retrievesdocuments from a document collection that are considered relevant given thewords in the query. A basic IR system retrieves those documents from the col-lection that contain the most query words. The drawback of retrieving onlydocuments that contain query words, is that often, not all relevant documentswill be retrieved. Some relevant documents may not contain any of the querywords at all. Many techniques can be used to improve upon the basic sys-tem, by expanding the query with related words, or using approximate wordmatching methods. The aim (and the main challenge) of these techniques isto increase the number of retrieved relevant documents, without increasing thenumber of retrieved irrelevant documents. This is a difficult task, but signifi-cant improvements have been made by using several language dependent andlanguage independent resources.

1.2 Historic documents and IR

IR systems often use external resources to improve retrieval performance. Stem-mers, for example, are used to map words into a standard form, so that mor-phologically different forms can be matched [7]. Resources are also used inCross-Language Information Retrieval (CLIR), where bilingual dictionaries areused to translate query terms [10]. Different amounts of resources are availablefor different languages.

Through increased performance of OCR (optical character recognition) tech-niques, and dropping costs, more and more historic documents become digitallyavailable. These documents are written in a historic variant of a modern lan-guage. Often, the spelling and vocabulary of a language have changed over time.For these historic language variants, very little resources are digitally available.Although the performance of OCR is increased, the linguistic resources used for

1

Page 12: Constructing language resources for historic document retrieval

2 CHAPTER 1. INTRODUCTION

automatic correction are based on modern languages. These correction methodsmight not work for older texts. Thus, for many historic documents, digitizationrequires manual correction of OCR-errors. But the problems don’t end here.Once a document has been digitized correctly, the historic spelling and vocab-ulary still form a problem for linguistic resources based on modern languages.Therefore, this thesis focuses on automatically constructing linguistic tools forhistorical variants of a language. These tools can then be used for Historic Doc-ument Retrieval, which aims at retrieving documents from historical corpora.The tools will be used to construct resources for 17th century Dutch. Giventhat only generic techniques are used, they should also provide a framework forother periods and other languages.

1.3 Constructing language resources

The aim of this research is to construct language resources to be used for his-toric document retrieval (HDR) in Dutch. Previous research by Braun [2], andRobertson and Willett [18] has shown that specific resources for historic textscan improve IR performance. Robertson and Willett treated historic Englishas a corrupted version of modern English, using spelling-correction methods tofind historic spelling variants of modern words. Braun focused on heuristics, ex-ploiting regularities in historic spelling to develop rewrite rules which transformhistoric word forms into modern word forms. These rewrite rules were devel-oped manually, since the inconsistency of spelling was deemed too problematicfor automatic generation of rules. The rewrite rules can significantly improveretrieval performance. Therefore, the problem of automatic rule generation willbe investigated, considering approaches from phonetics, computational linguis-tics and natural language processing (NLP). In both [2] and [18], the focus ison historic spelling. However, Braun identified a second bottleneck, namelythe vocabulary gap. Some historic words no longer exist. Some modern wordsdidn’t exist yet (like telephone or bicycle), and other words still exist but have adifferent meaning. To tackle this problem, a thesaurus might be used. However,no such thesaurus is digitally available, so to solve the vocabulary bottleneckone has to be constructed, either manually or automatically. This research willthus be focused on the following research questions:

• Can historic language resources be constructed automatically?

• Is the frame work for constructing resources a language indepent (generic)approach?

• Can HDR benefit from these resources?

The first question can be split into two more specific question, based on theobservation of Braun about the spelling problem and the vocabulary problem:

• Can (automatic) methods be used to solve the spelling problem?

Page 13: Constructing language resources for historic document retrieval

1.4. OUTLINE 3

• What are the options for automatically constructing a thesaurus for his-toric languages?

For the spelling problem, Braun and Robertson and Willett have alreadymentioned two methods, rewrite rules and spelling correction techniques. Butthere may be other options to align two temporal variants of a language. There-fore, this question can be made more specific:

• Can rewrite rules be generated automatically using corpus statistics andoverlap between historic and modern variants of a language?

• Are the generated rewrite rules a good way of solving the spelling bottle-neck?

• Can historic Dutch be treated as a corrupted version of modern Dutch,and thus be corrected using spelling correction techniques?

The available methods will be tested indepently and as a combined ap-proach. Parallel to this research, Adriaans [1], evaluates the retrieval side ofHDR in much more detail. The methods and thesauri developed in this projectwill be used in his retrieval experiment as an external evaluation. If these tech-niques are found to be useful, this will result in a number of language resources,some of which are language (and period) dependent, and others are languageindependent.

The main drawbacks of manual construction of resources are the need for ex-pert knowledge in the form of historic language experts, and the huge amountof time it takes to construct the resources. Automatic construction exploitsstatistical correlations and regularities in a language. Therefore, expert knowl-edge is no longer essential, and the time it takes to build resources is greatlyreduced. Another advantage is that, if automatic construction is effective, thesame techniques might be used for several different languages. As the afore-mentioned articles have shown, IR performance for both 17th century Englishdocuments and 17th century Dutch documents can be increased by attackingthe spelling variation. The techniques in this research should be language in-dependent, making them useful for both Dutch and English, and perhaps otherlanguages for which historic documents pose the same problems.

1.4 Outline

The next chapter will elaborate on the distinction between historic and modernDutch documents, and some available historic Dutch document collections willbe described. It will show that information retrieval on historic documentsis differrent from retrieval on modern document collections. Chapter 3 willdiscuss in detail the automatic constructing of rewrite rules using phoneticsand statistics, and their effectiveness on historic documents. Several differentmethods are described and compared with each other and with the rules from [2].Further extensions and combinations to these methods and evaluations follow in

Page 14: Constructing language resources for historic document retrieval

4 CHAPTER 1. INTRODUCTION

chapter 4, including a document retrieval experiment. Chapter 5 will investigatethe possibility of building a thesaurus to find synonyms among historic words,using various techniques, and other ways of solving the spelling problem areput to the test. In the final chapter conclusions are drawn from the conductedexperiments, and some guidelines for the future will be given.

Page 15: Constructing language resources for historic document retrieval

Chapter 2

Historic Documents

Historic Documents are documents written in the past. Of course, this holds forall documents. But since spoken and written language changes continuously, acentury old Dutch document is written in a form of Dutch that is different froma document written two weeks ago. The changes are not spectacular, but theyare there all the same. Using a search engine on the internet to find documentson typical Dutch food with the keywords Hollandse gerechten (English: Dutchdishes) may retrieve a text written in 1910 containing both words. The keywordsare normal in modern Dutch, but also in early 20th century Dutch. What thesearch engine probably won’t find is a website containing hundreds of typicallyDutch recipes from the 16th century, although this website does exist (see section2.5, the Kookhistorie corpus). The historic texts contain historical spellingvariants of the modern words Hollandse gerechten. This problem is caused by thefact that the change from 16th century Dutch to modern Dutch is spectacular.Although the number of digitized 16th century documents is small, throughthe increasing interest from historians and funding from national governmentsfor digitizing historic documents,1 this number is growing rapidly. The afore-mentioned problem, the gap between a modern keyword and its relevant historicvariants, becomes increasingly important.

Going back further in time, the differences between modern Dutch and mid-dle Dutch as used in late middle ages (1200 – 1500) are even bigger. In fact,between 1200 and 1500, Dutch was not a single language, but a collection ofdialects. Each dialect had its own pronunciation, and spelling was often basedon this pronunciation.[23] Between geographical regions the spelling differed.Due to the union of smaller independent countries and increasing commerce,a more uniform Dutch language emerged after 1500.2 As contacts between re-gions increased, spelling was less and less based on pronunciation, becoming

1See, for example, the CATCH (Continuous Access To Cultural Heritage) project. Thisis funded by the Dutch government to make historic material from Dutch cultural heritagepublicly accessible in digital form, thereby preserving the fragile material.

2For a more detailed description of the changes in language between 1200 and 1500, (inDutch) see http://www.literatuurgeschiedenis.nl

5

Page 16: Constructing language resources for historic document retrieval

6 CHAPTER 2. HISTORIC DOCUMENTS

more consistent. In the 17th century, the Dutch translation of the Bible, theStatenbijbel, together with books by famous Dutch writers like Vondel and Hooftwere considered as well-written Dutch, bringing about a more consistent andsystematic spelling. Since there was no official spelling (which wasn’t introducedin the Netherlands until 1804), there were still many acceptable ways of spellinga word [23].

2.1 Language variants or different languages?

The Dutch language is related to the German language, yet, we consider themto be different languages. A native German speaker will recognize certain wordsin a Dutch document, but might have problems understanding what the text isabout. A bilingual person, speaking both German and Dutch could translatethe document into German, making it easy for the former reader to understandit. The same will probably hold for a native Dutch speaker reading a documentwritten in middle Dutch. A middle Dutch expert could translate the documentinto modern Dutch, making it more readable. But for documents written after1600, The historic language expert is no longer needed (or at least to a muchlesser degree). Knowledge of modern Dutch gives enough handles on 17th cen-tury Dutch documents, for native speakers to understand most of the text. Itseems there is a shift from two different languages to a language together witha certain “dialect”. This makes 17th century Dutch more or less the same lan-guage as modern Dutch, from an information retrieval (IR) perspective. If 17thcentury Dutch can be seen as a “strange” dialect of modern Dutch, its overlapwith modern Dutch might be used to bridge the gap that exists between thetwo temporal variants.

2.2 The gap between two variants

But where do the two variants overlap, and where do they differ? Braun, in [2],identified two main bottlenecks for IR from historic documents. The first bot-tleneck is the d ifference in spelling. Not only is 17th century spelling differentfrom modern spelling, it is also less consistent. A word has only one officiallycorrect spelling in modern Dutch (although many documents do contain somevariation, caused by spelling errors, changes in the official spelling or stubborn-ness), where older Dutch has many acceptable spelling variations for the sameword. The other bottleneck is the vocabulary gap. The modern word koelkast(English: refrigerator) did not exist in the 17th century. In the same way, thehistoric word opsnappers (English: people celebrating) cannot be found in anymodern Dutch dictionary. Some words are no longer used, new words a createddaily, and yet other words have changed in meaning. The fact that 17th centurydocuments are still readable shows that the grammar has changed very little,so this is probably not an issue (most IR systems ignore word order anyway).

Here is an example of the difference between historic and modern Dutch. The

Page 17: Constructing language resources for historic document retrieval

2.2. THE GAP BETWEEN TWO VARIANTS 7

following historic text is a paragraph taken from the “Antwerpse Compilatae”,a collection of law texts written in 1609, describing laws and regulations forthe region of Antwerpen. The full text describes how a captain should load atrader’s goods, and what his responsibilities towards these goods are at sea:

9. Item, oft den schipper versuijmelijck waere de goeden endecoopman-schappen int laden oft ontladen vast genoech te maecken,ende dat die daerdore vuijtten taeckel oft bevanghtouw schoten, endeint water oft ter aerden vielen, ende alsoo bedorven worden oft teniette gingen, alsulcke schade oft verlies moet den schipper oockalleen draegen ende den coopman goet doen, als voore.

10. Item, als den schipper de goeden soo qualijck stouwt oft laeijtdat d’eene door d’andere bedorven worden, ghelijck soude mogengebeuren als hij onder geladen heeft rosijnen, alluijn, rijs, sout, graenende andere dierghelijcke goeden, ende dat hij daer boven op laeijtwijnen, olien oft olijven, die vuijtloopen ende d’andere bederven, dieschaede moet den schipper oock alleen draegen ende den coopmangoet doen, als boven.

The modern Dutch version (the author’s own interpretation) would look likethis:3

9. Idem, als de schipper verzuimd de goederen en koopman-swaren in het laden of uitladen vast genoeg te maken, en dat diedaardoor uit een takel of vangtouw schieten, en in het water of teraarde vallen, en zo bederven of te niet gaan, zulke schade of verliesmoet de schipper ook alleen dragen en de koopman vergoeden, alshiervoor.

10. Idem, als de schipper de goederen zo kwalijk stouwt of laadtdat het ene door het andere bedorven wordt, gelijk zou kunnengebeuren als hij onder geladen heeft rozijnen, ui, rijst, zout, graanen andere, dergelijke goederen, en dat hij daar boven op laadt wij-nen, olieen of olijven, die uitlopen en de andere bederven, die schademoet de schipper ook alleen dragen, en de koopman vergoeden, alsboven.

This is a translation in English, again, the author’s own translation:

9. Equally, if the shipper neglects to properly secure the goodsduring loading or unloading, causing them to fall in the water or onthe ground and thereby spoiling them, he should repay the damageto the trader.

10. Equally, if the shipper stacks or loads the goods in such amanner that one of the goods spoils another, as could happen if he

3The word order is retained to make it easier to compare both texts. Although this wordorder is readable for native Dutch speakers, it is somewhat strange. Apparently, grammar haschanged somewhat as well.

Page 18: Constructing language resources for historic document retrieval

8 CHAPTER 2. HISTORIC DOCUMENTS

would stack wine, oil or olives on top of raisins, onions, rice, salt,grain or some such goods, where the former spoils the latter, he mustrepay the damage to the trader.

Analyzing the historic and modern Dutch sentences, it may be clear thatthe biggest difference is in spelling. Some words are still the same (schipper,bederven, alleen, water), but most words have changed in spelling. The changesin vocabulary are visible in the change from goet doen to vergoeden (English: re-pay). There are also some morphological/syntactical changes, like versuijmelijck(negligent) to verzuimd (neglects).

It is probably easier to attack the spelling bottleneck first. To solve thesecond, a thesaurus is needed to translate historic words into modern wordsor the other way around. If a method can be found and used to find pairs ofmodern and historic words that have the same meaning, such a thesaurus canbe constructed. But if spelling is not uniform, one spelling variant of a historicword might be matched with a modern one, while another spelling variant ismissed. By solving the spelling bottleneck first, thereby standardizing spellingfor historic documents, finding word translation pairs for a thesaurus may evenbe easier.

2.3 Bridging the gap

In Robertson & Willett [18] n-gram matching is used succesfully to find historicspelling variants of modern words. Thus, the lack of specific resources mightnot be a problem. However, many IR systems for modern languages use astemming algorithm (see [8]) to conflate morphological variants to the samestem. The Porter stemmer4 is one of the most popular stemmers available formany different modern languages, with a specific version for each language (aDutch version is described in [12]). Because modern languages are consistent inspelling, stemmers can be very effective. The Porter stemmer for Dutch wouldconflate the words gevoelig (sensitive), gevoeligheid (sensitivity) en gevoelens(feelings) to the same stem gevoel (feeling). Using gevoelig as a query word,documents containing the word gevoelens will also be considered relevant.

When spelling is inconsistent, only some word forms would be stemmed.Using the porter stemmer for modern Dutch would only affect modernly spelledhistoric words. The historic words gevoel, ghevoelens and gevuelig (English:feeling, feelings and sensitive) would be stemmed to the stems gevoel, ghevoel andgevuel respectively. By standardizing spelling (i.e. making it consistent), thesethree word-forms will be stemmed to the same stem. Another fairly standardtechnique in modern IR is query expansion. This means adding related wordsto the keywords in the query. In historic documents, some of the words relatedto a modern keyword might be historic words that no longer exist. Althoughsome of these historic words could be very useful for expanding queries, thelack of knowledge about them makes it impossible to use them effectively. A

4http://www.tartarus.org/˜martin/PorterStemmer/

Page 19: Constructing language resources for historic document retrieval

2.4. RESOURCES FOR HISTORIC DUTCH 9

thesaurus relating these words to modern words would solve this problem. Fromthis perspective, the historic language can be seen as a different language fromthe modern language, and the retrieval task becomes a so called Cross-LanguageInformation Retrieval (CLIR) task (see [11] and [10] for an analysis of the mainproblems and approaches in CLIR).

But can spelling be standardized with nothing but a collection of historicdocuments? And is it possible to make a thesaurus using the same limiteddocument collection? Of course, it is possible to do everything by hand (seesections 3.1.1 and 5.3). But this is very time consuming, and different languageexperts might have different opinions on what the best modern translation wouldbe. Automatic generation, if at all possible, might be more error prone. Butas modern IR systems have shown [7], sub-optimal resources can still be veryuseful for finding relevant documents.

Although there was no standard way of spelling a word in 17th centuryDutch, the possibilities of spelling a word based on pronuncation are not infi-nite. In fact, there are only a few different spellings for a certain vowel. Corpusstatistics can be used to find different spelling variants by looking at the over-lap of context. Also, techniques have been developed to group semanticallyrelated words based purely on corpus statistics. If this can be done for modernlanguages, it might work with historic languages as well.

2.4 Resources for historic Dutch

Which kind of resources are needed to standardize spelling, and which areneeded to bridge the vocabulary gap? In [2], rewrite rules are used to mapspelling variants to the same form. By focussing on rewriting affixes to modernDutch standards, more morphological variants could be conflated by a stemmer.Some rules where constructed for rewriting the stems as well, to conflate stemsthat were spelled in various ways (like gevoel and gevuel). These rules whereconstructed manually, because the spelling was considered to be too inconsis-tent to do it automatically. But this inconsistency can actually be exploited toconstruct rules automatically. The pronunciation of two spelling variations ofa word is the same. In historic documents, the word gelijk (English: equal) isoften spelled as gelijck or ghelijck. In the same way, gevaarlijk (dangerous) isoften spelled as ghevaerlijck or gevaerlijck. By matching words based on theirpronunciation, spelling variations can be matched as well. In both cases, lijck isapparently pronounced the same as lijk, and ghe is pronouced the same as ge. Ifthere are more historic words showing the same variations, it seems reasonableto rewrite lijck to lijk. But if historic word-forms can be matched with theirmodern variants through pronunciation, why would we need rewrite rules? Well,not all historic words will be matched with a modern word. For instance, theword versuijmelijck (see the short historic text on loading cargo on a ship) isnot pronounced like any modern Dutch word. This is because the morphologyof the word has changed. The modern variant would probably be verzuimend.Here, rewriting makes sense, because, changing the suffix lijck to lijk, a stemmer

Page 20: Constructing language resources for historic document retrieval

10 CHAPTER 2. HISTORIC DOCUMENTS

for Dutch will reduce it to versuijm. Other rewrite rules may change this to themodern stem verzuim, conflating it with all other morphological variants.

Finding historical synonyms for modern words, is a problem heretofore onlytackled by manual approaches. For modern languages, techniques have beendeveloped to find synonyms automatically (see, for instance [3, 4, 5, 14]), usingplain text, or syntactically annotated corpora. Part-Of-Speech (POS) taggersexist for many languages, but not for 17th century Dutch, and annotated, 17thcentury Dutch documents are not available either. Therefore, only those tech-niques that use nothing but plain text are an option.

The next chapters describe the automatic generation of rewrite rules basedon phonetic information, and the automatic construction thesauri using plaintext. The various approaches are listed here:

• Rewrite rule generation methods: Three different methods, based onphonetic transcriptions, syllable structure similarity and corpus statisticswill be described.

• Rewrite rule selection methods: Some of the generated rules could bebad for performance. Some language dependent and independent selectioncriteria will be tested.

• Rewrite rule evaluation methods: The main evaluation will test thegenerated rule sets on a test set of historic and modern word pairs, andmeasure the similarity of the words before and after rewriting. Furtherevaluation is done by doing retrieval experiments, one word-based, theother document-based.

• Thesauri and Dictionaries for the vocabulary gap: A historic tomodern dictionary will be constructed from existing translations pairs (seenext section), a historic synonym thesaurus will be constructed based onbigram information. These methods address the vocabulary gap.

• Dictionaries for the spelling gap: A dictionary based on pronciationwill be made that contains mappigns from historic words to modern wordswith the same pronunciation. Finally, as a way of finding historic spellingvariants for modern words, the word retrieval experiment will be extendedwith a technique to measure the similarity of words based on phonetics.This results in a dictionary of spelling variants. Both methods try tobridge the spelling gap.

But to do this, a collection of historic documents is needed. Huge documentcollections are electronically avaible for modern Dutch (especially since the birthof IR-conferences like TREC5 and CLEF6), but for 17th century Dutch, docu-ments are only sparsely available.

5TREC URL: http://trec.nist.gov6CLEF URL: http://clef.isti.cnr.it/

Page 21: Constructing language resources for historic document retrieval

2.5. CORPORA 11

2.5 Corpora

Although the Nationale Koninklijke Bibliotheek van Nederland has a large col-lection of historic documents, at this moment, very few of them are in digitalform. The resources that will be constructed use corpora of 17th century textsacquired from the internet. The following corpora where found:

• Braun corpus: This was acquired from the University of Maastricht, fromthe research done by Braun [2].

• Dbnl corpus: The Digitale bibliotheek voor de Nederlandse letteren7 storesa huge amount of Dutch historic texts. The text used in this research areall from the Dutch ‘Golden Age’, 1550–1700. This is by far the largestcorpus. Some texts were written in Latin, others are modern Dutch de-scriptions of the historic texts. Most of these non-historic Dutch textswhere removed from the corpus. Many texts contain notes with wordtranslation pairs. These historic/modern translations can be used to cre-ate a thesaurus for historic Dutch.

• Historie van Broer Cornelis: This is a medium size corpus from the be-ginning (1569) of the Dutch literary ‘Golden Age’, transcribed by thefoundation ‘Secrete Penitentie’ as a contribution to the history of Dutchsatire.

• Hooglied: A very small corpus. It is a based on an excerpt from the’statenvertaling’, the first official Dutch bible translation. The so called’Hooglied’ was put to rhyme by Henrick Bruno in 1658.8

• Kookhistorie: A website containing three historic cook books.9 There is ahuge time span between the appearance of the first cook book (1514) andthe last one (1669). The language of the first book is very different fromthat of the second (1593) and third. However, since the first cook bookcontains some modern translation of historic terms that also occur in theother two cook books, the translations can still be used for the thesaurus.

• �Het Voorlopig Proteusliedboek: A small text transcribed by the ‘Leidsevereniging van Renaissancisten Proteus.’10

2.6 Corpus problems

The DBNL corpus contains heterogeneous texts; historic Dutch from variousperiods, modern Dutch, Latin, French, English. If the overlap in phonetics isto be used, the texts from all these different languages might cause problems.

7URL: www.dbnl.nl8URL: http://www.xs4all.nl/ pboot/bruno/brunofs.htm9URL: www.kookhistorie.com

10URL: home.planet.nl/ jhelwig/proteus/proteus.htm

Page 22: Constructing language resources for historic document retrieval

12 CHAPTER 2. HISTORIC DOCUMENTS

The French word guy (English: guy, fellow) contains the vowel uy, but in Frenchit is pronounced different from historic Dutch in words like uyt (English: out).Foreign words ‘contaminate’ the historic Dutch lexicon. The historic corpus willbe used to find typical historic Dutch sequences of characters, so modern Dutchtext are also considered ‘foreign.’ As a preprocessing step, as much of the non17th century Dutch texts were removed from the corpus. Because the entireDBNL corpus contains over 8,600 documents, some simple heuristics were usedto find the foreign texts, so the corpus can still contain some other texts than17th century Dutch.

Another problem with the texts from the DBNL corpus is the period in whichthe texts were written. The oldest texts date from 1550, the most recent werewritten in 1690. In 150 years time, the Dutch language has changed somewhat,including pronunciation and use of some letter combinations (like uy). Forinstance, in the oldest texts, the uy was used to indicate that the u should bepronounced long (the modern word vuur was spelled as vuyr around 1550). Inmore recent texts, after 1600, the uy was often used like the modern ui, as inthe example given above (uyt is the historic variant of uit).

If texts from a wide ranging period are used, generating rewrite rules willsuffer from ambiguity. To minimalize these problems, it is probably better touse texts from a fairly small period (20 – 50 years, for instance).

2.7 Measuring the gap

Before considering the construction of resources, it might be helpful to have atleast some idea of the differences between the historic language of the corpus,and the modern language. Some words were spelled different from the currentspelling, but how much words are we talking about? And how much of thesewords in the historic document collection are spelled as modern words? Toget an indication of the differences, a sample of 500 randomly picked wordsfrom the historic collection where picked and assessed (names where excludedsince they do not contribute to the difference between two languages). Eachword was assigned to one of three categories: modern (MOD), spelling variant(VAR), or historic (HIS). A word belongs to MOD if it is spelled in a modernway (according to modern Dutch spelling rules). It belongs to VAR if it isrecognized as a modern word spelled in a non-modern way. If a word has somenon-modern morphology, or can’t be recognized as a modern word at all, itbelongs to HIS. The word ik (English: I) is found often in historic texts, butit hasn’t changed over time. It is still used, thus belongs to MOD. The wordheyligh is recognized as a historic spelling of the modern word heilig (English:holy), and is categorized as VAR. But the word beestlijck is not recognized as amodern word. Even adjusting its historic spelling, becoming beestelijk, it is nota correct modern Dutch word. Taking a look at the context (V beestlijck leven)makes it possible to identify this word as a historic translation of the modernword beestachtig (English: bestial or beastly). From context, it’s not hard for anative Dutch speaker to find out what it means, but it is clear that over time,

Page 23: Constructing language resources for historic document retrieval

2.8. SPELLING CHECK 13

Category DistributionModern 177 (35%)Variant 239 (48%)Historic 84 (17%)

Table 2.1: Distribution over categories for 500 historic words

its morphology has changed (the root of the word beest is still the same, which isthe very reason that its meaning is recognizable from context). This might notbe problematic for native Dutch speakers, but it does pose a problem for findingrelevant historic documents for the query term “beestachtig”. This word, andother even less recognizable words belong to HIS. Categorizing all 500 randomlypicked words does not result in any hard facts about the gap between the twolanguage variants, but it does give some idea about the size of the problem.The results are listed in Table 2.1. It turns out that most of the words (239words, about 48%) are historic spelling variants of modern words. The overlapbetween historic and modern Dutch is also significant (177 words, 35%), leavinga vocabulary gap of 84 out of 500 words (17%). From this, it shows that solvingthe problem of spelling variants bridges the gap between historic (at least 17thcentury) Dutch and modern Dutch for a large part.

2.8 Spelling check

Robertson and Willett suggested using spelling correction methods. An ap-proach for handling inconsistent spelling is to treat 17th century Dutch as acorrupted version of modern Dutch. A spelling checker might be able to maphistoric word forms to modern forms. That would take away the need to buildspecific resources for historic document retrieval. For instance, the historic wordmenschen might be identified by the spelling checker is a misspelled form of themodern word mensen. One way of testing this is to do a spelling check onseveral documents, and manually evaluate the spelling checker’s performance.Some spelling checkers use the context of a word to find the most probablecorrect word.

The unix spell checker A-spell was tested on the small text snippet fromsection 2.2, using a modern Dutch dictionary:11

9. Item, of de schipper versuijmelijck ware de goeden ende coop-man-schappen int laden ende of ontladen vast genoeg te maken, endedat.die daardoor vuijtten takel of vangtouw schoten, ende int waterof ter aerden vielen, ende alsoo bedorven worden of te niette gingen,alsulcke schade of verlies moet de schipper ook alleen dragen endede koopman goet doen, als voor.

11Information on Aspell and the Dutch dictionary used can be found onhttp://aspell.sourceforge.net/

Page 24: Constructing language resources for historic document retrieval

14 CHAPTER 2. HISTORIC DOCUMENTS

10. Item, als de schipper de goeden soo qualijck stouwt of laeijt datd’eene door d’andere bedorven worden, gelijk soude mogen gebeurenals hij onder geladen heeft rozijnen, aluin, rijs, sout, graan endeandere dergelijke goeden, ende dat hij daar boven op laeijt wijnen,olin of olijven, die uitlopen ende d’andere bederven, die schade moetde schipper ook alleen dragen ende de koopman goet doen, als boven.

The words oft, den, genoech, maecken, taeckel, daerdore and others were rec-ognized as misspelled words, and a list of suggestions were given for each word,including the correct modern words, which were not always the most probablealternatives according to A-spell. For the word versuijmelijck no alternative wassuggested, probably because it is too dissimilar to any modern Dutch word. Theword goeden is a historic word for which A-spell suggested ‘goed’ (good), butnot ‘goederen’ (goods). The correct suggestion for coopman-schappen (whichis koopmanschappen, lit. ‘trade goods’) was not given, probably because themodernized version of the word (koopmanschappen) is not a modern word (theword koopmanschap was suggested, but this means something else, namely thebusiness of trading). The same goes for qualijck (modern form: kwalijk) andlaeijt (modern word: laadt). Also, some words are in fact historic but are notrecognized is misspellings. The word niette should become niet (English: not),but is instead recognized as the past singular form of the verb nieten (English:to staple, as in stapling sheets of paper together).

Another spell checker available for Dutch is the one that comes with theMS Word text processor.12 It suggests orthographically similar words for anyunknown word in the text, and is also capable of checking grammar. This is theoutput after applying the correct suggestions by MS Word:

Item, oft den schipper versuijmelijck ware de goeden ende koop-manschappen int laden oft ontladen vast genoeg te maecken, en datdie daardoor vuijtten taeckel oft bevanghtouw schoten, ende int wa-ter oft ter aarden vielen, ende zo bedorven worden oft te niette gin-gen, alsulcke schade oft verlies moet den schipper ook alleen dragenende den koopman goed doen, als voor.Item, als den schipper de goeden zo kwalijk stouwt oft laeijt dat denedoor dandere bedorven worden, gelijk zoude mogen gebeuren als hijonder geladen heeft rozijnen, aluin, rijs, zout, granen ende anderedergelijke goeden, ende dat hij daer boven op laeijt wijnen, olien oftolijven, die uitlopen ende dandere bederven, die schade moet denschipper ook alleen dragen ende den koopman goed doen, als boven.

MS Word marks the word versuijmelijck as a misspelled word, but no al-ternatives are suggested, which happens for bevanghtouw and alsulcke as well.For some words, the correct word is suggested, as is the case for oft, ende andmaecken and quite a few others. For many other words, the correct modern

12For those unfamiliar with MS Word, see http://office.microsoft.com

Page 25: Constructing language resources for historic document retrieval

2.8. SPELLING CHECK 15

word is in the list of alternatives. For the historic word alsoo it correctly sug-gests alzo and afterwards suggests to replace alzo with the more grammaticallyappropriate word zo.

Spell checkers can be used to find correct modern words for historic wordsthat are orthographically similar. However, for many historic words, spell check-ers cannot find the correct alternative, and for some they cannot find any mod-ern alternative at all. Moreover, each word has to be checked seperately andthe correct suggestion has to be selected from the list manually (the correctalternative is not always the first one in the list of suggestions). It would stilltake an enormous amount of time and effort to modernize historic documentsfor HDR in this way. A spelling check is not a good solution. It seems we doneed specific resources to aid HDR.

Page 26: Constructing language resources for historic document retrieval

16 CHAPTER 2. HISTORIC DOCUMENTS

Page 27: Constructing language resources for historic document retrieval

Chapter 3

Rewrite rules

In this chapter, the spelling bottleneck, and approaches for solving this problemare described. The following points will be discussed:

• Inconsistent Spelling & rewrite rules: The problems with inconsis-tent spelling. How rewrite rules can solve these problems, and what isneeded.

• Rewrite rule generation: Methods for generating rewrite rules.

• Rewrite rule selection: Which rules are to selected and applied?

• Evaluation of rewrite rules: How are the sets of rewrite rules evalu-ated? And well do they perform?

• Rewrite problems: Multiple modern spellings for historic character se-quences.

• Conclusions: Is automatic generation of rewrite rules an effective solu-tion to the spelling problem?

3.1 Inconsistent spelling & rewrite rules

3.1.1 Spelling bottleneck

As mentioned in chapter 2, one of the main problems with searching in historictexts is that the word or words you are looking for can be spelled in manydifferent ways. For example, if you searching for texts that contain the wordrechtvaardig (English: righteous), you might find it in one or two texts. Butthere probably are many more texts that contain the same word spelled indifferent ways (i.e.: rechtvaerdig, reghtvaardig, rechtvaardigh and combinationsof these spelling variations).

17

Page 28: Constructing language resources for historic document retrieval

18 CHAPTER 3. REWRITE RULES

One way of solving this problem would be to expand you query with spellingvariations typical of that period. But few people possess the necessary knowl-edge to do this. Apart from that, it is fairly time consuming to think of all thesevariations, and you inevitably omit some variations. It would be far more effi-cient to do query expansion automatically. Or to rewrite all historic documentsto a standard form, that matches modern Dutch closely.

Robertson and Willett [18] have shown that character based n-gram match-ing is an effective way of finding spelling variants of words in 17th centuryEnglish texts. Historic word forms for modern words were retrieved based onthe number of n-grams they shared. All the historic words where transformedinto a index of n-grams, and the 20 words with the highest score were retrieved.The score was computed using the dice score, with N(Wi,Wj) being the numberof n-grams that Wi and Wj have in common, and L(Wi) the length of word Wi:

Score(Wmod,Whist) =2×N(Wmod,Whist)L(Wmod)× L(Whist)

(3.1)

In a historic word list containing 12191 unique words, 2620 historic wordswere paired with 2195 unique modern forms. Thus, each modern form had atleast one corresponding historic word form. The results in Table 3.1 show therecall at the 20 most similar matches (no precision scores given in [18]).

Method Recall2-gram matching 94.53-gram matching 88.8

Table 3.1: Comparative recall for the 20 most similar matches for historic En-glish

Braun [2] has conducted the same experiment for 17th century Dutch. Itturns out the n-gram matching performance is increased by standardizing spel-ling and stemming (Table 3.2). The inconsistency of spelling makes it hard toapply a stemming algorithm directly on historic documents. Therefore, spellingis standardized by applying rewrite rules on the historic words. In [2], theserewrite rules for 17th century Dutch were constructed with the help of experts.They transform the most common spelling variations to a standard spelling.Most of the variations of rechtvaardig just mentioned would be changed to themodern spelling by these rules. By rewriting different spelling variants to thesame word form, and removing affixes through stemming, a fair number of wordforms are conflated to the same stem.

Still, constructing rules manually, using the help of experts takes a lot ofeffort, and experts of 17th century Dutch are not freely and widely available.More efficient, but also more error prone, are automatic, statistical methodsto produce rewrite rules. In this chapter, several automatic approaches arecompared with each other as well as with the rewrite rules constructed by Braun.

Page 29: Constructing language resources for historic document retrieval

3.1. INCONSISTENT SPELLING & REWRITE RULES 19

Retrieval method Comp. Recall Precision3-gram 70.4 57.93-gram + stemming 74.0 62.53-gram + rewriting 74.8 53.73-gram + stemming 82.1 57.8+ rewriting

Table 3.2: Comparative recall for the 20 most similar matches for historic Dutch

3.1.2 Resources

To construct rewrite rules, a collection of historic documents is needed, as well asa collection of modern documents. Without the modern documents, it would bemuch harder to standardize historic spelling. The are several equally acceptableways of spelling a word in 17th century Dutch. There is no single spelling thatwould be better than the others. To ensure uniform rewriting, the rules haveto be constructed with great care. Identifying spelling variants is only the firststep. The second step is rewriting them all to the same form. For another groupof spelling variants, the same standard form should be chosen. But this far fromtrivial. Consider the spelling variants klaeghen, klaegen, klaechen and claeghen.Three out four words start with kl, so it seems sensible to choose kl as thestandard form. Also, two out of four words use gh, so g and ch should becomegh as well to transform all four variants into a uniform spelling. Another groupof spelling variants might be vliegen, vlieghen, vlieggen, vlyegen and fliegen. Inthis case, rewriting fl to vl seems to make more sense than rewriting vl to fl.The same goes for ye and ie. But the next transformation should be selectedmore carefully. Of the 3 different options g, gh and gg, g occurs more often. Butrewriting gh to g would be in conflict with the earlier decision to rewrite g togh. A far easier solution, and with the goal of making resources for informationretrieval in mind, is to rewrite the historic word forms to modern word forms.In that case, a standard spelling already exists, and rewriting historic spellingvariants to a uniform word is done by rewriting them to the appropriate modernword. Of course, we need to find the appropriate modern form, which mightnot be easy at all. But we’re faced with the same problems when finding thedifferent historic spelling variants themselves. The other advantage of rewritingto modern words becomes clear when combining it with an IR system. Modernusers pose queries in modern language. Rewriting all possible historic variantsto one historic word will not make it any easier for the IR system to matchit with its modern variant. Rewriting historic words to modern words, meansrewriting to the language of the user.

The document collections

For the historic document collection, a corpus of several large books is used.These books all date from the same period (1600 – 1620). The reason of keeping

Page 30: Constructing language resources for historic document retrieval

20 CHAPTER 3. REWRITE RULES

the period small, is that spelling changed over time. If a larger time-span ischosen, a greater ambiguity in spelling might result in incorrect rewrite rules.The pronunciation of some character combinations in 1550 might have changedby 1600. Also, the specific period between 1600 and 1620 makes it possibleto compared the generated rewrite rules with the rules constructed by Braun,since these rules where based on two law books dating from 1609 and 1620.The corpus used in this research, named hist1600, contains these same two lawbooks, in addition to a book by Karel van Mander (Het schilder-boeck), printedin 1604. Two of the techniques used here compare the words of the historiccorpus to words in a modern corpus. The modern corpus (15 editions of theDutch newspaper ”Algemeen Dagblad”) is equal in size to the historic corpus(see Table 3.3). The included editions of the newspaper where selected ondate, ranging over two whole years, to make sure that not all editions cover thesame topics (two successive editions often contain articles on the same topics,probably repeating otherwise low frequent content words).

Name total size number of(number of words) unique words

AC-1609 221739 11648GLS-1620 131183 6977mand001schi01 453474 32314Total 791217 47816(hist1600)Alg. Dagblad 797530 58664

Table 3.3: Corpus statistics for modern and historic corpora

3.1.3 Linguistic considerations

Is it possible to have some idea about how well a certain method will work?Surely it would be nice to know in advance that matching variants of a wordbased on phonetic similarity works well. But we don’t have this knowledge.However, some observations beforehand can point to the right direction (oraway from the wrong direction).

Syllable structure

One such observation is that apparently, most historic words that are recogniz-able spelling variants of modern words have the same syllable structure as theirmodern form. Each syllable in Dutch contains a vowel, and can have a conso-nant as onset and/or as coda. If we take the modern Dutch word aanspraakand a historic form aenspraeck, the similarity in syllable structure is obvious.For both forms the first syllable has a coda (n), the second syllable has an onset(spr) and shows a difference in the codas (k vs. ck). The vowels of the twosyllables differ also (aa vs. ae). Can this be of any help in choosing a method to

Page 31: Constructing language resources for historic document retrieval

3.1. INCONSISTENT SPELLING & REWRITE RULES 21

attack the spelling problem? A solution can be to split the words into syllablesand than make rewrite rules from mapping the historic syllable to the modernsyllable. This would give the following rules:

aen→ aanspraeck → spraak

The advantage of this approach is that it will not only rewrite the wordaenspraeck but also any other historic word that contains the syllable aen. Whatit won’t do is rewrite the word staen to �staan (English: to stand), since itwon’t rewrite syllables containing aen that have an onset. After reading a fewsentences of a historic document it becomes clear that the vowel ae is verycommon in these documents. In modern documents is not nearly as common.One problem that is immediately visible is that to rewrite all words that containthe vowel combination ae an enormous amount of rules is needed to cover allthe different syllables in which this combination can appear. And since thecorpus is limited, not all possible syllables can be found. The rules need to begeneralized. For instance, a rule could be: rewrite all instances of ae to aa insyllables that have a coda.But this introduces a few problems. For native Dutch speakers, it is probablyfairly easy to recognize the syllable structure of many historic words. But anautomatic way of splitting a word into syllables would be based on the modernDutch spelling rules. Since historic words are not in accordance with these rules,splitting them properly into syllables might do more bad than good. Accordingto modern spelling rules, the word claeghen would be split in claeg and hen,which is not what it should be (namely, clae and ghen). Redundant letters inhistoric words can shift the syllable boundaries, adding a coda or onset wherethere shouldn’t be one.To get around this problem, it is possible to split the word into sequences ofvowels and sequences of consonants. The word claeghen would be split intothe sequences cl, ae, gh, e and n. Syllable boundaries can be contained in onesequence (ia in hiaten), but need not be a problem. Historic vowel sequences mayonly be rewritten to modern vowel sequences, and historic consonant sequencesmay only be rewritten to modern consonant sequences. Putting this restrictionon what a historic sequence can be rewritten to, will retain the syllable structure.Again, the considered context can be specific, changing ae to a in the context ofcl and gh, or general, changing ae to a if the sequences is preceded and followedby any consonant sequence.

Spelling errors versus phonetic spelling

Treating historic spelling as a form of spelling errors leads to the method of spellchecking. A possible algorithm for finding the correct word given a misspelledword is the Edit Distance algorithm, [24]. This algorithm finds the smallestnumber of transformations needed to get from one word to another word. Ateach step in the algorithm, the minimal cost of inserting, deleting a substitut-ing a character is calculated. Inserting or deleting a character takes 1 step, a

Page 32: Constructing language resources for historic document retrieval

22 CHAPTER 3. REWRITE RULES

substitution takes 2 steps (the same as 1 delete + 1 insert). Changing bed intobad takes one substitution (‘e’ to ‘a’), thus the edit distance between bed andbad is 2. The edit distance between bard and bad is 1 (deleting the ‘r’). This canbe used to find the word in a lexicon that is closest to the misspelled word [6].However, historic spelling is different from misspellings in modern texts. Thevariance in spelling is not based on accidentally hitting a wrong key on thekeyboard, but on phonetic information. Without any official spelling, writingcaas or kaas makes no difference. They are both pronounced the same. Thus,historic Dutch can be treated as modern Dutch with spelling errors based ona lack of knowledge of modern spelling rules (which people in the 17th cen-tury where, of course, ignorant of). Thus, writing caas instead of kaas (English:’cheese’) is more probable than writing cist instead of kist (English: ’box’), sincea c is pronounced as a k when follow by an a, but is pronounced as an s whenfollowed by an i. From a phonetic perspective, the distance between cist andkist is bigger than between caas and kaas.

3.2 Rewrite rule generation

One can think of many different ways of generating rewrite rules. The use pho-netic transcriptions is one, but another way would be to see the spelling varianceas a noisy channel (i.e. treating historic spelling as a misspelling of modernDutch), making rewrite rules out of typical misspellings. N-gram matching canbe used to find letter combinations that occur frequently in a historic lexicon,but much less frequent in a modern lexicon. In all approaches, a few issueshave to be considered. First of all, while some historic words are spelling varia-tions of modern words, many other historic words are just plain different words.They cannot be mapped to modern words, although they can be modernized inspelling. Thus, purely historic words cannot be used to generate rules, but thegenerated rules will affect these words.

Three different rule generation methods have been developed:

1. Phonetic Sequence Similarity

2. Relative Sequence Frequency

3. Relative N-gram Frequency

3.2.1 Phonetic Sequence Similarity

The first method of mapping historic words to their modern variants is by usingphonetic transcriptions of both historic and modern words. Phonetic matchingtechniques are used to find the correct spelling of a name, when a name is givenverbally, i.e. only its pronunciation is known (see [26]). For modern Dutch,a few automatic conversion tools are available to transform the orthographicword in to a phonetic transcription. A phonetic transcription is list of phonemecharacters which have a specific pronunciation. A simple conversion tool for

Page 33: Constructing language resources for historic document retrieval

3.2. REWRITE RULE GENERATION 23

Dutch can be found on the Mbrola website. 1 It makes acceptable phonetictranscriptions of words. But, because of its simplicity, it cannot cope withthe less frequent letter combinations in the Dutch language. For instance, thecombination ae is transcribed to two separate vowels AE. A much more com-plex grapheme to phoneme conversion tool is embedded in the text-to-speechsoftware package Nextens (see http://nextens.uvt.nl). This converter is moresensitive to the context of a grapheme (letter). The grapheme n preceded bya vowel and followed by a b is not pronounced as an n but as an m. Also, itcan cope with the more seldom letter combinations like ae (transcribed to thephoneme e). Which phonetic alphabet is used by these tools is not important,as long as the same tool is used for all transcriptions.2

While the conversion tools have been developed for modern Dutch, they canalso be used for historic variants of Dutch. It is not clear how well this works,but if 17th century Dutch is close enough to modern Dutch, this could be avery simple way to standardize and modernize historic spelling. Once phonetictranscriptions are made for all the words in the historic lexicon and all thewords in the modern lexicon, it is easy to find historic words and modern wordswith the same pronunciation. These word pairs can be combined in a thesaurusfor lookup (see chapter 5), but they can also be used for constructing rewriterules. The next step is then to construct a rewrite rule based on the differ-ence between these historic and modern word pairs. One way to do this is tomake a mapping between the differing syllables. But splitting historic wordsinto syllables is problematic. However, splitting words in vowel sequences andconsonant sequences is an option. If the equal sounding words have the samevowel/consonant sequence structure, then, by aligning the consonant/vowel se-quences, the aligned sequences are paired on the basis of pronunciation. Toclarify the idea, consider the following example:

historic word: klaghenmodern word: klagenhistoric sequences: kl, a, gh, e, nmodern sequences: kl, a, g, e, n

All these sequence pairs are pronounced these same, including the pair [gh,g].From this list of pairs, only the ones that contain two distinct sequences areinteresting. Rewriting kl to kl has no effect.After applying rewrite rules based on phonetic transcriptions, the lexicon haschanged. But iterating this process has no further effect. Since the rewrite rulesare based on mapping historic words to modern words that are pronounced thesame, after rewriting, the pronunciation stays the same.

1see http://tcts.fpms.ac.be/synthesis/mbrola.htmlor http://www.coli.uni-sb.de/˜eric/stuff/soft/ which is the website of the author of the con-version tool

2This became clear when using the Kunlex phonetic transcriptions list that is supplied withthe Nextens package. This list contains 340.000 modern words with phonetic transcriptions.However, converting the words to phonetic transcriptions using Nextens results in differenttranscriptions from the ones in the Kunlex list.

Page 34: Constructing language resources for historic document retrieval

24 CHAPTER 3. REWRITE RULES

3.2.2 The PSS algorithm

The PSS (Phonetic Sequence Similarity) algorithm aligns two distinct charactersequences that are similar based on phonetics. If the phonetic transcription PTof a historic word Whist also occurs in the modern phonetic transcriptions list,then the modern word Wmod that has the same transcription PT , is consideredthe modern spelling variant of Whist. Both words are split into sequences ofvowels and sequences of consonants. If number of sequences of Whist is differentfrom the number of sequences of Wmod, no rewrite rule is generated. Thisis because there is unmatched sequence. Consider the modern word authentiekand the similar sounding historic word authentique3. The modern word contains6 sequences (au, th, e, nt, ie, k), while the historic word contains 7 (au, th, e,nt, i, q, ue). This last sequence ue is not pronounced (at least, not according toNextens). All the other sequences can be aligned to the sequences of the modernword. This problem is sidestepped by ignoring these cases. When the number ofsequences are equal, an extra check is needed to make sure that for both wordsthe aligned sequences are of the same type, that is, both sequences should bevowels, or both should be consonants. In this research, no word pairs were foundthat couldn’t be aligned properly, except for the word pair mentioned above,but as was mentioned, it is part of a French text. The next step is comparing allthe aligned sequences. If the spelling of the historic sequence Seqi

hist is differentfrom the spelling of the modern sequence Seqi

mod, a possible rewrite rule isfound. Since both words are pronounced the same, apparently, both sequencesare pronounced the same as well. By replacing Seqi

hist in a historic word withSeqi

mod, pronunciation should be preserved. Thus the rewrite rule becomes:

Seqihist → Seqi

mod (3.2)

The resulting rules are ranked by their frequency of occurrence. Thus, ifSeqi

hist and Seqimod are aligned N times in all the equal sounding word pairs,

the resulting rule has score N. If Seqihist and Seqi

mod are aligned often, it ishighly probable that the rule is correct, and that it will have a huge effect onthe historic corpus.

3.2.3 Relative Sequence Frequency

The second approach tries to find modern spellings for sequences of vowels andsequences of consonants based on ’wildcard’ matching. Each word, in bothhistoric and modern corpora, is split in sequences of vowels and sequences ofconsonants (in the same way as for the PSS algorithm). Sequences that arefrequent in historic texts but rare in modern texts are likely candidates forrewriting. To find the appropriate modern sequence to replace it, the historicsequence could be removed from the historic word and replaced by a wildcard.This should be a vowel wildcard if the removed historic sequence is a vowel,

3although this word is in the DBNL corpus, it is probably taken from document containinga small portion of French

Page 35: Constructing language resources for historic document retrieval

3.2. REWRITE RULE GENERATION 25

and a consonant wildcard for historic consonant sequences. If a modern can bematched with the historic word containing a wildcard, the modern sequence thatis aligned with the wildcard is a candidate for replacing the historic sequence.Historic and modern sequences that are aligned often have a high probability ofbeing correct.

3.2.4 The RSF algorithm

The Relative Sequence Frequency (RSF) algorithm generates rules based typicalhistoric character sequences. The whole historic corpus is split in vowel andconsonant sequences Seq, which are ranked by their frequency Fhist(Seq). Afterthat, their frequency scores are divided by the total number of sequences of thewhole corpus Nhist(Seq), resulting in a list of relative frequencies:

RFhist(Seq) =Fhist(Seq)Nhist(Seq)

(3.3)

The same is done for the modern corpus. The final relative sequence fre-quency RSF (seq) is given by:

RSF (Seq) =RFhist(Seq)RFmod(Seq)

(3.4)

A sequence i with a high RSF (seqi) is a typical historic character combina-tion. A score of 1 means that the sequence is used just as frequent in a moderncorpus as in a historic corpus. A threshold is used to determine whether asequence is considered typically historic or not. This threshold is set to 10,meaning that, for a historic and a modern text of equal size N , the charactersequence seqi should occur at least 10 times more often in the historic text tobe typically historic. The reasoning behind this is that if a sequence occursmuch more often in a historic text (i.e. is much more common in a historictext), there is a good chance that its spelling has changed in the past few cen-turies. If Seqi occurs in the historic corpus but not in the modern corpus (i.e.RFmod(Seqi) = 0, RSF (Seqi) is set to 10. No matter what its historic fre-quency is, Seqi is infinitely more frequent in the historic corpus than in themodern corpus, and is considered a typical historic character combination.

Starting with the highest ranking character sequence Seq, all historic wordsthat contain this sequence are transformed in so called ’wildcard words’. IfSeq is a vowel sequence, a historic word Whist contains Seq if Seq is precededand followed by consonants, or the start or end of the word. For example, theword quaellijk is not listed as a word containing the sequence ae, since ae isnot the full vowel sequence (which is uae). In all the historic words, Seq isreplaced by a ’vowel wildcard’, resulting in a wildcard word WWhist. The wordaenspraek is transformed to VnsprVk, where V is a vowel wildcard. WWhist isthen compared to all modern words. A modern word Wmod matches WWhist ifit can match all vowel wildcards with vowel sequences, or consonant wildcardswith consonant sequences. Thus, VnsprVk is matched with the modern word

Page 36: Constructing language resources for historic document retrieval

26 CHAPTER 3. REWRITE RULES

it aanspraak, but also with inspraak, inspreek, and aanspreek. Given these 4matches, ae is matched with i once, ee twice, and 4 times with aa, resulting in3 different rewrite rules:

ae→ aaae→ eeae→ i

Again, a threshold is used to remove unreliable matches. If seqhist hasN(WWhist) wildcard words, then seqmod is considered reliable if it is matchedto seqhist by at least N(WWhist)/10 wildcards, with a minimum threshold of 2.Only one wildcard match is considered an ’coincidence’. This threshold is calledthe pruning threshold. After each historic sequence is processed, and wildcardmatches are found, the list of possible modern sequence is pruned by removingall rules with a score below the pruning threshold. For instance, the sequenceae has more than 5000 wildcard words. A modern sequence is a reliable matchif it matches at least 500 wildcard words with modern words. Of course, itis possible, for words that contain seqhist multiple times (ae occurs twice inaenspraek), to restrict wildcard matching to words that match all the multiplewildcards with the same vowel sequence seqmod. In that case, only aanspraakwould be a match. All the other words match ae with two different modernsequences.

3.2.5 Relative N-gram Frequency

A standard, language independent method for matching terms is n-gram match-ing. For each word all substrings of n characters are determined. One wayof determining similarity between two words is by counting the number of n-grams that are shared by these words. For instance, the words aenspraeck andaanspraak are split in the following substrings of length 3:

aenspraeck: #ae, aen, ens, nsp, spr, pra, rae, aec, eck, ck#aanspraak: #aa, aan, ans, nsp, spr, pra, raa, aak, ak#The number sign (#) shows the word boundary. Only the substrings nsp, spr

and pra are shared by both words. Character n-gramming is a popular techniquein information retrieval, where it can have a huge influence on accuracy (for adetailed analysis on n-gram techniques, see [17]). For this research, n-grammingis used to find typical historic n-grams. Like the previous RSF algorithm, rela-tive frequencies are used to find letter combinations that are frequent in historicdocuments, but rare in modern documents.

3.2.6 The RNF algorithm

The third algorithm is only slightly different from the RSF algorithm, generatingrules based on N-grams instead of vowel/consonant sequences. Hence, it iscalled the Relative N-gram Frequency (RNF) algorithm. It is basically the samealgorithm, but with one major difference. Where the RSF algorithm considers

Page 37: Constructing language resources for historic document retrieval

3.3. REWRITE RULE SELECTION 27

only full sequences (a (full) vowel sequence is only matched with another (full)vowel sequence), the RNF algorithm matches an n-gram with any charactersequence between n− 2 and n + 1 characters long.This restriction is based on the fact that modern spelling is more compact thanhistoric spelling. To indicate that vowels should have a long pronunciation,historic words are often spelled with double vowels (like aa, ee). In modernspelling, this is no longer needed (only in a few cases) because of the officialspelling rules. Also, exotic combinations like ckxs where normal in historicwriting, but in modern spelling, only x or ks is allowed. Thus, it is to beexpected that a modern spelling variant of a historic sequence is often shorterthan the historic sequence itself.Also, without this restriction, the number of possible rules would explode. Whenreplacing zaek for the n-gram aek with the wildcard word zW (where W is anunrestricted wildcard), will result in matching zaek with all existing modernwords that start with the letter z. Processing hundreds of wildcard words willrequire enormous amounts of memory and disk space. By restricting the lengthof the wildcard, only words of length 2 to 5 are matched (this will still matchwith many words, but memory requirements are now within acceptable limits).There is no restriction on vowels or consonants. An n-gram containing onlyvowels can be matched by a wildcard containing only consonants. RNF is testedwith different n-gram sizes, ranging from 2 to 5.

When constructing rules from wildcard matches, the same pruning thresholdis used as for the RSF algorithm described above. Without this threshold, thenumber of generated rules would still be enormous for large n (n ≥ 4). Especiallyfor n = 5, literally hundreds of thousands of rules are generated. To reducememory and disk space requirements, the pruning threshold for n = 5 is set to5.

3.3 Rewrite rule selection

3.3.1 Selection criteria

Once the methods for generating rewrite rules are working, it is easy to generateliterally thousands of rewrite rules. Of course, not all these rules work equallywell. Some rules are based on matching one particular historic word to oneparticular modern word, and some rules are based on matching a historic word tothe wrong modern word. The number of matches between historic and modernwords on which a rule is based can be used as a ranking criterium. The morehistoric words that can be transformed to a modern word with the same rule,the more probable it is that the rule is correct. A rule that maps only onehistoric word to a modern word might be correct, but even if it is, its influenceon an entire corpus will be minimal. A rule that maps over a hundred historicwords to modern words is probably correct. It is highly improbable that aninappropriate rule rewrites this many historic words to modern words. Therule lcx → ndst rewrites volcx to vondst, but very few other matches will be

Page 38: Constructing language resources for historic document retrieval

28 CHAPTER 3. REWRITE RULES

found. But how many matches are needed to make a reliable judgment whethera rule is appropriate or not? There are many different criteria that can be used.For instance, given a typical historic character sequence Seqhist, the number ofmodern sequences N(Seqmod) that lead to rewriting a historic word Whist to amodern word Wmod should be as low as possible. N(Seqmod) is the number ofalternatives from which a modern sequence should be picked. If the same historicsequence occurs in many different rules (i.e. there a lot of different modernconsequences to rewrite to), the chance of only one of them being correct issmall. If there is only one rule (i.e. there is only one modern consequence foundfor a historic sequence), then that is inevitably the best option. Another aspectto look at is the effect of the rule on the modernly spelled words in the historiccorpus. Comparing the words of the historic corpus with a modern word list(Nextens comes with a fairly large list containing approximately 340.000 modernDutch words with phonetic transcriptions, the so called Kunlex word list), showswhich words in the historic corpus have not changed in spelling. These wordsshouldn’t be affected by rewrite rules. The criterium then becomes selectingonly rules that have little to no effect on modernly spelled words. Of course,it is also possible to retain rules that have a large effect on these words, butrestrict the application of such a rule to non-modern words (i.e. words that arenot in the modern lexicon).

Another important decision to be made is whether a historic sequence can berewritten to different modern spellings. As the y-problem described in section3.6.1 indicated, not all sequences ay should be rewritten to the same modernform. The RNF has no difficulty with these problems, since larger n-gramstake the context of ay into account. By first applying large n-grams, differentwords containing ay might be affected by different RNF rules. The other 2algorithms, PSS and RSF cannot take context into account since they use onlyvowels or only consonants. Thus, whatever selection criterium is used, onlyone modern spelling will be selected for each historic sequence. The followingselection criteria will be discussed:

• Match-Maximal: Rank rules according how many wildcard words arematched (MM).

• Non-Modern: Remove all rules that effect modern words in historic lexi-con. A word from the historic lexicon is modern if it is also in the Kunlexlexicon (NM).

• Salience: For the set of competing rules with the same antecedent part,select the consequent part with the highest score only if the differencebetween the highest score and the second highest score is above a certainthreshold (S).

Page 39: Constructing language resources for historic document retrieval

3.4. EVALUATION OF REWRITE RULES 29

3.4 Evaluation of rewrite rules

3.4.1 Test and selection set

A dozen more purely statistical selection criteria can be used, but another alter-native is to create a selection and test set by hand. To test the effectiveness ofthe rewrite rules, a test set that contains historic words and the correct modernvariant can be used. The historic words in the test set are picked from a randomsample of words from a small list of 17th century books, published in the sameperiod as the documents used for the generation of rewrite rules (1600–1620).Words from these books where randomly selected and added to the test set ifa correct modern spelling was given. These modern forms where only enteredwhen the historic words was recognized as a variant of a modern word, or asa morphological variant of a modern word. The historic word beestlijck wouldbe spelled as beestelijk in modern Dutch. However the word beestelijk is notan existing modern word, but a morphological variant of the word beestachtig(beastly). The test set contains some of these words. Some words where notrecognizable at all. These where not added to the test set, since no modernspelling could be entered.This way of constructing a test set is fairly simple and doesn’t take a lot oftime. In just a few hours, a total set of 2000 words was made. The whole setwas then split into a selection set and a test set. The selection set was used asa rule selection method, as a way of sanity checking. Some of the constructedrules clearly make no sense. For example, the rule cxs → mbt might resultin rewriting some historic words to existing modern words, but since it alsochanges pronunciation (and word meaning) radically, it is clear that this rulemakes no sense. To make sure that all selected rules make at least some sense,a way of sanity checking is to select only rules that have a positive effect on theselection set.Using edit distance, measuring the distance D(Whist,Wmod) between the his-toric word and its modern variant and the distance D(Wrewr,Wmod) betweenthe rewritten word and the modern variant, shows the effect of a rewrite rule.Here is an example to explain edit distance, using the historic word volcx andits modern version volks :

v o l c x0 1 2 3 4 5

v 1 0 1 2 3 4o 2 1 0 1 2 3l 3 2 1 0 1 2k 4 3 2 1 2 3s 5 4 3 2 3 4

Table 3.4: Edit distance between volcx and volks

The final edit distance between volcx and volks is 4. The first three characters

Page 40: Constructing language resources for historic document retrieval

30 CHAPTER 3. REWRITE RULES

of both words are the same, resulting in an edit distance of 0. But the next twocharacter differ. from c to k takes 1 substitution, and another substitution isneeded going from x to s.

v o l c c0 1 2 3 4 5

v 1 0 1 2 3 4o 2 1 0 1 2 3l 3 2 1 0 1 2k 4 3 2 1 2 3s 5 4 3 2 3 4

Table 3.5: Edit distance between volcc and volks

If the difference between D(Whist,Wmod) and D(Wrewr,Wmod) is zero, theneither the rule is not applicable to the historic word, or it has no effect on thedistance, in which case it is probably an inappropriate rule. Changing volcx intovolcc has no effect on the edit distance (the distance between volcx and volksis equal to the distance between volcc and volks, see Tables 3.4 and 3.5), butthe word has changed into something that is pronounced differently, while thehistoric word is pronounced the same as its modern variant volks. Most nativeDutch speakers will have little problems recognizing volcx as a spelling variantof the adverb volks, while they would probably recognize volcc as a variant ofthe noun volk.

The problem with using edit distance as a measure is that a bigger reductionin distance not necessarily means that a rule is better. Take two competingrewrite rules lcx → lcs and lcx → lk. The first rule reduce the edit distancefrom 4 to 2 (see Table 3.6), while the second rule reduces it to 1 (Table 3.7).The result of the first rule is a word that looks and sounds much like the correctmodern word. The result of the second rule is a different modern word.

v o l c s0 1 2 3 4 5

v 1 0 1 2 3 4o 2 1 0 1 2 3l 3 2 1 0 1 2k 4 3 2 1 2 3s 5 4 3 2 3 2

Table 3.6: Edit distance between volcs and volks

A rewrite rule has a postive effect on the selection set if the average distancebetween historic and modern words is reduced. The average change in distancebetween the original test set, and the test set after rewriting is given by:

Page 41: Constructing language resources for historic document retrieval

3.4. EVALUATION OF REWRITE RULES 31

v o l k0 1 2 3 4

v 1 0 1 2 3o 2 1 0 1 2l 3 2 1 0 1k 4 3 2 1 0s 5 4 3 2 1

Table 3.7: Edit distance between volk and volks

C =1n

n∑i=0

D(W ihist,W

imod)−D(W i

rewr,Wimod) (3.5)

Where D(Whist,Wmod) is the edit distance between a historic word and itsmodern variant, and D(Wrewr,Wmod) is the edit distance between the rewrittenhistoric word and the same modern variant. A simple measure would be divid-ing the average change in edit distance C by the distance D(Seqhist, Seqmod)between the historic antecedent Seqhist of the rule and its modern consequentSeqmod (rules that change multiple characters should reduce the average dis-tance more than rules that change only one character):

Score(rulei) =Ci

Di(3.6)

If the resulting score is close to 1, the total amount of change by the rewriterule is mostly in the right direction. Looking again at the example of the rulecx → k, the edit distance between the original historic word volcx and themodern word volks is reduced by 3, and the edit distance between cx and k isalso 3 (cost 2 for substitution of c with k, and cost 1 for deleting x). Thus,this rule scores 1. In other words, every change by the rule is a change in theright direction. But this is not good enough. rewriting cx to k reduces the editdistance between volcx and volks, but the rule cx → ks not only reduces theedit distance, it also rewrites the historic word to the correct modern variant.According to (3.6), both rules would get the same score. But if a rule changessome historic words to their correct modern forms, it must be a good rule. Abetter measure should account for this. (3.7) adds the number of words changedto their correct modern form M divided by the total number of rewritten wordsR:

Score(rulei) =Ci

Di+

Mi

Ri(3.7)

Now, the rule cx → ks reveives a higher score because it rewrites at leastsome of the words containing cx to their correct modern form. To make surethat rules with an accidental positive effect are not selected, a threshold for thefinal score of 0.5 is set. In words, this means that for each step done by the rule

Page 42: Constructing language resources for historic document retrieval

32 CHAPTER 3. REWRITE RULES

(insertion, deletion takes one step, substitutionion takes two steps), the distanceshould reduce by at least 0.5.

The big disadvantage of selecting only rules that have a positive effect onthe selection set is that not all the typical historic word forms and letter com-binations are in the selection set. Although the rules are based on statisticson the whole corpus, some constructed rewrite rules that are appropriate mightnot be selected because they have no effect on the selection set. On the otherhand, from a statistical viewpoint, if a specific character combination is not in aset of 1600 randomly selected word pairs, then it is probably not a common ortypical historic combination. Another drawback is that words that couldn’t berecognized as variants of modern words, are not in the test set, but are affectedby the selected rewrite rules. Although the performance of a rule on the test setgives an indication of its “appropriateness” on the recognizable words, there isno such indication for its effect on the unrecognized words.

The test set is used as a final evaluation of the selected rewrite rules. Therewrite rules are applied to the historic words and then compared with thecorrect modern forms. As mentioned above, comparison is based on the editdistance between the words. The final score for a rule is the average distancebetween the rewritten words and the correct words. To get some measure of theeffect of rewriting, the average distance between the original historic words andthe correct words is also calculated as a baseline. The difference between thesetwo averages should give an indication of the effect of rewriting. The baselinescore is shown in Table 3.8

total averageword pairs distance

baseline 400 2.38

Table 3.8: The baseline average edit distance

3.5 Results

The three algorithms PSS, RSF and RNF are evaluated using the test set. Toget an idea of how well certain rule sets perform, all automatically generatedrule sets are compared with the manually constructed rule set in [2]. The resultsfor this set of rules on the test set is given in Table 3.9. In column 2, the totalnumber of rules in the rule set is given (num. rules), in column 3 the totalnumber of historic words that are affected by the rules is given (total rewr.).The 4th column shows the number of historic words for which the rewritingis optimal (perf. rewr. indicating a perfect rewrite). The last column showsthe new average distance (new dist.), between the rewritten historic words, andthe modern words. The difference between the new average distance and thebaseline average distance is shown in parentheses.

Page 43: Constructing language resources for historic document retrieval

3.5. RESULTS 33

rule set num. total perf. newrules rewr. rewr. dist.

Braun 49 248 137 1.41 (-0.97)

Table 3.9: Manually constructed rules on test set

3.5.1 PSS results

The PSS algorithm generated 510 rules, some of which contain the same historicsequence as antecedent part. From these, only the highest scoring rules with aunique historic sequence (i.e. only one rule per historic sequence) are selected(MM, see the Maximal Match criterium). The initial score of a rule is thenumber of times that the modern consequent seqmod of the rule is found as awildcard for the historic antecedent seqhist (see the algorithm descriptions insection 3.2.1). Different threshold values for the rule selection algorithm wheretested on the test set, ranging from 0 to 50 (the MM-threshold). The changein average distance shows whether the rules have a positive or negative effect.Also, the total number of words that are affected by the test set are given,together with the number of perfect rewrites. The number of perfect rewritesis the number of words which are rewritten to their correct modern form. Theresults are shown in Table 3.10. It is clear that the rewrite rules generated bythe PSS-algorithm have a bad influence on the average edit distance betweenhistoric words and their modern variants. However, by increasing the threshold,only the high scoring rules are applied, rewriting 56 out of 400 words to theircorrect modern form. The average distance still increases though. The rulesselected at threshold 5 perform better than the rules selected at threshold 10.Apparently, the rules with a score between 5 and 10 (or at least some of them)are better than some of the higher scoring rules.

One reason for this is the phonetic change of the sequence ae from a as innaem (name) to e as in collegae (collegues).4 The ae sequence is very frequent inthe historic corpus, but the Nextens converter transcribes to an e so that naemwill be matched with neem (to take) instead of with naam (name). Accidentally,there are a lot of wildcard matches with ee, so the rule ae → ee gets a highscore. Another high scoring rule is the rule oo→ o, because many historic wordscontain a double vowel where their modern forms contain single vowels. But thisrule generalizes too much. There are still many modern Dutch words containinga double vowel, and their historic counterparts should not be changed, like boot,boom, school, etc.. There are many lower scoring rules that make more sensefrom a phonetic perspective, but low corpus frequencies keeps them at a lowscore.

There are some phonetic transcription that are just plain wrong. For in-stance, the sequence igh in veiligh (safe) is transformed by Nextens to a socalled ‘schwa’ character. A ‘schwa’ is how non-stressed vowels are pronounced,

4letters in boldface indicate phonemes.

Page 44: Constructing language resources for historic document retrieval

34 CHAPTER 3. REWRITE RULES

like the ‘e’ in the character. The problem with igh is that in many words, the‘i’ is pronounced as a schwa, but the ‘gh’ is certainly pronounced. After con-version, the word veiligh is matched to the modern Dutch word veilen, becausethe final ‘n’ in infinitivals is often not pronounced. A chain is as strong as itsweakest link. If the phonetic transcriptions are not 100% correct, the generatedrules can’t be either.

As a extra, second selection criterium, only rules where selected that hadno effect on those words of the historic lexicon that also occur in the modernKunlex lexicon. Thus, only non-modern (NM, see the Non-modern selectioncriterium in section 3.3.1) historic sequences are considered. The results forthe salience criterium are given for a salience threshold of 2 (S 2 in the Table).This means that the highest scoring rule R1 for a historic sequence Seqhist isselected if R1 matches at least twice as many wildcards as the second best ruleR2. Several different thresholds values where tested. The threshold value 2consistently shows the best results.

Sel. MM num. total perf. newcrit. Tresh. rules rewr. rewr. dist.MM only 0 404 394 9 4.6 (+2.22)MM only 5 109 373 25 3.39 (+1.01)MM only 10 64 365 18 3.76 (+1.38)MM only 20 34 320 34 3.14 (+0.76)MM only 30 25 272 61 2.44 (+0.06)MM only 40 22 269 59 2.46 (+0.08)MM only 50 18 248 56 2.48 (+0.10)MM + NM 0 251 232 39 2.87 (+0.49)MM + NM 5 43 192 28 2.86 (+0.48)MM + NM 10 20 185 24 2.88 (+0.50)MM + NM 20 10 179 18 2.9 (+0.52)MM + NM 30 6 147 12 2.71 (+0.33)MM + NM 40 6 147 12 2.71 (+0.33)MM + NM 50 5 112 12 2.71 (+0.33)MM + S 2 0 383 376 15 3.88 (+1.81)MM + S 2 5 99 331 28 2.75 (+0.65)MM + S 2 10 56 322 21 3.12 (+1.01)MM + S 2 20 29 247 40 2.58 (+0.51)MM + S 2 30 22 195 56 2.15 (-0.23)MM + S 2 40 19 190 54 2.17 (-0.21)MM + S 2 50 15 151 51 2.2 (-0.18)sel. N.A. 104 253 101 1.66 (-0.72)set

Table 3.10: Results of PSS on test set

What is interesting is that once the NM selection criterium is applied, the

Page 45: Constructing language resources for historic document retrieval

3.5. RESULTS 35

number of rules that are applied has little effect on the average edit distancebetween rewritten words and the correct modern words, but is still in balancewith the total number of affected words (more rules rewrite more words). Thehighest scoring rules affect the most words (5 rules rewrite 112 words). Forlower thresholds, NM does have a positive effect, reducing the average distanceby almost 38%. But this is probably because it just reduces the number ofrules. Since most lowly ranked rules increase the average distance, reducing thenumber of lowly ranked rules will reduce the negative influence. However, thenumber of perfect rewrites is heavily affected by NM. Before applying NM, ahigher threshold results in many more perfect rewrites, and average distancedrops to nearly the original distance (which is 2.38). After applying NM, anMM-threshold of 50 results in an increase in distance, with much less perfectrewrites (when compared to an MM-threshold of 50 before applying NM). Inother words, the rules that where thrown out by NM where better than therules that NM keeps in the set. Dropping the threshold to 20 introduces somemore bad rules (only 5 rules are added, and the average distance goes up again).Decreasing the threshold even more shows that some of the rules with a scorebelow 20 are better than some of the rules with a score above 20.

The results for the Salience (S) selection criterium look much more like theMaximal Match results. At each threshold level the number of rules is onlyslightly smaller than without selecting on salience. For the average distance,salience works much better. Rules with a score above 30 descrease to averagedistance. Some of these rules that are removed by the salience criterium actuallyproduce perfect rewrites. For thresholds 30, 40 and 50, the number of rulesdecreases by 3, and the total number of perfect rewrites decreases by 5. Thus,the 3 rules scoring between 30 and 40 removed by salience have a bad effect onthe average distance but do have a positive effect on some words. This showsthat for the historic antecedents in these 3 rules, multiple modern consequentsare required, or the context of the historic sequence (the characters precedingand following the sequence) should be taken into account.

The best results by far are produced by using the selection set. As describedin section 3.4.1, the selection set contains 1600 word pairs, and are used to filterout rules that have a negative effect on the testset. The MM-score of the rulesare ignored in this selection criterium, and are replaced by a score based on howwell they perform on the selection set. As the selection set is constructed in thesame way as the test set (in fact, only one set of 2000 words was constructed,which was split afterwards in the selection set and the test set), it should comeas no surprise that this produces better results. About 63% of the all the wordsin the test set are rewritten and about 25% of them to their correct modernforms.

The PSS algorithm clearly suffers from wrong phonetic transcriptions. Thechange of pronunciation for some character sequences (most notably the se-quence ae, which occurs very often in the historic corpus) over time is ignoredby the Nextens conversion tool. These problems occur throughout the rule set,from highly frequent to rare sequences. Therefore, raising the MM-thresholdwill only reduce the total number of rules, effectively reducing the number of

Page 46: Constructing language resources for historic document retrieval

36 CHAPTER 3. REWRITE RULES

rules which have a negative effect on the test set, but also reducing the numberof rules that have a positive effect. The use of the selection set seems the onlyway to sort the good rules from bad ones.

3.5.2 RSF results

The RSF algorithm generates much more rules than the PSS algorithm, but thenumber of historic sequences for which it finds rewrite rules is smaller. This isbecause it finds many different rewrite rules for the same historic sequence. Afterselecting the highest scoring rule for each unique historic sequence, only 209 rulesare left, compared to 293 rules for the PSS algorithm. This is probably becausethe RSF algorithm only considers typical historic character combinations, wherethe PSS algorithm considers all sequences in the historic index that can bematched with a modern variant. The PSS algorithm generates the rule cl→ klbecause the historic word clacht is pronounced the same as the modern wordklacht. But the RNF algorithm doesn’t ever consider cl as a typical historicsequence, thus won’t generate a rewrite rule for it. The results on the testset are shown in table 3.11. The rules generated by RSF perform much betterthan the ones generated by PSS. The average distance between the historicwords and their correct modern variants decreases. Also, the number of perfectrewrites is much higher. Most rules have a very low score. Setting the thresholdto 5 removes 70% of all the rules in the test set, while staying close to theperformance of the full set of 209 rules. Apparently, the positive effect of theRSF rule set comes mainly from the rules with a score above 5. Further increaseof the threshold shows a further decrease in performance, but this time thedifferences are becoming significant. Between 10 and 20, almost half of the rulesare removed, and the number of perfect rewrites decreases further. But it shouldbe clear that the most effective rules are the ones with the highest scores. Only10 rules have a score above 50, but account for the bulk of the perfect rewritesand the decrease in distance. For all thresholds, the ratio between total rewritesand perfect rewrites is roughly the same (of every 2 affected words, one rewriteis perfect).

Clearly, the NM selection criterium has a negative effect on the RSF gener-ated rules. It throws out some rules (about 20-25%), which have a positive effecton the test set. By throwing them out, the number of perfect rewrites drop, andthe average distance increases. The effect of NM for PSS was questionable. ForRSF, it is just plain bad. A simple explanation for the performance of NM isthat the RSF algorithm already selects historic sequence based on their relativefrequency. Even if a historic sequence occurs in the modern corpus, the factthat was selected by RSF means that it is at least 10 times more frequent inthe historic corpus. The use of relative frequencies makes NM redundant.

The salience criterium also removes many good rules. At an MM-thresholdof 50, 6 out of 10 rules are removed (60%), reducing the number of perfectrewrites by 78 (76%). In other words, probably the best rules in the entire setare removed by selecting on salience. By dropping the salience threshold, theperformance will go up again. Another short test revealed that by reducing the

Page 47: Constructing language resources for historic document retrieval

3.5. RESULTS 37

Sel. MM num. total perf. newcrit. Thresh. rules rewr. rewr. dist.MM only 0 209 261 133 1.41 (-0.97)MM only 5 62 249 130 1.42 (-0.96)MM only 10 39 243 127 1.44 (-0.94)MM only 20 21 231 117 1.48 (-0.90)MM only 30 13 212 109 1.54 (-0.84)MM only 40 12 212 109 1.54 (-0.84)MM only 50 10 206 103 1.56 (-0.82)MM + NM 0 190 207 100 1.58 (-0.8)MM + NM 5 51 195 97 1.59 (-0.79)MM + NM 10 30 188 94 1.61 (-0.77)MM + NM 20 16 178 85 1.64 (-0.74)MM + NM 30 10 162 79 1.71 (-0.67)MM + NM 40 9 162 79 1.71 (-0.67)MM + NM 50 7 156 73 1.73 (-0.65)MM + S 2 0 48 83 39 2.17 (-0.21)MM + S 2 5 21 78 37 2.19 (-0.19)MM + S 2 10 16 78 37 2.19 (-0.19)MM + S 2 20 9 74 35 2.2 (-0.18)MM + S 2 30 6 61 30 2.26 (-0.12)MM + S 2 40 6 61 30 2.26 (-0.12)MM + S 2 50 4 54 25 2.27 (-0.11)sel. set N.A. 76 252 140 1.33 (-1.05)

Table 3.11: Results of RSF on test set

salience threshold by 0.1 at a time, the performance slowly changes towards tooriginal performance. But only by setting the threshold to 1 (no salience), theperformance is equal to the original MM rule set. Thus, for RSF, salient rulesare no better, it seems, than other high ranking but none salient rules.

Again, the selection set is the best selection criterium. It’s performance isbetter than the MM-threshold. The number of perfect rewrites is higher, whilethe total number of rewrites is lower, and the average distance is reduced bymore than 1 (for edit distance, this amounts one step, insert or delete, closer tothe modern word).

3.5.3 RNF results

Like the RSF algorithm, RNF also generates many rules. But since the se-quences are not restricted to either consonants or vowels, many more historicsequences and possible modern sequences are considered. The n-gram size be-comes important. For n-grams of size 2 only 27 * 27 (26 alphabetic charactersplus the word boundary character) = 729 historic sequences are possible. For 4-

Page 48: Constructing language resources for historic document retrieval

38 CHAPTER 3. REWRITE RULES

grams already 492.804 historic sequences are possible. Of course, most of thesesequences will not be in the historic corpus (take ’qqqq’ or ’xjhs’ for example).So, before generating the rules, we can predict that there will be far more rulesfor n-grams of size 4 than for n-grams of size 2. See table 3.12 for the results ofall n-gram lengths. The results for NM are not listed, since they show the samebad effect as for the RSF rules, and would only make table 3.12 less readable.As for salience, it shows mixed results. For 2-grams, the best salience thresholdis 1.5, performing for worse than the original rule set. For 3-grams and 4-grams,the best value is around 1.25, showing some improvement in average distancefor lower MM threshold values (up to 20) but a drop in the number of perfectrewrites.

The results for n-grams of length 2 show that only the 8 highest MM-scoringrules, with a threshold above 50, have a big influence on the test set. Theserules are very good, rewriting 67% of all the words, and of these, 56% areperfect rewrites. This is due, for the largest part, to the rule ae → aa. Manyof the historic words in the test set contain the sequence ae, and most of theircorresponding modern variants have aa as the modern spelling variant. Asthe results at lower MM-thresholds show, the low scoring rules have almost noinfluence on the test set.

Another noticable result is that at a MM-threshold of 20, the rules show thegreatest reduction in average distance for all the other n-gram lengths. Also,for n ≥ 3, increasing the MM-threshold results in less perfect rewrites. As forthe total rewrite / perfect rewrite ratio, the best n-gram lengths are 2 and 3.

Like the PSS and the RSF algorithm, RNF benefits greatly from the useof the selection set. All n-gram lengths show in improvement over the MM-threshold selection. The number of selected rules is less than for low MM-thresholds (which show the highest number of perfect rewrites of the differentMM-thresholds), as well as the total number of rewrites. But the number ofperfect rewrites increases (this is most noticable for n ≥ 4. Now, the 4-gramrules show the best results. 62% of all rewrites is perfect, and the averagedistance is reduced by almost 50%.

3.6 Conclusions

The most significant conclusion is that phonetic transcriptions are not nearlyas useful as expected. As mentioned earlier, there are two reasons for this.First, the transcriptions are not always correct. Some letter combinations thatno longer occur in modern Dutch words are treated as English or French char-acter sequences. From the surrounding characters it should be clear that theword under consideration is certainly not English or French. The grapheme tophoneme converter of Nextens is very accurate compared to other conversiontools, but for this particular task, it is simply not good enough. To the defenseof Nextens, it should be mentioned that it wasn’t designed for this task. It wasdesigned with pronunciation of modern Dutch in mind. That it does very well.

The other main reason is that, although the overlap between 17th century

Page 49: Constructing language resources for historic document retrieval

3.6. CONCLUSIONS 39

N-gram MM Thres num. total perf. newsize -hold rules rewr. rewr. dist.2 0 15 271 150 1.29 (-1.09)2 5 14 271 150 1.29 (-1.09)2 10 11 269 150 1.30 (-1.08)2 20 10 268 150 1.30 (-1.08)2 30 9 267 150 1.30 (-1.08)2 40 8 267 150 1.30 (-1.08)2 50 8 267 150 1.30 (-1.08)2 sel. N.A. 12 271 152 1.29 (-1.09)set3 0 163 277 148 1.38 (-1)3 5 163 277 148 1.38 (-1)3 10 124 270 148 1.33 (-1.05)3 20 81 260 143 1.33 (-1.05)3 30 50 239 131 1.41 (-0.97)3 40 39 229 123 1.46 (-0.92)3 50 27 196 95 1.78 (-0.60)3 sel. N.A. 127 274 162 1.19 (-1.19)set4 0 458 284 115 1.89 (-0.49)4 5 458 284 115 1.89 (-0.49)4 10 321 268 110 1.87 (-0.51)4 20 118 211 92 1.87 (-0.51)4 30 57 163 64 1.93 (-0.45)4 40 39 138 50 2.13 (-0.25)4 50 29 114 37 2.18 (-0.2)4 sel. N.A. 276 269 166 1.20 (-1.18)set5 0 726 205 57 2.69 (+0.31)5 5 726 205 57 2.69 (+0.31)5 10 318 157 52 2.42 (+0.04)5 20 78 80 28 2.25 (-0.13)5 30 20 34 9 2.33 (-0.05)5 40 10 23 6 2.36 (-0.02)5 50 7 20 6 2.36 (-0.02)5 sel. N.A. 276 153 97 1.79 (-0.59)set

Table 3.12: Results of RNF on test set

Page 50: Constructing language resources for historic document retrieval

40 CHAPTER 3. REWRITE RULES

Dutch and contemporary Dutch is mostly in pronunciation, the pronunciation ofsome high frequency vowel and consonant sequences (highly frequent in historcDutch that is) certainly has changed. The correct transformation of these se-quences is absolutely essential if good performance is to be achieved. Of course,this problem could be solved by adjusting the conversion tool, tweaking therules through which certain character combinations are mapped to phonemes,but that would require expert knowledge. The main aim of this research was tofind out if the spelling bottleneck can be solved without any expert knowledge.Clearly, we need more than phonetic transcriptions alone.

The other two methods work much better. Both methods only take typicallyhistoric sequences into account, and do not suffer from changes in pronunciation.Corpus statistics are enough to generate and select well-performing rewrite rules.

On the down side, the RSF and RNF algorithms only consider typicallyhistoric sequences. Many words with historic spelling contain sequences that arequite frequent in modern Dutch as well, like cl in clacht (modern Dutch: klacht).The PSS algorithm will generate rules for these sequences, but this is probablywhere the usefulness of rewrite rules turns into senseless spelling reformation.After transforming typically historic letter combinations into modern ones, rulegeneration should probably be replaced by word matching, either n-gram based(see section 4.2) or phonetically (see section 5.4).

It seems that the selection criteria NM and S only have some positive effecton the PSS and RNF rule sets for some MM-thresholds. There is no singlevalue for the salience threshold that works properly for all methods. The onlycriterium that works well for all 3 methods is the selection set. It consistentlyshows the best results of all the different selection methods.

3.6.1 problems

There are of course some specific problems with using rewrite rules based onstatistics. Since spelling was based on pronunciation, and people pronouncedcertain characters in different ways, some historic words are ambiguous. Justlike certain modern words can have different meanings determined by context,the spelling of some historic words can be rewritten to different modern words,depending on context. The character combination ue in the historic word muercan be rewritten to modern spelling as oe as in moer (English: nut) or as uu asin muur (wall).

3.6.2 The y-problem

Scanning the list of unique words of all corpora quickly showed a major problem.Many of the historic terms contain the letter y, in many different combinationsof vowels. It occurs before or after any other vowel, or just by itself. And inall these cases it its modern spelling is different. table 3.13 shows the possiblecombinations and their modern spelling:

The next chapter will describe other ways of evaluating the rewrite rules.The influence of the rewrite rules on document collections from other periods will

Page 51: Constructing language resources for historic document retrieval

3.6. CONCLUSIONS 41

vowel modern spelling old / modernay aa withayrigh / witharigay a gepayste / gepasteay aai sway / zwaaiay ai zwaay / zwaaiay ei treckplayster / trekpleisteray ij vriendelayck / vriendelijkey ei kley / kleiey ij vrey / vrijey ee algemeyn / algemeenoy oy employeren / employerenoy ooi flickfloyen / flikflooienoey oe armoey / armoeoey oei bloeyde / bloeideuy e huydendaachse / hedendaagseuy ui suycker / suikeruy uu huyrders / huurdersuy u gheduyrende / gedurendeya ia coryandere / korianderya ija vyandt / vijandya iea olyachtich / olieachtigye ie poezye / poezieye ij toverye / toverijye ije vrye / vrijeyu io ghetryumpheert / getriomfeerdyu ijv yurich / ijv’rig

Table 3.13: Different modern spellings for y

Page 52: Constructing language resources for historic document retrieval

42 CHAPTER 3. REWRITE RULES

be measured, as well as the effect of rewriting on retrieving historic word formsfor modern words from a historic corpus. As an extra evaluation, a documentretrieval experiment is described, where the rewrite rules are integrated into theIR system. Furthermore, a few simple extensions, such as combinations anditerations, to the three methods PSS, RSF and RNF, are tested. This shouldprovide a better indication of the performance of the rule-generation methods.

Page 53: Constructing language resources for historic document retrieval

Chapter 4

Further evaluation

As we saw in the previous chapter, the RSF and RNF algorithms outperformthe phonetically based PSS algorithm. Here, extensions to these methods areconsidered, as well as some other evaluation methods and test sets generatedfrom documents from different periods. This chapter is divided into the followingsections:

• Iteration and combination: The three methods described in the pre-vious chapter are combined and used iteratively.

• Reducing double vowels: The problem of vowel doubling is investi-gated.

• Word-form retrieval: A method to retrieve historic word forms formodern words.

• Historic Document Retrieval: An external evaluation method to eval-uate the rewrite rule sets.

• Documents from different periods: Evaluation of the rewrite rules onolder and newer documents.

4.1 Iteration and combining of approaches

As stated in section 3.2.1, iteration over phonetic transcriptions has no effect.For the RNF and the RSF methods, iteration can have effect. After the firstiteration, the historic words that are changed by the rewrite rules have becomemore similar to their modern variants. A next iteration might result in moremodern words that match a wildcard word. Consider again the example ofthe words aanspraak and aenspraeck. The consonant sequence ck is typical forhistoric documents, it rarely occurs in modern documents. But the wildcardword aenspraeC (C is a consonant wildcard) is not matched by any modernword. No modern spelling is found for ck. But through other words, the modern

43

Page 54: Constructing language resources for historic document retrieval

44 CHAPTER 4. FURTHER EVALUATION

Method iter- new total perf. old newation rules rewr. rewr. dist. dist.

RSF 1 209 257 110 2.38 1.42RSF 2 4 3 0 1.42 1.42RSF 3 0 0 0 1.42 1.422 1 12 271 152 2.38 1.292 2 1 1 0 1.29 1.282 3 0 0 0 1.28 1.283 1 127 274 162 2.38 1.193 2 12 6 3 1.19 1.173 3 0 0 0 1.17 1.174 1 276 269 166 2.38 1.204 2 52 26 19 1.20 1.104 3 0 0 0 1.10 1.105 1 276 153 97 2.38 1.795 2 60 38 19 1.79 1.625 3 14 4 3 1.62 1.60

Table 4.1: Results of iterating RSF and RNF

variant aa might have been found for the historic sequence ae. After applyingthis rule to the historic corpus, aenspraeck becomes aanspraack. In the nextiteration the wildcard word aanspraaC will be matched with the modern wordaanspraak resulting in the rewrite rule ck rightarrow k.

However, by combining the different methods, iterating the phonetic tran-scription method suddenly does have effect. After applying the ae rightarrow aarule found by other methods, a phonetic transcription is made for aanspraackinstead of for aenspraeck. And the pronunciation of aanspraack is equal tothe pronunciation of the modern word aanspraak, while the pronunciation ofaenspraeck isn’t (at least, according to Nextens).

4.1.1 Iterating generation methods

After applying rewrite rules, the historic words are closer to their modern coun-terparts. There might be some historic sequence Seqhist for which no wildcardmatches could be found because it only occurs in words with several typicallyhistoric sequences. After the first iteration, some of these other historic se-quences may have been changed to a modern spelling. In a second iteration,Seqhist can be matched with a modern sequence.

In table 4.1, the results of iterating over rule generation methods is shown.The average distances before and after applying the rules generated at eachiteration are shown in columns 6 and 7. After the second iteration, only therules set of RNF with n = 5 improves by further iteration. A simple explanationis that most rules generated for historic antecedents of length 5 affect only a few

Page 55: Constructing language resources for historic document retrieval

4.1. ITERATION AND COMBINING OF APPROACHES 45

words (the first 276 rules affect only 153 words in the test set). There are muchmore typically historic sequences of length 5 than there are of length 4. Theproblem with evaluating the rules for n-grams of length 5 is that these sequencesare so specific that many of them do not occur in the test set at all. In eachiteration, there are many more rules than there are affected words. All theserules have have an antecedent part that does occur in the selection set, hencethe selection of the rule. But the selection set is much bigger than the test set,and thus contains many more sequences. Looking purely at the scores, it iseasy to conclude that 4-grams work better for RNF than 5-grams, but a look atthe rules themselves gives another impression. Consider the historic sequenceverci. The RNF algorithm finds the rule verci ← versi for it, which changesa historic word like vercieren to versieren (adorn, decorate). The 4-gram vercshould become verk in words like vercopen and overcomen, but it should becomevers in vercieren. Because 5-grams are more specific, 5-gram rules probablymake less mistakes. On the other hand, longer n-grams are more and more likewhole words. Instead of generating rewrite rules, the RNF algorithm would begenerating historic to modern word matches. It would consider every word ofapproximately the length as a possible modernization, leaving all the work tothe selection process.

4.1.2 Combining methods

The combining of methods is done by generating, selecting and applying therules of one method before generating, selecting and applying rules of anothermethod. The rules of PSS contain not only typically historic antecedents, butalso some non-typical antecedents. Thus, the PSS rules will affect differentwords, and words differently than the other rule sets. Also, by first applyingthe rules generated by RNF or RSF, the historic antecedents that have a wrongphonetic transcription (like the frequent sequence ae) may be rewritten beforethe PSS rules are generated. This will reduce a number of wrong phonetic tran-scriptions. Therefore, the generation methods are applied one after another, inall different permutations, to see the effect or ordering the generation methods.To be able to make a fair comparison of the different orderings, the rules of eachrule set where selected with the selection set, because it performs very well forall three generation methods. The rule set for RNF is the combined rule setsfor the n-gram length 2, 3, 4 and 5. The rules where applied in order of length,with the longest antecedents first, because they are more specific than shorterantecedents. Rewriting ae to aa before rewriting aeke to ake cancels the effectof the latter rule.

By combining all the rules of n-gram lengths 2, 3, 4 and 5, the improvementis huge. Even before combining the RNF with the other algorithms, over 50%of all the words in the test set is modernized correctly.

Combining PSS and RSF increases performance significantly, although theorder is not important. But when combining RSF with RNF, the ordering doesmatter. Applying RSF first, the results are worse than applying RNF alone.Combining them is no improvement. When compared the RNF rules, the only

Page 56: Constructing language resources for historic document retrieval

46 CHAPTER 4. FURTHER EVALUATION

Method num. total perf. newOrder rules rewr. rewr. dist.PSS 104 253 101 1.66 (-0.72)PSS + RNF 769 347 211 0.91 (-1.47)PSS + RSF 136 326 166 1.13 (-1.25)PSS + RNF + RSF 774 349 211 0.90 (-1.48)PSS + RSF + RNF 389 348 206 0.91 (-1.47)RNF-2 12 271 152 1.29 (-1.09)RNF-3 127 274 162 1.19 (-1.19)RNF-4 276 269 166 1.20 (-1.18)RNF-5 276 153 97 1.79 (-0.59)RNF-all 691 315 207 0.97 (-1.41)RNF + PSS 746 335 224 0.87 (-1.51)RNF + RSF 702 319 208 0.95 (-1.43)RNF + PSS + RSF 752 337 224 0.87 (-1.51)RNF + RSF + PSS 753 337 224 0.86 (-1.52)RSF 62 252 140 1.33 (-1.05)RSF + RNF 328 295 183 1.05 (-1.33)RSF + PSS 134 324 167 1.12 (-1.16)RSF + RNF + PSS 381 342 193 0.96 (-1.42)RSF + PSS + RNF 397 346 211 0.88 (-1.50)

Table 4.2: Results of combined methods on test set

Page 57: Constructing language resources for historic document retrieval

4.1. ITERATION AND COMBINING OF APPROACHES 47

Applied LexiconRule set sizeNone 44041RSF 41956PSS 41557RNF 39368RNF+RSF+PSS 38525

Table 4.3: Lexicon size after applying sets of rewrite rules

cominations that improve on it are the combinations with PSS. Apparently, thePSS rules are somewhat complementary to RNF and RSF rules, as was expected.The RNF and RSF algorithms work in a similar way (relative frequency of asequence). The PSS algorithm is fundamentally different. It’s rules are basedon phonetics.

It is interesting to see that the total number of unique words in the corpusis greatly reduced by rewriting words to modern form. The original hist1600corpus contains 47,816 unique words (see table 3.3), if the lexicon is case sensitive(upper case letters are distinct from lower case letters). If case is ignored, thereare 44,041 unique words left. After applying the rules of the combined methodsPSS, RNF and RSF, the total number of unique words is reduced to 38,525, a12.5% decrease. By rewriting, many spelling variants are conflated to the same(standard) form. As table 4.3 shows, the RNF rules have the most significanteffect on conflation. Looking at the number of rules that each method generates,this is hardly surprising.

4.1.3 Reducing double vowels

A common spelling phenomenon in 17th century Dutch is the use of doublevowels to indicate vowel lengthening. In English, vowel lengthening is clearin the word good when compared to poodle. In the former word, the oo ispronounced somewhat longer than in the latter. The same effect occurs inDutch. In bomen (English: trees), the o is long, in bom (bomb) the o is short.But boom, the singular form of bomen, is also pronounced with a long o. Thedouble vowel ’oo’ is needed in this case to disambiguate it from bom. In modernDutch spelling, this doubling of vowels is only for syllables with a coda1. Usingthe modern spelling rules for vowel doubling, redundant double vowels in historicwords can be removed. A simple algorithm to do this, the Reduce Double Vowelsalgorithm (RDV), reduces a double vowel to a single vowel if it is followed bya single consonant and one or more vowels, or if the double vowel is at theend of the word (no coda). Thus, eedelsteenen becomes edelstenen, and gaa isreduced to ga, but beukeboom is not changed to beukebom. This algorithm does

1there are some exceptions to this rule. The Dutch word for sea is zee. Without the doublevowel the e would be pronounced short, becoming ze (English: she).

Page 58: Constructing language resources for historic document retrieval

48 CHAPTER 4. FURTHER EVALUATION

make mistakes. The modern word zeegevecht (sea battle), is changed to thenon-Dutch word zegevecht. The error is not in pronunciation, which is the samefor both words, but in spelling. The double vowel ii (almost) never occurs inDutch, so all occurrences of ii in historic words can safely be reduced. For wordfinal vowels, the e vowel is an exception. If a word ends in a single vowel e, thisis pronounced as a schwa (like the e in ’vowel’). For words ending in a long evowel, the double vowel ee is required (thee, zee, twee, vee. Thus, the algorithmshould ignore word final vowels ee.

To test its effectiveness, it was applied to the full historic word list, containing47816 unique words. Of these, 1498 words contain redundant vowels accordingto the RDV-algorithm. The total number of words containing redundant vowelsmight be larger, since the algorithm is so simple it is bound to miss some ofthese words. But what of the words it did affect? The list of reduced wordswas checked manually. It turns out that of all 1498 words, 134 reductionswere incorrect (almost 9%). A closer analysis of the incorrect reductions showthat, by far, the most mistakes are made with the double ee vowel in non-final,open ended (no coda) syllables in words like veedieven (English: cattle thieves),tweedeeligh (English: consisting of two parts) and zeemonster (sea monster). Ineach of these 3 examples, the first syllable has its vowel reduced. But in allthree examples, the first syllable is a Dutch word by itself. In fact, these wordswhere the very reason why the algorithm ignores word final ee vowels. It seemsthat the frequent use of compound words in Dutch has a significant effect onthe (too) simple RDV-algorithm. A modification might be compound splittingwhen encountering a word containing ee. If the first part of the word, up to andincluding ee, is an existing word by itself (i.e. it’s in the lexicon), don’t reducethe vowel. Other frequent mistakes have to do with the adding of a suffix. Atypical Dutch suffix is -achtig, as in twijfelachtig (doubtful, twijfel means doubt).But the word geelachtig (yellowish, geel means yellow) is incorrectly reduced togelachtig (gelly). These errors can be reduced by suffix stripping (stemming).

Furthermore, it was tested on the test set (see table 4.4). It affects only 29 ofthe 400 historic words in the test set, but of these, 20 are written to the correctform. Using RDV after applying rewrite rules, it still has a significant effecton the test set. The best order of combining the 3 rule generation methods(RNF, RSF and PSS) affects 337 words, rewriting 224 of them to the correctmodern form. After the RDV algorithm is applied, 5 more words are rewritten,with 235 perfect rewrites (more than 59% of all the words in the test set!).Many of the double vowels are removed by the 4-gram and 5-gram RNF rules(like eelen → elen), but it is mainly due to the fact that ae is rewritten toaa, resulting in more double vowels, that double vowel reduction still has asignificant effect.

Applying the RNF+RSF+PSS rule set and the RDV algorithm on the ex-ample text from the ‘Antwerpse Compilatae’ (see chapter 2) gives the followingresult:

9. item, oft den schipper verzijmelijk ware de goeden ende koop-man-schappen int laden oft ontladen vast genoeh te maken, ende

Page 59: Constructing language resources for historic document retrieval

4.2. WORD-FORM RETRIEVAL 49

Method num. total perf. newOrder rules rewr. rewr. dist.RDV only N.A. 29 20 2.31 (-0.09)RNF + RSF + PSS 753 337 224 0.86 (-1.52)RNF + RSF 753 342 235 0.83 (-1.55)+ PSS + RDV

Table 4.4: Results of RDV on test set

dat.die daardore vijtten takel oft bevangtouw schoten, ende int wa-ter oft ter aarden vielen, ende also bedorven worden oft te niettegingen, alsulke schade oft verlies moet den schipper ook alleen dra-gen ende den koopman goet doen, als vore.10. item, als den schipper de goeden so kwalijk stouwd oft laijd datd’ene door d’andere bedorven worden, gelijk soude mogen gebeurenals hij onder geladen heeft rozijnen, allijn, rijs, sout, gran ende an-dere diergelijke goeden, ende dat hij daar boven op laijd wijnen,olien oft olijven, die vijtlopen ende d’andere bederven, die schademoet den schipper ook alleen dragen ende den koopman goet doen,als boven.

The words verzijmelijk, vijten, genoeh and stouwd are incorrect rewrites ofthe words versuijmelijck, uit een, genoeg and stouwt. But takel, maken, schade,koopman, kwalijk, rozijnen and dragen are correctly transformed from taeckel,maecken, schade, coopman, qualijck and rosijnen. Although it is far from per-fect, many words are modernized. Even the word verzijmelijk is orthographicallymuch closer to its correct modern form verzuimelijk, although its pronunciationis no longer the same.

4.2 Word-form retrieval

Since the edit distance measure on a manually constructed test set is not theonly (and probably not the best) way of evaluating the performance of rewriterules, another evaluation method is used here. In [18] historic spelling variantsof modern words are retrieved using character based n-gram matching (see Ta-ble 3.1. To evaluate the rewrite rules generated by RNF (the best method byfar), a similar experiment was done on the full test set of 2000 word pairs. Eachmodern word in this set is used as a query word. The full list of historic wordsfrom the hist1600 corpus, plus the historic word forms of the test set,2 was usedas the total word collection. Since the full number of spelling variants for eachmodern word is not known, the word pairs from the test set were used to per-form a known-item retrieval experiment. We are looking for a specific spelling

2This was done to make sure that the appropriate historic word form is in the word listfrom which the words are retrieved.

Page 60: Constructing language resources for historic document retrieval

50 CHAPTER 4. FURTHER EVALUATION

N-gram Rule recallsize set @20 @10 @5 @12 85.30 78.30 69.05 32.202 4-gram 90.55 86.00 79.25 46.652 comb. 92.50 88.90 83.20 48.803 79.80 70.60 59.95 27.653 4-gram 88.35 83.90 76.85 45.653 comb. 90.80 86.70 81.50 49.254 65.75 56.15 45.75 20.404 4-gram 83.30 78.50 73.20 43.904 comb. 86.50 81.40 76.65 47.255 52.00 45.50 37.30 16.505 4-gram 77.15 73.15 68.30 41.555 comb. 80.30 76.65 72.55 45.50

Table 4.5: Results of historic word-form retrieval

variant, namely, the one from the test set. Table 4.5 shows recall at severaldifferent levels. The experiment was repeated after rewriting the historic wordlist using the 4-gram RNF rules after 2 iterations (see 4.1.1), and using the bestcombination rule set (RNF, RSF, PSS).

Especially at low recall levels (1 and 5) the differences between the originalhistoric words and the rewritten words is huge.

The 4-gram rules generated by RNF perform much better than the 5-gramrules. The 5-gram rules are a huge improvement on the original words, but the4-grams are much better still. The performance of 2-gram and 3-gram matching@20 are comparable to the experiments by Robertson and Willett (see table 3.1).

When matching n-grams for historic word forms, small n-grams (2 and 3)perform better than large n-grams (4 and 5). However, it is interesting to see,that the difference between rewriting and no rewriting, for recall at lower levels(recall @1 and recall @5), becomes very big for large n-grams. More specifically,the performance of 4-gram and 5-gram matching is very bad at recall levels 1and 5.

4.3 Historic Document Retrieval

Parallel to this research, Adriaans [1] has worked on historic document retrieval(HDR). He has investigated whether HDR should be treated is cross-lingualinformation retrieval (CLIR) or monolingual information retrieval (MLIR). Inhis CLIR approach, the rewrite rule generation methods described here havebeen applied to a collection of historic Dutch documents, and a set of modernDutch queries.

Page 61: Constructing language resources for historic document retrieval

4.3. HISTORIC DOCUMENT RETRIEVAL 51

4.3.1 Topics, queries and documents

Two sets of topics were used. One is a set of 21 expert topics from [2], forwhich a number of the documents in the collection were assesed and markedas relevant or non-relevant. The formulation of the queries, and the assesmentof the documents has been done by experts of 17th century Dutch. The otherset contains 25 topics, for which only one relevant document is known. Thisapproach as called known-item retrieval. A document from the collection isselected randomly, and a query is formulated that describes the content of thatdocument as precisely as possible. This approach is used when there is a lack ofassessors and/or time to asses the relevance of all the documents for each query.Each query in the experiment is a combination of a title and a description. Thedescription is a natural language sentence, posed in modern Dutch describingthe topic of the query. The title contains key words from the description. Bycombining the description with the title, the key words occur twice in the query,giving them extra weight in ranking the documents. The three columns showthe average precision for topics using only descriptions (D only), descriptionsand titles (D+T), and titles only (T only) as queries. The average precisionis the non-interpolated average precision score. For each query, the top-10ranking documents are retrieved. If, for query Q, the 3rd and 5th documents arerelevant, the non-interpolated average precision is the average of the precisionat rank 3 and the precision at rank 5. If n documents are retrieved, the averageprecision is:

Avg.Prec. =1n

n∑i=1

i

rank(i)(4.1)

The document collection is also the same is the one used in [2], because thesedocuments were assessed for the expert topics.

4.3.2 Rewriting as translation

The used rule set is a combination of RNF, PSS and RSF (see section 4.1.2).However, the RNF+PSS+RSF rules are generated by non-final versions of thegeneration algorithms, since the HDR experiments where done before the finalversions were ready. The RNF algorithm performed sub-optimally because ofmemory problems. Due to time constraints, the best rewrite rules at that timewere used (which is the ordered generation of RNF, PSS and then RSF). Thefinal version of the RNF algorithm generates more rules, and with the finalcombined rule sets this would probably give other HDR results.

Table 4.6 shows the results of the HDR experiment on the known itemtopics.3 The baseline is the standard retrieval method, looking up the exactquery words in the inverted document index. An inverted document index is

3The results in this thesis differ from the results in [1] because of last minute changes in[1]. The results shown here are only based on topics for which there is at least one relevantdocument. The results in [1] also take into account topics that don’t have any relevantdocuments in the collection.

Page 62: Constructing language resources for historic document retrieval

52 CHAPTER 4. FURTHER EVALUATION

Method Avg. Avg. Avg.prec. prec. prec.

D only D+T T onlyBaseline 0.2192 0.1955 0.1568Stemming 0.2125 0.2352 0.17494grams 0.2366 0.2538 0.2457Decompounding 0.2356 0.2195 0.1795Rules Doc 0.3097 0.3118 0.3016Rules Query 0.2266 0.2537 0.2487Rules Doc + stem 0.3702 0.3884 0.3006Rules Query + stem 0.1873 0.2370 0.2234

Table 4.6: HDR results using rewrite rules

matrix in which the columns represent the documents in the collection, and therows represents all the unique words in the entire document collections, witheach cell containing the frequency of the represented word in the representeddocument. Thus, a column shows the frequencies all words that occur in thedocument, and each row shows the frequencies of a word in the documents thatit occurs in.

The table shows the results of some standard IR techniques as well. Stem-ming, as explained in section 2.3, conflates words through suffix stripping. The4-gram experiments uses 4-grams of words in combination with the words them-selves as rows in the inverted index. Decompounding is used to split compoundwords into their compound parts. The results for the rewrite rules are split intodocument translation and query translation. In the query translation experi-ment, a list of translation pairs was made for the words in the historic documentcollection, containing the original historic term and its rewritten form. Eachquery was expanded with a original historic word if its rewritten form was aquery word. The document translation experiment was done by replacing eachword in all documents by its rewritten form from the same list of translationpairs. The first 4 experiments can be seen as monolingual IR (documents andqueries are treated as one language). Translating queries or documents is across-language (CLIR) approach. Either the documents are translated into thelanguage of the queries, or the queries are translated into the language of thedocuments. The last two rows show the effect of stemming after translation.

Of the 4 monolingual approaches, the use of 4-grams works best, althoughstemming and decompounding perform better than the baseline as well. Trans-lation of the queries is comparable in performance to the 4-gram approach, butstemming the translated queries has a negative effect, especially when usingdescriptions only. Query translation means adding historic word-forms to thequery. These historic word-forms contain historic suffixes that might not bestripped by the stemmer, just as the historic word-forms in the documents.Without rewriting, many historic spelling variants cantnot be conflated to the

Page 63: Constructing language resources for historic document retrieval

4.3. HISTORIC DOCUMENT RETRIEVAL 53

Method D only D+T T onlyBaseline 0.3396 0.4289 0.4967Stemming 0.3187 0.3778 0.42064grams 0.3037 0.3465 0.3821Decompounding 0.3307 0.4228 0.4900Rules Doc 0.2825 0.3835 0.4538Rules Query 0.3067 0.4224 0.4844Rules Doc + stem 0.2690 0.3268 0.3799Rules Query + stem 0.2920 0.3628 0.4214

Table 4.7: HDR results for expert topics

same stem. Thus, if the historic query terms are not affected by the stemmer,they will only be matched by the exact same word-forms in the document collec-tion, not with any morphological variant. Document translation is clearly supe-rior to query translation. Even without stemming it consistently out-performsall the other approaches. But here, stemming is useful. By rewriting, many his-toric spelling variants are conflated to a more modern standard, including theirsuffixes. After stemming, morphological variants are conflated to the same stem,which significantly improves retrieval performance. For the D+T and T onlyqueries, the improvement over the baseline is almost 100%.

The results for the expert topics are listed in Table 4.7. These results arein no way comparable to the known-item results. No matter what approach isused, nothing performs better than the baseline system. The decompoundingapproach, and the query translation approach (without stemming) come closeto the performance of the standard system, but they show no improvement.

A closer analysis of the topics shows that the experts who formulated thequeries used specific 17th century terms, and added historic spelling variants tosome of the descriptions and the titles. Topic 13 has the following descriptionand title:

• Description: Welke invloed heeft ’oorvrede’ nog in de periode van deAntwerpse Compilatae (normaal: oorvede, oirvede)?

• Title: oorvrede Antwerpse Compilatae oorvede oirvede

The document collection used for retrieval contains documents from the‘Gelders Land- en Stadsrecht’ corpus, and the ‘Antwerpse Compilatae’ corpus.Both are a collection of text concerning 17th century Dutch law. All documentsfrom the ‘Antwerpse Compilatae’ contain the words Antwerpse Compilatae. So,by putting these words in the query, all documents from this corpus are con-sidered as possible relevant documents. Next, the word ’oorvrede’ is combinedwith 2 spelling variants in both the description and the title. By rewriting thedocuments, the spelling variant oirvede might have changed, so all documentsoriginally containing oirvede no longer match with the query word oirvede. In

Page 64: Constructing language resources for historic document retrieval

54 CHAPTER 4. FURTHER EVALUATION

Period baseline Braun 4-gram PSS RSFdistance

1569 2.62 1.53 1.57 1.99 1.821600 2.38 1.54 1.20 1.65 1.431658 1.98 1.24 0.95 1.54 0.95

Table 4.8: Results of rules on test sets from different periods

general, if spelling variants are added to the query, the documents should notbe rewritten, since rewriting is used to conflate spelling variants.

4.4 Document collections from specific periods

Since spelling changed over time, getting more and more standard because of theincrease in literate people and printing press, a rewrite rule should only be usedon documents dating from more or less the same period as the documents it wasgenerated from. That is, if documents from the beginning of the 17th centurywhere used to construct rewrite rules, applying them to texts dating from 1560or late 17th century might have a negative effect, because the pronunciation ofcertain character combinations might have changed in between these periods.To see if this is the case, two small test sets, containing 100 word pairs each,one from a text dating from 1569 and one from a text written in 1658, wheremanually constructed (in the same way as the large test set constructed fromtexts dating from 1600–1620, see 3.4.1). The rules and the RDV-algorithmwhere applied to these sets with the following results (the second test set, period1600, is the original, large test set) :

As the second column shows, the average distance between the historic wordsand their modern counterparts is decreasing together with the age of the sourcedocuments (the actual differences might be even bigger, since the test sets do notcontain words that haven’t changed over time. The texts from 1658 probablycontain more of these words than older texts). What is interesting to see, isthat the rules manually constructed by Braun perform better on the oldest testset than any automatic method. Even the best rules generated by RNF performsomewhat worse, although they do perform much better on the test sets frommore recent documents. Again, the PSS rules perform worst of all rule sets.The RSF rules show great improvement in performance as the age of the sourcedocuments decreases. On the 1658 test set, it shows the same performance asthe best RNF rules. Of course, given the small size of the test sets, these resultsonly give an indication of the effects on test sets from different periods. Tobe able to draw reliable conclusions, much larger test sets, and maybe even aword-form retrieval experiment (see section 4.2) should be used.

Page 65: Constructing language resources for historic document retrieval

4.5. CONCLUSIONS 55

4.5 Conclusions

All different evaluations show that 4-gram RNF generates the best rules. Al-though the Braun rules show to be more period independent, for documentswritten after 1600 the automatic methods perform much better. Word retrievalbenefits greatly from rewriting, getting performance on par with the resultsfrom [18] for historic Enlish word-forms. For HDR, the effect of rewriting isspectacular. What is interesting is the improvement of the stemming algorithm.Before rewriting, stemming the documents and queries has a mixed effect. Forthe titles it is useful, but for the descriptions, stemming has very little effect.But once the rewrite rules have been applied, much more historic words have amodern suffix that can be removed, conflating spelling and morphological vari-ants all to the same stem. In other words, rewriting has brought the historicDutch documents closer to the modern Dutch language.

As the iteration of RNF ceases to have effect after 3 iterations, it might bemore effective to switch to phonetic matching, or possibly word-retrieval, afterthat. It would be interesting to see the results of a combined run, first rewritingdocuments and then use n-gramming to find spelling variants that are not yetfully modernized. Also, the results of the HDR experiment are based on oldrules. The current best rule set performs much better on the test set evaluationand on the word-retrieval test, thus might also perform even better on the HDRexperiment.

Page 66: Constructing language resources for historic document retrieval

56 CHAPTER 4. FURTHER EVALUATION

Page 67: Constructing language resources for historic document retrieval

Chapter 5

Thesauri and dictionaries

A thesaurus is a dictionary of words containing for each word a list of relatedwords. In IR, it is often used to find synonyms of a word, and other closelyrelated words for query expansion. A document D that doesn’t contain any ofthe query terms will not be retrieved, but can still be relevant. The topic of thequery might be discussed in a document using different, but related words. Byexpanding the query with related words from the thesaurus, the query mightnow contain words that are in D, so D will be retrieved.

In [2], one of the main bottlenecks for HDR is the vocabulary gap. This gapnot only represents concepts that no longer exist, or that didn’t exists in the17th century. It also represents the concepts that are described by modern andhistoric synonyms. As 17th century documents contain many synonyms (see [2,p.31]), a thesaurus can be useful for query expansion.

A thesaurus can be created automatically by extracting and combining wordpairs from different sources:

1. Small parallel corpora

2. Non-parallel corpora (using context)

3. Crawling footnotes

4. Phonetic transcriptions

5. Edit distance

The first three methods can be used to construct a thesaurus for the vocab-ulary gap. The last two methods might be used to tackle the spelling problem,as an alternative to rewrite rules.

5.1 Small parallel corpora

A very useful technique for finding word translation pairs between two differentlanguages is word to word alignment in parallel corpora, (see [21] and [22]). In

57

Page 68: Constructing language resources for historic document retrieval

58 CHAPTER 5. THESAURI AND DICTIONARIES

the European Union for instance, all political documents written for the Euro-pean parliament have to be translated in many different languages. As thesedocuments contain important information, it is essential that each translationconveys exaclty the same message. The third paragraph in a Polish translationcontains the same information as the third paragraph in an Italian translation.This can be exploited to construct a translation dictionary automatically byaligning sentences and words within these sentences. A collection of such docu-ments in several languages is often called a parallel corpus. A parallel corpus canthus be used to find synonyms in one language for words in another language.If such a collection of documents is available for 17th centry Dutch and mod-ern Dutch, it could be used to construct word translation pairs between 17thcentury and modern Dutch. This could be a partial solution to the vocabularygap identified in [2]. Partial, because historic words for concepts that no longermake any sense in modern times cannot be aligned with modern translations,simply because no such translations exist.

One of the largest parallel corpora is probably the Bible. It is translatedin many different languages, and also in many different historical variants ofmany modern languages. The advantage of using different Bible translations isthat a line in one translation corresponds directly to the same line in the othertranslation. The Statenbijbel and the NBV (Nieuwe Bijbel Vertaling) can beused to construct a limited translation dictionary. The Statenbijbel is the firstDutch translation of the original Bible, written in 1637. The NBV is the mostrecent Bible translation, available in book form and on the internet. However,the Statenbijbel, unlike the NBV, is not electronically available. The oldestdigitized version that can be found is a modernized version of the Statenbijbel,dating from 1888. By that time, official spelling rules were introduced, and late19th century Dutch is very similar to modern Dutch, making it useless for 17thcentury Dutch.

5.2 Non-parallel corpora: using context

If the time span between the two language variants is large, the variants canbe considered to be different languages; historic documents can be considereddocuments written in another language. If the time span between variantsbecomes smaller, the languages are more alike, there is an increasing lexicaloverlap. Many words have the same meaning in both languages. There are,however, some words that appear in one variant but not in the other. Over time,some words are no longer used, and new words have been made up. Some of thepurely historic Dutch words, that are no longer used in modern Dutch, have apurely historical meaning. It is hard to find corresponding modern Dutch wordsfor them, because they have lost their meanings in modern society. But somepurely historical words do have modern counterpart. The word opsnappers is a17th century Dutch word, that is no longer used in modern Dutch. Its modernvariant is feestvierders (English: people having a party). Someone looking fordocuments on throwing a party in ancient days might use feestvierders as a

Page 69: Constructing language resources for historic document retrieval

5.2. NON-PARALLEL CORPORA: USING CONTEXT 59

query word. Query expansion can benefit from a modern to historic translationdictionary containing opsnappers as a historic synonym for feestvierders.

There are several techniques that can be used to find semantically relatedwords automatically. Two of them will be discussed here. The first uses the fre-quency of co-occurrence of two specific words, the second uses syntactic structureto find words that are at least syntactically, and possibly semantically related.

5.2.1 Word co-occurrence

One way of constructing a thesaurus automatically is using word co-occurrencestatistics to pair related words. This technique exploits the frequent co-occurrenceof related words in the same document, or paragraph. If two words co-occurfrequently in documents, there is a fair chance that these words are related. Inpolitical documents, the words minister and politician will often co-occur. Ofcourse, high frequent words like the and in will also co-occur often. Contentwords have a lower frequency, and will co-occur less often than function words.But a thesaurus containing pairs of high frequency function words will not helpin retrieving relevant documents, since high frequence words are often removedfrom the query. And most of the documents in the collection contain thesefunction words, so almost all documents would be considered relevant. Contentwords not only carry content, they also good at discriminating between docu-ments. The word minister is much better at discriminating between political andnon-political documents than the word the. Thus, a simple word co-occurrencethesaurus should pair related content words. There are two ways of filteringout highly frequent term co-occurrences. Removing the N most frequent termsfrom the lexicon is a rather radical approach. The other possibility is to penalizehigh term frequency by dividing the number of co-occurrences of two words bythe product of their individual term frequencies. The main advantage of thelatter approach is that high frequency content words are not removed from thelexicon. No information is lost. On the other hand, if a content words occursoften, its discriminative power is minimal.

However, low frequency terms suffer from accidental co-occurrences. For twototally unrelated, low frequency words, accidental co-occurrence is enough tomake their co-occurrence significant. An extra problem is the spelling variation.A low-frequency content word W1 might be spelled different in each documentit occurs. Even if it co-occurs with the same related word W2 in each of thosedocuments, each spelling variation will have a co-occurrence frequency with W2

of 1. Ofcourse, this problem is partly solved by applying the rewrite rules fromchapter 3 to the documents, but there are still some spelling variants that arenot conflated after rewriting. Since the document collection is limited in size,almost all content words have a low frequency, making it nearly impossible toconstruct a useful co-occurrence thesaurus in this way.

Page 70: Constructing language resources for historic document retrieval

60 CHAPTER 5. THESAURI AND DICTIONARIES

5.2.2 Mutual information

Another related approach comes from information theory. In [16], a automaticword-classification system is described using mutual information of word-basedbigrams. Bigrams are often used in natural language processing techniques toestimate the next word given the current word. Given a corpus, the probabilityof word Wi is given by the probability of the words W1,W2, ...,Wi−1 occuringbefore it. This can be approximated by considering only the n − 1 words be-fore Wi (Wi−n+1,Wi−n, ...,Wi−1). In the case of bigrams, only the previousword Wi−1 is considered. The probability of Wi given Wi−1 is the frequency ofWi−1,Wi divided by the total number of bigrams in the corpus. The N mostfrequent words of an English corpus are classified in a binary tree by maximizingthe mutual information between the words in class Ci and the words in class Cj .The final tree shows groups of semantically or syntactically related words. Themutual information between a word Wi from class Ci and a word Wj from classCj is given by in (5.1) (P (Wi,Wj) is the probability of the bigram Wi,Wj):

I(Wi,Wj) = logP (Wi,Wj)

P (Wi)P (Wj)(5.1)

The total mutual information M(i, j) between two classes Ci and Cj is then:

M(i, j) =∑

Ci,Cj

P (Ci, Cj)× logP (Ci, Cj)

P (Ci)P (Cj)(5.2)

Maximizing the mutual information is done sub-optimally, by finding a lo-cally optimal classification. First, the N words are randomly classified andMt(Ci, Cj) is computed. Second, for each word W in both classes, the mutualinformation information Mt+1(Ci, Cj) is calculated of the situation where W isswaped from one class to another. If this increases the mutual information, theswap is permanent, otherwise, W is swaped back to its original class. In [16],the final classification stops after these N swaps. Because computing power hasincreased so much, it doesn’t take much more time to iterate this process until,in the next N swaps, no swap is permanent. Working top-down, at each level,all the words of class Ci are classified further into subclasses. this ensures thatthe classification at the previous level stays intact.

To reduce the computational complexity of the algorithm, the mutual infor-mation at t+1 can computed by updating the mutual information at t with thechange made by W . If W is in class Ci at t, computing Mt+1(Ci, Cj) is done byfirst computing the mutual information M(W,Cj). This is the contribution ofW at t. Next, the mutual information M(W,Ci) is computed. If M(W,Ci) ishigher than M(W,Cj), swapping W to class Cj increases to mutual information.The new mutual information is then:

Mt+1(Ci, Cj) = Mt(Ci, Cj)−M(Wi, Cj) + Max(M(W,Cj),M(W,Ci)) (5.3)

In this way, the full mutual information M(Ci, Cj) has to be calculated onlyonce, and is updated by each swap.

Page 71: Constructing language resources for historic document retrieval

5.2. NON-PARALLEL CORPORA: USING CONTEXT 61

The idea behind this approach is that closely related words are classifiedclose to each other, and two unrelated words should be classified in differentclasses early in the tree (near the root). If two words convey the same meaning,it makes no sense to place them next to each other in a sentence, because itwould make one of them redundant. The meanings of two adjacent words shouldbe complementary. If two words co-occur often (i.e. their bigram frequency ishigh), they should not be in the same class. Low co-occurence (low bigramfrequency) of high frequent words (high unigram frequency) makes it probablethat the meanings of these words overlap, so they will be classified close to eachother. The example classification given in [16] shows some classes that mightbe useful for query expansion. In one class, all days of the week are clusteredtogether, and in another class, many time-related nouns are clustered. If onethe words in such a class is used as a query word, adding other words from thesame class to the query might help finding documents on the same topic.

Once the N most frequent words have been classified, adding other, lessfrequent wordss requires no more than putting each word in that class thatresults in the highest mutual information. This second step becomes trivialwhen adding words with very low frequencies. A word W with frequency 1 (thisholds for the largest part of the content words) only shows up in 2 bigrams, oncewith the previous word in the sentence, and once with the next word. Thus, itwill only add mutual information when classified in the opposite class of one ofthese words. If neither the previous nor the next word is in the same class at aclassification level s, putting W at s + 1 class Ci or Cj makes no difference tothe mutual information, because, using (5.3), it adds 0 to either class.

This introduces a new problem for historic documents. Because of the in-consistency in spelling, resulting in spelling variants, each variant has a lowercorpus frequency and occurs in less bigrams than it would have given a con-sistent spelling (by conflating spelling variants, the new word frequency is thesum of the frequencies of the conflated variants). Apart from that, classifica-tion based on bigrams requires a huge amount of text. More text means betterclassification, simply because there is more evidence to base a classification on.But the amount of electronically available historic text is limited, resulting indata sparseness.1

To make sure that the algorithm was implemented correctly, a test classifi-cation was made using a 60 million English newspaper corpus.2 The 1000 mostfrequent words were classified in a 6 level binary classification tree. Table 5.1shows 4 randomly selected classes, paired with their neighbouring class, at clas-sification level 6 (the leaves of the tree). Out of the 1 million possible bigrams(each of the 1000 unique words can co-occur with all 1000 words, including

1Data sparseness in this case means the lack of evidence for unseen bigrams. A bigramWi−1, Wi might not occur in the corpus, making the probability P (Wi−1, Wi) 0. Smoothingtechniques can be used to overcome this problem, but for the classification algorithm it stillalways adds the same amount of mutual information to each class, making classification trivial.In a larger corpus, there is a bigger chance that a certain bigram occurs, resulting in a morereliable probability estimate.

2The newspaper corpus is the LA-times corpus used at CLEF 2002.

Page 72: Constructing language resources for historic document retrieval

62 CHAPTER 5. THESAURI AND DICTIONARIES

Class Classnumber content9 city Administration movie very given growing10 nation department only like proposed approved27 their housing28 her my financial economic private drug Simi Newport

World Laguna National Pacific Long Orange Santa Venturamiddle as hot five six eight few will can ’ve may saidwould could does cannot did should ’ll is ’d must

35 allow take begin36 bring give provide keep hold get pay sell win find

build break create use meet leave become call tell asksay see think feel know want run stop play Japanhusband hours days though

49 Clinton Anaheim Los Northridge Thousand judge wifecouple key action summer minute top order largestusually anything non New own

50 Department American San Inc star hearing projectelection list book force war quarter morning week baddifferent free got

Table 5.1: Classification of 1000 most frequent words in LA-times

itself), for the 1000 most frequent words, the corpus contains 412.516 uniquebigrams.

In class 28, a number of auxillary verbs is clustered, and in class 36, somesemantically related verbs are clustered. The neighbouring class, 35, containssome related verbs as well, which indicates that there is a relation betweenclusters that are classified close to each other. In 49 and 50, some time-relatednouns are clustered (summer, minute, quarter, morning, week). But manyclusters contain seemingly semantically unrelated words, like ‘Administration’,‘movie’, ‘very’ and ‘growing’. The ability to cluster on semantics is limited,although more data (a larger corpus) should lead to better (or at least, morereliable) classification. The corpus is used as a language model for English.Thus, more text leads to a more reliable model.

Better classifications have been made with syntactically annotated corpora.One of the main problems with plain text is not the semantic but the syntacticambiguity of words. The word ’sail’ can be used as a noun (‘The sail of a ship.’)or as a verb (‘I like to sail.’) But the orthographic form ‘sail’ can only beclassified in 1 class. In syntactically annotated corpora, a word can be classifiedtogether with its part-of-speech tag. For modern English, such corpora exist,but for 17th century Dutch, all that is available is plain text.

The total historic Dutch corpus is much smaller than the English one, butstill contains about 7 million words. The 1000 most frequent words share 226.318

Page 73: Constructing language resources for historic document retrieval

5.2. NON-PARALLEL CORPORA: USING CONTEXT 63

Class Classnumber content1112 In uit Na Aen om Op tot van vanden of ofte en ende

Laet Doen Wilt Uw Haer Zy Wy Zijn Mijn Hy Ons Ik SijnGy selve Noch vp Dus Der o Een Geen Daer Daar DeseDies Dat Des Dees Alle so soo zo Zoo Indien Wanneer Nual

25 aller also als verheven inder vander wien binnen tealwaer ter

26 welcken nam toch dewyl eerste dat dattet mit achteronder Roomsche

33 wie hemels verre inne vooren34 och heer staat ras Maria connen konnen ware zijnde mede

datse dijen59 Heer Prins hand borst lijf beeld beelden brief boeck

kennis steen gelt brant dood verdriet troost rustplaats slagh oyt niet voorts eerst sprack

60 bloed kroon troon staet stadt plaets wegh Boeckeditie uitgave naem stof vrucht glans kort quaetbegin neder noyt wel voort wederom weder zien siengaven leeren

Table 5.2: Classification of 1000 most frequent words in historic Dutch corpus

bigrams (that means that 77% of all possible bigrams is not in the corpus). Thesame experiment was repeated with the historic Dutch corpus, and again 4classes were randomly selected, shown in table 5.2, togheter with their neigh-bouring classes.

In some of these classes, there are some clusters of syntactically relatedwords. Class 12 contains many prepositions and pronouns, and classes 59 and60 contain mostly nouns. Semantically, classes 59 and 60 are also interesting,because there are some themes. ‘Heer’ and ‘Prins’ (lord and prince), ‘hand’,‘borst’ and ‘lijf’ (hand, chest, body), ‘brief’, ‘boeck’ and ‘kennis’ (letter, book,knowledge) in 59, ‘kroon’ and ‘troon’ (crown, throne), ‘staet’, ‘stadt’, ‘plaets’,‘weg’ (state, city, place, road), ‘Boeck’, ‘editie’, ‘uitgave’ (Book, edition, edition)in 60. The main problem of small corpora is that, if the mutual informationwithin one class is zero (none of the words in that class share a bigram), furtherclassification is useless. This is clear in classes 11 and 12. Apparently, movingone word from class 12 to 11 does not increase the mutual information. Afurther subclassification of class 12 will result in one empty subclass, and theother subclass containing all words of class 12.

For a better comparison, the 1000 most frequent words in a 30 million words,

Page 74: Constructing language resources for historic document retrieval

64 CHAPTER 5. THESAURI AND DICTIONARIES

Class Classnumber content21 dacht zet vraagt grote hoge oude22 maakte hield sterke enorme vijftig hoe29 Bosnische Europees Zuid dezelfde hetzelfde welke ieder

vele veel enkele beide dertig honderd vijf wat vorigeconomische mogen hard ondanks Van

30 Nederlandse nationale Navo Rotterdamse elke deze zoveelbepaalde negen mijn dit laten

31 drie tien zeven derde halve volgend vorige zwarespeciale belangrijke oud

32 zwarte rode politieke vrije ex rekening tv milieugebruik kun

49 me we wij belangrijkste echte dergelijke voormaligemeeste klein dollar groei overheid regering gemeenterechtbank Spanje ogen televisie stuk leeftijd weekeindeweek seizoen keer ander

50 handen hart familie bevolking Raad politiek kabinetonderwijs school tafel Feyenoord ploeg elftal finalezomer toekomst maand periode buitenland koers produktieverkoop verzoek ton rechter kant totaal grotermogelijkheid

Table 5.3: Classification of 1000 most frequent words in modern Dutch corpus

modern Dutch corpus3 were also classified in a level 6 binary tree. 4 randomlyselected classes and their direct neighbouring classes are listed in table 5.3. Inthis corpus, the 1000 most frequent words share 295.404 unique bigrams.

Table5.3 shows 4 directly neighbouring classes (29, 30, 31, 32). At level 4in the tree they would be merged into one classes. This would make sense, asclasses 29, 30 and 31 contain number words (dertig, honderd, vijf, negen, drie,tien, zeven) and related adjectives (ieder, vele, veel, enkele, beide, elke, zoveel,bepaalde, derde, halve), and all 4 classes contain some other adjectives. Classes49 and 50 also contain some semantically related words: ‘Overheid’, ‘regering’,‘gemeente’, ‘Raad’, ‘politiek’, ‘kabinet’ (government, government, community,council, politics, cabinet), and ‘leeftijd’, weekeinde’, ’week’, ’seizoen’, ’zomer’,’maand’, ’periode’, ’toekomst’ (age, weekend, week, season, summer, month,period, future).

For all three corpora, the classification trees show some useful clustering,but it is far from being usable for query expansion, because it is based on highfrequency words, which add very little content to a query and mark a lot ofdocuments as relevant. As mentioned before, classification of low frequencywords is completely unreliable, because there is very little evidence to base a

3This corpus is also from CLEF 2002.

Page 75: Constructing language resources for historic document retrieval

5.2. NON-PARALLEL CORPORA: USING CONTEXT 65

classification on. But the low frequency words are the very words that are usefulfor document retrieval. Low frequency words, by definition, occur in only a fewdocuments, and are often related to the topic of a document. It seems that theonly way to get a more reliable classification is to use a bigger corpus.

There is however, a big difference in automatic clustering between Englishand Dutch. In the 60 million words corpus used for English, there are ‘only’306.606 unique words, whereas the 30 million words corpus for modern Dutchcontains 495.605 unique words. The historic Dutch corpus, containing 7 millionwords in total, has 373,596 unique words. In general, a larger corpus containsmore unique words, so a 60 million words corpus of historic Dutch would proba-bly contain much more unique words than the English corpus. The main reasonfor this is probably due to compounding of words. In English, compoundsare rare (like ‘schoolday’), as most nouns are seperated by a whitespace (‘shoelace’), but in Dutch, compounding is much more common, resulting in words like‘bedrijfswagenfabriek’ (lit.: company car factory) and ‘nieuwjaarsgeschenken’(New Year gifts). To get enough evidence for a reliable classification, a largerlexicon requires a larger corpus.

Another difference between these two languages is the word order, which ismore strict in English than in Dutch. Both languages share the Subject - Verb- Object order in basic sentences. But adding a modifier to the beginning ofthe sentence, the order is retained in English, but changes in Dutch (the verb isalways the second part of the sentence, so the subject comes after the verb. Thishas consequences for the number of unique bigrams in the corpus. For Dutch,a larger number of bigrams is needed to get same reliability for the ‘languagemodel.’

The quality of the classification seems to depend on quite a number of factors:

1. Lexicon size: Each unique word needs plenty of evidence for properclassification, thus a larger lexicon needs more evidence, i.e., more text.

2. Ambiguity in a language: Words that can have different syntacticfunctions can supply contradictive evidence (the verb ‘sail’ can co-occurwith words that cannot co-occur with the noun ‘sail’). Languages thathave many of these words are harder to model correctly.

3. Strictnes of word-order: Some languages allow various word-orderingsfor a sentence. In many so called ’free-word-order-languages’ like Polishand Russian, a rich morphology makes it possible to distinguish syntacticcategories. However, for a language like Dutch, the word-order may bechanged, but this introduces changes in pronouns and prepositions. ADutch translation of the English sentence ‘I’m not aware of that.’ could be‘Ik ben me daar niet bewust van,’ or ‘Ik ben me daar niet van bewust.’ Butanother word-order is allowed’, like ‘Daar ben ik me niet bewust van’, oreven ‘Daar ben ik me niet van bewust.’ Nothing changes morphologically,but there are four different sentences, with exactly the same words andexactly the same meaning.4 More possible orderings need more evidence

4Thanks to Samson de Jager for pointing out this peculiarity in Dutch.

Page 76: Constructing language resources for historic document retrieval

66 CHAPTER 5. THESAURI AND DICTIONARIES

to be modelled correctly.

4. Document style: Each document is written in a certain style. Sentencesin newspaper articles are often different from sentences in personal letters.This has an effect on which bigrams occur in the corpus.

5. Depth of classification: At depth 1 (the classes directly under the root),the classification is often quite reliable. At deeper levels, the number ofbigrams that the words in a class share becomes increasingly small, mak-ing each further classification less reliable. In classes with only nouns(especially in Dutch, were compounding leads to a large number of lowfrequency nouns), a further classification of semantic structure is not pos-sible because of a lack of syntactic distinction (nouns rarely appear nextto each other).

To aid word clustering for historic Dutch, the historic document collectioncould be mixed with an equal amount of modern Dutch text to reduce datasparseness. The spelling of many words has changed over time, but the mostfrequent words have changed very little. There is still a reasonably large overlapbetween the most frequent words in both corpora, so if no more historic text isavailable, modern text might help. For modern Dutch, syntactically annotatedcorpora are available, and can be mixed with historic Dutch to estimate POS-tags for historic words. If all the modern words in a class are nouns, it seemsprobable that the historic words in that class are nouns as well. To bridge thevocabulary gap, clustering historic and modern words with related meaningsmight be very useful. At least for query expansion, adding historic words tomodern query words can increase recall.

5.3 Crawling footnotes

There are some digital resources available on the web. For instance, the Digi-tale bibliotheek voor de Nederlandse Letteren5 (DBNL) has a large collection ofhistoric Dutch literature. Many of these texts contain footnotes of the form “1.opsnappers: feestvierders”. These are direct translations of historic words tomodern variants. By using a large amount of these texts, a historic to moderndictionary can be constructed. The texts on DBNL are categorized based onthe century in which they where published. There is huge list of 17th centuryDutch literature available, containing over a 100 books, and more works areadded regularly. Not all books contain footnotes, and not all footnotes are di-rect translations. Many footnotes contain background information or referencesto other works. But some texts contain thousands of translations.

Because the books are annotated by different people, the notes don’t have aconsistent format. In some texts, the historic word is set in italics or boldface,

5url: www.dbnl.nl

Page 77: Constructing language resources for historic document retrieval

5.3. CRAWLING FOOTNOTES 67

in others, a special html-tag is used to mark it. Consider the next two exam-ples, the first of which is very clear, containing a special tag to signify a wordtranslation.

<div ID="N098"><small class="note"><a href="#T098" name="N098"><span class="notenr">&nbsp;4. </span></a>&nbsp;<span class="term">beschaemt:</span> teleurgesteld. Vgl.<span class="bible">Rom. X, 11</span>.</small></div>

<div ID="N1944"><small class="note"><a href="#T1944" name="N1944"><span class="notenr">&nbsp;9 </span></a>&nbsp;<i>bloot:</i> onbeschermd.</small></div>

<div ID="N1608"><small class="note"><a href="#T1608" name="N1608"><span class="notenr">&nbsp;1353 </span></a>&nbsp;<i>hoofdscheel:</i> hoofdschedel; <i>van:</i> door;<i>bedropen:</i> overgoten;Van ’t begin en van ’t einde van Melchisedech’s leven is onsverder niets bekend; Vondel beschouwt hem als door God zelftot priester gewijd.</small></div>

The first note has marked the historic word (‘beschaemt’) by tagging itwith a span class called ‘term’. In all of these cases, the modern translation(‘teleurgesteld’) directly follows the historic word, and ends with a dot or asemi-colon. The second note is less specific. The historic word (‘bloot’) ismarked in italics, and the modern translation (‘onbeschermd’) again follows itand ends in a dot (or a semi-colon). The first note is easy to extract. The secondnote is more problematic, because the italics not always signify a translation:

<div ID="N1728"><small class="note"><a href="#T1728" name="N1728"><span class="notenr">&nbsp;10 </span></a>&nbsp;<i>Orpheus:</i> Orpheus, de bekende zanger van de Griekse sage,die de wilde dieren bedwong door z’n snarenspel(<i>konde paren:</i> kon verenigen).</small></div>

Here, the first word in italics, ‘Orpheus’, is not followed by a modern transla-tion, but by an explanation of who Orpheus was. A simple way of distinguishingbetween this note and the previous one, is that the translation pair contains onlyone word after the historic, italicized word. But this doesn’t work for transla-tions containing several words:

<div ID="N1726"><small class="note"><a href="#T1726" name="N1726">

Page 78: Constructing language resources for historic document retrieval

68 CHAPTER 5. THESAURI AND DICTIONARIES

<span class="notenr">&nbsp;7 </span></a>&nbsp;<i>sloer:</i> sleur, gang, manier.</small></div>

<div ID="N1437"><small class="note"><a href="#T1437" name="N1437"><span class="notenr">&nbsp;12 </span></a>&nbsp;<i>onses Moeders:</i> van onze moeder, de aarde.</small></div>

For the historic word ‘sloer’, multiple modern translations are given, seper-ated by a comma. The historic phrase ‘onses Moeders’ has two modern phrasesas translation. How can these be distinguished from the note about Orpheus?It gets even worse. Consider the next consequetive notes:

<div class="notes-container" id="noot-1739"> <div class="note"><a href="#1739T" name="1739"><span class="notenr">4</span></a><i>gedraeghe mij tot:</i> houd mij aan.</div></div><div class="notes-container" id="noot-1740"> <div class="note"><a href="#1740T" name="1740"><span class="notenr">5</span></a><i>deze: de ene</i>. Bedoeld wordt Pieter Reael, vgl.<i>502</i>.</div></div>

The first one contains the historic phrase inside italics and the modern phrasefollowing it directly. The second one contains both the historic word and itsmodern translation inside italics, and an explanation directly after it. And a fewnotes further down, the single word after the italics is not a modern translation,but a reference:

<div class="notes-container" id="noot-1744"><div class="note"><a href="#1744T" name="1744"><span class="notenr">14</span></a><i>de schrijver:</i> Vondel.</div></div>

All this makes is very hard to extract only the translation pairs from anote. Manual correction is not an option, since the 17th century DBNL corpuscontains over 170.000 footnotes. The final list consists of approximately 110.000translation pairs, many of which are not actual translation pairs but references,explanations or descriptions. Still, for query expansion it could be useful. Ifeach modern translation occurs only a few times, only a few historic words orphrases are added to the query. Not all of them will be useful, but adding noiseto the query might be compensated by the fact that some relevant words areadded as well. By making seperate dictionaries for word to word, word to phraseand phrase to phrase translations, evaluating each of them seperately, will givean indication of whether a dictionary can be useful, or contains to much noise.

The dictionaries in table 5.4 are translations from historic to modern, asextracted from the DBNL corpus. The word to phrase dictionary containshistoric words as entries, and modern phrases as translations. Vice versa, thephrase to word dictionary contains historic phrases with modern single word

Page 79: Constructing language resources for historic document retrieval

5.3. CRAWLING FOOTNOTES 69

Dictionary number of unique number oftranslations entries synonyms

word to word 36505 20281 1.8word to 26445 16649 1.6phrasephrase to 5589 4931 1.1wordphrase to 42680 35127 1.2phrasetotal 111219 68384 1.6

Table 5.4: DBNL dictionary sizes

translations. To get an indication of usefullness of the DBNL thesaurus, arandom sample of 100 entries was drawn twice, and each entry evaluated. Foreach of the four different parts of the total thesaurus, the 100 entries weremarked as useful or useless. Repeating this process once, thus drawing 100random entries twice, the results give us some idea about the usefulness of thethesaurus parts.

thesaurus useful uselesspart entries entriesword to word 91/88 9/12word to phrase 72/62 28/38phrase to word 59/55 41/44phrase to phrase 70/68 30/32

Table 5.5: Simple evaluation of DBNL thesaurus parts: usefulness of 100 randomsamples

Some good examples from the word to word and word to phrase dictionariesare:

ghewracht -> bewerktbadt -> verzochtbooswicht -> zondaarbelent -> zeer nabijheerschapper -> heer en meester

Here are some bad examples:

galgenbergh -> GolgothaGod -> GodheidKatten-vel -> kat

Page 80: Constructing language resources for historic document retrieval

70 CHAPTER 5. THESAURI AND DICTIONARIES

d’altaergodin -> Vestastuck -> op ’t schaakbordHippomeen -> zie

The last example is a typical parsing mistake. The right hand side ‘zie’(English: see) is part of a reference to something.

Furthermore, the word to word and word to phrase dictionaries were used toget an idea of the overlap between the historic words in a historic corpus, andthe historic words in the dictionaries. How many of the words in the hist1600corpus (the corpus used for the RSF and RNF algorithms, see section 3.1.2)for example, have an entry in the DBNL thesaurus? And what about thecorpus that was used for creating the test set? Table 5.6 gives an indicationof the coverage of the thesaurus. The Braun corpus contains the ‘AntwerpseCompilatae’ and the ‘Gelders Land- en Stadsrecht’, the Mander corpus contains‘Het Schilderboeck’ by Karel van Mander. Together, they make up the hist1600corpus. This split up was done because the DBNL thesaurus contains someentries extracted from the Mander corpus. The same holds for the documentsfrom the test set corpus. The modern words in the corpora, at least the wordsthat are found in the modern Kunlex lexicon, were first removed from the totalhistoric lexicon (column 3). Synonyms for these words can be found in a modernDutch thesaurus.

The coverage results can be explained by the footnote extraction. The Brauncorpus does not contain any footnotes, and has the smallest coverage from theDBNL thesaurus. The Mander corpus has a larger coverage, probably becausea number of entries from the DBNL thesaurus come from ‘Het schilderboeck’.That the DBNL thesaurus covers even a larger part of the test set corpus isprobably due to the fact that ‘De werken van Vondel, Eerste deel (1605 – 1620)’is part of the corpus and contains several thousand notes with translation pairs.

Corpus Unique Not in DBNLwords Kunlex coverage

hist1600 47816 41156 (86%) 4315 (10,5%)Braun 17891 14168 (79%) 1429 (10.1%)Mander 33805 27074 (80%) 3547 (13.1%)Test set 69453 44827 (65%) 8119 (18.1%)corpusSelection 1600 1569 (98%) 603 (38,4%)Test set 400 397 (99%) 152 (38,3%)

Table 5.6: Coverage statistics of corpora for DBNL thesaurus

The DBNL thesaurus covers a far larger part of the historic words in theselection and test set (see section 3.4.1). Apparently, in the process of givingmodern spelling variants of historic words, there was a bias towards givingmodern forms for historic words with a salient historic spelling. A bias which

Page 81: Constructing language resources for historic document retrieval

5.3. CRAWLING FOOTNOTES 71

can very well have been the same for the editors of the DBNL who made thefootnotes. Also, the selection and test set do contain some modern words. Outof the 2000 words in the both sets, 34 words are in the Kunlex, showing thatthe decision whether a word is historic or modern is not trivial.

5.3.1 HDR evaluation

As an external evaluation, the performance of the DBNL thesaurus was testedon hitoric document retrieval experiment. For a description of the experiment,see section 4.3. Instead of using rewrite rules, the documents and query weretranslated using the DBNL thesaurus. As the results in table 5.5 show, apartfrom the word to word thesaurus, the thesauri contain many nonsense entries.Therefore, only the word to word thesaurus was used. The original words in thehistoric documents were replaced by one of the related words from the DBNLthesaurus. For query translation, all the entries containing a query word as atranslation were added to the query. The original query words were kept in thequery as well. The effect of translation was compared to other standard IRtechniques. Table 5.7 contains the results of translation both with and withoutstemming on the known-item topics.

Method Avg. Avg. Avg.prec. prec. prec.

D only D+T T onlyBaseline 0.2192 0.1955 0.1568Stemming 0.2125 0.2352 0.17494grams 0.2366 0.2538 0.2457Decompounding 0.2356 0.2195 0.1795DBNL Doc 0.1098 0.1262 0.1546DBNL Query 0.0860 0.1324 0.1597DBNL Doc + stem 0.0902 0.1250 0.1321DNBL Query + stem 0.1389 0.1730 0.1847

Table 5.7: HDR results for known-item topics using DBNL thesaurus

Translation of the descriptions is disastrous for retrieval performance, al-though stemming compensates a little. For the titles, translation works slightlybetter. Whereas the baseline shows a decline in performance by adding titlesand removing descriptions, query translation shows the opposite behaviour. Oneof the reasons might be that the number of words in the query is fairly smallusing only titles when compared to the descriptions.

Table 5.8 displays the average number of words in the descriptions and titlesfor both topic sets. Method None represents the original queries, before trans-lation. The DBNL thesaurus more than doubles the number of words in thedescriptions, but as the description also contains high frequency words, and thethesaurus also contains translations for high frequency words, adding so many

Page 82: Constructing language resources for historic document retrieval

72 CHAPTER 5. THESAURI AND DICTIONARIES

translations apparently does more bad than good. Only modern stop words (thehigh frequency words, that add little content to the query) are removed fromthe query, but historic translations are added before this happens. The titlesdon’t contain any stop words, thus through translation none are added. Thetitles contain mostly low frequency content words. Adding historic synonymsof these words, and stem all the query words afterwards improves performance.

Look at topic 7 for a good example:

• Description: Kan een eigenaar van onroerend goed zijn verhuurde pandzomaar verkopen, of heeft hij nog verplichtingen ten opzichte van de hu-urder?

• Title: eigenaar onroerend goed verhuurde pand verkopen verplichtingenhuurder

This is the same query after adding translations:

• Description: kan ken koon mach magh een een eigenaar van onroerendgoed aertigh welzijn binnen cleven sinnen verstrekken verhuurde pandpaan panckt zomaar verkopen veylen of heeft hij deselve sulcke versoeckernog nach verplichtingen ten opzichte van de huurder huurling

• Title: eigenaar onroerend goed aertigh wel verhuurde pand paan pancktverkopen veylen verplichtingen huurder huurling

Not all translation added to the title are good, but most of them are relatedto the topic. As for the description, many totally unrelated historic words areadded that will not be recognized as a stop word (mach, magh, koon, sulcke).

topics method words in words intitle descr.

expert None 3.52 11.05expert dbnl 5.52 18.43expert phon 4.29 16.38expert rules 4.14 14.05known item None 4.36 11.68known item dbnl 8.60 24.48known item phon 7.44 19.32known item rules 6.68 16.20

Table 5.8: Average number of words in the query using query translation meth-ods

Because the combination of query translation and stemmming works betterthan query translation only, it would be interesting to combine query translationwith the other monolingual techniques. Another approach could be to combinethe scores of retrieval runs. By giving the ranked list of each retrieval approach

Page 83: Constructing language resources for historic document retrieval

5.4. PHONETIC TRANSCRIPTIONS 73

Method D only D+T T onlyBaseline 0.3396 0.4289 0.4967Stemming 0.3187 0.3778 0.42064grams 0.3037 0.3465 0.3821Decompounding 0.3307 0.4228 0.4900DBNL Doc 0.2246 0.2326 0.3442DBNL Query 0.2749 0.3696 0.4632DBNL Doc + stem 0.2095 0.2574 0.2917DBNL Query + stem 0.2705 0.3316 0.3848

Table 5.9: HDR results for expert topics using DBNL thesaurus

a specific weight, the final relevance score of document is the weigthed sum ofthe relevance scores of each approach. Documents that are considered relevantby two different approaches, say, retrieval using 4-grams and retrieval usingthe DBNL thesaurus, are, on average, ranked higher on the combine list. Thereasoning behind this is that if several approaches retrieve the same document,there is a fair chance that that document is actually relevant.

As was mentioned in section 4.3, the HDR results of the advanced techniquesfor the expert topics show no improvement over the baseline. The same holds fordocument and query translation using the DBNL thesaurus. The expert queriescontain specific 17th century words from the documents, making query expan-sion redundant for a large part. It is still interesting to see that, consistent withthe known-item results, query translation works better than document transla-tion and stemming afterwards has a negative effect. Although the monolingualmethods perform better on the descriptions, translation of the titles seems towork better than stemming or 4-gram matching. And again, adding translationsto the descriptions decreases performance significantly.

5.4 Phonetic transcriptions

Although often historic words are spelled different from their modern counter-parts, in many cases, their pronunciation is the same. This fact can be effectivelyused to construct a dictionary of equal sounding word pairs. For Dutch, a fewalgorithms are available to convert strings of letters to strings of phonemes (seesection 3.2.1). The algorithm for building a dictionary using phonetic transcrip-tions is very easy. First, convert the historic lexicon lexhist into a historic pro-nunciation dictionary Pdicthist, and the modern lexicon lexmod into Pdictmod.Next, for all historic words whist in Pdicthist, lookup its phonetic transcriptionPT (whist) in the modern dictionary Pdictmod. Pair whist with all words wmod

for which the phonetic transcription PT (wmod) is equal to (PT (whist), and addeach pair to the final dictionary.This approach can also be combined with the rewrite rules to improve upon

Page 84: Constructing language resources for historic document retrieval

74 CHAPTER 5. THESAURI AND DICTIONARIES

the final thesaurus. After applying rewrite rules to the historic lexicon, therewritten words will (probably) be more similar in spelling to the correspondingmodern word. Through rewriting, the pronunciation of a word may change.Since letter-to-phoneme algorithms are based on modern pronunciation rules,the phonetic transcription of the historic word klaeghen will be different fromthe transcription of its corresponding modern word klagen, since the modernpronunciation of ae is different from the modern pronunciation of a (they mayhave been the same in 17th century Dutch). Thus, if after rewriting, klaeghenhas become klaghen, the phonetic transcription will be equal to that of klagen.Converting the lexicon to a pronunciation dictionary again, and repeating theconstruction procedure will result in new pairings.

Of course, words that are pronounced the same are not necessarily the samewords (consider eight and ate). This is were the edit distance clearly helps indistinguishing between spelling variants and homophones (if the homophonesare orthographically dissimilar enough).6

Because the phonetic transcriptions contain some errors, and because thepronunciation of some vowel sequences has changed over time, the phonetictranscriptions before and after rewriting were evaluated by randomly selectingand checking a 100 entries for correctness. The whole experiment was donetwice to get a more reliable indication. If the number of correct and incorrecttranscriptions show a big difference between the first and the second time, abigger sample, or more iterations are needed to get a better indication. If thenumbers vary only slightly, their average gives a fair indication of the totalnumber of correct and incorrect transcriptions. The results in Table 5.10 showa significant improvement in the quality of the transcriptions. Before rewriting,the phonetic dictionary contains 4592 entries, and 16% of all transcriptions aredifferent from their real pronunciation. Only 2% of all the 11.592 entries afterrewriting have incorrect phonetic transcriptions. The phonetic dictionary afterrewriting (using the combined rule set RNF+RSF+PSS) contains the originalhistoric words as entries, but the modern words were matched with the phonetictranscriptions of the rewritten forms of the historic words. The word aengaendewas first rewritten to aangaande. Then, aengaende is matched with a modernword that has the same phonetic transcription as aangaande.

Not only does rewriting effect the number historic words that phoneticallysimilar to their modern forms, it also decreases the number of wrong phoneticmatches. The historic ae sequences is no longer mathced with the modern eesequence, but with aa. The same goes for the historic sequences ey and uy whichwere matched with the sequence ie in modern words before rewriting, and arerespectively matched with ei and ui afterwards.

5.4.1 HDR and phonetic transcriptions

As a way of evaluating the effectiveness of mapping words using phonetic tran-scriptions, the same HDR experiment as described in section 4.3 and the pre-

6If two homophones are orthographically similar, a spelling variant of one of them couldjust as easily be a spelling variant of the other.

Page 85: Constructing language resources for historic document retrieval

5.4. PHONETIC TRANSCRIPTIONS 75

Phonetic total incorrect perc.dictionary entries entries incorrect

of 100normal 4592 15 / 17 16rewritten 11592 0 / 4 2

Table 5.10: Incorrect transcriptions in 2 samples of 100 randomly selected en-tries, before and after rewriting

Method Avg. Avg. Avg.prec. prec. prec.

D only D+T T onlyBaseline 0.2192 0.1955 0.1568Stemming 0.2125 0.2352 0.17494grams 0.2366 0.2538 0.2457Decompounding 0.2356 0.2195 0.1795Phonetic Doc 0.2642 0.2901 0.2609Phonteic Query 0.2458 0.2511 0.2474Phonetic Doc 0.2645 0.3054 0.2502+ stemmingPhonetic Query 0.1911 0.2153 0.1983+ stemming

Table 5.11: HDR results for known-item topics using phonetic transcriptions

vious section was conducted. Instead of using rewrite rules to translate queriesor documents, the phonetic transcription dictionary was used. The results areshown in Table 5.11. For this experiment, the stop word list was extended withphonetic variants of stop words taken from the phonetic dictionary.

The results of translating the queries are comparable to the use of 4-gramsin the monolingual approach, and, equal to the effect on rewriting (see Table 4.6in the previous chapter), stemming the translated queries has a negative effectfor the same reason. The historic words often have historic suffixes that areunaffected by the stemmer, thus conflation of morphological variants is minimal.Document translation shows the best results for all different queries (D only,D+T and T only). But now, the effect of stemming is minimal.

The number of phonetically equal words added to the descriptions and titlesis smaller than the number of related words added by the DBNL thesaurus.Although the phonetic dictionary adds spelling variants of modern stop wordsto the query, the list of modern stop words was extended by their phoneticallyhistoric counterparts. Therefore, the performance of query translation for thedescription is comparable to query translation for the titles. Combining themdoes not affect average precision much.

Page 86: Constructing language resources for historic document retrieval

76 CHAPTER 5. THESAURI AND DICTIONARIES

Method D only D+T T onlyBaseline 0.3396 0.4289 0.4967Stemming 0.3187 0.3778 0.42064grams 0.3037 0.3465 0.3821Decompounding 0.3307 0.4228 0.4900Phonetic Doc 0.2719 0.3213 0.4178Phonetic Query 0.3037 0.4137 0.4920Phon. Doc + stem 0.2581 0.3063 0.3373Phon. Query + stem 0.2913 0.3638 0.4176

Table 5.12: HDR resulst for expert topics using phonetic transcriptions

Another interesting observation is that combining description and titles leadsto a significant increase in precision for document translation. The originalqueries of topic 3:

• Description: Hoe wordt de hypotheekrente afgehandeld bij de verkoopvan een pand?

• Title: hypotheekrente afgehandeld verkoop pand

Adding phonetic transcriptions results in:

• Description: hoe wordt wort de hypotheekrente afgehandeld bij bei beyby de verkoop vercoop vercoope vercope van vaen een ehen pand pandt

• Title: hypotheekrente afgehandeld verkoop vercoop vercoope vercopepand pandt

In both the title and the description, 3 spelling variants for verkoop (sale)and 1 for pand (house). The spelling variants of the stop words bij, van and eenwere removed because of the extended stop word list.

5.5 Edit distance

Similarity between words can also be measured using the edit distance algo-rithm. At each place in a word, a character can be deleted, inserted or sub-stituted (same as delete + insert). The edit distance of two words is equal tocost of changing one word into the other. Deleting and inserting cost 1 step,substitution costs 2 (equal to 1 delete and 1 insert) unless the character to besubstituted is equal to the substitute, in which case the cost is 0. The moresimilar two words are, the lower the cost. This technique is often used for spellchecking. The algorithm can be adjusted to account for the distance betweentwo keys on a keyboard. Accidentally hitting a key next to the intended oneoccurs more often than hitting one on the other end of the keyboard. A bigger

Page 87: Constructing language resources for historic document retrieval

5.5. EDIT DISTANCE 77

SimilarCharactersb,pd,tf,vs,zy,iy,iey,ijg,chc,kc,s

Table 5.13: Phonetically similar characters

distance between two characters on a keyboard results in a higher substitutioncost.

But the similarity between historic words and their modern variants is notbased on the distance between keys on a keyboard, but on their similarity inpronunciation. Thus, the algorithm can instead be adjusted to take into accountthe similarity of pronunciation. A c can be pronounced as a k or as an s. Thus,substituting a c for an s should be lower in cost than substituting a c for at. Adjusting the cost of substitutions has to be done carefully. The cost ofsubstituting c for t should not be increased unless the cost of deleting andinserting characters is increased as well. Otherwise, the algorithm will preferdeleting + inserting over substituting, resulting in the same cost as before theadjustment. Instead, lowering the cost for substituting phonetically similarcharacters, the algorithm will prefer substituting c for s over deleting c andthen inserting t.

5.5.1 The phonetic edit distance algorithm

The Phonetic Edit Distance (PED) algorithm is an adjusted version of the basicedit distance algorithm. In the standard version, every substitution adds 2 tothe total edit distance unless the two characters under consideration are equal.The PED version differentiates between substituting two phonetically similarcharacters and two phonetically dissimilar characters. If an ‘s’ is substitutedfor a ‘z’, the edit distance is increased by 0.5, but if an ‘s’ is substituted for a‘j’ the edit distance is increased by 2. The following character are consideredphonetically similar:

The edit distance is increased by 2 when the first character of the combi-nations ‘ie’, ‘ij’ or ‘ch’ is substituted for the phonetic equivalent. The PEDalgorithm then decreases the edit distance by 1.5 if the second character of thecharacter combinations ‘ie’, ‘ij’ or ‘ch’ is substituted for the phonetic equivalent

Page 88: Constructing language resources for historic document retrieval

78 CHAPTER 5. THESAURI AND DICTIONARIES

of the character combination. In total, after two substitutions, 0.5 is added tothe edit distance.

The main problem with the edit distance algorithm is that it is a costly oper-ation. Since the historic and modern corpora easily consist of several thousand,or even tens of thousands of words, finding the closest historic match of a modernword requires a huge amount of computation, [25]. A solution to this problemis to use a coarse grained selection method like n-gram matching first, whichcan reduce the number of candidates under consideration. The word-retrievalexperiment (section 4.2) showed that candidate selection using n-grams workswell, especially after applying rewrite rules. With an n-gram size of 2, over 90%of the historic variants from the test set were found in the top 20 candidates.However, most of the words in the list of candidates are no historical variantsof the modern word. As the results in Table 4.5 show, only half of the variantsare found at rank 1, and for most modern words, there are only 2 or 3 spellingvariants found in the entire historic corpus. This is where the fine grained selec-tion of edit distance can be put to good use. Using the phonetic version of theedit distance algorithm, the 20 candidates can be reranked according to theirphonetic similarity. The historical variants should be ranked higher than all theother words in the list of candidates. Of course, if there are multiple historicvariants of the modern word, the historical variant from the test set need not beat rank 1. There might be another spelling variant that is phonetically closerto the modern word. Thus, the precision @1 will not be much higher, but theprecision @5 should increase (there are very little modern words that have morethan 5 historical spelling variants in the corpus).

The list of phonetically similar characters could probably be extended, butthis is already an improvement over the standard edit distance algorithm, as canbe seen in Table 5.14. ED stands for the standard edit distance algorithm, PEDis the phonetic version and RR means the rewritten forms of the original historicwords were used for n-gram retrieval using the combined RNF+RSF+PSS ruleset. Recall scores 10 are roughly the same for ED and PED, but for recall 5 thedifferences become more significant.

The recall 5 is much closer to the recall 20 after reranking, in Table 5.14.Actually, the increase in precision makes it interesting to retrieve more than20 word-forms. The problem with PED is that it is computationally expensiveto find the closest match in an entire lexicon. But for 20 or even 100 wordsthis is no problem. The increase in recall when retrieving 100 words can betransformed into an increase in recall 5 through reranking using PED. In then-gram retrieval part, once the ranked list of candidates is calculated, selectingthe first 100 words takes negligibly more time than selecting the first 20 words.

5.6 Conclusion

The DBNL thesaurus can be used effectively in query expansion, if the stop wordlist is extended with historic variants as was done for the phonetic dictionary,and with a better note extraction algorithm, the word to phrase and phrase

Page 89: Constructing language resources for historic document retrieval

5.6. CONCLUSION 79

N-gram Pre- recallsize proccess @20 @10 @5 @12 0.853 0.783 0.691 0.3222 ED 0.853 0.830 0.720 0.3222 PED 0.853 0.843 0.802 0.4362 RR 0.925 0.889 0.832 0.4882 RR+ED 0.925 0.908 0.850 0.5192 RR+PED 0.925 0.920 0.892 0.5633 0.798 0.706 0.600 0.2773 ED 0.798 0.782 0.707 0.3263 PED 0.798 0.791 0.768 0.4303 RR 0.908 0.867 0.815 0.4933 RR+ED 0.908 0.894 0.851 0.5233 RR+PED 0.908 0.905 0.881 0.565

Table 5.14: Results of historic word-form retrieval using PED reranking

to word translations might become useful as well. The downside is that theconstruction of this thesaurus depends on manually added word translationpairs. Automatically extracting them correctly is difficult, and the only historicwords for which a translation is given are the ones that are deemed importantby the DBNL editors. The modern translations of the words that they findimportant enough to translate might not be the words that are posed as querywords by the user.

By combining historic Dutch documents with modern Dutch documents, andmore imortantly, by increasing the corpus size, the use of word clustering algo-rithms can become an important method for bridging the vocabulary gap. As isstands, the vocabulary gap remains the most difficult bottleneck of the two, asthe spelling gap is partly bridged by the rewrite rules from the previous chapterand the phonetic dictionary and PED reranking procedure in this chapter.

The phonetic variants dictionary is effective, but only after rewriting. Thephonetic transcriptions of the original historic words are not always correct, thusby replacing these transcriptions with the transcriptions of the rewritten words,many historic words are no longer paired with the wrong modern word. Theadvantage of matching historic and modern words with phonetic transcriptionsover using rewrite rules is that non-typical historic character sequences (like‘cl’ in clacht) are not rewritten incorrectly (clausule should not be rewrittento klausule). The phonetic dictionary only replaces whole words, not parts ofwords. Thus, clacht will be replaced with klacht, but the historic word clausuleis matched with the modern word clausule, and is thus retained in the lexicon.

The performance of word-retrieval can be greatly improved by rerankingthe candidate list using the Phonetic Edit Distance algorithm. The number ofcandidates can then be reduced to 3 or 5 words, and the remaining list can beused for query expansion. It has yet to be tested on a HDR experiment though.

Page 90: Constructing language resources for historic document retrieval

80 CHAPTER 5. THESAURI AND DICTIONARIES

Page 91: Constructing language resources for historic document retrieval

Chapter 6

Concluding

We’ve seen, in the previous chapters, that language resources can be constructedfrom nothing but plain text. They can be used effectively for HDR, and mightalso be used as stand alone resources to improve readability. This chaper con-cludes this research, and will try to answer to questions from the first chapter.Some future directions are given as well.

6.1 Language resources for historic Dutch

The first chapter posed some research questions. They are repeated here andan attempt at answering them is made.

• Can resources be constructed automatically? The methods de-scribed in chapters 3, 4 and 5 have shown that language resources forDutch historic document retrieval can be constructed automatically us-ing nothing but plain historic Dutch text. The HDR experiments and theword-form retrieval experiments have shown that these language resourcescan be used effectively to find historic Dutch word-forms of modern Dutchwords, and also significantly improve HDR performance.

• Can (automatic) methods be used to solve the spelling problem?

– Can rewrite rules be generated automatically using corpusstatistics and overlap between historic and modern variantsof a language? The generation, selection and application of rewriterules can be done automatically, and with good results. The RNFand RSF algorithms work well in finding modern spelling variantsof typical historic character sequences. Both methods use plain textcorpora and exploit the overlap between historic and modern Dutch.

– Are rewrite rules a good way of solving the spelling bot-tleneck? As the results of combining and iterating the methodshave shown, after rewriting the most important historic character

81

Page 92: Constructing language resources for historic document retrieval

82 CHAPTER 6. CONCLUDING

sequences, their no longer produce any useful rules. However, thetypically historic sequences caused most of the problems for the pho-netic transcriptions. Once they have been modernized, the graphemeto phoneme converter produces much better results. So, in answerto the question, they are a good first step in solving the spellingbottleneck for 17th century Dutch.

– Can historic Dutch be treated as a corrupted version ofmodern Dutch, and thus be corrected using spelling correc-tion techniques? Using a spell checker shows acceptable results,but the main problem is that a modern word for each historic wordsmust be selected manually from a list of suggestions. If the correctword is not in the list of alternatives, further manual correction isneeded. A language independent and automatic solution is the useof n-gram matching to retrieve similar word-forms. This produces alist of historic spelling variants for modern words. It has yet to betested if the inversed procedure, finding similar modern word formsfor historic words, works as well. Using n-gram matching as a coarsegrained search, and edit distance, or its phonetic variant, as a finegrained search, the list of candidates can be reduced further.

• What are the options for automatically constructing a thesaurusfor historic languages? The vocabulary gap is still a big problem. Mostof the resources and methods described are solution to the spelling prob-lem. Only the DBNL thesaurus and the word co-occurrence classificationsare aimed at the vocabulary gap, and neither is a good solution at thismoment. The DBNL thesaurus contains many nonsense entries, and onlycontains manually constructed translation pairs. Extending it to covernew words depends on the knowledge and effort of experts. As for theco-occurrence thesaurus, its application in HDR seems a long way of. Toget better classifications, much more text is needed, and even then, thesemantic distinctions are probably to coarse grained to make them usefulfor query expansion.

• Is the frame work for constructing resources a language inde-pent (general) approach? The same word-form retrieval experimentdescribed in [18] works for historic Dutch documents. This supports theclaim by Robertson and Willett that their methods are general, languageand period independent. The word-form retrieval method uses only n-gram information, which is language independent. The rewrite rule gen-eration methods RNF and RSF can be added to the list of language inde-pendent techniques. Even without using a manually constructed selectionset, using the MM selection criterium, which is a language independentmethods as well, the rules selected can help modernize historic spelling (atleast for historic Dutch). Further improvements can be made by rerankingthe candidates using the PED algorithm, which can increase the precisionat a certain level, or alternatively, increase recall at lower levels. The PED

Page 93: Constructing language resources for historic document retrieval

6.2. FUTURE RESEARCH 83

algorithm is language dependent; the characters are phonetically similarin Dutch, but not necessarily in all languages. Although the edit distancealgorithm is less effective, it is language independent.

The other resources, the PSS algorithm, the phonetic thesaurus and theDBNL thesaurus are specific for Dutch. The PSS algorithm and the pho-netic thesaurs use a grapheme to phoneme conversion tool that must bedesigned specifically for each language. The DBNL thesaurus consists ofmanually constructed word translation pairs.

• Can HDR benefit from these resources? The experiments describedin [1] show that HDR can gain from several techniques, some treating HDRas a monolingual approach, others, including the techniques and languageresources for historic Dutch, treating HDR as a CLIR approach. Theretrieval results show that rewriting the historic Dutch documents to amore modern Dutch is a very effective way to improve performance. Afterrewriting, the gap between 17th century Dutch and modern Dutch hasbecome smaller. The monolingual approach of stemming document andquery words is much more effective after the documents are translated.

6.2 Future research

Resources for 17th century Dutch can help HDR, but there is still much thatcan be improved upon. Since the problems for 17th century Dutch have beensplit into two main issues throughout this research, directions for future workwill follow these two paths.

6.2.1 The spelling gap

It seems that the spelling bottleneck is not the main problem anymore, althoughthere are still some techniques that could be improved, like the PED algorithm,and the phonetic transcription tool.

The phonetic transcription tool from Nextens is designed for modern Dutch,with its many loan words from English, French, German and other languages. Itmay be clear that, although the overlap between historic and modern Dutch isin pronunciation, there are some difference in pronunciation as well. Taking thisinto account, the rules for transcribing a sequence of characters into a sequenceof phonemes can be adjusted for 17th century Dutch. The pronunciation ofae is one of the main problems, but phenomena like double vowels and doubleconsonants also form a major hurdle in matching historic and modern words.Making specific rules for their transcription will probably solve most of theseproblems.

The PED algorithm can be adjusted in two main ways. First off, the listof phonetically similar characters can be extended, and maybe improved. Forinstance, the characters ‘b’ and ‘p’ are pronounced similar in certain contexts,like the end of a word (both are pronounced as a ‘p’ in Dutch). But at the

Page 94: Constructing language resources for historic document retrieval

84 CHAPTER 6. CONCLUDING

beginning of a word, they sound different. The algorithm could be changed touse context information when judging the similarity of pronunciation. Second,the current cost function might not be optimal. Right now, the substitution ofsimilar characters costs less than deleting or inserting a character. Different costfunctions can be tested. It would be interesting to see how well this algorithmworks on historic variants of other languages. Maybe the cost function shoulddepend on the specific language for which it is used. Another approach wouldbe to use the normal edit distance algorithm on the phonetic transcriptions ofwords.

Also, making a spelling variations dictionary is not trivial, and it has notbeen tested either. If the number of spelling variations is unknown, how todetermine which candidates are actual spelling variants, and which are not,might be a difficult problem.

As far as the rewrite rules are concerned, the effect of a rule set on documentcollections from different periods can be investigated further. The results inTable 4.8 show that the generated rules still work for documents written slightlyearlier or slightly later than the documents that were used to generate the rulesfrom. However, if the difference in age gets larger (between the documentsfrom which the rules are generated, and the documents on which the rules areapplied), the performance of the rules will probably decrease. For documentsin Middelnederlands1 the differences with modern Dutch are far bigger, notonly in spelling but also in pronunciation. The gap might even be too large tobe bridged by rewrite rules. For more recent documents, the gap is so smallthat rewrite rules based on typical historic character sequences are not effectiveany more because there are almost no character sequences that are typical ofthe historic documents. After the introduction of offical spelling rules, thedifferences between historic Dutch and contemporary Dutch are very small,making the RSF and RNF algorithm redundant. It would be interesting tosee the difference in performance on document collection from a specific periodbetween rules generated from documents dating from the same period and rulesgenerated from documents ranging from another period. For other languagesthe results can be completely different, but spelling often changes gradually,2

so these effects should be very similar in other languages.

6.2.2 The vocabulary gap

As mentioned earlier, the hardest problem of the two is the vocabulary gap. Theresources constructed to bridge this gap are far from being usable. The DBNLthesaurus contains to much noise, and the co-occurrence thesaurus would also,if low frequency words were classified as well.

The quality of the DBNL thesaurus can be improved by using a betterextraction algorithm. Apart from that, the list of 17th century books at the

1The Dutch language between 1200 and 1500.2Except, possibly, for the introduction of official spelling rules, which can have a significant

effect on spelling.

Page 95: Constructing language resources for historic document retrieval

6.2. FUTURE RESEARCH 85

DBNL website is expanded regularly. These new entries also contain notes andtranslation pairs, so the thesaurus could be updated with new entries as well.

The construction of a historic synonym thesaurus using mutual informationseems infeasible at this moment. An enormous amount of text would be re-quired, and even then, the clusters will still show more syntactic structure thansemantic structure. Large clusters of nouns are almost impossible to split intosemantically related subclusters if no more than plain text corpora are available.Once syntactically annotated 17th century Dutch corpora are available, classi-fication based on bigram frequencies might become useful to cluster synonyms.For HDR purposes, it would be interesting to see the effect of mixing historicand modern Dutch corpora. If the historic and modern words in a cluster aresemantically related, the historic words could be added to modern query wordsfrom the same cluster.

Finally, if the co-occurrence based thesaurus improves in quality, it could becombined with the DBNL thesaurus. The DBNL thesaurs could be used to testthe quality of the co-occurrence thesaurus (if it is based on a mix of historic andmodern Dutch). The historic word and its modern translation should be in thesame cluster, or at least, close to each other in the classification tree.

As it stands, the attempts at bridging the vocabulary gap have led to lit-tle more than plans for building a real bridge. The bridge over the spellinggap, although still a bit shaky, seems to have reached the other side. Languageresources are now available for historic Dutch, most of them automatically gen-erated, and possibly useful for other languages as well.

Page 96: Constructing language resources for historic document retrieval

86 CHAPTER 6. CONCLUDING

Page 97: Constructing language resources for historic document retrieval

Bibliography

[1] Adriaans, F. (2005). Historic Document Retrieval: Exploring strategiesfor 17th century Dutch

[2] Braun, L. (2002). Information Retrieval from Dutch Historical Corpora

[3] Brown, P.F.; Della Pietra, V.J.; deSouza, P.V.; Lai, J.C.; Mercer, R.L.(1992). Class-based n-gram moderls of natural language in ComputationalLinguistics, volume 18, number 4, pp. 467-479

[4] Crouch, C.J., Yang, B. (1992). Experiments in automatic statistical the-saurus construction

[5] Dagan, I.; Lee, L.; Pereira, F.C.N. (1998). Similarity-based models of wordcooccurrence probabilities in Machine Learning, Volume 34, number 1-3

[6] Hall, P.A.V., Dowling, G.R. (1980). Approximate string matching in Com-puting Surveys, Vol 12, No.4, December 1980

[7] Hollink, V.; Kamps, J.; Monz, C.; de Rijke, M. (2004). Monolingual doc-ument retrieval for European languages in Information Retrieval 7, pp.33-52

[8] Hull, D.A. (1998). Stemming algorithms: A case studie for detailed evalu-ation in Journal of the American Society for Information Science, volume47, issue 1

[9] Jing, Y.; Croft, W.B. (1994). An association thesaurus for informationretrieval in Proceedings of RIAO, pp. 146-160

[10] Kamps, J.; Fissaha Adafre, S.; de Rijke, M. (2005). Effective Translation,tokenization and combination for Cross-lingual Retrieval

[11] Kamps, J.; Monz, C.; de Rijke, M.; Sigurbjornsson, B. (2004). Language-dependent and language-independent approaches to Cross-Lingual TextRetrieval In Comparative Evaluation of Multilingual Information AccessSystems, CLEF 2003, volume 3237 of Lecture Notes in Computer Science,pages 152-165. Springer, 2004.

87

Page 98: Constructing language resources for historic document retrieval

88 BIBLIOGRAPHY

[12] Kraaij, W. & Pohlmann, R. (1994). Porter’s stemming algorithm forDutch

[13] Lam, W., Huang, R., Cheung, P.-S. (2004). Learning phonetic similarityfor matching named entity translations and mining new translations inProceedings of the 27th annual international conference on Research anddevelopment in information retrieval, 289-296

[14] Li, H. (2001). Word clustering and disambiguation based on co-occurrencedata in Natural Language Engingeering 8(1), pp. 25-42

[15] Lin, D. (1998). Automatic retrieval and clustering of similar words inProceedings of COLIN/ACL-98, pp. 768-774

[16] McMahon, J.G.; Smith, F.J. (1996). Improving statistical language modelperformance with automatically generated word hierarchies in Computa-tional Linguistics, Volume 22, number 2.

[17] McNamee, P.; Mayfield, J. (2004). Character N-gram tokenization for eu-ropean language text retrieval in Information Retrieval, 7, 2004, pp. 73-97

[18] Robertson, A.M.; Willett, P. (1992). Searching for Historical Word-Formsin a Database of 17th-century English Text Using Spelling-CorrectionMethods

[19] Salton, G.; Yang, C.S.; Yu, C.T. (1975) A theory of term importance inautomatic text analysis in Journal of the American Society for InformationScience.

[20] Salton, G. (1986). Another look at automatic text-retrieval systems inCommunications of the ACM, volume 29, number 7

[21] Tiedemann, J. (1999). Word alignment step by step in Proceedings of the12th Nordic Conference on Computational linguistics, pp. 216-227

[22] Tiedemann, J. (2004). Word to word alignment strategies in Proceedingsof the 20th International Conference on Computational Linguistics,

[23] van der Horst, J., Marschall, F. (1989). Korte geschiedenis van de Neder-landse taal Sdu Uitgevers, Den Haag

[24] Wagner, R.A.; Fischer, M.J. (1974). The string-to-string correction prob-lem in Journal of the ACM, Vol. 21, number 1, pp. 168-173

[25] Zobel, J.; Dart, P. (1995). Finding approximate matches in large lexiconsin Software-practice and experience, Vol 25(3), March 1995, pp. 331-345

[26] Zobel, J.; Dart, P. (1996). Phonetic string matching: lessons from infor-mation retrieval from Proceedings of the 19th International Conferenceon Research and Development in Information Retrieval

Page 99: Constructing language resources for historic document retrieval

Appendix A - Resourcedescriptions

Each of the resources and methods to construct them are described in moredetail here. Each section covers a resource and its associated algorithms.

Appendix A1 - Phonetic dictionary

The phonetic dictionary (section 5.4) is a plain text file, each line contain-ing a unique historic word and its modern phonetic equivalent. The modernwords are not unique, as a number of historic spelling variants are translated tothe same modern form. For the historic words, the phonetic transcriptions oftheir rewritten forms is used to match them with the phonetic transcriptions ofmodern words. This is done to solve the biggest problems with the change inpronunciation (the sequence ae in particular). In total, there are 11,592 entries.

This example shows the format (historic word, tab, modern word) of thedictionary:

aengeclaeght aangeklaagdaengecomen aangekomenaengedaen aangedaanaengedraegen aangedragen

Appendix A2 - DBNL dictionary

The DBNL dictionary (section 5.3) is also a plain text file, each line containinga dictionary entry and its translation. The entries and their translations can besingle words or phrases. As Table 5.5 showed, the word to word entries are byfar the most useful. Some statistics are repeated here:

Some entries have multiple translations, therefore the last column shows theaverage number of synonyms for each entry. The format of the DBNL dictionaryis equal to the phonetic dictionary format (historic word, tab, modern word):

begosten aanvingenbegote overgotenbegoten bespoeld

89

Page 100: Constructing language resources for historic document retrieval

90 APPENDIX A - RESOURCE DESCRIPTIONS

Dictionary number of unique number oftranslations entries synonyms

word to word 36,505 20,281 1.8word to phrase 26,445 16,649 1.6word to either 62,950 36,930 1.7phrase to word 5,589 4,931 1.1phrase to phrase 42,680 35,127 1.2phrase to either 48,269 40,058 1.2total 111,219 68,384 1.6

Table 1: DBNL dictionary sizes

begraeut afgesnauwd

Appendix A3 - Rewrite rule sets

The rewrite rules are generated from corpus statistics and phonetic information(section 3.2. The three rewrite rule generation algorithms PSS, RSF and RNFare explained in sections 3.2.1, 3.2.3 and 3.2.5. The rules can be applied to thelexicon of the document collection to obtain a dictionary of spelling moderniza-tion. The rewritten forms are not necessarily existing modern words, becausethe word can still be a historic word with a modern spelling (see the examplesbelow).

gerichtschrijversampt gerichtschrijverzambtgerichtschrijverseydt gerichtschrijverzeidgerichtscosten gerichtskostengerichtsdach gerichtsdaggerichtsdaege gerichtsdage

The rules generated by PSS and RSF are different from the rules generatedby RNF, because the vowel/consonant restrictions. The historic antecedent ofthese rules consist of a historic sequence and a context restriction. For instance,a historic vowel sequence should match a historic word if the vowel sequenceis surrounded by consonants. The word vaek matches the vowel sequence ae,while the word zwaeyen doesn’t, because its full vowel sequence is aeye. Tomake sure that the rule ae doesn’t match zwaeyen, the antecedent is extendedwith context wildcards:

[bcdfghjklmnpqrstvwxz]ae#→ a [aeiouY ]lcx[aeiouY ]The antecedent part of the rule is actually a regular expression and brack-

eted consonant wildcard [bcdfghjklmnpqrstvwxz] indicates that the charactersequence ae must be preceded by one of these consonants. The word boundarycharacter # indicates that the ae sequence must be at the end of the word.3

3In Perl, this word boundary character can be replaced by a dollar sign ($), which matchesthe preceding regular expression only at the end of the string.

Page 101: Constructing language resources for historic document retrieval

91

The uppercase Y in the second rule above is used as a replacement for theDutch diphtong ij, because the j would otherwise be recognized as a consonant.Therefore, all occurrences of ij in words and sequences are replaced by Y in theRSF and PSS algorithms.

The RNF algorithm doesn’t have this consonant/vowel restriction, it willmatch anything with the historic antecedent. Context information is more de-tailed for longer n-grams:

ae→ aa bae→ baa bael→ baal baele→ bale

Page 102: Constructing language resources for historic document retrieval

92 APPENDIX A - RESOURCE DESCRIPTIONS

Page 103: Constructing language resources for historic document retrieval

Appendix B - Scripts

The PSS algorithm requires two lists of mappings from words to phonetic tran-scriptions. One list with historic words and phonetic transcriptions, and onelist for modern words and phonetic transcriptions. The phonetic alphabet thatis used is not important, as long as both lists use the same phonetic alphabet.The output is a plain text file, where each line contains a rewrite rule and itsPSS score. The PSS score is the MM-score (the maximal match score, describedin section 3.3.1.

The RSF and RNF algorithms both require two word frequency indices, onefrom a historic corpus and one from a modern corpus. These indices are plaintext files with each line containing a unique word from the corpus, and its corpusfrequency.

These three algorithms are implemented in Perl, simply named ‘PSS.pl’,‘RSF.pl’ and ‘RNF.pl’, and use only standard packages and require some scriptsthat are included in the resources package.

Other important algorithms are:

• mapPhoneticTranscriptions.pl: This expects two lists containing wordsand their phonetic transcriptions, and gives as output a dictionary withof words with the same pronunciation.

• PED.pl: This is a package of subroutines, with ped is the main subroutinethat expects two strings as input and returns the phonetic edit distanceas output.

• RDV.pl: This contains the subroutine rdv, an implementation of theReduce Double Vowels algorithm. It needs a string as input and returnsas output the string after reducing redundant double vowels.

• selectRules.pl and selectMethods.pl: The selectRules.pl script isan executable script that allows you to select rules from a rule set us-ing a specific selection method (section 3.3). When executing, it needsthree arguments, the number of the selection method, the name of the filecontaining the rule set, and finally the name of the output file where theselected rules are stored. The selection criteria are implemented in theselectMethods.pl script.

93

Page 104: Constructing language resources for historic document retrieval

94 APPENDIX B - SCRIPTS

• applyRules.pl: This is a package of subroutines for applying a set ofrules to a string or a list of strings.

• createTestSet.pl: This script requires three arguments, a text file fromwhich words are randomly selected, a filename for the test set, and afilename for a list of words to skip. The nice thing about this approach isthat the test set can be constructed once, and then extended later. Theskip file contains a list of words that are already presented to the userin an earlier iteration, and where discarded. The user is presented withrandomly selected words from the text file, and if an alternative spelling isgiven by the user, the original word and its alternative spelling are addedto the test set. If no alternative spelling is given, the word is added tothe list of discarded words. In a second run of the script (with the samefilenames for the test set and the skip file), the words from the test setand the skip file are not presented to the user.

• buildIndex.pl: This expects three arguments as input. The first argu-mented is a flag indication whether the second argument is a text file, ora file containing a list of text files if multiple text files are to be indexed.The third argument is the filename for the resulting index, containing theunique words from the text file(s), and their collection frequencies.

For more information on or access to the scripts, send an e-mail to theauthor.

Page 105: Constructing language resources for historic document retrieval

Appendix C - Selection andTest set

The selection and test set (section 3.4.1) are in the same format as all the otherword lists and dictionaries. Each line in the test set file consists of a historicword, a tab, and the modern spelling of the historic word (again, not necessarilyan existing modern word). To clearify this, a few entries are given here:

sijnen zijnsijns zijnssilvere zilveresimmetrie symmetriesin zinsingen zingensinght zingtsinlijckheyt zinnelijkheidsinnen zinnensinplaets zinplaats

The historic word sinplaets is spelled as zinplaats in modern Dutch, althoughzinplaats is not an existing modern word.4

4At least, it is not listed in the ‘Van Dale - Groot woordenboek der Nederlandse taal.’

95