26
IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. 18/08/2010 - IFLA satellite meeting, Upp sala OCR challenges in historic documents and the contribution of IMPACT Clemens Neudecker, KB National Library of the Netherlands

OCR challenges in historic documents and the contribution of IMPACT

Embed Size (px)

DESCRIPTION

OCR challenges in historic documents and the contribution of IMPACT IFLA 2010 Satellite Meeting "New Techniques for Old Documents", 16-18 August 2010, Uppsala, Sweden.

Citation preview

Page 1: OCR challenges in historic documents and the contribution of IMPACT

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18/08/2010 - IFLA satellite meeting, Uppsala

OCR challenges in historic documents and the contribution of IMPACTClemens Neudecker, KB National Library of the Netherlands

Page 2: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

2

Background Text that is not digital is virtually invisible

OCR (optical character recognition) technology does not produce satisfactory results for historic documents

There is a lack of institutional knowledge and expertise which causes “re-inventing the wheel”

Innovate OCR software and language technology

Share best practice and build capacity across Europe (Guidelines, Training, Workshops)

Page 3: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

3

IMPACT – Improving access to text Funded by the EC as part of the 7th Framework Programme

Coordinated by KB – National Library of the Netherlands

EU funding: € 12 100 000

26 partners: Libraries, Research Institutes, Industry Partners

Start date: 1 January 2008

Duration: 48 Months 2011: Center of Competence

Page 4: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

4

Historic material: different problems

I. OCR errorsDamaged material, bad quality scans, difficult layout,

historic fonts, …

II. Historical languageSpelling variants, orthographical variants, inflected forms, …

Page 5: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

5

Bad OCR results…

la 112 B ik e my lat arrived the>Pylades,-. lliot; aod. Abe- 3ineva, CNeee 4orn Neath,' titch ,cuim; ,'t;ohn_ IoMelwl fri ytiil SUn-.die8; ,FrietndiLp, St&ar, froniidon, 'Ui wine andgrocerieu ;: ;aletn, Bker, from Liverpool,. witfi eoal.;'4Stalled the AluidonG.: ceror' Lkndon, with sundries;: ;Two Rrothwsj'@ Whe~atn-;- Pylade', Eiot; Har'tinny,;;: Fisbley; ::Iiiveiy Peggy:-(flth add tie JAne, Redman,for eathly Newpot;agd llford; -Tw Br.otherAs, lawces,fos Lysixowjvithbinehol V pirI-ihzure;vi etsey, Per-wIliti; iIudstry, ModA - ~tbi ,Al~t,,'enniugs, for.:IP1~iOntI, StIth Ltu .c*ar An'l? Hawkinss foirouck , + iii ballasto I _______~ ~ ~ ~~~Ai

Page 6: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

6

Bleed through & shine through

General description Effects on OCRing

When the printing ink was not dry, the letters of the one page also appear on the other page. Also, if a paper is relatively thin the ink of the other side of the page may shine through.

Effects are high, since it is the same ink (though lighter) and the shaping of characters is directly disturbed.

Page 7: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

7

IMPACT: Binarisation

Page 8: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

8

Annotations in the text

General description Effects on OCRing

All notes, lines, drawings created byusers, but also stamps, tapes etc. usedwithin libraries.

Effects are high, since both segmentationas well as the recognition process itself isdisturbed.

Page 9: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

9

IMPACT: Improved binarisation

9

Original State of the Art IMPACT

Page 10: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

10

Warping of paper

General description Effects on OCRing

Due to humidity the single page of an old book is very rarely really flat, in contrast it is warped. Even with putting the paper against a glass plate the warping will not disappear.

Partly a relatively high effect, especially if it is connected with bad printing (e.g. characters not aligned on the baseline of a line).

Page 11: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

11

IMPACT: Border removal

Page 12: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

12

IMPACT: Geometric correction I

Page 13: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

13

IMPACT: Geometric correction II

Page 14: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

14

Gothic typeface

General description Effects on OCRing

Historic fonts, obsolete characterssuch as the long s

Effects are high since such fonts and characters are often not recognised

correctly.

Page 15: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

15

IMPACT: Improved recognition

Page 16: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

16

Complex layout

Effects on OCRing

Effects are high since text is not ordered in the right way

General description

Due to difficult layouts, pages can be segmented incorrectly

Page 17: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

17

IMPACT: Segmentation

Blocks/Regions Words Glyphs

Page 18: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

18

IMPACT: Functional extension parser

Recognition of the structure of book pages– Print space– Standard font of the

main text– Page numbers

Enrichment of OCR results with structural information

Page 19: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

19

Bad printing: blurred, broken, faded characters

General description Effects on OCRing

According to the printing technology usedletters may be blurred, broken or dotted.

Effects are high since characters are broken or bound together.

Page 20: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

20

IMPACT: Cooperative correction

Integrated web-based system for cooperative correction of OCR results

Character/Word/Page mode

Collaboratively correct OCR errors and use results for improving OCR

Page 21: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

21

IMPACT: Word spotting

Alternative technique for indexing historical documents

After word segmentation relevant words are detected and highlighted

Key words can be e.g. person and location names

Page 22: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

22

Historical language

Historical variants of the Dutch word ‘wereld’ (world):

werelt weerelt wereld weerelds wereldt werelden weereld werrelts waerelds weerlyt wereldts vveerelts waereld weerelden waerelden weerlt werlt werelds sweerels zwerlys swarels swerelts werelts swerrels weirelts tsweerelds werret vverelt werlts werrelt worreld werlden wareld weirelt weireld waerelt werreld werld vvereld weerelts werlde tswerels werreldts weereldt wereldje waereldje weurlt wald weëled

Page 23: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

23

IMPACT: Historical dictionariesOCR:

Lexica for German, Dutch, English, French, Spanish, Polish, Bulgarian and Czech

Generic tools for building historical lexica

FineReader with built in standard Dutchstandard Dutch dictionary werreid

FineReader with IMPACT dictionary of historical Dutch werreld

RETRIEVAL:

Key in ‘wereld’ and find ‘werreld’

Page 24: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

24

IMPACT: Linguistic post-correction

The colors indicate different types of analysis results, like a word being found in the historical or hypothetical dictionary, or a supposed OCR error, etc.

Page 25: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

25

IMPACT: Interoperability framework

Interaction, Modularisation, Evaluation

Page 26: OCR challenges in historic documents and the contribution of IMPACT

18/08/2010 - IFLA satellite meeting, Uppsala

IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands.

26

Thank you!

http://www.impact–project.eu/

[email protected]

@impactocr

http://impactocr.wordpress.com/