IMPACT Final Conference - Claus Gravenhorst

Content Conversion Specialists GmbH

Applied IMPACTDoes the new FineReader Engine and Dutch lexicon increase OCR accuracy and production efficiency?A case study by KB and CCS

Claus Gravenhorst, Director Strategic Initiatives

Final IMPACT Conference, London, 2011-10-24

Agenda

Scope

Improving Text Accuracy

Test Material

Test System

Provided IMPACT Tools

Integration with Test System

Test Scenario

Evaluation Method

Evaluation Results

Conclusion

Scope

Testing results and tools from the IMPACT project in a real mass production environment

IMPACT provides improved OCR technology as well as historical dictionaries for 9 European languages

Motivation of CCS

- Benefit from technology improvements

- Increase the level of automation for small, mid and large scale digitisation workflows

- Prevent from reinventing the wheel in specific areas such as OCR and language technology

Improving Text Accuracy

Various pre- and post-processing steps as well as extension of the OCR have an impact on the text accuracy

• Dictionary

• Pattern Training

• Deskew

• Cropping

• Dewarping

• etc.

• Dictionary

• Linguistic methods

• Crowd sourcing

• Zoning

• Classification

• Ordering

• Grouping

ImageImage

Pre-ProcessingOCR

Segmentation

Layout AnalysisText Correction

Information Retrieval benefits:cleaner index, more relevant hits

Test Material

17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.” printed with Fraktur/Gothic fonts

Databank of Digital Daily Newspapers (DDD)

- 1619 - 1635, 73 issues, 144 pages

- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, saved with Adobe Photoshop CS4

IMPACT Ground Truth Material (GTM)

- 1620 - 1632, 33 issues, 72 pages, overlap with DDD pages

- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, saved with ImageMagick 6.5.7

- page.xml with segment/zone coordinates and keyed text

Test Material



- 1619 - 1635, 73 issues, 144 pages




- 1619 - 1635, 73 issues, 144 pages


Test System

docWorks – Large Scale Digitisation Workflow

Provides layout analysis for page segmentation and zone classification

Developed during the EU funded FP5 research project METAe (2000 – 2003)

Used for small, mid and large scale digitisation projects by cultural heritage institutions and service providers around the world (e.g. BL books/newspapers, KB DDD newspapers, Proquest/EEB, etc.)

Provides structural analysis for recognition of logical entities

Test System

QA+CorrectionQA+CorrectionRe-Scan

Conversion

Imaging

Layout Analysis

OCR

ISR

Reject Condition

Delivery QA random

Final Output

Book DeliveryQA+Correction

ScanningImage

Metadata

Database----------------------

Repository

Metadata

Z 39.50

Automated QA

DocumentUID

BarcodeItem Tracking

Manual QA

in-house,

near-shore, off-shore

multiple locations

Manual QS

in-house, near-shore

Check in

Check out

Robot-ScannerBook-ScannerDocument-ScannerMicrofilm-Scanner

Test System






Conversion

Image Pre-processing

Layout Analysis

OCR (ABBYY)

Structural Analysis (ISR)

Test System






Provided IMPACT Tools

ABBYY FineReader Engine 10 with Gothic/Fraktur extension and standard interface for integration of external dictionaries

Corpus based dictionary of DBNL – Digitale Bibliotheek for de Nederlandse Letteren (www.dbnl.org), 16th – 19th century

Dictionary based dictionary from the WNT – Woordenboek der Nederlandsche Taal, 16th – 19th century

DLL incl. documentation for:

- access to the dictionaries and integration into the OCR process via the FRE10 external dictionary interface

- access to routines for fixing misrecognition of “long S” characters

http://www.dbnl.org/

Integration with Test System

docWorks test system built

DLL integrated with docWorks code

External dictionaries callable via DLL and ABBYY external dictionary interface

Minor FRE10 bug identified during integration phase. ABBYY support immediately provided a workaround.

Overall, the integration went smoothly

Test Scenario

GTM sample processing:

- based on segmentation obtained from the page.xml files

- without any image pre-processing

DDD processing:

- comparison with GTM sample processing showed slightly better results

- image pre-processing and segmentation by docWorks

- text correction for a complete run to create DDD Ground Truth Text

4 runs with DDD images:

- FR engine (FRE) 9 and standard Dutch dictionary

- FRE 10 and standard Dutch dictionary

- FRE 10 and corpus based dictionary incl. “long S” fix

- FRE 10 and dictionary based dictionary incl. “long S” fix

Evaluation Method

Goal was to generate statistical data for character and word accuracy of all 4 test runs through automated comparison of text output with DDD Ground Truth Text

Computing of accuracy rates is based on the Levenshtein Algorithm. The Levenshtein distance represents the number of actions (insert, delete, substitute) needed to transform one text into another

The Levenshtein percentage is the value represented by the Levenshtein Distance multiplied by 100 and divided by the number of characters or words of the correct DDD Ground Truth Text

Evaluation Results (1)

docWorks text correction mode

Long S recognitione.g. in “Duytslant”

Evaluation Results (2)

0

10

20

30

40

50

60

70

Le

ven

shte

in p

erc

en

tag

e

FRE9+SD FRE10+SD FRE10+CD FRE10+DD

SD = Standard Dutch Dictionary

CD = Corpus based Dictionary

DD = Dictionary based Dictionary

17,6418,8523,1824,33

%

characters

words

66,38 64,87

56,5452,70

Smaller value represents a higher text accuracy

Improvement in character accuracy is 27,5 % (FRE10+DD vs. FRE9+SD)

Improvement in word accuracy is 20,6 % (FRE10+DD vs. FRE9+SD)

Conclusion

Improved ABBYY OCR and historical dictionaries enable higher text accuracy and lower the effort for text correction

Tools easy to integrate via DLL and ABBYY interface for external dictionaries

Future digitisation projects will benefit from historical dictionaries

Biggest potential for further improvement is in language technology

Thank you

CCS Content Conversion Specialists GmbH

information:accessible

Weidestr. 134, D-22083 Hamburg, Germany +49 (0) 402 2713016 phone +49 (0) 402 2713011 fax +49 (0) 176 12713016 mobile

[email protected]

Internet: www.content-conversion.com

Claus GravenhorstDirector Strategic Initiatives

Education

IMPACT Final Conference - Claus Gravenhorst