Upload
impact-centre-of-competence
View
717
Download
0
Embed Size (px)
DESCRIPTION
Applied IMPACT: Does the new FineReader Engine and Dutch Lexicon increase OCR accuracy and production efficiency? A case study by KB and CCS.
Citation preview
Content Conversion Specialists GmbH
Applied IMPACTDoes the new FineReader Engine and Dutch lexicon increase OCR accuracy and production efficiency?A case study by KB and CCS
Claus Gravenhorst, Director Strategic Initiatives
Final IMPACT Conference, London, 2011-10-24
Agenda
Scope
Improving Text Accuracy
Test Material
Test System
Provided IMPACT Tools
Integration with Test System
Test Scenario
Evaluation Method
Evaluation Results
Conclusion
Scope
Testing results and tools from the IMPACT project in a real mass production environment
IMPACT provides improved OCR technology as well as historical dictionaries for 9 European languages
Motivation of CCS
- Benefit from technology improvements
- Increase the level of automation for small, mid and large scale digitisation workflows
- Prevent from reinventing the wheel in specific areas such as OCR and language technology
Improving Text Accuracy
Various pre- and post-processing steps as well as extension of the OCR have an impact on the text accuracy
• Dictionary
• Pattern Training
• Deskew
• Cropping
• Dewarping
• etc.
• Dictionary
• Linguistic methods
• Crowd sourcing
• Zoning
• Classification
• Ordering
• Grouping
ImageImage
Pre-ProcessingOCR
Segmentation
Layout AnalysisText Correction
Information Retrieval benefits:cleaner index, more relevant hits
Test Material
17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.” printed with Fraktur/Gothic fonts
Databank of Digital Daily Newspapers (DDD)
- 1619 - 1635, 73 issues, 144 pages
- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, saved with Adobe Photoshop CS4
IMPACT Ground Truth Material (GTM)
- 1620 - 1632, 33 issues, 72 pages, overlap with DDD pages
- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, saved with ImageMagick 6.5.7
- page.xml with segment/zone coordinates and keyed text
Test Material
17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.” printed with Fraktur/Gothic fonts
Databank of Digital Daily Newspapers (DDD)
- 1619 - 1635, 73 issues, 144 pages
- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, saved with Adobe Photoshop CS4
17th century Dutch newspaper “Courante uyt Italien, Duytslandt, &c.” printed with Fraktur/Gothic fonts
Databank of Digital Daily Newspapers (DDD)
- 1619 - 1635, 73 issues, 144 pages
- TIFF, 24 bit color, 300 dpi, captured with CANON DSLR camera, saved with Adobe Photoshop CS4
Test System
docWorks – Large Scale Digitisation Workflow
Provides layout analysis for page segmentation and zone classification
Developed during the EU funded FP5 research project METAe (2000 – 2003)
Used for small, mid and large scale digitisation projects by cultural heritage institutions and service providers around the world (e.g. BL books/newspapers, KB DDD newspapers, Proquest/EEB, etc.)
Provides structural analysis for recognition of logical entities
Test System
QA+CorrectionQA+CorrectionRe-Scan
Conversion
Imaging
Layout Analysis
OCR
ISR
Reject Condition
Delivery QA random
Final Output
Book DeliveryQA+Correction
ScanningImage
Metadata
Database----------------------
Repository
Metadata
Z 39.50
Automated QA
DocumentUID
BarcodeItem Tracking
Manual QA
in-house,
near-shore, off-shore
multiple locations
Manual QS
in-house, near-shore
Check in
Check out
Robot-ScannerBook-ScannerDocument-ScannerMicrofilm-Scanner
Test System
docWorks – Large Scale Digitisation Workflow
Provides layout analysis for page segmentation and zone classification
Developed during the EU funded FP5 research project METAe (2000 – 2003)
Used for small, mid and large scale digitisation projects by cultural heritage institutions and service providers around the world (e.g. BL books/newspapers, KB DDD newspapers, Proquest/EEB, etc.)
Provides structural analysis for recognition of logical entities
Conversion
Image Pre-processing
Layout Analysis
OCR (ABBYY)
Structural Analysis (ISR)
Test System
docWorks – Large Scale Digitisation Workflow
Provides layout analysis for page segmentation and zone classification
Developed during the EU funded FP5 research project METAe (2000 – 2003)
Used for small, mid and large scale digitisation projects by cultural heritage institutions and service providers around the world (e.g. BL books/newspapers, KB DDD newspapers, Proquest/EEB, etc.)
Provides structural analysis for recognition of logical entities
Provided IMPACT Tools
ABBYY FineReader Engine 10 with Gothic/Fraktur extension and standard interface for integration of external dictionaries
Corpus based dictionary of DBNL – Digitale Bibliotheek for de Nederlandse Letteren (www.dbnl.org), 16th – 19th century
Dictionary based dictionary from the WNT – Woordenboek der Nederlandsche Taal, 16th – 19th century
DLL incl. documentation for:
- access to the dictionaries and integration into the OCR process via the FRE10 external dictionary interface
- access to routines for fixing misrecognition of “long S” characters
Integration with Test System
docWorks test system built
DLL integrated with docWorks code
External dictionaries callable via DLL and ABBYY external dictionary interface
Minor FRE10 bug identified during integration phase. ABBYY support immediately provided a workaround.
Overall, the integration went smoothly
Test Scenario
GTM sample processing:
- based on segmentation obtained from the page.xml files
- without any image pre-processing
DDD processing:
- comparison with GTM sample processing showed slightly better results
- image pre-processing and segmentation by docWorks
- text correction for a complete run to create DDD Ground Truth Text
4 runs with DDD images:
- FR engine (FRE) 9 and standard Dutch dictionary
- FRE 10 and standard Dutch dictionary
- FRE 10 and corpus based dictionary incl. “long S” fix
- FRE 10 and dictionary based dictionary incl. “long S” fix
Evaluation Method
Goal was to generate statistical data for character and word accuracy of all 4 test runs through automated comparison of text output with DDD Ground Truth Text
Computing of accuracy rates is based on the Levenshtein Algorithm. The Levenshtein distance represents the number of actions (insert, delete, substitute) needed to transform one text into another
The Levenshtein percentage is the value represented by the Levenshtein Distance multiplied by 100 and divided by the number of characters or words of the correct DDD Ground Truth Text
Evaluation Results (1)
docWorks text correction mode
Long S recognitione.g. in “Duytslant”
Evaluation Results (2)
0
10
20
30
40
50
60
70
Le
ven
shte
in p
erc
en
tag
e
FRE9+SD FRE10+SD FRE10+CD FRE10+DD
SD = Standard Dutch Dictionary
CD = Corpus based Dictionary
DD = Dictionary based Dictionary
17,6418,8523,1824,33
%
characters
words
66,38 64,87
56,5452,70
Smaller value represents a higher text accuracy
Improvement in character accuracy is 27,5 % (FRE10+DD vs. FRE9+SD)
Improvement in word accuracy is 20,6 % (FRE10+DD vs. FRE9+SD)
Conclusion
Improved ABBYY OCR and historical dictionaries enable higher text accuracy and lower the effort for text correction
Tools easy to integrate via DLL and ABBYY interface for external dictionaries
Future digitisation projects will benefit from historical dictionaries
Biggest potential for further improvement is in language technology
Thank you
CCS Content Conversion Specialists GmbH
information:accessible
Weidestr. 134, D-22083 Hamburg, Germany +49 (0) 402 2713016 phone +49 (0) 402 2713011 fax +49 (0) 176 12713016 mobile
Internet: www.content-conversion.com
Claus GravenhorstDirector Strategic Initiatives