Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR...

Optical Character Recognition with a NeuralNetwork Model for Coptic

Kirill Bulert So Miyagawa Marco Buechler

December 8, 2017DH2017 Montreal, Canada

Virtual Short Paper

Coptic

About Coptic

• The final stage of the AncientEgyptian language (thirdcentury)

• Multiple dialects (mostimportant: Bohairic, Sahidic)

• Many manuscripts in SahidicCoptic

Alphabet

• Simple: Ca. 30 unique characters• No upper/lower case in historic texts• Diacritics add some complexity, but not much• Based on Greek alphabet

Coptic projects

• SFB 1136 (Goettingen)https://www.uni-goettingen.de/de/sfb-1136/521113.html

• Digital Edition of the Coptic old testament (Goettingen)http://coptot.manuscriptroom.com/

• Coptic SCRIPTORIUM (Georgetown/Pacific)http://copticscriptorium.org/

OCR Systems

Tesseract vs Ocropy, free OCR frameworks

Tesseract• Font based learning

• Single characters aredecomposed

• Decomposed parts arematched against given text

• Fonts

Problem

• Few Coptic fonts available,not all historic variationscovered

Ocropy/Ocropus• Neural nets with online

learning

• Model accuracy proportionalto ground truth size

• But, size of ground truth islimited

• Ground truth andcorresponding images

Problem

• Limited data for learning

Tesseract vs Ocropy, free OCR frameworks

Tesseract• Font based learning

• Single characters aredecomposed

• Decomposed parts arematched against given text

• Fonts

Problem

• Few Coptic fonts available,not all historic variationscovered

Ocropy/Ocropus• Neural nets with online

learning

• Model accuracy proportionalto ground truth size

• But, size of ground truth islimited

• Ground truth andcorresponding images

Problem

• Limited data for learning

Pre-Processing

GIGO - principle

• Training: Data → training → model

• Garbage in, garbage out

• Cleaner images improve results tremendously

• Dust and stains can be cleaned algorithmically, but …

• What if text itself is noise the problem?

Original Page

• Flawless, almost …

Original Page

• Easily removable with ScanTailor

Foreign language

• No multilingual OCR models for Coptic

Annotations

• Special characters might not be part of any model (⸤ ⸥)• Not all annotations wanted

Language specific variations

• Might also not be included in a model

Previous work

• Coptic models created by Moheb for Tesseract (2013)

• Trained with several Coptic fonts, no non-Coptic letters support

• No support for diacritics

• Non-Coptic letters get replaced with similar Coptic letters

The good, the bad, and the problematic

• Even clean scans still contain noise

Results

Without line numbers

Accuracy Input

• Without time consuming pre-processing

• Ocropy model trained on 10 pages more accurate

• Non-multilingual Tesseract model less accurate

Accuracy

• Without non-Coptic letters

• Difference results mostly fromdiacritics

Accuracy

• Diacritics removed

• Pure Coptic Tesseract modeloutperforms Ocropys mixedmodel

Workload comparison

• Utilising OCR decreases human workload

Roundup

• Utilisation of OCR beneficial for most clean documents

• Tesseract best for monolingual documents with limited fonts andfont variations

• Ocropy best for large documents with multiple languages

Road ahead

• Approached by Google for collaboration on OCR for Coptic

• Create data set for Coptic OCR testing

• Transition from typeset to handwritten Coptic texts

• Combination of different models

Thank you.Questions?

Get our Ocropy models at

Figure 1: https://github.com/somiyagawa/CopticOCR-1/tree/master/Besa

PresentationKirill Bulert, So Miyagawa

Team (in alphabetical order)Kirill Bulert, Marco Büchler, So Miyagawa.

Visit ushttp://www.etrap.euhttp://www.uni-goettingen.de/de/517150.html

contact@etrap.eu smiyaga@gwdg.de

OCR References

1. Uwe Springmanns OCR Workshophttp://www.cis.uni-muenchen.de/ocrworkshop/program.html

2. Scantailor for pre-processinghttp://scantailor.org/

3. Ocropy/Ocropushttps://github.com/tmbdev/ocropy

4. Kraken an Ocropy forkhttp://kraken.re/

5. Tesseract OCRhttps://github.com/tesseract-ocr/

6. Moheb’s Coptic Pageshttp://www.moheb.de/ocr.html

Coptic References

1. SFB 1136 (Goettingen)https://www.uni-goettingen.de/de/sfb-1136/521113.html

2. Digital Edition of the Coptic old testament (Goettingen)http://coptot.manuscriptroom.com/

3. Coptic SCRIPTORIUM (Georgetown/Pacific)http://copticscriptorium.org/

Licence

The LaTeX theme this presentation is based on is licensed under aCreative Commons Attribution-ShareAlike 4.0 International License.Changes to the theme are the work of eTRAP.

Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR...

Documents

SDE Software Solutions - Masaryk University · OCR Tesseract. Ako sa k nám mô eš pridať? SSME Interim Project - front-end developer React/AngularJS Bakalárske a diplomové práce

IMPROVING THE EFFICIENCY OF TESSERACT OCR ENGINE

Tesseract Catalogue 102

OCR using Tesseract

LATIN-NASTALIQUE SCRIPT CLASSIFICATION SYSTEM Ghani.pdf · script is recognized by the Tesseract OCR. Font size independent approach is used. INTRODUCTION Latin script. Urdu OCR and

SamplePDFDocument - Tesseract OCR · Chapter1 Template 1.1 Howto compilea . texfileto a . pdffile 1.1.1 Tools Toprocessthe filesyou (may) need: 0 pdflatex(for example from tetexpackage

DAFTAR ISI - repository.uib.ac.idrepository.uib.ac.id/181/2/S-1221012-table_of_contents.pdf · ix Universitas Internasional Batam 2.9 Tesseract OCR ..... 19 BAB III METODOLOGI PENELITIAN

Tesseract OCR

Blue Prism v6 · PDF fileWith over 400 customers, ... connected and easy to control. ... • Tesseract OCR - Version 6 is installed with Tesseract OCR V3.05.01

Project FoxEye - Mozilla · The open source projects used in the demos OpenCV: BSD license, tons of computer vision algorithms library. Tesseract-OCR: Apache License v2.0, the

Lessons Learned from What Worked and What … › docs › das_tutorial2016 › 1...1. Motivation and History of the Tesseract OCR Engine Lessons Learned from What Worked and What

Tesseract OCR Engine - nas-adminmentoring.nas-admin.org/images/c/cc/MyPdfFile.pdf · Tesseract OCR Engine What it is, where it came from, where it is going. Ray Smith, Google Inc

Tesseract OCR Engine Svetlin Nakov and Veselin Kolev BASD (Bulgarian Association of Software Developers)

Chapter 12. OCR and Sudoku - Prince of Songkla Universityad/jg/nui12/sudokuOCR.pdf · Tesseract is perhaps the most popular free OCR library, with several Java bindings, including

Tesseract Training_for Khmer Language_For Posting

Tesseract OCR Engine - OpenFest 2009

Decomposed PE

TESSERACT OCR: A CASE STUDY FOR LICENSE PLATE RECOGNITION ... 07(01) 03.pdf · minerva, 7(1): 19-26 tesseract ocr: a case study for license plate recognition in brazil 19 tesseract

Presented by Abirami Poonkundran. Introduction Current Work Current Tools Solution Tesseract Tesseract Usage Scenarios Information Flow

Tesseract para el euskera - · PDF file$ xulrunner application.ini 9. 4 Desinstalación 4.1 En entorno Windows El proceso para desinstalar el OCR de código abierto para el euskera