29
Optical Character Recognition with a Neural Network Model for Coptic Kirill Bulert So Miyagawa Marco Buechler December 8, 2017 DH2017 Montreal, Canada Virtual Short Paper

Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Optical Character Recognition with a NeuralNetwork Model for Coptic

Kirill Bulert So Miyagawa Marco Buechler

December 8, 2017DH2017 Montreal, Canada

Virtual Short Paper

Page 2: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Coptic

Page 3: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

About Coptic

• The final stage of the AncientEgyptian language (thirdcentury)

• Multiple dialects (mostimportant: Bohairic, Sahidic)

• Many manuscripts in SahidicCoptic

1

Page 4: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Alphabet

• Simple: Ca. 30 unique characters• No upper/lower case in historic texts• Diacritics add some complexity, but not much• Based on Greek alphabet

2

Page 5: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Coptic projects

• SFB 1136 (Goettingen)https://www.uni-goettingen.de/de/sfb-1136/521113.html

• Digital Edition of the Coptic old testament (Goettingen)http://coptot.manuscriptroom.com/

• Coptic SCRIPTORIUM (Georgetown/Pacific)http://copticscriptorium.org/

3

Page 6: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

OCR Systems

Page 7: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Tesseract vs Ocropy, free OCR frameworks

Tesseract• Font based learning

• Single characters aredecomposed

• Decomposed parts arematched against given text

Input

• Fonts

Problem

• Few Coptic fonts available,not all historic variationscovered

Ocropy/Ocropus• Neural nets with online

learning

• Model accuracy proportionalto ground truth size

• But, size of ground truth islimited

Input

• Ground truth andcorresponding images

Problem

• Limited data for learning

4

Page 8: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Tesseract vs Ocropy, free OCR frameworks

Tesseract• Font based learning

• Single characters aredecomposed

• Decomposed parts arematched against given text

Input

• Fonts

Problem

• Few Coptic fonts available,not all historic variationscovered

Ocropy/Ocropus• Neural nets with online

learning

• Model accuracy proportionalto ground truth size

• But, size of ground truth islimited

Input

• Ground truth andcorresponding images

Problem

• Limited data for learning

5

Page 9: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Pre-Processing

Page 10: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

GIGO - principle

• Training: Data → training → model

• Garbage in, garbage out

• Cleaner images improve results tremendously

• Dust and stains can be cleaned algorithmically, but …

• What if text itself is noise the problem?

6

Page 11: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Original Page

• Flawless, almost …

7

Page 12: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Original Page

• Easily removable with ScanTailor

8

Page 13: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Foreign language

• No multilingual OCR models for Coptic

9

Page 14: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Annotations

• Special characters might not be part of any model (⸤ ⸥)• Not all annotations wanted

10

Page 15: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Language specific variations

• Might also not be included in a model

11

Page 16: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Previous work

• Coptic models created by Moheb for Tesseract (2013)

• Trained with several Coptic fonts, no non-Coptic letters support

• No support for diacritics

• Non-Coptic letters get replaced with similar Coptic letters

12

Page 17: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

The good, the bad, and the problematic

• Even clean scans still contain noise

13

Page 18: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Results

Page 19: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Without line numbers

Accuracy Input

• Without time consuming pre-processing

• Ocropy model trained on 10 pages more accurate

• Non-multilingual Tesseract model less accurate

14

Page 20: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Without line numbers

Accuracy

• Without non-Coptic letters

• Difference results mostly fromdiacritics

Input

15

Page 21: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Without line numbers

Accuracy

• Diacritics removed

• Pure Coptic Tesseract modeloutperforms Ocropys mixedmodel

Input

16

Page 22: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Workload comparison

• Utilising OCR decreases human workload

17

Page 23: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Roundup

• Utilisation of OCR beneficial for most clean documents

• Tesseract best for monolingual documents with limited fonts andfont variations

• Ocropy best for large documents with multiple languages

18

Page 24: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Road ahead

• Approached by Google for collaboration on OCR for Coptic

• Create data set for Coptic OCR testing

• Transition from typeset to handwritten Coptic texts

• Combination of different models

19

Page 25: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Thank you.Questions?

Get our Ocropy models at

Figure 1: https://github.com/somiyagawa/CopticOCR-1/tree/master/Besa

19

Page 26: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

PresentationKirill Bulert, So Miyagawa

Team (in alphabetical order)Kirill Bulert, Marco Büchler, So Miyagawa.

Visit ushttp://www.etrap.euhttp://www.uni-goettingen.de/de/517150.html

[email protected] [email protected]

20

Page 27: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

OCR References

1. Uwe Springmanns OCR Workshophttp://www.cis.uni-muenchen.de/ocrworkshop/program.html

2. Scantailor for pre-processinghttp://scantailor.org/

3. Ocropy/Ocropushttps://github.com/tmbdev/ocropy

4. Kraken an Ocropy forkhttp://kraken.re/

5. Tesseract OCRhttps://github.com/tesseract-ocr/

6. Moheb’s Coptic Pageshttp://www.moheb.de/ocr.html

21

Page 28: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Coptic References

1. SFB 1136 (Goettingen)https://www.uni-goettingen.de/de/sfb-1136/521113.html

2. Digital Edition of the Coptic old testament (Goettingen)http://coptot.manuscriptroom.com/

3. Coptic SCRIPTORIUM (Georgetown/Pacific)http://copticscriptorium.org/

22

Page 29: Optical Character Recognition with a Neural Network Model ... · Tesseract vs Ocropy, free OCR frameworks Tesseract • Font based learning • Single characters are decomposed •

Licence

The LaTeX theme this presentation is based on is licensed under aCreative Commons Attribution-ShareAlike 4.0 International License.Changes to the theme are the work of eTRAP.

cba

23