Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Development of Arabic OCR
Team members in UofG: Qiying He (Tina)
SangYu Lee
Leonardo Nunes Parente
----Opportunities and challenges
& in IUG: Ghadeer Abu-Oda
Shadia Baroud
OCR = Optical Character Recognition
What is OCR?
Situation & Problems 1
Solutions 2
Evaluation & Future Work 3
Content
Background in Gaza, Arabic language and
existing problems of Arabic OCR
Hidden Markov Model, Open software ,and
their advantages and disadvantages
Best solution, limitations, and future trend
Situation & Problem
Part 1
Blind
People
Free
Software
Cannot
afford
new apps
ATC in
IUG[1]
Background DOLOR
ATC = Assistive Technology Centre
Complexity of Arabic
DOLOR
28 characters, 22 are cursive, 6 are
non-cursive.
Cursiveness
The character can have up to 4 shapes
depending on its position (Table 1).
Shapes
[2]
Problems about OCR
Most Apps Focus on English or Latin
based language
A.
Not many techniques for handwritten
Arabic recognition
C.
Arabic OCR is still in the early stage
(inaccurate)
B.
Solutions
Part 2
Solution 1: Statistical Methods
Algorithm Accuracy Rate
Logistic Regression 89.4% [4]
Linear SVM 85.4% [4]
kNN (3) 89.5% [4]
HMM 92.1% [5]
- Hidden Markov Model (HMM)
A. What is HMM
Tool for representing probability distribution over sequences of observations [3]
B. Why
- Based on “process-focused approach”
Suitable for recognising handwriting
- High accuracy rate
[6]
Solution 1: Statistical Methods
- Hidden Markov Model (HMM)
D. Evaluation
- One of the most suitable algorithm for handwriting recognition
- Can be further developed by adapting appropriate software
C. How does it work?
A pattern is assigned to the model
with highest posterior probability (i.e.
the model that best explains the
pattern) [6]
Software Price
Sakhr £ 650.00
Omnipage (Pro) £ 292.00
Abby £ 100.00
B. Why?
- Price: Free
Solution 2: OCR Software
- Tesseract
A. What is Tesseract?
OCR engine for various operating systems, developed by HP in 1995
[7]
Solution 2: OCR Software
- Tesseract
C. Evaluation
- Easy accessibility: no cost & open for input
- Necessity for more participation & better accuracy for Arabic
B. Why?
Character Word
Change of error rate -7.31% -5.339%
- Open Source Software (OSS)
More opportunities to adapt software users’ input
Evaluation & Future Work
Part 3
Online
Community
Developers
:
University
students
Base:
Tesseract
+ HMM
Free collaborative Arabic OCR software
Free collaborative Arabic OCR software
- Android
- Ubuntu
- Debian
- Fedora
- 35 million articles in 288 different languages[9] - Since 2005: 12,000 developers from more
than 1,200 companies[8]
Linux:
NEMLAR (Network for Euro-Mediterranean
Language Resources) project[10]: (2003-2005)
- Partners: Egypt, Jordan, Lebanon, Morocco, Tunisia, West Bank &
Gaza Strip, Denmark, France, Greece and The Netherlands.
- BLARK (Basic Language Resource Kit) for Arabic
NEMAR project[10]: (2008-2010)
- Machine Translation
- Multilingual Information Retrieval for Arabic
- Supported by the European Commission's ICT programme
Free collaborative Arabic OCR software
Free collaborative Arabic OCR software
Limitations
- Lack of interest in making efforts to develop free software by other Arab countries
- Programmers disinterested in participating in the project
Future approaches
- Crowdsourcing and database
- Text-to-speech
ReferencesOLOR
[1] Elaydi H, Shehada H. A Source of Inspiration: ATC for Visually Impaired Students at the Islamic University of Gaza[J]. ICTA, 2007, 7: 12-14.
[2] Asebriy Z, Bencharef O, Raghay S, et al. Comparative systems of handwriting Arabic character recognition[C]//Complex Systems (WCCS),
2014 Second World Conference on. IEEE, 2014: 90-93.
[3] Sargur, N. S. Hidden Markov Models. [PowerPoint slides]. Presented at a CSE 574 lecture at Buffalo University.
[4]George, M. [no date]. Optical Character Recognition: Classification of Handwritten Digits and Computer Fonts.
[5]Huaigu, C. et al. (2014).Progress in the Raytheon BBN Arabic Offline Handwriting Recognition. International on Frontiers in Handwriting
Recognition.
[6] RWTH-OCR. (2007) Arabic Handwriting Recognition.[online] Available from https://www-i6.informatik.rwth-aachen.de/~dreuw/arabic.php.
[7] Ray, S. [No date]. The Tesseract open source ocr system. [online] Available from http://static.googleusercontent.com/…/pubs/archive/33418.pdf
[8] Corbet, J., Kroah-Hartman, G. and McPherson, A. (2015) The Linux Foundation Releases Linux Development Report. Available at:
http://www.linuxfoundation.org/ (Accessed: 25 August 2015).
[9] Safer, M. (2015) Wikipedia cofounder Jimmy Wales on 60 Minutes. Available at: http://www.cbsnews.com/…/wikipedia-jimmy-wales-morley-
safe…/ (Accessed: 29 August 2015).
[10] MEDAR, Speech and Language Technologies for Arabic (no date) Available at: http://www.medar.info/index.php (Accessed: 29 August 2015).
Thank
You!
Q & A