An Analysis of Off-line and On-line Approaches in
Urdu Character Recognition
Naila Habib Khan1, Awais Adnan2 and Sadia Basar3
Institute of Management Sciences, Peshawar, Pakistan
Abstract: In this research article a detailed analysis has been proposed for various offline and online character recognition systems
for Urdu script from year 2002 to 2012. This analysis is based on the Methodology, Text Type, Font, Recognition Level, Sample and
Accuracy Level achieved by each individual Urdu script recognition system. This paper attempts to cover various aspects of offline
and online character recognition systems to provide wide exposure to this research topic with special emphasis on Urdu Script.
Generally, character recognition is the capability of a computer system to comprehend printed or handwritten text from different
sources like documents, books, reports, photographs or directly from digital touch screens. In Offline Character Recognition system,
an image is sensed by a scanner having printed text. When using any digital device in real time for example a touch-screen or a
digital pen, it is referred to as Online Character Recognition.
Keywords: Online, Offline, OCR, Urdu.
1 Introduction
Urdu is spoken by 490 million people around the world.
It is the 4th largest language spoken and understood in the
world. It is the official language of Pakistan and five
Indian states. It is also widely spoken and understood in
countries like Afghanistan, United Arab Emirates, Saudi
Arabia, Bangladesh, United Kingdom, United States,
South Africa, Botswana, Bahrain, Canada, Germany, Fiji,
Guyana, Malawi, India, Nepal, Qatar, Mauritius, Oman,
Zambia, Norway and Thailand.
Urdu is normally confused to be Hindi. Urdu and Hindi
are associated to each other and share the same
background. The primary difference between Urdu and
Hindi is its written script. In Pakistan it’s written in Arabic
script hence named as “Urdu” where as in India it’s
written in Devnagari script and hence called “Hindi”. In
India Urdu is widely spoken and understood in different
cities namely Delhi, Muzaffarnagar, Najibabad, Rampur,
Roorkee, Bareilly, Meerut, Lucknow, Azamgarh, Bijnor,
Deoband, Saharanpur, Moradabad, Aligarh, Allahabad,
Gorakhpur, Agra, Bidar, Ajmer, Kanpur, Badaun,
Bhopal, Hyderabad, Aurangabad, Bengaluru, Kolkata,
Mysore, Patna, Gulbarga, Nanded, and Ahmedabad. India
also publishes 405 daily Urdu newspapers. In Bangladesh
Urdu is used as a language for communication but it’s
referred to as “Behari”.
Urdu was developed under the great influence of
Arabic, Persian and Turkish languages almost 900 years
ago. Urdu language shares the same script as Arabic,
Persian, Turkish, Pashto and Kashmiri. Learning Urdu is
highly beneficial because it helps u read Persian and
Arabic alphabets, since Urdu script is 90% similar to these
scripts [1]. Due to huge significance of Urdu script a
number of researchers have focused on Optical Character
Recognition systems, which can convert Urdu ancient
literature to digital format. In this research paper we have
focused and analyzed various offline and online character
recognition systems for Urdu. The researches have been
analyzed based on the Methodology, Text Type, Font,
Recognition Level, Sample and Accuracy Level.
Methodology includes the major machine learning
techniques and algorithms implemented to develop a
recognition system. Handwritten and typewritten texts are
the two major text type’s used with any character
recognition system. Font can be of any style, Nastaliq
being the most famous font for Urdu script. ‘Recognition
level’ is based on the concept that segmentation based or
segmentation free approach has been used. Finally, the
sample and accuracy level discusses the dataset and
overall success rate of the character recognition system
respectively.
2 Types of Character Recognition Systems
Figure 1 below shows that character recognition can
basically be divided into two types i.e. Online and Offline
Character Recognition.
Figure 1 Character Recognition and Online Character Recognition
Computational Science and Systems Engineering
ISBN: 978-1-61804-362-7 280
2.1 Online Character Recognition System
Online character recognition refers to real time
recognition of characters. In online systems, characters
are recognized as they are written. Here a concept of
digital ink is used, sensors are used to analyze pen tip
movements for example pen up/down. Online character
recognition relieves us from the task of locating the
position of character. These systems are available in
PDA’s, Handheld PC’s, and also in some of the latest
touch screen mobile phones.
2.2 Offline Character recognition
Offline character recognition involves the automatic
conversion of handwritten or typewritten text from
scanned paper to letter codes that can be utilized inside
the computer. Offline character recognition is a complex
process as compared to online character recognition. In
offline character recognition, characters must be first
located and then extracted for recognition.
3 Types of Text for Recognition
3.1 Printed Text Recognition
Printed text recognition refers to the recognition of the
text that is computer generated. In case of printed text we
can have different fonts and sizes. The text can be in any
computer generated font for e.g. Times New Roman,
Arial, Calibri and Courier etc. Printed character
recognition system is simpler as compared to handwritten
character recognition.
3.2 Handwritten Text Recognition
Handwritten text recognition refers to the recognition of
any text that has been written with hand. Recognition of
handwritten text is difficult as compared to typewritten
text. Handwritten characters vary from person to person
and also according to the state of mood of the person.
Henceforth developing a character recognition system for
its recognition is considered a difficult task.
4 Comparison of Off-line Character
Recognition System
For offline character recognition systems a detailed study
has been conducted from year 2002 to 2012. A total of 14
research papers have been taken into consideration. For
each research paper the attributes considered are year of
publication, title, type of text (handwritten/printed), font
utilized, methodology applied, level of recognition
(character/word), sample data taken and the overall
accuracy of the developed system. All the data is
organized in form of a table were the columns represents
the attributes mentioned above. Each row of the table
represents an individual research paper; data has been
organized in ascending order based on year column (see
Table 1).
4.1 Analysis of Recognition Levels Used for
Offline Character Recognition Systems
As Urdu is a cursive language targeting for character level
recognition is a challenging task. Intensive segmentation
procedure is required for character level recognition
systems. Segmentation produces numerous errors and the
shape of character may be disfigured. Husain [2] used
ligature based recognition using a segmentation free
approach. Pal and Sarkar [3] implemented a segmentation
based approach since their recognition system aimed at
recognizing individual character. Hence this
segmentation based approach led to several segmentation
errors; these errors contributed 0.7% to the total error rate
of the proposed recognition system. [4-7] developed an
optical character recognition system that could recognize
only isolated Urdu characters. Ahmad, et al. [8] used
segmentation based recognition; the word was first
stretched horizontally so that the segmentation errors
could be minimized. Javed, et al. [9] proposed a
segmentation free approach for Urdu Nastaliq script
recognition which is highly cursive and overlapping.
The overall research data for offline character
recognition systems was presented in terms of 100 % and
it was found that 57 % of researchers opted to work with
recognition at character level and 43% opted for word
level recognition (see Figure 2).
Figure 2 Analysis of Recognition Levels of Data Used for Offline
Character Recognition Systems
57%43%
Character Word
Computational Science and Systems Engineering
ISBN: 978-1-61804-362-7 281
Table 1 Analysis of Offline Character Recognition Systems
Year Title P/H
*
Font
Methodology Applied Recognition
Level:**
Test Data Accuracy
2002 A Multi-Tier Holistic Approach
For Urdu Nastaliq Recognition
[2]
P NNQ Holistic Approach
Feed Forward Back
Propagation Neural Network
W 200
100 %
2002 Ligature Based Optical
Character Recognition Of
Urdu- Nastaleeq Font [10]
P NQ Template Matching W Type Written
Nastaliq
Script
Reasonable
2003 Recognition Of Printed Urdu
Script [3]
P NK
&
NQ
Water Reservoir Principle
Ch 3050 97.8%.
2005 English, Devnagari And Urdu
Text Identification [11]
P AF Water Reservoir Principle
Binary Tree Classifier
W 3210
98.09%
2007 Urdu Nastaleeq Optical
Character Recognition [8]
P NQ Neural Network Ch Old And New
Written
Scripts
93.4 %
2007 OCR For Printed Urdu Script
Using Feed Forward Neural
Network [4]
P AR
Feed Forward Neural Network Ch 72pt Font
Size
98.03 %
2009 Urdu Compound Character
Recognition Using Feed
Forward Neural Networks [12]
P AR Segmentation Phase Based On
Pixels Strength
Feed Forward Neural Network
Ch 56 Classes Of
Characters.
Each Having
100 Samples
70%
2009 Optical Character Recognition
System For Urdu (Naskh Font)
Using Pattern Matching
Technique [5]
P NK
Pattern Matching Technique Ch Urdu Text
Having
Different
Fonts Sizes
89%
2009 A Finite State Model For Urdu
Nastalique Optical Character
Recognition [13]
P NQ
Finite State OCR Modeled
Using Finite Automata
Ch Nastaliq
Having Same
Font Size
Encouraging
2010 Segmentation Free Nastalique
Urdu OCR [9]
P NNQ Global Transformational
Features
Hidden Markov Model
W 3655 92 %
2010 Font Size Independent OCR For
Noori Nastaleeq [14]
P NNQ
Font Size Normalization
X-Height Calculation
Outline Capture
Chain Code Algorithm
W
Wide Variety
of Font Sizes
94% - 97%
(Font Size
24, 28 ,32 )
93% - 97%
(Font Sizes
40, 44, 48,
52 )
2012 An Efficient Method For Urdu
Language Text Search In Image
Based Urdu Text [15]
H -- Template Matching Technique
Correlation Algorithm
W 2,3,4 and 5
Character
Ligatures
100% ,87%
and 78% for
5-4,3 and 2
Char
Ligature
2012 Recognition Of Segmented
Arabic/Urdu Characters Using
Pixel Values As Their Features
[16]
P -- Pixel Value Feature Vectors
Neural Network
Ch 30 Mixed
Arabic/Urdu
Alphabets
Used For
Making 53
Classes
95%
2012 Recognition Of Offline
Handwritten Isolated Urdu
Character [7]
H -- Moment Invariant Technique
Primary And Secondary
Component Separation
SVM For Classification
Ch 36800 93.59%
* : P: Printed, H : Hand written
AF: Any font; AR: Ariel; NK : Naskh; NQ: Nastaliq; NNQ: Noori Nastaliq; -- : Not Specified
** : Ch: Character ; W: Word
Computational Science and Systems Engineering
ISBN: 978-1-61804-362-7 282
4.2 Analysis of Handwritten and Printed Text
Utilization for Offline Character
Recognition Systems
Pathan, et al. [7] states that inadequate amount of research
work has been directed towards Urdu handwritten
character recognition. Handwritten text can be used with
online character recognition systems and also with offline
character recognition systems. When handwritten text is
used with offline recognition systems it’s mostly referred
to as Offline handwriting recognition system. Offline
handwriting recognition system converts an image of
handwritten text into codes that are understood within the
computer and text-processing application domains.
Offline handwriting recognition involves scanning a
handwritten document or form written by an individual in
the past.
The main issue with handwritten text is that it differs
from one individual to another. Handwritten text is
affected not only by mood but also by the material on/with
which it is written. Different types of pen may have
different writing tip. For example a pen with smaller tip
will produce thin handwritten characters while a pen with
larger tip will produce handwritten characters that have
certain thickness. Fountain pen, ballpoint and marker pen
may also affect the handwriting of individuals because of
difference of inks and the tips of these pens.
Printed Text on the other hand is much simpler to
handle. We only have to deal with different fonts and sizes
of text. This simplicity is the primary reason that most of
the researchers have opted to work with printed text for
Urdu character recognition systems.
Figure 3 Analysis of Handwritten and Printed Text Utilization for
Offline Character Recognition System
Examining the research papers it is found that 86% and
7% research has been performed on printed and
handwritten respectively (see Figure 3). Khan, et al. [15]
and Pathan, et al. [7] utilized handwritten text as for
recognition purposes. Due to complexity and
segmentation issues handwritten text Urdu character
recognition is lagging. Urdu needs a robust character
recognition system that is capable of converting both
handwritten and printed text into computer recognizable
form.
4.3 Analysis of Font Types Used For Printed
Text in Offline Character Recognition
Systems
There are several calligraphic styles for writing Arabic
script. Naskh, Nastaliq, Kufi, Deevani, Sulus and Riqah
styles are few of them. Naskh and Nastaliq are the most
famous writing styles used with Urdu scripts. Nastaliq
and Naskh are both written from right to left. Nastaliq
writing style for Urdu is highly cursive, diagonal, context
sensitive and non-monotonic writing system [17].
Nastaliq is basically a fusion of Naskh and Taliq writing
styles; it’s really beautiful and artistic style of writing
Urdu script. Because of the complexities associated with
Nastaliq writing, developing an efficient Character
Recognition System for Urdu is highly challenging task.
Figure 4 Analysis of Font Types Used For Printed Text in Offline
Character Recognition Systems
Summing up, 22% Nastaliq, 21 % Noori Nastaliq and
7% both (Nastaliq and Naskh) give the end result of 50%.
This result of font analysis indicates that Nastaliq and its
variations like Noori Nastaliq are the most popular choice
in printed Urdu offline character recognition systems (see
Figure 4). While 29% of the systems didn’t declare the
fonts utilized or are completely font independent.
7%
86%
7%
Handwritten
Printed
Both(Handwritten & Printed)
22%
21%
7%7%
14%
29%
Nastaliq
Noori Nastaliq
Naskh
Both(Nastaliqand Naskh)
Ariel
Other
Computational Science and Systems Engineering
ISBN: 978-1-61804-362-7 283
Table 2 Comparison of Online Character Recognition Systems
Year Title Methodology
Applied
Recognition
Level: **
Test Data Accuracy
2005
Urdu Online Handwriting Recognition
[18]
Analytical
Approach
Slant Analysis
Tree Based
Dictionary
Searching Method
Ch 39 Urdu characters
10 Numerals
200 Two Character
Words
93% (Isolated
Characters)
93 %( Numerals)
78% (Two
Character Words)
2007 Online Urdu Character Recognition
System [19]
Feature Vector
Extraction
Back Propagation
Neural Network
W 240 Ligatures with
Combination of 6
Diacritics
Base Ligatures
93%
Secondary Strokes
98%.
2009 Urdu Qaeda: Recognition System For
Isolated Urdu Characters [6]
Feature Extraction
Linear Classifier
Ch Four Samples of Each
Character Were Taken
From Two Participants
92.8% for Fluent
Urdu Users
31 % for Non-
Native User
2010 HMM and fuzzy logic: A hybrid
approach for online Urdu script-based
languages character recognition [20]
Hybrid Classifier
HMM and Fuzzy
Logic
W 1800 ligatures 87.6% for Nastaliq
74.1% for Naskh
2012 Fuzzy Based Preprocessing Using
Fusion Of Online And Offline Trait For
Online Urdu Script Based Languages
Character Recognition [21]
Fuzzy Logic Based
Preprocessing
Primary Baseline
Extraction
Local Baseline
Extraction
W 1800 Ligatures for
Nasta'liq Script
1000 For Naskh Style
74.3% for Nasta'liq
60.7% for Naskh
** : Ch: Character ; W: Word
5 Comparison of Online Character
Recognition Systems
For online character recognition system total 5 research
papers, from year 2005 to 2012 have been taken into
account. The data has been organized in form of a table
for better analysis and understanding purpose. For each
research paper certain attributes are considered; year of
publication, title, methodology, level of recognition
(character/word), sample data and the accuracy of the
developed system. It also is worth mentioning that there
are fewer columns in Table 2 as compared to Table 1. This
is due to the fact that online character recognition systems
deals only with handwritten text so Text Type
(handwritten/printed) column has been omitted. Also the
type of font is rarely of concern when dealing with online
character recognition system since mostly Urdu is written
in Nastaliq calligraphic style.
5.1 Analysis of Recognition Levels Used for
Online Character Recognition Systems
The percentage outcome is higher for word level
recognition as compared to character level online Urdu
recognition. Malik and Khan [18] and Shahzad, et al. [6]
found it easier to work with recognition at character level
while [19] and [20, 21] opted to work with combination
of ligatures and diacritics instead of individual characters.
With analysis of online Urdu recognition systems it is
found that 25 % of research has been carried out towards
character level recognition and 75% towards
word/ligature level recognition (see Figure 5).
Figure 5 Analysis of Recognition Levels Used For Online Character
Recognition Systems
25%
75%
Character Word
Computational Science and Systems Engineering
ISBN: 978-1-61804-362-7 284
6 Results and Discussion
In this research article, a comparative analysis has been
done to know how much research work has been done for
both online and offline character recognition systems. The
final results clearly showed that more research work has
been performed on offline character recognition systems.
The primary reason is that of real time complexity
associated with online character recognition systems.
Also Urdu is a multi-stroke language which creates
complexities and issues in online recognition systems.
Out of all the research paper listed in this research paper
it is found that 26% of work has been carried out for
online character recognition system and 74% research
work has been conducted for offline character recognition
systems (see Figure 6).
Figure 6 Comparative Analysis of Online and Offline Character
Recognition Systems
7 Conclusion
Only 26% of research has been directed towards online
Urdu character recognition systems while a greater 74%
has been focused towards offline character recognition
systems (see Figure 6).
For Urdu Offline Character Recognition systems,
57 % of researchers opted to work with recognition at
character level while 43% researches opted for ligature
or word level recognition (see Figure 2). Word and
character level recognition both have been widely and
almost equally explored.
In case of handwritten and printed text the results are
7% and 86% respectively (see Figure 3). The
assumption that can be drawn is that complexity of
handwritten text has held back the researchers. The
complex nature of handwritten text is due to its high
dependability on the mood of person, type of pen and
surface of writing in usage.
Nastaliq and its variations is the favorite font among
the researchers. Though few authors have opted to
work with fonts, Ariel and Naskh. 29% of researchers
developed font independent systems or didn’t care for
the type of font utilized at all (see Figure 4).
For Urdu Online character recognition systems,
75% of work has been aimed at word level recognition
while 25% at character level recognition (see Figure 5).
8 References
[1] A. Bharath and S. Madhvanath, "Online handwriting
recognition for Indic scripts," in Guide to OCR for
Indic Scripts, ed: Springer, 2010, pp. 209-234.
[2] S. A. Husain, "A Multi-tier Holistic approach for Urdu
Nastaliq Recognition," Multi Topic
Conference,Abstracts 2002, p. 84, 2002.
[3] U. Pal and A. Sarkar, "Recognition of Printed Urdu
Script," presented at the Proceedings of the Seventh
International Conference on Document Analysis and
Recognition - Volume 2, 2003.
[4] I. Shamsher, Z. Ahmad, J. K. Orakzai, and A. Adnan,
"OCR For Printed Urdu Script Using Feed Forward
Neural Network," World Academy of Science,
Engineering and Technology, 2007.
[5] T. Nawaz, S. A. H. S. Naqvi, H. ur Rehman, and A.
Faiz, "Optical character recognition system for urdu
(naskh font) using pattern matching technique,"
International Journal of Image Processing (IJIP), vol.
3, p. 92, 2009.
[6] N. Shahzad, B. Paulson, and T. Hammond, "Urdu
Qaeda: Recognition System for Isolated Urdu
Characters," in IUI 2009 Workshop on Sketch
Recognition,, Sanibel Island, Florida, 2009.
[7] I. K. Pathan, A. A. Ali, and R. R.J., "Recognition of
Offline Handwritten Isolated Urdu Character,"
Advances In Computational Research, vol. 4, pp. 117-
121, 2012.
[8] Z. Ahmad, J. K. Orakzai, I. Shamsher, and A. Adnan,
"Urdu Nastaleeq Optical Character Recognition,"
World Academy Of Science, Engineering And
Technology, pp. 249-252, 2007.
[9] S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S.
Jamil, and H. Moin, "Segmentation free nastalique
urdu ocr," World Academy of Science, Engineering
and Technology, vol. 46, pp. 456-461, 2010.
[10] Z. A. Shah, "Ligature Based Optical Character
Recognition of Urdu- Nastaleeq Font," INMIC, 2002.
[11] S. Chanda and U. Pal, "English, Devnagari and Urdu
Text Identification," in Proceedings of the
International Conference on Cognition and
Recognition, 2005, pp. 538-546.
[12] Z. Ahmad, J. K. Orakzai, and I. Shamsher, "Urdu
Compound Character Recognition Using Feed
74%
26%
Offline Character Recognition Systems
Online Character Recognition Systems
Computational Science and Systems Engineering
ISBN: 978-1-61804-362-7 285
Forward Neural Networks," in ICCSIT, 2009, pp. 457
- 462.
[13] S. A. S. S.-u. Haque and M. K. Pathan, "A finite state
model for urdu nastalique optical character
recognition," IJCSNS, vol. 9, p. 116, 2009.
[14] Q. u. A. Akram, S. Hussain, and Z. Habib, "Font Size
Independent OCR for Noori Nastaleeq," in In
Proceedings of Graduate Colloquium on Computer
Sciences (GCCS), NUCES Lahore, 2010.
[15] K. Khan, M. Siddique, M. Aamir, and R. Khan, "An
Efficient Method for Urdu Language Text Search in
Image Based Urdu Text," IJCSI International Journal
of Computer Science Issues, vol. 9, March 2012.
[16] S. Zaman, W. Slany, and F. Sahito, "Recognition of
Segmented Arabic/Urdu Characters Using Pixel
Values as their Features," ICCIT, 2012.
[17] S. A. Sattar, S. Haque, M. K. Pathan, and Q. Gee,
"Implementation Challenges for Nastaliq Character
Recognition," Communications in Computer and
Information Science,Volume 20, pp. 279-285, 2009.
[18] S. Malik and S. A. Khan, "Urdu Online Handwriting
Recognition," Emerging Technologies, Proceedings
of the IEEE Symposium, vol. 17, 2005.
[19] S. A. Husain, A. Sajjad, and F. Anwar, "Online Urdu
Character Recognition System," in MVA, 2007, pp.
98-101.
[20] M. I. Razzak, F. Anwar, S. A. Husain, A. Belaid, and
M. Sher, "HMM and fuzzy logic: A hybrid approach
for online Urdu script-based languages' character
recognition," Know.-Based Syst., vol. 23, pp. 914-923,
2010.
[21] M. I. Razzak, S. A. Husain, A. A. Mirza, and A.
Belaid, "Fuzzy Based Preprocessing Using fusion Of
Online And Offline Trait For Online Urdu Script
Based Languages Character Recognition,"
International Journal Of Innovative
Computing,Information And Control, vol. 8, pp.
3149–3161, 2012.
Computational Science and Systems Engineering
ISBN: 978-1-61804-362-7 286