Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
56
OOO fff fff lll iii nnn eee HHH aaa nnn ddd www rrr iii ttt ttt eee nnn
UUU rrr ddd uuu ddd aaa ttt aaa sss eee ttt
ppp rrr eee ppp aaa rrr aaa ttt iii ooo nnn
Chapter 4. Offline handwritten Urdu Dataset preparation
ike other research domains, the basic requirement for development of any
character recognition system is the availability of standard dataset. For
character recognition system the standard dataset consists of images of basic
character, punctuation and diacritic marks and numerals of the language script. The
standard dataset help the researchers for training and testing of developed system. It
also helps to compare the results of proposed research and algorithms with already
implemented methodologies by other researchers and show achieved success in
particular domain. Standard dataset assist as common testing environment for the
entire researchers to test and compare their results with other. In case of unavailability
of standard dataset researchers requires putting extra efforts on dataset preparation. If
a standard dataset is already available then instead of preparing dataset researchers
can fully concentrate on developing new algorithms and methodologies. Any
character recognition system can easily recognize the scanned character accurately if
it has been perfectly trained through robust character dataset. If a dataset used for
training consists of all possible character variations then system can be trained
L
Offline Handwritten Urdu Dataset Preparation
57
perfectly. Particularly in handwritten character recognition system it primarily
requires a complete dataset with possible handwriting variations.
Urdu is one of the important languages of the Indian subcontinent and spoken
by many speakers around the world. During past few years several interests of many
researchers have been increasing on Urdu document image analysis and character
recognition. To the best of our knowledge and reviewed literature there is no Standard
database for handwritten Urdu characters is publicly available. Specifically the
standard dataset of Handwritten Urdu characters with all four possible shapes of
character is not prepared. Numerous Arabic standard datasets like CEDAR, A1-ISRA,
AHDB, CENPARMI and IFN/ENIT [1, 2] are available. Most of these databases
contain only Arabic words and digits. CEDAR database released in 1994 includes
images of approximately 5,000 city names, 5,000 state names, 10,000 ZIP codes and
50,000 alphanumeric characters. Al-ISRA database of University of British Columbia
in Canada contains 37,000 Arabic words, 10,000 digits 2,500 signatures, and 500
free-form Arabic sentences. Alma’adeed et al. offered the AHDB, a database which
contain words used for numbers in bank and Arabic words. In 2003, Center for
Pattern Recognition and Machine Intelligence (CENPARMI) contain images 29,498
samples of subwords, 15,175 samples of Indian digits, legal amounts, and courtesy
amounts, legal and courtesy databases 2,499 each. The IFN/ENIT dataset was created
by the Institute of Communications Technology (IFN) consists of 26,459 images of
the 937 names of cities and towns in Tunisia.
Even Urdu is derived from Arabic script but character dataset of Arabic
language may be partially suitable. But these datasets can’t fulfill complete
requirements of training and testing for robust Urdu character recognition. Few
constrained collection of Urdu handwritten dataset can be found in literature [3].
These datasets, however, comprises a limited vocabulary and does not capture the
semantic and syntactic variations of the script so conclusive experiments cannot be
performed [4]. In 2009 Malik W.S et al. [3] presented a new large Urdu handwriting
database which includes isolated digits, numeral strings with/without decimal points,
five special symbols, 44 isolated characters, 57 Urdu words and Urdu dates in
different patterns, designed at CENPARMI i.e. Centre for Pattern Recognition and
Machine Intelligence. Also claimed, it is the first database for Urdu off-line
handwriting recognition.
Offline Handwritten Urdu Dataset Preparation
58
V. Govindaraju et al. [5] in 2009 revealed that ‘To the best of our knowledge,
no handwritten Urdu data set exists’. During last couple of years few small datasets
were introduced which consist only printed and handwritten sentences & words,
printed and handwritten isolated characters, and most of them are publically not
available. Even no dataset provides all possible shapes of characters as mentioned
above. That’s was why it was our first requirement to prepare a dataset of handwritten
Urdu characters which should cover all the possible four shapes of every characters.
With this objective, a dataset of 107325 handwritten Urdu characters has been
developed which consists isolated Urdu characters along with all possible shapes of
these character. Also a dataset of 15900 handwritten Urdu numerals has been
prepared.
4.1 Database Designing
Total 159 writers from different age groups and qualification were selected for
data collection purpose. Seven-page tabular data entry formats have been designed for
the collection of handwritten Urdu character and digits data. It includes not only
isolated Urdu characters but also all possible character shapes which vary according
to their position in word i.e. initial, middle and end, as shown in Figure 4.1. Overall
135 characters shapes (shown in Figure 4.2) were selected for data collection. Each
page of data collection format consists of 22 characters; for each character five blank
boxes were given to writer, where they could write respective characters. In this way
each writer was asked to give 5 handwritten samples of each of the 135 characters.
The writer were asked to write without touching to boundary of given box but some
of them exceeded the box boundaries and overlapped the boundary line of given box.
The writers were allowed writing using either blue or black ink/ball pen.
Figure 4.1 Four position based shapes of Urdu Character
Offline Handwritten Urdu Dataset Preparation
59
The first page contained fields like form number, name of writer, age,
qualification, gender, occupation and comment, a sample is shown in Figure 4.3 .
This information can be useful in future to filter data according to specific group of
people for research purpose. We have not tracked any record of writer behavior like
right-hand or left handed etc. For data collection and handwriting sampling school
students, teachers, various office employees, journalists, katibs were selected. An
attempt has been made to cover all the possible handwriting variations from different
groups. In this way (135 X 5) i.e. 675 written character samples have been taken from
each writer. Lastly a dataset of 107325 handwritten Urdu characters have been
prepared with the help of 159 writers.
In given data collection form one page has been designed for the collection of
handwritten Urdu digit shown in Figure 4.4. For handwritten Urdu digit database each
of 159 writers was asked to give 10 sample of each digit from zero to nine. And
finally database of 15900 handwritten Urdu digits has been prepared. A dataset of 159
handwritten paragraphs and document pages is prepared for testing various paragraph,
line, words and character segmentation methods. A sample handwritten Urdu
paragraph is shown in Figure 4.5. A blank page along with a printed paragraph was
given to 159 writers and requests to write the same paragraph on blank page. These
written documents were scanned using scanner with 300 dpi.
After collecting sample of all possible character these data collection pages
were optically scanned with 300 dpi resolutions using scanner. The input file for the
system is color jpg format file which was later converted into a gray scale file. This
method was preferred because gray scale image files can cover more pixels of
Figure 4.2 Total 135 character shapes of Urdu Script
Offline Handwritten Urdu Dataset Preparation
60
scanned image than binary form, and provide more information related to
scanned character. In case of binary type image file, noise or some features like loop
may be distorted. As discussed earlier in case of Urdu destruction of even a small dot
may change the meaning of character. As scanned pages were very clean and no
specific noise was observed in scanned images. Even thought for designing a quality
dataset basic noise removal method (discussed in preprocessing section) were applied
at each page level before cropping character and storing in character database. After
analyzing dataset, some broken and overwritten characters were found. To normalize
these characters, various morphological filtering methods were applied on dataset
images and noise free dataset is developed.
4.2 Handwritten Urdu Dataset collection formats
Figure 4.3 Scanned Form: Handwritten Urdu character Dataset collection
Offline Handwritten Urdu Dataset Preparation
61
Scanned Form: Handwritten Urdu character Dataset collection
Offline Handwritten Urdu Dataset Preparation
62
Scanned Form: Handwritten Urdu character Dataset collection
Offline Handwritten Urdu Dataset Preparation
63
Scanned Form: Handwritten Urdu character Dataset collection
Offline Handwritten Urdu Dataset Preparation
64
Scanned Form: Handwritten Urdu character Dataset collection
Offline Handwritten Urdu Dataset Preparation
65
Figure 4.4 Scanned form Handwritten Urdu Digit dataset collectione
Offline Handwritten Urdu Dataset Preparation
66
Figure 4.5 Sample Handwritten Urdu paragraph
Offline Handwritten Urdu Dataset Preparation
67
References:
1. IFN/ENIT - Database of Arabic Handwritten words, Institute of Communications
Technology, Technical University Braunschweig, Germany. Available at
http://ifnenit.com/
2. Saeed Mozaffari, Karim Faez, Farhad Faradji, Majid Ziaratban, S Mohamad
Golzan, A comprehensive isolated Farsi/Arabic character database for
handwritten OCR research, Tenth International Workshop on Frontiers in
Handwriting Recognition (2006), Amirkabir University of Technology, Tehran.
Available at http://hal.inria.fr/docs/00/11/26/76/PDF/cr1096180506660.pdf
3. Malik. W. Sagheer, C. L. He, N. Nobile and C. Y. Suen, "A New Large Urdu
Database for Off-Line Handwriting Recognition," Proceedings ICIAP
(International Conference on Image Analysis and Processing), pp. 538-546,
Salerno, Italy, Sept. 2009.
4. Ahsen Raza, Imran Siddiqi, Ali Abidi, Fahim Arif, ‘An Unconstrained Benchmark
Urdu Handwritten Sentence Database with Automatic Line Segmentation’, Proc.
of 2012 International Conference on Frontiers in Handwriting Recognition, PP
489-494.
5. Omar Mukhtar, Srirangaraj Setlur, and Venu Govindaraju, “Experiments on Urdu
Text Recognition”, V. Govindaraju, S. Setlur (eds.), Guide to OCR for Indic
Scripts, Advances in Pattern Recognition, DOI 10.1007/978-1-84800-330-9_8, C
_ Springer-Verlag London Limited 2009, pp 163-171.