OOf ffffllli iin nne ee HHHaaannddd …shodhganga.inflibnet.ac.in/bitstream/10603/36178/14/14...developed which consists isolated Urdu characters along with all possible shapes of

56

OOO fff fff lll iii nnn eee HHH aaa nnn ddd www rrr iii ttt ttt eee nnn

UUU rrr ddd uuu ddd aaa ttt aaa sss eee ttt

ppp rrr eee ppp aaa rrr aaa ttt iii ooo nnn

Chapter 4. Offline handwritten Urdu Dataset preparation

ike other research domains, the basic requirement for development of any

character recognition system is the availability of standard dataset. For

character recognition system the standard dataset consists of images of basic

character, punctuation and diacritic marks and numerals of the language script. The

standard dataset help the researchers for training and testing of developed system. It

also helps to compare the results of proposed research and algorithms with already

implemented methodologies by other researchers and show achieved success in

particular domain. Standard dataset assist as common testing environment for the

entire researchers to test and compare their results with other. In case of unavailability

of standard dataset researchers requires putting extra efforts on dataset preparation. If

a standard dataset is already available then instead of preparing dataset researchers

can fully concentrate on developing new algorithms and methodologies. Any

character recognition system can easily recognize the scanned character accurately if

it has been perfectly trained through robust character dataset. If a dataset used for

training consists of all possible character variations then system can be trained

L

Offline Handwritten Urdu Dataset Preparation

57

perfectly. Particularly in handwritten character recognition system it primarily

requires a complete dataset with possible handwriting variations.

Urdu is one of the important languages of the Indian subcontinent and spoken

by many speakers around the world. During past few years several interests of many

researchers have been increasing on Urdu document image analysis and character

recognition. To the best of our knowledge and reviewed literature there is no Standard

database for handwritten Urdu characters is publicly available. Specifically the

standard dataset of Handwritten Urdu characters with all four possible shapes of

character is not prepared. Numerous Arabic standard datasets like CEDAR, A1-ISRA,

AHDB, CENPARMI and IFN/ENIT [1, 2] are available. Most of these databases

contain only Arabic words and digits. CEDAR database released in 1994 includes

images of approximately 5,000 city names, 5,000 state names, 10,000 ZIP codes and

50,000 alphanumeric characters. Al-ISRA database of University of British Columbia

in Canada contains 37,000 Arabic words, 10,000 digits 2,500 signatures, and 500

free-form Arabic sentences. Alma’adeed et al. offered the AHDB, a database which

contain words used for numbers in bank and Arabic words. In 2003, Center for

Pattern Recognition and Machine Intelligence (CENPARMI) contain images 29,498

samples of subwords, 15,175 samples of Indian digits, legal amounts, and courtesy

amounts, legal and courtesy databases 2,499 each. The IFN/ENIT dataset was created

by the Institute of Communications Technology (IFN) consists of 26,459 images of

the 937 names of cities and towns in Tunisia.

Even Urdu is derived from Arabic script but character dataset of Arabic

language may be partially suitable. But these datasets can’t fulfill complete

requirements of training and testing for robust Urdu character recognition. Few

constrained collection of Urdu handwritten dataset can be found in literature [3].

These datasets, however, comprises a limited vocabulary and does not capture the

semantic and syntactic variations of the script so conclusive experiments cannot be

performed [4]. In 2009 Malik W.S et al. [3] presented a new large Urdu handwriting

database which includes isolated digits, numeral strings with/without decimal points,

five special symbols, 44 isolated characters, 57 Urdu words and Urdu dates in

different patterns, designed at CENPARMI i.e. Centre for Pattern Recognition and

Machine Intelligence. Also claimed, it is the first database for Urdu off-line

handwriting recognition.


58

V. Govindaraju et al. [5] in 2009 revealed that ‘To the best of our knowledge,

no handwritten Urdu data set exists’. During last couple of years few small datasets

were introduced which consist only printed and handwritten sentences & words,

printed and handwritten isolated characters, and most of them are publically not

available. Even no dataset provides all possible shapes of characters as mentioned

above. That’s was why it was our first requirement to prepare a dataset of handwritten

Urdu characters which should cover all the possible four shapes of every characters.

With this objective, a dataset of 107325 handwritten Urdu characters has been

developed which consists isolated Urdu characters along with all possible shapes of

these character. Also a dataset of 15900 handwritten Urdu numerals has been

prepared.

4.1 Database Designing

Total 159 writers from different age groups and qualification were selected for

data collection purpose. Seven-page tabular data entry formats have been designed for

the collection of handwritten Urdu character and digits data. It includes not only

isolated Urdu characters but also all possible character shapes which vary according

to their position in word i.e. initial, middle and end, as shown in Figure 4.1. Overall

135 characters shapes (shown in Figure 4.2) were selected for data collection. Each

page of data collection format consists of 22 characters; for each character five blank

boxes were given to writer, where they could write respective characters. In this way

each writer was asked to give 5 handwritten samples of each of the 135 characters.

The writer were asked to write without touching to boundary of given box but some

of them exceeded the box boundaries and overlapped the boundary line of given box.

The writers were allowed writing using either blue or black ink/ball pen.

Figure 4.1 Four position based shapes of Urdu Character


59

The first page contained fields like form number, name of writer, age,

qualification, gender, occupation and comment, a sample is shown in Figure 4.3 .

This information can be useful in future to filter data according to specific group of

people for research purpose. We have not tracked any record of writer behavior like

right-hand or left handed etc. For data collection and handwriting sampling school

students, teachers, various office employees, journalists, katibs were selected. An

attempt has been made to cover all the possible handwriting variations from different

groups. In this way (135 X 5) i.e. 675 written character samples have been taken from

each writer. Lastly a dataset of 107325 handwritten Urdu characters have been

prepared with the help of 159 writers.

In given data collection form one page has been designed for the collection of

handwritten Urdu digit shown in Figure 4.4. For handwritten Urdu digit database each

of 159 writers was asked to give 10 sample of each digit from zero to nine. And

finally database of 15900 handwritten Urdu digits has been prepared. A dataset of 159

handwritten paragraphs and document pages is prepared for testing various paragraph,

line, words and character segmentation methods. A sample handwritten Urdu

paragraph is shown in Figure 4.5. A blank page along with a printed paragraph was

given to 159 writers and requests to write the same paragraph on blank page. These

written documents were scanned using scanner with 300 dpi.

After collecting sample of all possible character these data collection pages

were optically scanned with 300 dpi resolutions using scanner. The input file for the

system is color jpg format file which was later converted into a gray scale file. This

method was preferred because gray scale image files can cover more pixels of

Figure 4.2 Total 135 character shapes of Urdu Script


60

scanned image than binary form, and provide more information related to

scanned character. In case of binary type image file, noise or some features like loop

may be distorted. As discussed earlier in case of Urdu destruction of even a small dot

may change the meaning of character. As scanned pages were very clean and no

specific noise was observed in scanned images. Even thought for designing a quality

dataset basic noise removal method (discussed in preprocessing section) were applied

at each page level before cropping character and storing in character database. After

analyzing dataset, some broken and overwritten characters were found. To normalize

these characters, various morphological filtering methods were applied on dataset

images and noise free dataset is developed.

4.2 Handwritten Urdu Dataset collection formats

Figure 4.3 Scanned Form: Handwritten Urdu character Dataset collection


61

Scanned Form: Handwritten Urdu character Dataset collection


62



63



64



65

Figure 4.4 Scanned form Handwritten Urdu Digit dataset collectione


66

Figure 4.5 Sample Handwritten Urdu paragraph


67

References:

1. IFN/ENIT - Database of Arabic Handwritten words, Institute of Communications

Technology, Technical University Braunschweig, Germany. Available at

http://ifnenit.com/

2. Saeed Mozaffari, Karim Faez, Farhad Faradji, Majid Ziaratban, S Mohamad

Golzan, A comprehensive isolated Farsi/Arabic character database for

handwritten OCR research, Tenth International Workshop on Frontiers in

Handwriting Recognition (2006), Amirkabir University of Technology, Tehran.

Available at http://hal.inria.fr/docs/00/11/26/76/PDF/cr1096180506660.pdf

3. Malik. W. Sagheer, C. L. He, N. Nobile and C. Y. Suen, "A New Large Urdu

Database for Off-Line Handwriting Recognition," Proceedings ICIAP

(International Conference on Image Analysis and Processing), pp. 538-546,

Salerno, Italy, Sept. 2009.

4. Ahsen Raza, Imran Siddiqi, Ali Abidi, Fahim Arif, ‘An Unconstrained Benchmark

Urdu Handwritten Sentence Database with Automatic Line Segmentation’, Proc.

of 2012 International Conference on Frontiers in Handwriting Recognition, PP

489-494.

5. Omar Mukhtar, Srirangaraj Setlur, and Venu Govindaraju, “Experiments on Urdu

Text Recognition”, V. Govindaraju, S. Setlur (eds.), Guide to OCR for Indic

Scripts, Advances in Pattern Recognition, DOI 10.1007/978-1-84800-330-9_8, C

_ Springer-Verlag London Limited 2009, pp 163-171.

http://www.researchgate.net/researcher/69924962_Saeed_Mozaffari/

http://www.researchgate.net/researcher/69887178_Karim_Faez/

http://www.researchgate.net/researcher/70374236_Farhad_Faradji/

http://www.researchgate.net/researcher/70518713_Majid_Ziaratban/

http://www.researchgate.net/researcher/81274636_S_Mohamad_Golzan/

http://www.researchgate.net/researcher/81274636_S_Mohamad_Golzan/

http://www.springerlink.com/content/137672267n31q8xx/fulltext.pdf

http://www.springerlink.com/content/137672267n31q8xx/fulltext.pdf

Documents

OOf ffffllli iin nne ee HHHaaannddd …shodhganga.inflibnet.ac.in/bitstream/10603/36178/14/14...developed which consists isolated Urdu characters along with all possible shapes of