15
75 Urdu Word Typology and Word Segmentation Methods – Review Imran Khan Pathan Milliya Collage Beed, India, E-mail: [email protected]. Shaikh Abdul Hannan Vivekanand College, Aurangabad, India, E-mail: [email protected]. R.J. Ramteke Department of Computer Science and IT, NMU, Reader, Jalgaon, E-mail: [email protected]. ABSTRACT: In recent years, the recognition of Farsi and Arabic handwriting is drawing increasing attention. Preprocessing is the most important stage in the Arabic OCR system; it has a direct effect on the reliability and efficiency of the segmentation and feature extraction stages. This review paper is deal with the study of word typology, writing methodology of Urdu script, compound formation of Urdu words, orthogonal properties of Urdu characters. It also cover Evaluation of Urdu OCR system, problems with Urdu OCR, work did in Urdu OCR, various techniques for the segmentation, Skew detection and deletion, water Reservoir Principle used in Urdu character recognition. Keywords: Urdu script, water reservoir principle, skew deletion, character segmentation, compound formation, orthography. 1. INTRODUCTION URDU is the national language of Pakistan, is spoken by more than 60 million speakers in over 20 countries [2]. It is a cursive script, written from right to left, like Arabic and Farsi but with some additional alphabets, therefore OCRs used for Arabic or Farsi will not suit the needs for Urdu script. Urdu optical character recognition is complex as compare to other scripting languages, due to complex script writing, orthography, space International Journal of Advances in Software Engineering Volume 1, Number 1, January-June 2011, pp. 75-89

Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Embed Size (px)

Citation preview

Page 1: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Urdu Word Typology and Word Segmentation Methods – ReviewF F

75

Urdu Word Typology and WordSegmentation Methods – Review

Imran Khan PathanMilliya Collage Beed, India,E-mail: [email protected].

Shaikh Abdul HannanVivekanand College, Aurangabad, India,E-mail: [email protected].

R.J. RamtekeDepartment of Computer Science and IT, NMU, Reader, Jalgaon,E-mail: [email protected].

ABSTRACT: In recent years, the recognition of Farsi and Arabichandwriting is drawing increasing attention. Preprocessing is the mostimportant stage in the Arabic OCR system; it has a direct effect on thereliability and efficiency of the segmentation and feature extractionstages. This review paper is deal with the study of word typology,writing methodology of Urdu script, compound formation of Urduwords, orthogonal properties of Urdu characters. It also coverEvaluation of Urdu OCR system, problems with Urdu OCR, work didin Urdu OCR, various techniques for the segmentation, Skew detectionand deletion, water Reservoir Principle used in Urdu characterrecognition.

Keywords: Urdu script, water reservoir principle, skew deletion,character segmentation, compound formation, orthography.

1. INTRODUCTION

URDU is the national language of Pakistan, is spoken by more than 60million speakers in over 20 countries [2]. It is a cursive script, written fromright to left, like Arabic and Farsi but with some additional alphabets,therefore OCRs used for Arabic or Farsi will not suit the needs for Urduscript. Urdu optical character recognition is complex as compare to otherscripting languages, due to complex script writing, orthography, space

International Journal of Advances in Software EngineeringVolume 1, Number 1, January-June 2011, pp. 75-89

Page 2: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

International Journal of Advances in Software EngineeringF F

76

insertion and space deletion problems [9] [12] [13]. Urdu characters varyfrom each other on the basis of small changes in their shape and the hatfeature that carry very important information; it makes Urdu characterrecognition more difficult. Water Reservoir Principle where used toidentification of shape of words. Skew deletion and correction can beachieved by estimating the skew angle, and rotating the image by the skewangle in the opposite direction. For line and character segmentationHorizontal projection method is used it automatically detects individualtext lines and then segments the characters in each line [1][3][6][8][10].

2. PROPERTIES OF URDU SCRIPT

In India there are twelve scripts and Urdu is one of the popular Indianscripts. Here we describe some properties of the Urdu script that are usefulfor building the OCR system. The modern Urdu alphabet consists of 39basic characters. These characters are shown in Fig. 1(a). Urdu has 10numerals and the numerals are shown in Fig. 1(b). Like other Indian scriptsin Urdu also two or more characters may combine and create a complexshape called compound characters. Examples of some compound charactersare shown in Fig. 2. Also depending on the positions (first, middle or last)in a word the basic shape of a character may be changed. For example seeFig. 3. Here an Urdu basic character in its isolated form and its shapes infirst, middle and last positions of a word are shown [1]. As a result, thetotal number of characters to be recognized is very large. Thus, OCRdevelopment for Urdu is more difficult than any European language scripthaving a smaller number of characters [7][11][14][17][20].

Figure 1: Examples of Urdu Alphabet and Numerals (a) Basic Characters of UrduAlphabet (b) Urdu Numerals.

Page 3: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Urdu Word Typology and Word Segmentation Methods – ReviewF F

77

Urdu script has some different characteristics compare to other Indianscripts. Writing style in Urdu is from right to left whereas it is left to rightin other Indian scripts. It can be noted that an Urdu basic character mayhave four components (see character number 6, 8, 17, 19 etc. of Fig. 1(a))while in other Indian scripts this property is rare. There is a structuralsimilarity between Urdu and Arabic script. There are different types in Urduscript like Naskh, Nastaliq, Aswad, Batool, Jaben etc [15] [18] [19] [21].

Figure 2: Some Examples of Urdu Compound Characters.

Figure 3: An Isolated Basic Character and Its Shapes in First, Middle, and LastPositions in a Word are Shown.

3. URDU ORTHOGRAPHY

Arabic script is used for writing Urdu language, like Arabic Urdu is writtenin Right to Left (RTL) direction. Urdu characters change their shapesdepending upon neighboring context. But generally they acquire one ofthese four shapes, namely isolated, initial, medial and final. Urdu characterscan be divided into two groups, separators and non-separators. These arealso known and non-joiners and joiners respectively. The separators or

Page 4: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

International Journal of Advances in Software EngineeringF F

78

non-joiners can acquire only isolated and final shape. On contrary non-separators or joiners can acquire all the four shapes. The isolated form ofeach of these is shown in figures given below [22][25].

Figure 4: Separators/Non-Joiners in Urdu.

Figure 5: Non-Separators/ Joiners in Urdu.

Here are the set of rules that the characters use to acquire shapes.

3.1. A Joining Character Takes

• Initial form when a joiner follows it ( <- when is following likein ( ).

• Initial form when a non-joiner follows it ( <- when is followinglike in ( ).

• Final form when comes after a joiner ( <- when comes after like in ( ).

• Isolated form when comes after a non-joiner ( <- when comesaftee like in. ( ).

• Medial form when it is already in final form and is followed by ajoiner ( <- when it is already in final firm and is followed bylike in ( <- ).

• Medial form when it is already in final form and is followed by a non-joiner ( <- when it is already in final firm and is followed bylike in ( <- ).

3.2. A Non-Joining Character Takes

• Isolated form when a joiner or non joiner follows it ( <- when isfollowing ( ).

• Isolated form when is followed by a non-joiner ( <- when it isfollowed by like in ( ).

• Final form when is followed by a joiner ( <- when it is followed by like in ( ).

Page 5: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Urdu Word Typology and Word Segmentation Methods – ReviewF F

79

4. COMPOUND WORDS IN URDU

In Urdu, Compounding is a very rich phenomenon. Urdu is an off-shootfrom many other languages like Arabic, Farsi, Turkish, Hindi and Sanskritetc. Compounding is frequently seen to occur in Farsi and is inherited byUrdu as well. Urdu, like English is a head final language. Compounds canbe formed with two independent words such as noun and adjectives, andalso with independent words and verb stems and verb stems themselves.Compounds usually occur in following formats XY, X-o-Y and X-e-Y[26][28].

4.1. XY Formation

The XY formation simply involves combining two free-morphemes. Nomore than two morphemes can combine together in this manner in Urdu.Example of such formation is (MomBatti).

Which means Candle. Another example is ‘ ’ (Jaraim Paisha)which means criminal. According to compounds in Urdu can be classifiedinto four types:

4.1.1. Dvanda

These have two conditions:

• Both morphemes that form compound have different meanings.These further have two conditions:

- Both morphemes are nouns. Example ‘ ’ (Maan Baap;Parents), ‘ ’ (Naak Naqsha; Features).

- Both morphemes are verbs. Example ‘ ’ (Parha Likha;Educated).

• Both morphemes that form compounds have identical or similarmeanings. These also have further two conditions:

- Both morphemes are nouns. Example ‘ ’ (Khat Patar;Letter), ‘ ’ (Kaam Kaaj; Work).

- Both morphemes are verbs. Examples.

- ‘ ’ (Daikh Bhal; Care taking).

Page 6: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

International Journal of Advances in Software EngineeringF F

80

4.1.2. Tatpurusa

This is another type of XY compounds in which means a type of Y which isrelated to X in a way corresponding to one of the grammatical cases of X.Examples are:

(Ghur Dour; Horse Race)

(Chadar Chapol)

(Dais nikala; Exiled)

4.1.3. Karmadharaya

In this type of XY compounding the relation of first to second element isattributive, appositional or adverbial. These are often classified as sub-type of Tutpurusa.

Examples are: (Khar Kanna; The one with big ears).

(Barh Bola; The one who exaggerates).

4.1.4. Divigu

It is a type of XY formation in which X is a numeral. Examples are:

(Adh Muwa; Half Dead) (Dupatta; Scarf) [23].

4.2. X-o-Y Formation

The X-o-Y construction contains linking morpheme -o-. It usually givesmean of ‘and’ and is commonly used. Example are ‘ ’ (Mulk-o-Milat; Country and Nation), ‘ ’ (Aziz-oaqarib; near and dearones).

The X-o-Y formation is an instance of coordinating compounds. Themorpheme -o- is mostly involved in nominal constructions. There are caseswhen both morphemes in compounds give identical or similar meaning.For example in compound (Tabah-o-barbad) both ‘ ’ (Tabah)and ‘ ’ (Barbad) means destroy. Compound itself is used to givemeaning of destroy and itself can be replaced by any of its constituents ina sentence to give exactly the same meaning. Similar another example is‘ ’ (Amn-o-aman) which means peace. Although it is originatedfrom Farsi the -o- is nowadays also used to combine English words. One

Page 7: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Urdu Word Typology and Word Segmentation Methods – ReviewF F

81

such example is ‘ ’ (Petrol-wa-Diesel). Such examples arecommonly found in Urdu corpra. The -o- is also used to form compoundshaving verb stems. These are discussed below.

4.3. X-e-Y Formation

The third and final formation X-e-Y contains linking morpheme or an encliticshort vowel known as zer-izafat or hamza-e-izafat. Izafat means increaseor addition. It is pronounced in Urdu as short /e/ and is used in Noun-e-Noun and Noun-e-Adjective compounds.

The noun-e-noun compounds signify possessor relationship in whichX belongs to Y. Alternative construction for such compounds is Y ‘ ’(ke/ki) X where ke and ki are casemarkers used to mark possession.Examples are ‘ ’ (Ehliyan-e-Hind) which means ‘People of India’and can be alternatively as ‘ ’ (Hindustan ke log; ‘People ofIndia’).

The noun-e-adjective formation shows that noun X is modified by

adjective Y. For example: ‘ ’ (Vazeer-e-azam) which meansprime minister. Another example is ‘ ’ (Deewan-ekhas) whichmeans private hall of audience. These compounds however are lexicalentries for native Urdu speakers. Zer-e-Izafat is left unwritten in moderntexts but a native speaker would pronounce it as if it is there. When writtenit is written as follow:

• As subscript zer.

• As hamza over bari yeh (when it follows word ending in the longvowels alef or vao).

• As hamza over choti heh (when it follows a final heh).

• As zero (when it follows word ending with bari yah).

5. MAJOR PROBLEMS WITH URDU OCR

Urdu is a complex language, features of Urdu which make it more difficultthan English are: characters that vary from each other on the basis of smallchanges in their shape and the hat feature that carry very importantinformation, because based on the type and number of these hat feature

Page 8: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

International Journal of Advances in Software EngineeringF F

82

above or below the basic structure, the character can turn into a newcharacter with different sound. Although average length of character isless but it in turn increases the overall complexity and provide less contextto alleviate this problem [24][27].

Urdu is amongst the Asian languages that suffer word segmentationdilemma. However, unlike other Asian languages word-segmentation inUrdu is not just a space insertion problem. Space in Urdu is a frequentlyused character in printed texts. However its presence does not necessarilyindicate word boundary. In other cases space is optionally used so the userenjoys liberty. Put it another way a sentence can have cases when a singleword might have space in between. Alternatively multiple words are writtenin continuum without any space like in CJK languages. So, in Urdu word-segmentation is both a space insertion and a space deletion problem aremajor one. To further complicate the situation some words that are writtenwith space can be also be written without them. In few cases these arespelled differently when written without space. Urdu word segmentationproblem is triggered by its orthographic rules and confusion about thedefinition of word. There is no consensus on what exactly is a word inUrdu [29].

6. EVALUATION OF URDU OCR

6.1. Methodologies Used

In Urdu character recognition is a part of a broad domain called PatterRecognition. Various methods are used for Urdu character recognition.Character recognition problems can be solved by following steps.

I. Preprocessing.

II. Segmentation.

III. Feature Extraction.

IV. Classification.

V. Recognition/Post Processing.

These are the basic steps for recognition of Urdu character. In UrduOCR domain various methods are used for skew detection, Linesegmentation, Character recognition [31][34][40].

Page 9: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Urdu Word Typology and Word Segmentation Methods – ReviewF F

83

6.2. Method Used for Skew Detection and Correction

The digitised text images are first converted into two tone images using ahistogram based thresholding approach. Here we object pixels arerepresented by 1 and background pixels by 0. The two-tone image generallyshows protrusions and dents in the characters as well as isolated objectpixels over the background, which is cleaned by a logical smoothingapproach [5]. Casual use of the scanner may lead to skew in the documentimage. Skew angle is the angle that the text line of the document imagemakes with the horizontal direction. Skew correction can be achieved by(i) estimating the skew angle, and (ii) rotating the image by the skew anglein the opposite direction.

(a) (b)Figure : (a): Example of An Urdu Skewed Text (b): Candidate Points for Hough

Ttransform are Shown.

Here we use a Hough transform based technique for skew angleestimation. To reduce the amount of data to be processed by the Houghtransform, we compute some candidate points considering some selectedcomponents from the image. For component selection, mean width bm ofthe bounding boxes of the connected components is computed andcomponents having bounding box width greater than 0.5 × bm are selected.Thus, small and irrelevant components like dots, punctuation marks, smallmodifiers, etc. are mostly filtered out of the skew estimation process. Let Iis the image containing only selected components; B is the set of lowermostpoints of the top reservoirs obtained from the selected components of I; I ′is the anti-clockwise rotated (90º) image of I and B′ is the set of lowermost

Page 10: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

International Journal of Advances in Software EngineeringF F

84

points of the top reservoirs of the components belong to I′ for which topreservoirs do not obtained before rotation. Let L = B ∪ B′. Then L is thecandidate points for Hough transform. Candidate points are chosen in sucha way that they will lie, more or less, on parallel straight lines and hencethese points will be good representative for Hough transform. An Urduskewed text is shown in Fig. 6 (a) and the candidate points for Houghtransform of this skewed text are shown in Fig. 6 (b). From Fig. 6 (b) it canbe seen that most of the candidate points of a text line lie on a straight line.For skew angle detection, usual Hough transform is used on these candidatepoints. After skew angle detection the image is rotated according to thedetected skew angle. It has been noted that font style and size variations donot affect this skew estimation method. Also, the proposed method canhandle documents with skew angle between + 45° to – 45°. From ourexperiments is it observed that about 97.4% of the cases our method cancompute the skew angles with a tolerance of ± 0.5 degree [29] [4] [5] [32].

6.3. Methods Used for Line and Character Segmentation

The proposed OCR system automatically detects individual text lines andthen segments the characters in each line. We do not segment words froma line for the recognition purpose. The lines of a text block are segmentedby finding the valleys of the projection profile computed by counting thenumber of black pixels in each row. The trough between two consecutivepeaks in this profile denotes the boundary between two text lines. A textline can be found between two consecutive boundary lines. An Urdu textwith its projection profile is shown in Fig. 7. Line segmentations are shownby dotted lines in this figure.

Figure 7: Horizontal Projection Profile of an Urdu Text and its LineSegmentations are Shown.

Page 11: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Urdu Word Typology and Word Segmentation Methods – ReviewF F

85

Character segmentation is done by a combination of componentlabelling and vertical projection profile methods. A text line is scannedvertically. If in one vertical scan two or less object pixels are encounteredthen the scan is denoted by 0, else the scan is denoted by the number ofobject pixels in that column. In this way a vertical scanning histogram isconstructed. Now, if in the histogram there exist a run of at least K1consecutive 0’s then the midpoint of that run is considered as the boundaryof a character. The value of K1 is determined from the experiment.Sometimes because of kerned behaviour (kerned characters are thecharacters that overlap with neighbouring characters) of Urdu script somecharacters of a line may not be segmented properly by projection profilemethod. To take care of such cases we apply component labelling approachalong with projection profile method. If we notice that the distance betweentwo consecutive character boundaries is big we suspect there is a mis-segmentation in this position and we use component labelling for furthersegmentation. During component labelling we check vertical overlappingof the components. If two or more components are fully vertical overlappedwe assume that components are different parts of a character. If thehorizontal overlapping between two bounding boxes of two consecutivecomponents is less than 35% we detect these two components are parts oftwo different characters [33][36][38].

6.4. Water Reservoir Principle Used for Character Recognition

If water is poured from one side of a component, the cavity regions of thecomponent where water will be stored are considered as reservoirs [6]. Bytop (bottom) reservoirs we mean the reservoirs obtained when water ispoured from top (bottom) of the component. (A bottom reservoir of acomponent is visualized as top reservoir when water will be poured fromtop after rotating the component by 180°). Similarly if water is pouredfrom left (right) side of the component, the cavity regions of the componentswhere water will be stored are considered as left (right) reservoirs. For anillustration see Fig. 8. Here top, bottom, left and right reservoirs of someUrdu characters are shown. Water flow direction from a full reservoir isalso shown in this figure. [35][37][39][41][42].

All reservoirs obtained from a direction of a component are notconsidered for future processing. The reservoirs having heights greater

Page 12: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

International Journal of Advances in Software EngineeringF F

86

than a threshold T1 are only considered. The value of T1 is chosen as 2/5times the corresponding component height. This threshold value is obtainedfrom the experiment.

Figure 8: Different Reservoirs and Their Water Flow Directions are Shown in FourCharacters. Water Flow Directions are Shown by Dotted Arrow.

7. CONCLUSION

Urdu OCR has large scope but very little work has been done in this. Variousmethods are used for segmentation and recognition for example WaterReservoir Principle, Skew deletion, Minimum word-Minimum ErrorHeuristic, Unigram based, Bigram Based, Neural Network. Some workdone for online Urdu OCR recognition, offline Urdu OCR but till characteror word level only, in this direction Urdu OCR development needed to becontinuing for Off-line handwritten word recognition.

REFERENCES[1] Aburas, A, A. & Rehiel, M,A. 2007., “Off-Line Omni-Style Handwriting Arabic

Character Recognition System Based on Wavelet Compression”, Arab ResearchInstitute in Sciences & Engineering, ISSN 1994-3253, 3(4): pp. 123-135.

[2] Al-Badr, B. & Mahmoud, S. 1995., “Survey and Bibliography of Arabic OpticalText Recognition”, Signal Processing, 41(1): pp. 49-77.

[3] AlKhateeb, J, H. Ren, J. Ipson, S & Jiang, J. 2008., “Knowledge-Based BaselineDetection and Optimal Thresholding for Words Segmentation in EfficientPreprocessing of Handwritten Arabic Text”, Fifth International Conference onInformation Technology: New Generations, IEEE computer society. pp. 1158-1159.

[4] Almuallim, H. & Yamaguchi, S. 1987., “A method of Recognition of Arabic CursiveHandwriting”, IEEE Transactions on Pattern Analysis and Machine Intelligence(PAMI), 9(5): pp. 715-722.

Page 13: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Urdu Word Typology and Word Segmentation Methods – ReviewF F

87

[5] Al-Rashaideh, H. 2006., “Preprocessing Phase for Arabic Word HandwrittenRecognition”, Russian Academy of Sciences, 6(1): pp. 11-19, Russian Federation.

[6] Alshebeili, S.A., Nabawi, A.A. & Mahmoud, S.A. 1997., “Arabic CharacterRecognition Using 1-D Slices of the Character Spectrum”, Signal Processing, 56(1):pp. 59-75.

[7] Al-Yousefi, H. & Udpa, S.S. 1992., “Recognition of Arabic Characters”,IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 14(8):pp. 853-857.

[8] Amin, A. 1998., “Off-Line Arabic Character Recognition: The State of the Art”,Pattern Recognition, 31(5): pp. 517-530.

[9] Amin, A. 1997., “Arabic Character Recognition”, In Bunke H. & Wang P.S.P. (ed.),Handbook of Character Recognition and Document Image Analysis, pp. 397- 420.World Scientific, Singapore.

[10] Amin, A. & Mari, J.F. 1989., “Machine Recognition and Correction of Printed ArabicText”, IEEE Transactions on Systems, Man and Cybernetics (SMC), 19(5):pp. 1300-1306.

[11] Argner, V & El Abed, H. 2008., “Databases and Competitions: Strategies to ImproveArabic Recognition Systems”, pp. 82-103.

[12] Arica, N & Yarman-Vural, F. 2002., “Optical Character Recognition for CursiveHandwriting”, IEEE PAMI, 24(6): pp. 801–813.

[13] Broumandnia, A. Shanbehzadeh, J & Nourani3, M. 2007., “Handwritten Farsi/ArabicWord Recognition”, IEEE. pp. 767-771.

[14] Burrow, P. 2004., “Arabic Handwriting Recognition”, M.Sc. Thesis. University ofEdinburgh, England.

[15] El-Hajj, R. likforman-Sulem, L & Mokbe, C. 2005., “Arabic Handwriting RecognitionUsing Baseline Dependant Features and Hidden Markov Modeling”, (ICDAR’05)Proceedings of the 2005 Eight International Conference on Document Analysis andRecognition, IEEE. 20(5), pp. 1520-5263.

[16] Fahmy, M.M.M. & El-Messiry, H. 2001., “Automatic Recognition of TypewrittenArabic Characters using Zernike Moments as a Feature Extractor”, Journal of Studiesin Informatics and Control, 10(3): pp. 48-51.

[17] Farooq, F. Govindaraju, V & Perrone, M. 2005., “Preprocessing Methods for HandwrittenArabic Documents”, (ICDAR’05) Proceedings of the 2005 Eight InternationalConference on Document Analysis and Recognition, IEEE. 1. pp. 267-271.

[18] Khorsheed, M.S. 2002., “Off-Line Arabic Character Recognition - A Review”, PatternAnalysis & Applications, 5(1): pp. 31-45.

[19] Akiyama, T. and N. Hagita, 1990., “Automated Entry System for Printed Documents”,Patt. Recog., 23: pp. 1141-1158.

[20] Al-Badr, B. and S. Mahmoud, 1995., “Survey and Bibliography of Arabic OpticalText Recognition”, Signal Process., 41: pp. 49-77, 1995.

Page 14: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

International Journal of Advances in Software EngineeringF F

88

[21] AL-Shatnawi, A. and K. Omar, 2008., “Methods of Arabic Baseline Detection-theState of Art”, Int. J. Comput. Sci. Network Secur., 8: pp. 137-142.

[22] Argner, V. and h. El Abed, 2008., “Databases and Competitions: Strategies to ImproveArabic Recognition Systems”. pp: 82-103.

[23] Hashizume, A., P.S. Yeh and A. Cosenfeld, 1986, “A Method of Detecting theOrientation of Aligned Components”, Patt. Recog. Lett., 4: pp, 125-132.

[24] Khorsheed, M.S., 2002., “Off-Line Arabic Character Recognition-A Review”, Patt.Anal. Appl., 5:, pp. 31-45.

[25] Le, D.S., G.R., Thoma and H. Wechsler, 1994, “Automatic Page Orientation andSkew Angle, Detection for Binary Document Images”, Patt.Recog., 27: pp. 1325-1344.

[26] Liana, M. and G. Venu, 2006., “Offline Arabic Handwriting Recognition: A Survey”,IEEE Trans. Patt. Anal. Mach. Intell., 28: pp. 712-724.

[27] Omar, K., A. Ramli, R. Mahmod and M. Sulaiman, 2002., “Skew Detection andCorrection of Jawi Images using Gradient Direction”, J. Technol., 37: pp. 117-126.

[28] O’ Gorman, L., 1993., “The Document Spectrum for Page Layout Analysis”, IEEETrans. Patt. Anal. Mach. Intell., 11: pp. 1162-1173.

[29] U. Pal, and B.B. Chaudhuri, 1996., “An Improved Document Skew Angle EstimationTechnique”, Patt. Recog. Lett., 17: pp. 899-904.

[30] Sarhan, A.M., and O.I. Al Helalat, 2007., “Arabic Character Recognition usingArtificial Neural Networks and Statistical Analysis”, Proc. World Acad. Sci. Eng.Technol., 21: pp. 32-36.

[31] Safabakhsh, R. and P. Adibi, 2005., “Nastaaligh Handwritten Word Recognition usingContinuous Density Variable-Duration HMM”, Arabian J. Sci. Eng., 30: pp. 95-118.

[32] Srihari, S.N. and V. Govindaraju, 1989., “Analysis of Textual Images using the HoughTransform”, Mach. Vis. Appl., 2: pp. 141-153.

[33] Nawaz, S.N., M. Sarfraz, A. Zidouri and W.G. Al-Khatib, 2003., “An Approach toOffline Arabic Character Recognition using Neural Networks”, Proceeding of the10th IEEE International Conference on Electronics, Circuits and Systems, Dec. 14-17, pp. 1328-1331.

[34] Yan, H.,1993., “Skew Correction of Document Images using Interline CrossCorrelation”, Comput. Vis. Graph. Image Process., 55: pp. 538-543.

[35] Yu, B. and A.K. Jain, 1996., “A Robust and Fast Skew Detection Algorithm forGeneric Documents”, Patt. Recog., 29: pp. 1599-1629.

[36] Zeki, A.M., 2005., “The Segmentation Problem on Arabic Character Recognition-the State of the Art”, Proceeding of the 1st International Conference on Informationand Communication Technology, Aug. 27-28, IEEE Xplore Press, USA., pp. 11-26.

[37] Muhammad Afzal and Sarmad Hussain., “Urdu Computing Standards: Developmentof Urdu Zabta Takhti - WG2 N2413-2-SC2 N3589-2 (UZT) 1.01”, Proceedings of

Page 15: Urdu Word Typology and Word Segmentation Methods – · PDF fileUrdu Word Typology and Word Segmentation Methods – Review F F 79 4. COMPOUND WORDS IN URDU In Urdu, Compounding is

Urdu Word Typology and Word Segmentation Methods – ReviewF F

89

INMIC2001, Organised by IEEE & Lahore University of Management Sciences,Lahore, December 28-30, 2001, pp: 216-222.

[38] Faisal Shafait, Adnan-ul-Hasan, Daniel Keysers, and Thomas M. Breuel., “LayoutAnalysis of Urdu Document Images Image Understanding and Pattern Recognition(IUPR) Research Group German Research Center for Artificial Intelligence (DFKIGmbH) D-67663 Kaiserslautern, Germany.

[39] Inam Shamsher, Zaheer Ahmad, Jehanzeb Khan Orakzai, and Awais Adnan OCRFor Printed Urdu Script Using Feed Forward Neural Network Proceedings of WorldAcademy of Science, Engineering and Technology, 23 Aug. 2007 ISSN, pp. 1307-6884.

[40] Zahour, A., Taconet, B., Mercy, P. & Ramdane, S. 2001., “Arabic Hand-WrittenText-Line Extraction”, 6th International Conference on Document Analysis andRecognition (ICDAR’01), pp. 281-285, Washington, USA. 10-13 September.

[41] Zeki, A.M. 2005., “The Segmentation Problem on Arabic Character Recognition –the State of the Art”, 1st International Conference on Information and CommunicationTechnology (ICICT), pp. 11-26, Karachi, Pakistan.

[42] U. Pal and Anirban Sarkar, “ Recognition of Printed Urdu Script”, IEEE – ICDAR2003.