11
Engineering Applications of Artificial Intelligence 21 (2008) 658–668 Multilingual OCR system for South Indian scripts and English documents: An approach based on Fourier transform and principal component analysis V.N. Manjunath Aradhya , G. Hemantha Kumar, S. Noushath Department of Studies in Computer Science, University of Mysore, Mysore 570006, Karnataka, India Received 5 May 2006; received in revised form 20 May 2007; accepted 28 May 2007 Available online 12 July 2007 Abstract Character recognition lies at the core of the discipline of pattern recognition where the aim is to represent a sequence of characters taken from an alphabet [Kasturi, R., Gorman, L.O., Govindaraju, V., 2002. Document image analysis: a primer. Sadhana 27 (Part 1), 3–22]. Though many kinds of features have been developed and their test performances on standard database have been reported, there is still room to improve the recognition rate by developing improved features. In this paper, we present a multilingual character recognition system for printed South Indian scripts (Kannada, Telugu, Tamil and Malayalam) and English documents. South Indian languages are most popular languages in India and around the world. The proposed multilingual character recognition is based on Fourier transform and principal component analysis (PCA), which are two commonly used techniques of image processing and recognition. PCA and Fourier transforms are classical feature extraction and data representation techniques widely used in the area of pattern recognition and computer vision. Our experimental results show the good performance over the data sets considered. r 2007 Elsevier Ltd. All rights reserved. Keywords: Document analysis; Multi-lingual character recognition; South Indian languages; Fourier transform; Principal component analysis (PCA) 1. Introduction Optical character recognition (OCR) is one of the virtually eminent applications of automatic pattern recog- nition. Research in OCR is very popular for various application potentials in banks, post offices, defense organizations, reading aid for the blind, library automa- tion, language processing and multi-media design. India is a multi-lingual multi-script country, where a single docu- ment page (e.g., passport application form, an examination question paper, a money order form, bank account opening application form, train reservation form, etc.) may contains text in two or more language scripts. OCR is of special significance for a multilingual country like India having 16 major state languages and over 100 regional languages. The current challenges in the filed of OCR technology is to take care of poor quality documents, recognizing characters with various noise types, recognition of multi- lingual characters and development of OCR which can handle different fonts and sizes. For Indian and many other oriental languages, OCR systems are not yet able to successfully recognize printed document images of varying scripts, quality, size, style, and font. In contrast to European languages, Indian languages pose many additional challenges. Such as: (i) large number of vowels, consonants, and conjuncts, (ii) most scripts spread over several zones, (iii) inflectional in nature and having complex character grapheme, (iv) lack of standard test databases for the Indian languages. However, most of the available systems work on European scripts which are based on Roman alphabet (Bozinovic and Srihari, 1989; Casey and Nagy, 1968; Wang et al., 1982; Mori et al., 1992; Kahan et al., 1987). Research reports on oriental language scripts are few, except for Korean, Chinese and Japanese scripts (Govindan and ARTICLE IN PRESS www.elsevier.com/locate/engappai 0952-1976/$ - see front matter r 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.engappai.2007.05.009 Corresponding author. E-mail addresses: [email protected] (V.N. Manjunath Aradhya), [email protected] (S. Noushath).

Multilingual OCR system for South Indian scripts and English … ·  · 2015-08-14English documents: An approach based ... European scripts which are based on Roman alphabet (Bozinovic

Embed Size (px)

Citation preview

ARTICLE IN PRESS

0952-1976/$ - se

doi:10.1016/j.en

�CorrespondE-mail addr

Aradhya), naw

Engineering Applications of Artificial Intelligence 21 (2008) 658–668

www.elsevier.com/locate/engappai

Multilingual OCR system for South Indian scripts andEnglish documents: An approach based on Fourier transform

and principal component analysis

V.N. Manjunath Aradhya�, G. Hemantha Kumar, S. Noushath

Department of Studies in Computer Science, University of Mysore, Mysore 570006, Karnataka, India

Received 5 May 2006; received in revised form 20 May 2007; accepted 28 May 2007

Available online 12 July 2007

Abstract

Character recognition lies at the core of the discipline of pattern recognition where the aim is to represent a sequence of characters

taken from an alphabet [Kasturi, R., Gorman, L.O., Govindaraju, V., 2002. Document image analysis: a primer. Sadhana 27 (Part 1),

3–22]. Though many kinds of features have been developed and their test performances on standard database have been reported, there is

still room to improve the recognition rate by developing improved features. In this paper, we present a multilingual character recognition

system for printed South Indian scripts (Kannada, Telugu, Tamil and Malayalam) and English documents. South Indian languages are

most popular languages in India and around the world. The proposed multilingual character recognition is based on Fourier transform

and principal component analysis (PCA), which are two commonly used techniques of image processing and recognition. PCA and

Fourier transforms are classical feature extraction and data representation techniques widely used in the area of pattern recognition and

computer vision. Our experimental results show the good performance over the data sets considered.

r 2007 Elsevier Ltd. All rights reserved.

Keywords: Document analysis; Multi-lingual character recognition; South Indian languages; Fourier transform; Principal component analysis (PCA)

1. Introduction

Optical character recognition (OCR) is one of thevirtually eminent applications of automatic pattern recog-nition. Research in OCR is very popular for variousapplication potentials in banks, post offices, defenseorganizations, reading aid for the blind, library automa-tion, language processing and multi-media design. India isa multi-lingual multi-script country, where a single docu-ment page (e.g., passport application form, an examinationquestion paper, a money order form, bank accountopening application form, train reservation form, etc.)may contains text in two or more language scripts. OCR isof special significance for a multilingual country like Indiahaving 16 major state languages and over 100 regionallanguages.

e front matter r 2007 Elsevier Ltd. All rights reserved.

gappai.2007.05.009

ing author.

esses: [email protected] (V.N. Manjunath

[email protected] (S. Noushath).

The current challenges in the filed of OCR technology isto take care of poor quality documents, recognizingcharacters with various noise types, recognition of multi-lingual characters and development of OCR which canhandle different fonts and sizes.For Indian and many other oriental languages, OCR

systems are not yet able to successfully recognize printeddocument images of varying scripts, quality, size, style, andfont. In contrast to European languages, Indian languagespose many additional challenges. Such as: (i) large numberof vowels, consonants, and conjuncts, (ii) most scriptsspread over several zones, (iii) inflectional in nature andhaving complex character grapheme, (iv) lack of standardtest databases for the Indian languages.However, most of the available systems work on

European scripts which are based on Roman alphabet(Bozinovic and Srihari, 1989; Casey and Nagy, 1968; Wanget al., 1982; Mori et al., 1992; Kahan et al., 1987). Researchreports on oriental language scripts are few, except forKorean, Chinese and Japanese scripts (Govindan and

ARTICLE IN PRESSV.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668 659

Shivaprasad, 1990). Focused documentation of the activ-ities in this area may also be seen in Indian languagedocument analysis and understanding (2002).1 Thesetechnical articles reveal the academic contributions to thedevelopment of OCR system for Indian scripts and discussspecific issues to be addressed in Indian scripts. One of themajor contributions in this area is the Devnagari OCRsystem developed by Chaudhuri B.B. Chaudhuri and Pal(1998), which is commercially available. This is the firstOCR system among all script forms used in the Indian sub-continent. Performance of the system on documents with avaried degree of noise is being studied and that on size andstyle variations is also studied. Another important OCRsystem is also proposed in the literature that can read twoIndian language scripts: Bangla and Devnagari (Hindi), themost popular ones in Indian sub-continent by Chaudhuriand Pal (1997). Stroke features are used for characterrecognition. Feature based tree classifiers is initially usedand efficient template matching approach is employed torecognize individual character. The performance of thesystem is quite satisfactory on single font clear documents.An OCR system to read South Indian language of Teluguis presented in Negi et al. (2001). The simple and mostobvious approach for recognizing Telugu characters is tobreak each character into glyphs and recognize them. Theyuse fringe distance for the comparison of Telugu characterbinary images. The overall performance of the proposedmethod is 92%. A multi-font OCR system for printedTelugu text is also described in Vasantha Lakshmi andPatvardhan (2002).

The character recognition process from printed docu-ments containing Hindi and Telugu text is presented inJawahar et al. (2003). The bilingual recognizer is based onprincipal component analysis (PCA) followed by supportvector classification. Basically principal components areused for dimensionality reduction and can give superiorperformance for font independent OCR. Support vectormachines (SVM) are pair-wise discriminating classifierswith the ability to identify the decision boundary withmaximal margin. The overall accuracy of the method is96.7%. In Hanmandlu et al. (2003) an approach based onFuzzy for unconstrained handwritten character recognitionis presented. Binary image of a character is partitioned intofixed number of sub-images called boxes. The featuresconsist of normalized vector distances and angles fromeach box. They have devised a new fuzzification functioninvolving parameters, which take into account of thevariations in the fuzzy sets. A recognition rate of almost99% was achieved with the new fuzzification function.

An OCR system for printed Urdu script is described inPal and Sarkar (2003). The method uses topologicalfeatures, contour features as well as features obtainedfrom the concept of water reservoir. The topologicalfeatures include existence of holes, ratio of hole height to

1For further information please visit the link: www.iiit.net/itrc/

index.html.

character height, etc. Contour features include character-istics of different profiles obtained from a portion ofcontour. Water reservoir based features include number ofreservoirs, position of reservoirs, height of reservoir, etc.These features are used for further recognition purpose.The method was tested in a variety of printed Urdudocuments. The system identifies individual text line withan accuracy of 98.3% and the system recognizes basiccharacters and numerals with an accuracy of about 97.8%.OCR for printed Tamil text using Unicode is presented inSeethalakshmi et al. (2005). The method uses variousfeatures that are considered for classification are thecharacter height, width, number of horizontal and verticallines, etc. Back propagation and SVM based classifiers areused for subsequent classification purpose. Automaticrecognition of printed Oriya script is presented inChaudhuri et al. (2002). The digitized document image isfirst passed through preprocessing modules like skewcorrection, line segmentation, character segmentation,etc. Next, individual characters are recognized usingcombination of stroke and run-number based featuresalong with features obtained from the concept of waterreservoir. In the recognition stage, modifiers are recognizedin the first part and remaining characters are recognized inthe second part. The system has been tested on a variety ofprinted Oriya documents. Recognition of individual textlines is about 97.5% whereas character segmentationaccuracy is about 97.2%. On average the system recognizescharacters with an accuracy of about 96.3%. A font andsize independent OCR system for printed Kannadadocuments is presented in Ashwin and Sastry (2002). Thesystem first extracts words from the document image andthen segments these into sub-character level pieces. A set ofzone features is extracted after normalization of thecharacters for further recognition purpose. A survey onIndian script character recognition is presented in Pal andChaudhuri (2004). The paper discusses a review of OCRwork done on Indian language scripts and differentmethodologies applied in OCR development in interna-tional scenario.To the best of our knowledge, this is the first report of its

kind on multiple Indian scripts character recognition. Theproposed multilingual character recognition system worksbased on Fourier transform and PCA (Fourier-PCA).Fourier transform is a widely used image processingtechnique, which is often applied to the enhancement ofimage description information and visual effect. In thispaper, we combine it with popular PCA to enhance theclassification information and improve the recognitioneffect. We first obtain filtered images (pre-processed) by theselection of appropriate Fourier frequency bands ofcharacter images. We then propose to carry out characterclassification/recognition by using the PCA method.The organization of this paper is as follows. In Section 2,

brief overview of South Indian scripts are presented.In Section 3, preprocessing procedures such as skewestimation and segmentation are presented. In Section 4,

ARTICLE IN PRESSV.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668660

we describe our proposed technique. Experimental resultsand comparative study are reported in Section 5. Finally,conclusions are drawn in Section 6.

2. Brief overview of South Indian scripts

In this section, we give the brief explanation over theproperties of South Indian scripts. Figures containing ofvowels, consonants, and conjunct consonant characterpertaining to South Indian scripts are separately shown inAppendix A.

2.1. Kannada script

Kannada is one of the major Dravidian languages ofsouthern India and one of the earliest languages evidencedepigraphically in India and spoken by about 50 millionpeople in the Indian states of Karnataka. The script has 49characters in its alphasyllabary and is phonemic. TheKannada character set is almost identical to that of otherIndian languages. The characters are classified into threecategories: swaras (vowels), vyanjanas (consonants) andyogavaahas (part vowel, part consonants).

2.2. Tamil script

Tamil is a Dravidian language spoken predominantly byTamils in India and Sri Lanka, of speakers in many othercountries. It is the official language of the Indian state ofTamil Nadu, and also has official status in Sri Lanka andSingapore and having more than 7 million speakers. Tamilis one of the major languages of the world. The Tamilscript has 12 vowels, 18 consonants and five grantha letters.The script, however, is syllabic and not alphabetic. Thecomplete script, therefore, consists of the 31 letters in theirindependent form, and an additional 216 combinant lettersrepresenting every possible combination of a vowel and aconsonant.

2.3. Telugu script

Telugu, another Dravidian language spoken by about 5million people in the southern Indian state of AndhraPradesh and neighboring states, and also in Bahrain, Fiji,Malaysia, Mauritius, Singapore and the UAE. Telugu is asyllabic language. Similar to most languages of India, eachsymbol in Telugu script represents a complete syllable.Officially, there are 18 vowels, 36 consonants, and threedual symbols. Of these, 13 vowels, 35 consonants are incommon usage.

2.4. Malayalam script

Malayalam is the language spoken predominantly inthe state of Kerala, in southern India. It is one of the 23official languages of India, spoken by around 37 millionpeople. The language belongs to the family of Dravidian

languages. Both the language and its writing system areclosely related to Tamil; however, Malayalam has a scriptof its own. Malayalam language script consists of 51 lettersincluding 16 vowels and 37 consonants. The earlier style ofwriting is now substituted with a new style and this newscript reduces the different letters for typeset from 900 toless than 90.

3. Preprocessing

Preprocessing plays an important role in any OCRsystem. In this section we explain two major preprocessingsteps that are crucial for successful development of OCRsystem: (1) skew estimation and (2) segmentation.The digitized images are in gray tone and we have used

histogram based thresholding approach to convert theminto two-tone images. For a clear document the histogramshows two-prominent peaks corresponding to white andblack regions. The threshold value is chosen as themidpoint of the two-histogram peaks. The two-tone imageis converted into 0-1 labels where the label 1 represents theobject and 0 represents the background.

3.1. Skew estimation

Skew angle detection is an important component of anyOCR and document analysis system. When a document isfed to a scanner either mechanically or by a humanoperator for digitization, it suffers some degrees of skew ortilt. Hence, in this work we have applied skew detectionmethod described in Nagabhushan et al. (2006), which isspecially designed to handle South Indian scripts docu-ments. The technique is based on boundary growing (BG),nearest neighboring clustering (NNC) and moments todetermine the inclination of the scanned document. First,the method excludes those components whose height isvery small. In doing so, dots of the character like ‘i’ and ‘j’,punctuation marks like full stop, comma, hyphen, etc. areremoved. If height of a character is greater than theaverage height, then the character height is reduced toaverage height. In this way, the method obtains heightnormalized character document. Using BG, the methodextracts top-left, bottom-right and centriod coordinatepoints present in each text line. However, the BG alonedoes not give accuracy for South Indian languagedocuments because of the modifiers that often gets pluggedinto the characters. Hence the method uses NNC and BGto extract the coordinates of the compound characters bydrawing a single boundary. The extracted coordinatepoints are then passed to moments to estimate skew angle.Fig. 1(a) shows the sample text lines of four South

Indian scripts. Fig. 1(b) shows the results of applying BGalone, and finally the results of BG+NNC is shown inFig. 1(c). Skew angle obtained for the document shown inFig. 2(a) is 4.891, whereas actual skew angle is 51. Afterdetecting the skew, it is necessary to deskew the document.To deskew the document, nearest neighbor interpolation

ARTICLE IN PRESS

Fig. 1. Results of BG+NNC for sample images of Kannada, Malayalam, Tamil and Telugu.

Fig. 2. Input skewed document and its corresponding deskewed docu-

ment.

Fig. 3. A sample Kannada word is divided into three lines.

Fig. 4. Showing the working procedure of the proposed segmentation

method.

V.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668 661

technique is applied for the document. Result of deskeweddocument is shown in Fig. 2(b).

3.2. Segmentation

Segmentation of a document into lines, and subsequentlyextracting individual characters constitute an important taskin the optical reading of texts. Presently most recognitionerrors are due to character segmentation errors (Bansal andSinha, 2002). To segment the word into individual characters,proposed system also incorporates a segmentation techniqueto segment the individual characters of South Indian andEnglish alphabets. As South Indian script is a non-cursivescript, the individual characters in a word are isolated.Spacing between the characters can be used for segmentation.But, there will not be any zero-valued valleys, due to thepresence of conjunct consonants. Hence, in such cases theusual method of vertical projection profile to separatecharacters fails. To segment the characters of these types,we propose a new technique based on their structure. Theproposed segmentation technique has two stages as follows:

Stage 1: consider the sample word as shown in Fig. 3.The height of the word is divided into three lines: upper,

lower and middle line. Upper line is drawn such that itpasses through the upper most pixels present in theword. In the similar way, the lower line is drawn whichpasses through the lower most pixel present in the word.Finally, the middle line is drawn corresponding to upperand lower line.Stage 2: beginning from the middle line, label aconnected component by searching in the upwarddirection. This search is confined within the image areaencompassed by the upper and middle lines. (ReferFig. 4). If any component is encountered along the middleline of the image, label the corresponding component.Now, within the image area of the previously labeledcomponent continue searching process as mentionedearlier (i.e., in upward direction). If any component isencountered within that area, the current componentalong with the previously labeled component is consideredas one single component. This labeling process iscontinued till we reach the end of the middle line. If anycomponent is present within the middle and lower linethen that component is labeled as modifier. Fig. 5 showsthe final segmentation of the individual characters of thesample word considered, where the numbers indicate theorder by which the components are extracted.

ARTICLE IN PRESS

Fig. 5. Showing the final segmentation of characters.

Fig. 6. Illustration of Fourier frequency spectrums with different

frequency bands.

V.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668662

This algorithm is a general algorithm which is indepen-dent of language and size. The proposed segmentationalgorithm is tested on verity of South Indian scripts andperformed well in all conditions. After successfullysegmenting words into individual characters, each of thesecomponents are scaled to a standard size before featureextraction. In the final implementation, it turned out to 220classes for Kannada, 200 classes for Tamil, 225 classes forTelugu, 85 classes for Malayalam and 62 in case of English.

4. Overview of PCA

Feature extraction is the identification of appropriatemeasures to characterize the component images distinctly.There are many popular methods to extract features.Amongst which PCA is one of the state-of-the art methodsfor feature extraction and data representation techniquewidely used in the area of computer vision and patternrecognition. Using these techniques, an image can beefficiently represented as a feature vector of low dimen-sionality. The features in such subspace provide moresalient and richer information for recognition than the rawimage. In this section, we have briefly explained PCAmethod for the sake of continuity and completeness.

4.1. The PCA method

The PCA technique (Turk and Pentland, 1991) regardseach image as a feature vector in a high dimensional spaceby concatenating the rows of the image and using theintensity of each pixel as a single feature vector. Moreformally, let us consider a set of N sample training images{A1, A2,y,AN} taking values in n-dimensional imagespace, and assume that each belongs to one of c classes{B1, B2,y, Bc}. Now the average matrix A of all trainingsamples has to be calculated then subtracted from theoriginal characters Ai and the result is stored in Fi:

A ¼1

N

XN

i¼1

Ai, (1)

Fi ¼ Ai � A. (2)

In the next step, the covariance matrix C is calculated

according to C ¼ 1=NPNi¼1

FiFTi . Now the eigenvectors

Ui(i ¼ 1,y,N) and the corresponding eigenvalues

li(i ¼ 1,y,N) are calculated. From the above N eigenvec-tors, only k should be chosen corresponding to largesteigenvalues. The highest the eigenvalues, the more char-acteristic features of a character does the particulareigenvector describe. Using the k eigenvectors Uk, featureextraction done by PCA is as follows:

Fi ¼ UTk ðAi � AÞ i ¼ 1; . . . ;N, (3)

4.2. Fourier approach description

Suppose that the original image sample set is X, eachimage matrix is sized m-by-n and expressed by f(x,y), where1pxpm,1pypn and mXn. Assuming there are c knownpattern classes (w1,w2,y,wc) in X. Perform a two-dimen-sional discrete Fourier transform on each image by

F ðu; vÞ ¼1

mn

Xm

x¼1

Xn

y¼1

f ðx; yÞ exp �j2pux

vy

n

� �h i, (4)

where j ¼ffiffiffiffiffiffiffi�1p

, exp() denotes the exponential function,and F(u,v) is sized mxn. Let F(u0,v0) indicates zero-frequency band. Shift F(u0,v0) to the center of imagematrix, that is, the point (m/2,n/2). Since the frequencydomain is represented by the matrix form, we use a squarebox Box (l) to represent the lth frequency band, where0plpm/2. The four vertices of the Box (l) are (u0�l,v0�l),(u0+l,v0�l), (u0�l,v0+l) and (u0+l,v0+l), respectively. So,the lth frequency band denotes:

F ðu; vÞ 2 BoxðlÞ.

The Fourier spectrum with different frequency bandsare shown in Fig. 6. If we select the lth frequency band,retain the values of F(u,v), otherwise set the values of F(u,v)to zero. But, which principle should we follow to select theappropriate bands? Here, we propose a simple two-dimen-sional discriminatory criterion to evaluate the discriminatory

ARTICLE IN PRESSV.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668 663

power of the frequency band and select the appropriatebands:

(i)

Fig.

imag

Use the lth frequency band:

F ðu; vÞ

¼retain original values if F ðu; vÞ 2 BoxðlÞ

0 if F ðu; vÞeBoxðlÞ

(,

ð5Þ

and perform an inverse Fourier transform on thecurrent F(u,v) values as follows:

f ðx; yÞ

¼1

mn

Xm

u¼1

Xn

v¼1

F ðu; vÞ exp j2pux

vy

n

� �h i. ð6Þ

Hence, for all the images in X, we obtain thecorresponding filtered images, which construct a newsamples set Yl. Note that the Yl have the same valuesof the number of samples and number of classes as thatof X. Let W denotes the total mean value of Yl. Nowwith regard to Yl, the scatter matrix is defined asfollows:

S ¼ E½ðY l �W ÞT ðY l �W Þ�. (7)

(ii)

We evaluate the discriminatory power of Yl, J(Yl),using the following judgment:

JðY lÞ ¼ trðSÞ, (8)

where tr() represents the trace of the matrix. Nowthose frequency bands, which maximize the abovecriteria, will have the better discriminatory power.Hence, we chose those frequency bands that maximizeEq. (8), perform the inverse Fourier transform andobtain the band-pass filtered images.

After deriving these band-pass filtered (i.e., prepro-cessed) images we apply PCA (as explained in Section 4.1)to build the corresponding eigenspaces for subsequentrecognition process. NNC is used for subsequent classifica-tion purpose. Here we call, the method applying PCA on Yl

as Fourier-PCA (F-PCA). Fig. 7 displays four originalsamples of images of the character and the correspondingfiltered images.

7. Typical character images (Top) and their corresponding filtered

es (Bottom).

5. Experimental results

In this section, we experimentally evaluate our proposedmethod with a data set containing various characterspertaining to South Indian languages and English. Eachexperiment is repeated 25 times by varying number ofprojection vectors t (where t ¼ 1,y,20, 25, 30, 35, 40, and45). Since t, has a considerable impact on recognitionaccuracy, we chose the value that corresponds to bestclassification result on the image set. All of our experimentsare carried out on a PC machine with P4 3GHz CPU and512MB RAM memory under Matlab 7.0 platform.

5.1. Experimentation on printed South Indian and English

text documents

We have evaluated the performance of the proposedsystem of the recognition process on different realdocuments. The characters are collected in a systematicmanner from the printed pages scanned on a HP 2400Scanjet Scanner. Documents were skew correctedand components were extracted. Subsets from this dataset were used for training and testing. The size of theinput character is normalized to 50� 50. Totally wehave conducted four different types of experimentation:(1) clear and degraded characters (2) scale independence(3) font independence (4) noisy characters. In the firstexperimentation, we considered the font-specific perfor-mance of South Indian languages and English. Weconsidered two fonts each for these languages and weconsidered 50,000 samples for this experiment. Resultsare reported in Table 1. From Table 1 it is noticed that forclear and degraded characters the recognition rate achievedis 99.01%. Some of the misclassifications were present dueto the distortions created by skew correction.Performance of the system is usually influenced by scale

and resolution of the characters. We considered samples ofprinted characters at various sizes and scanned at variousscales for the next experiment. Totally we collected about20,000 samples of different size characters and scannedwith different resolution. From the experimentation therecognition rate achieved for scale independence charactersis about 95%.A font is a set of printable or displayable text characters

in a specific style and size. Recognizing different font whenthe character class is huge is really interesting andchallenging. In this experimentation, we have considereddifferent font independent characters of South Indianscripts and English. Totally, we considered around 37,500samples for font independence and we achieved around92.4%.Noise plays a very important role in the field of

document image processing. To check the robustnessof the proposed method, we conducted series of experi-mentation on noisy characters by varying noise densityfrom 0.01 to 0.5 in step of 0.01. For this, we randomlyselect one character image from each class and generate

ARTICLE IN PRESS

Table 1

Recognition rate of the proposed system and comparative study with PCA method

Experimental details Language Font F-PCA (%) PCA (%)

Clear and degraded Kannada Kailasam 99.01 96.9

Kannada Kasturi

Tamil BRH Tamil

Tamil BRH Tamil RN

Telugu BRH Telugu

Telugu BRH Telugu RN

Malayalam BRH Malayalam

Malayalam BRH Malayalam RN

English Times New Roman

English Arial

Scale independence Kannada Kailasam 95 93.45

Kannada Kasturi

Tamil BRH Tamil

Tamil BRH Tamil RN

Telugu BRH Telugu

Telugu BRH Telugu RN

Malayalam BRH Malayalam

Malayalam BRH Malayalam RN

English Times New Roman

English Arial

Font independence South Indian N.A 92.4 90.8

English N.A

Noisy data Kannada Kailasam 94 92.56

Kannada Kasturi

Tamil BRH Tamil

Tamil BRH Tamil RN

Telugu BRH Telugu

Telugu BRH Telugu RN

Malayalam BRH Malayalam

Malayalam BRH Malayalam RN

English Times New Roman

English Arial

Fig. 8. Some of the sample images of English handwritten characters.

V.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668664

50 corresponding noisy images (here ‘‘salt and pepper’’noise is considered). We tested our proposed system withapproximately 30,000 character images of South Indianscripts and English. We noticed that the proposed systemachieves 94% recognition rate and remained robust tonoise by withstanding upto 0.3 noise density.

We also compared our proposed method with Eigen-character (PCA) based technique (Jawahar et al., 2003).Table 1 shows the performance accuracy of the PCA basedtechnique with clear and degraded characters, scaleindependence, font independence, and noisy characters.From Table 1 it is clear that the proposed Fourier-PCAbased technique performs well for all the conditionsconsidered.

5.2. Experimentation on handwritten characters

Handwritten recognition is an active topic in OCRapplication and pattern classification/learning research. InOCR applications, English alphabets recognition is dealtwith postal mail sorting, bank cheque processing, formdata entry, etc. For these applications, the performance ofhandwritten English alphabets recognition is crucial to the

overall performance. In this work, we have also extendedour experiment to handwritten characters of Englishalphabets. For experimentation, we considered samplesfrom 200 individual writers and total of 12,400 characterset is considered. Some of the sample images of hand-written characters are shown in Fig. 8. We trained thesystem by varying the training sample number by 50, 75,125, 175 and remaining samples of each character class areused during testing. Table 2 shows the best recognitionaccuracy obtained from the proposed system and the PCA

ARTICLE IN PRESSV.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668 665

based technique for varying number of samples. FromTable 2 it is clear that the highest recognition rate achievedis 93.8% when 175 samples are used for training andremaining 25 samples used for testing.

6. Conclusion

In this paper, the potentiality of PCA method is furtherexplored by combining it with Fourier transform. A simplecriterion for selecting appropriate Fourier frequency bandsis proposed. After choosing the appropriate frequencybands, we have performed inverse Fourier transform toobtain alternate band-pass filtered (preprocessed) trainingset. We have then applied conventional PCA scheme for

Table 2

Best recognition accuracy for handwritten characters

No. of training samples/

character class

Best recognition accuracy

F-PCA (%) PCA (%)

50 59 57.5

75 79.6 76.7

125 88.9 86

175 93.8 90.8

Fig. A.2. Kannad

Fig. A.1. Kannada vow

Fig. A.3. A few examples o

subsequent recognition purpose. This preprocessing stepmakes the training image set more salient by fading outthe unimportant features and enhancing the essentialvisual information, which will be useful for recognitiontask. The proposed system recognizes all basic characters,vowel–consonants combinations, and modifiers of SouthIndian script texts and upper case, lower case, andnumerals in case of English language with an averagerecognition accuracy of 95.1%. The system is also testedon English handwritten characters and achieved recogni-tion accuracy of about 93.8%. The system is tested onvariety of images containing noise, sizes, mulit fonts anddegraded characters. Hence, our proposed work hasgreater advantages and meets the current challenges inthe field of OCR. We have also compared our systemwith conventional PCA method and ascertained thesuperiority of the proposed method over PCA method inall aspects. Further, we plan to extend this work for otherIndian scripts.

Appendix A. South Indian scripts

Kannada script: Figs. A.1–A.3; Tamil script: Figs. A.4and A.5; Telugu script: Figs. A.6–A.8; Malayalam script:Figs. A.9–A.11.

a consonants.

els and its diacritics.

f conjunct consonants.

ARTICLE IN PRESS

Fig. A.4. Tamil vowels and diacritics.

Fig. A.5. Tamil consonants.

Fig. A.6. Telugu vowels and its diacritics.

Fig. A.7. Telugu consonants.

V.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668666

ARTICLE IN PRESS

Fig. A.9. Malayalam vowels and its diacrtitics.

Fig. A.8. Example of some of the conjunct consonants.

Fig. A.10. Malayalam consonants.

Fig. A.11. Example of some of the conjunct consonants and its diacritics.

V.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668 667

ARTICLE IN PRESSV.N. Manjunath Aradhya et al. / Engineering Applications of Artificial Intelligence 21 (2008) 658–668668

References

Ashwin, T.V., Sastry, P.S., 2002. A font and size-independent OCR system

for printed Kannada documents using support vector machines.

Sadhana 27 (Part 1), 35–58.

Bansal, V., Sinha, R.M.K., 2002. Segmentation of touching and fused

Devnagari characters. Pattern Recognition 35, 593–875.

Bozinovic, R.M., Srihari, S.N., 1989. Offline cursive script word

recognition. IEEE Transactions on Pattern Recognition and Machine

Intelligence 11, 68–83.

Casey, R., Nagy, G., 1968. Automatic reading machine. IEEE Transac-

tions on Computers. 17, 492–503.

Chaudhuri, B.B., Pal, U., 1997. An OCR system to read two Indian

language scripts: Bangla and Devnagari (Hindi). Proceedings of

ICDAR, 1011–1015.

Chaudhuri, B.B., Pal, U., 1998. A complete Bangla OCR system. Pattern

Recognition 31 (5), 531–549.

Chaudhuri, B.B., Pal, U., Mitra, M., 2002. Automatic recognition of

printed Oriya script. Sadhana 27 (Part 1), 23–34.

Govindan, V.K., Shivaprasad, A.P., 1990. Character recognition—

a survey. Pattern Recognition 23, 671–683.

Hanmandlu, M., Mohan, A., Goyal, C., Roy, D., 2003. Unconstrained

handwritten character recognition based on fuzzy logic. Pattern

Recognition 36 (3), 603–623.

Indian language document analysis and understanding, 2002. Special issue

of Sadhana, February.

Jawahar, C.V., Pavan Kumar, M.N.S.S.K., Ravi Kiran, S.S., 2003.

A bilingual OCR for Hindi–Telugu documents and its applications.

In: Proceedings of ICDAR, 3–6 August, Edinburgh, pp. 656–660.

Kahan, S., Pavildis, T., Baird, H.S., 1987. On the recognition of printed

character of any font and size. IEEE Transactions on Pattern Analysis

and machine Intelligence 9, 274–288.

Kasturi, R., Gorman, L.O., Govindaraju, V., 2002. Document image

analysis: a primer. Sadhana 27 (Part 1), 3–22.

Mori, S., Suen, C.Y., Yamamoto, K., 1992. Historical review of

OCR research and development. Proceedings of the IEEE 80,

1029–1058.

Nagabhushan, P., Hemantha Kumar, G., Shivakumara, P., Manjunath

Aradhya, V.N., 2006. Skew estimation by improved boundary growing

for text documents in South Indian languages. Journal of

Vivek Special Issue on Document Analysis of Indian Scripts 16 (2),

16–21.

Negi, A., Bhagvati, C., Krishna, B., 2001. An OCR system for Telugu. In:

Proceedings of ICDAR, 10–13 September, Seattle, pp. 1110–1113.

Pal, U., Chaudhuri, B.B., 2004. Indian script character recognition: a

survey. Pattern Recognition 37, 1887–1899.

Pal, U., Sarkar, A., 2003. Recognition of printed Urdu Script. In:

Proceedings of ICDAR, 3–6 August, Edinburgh 2003, pp. 598–602.

Seethalakshmi, R., Sreeranjani, T.R., Balachandar, T., 2005. Optical

Character recognition for printed Tamil text using unicode. Journal of

Zhejiang University Science 6A (11), 1297–1305.

Turk, M., Pentland, A., 1991. Eigenfaces for recognition. Journal of

Cognitive Neuroscience 3 (1), 71–86.

Vasantha Lakshmi, C., Patvardhan, C., 2002. A multi-font OCR system

for printed Telugu text. Proceedings of the Language Engineering

Conference (LEC), 1–17.

Wang, K.Y., Casey, R.G., Wahl, F.M., 1982. Document analysis system.

IBM Journal of Research and Devlopment 26, 647–656.