RecognitionofHandwrittenIndicScriptUsingClonalSelectionAlgorithm

H. Bersini and J. Carneiro (Eds.): ICARIS 2006, LNCS 4163, pp. 256 – 266, 2006.

© Springer-Verlag Berlin Heidelberg 2006

Recognition of Handwritten Indic Script Using Clonal

Selection Algorithm

Utpal Garain1, Mangal P. Chakraborty

1, and Dipankar Dasgupta

2

1 Indian Statistical Institute, 203, B.T. Road, Kolkata 700108, India

2 The University of Memphis, Memphis, TN 38152

[email protected], [email protected]

Abstract. The work explores the potentiality of a clonal selection algorithm in

pattern recognition (PR). In particular, a retraining scheme for the clonal selec-

tion algorithm is formulated for better recognition of handwritten numerals (a

10-class classification problem). Empirical study with two datasets (each of

which contains about 12,000 handwritten samples for 10 numerals) shows that

the proposed approach exhibits very good generalization ability. Experimental

results reported the average recognition accuracy of about 96%. The effect of

control parameters on the performance of the algorithm is analyzed and the

scope for further improvement in recognition accuracy is discussed.

Keywords: Clonal selection algorithm, character recognition, Indic scripts,

handwritten digits.

1 Introduction

Several immunological metaphors are now being used (in a piecemeal) for designing

Artificial Immune Systems (AIS) [1]. These approaches can broadly classified into

three groups namely, immune network models [2], negative selection algorithms [3],

and clonal selection algorithms [4]. This paper investigates a new training approach

for clonal selection algorithm (CSA) and its application to character recognition.

Earlier CSA was used for a 2-class problem to discriminate pair of similar character

patterns [5], the present study extends it for a m-class classification problem.

Training in CSA so far is modeled as one pass method where each antigen under-

goes single training phase. Once the training on all antigens is over, an immune mem-

ory is produced and used for solving classification problem (as used in [5] and [6]).

Our work presents a new training algorithm where a refinement phase is used to fine-

tune the initial immune memory that is build from the single pass training. In the

refinement stage, training of an antigen depends on its recognition score. Incorrect

recognition of an antigen triggers further training. This process continues until the

immune system suffers from negative learning or it is over-learned.

Recognition of handwritten Indic numerals has been considered to study the perform-

ance of the modified CSA. Because of its numerous applications for postal automation,

bank check reading, etc., the document image analysis researchers have been studying

the problem for last several years and a number of methods have been proposed.

Recognition of Handwritten Indic Script Using Clonal Selection Algorithm 257

While some of these are biologically inspired approaches such as neural networks [7],

genetic algorithms [8], AIS approaches remained unexplored for this application; though

AIS techniques have been applied to several pattern recognition problems [9-14].

The rest of the paper is organized as follows. Section-2 describes the CSA with the

proposed retraining scheme. Section-3 provides the experimental details and report re-

sults highlighting the performance of the CSA in classifying handwritten numerals. This

section also exhibits the performance of the new retraining scheme over the previously

used single-pass approach. In addition, section-3 discusses the effect of CSA control

parameters on its performance, and section-4 provides some concluding remarks.

2 Classification Using Clonal Selection Algorithm

Let AG represent a set of training data (antigens) and agi represents an individual

member of this set: AG = {ag1, ag2, …, agk}. Each agi has two attributes: class: ag.c

∈C ={c1,c2,………cn} (n = 10 for digit classification) and feature vector: ag.f. Let the

immune memory, IM={m1, m2, …, mm} where mi is a memory cell having two attrib-

utes similar to those of an individual antigen. For any mi, mi.c∈C = {c1, c2,………cn}

is the class information and mi.f is the feature vector.

Binary images of handwritten numerals are first size-normalized in a 48x48 matrix

whose each element is binary. This matrix is used as a feature map for the experi-

ments. Similarity between two such feature matrices S(F1, F2) a measure of auto-

correlation coefficient between F1 and F2 as defined below:

))()()((22

1),(

0010011100011011

1100011021

ssssssss

ssssFFS

++++

−−= (1)

where s00, s11, s01, and s10 denote the number of zero matches, one matches, zero mis-

matches, and one mismatches, respectively. It is to be noted that S gives values in the

range [0, 1], where 1 indicates the highest and 0 signifies the lowest similarity be-

tween two samples. We used this metric to measure similarity/affinity during anti-

body-antibody or antigen-antibody interactions.

Training has two phases: Phase-I is the same as was used in [6], while Phase-II in-

corporates a refinement process. Phase-I involves three stages namely, initialization

of immune memory, clone generation, and selection of clones to update the immune

memory. These stages are briefly discussed below.

Initialization: This stage deals with choosing some antigens as initial memory cells to

initialize the immune memory. In the present study, only one antigen from each class is

randomly chosen to initialize the immune memory (IM). It is to be noted that the num-

ber of initial cells has certain effect on system’s performance as illustrated in [6].

Clone generation: For a given antigen agi, its closest match (say, mi) is, at first, cho-

sen from the existing IM as follows:

stim(agi, mi) ≥ stim(agi, mj), for all j ≠ i and mj.c=agi.c (2)

The function stim() is used to measure the response of a b-cell to an antigen or to

another b-cell and is directly proportional to the similarity between the feature

matrices as defined in equation (1). After a memory cell mi (renamed as mmatch) is

258 U. Garain, M.P. Chakraborty, and D. Dasgupta

selected for a training antigen agi, mmatch goes through a proliferation process (Prolif-

eration-I), known as somatic hyper-mutation that generates a number of clones of

mmatch. The exact number of clones is determined by three parameters, namely, (i)

hyper-mutation rate, (ii) clonal rate and (iii) stim(agi, mmatch). Note that the first two

parameters are user-defined.

Each clone is produced through mutation (controlled by MUTATION_RATE, a

user defined parameter) at selected sites of mmatch’s feature matrix. No clone is an

exact copy of mmatch. The algorithms for Proliferation-I and the generation of mutated

clones are outlined in Algorithm-I and II, respectively. These algorithms are similar to

the ones described in [6]. On completion of hyper-mutations, a stimulation value is

computed for each element bj ∈ B as stim(bj, agi). Here bj denotes an individual b-cell

clone and B represents the entire cloned population.

In order to minimize the computational cost in generating clones, a modified

version of the resource limitation policy [15] is incorporated. The modified version

considers only the recent clones generated for the current antigen undergoing the

(maturation) training process. The method does not consider clones generated for

previous antigens i.e. present implementation considered the entire resource for the

current antigen’s class only.

Stopping criterion defined in equation (3) is used to terminate the training on an

antigen agi. If this criterion is not met then further proliferation of existing (i.e. sur-

vived after resource limitation) b-cells is invoked. In this stage (i.e. Proliferation-II),

each survived b-cell, i.e. bj is proliferated to produce a number of clones determined

by the resources allocated to it. Proliferation-II process is similar to one for prolifera-

tion-I outlined in Algorithm-I except the calculation of the number clones to be gen-

erated from each surviving b-cell (bj). This number is determined only by the

CLONAL_RATE and stim(agi, bj).

B

stimb

B

j

j =1

.

> STIMULATION_THRESHOLD (3)

Algorithm I. Hyper-mutation/Proliferation-I

Let B is the set of b-cell clones to be created due to somatic hyper-mutation started

with mmatch.

Initially B={mmatch}.

Let Nc denote the number of clones and calculated as,

Nc HYPER_MUTATION_RATE * CLONAL_RATE * stim(agi, mi)

While (|B| ≤ Nc)

Do

mut false //mut is a Boolean variable

Call mutate(mi, mut)

Let bj denote a mutated clone of mi

If (mut) Then B B ∪ bj

Done


Algorithm II. Production of Mutated Clones

mutate(x, flag){

For each binary feature element (i, j) in x.f // note that x.f is basically a matrix

Do

Generate a random number, r in [0, 1]

If (r < MUTATION_RATE) Then

x.fi,j toggle(x.fi,j)

flag true

Endif

Done

}

Clone selection and update of immune memory: Once the training criterion in

equation (3) is met for an antigen, the most stimulated (w.r.t. the current antigen un-

dergoing training) b-cell among the survived ones is selected as a candidate (let bcandi-

date denote this cell) to be inserted into immune memory. This process is outlined in

Algorithm III that is similar to one in [6]. This algorithm makes use of two parameters

AS (average stimulation) and α (a scalar value). The parameter α is a user-defined

one, whereas AS is measured from the input training antigen set as the average stimu-

lation between all pairs of the mean values of the antigen classes.

Algorithm III: Update of immune memory

CandStim stim(agi, bcandidate)

MatchStim stim (agi, mmatch)

CellAff stim(mmatch, bcandidate)

If (CandStim > MatchStim)

IM IM ∪ bcandidate // insertion into the immune memory

If (CellAff > α × AS)

IM IM – mmatch // memory replacement

Phase-II of the training algorithm: Note that the training in Phase-I is a one-pass

method i.e. the system is trained only once on a training antigen. At the end of the

training phase, the immune memory i.e. IM0={m1, m2, …, mm} is produced. In the

present implementation, training involves a second phase namely Phase-II that

employs a refinement process. In this method recognition and training go hand in

hand to obtain a better immune memory from its initial version i.e. IM0.

In this phase, recognition of the all the training antigens is done first using the

immune memory (IMi, i=0, 1, …) obtained in the previous stage (say, i-th stage).

Classification strategy outlined next is used for recognition of antigens and the

recognition accuracy is noted. Next, antigens for which incorrect classification is

recorded act as a bootstrap samples that undergo further training involving clone

generation, selection and updating immune memory as outlined above in Phase-I of

the training. This results in an updated immune memory (IMi+1), which is then used

for classification of all the training antigens. This newer version is retained if better


(than what was obtained using IMi) recognition accuracy is obtained. Otherwise, IMi

is reloaded and the Phase-II terminates.

It is observed that for a few iterations of Phase-II newer versions of the immune

memory continue to produce better recognition accuracy and then there is degradation in

accuracy, signaling a negative (or over) learning in the system. In fact, instead of using

the training antigen set, a separate validation set can be used in this refinement phase.

This modification would be considered in the future extension of the present study.

Classification strategy: Classification is implemented by a k-nearest neighbor (k-

NN) approach. For a target antigen (ag), k (an odd number) closest (w.r.t. ag) memory

cells are selected from the immune memory IM. Closeness is measured by the stim

function i.e. stim(ag, mi) for all i, mi ∈ IM. Next, k mi’s are grouped based on their

class labels. Class of the largest sized (a majority-voting strategy) group identifies ag.

3 Experimental Details

Two different datasets (DS1 and DS2) [16] have been used to test the proposed

classification approach based on clonal selection algorithm (CSA). These datasets

DS1 and DS2 contain samples for handwritten numerals in two major Indic scripts

namely, Devanagari (Hindi) and Bengali, respectively. Unlike English, Chinese,

Japanese, etc., studies in Indic script handwriting recognition are rare and this

provides additional motivation to this present work to deal with datasets of handwrit-

ing in Indian languages. Moreover, datasets consisting of a large number of samples

for handwritten digits in Indic scripts are recently available [16] in public domain and

this facilitates training and testing of an approach and comparing it with other

competing methods.

Both the datasets contain real samples collected from different kinds of handwrit-

ten documents such as postal mails, job application forms and railway ticket reserva-

tion forms, passport application forms, etc. For our experiment, each dataset consists

of 12,000 samples (equal number of samples for each class). DS1 samples are ran-

domly selected from a collection of 22,556 Devanagari numerals written by 1049

persons and DS2 samples are taken from a set of 12,938 Bengali numerals written by

556 persons. Some samples for each digit class are shown in fig 1. The datasets are

divided are into six equal sized partitions. Training is conducted on samples from five

partitions and classification is tested on the sixth partition. This realizes a six-fold

experiment that results in six test runs. The results reported next are averaged over

these six runs.

Experiments are carried out under two different training policies, L1: training is

single pass and L2: proposed method that employs refinement process. Recognition

accuracies under these two environments are reported in Table 1 and it is observed

that L2 outperforms L1 by a significant margin. However, L2 generates a slightly

larger sized immune memory than the one produced by L1. Significant difference is

observed in the time units required for training. On a Pentium-IV (733 MHz, 128

RAM) PC, L1 takes quite less CPU time than L2 that involves additional refinement

phase. However, there is hardly any difference in the time needed for classification by

the two approaches. The system can classify about 50 characters per second. Abso lute

time units taken during training and testing are outlined in Table 2 below.


Fig. 1. Hundred random samples from the dataset of Bengali handwritten numerals

Performance of the proposed refinement stage is studied to check how rapidly the

system attains the maximum classification rate on the training set. In fact, it’s the first

local maximum where the training terminates and at present, the system does not

attempt to find the global one. The response of the additional training module is

shown in fig. 2 for the dataset DS1. A similar behaviour is obtained for the other

dataset too.

In fig. 2 it is to be noted that the recognition accuracy gradually increases till the

8th iteration after which the accuracy degrades and training terminates. Number of

antigens undergo training in each pass is also plotted by a line curve in fig. 2. Please

note that iteration 0 represents the initial Phase-I training where all 10,000 antigens

were trained.

Table 1. Recognition accuracies and size of immune memory with two different training

algorithms

Recognition accuracy Size of immune memory

Dataset L1 L2 L1 L2

DS1 93.31% 96.23% 912 1283

DS2 92.57% 95.68% 1103 1472

Table 2. CPU Time for training and classification using two different training algorithms

Time to train Classification speed

(#characters per second)

Dataset L1 L2 L1 L2

DS1 5 H 14 Min 7 H 05 Min 52 49

DS2 5 H 19 Min 7 H 22 Min 51 47


Fig. 2. Performance analysis of the bootstrap module

Next, the effects of parameters are studied for two different measures: (i)

recognition accuracy and (ii) size of the immune memory. Results are reported here

for the new training algorithm. Almost similar effects have been observed on both the

datasets and results on DS1 are shown in Fig 3. Finally, the effect of k in k-nearest

neighbour classification is examined and it is observed that k = 5 gives the best

performance. Recognition accuracies for different values of k are shown in Fig. 4.

The overall results reported in Table 1 are obtained with k = 5, stimulation threshold

= 0.89, number of resources = 400, mutation rate = 0.008, affinity threshold scalar, α

= 0.4, hyper-mutation rate = 2 and clonal rate = 10 (the last two parameters are used

in Algorithm-I of section 2).

Classification results are further grouped into three classes, correct: a sample is

properly classified; incorrect: a sample is wrongly classified, and reject: the system

cannot classify a sample. A rejection is reported when no single class gets majority

among the k choices returned by the classifier. Table 3 presents the average classifica-

tion results taking these three aspects into consideration.


Table 3. Classification results

Dataset % correct % incorrect % reject

DS1 96.23 2.14 1.63

DS2 95.68 2.44 1.88

Fig. 3. Effect of different parameters on recognition accuracy and size of immune memory: (a)

stimulation threshold (refer equation (3)), (b) number of resources used for resource limitation,

(c) Mutation rate (refer Algorithm-II), and (d) Affinity threshold scalar, α as used in Algo-

rithm-III

Fig. 5 presents the class-wise classification rates. Recognition of the digit ‘0’ attains

highest recognition score in both scripts. On the other hand, samples of (digit ‘2’)

in Hindi and (digit ‘9’) in Bengali result in the lowest classification rates as

89.32% and 90.52%, respectively. Study of the confusion matrix identifies several

similar-shaped character pairs. For example, many samples from (digit ‘1’) and

(digit ‘2’) in Hindi dataset and from (digit ‘1’) and (digit ‘9’) in Bengali dataset

resulted in confusion during classification. Some post-processing can be employed to

discriminate such confusion pairs. In this context, a previous study [5] reported prom-

ising ability of an AIS-based approach for discrimination of similar-shaped character

pairs. The same approach can also be employed here to further improve the classifica-

tion accuracy. Such multi-level recognition scheme is considered as a future extension

of the present study.


Comparison with other existing studies: As mentioned earlier that there are many

studies on recognition of handwritten digits in English and Oriental scripts. However,

there are only a few reports on Indic script. A recent study [17] makes use of fuzzy

model based recognition scheme and reports recognition accuracy of about 95% on a

dataset containing about 3500 handwritten samples for Devanagari digits. Study in

[18] used neural net as classifier and achieved an accuracy of 93.26% on the same

dataset used here for recognition of handwritten Bengali digits.

Fig. 4. Recognition accuracies using k nearest neighbor approach with different k values

Fig. 5. Class-wise recognition accuracies


Compared to these approaches and achievements, the proposed AIS-based method

can be viewed as a potential alternative. However, it is to be noted that no study em-

ploys the same feature set. Authors in [17] use some grid-based features, [18] consid-

ers wavelet coefficients as features whereas, a size normalized binary image array has

been used as feature in the present study. Use of distance measure also differs from

one study to another. Therefore, a direct comparison needs replication of these ex-

periments using a uniform feature set and the same distance measure. Our future

study will consider this aspect to bring out a judicious comparison between an AIS-

based framework and other approaches using different learning paradigm.

4 Conclusions

This paper presents an application of a clonal selection algorithm for recognition of

handwritten Indic numerals. In particular, a 2-phase clonal selection algorithm im-

plementing a retraining scheme is proposed, and experiments using different datasets

are performed. Reported results show that this new method outperforms the previ-

ously used single pass method. Overall classification performance shows that this

method compares well with the existing approach. In particular, the proposed scheme

achieves recognition accuracy of about 96% that is comparable to the previous ap-

proaches.

This study uses a feature vector and a simple distance measure to explore the feasi-

bility of an AIS-based approach as an alternative classification tool. Since encourag-

ing results have been obtained in this experiment, future extension of this study would

include examination of different feature sets and distance measures to further improve

the recognition accuracy.

Reference

1. D. Dasgupta, Z. Ji, and F. Gonzalez, F, “Artificial immune system (AIS) research in the last five years,” in Congress on Evolutionary Computation (CEC’03), 2003, Volume: 1, pp. 123- 130.

2. Zheng Tang, Koichi Tashima, and Qi P. Cao, “Pattern recognition system using a clonal

selection-based immune network,” Systems and Computers in Japan, Volume 34, Issue 12,

pp. 56 - 63, 2003.

3. Z. Ji and D. Dasgupta, “Real-valued negative selection algorithm with variable-sized

detectors,” in LNCS 3102, Proceedings of GECCO, pages 287–298, 2004.

4. L. N. d. Castro and F. J. V. Zuben, “Learning and Optimization Using the Clonal Selection

Principle,” IEEE Transactions on Evolutionary Computation, Special Issue on Artificial

Immune Systems, vol. 6, pp. 239-251, 2002.

5. U. Garain, M. P. Chakraborty, D. Dutta Majumder, “Improvement of OCR Accuracy by

Similar Character Pair Discrimination: an Approach based on Artificial Immune System,”

to be presented in the 18th Int. Conf. on Pattern Recognition (ICPR), August 2006,

Hongkong.

6. A.B. Watkins, “AIRS: a resource limited artificial immune classifier,” Master’s

dissertation, Dept. of Computer Science, Mississippi State University, 2001.


7. Keith Price Bibliography on use of Neural Networks for recognition of Numbers and

Digits at http://iris.usc.edu/Vision-Notes/bibliography/char1019.html

8. C. de Stefano, A. Della Cioppa, and A. Marcelli, “Handwritten Numeral Recognition

by Means of Evolutionary Algorithms,” in Proc. of the 5th Int. Conf. on Document Analy-

sis and Recognition (ICDAR), Bangalore, India, page: 804-808, 1999.

9. J. H. Carter, “The Immune System as a model for Pattern Recognition and classification,”

Journal of the American Medical Informatics Association. Vol. 7, no. 3, pp.28-41, 2000

10. L. N. de Castro and J Timmis, “Artificial Immune Systems: A Novel Approach to Pattern

Recognition,” in Artificial Neural Networks in Pattern Recognition (Eds. L Alonso J

Corchado and C Fyfe), pp. 67-84. University of Paisley, January 2002.

11. S. Forrest, B. Javornik, R. E. Smith and A. S. Perelson, “Using genetic algorithms to ex-

plore pattern recognition in the immune system,” in Evolutionary Computation 1:3,

pp. 191-211, 1993.

12. Jennifer A. White and Simon M. Garrett, “Improved Pattern Recognition with Artificial

Clonal Selection,” in the Proc. of 2nd Int. Conf. on Artificial Immune Systems (ICARIS),

September 1-3, 2003, Napier University, Edinburgh, UK.

13. Y. Cao and D. Dasgupta, “An Immunogenetic Approach in Chemical Spectrum Recogni-

tion,” Advances in Evolutionary Computing (Eds. Ghosh & Tsutsui), Chapter 36,

Springer-Verlag, January 2003.

14. Tarakanov and V. Skormin, “Pattern Recognition by Immunocomputing,” in the proceed-

ings of the special sessions on artificial immune systems in Congress on Evolutionary

Computation, 2002 IEEE World Congress on Computational Intelligence, Honolulu,

Hawaii, May 2002.

15. J. Timmis, “Artificial Immune Systems: a novel data analysis techniques inspired by the

immune network theory,” PhD Thesis, University of Wales, Aberystwyth, 2001.

16. U. Bhattacharya and B. B. Chaudhuri, “Databases for research on recognition of handwrit-

ten characters of Indian scripts,” in Proc. of the 8th Int. Conf. on Document Analysis and

Recognition (ICDAR), Seoul, Korea, vol. II, page: 789-793, 2005.

17. M. Hanmandlu and O.V. Ramana Murthy, “Fuzzy Model Based Recognition of

Handwritten Hindi Numerals,” Proc. Int. Conf. on Cognition and Recognition, Dec. 2005,

pp. 490-496. http://www.studentprogress.com/appln/colleges/cogrec/

18. U. Bhattacharya, T. K. Das, A. Dutta, S. K. Parui, and B. B. Chaudhuri, “A Hybrid scheme

for handwritten numeral recognition based on Self Organizing Network and MLP,” in Int.

J. on Pattern Recognition and Artificial Intelligence (IJPRAI), Volume 16, pp. 845-864,

2002.

Documents

RecognitionofHandwrittenIndicScriptUsingClonalSelectionAlgorithm