11
Soliman MAMA, Abo-Bakr RM. Linearly and quadratically separable classifiers using adaptive approach. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 26(5): 908–918 Sept. 2011. DOI 10.1007/s11390-011-0188-x Linearly and Quadratically Separable Classifiers Using Adaptive Approach Mohamed Abdel-Kawy Mohamed Ali Soliman 1 and Rasha M. Abo-Bakr 2 1 Department of Computer and Systems Engineering, Faculty of Engineering, Zagazig University, Zagazig, Egypt 2 Departement of Mathematics, Faculty of Science, Zagazig University, Zagazig, Egypt E-mail: [email protected]; rasha [email protected] Received October 3, 2009; revised May 14, 2011. Abstract This paper presents a fast adaptive iterative algorithm to solve linearly separable classification problems in R n . In each iteration, a subset of the sampling data (n-points, where n is the number of features) is adaptively chosen and a hyperplane is constructed such that it separates the chosen n-points at a margin ² and best classifies the remaining points. The classification problem is formulated and the details of the algorithm are presented. Further, the algorithm is extended to solving quadratically separable classification problems. The basic idea is based on mapping the physical space to another larger one where the problem becomes linearly separable. Numerical illustrations show that few iteration steps are sufficient for convergence when classes are linearly separable. For nonlinearly separable data, given a specified maximum number of iteration steps, the algorithm returns the best hyperplane that minimizes the number of misclassified points occurring through these steps. Comparisons with other machine learning algorithms on practical and benchmark datasets are also presented, showing the performance of the proposed algorithm. Keywords linear classification, quadratic classification, iterative approach, adaptive technique 1 Introduction Pattern recognition [1-2] is the scientific discipline whose goal is the classification of objects into a num- ber of categories or classes. Depending on applications, these objects can be images or signal waveforms or any type of measurements that need to be classified. Linear separability is an important topic in the do- mains of artificial intelligence and machine learning. There are many real life problems in which there is a linear separation. A linear model is very robust against noise since a nonlinear model may consider the noisy samples in training data and perform more calculations to fit them. However, it may be less efficient than a linear model for testing data. Multilayer nonlinear (NL) neural networks, such as the back-propagation al- gorithm, work well for nonlinear classification problems. However, using back-propagation for a linear problem is overkill, with thousands of iterations needed to get to the point where linear separation can bring us fast. Linear separability methods are also used for training Support Vector Machines (SVMs) [3-4] used for pattern recognition. Support Vector Machines are linear learn- ing machines on linearly or nonlinearly separable data. They are trained by finding a hyperplane that linearly separates the data. In the case of nonlinearly separable data, the data are mapped into some other Euclidean space. Thus, SVM is still doing a linear separation but in a different space. In this paper, a novel and efficient method of fin- ding a hyperplane which separates two linearly separa- ble (LS) sets in R n is proposed. It is an adaptive itera- tive linear classifier (AILC) approach. The main idea in our approach is to detect the boundary region between the two classes where the points of different classes are close to each other. Then, from this region, n-points be- longing to the two different classes are chosen and a hy- perplane is constructed such that each of the n-points lies at prescribed distance ² (but points belonging to each class lie at opposite sides) from it. There exist precisely two such hyperplanes from which we choose the one that correctly classifies more points. If the cho- sen hyperplane successfully classifies all the points, cal- culations are terminated. Otherwise, other n-points are chosen to start a next iteration. These n-points are cho- sen adaptively from the misclassified ones as those were furthest from the constructed hyperplane in the current iteration because these points are most probably lying in the critical region between the two classes. Compared with other iterative linear classifiers, this approach is Regular Paper 2011 Springer Science + Business Media, LLC & Science Press, China

quad_comolinear.pdf

Embed Size (px)

Citation preview

Page 1: quad_comolinear.pdf

Soliman MAMA, Abo-Bakr RM. Linearly and quadratically separable classifiers using adaptive approach. JOURNAL OF

COMPUTER SCIENCE AND TECHNOLOGY 26(5): 908–918 Sept. 2011. DOI 10.1007/s11390-011-0188-x

Linearly and Quadratically Separable Classifiers Using Adaptive

Approach

Mohamed Abdel-Kawy Mohamed Ali Soliman1 and Rasha M. Abo-Bakr2

1Department of Computer and Systems Engineering, Faculty of Engineering, Zagazig University, Zagazig, Egypt2Departement of Mathematics, Faculty of Science, Zagazig University, Zagazig, Egypt

E-mail: [email protected]; rasha [email protected]

Received October 3, 2009; revised May 14, 2011.

Abstract This paper presents a fast adaptive iterative algorithm to solve linearly separable classification problems in Rn.In each iteration, a subset of the sampling data (n-points, where n is the number of features) is adaptively chosen and ahyperplane is constructed such that it separates the chosen n-points at a margin ε and best classifies the remaining points.The classification problem is formulated and the details of the algorithm are presented. Further, the algorithm is extendedto solving quadratically separable classification problems. The basic idea is based on mapping the physical space to anotherlarger one where the problem becomes linearly separable. Numerical illustrations show that few iteration steps are sufficientfor convergence when classes are linearly separable. For nonlinearly separable data, given a specified maximum numberof iteration steps, the algorithm returns the best hyperplane that minimizes the number of misclassified points occurringthrough these steps. Comparisons with other machine learning algorithms on practical and benchmark datasets are alsopresented, showing the performance of the proposed algorithm.

Keywords linear classification, quadratic classification, iterative approach, adaptive technique

1 Introduction

Pattern recognition[1-2] is the scientific disciplinewhose goal is the classification of objects into a num-ber of categories or classes. Depending on applications,these objects can be images or signal waveforms or anytype of measurements that need to be classified.

Linear separability is an important topic in the do-mains of artificial intelligence and machine learning.There are many real life problems in which there is alinear separation. A linear model is very robust againstnoise since a nonlinear model may consider the noisysamples in training data and perform more calculationsto fit them. However, it may be less efficient thana linear model for testing data. Multilayer nonlinear(NL) neural networks, such as the back-propagation al-gorithm, work well for nonlinear classification problems.However, using back-propagation for a linear problemis overkill, with thousands of iterations needed to getto the point where linear separation can bring us fast.Linear separability methods are also used for trainingSupport Vector Machines (SVMs)[3-4] used for patternrecognition. Support Vector Machines are linear learn-ing machines on linearly or nonlinearly separable data.They are trained by finding a hyperplane that linearly

separates the data. In the case of nonlinearly separabledata, the data are mapped into some other Euclideanspace. Thus, SVM is still doing a linear separation butin a different space.

In this paper, a novel and efficient method of fin-ding a hyperplane which separates two linearly separa-ble (LS) sets in Rn is proposed. It is an adaptive itera-tive linear classifier (AILC) approach. The main idea inour approach is to detect the boundary region betweenthe two classes where the points of different classes areclose to each other. Then, from this region, n-points be-longing to the two different classes are chosen and a hy-perplane is constructed such that each of the n-pointslies at prescribed distance ε (but points belonging toeach class lie at opposite sides) from it. There existprecisely two such hyperplanes from which we choosethe one that correctly classifies more points. If the cho-sen hyperplane successfully classifies all the points, cal-culations are terminated. Otherwise, other n-points arechosen to start a next iteration. These n-points are cho-sen adaptively from the misclassified ones as those werefurthest from the constructed hyperplane in the currentiteration because these points are most probably lyingin the critical region between the two classes. Comparedwith other iterative linear classifiers, this approach is

Regular Paper©2011 Springer Science +Business Media, LLC & Science Press, China

Page 2: quad_comolinear.pdf

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 909

adaptive and numerical results show that very few iter-ation steps are sufficient for convergence even for largesampling data.

The concept of a hyperplane is extended to perform-ing quadratic classifications, not just linear ones. Ana-logous to the separating hyperplane that is representedby a linear (first degree) equation, in quadratic classi-fication a second degree hypersurface is constructed toseparate the two classes.

This paper is divided into seven sections. In Section2, a brief survey of methods which classify LS classesare introduced showing theoretical basis for the mostrelated ones to the proposed classifier. In Section 3, themain idea, geometric interpretation and mathematicalformulation of the proposed AILC are presented. Illus-trative examples are given in Section 4. The quadrati-cally separable classifier is discussed and demonstratedby some examples in Section 5. Comparisons with otherknown algorithms are performed for linearly and non-linearly separable benchmark datasets and results arepresented in Section 6. Finally, in Section 7, conclu-sions and future work are discussed.

2 Comparison with Existing Algorithms

Numerous techniques exist in literature for solvingthe linear separability classification problem. Thesetechniques include the methods based on solving linearconstraints (the Fourier-Kuhn elimination algorithm[5]

or linear programming[6]), the methods based on theperceptron algorithm[7], and the methods based oncomputational geometry (convex hull) techniques[8]. Inaddition, statistical approaches are characterized by anexplicit underlying probability model, which provides aprobability that an instance belongs to a specific class,rather than simply a classification. The most relatedalgorithms to the one proposed in this work are theperceptron and SVM algorithms.

The perceptron algorithm was proposed byRosenblatt[5] for computing a hyperplane that linearlyseparates two finite and disjoint sets of points. In theperceptron algorithm, starting with arbitrary hyper-plane, the dataset is tested sequentially point afterpoint to check if it is correctly classified. If a pointis misclassified, the current hyperplane is updated tocorrectly classify this point. This process is repeateduntil a hyperplane is found that succeeds to classify thefull dataset. If two classes are linearly separable, theperceptron algorithm will provide, in a finite numberof steps, a hyperplane that linearly separates the twoclasses. However it is not known, ahead of time, howmany iteration steps are needed for the algorithm toconverge.

SVM[3], as a linear learning method, is trained

by finding an optimum hyperplane that separates thedataset (with largest possible margin) by solving a con-strained convex quadratic programming optimizationproblem which is time consuming.

In the proposed AILC, starting with an arbitrary hy-perplane, the full dataset is tested and the informationabout the relative locations of the misclassified pointswith respect to the hyperplane is utilized to predict thecritical region between the two classes where a betterhyperplane can exist. This adaptive nature of itera-tion speeds up the convergence to a hyperplane thatsuccessfully separates the two classes. In Section 3, theclassification problem is reformulated to produce the re-quired information at low cost. In addition, theoreticalbasis and implementation of AILC are provided.

3 Adaptive Iterative Linear Classifier (AILC)

In this section we present adaptive iterative linearclassifier (AILC). The main idea in our approach is tosimulate how one can predict a line in R2 that sepa-rates points belonging to two linearly separable classes.First, it detects the boundary region between the twoclasses where points of different classes are close to eachother. From this region of interest, it can choose twopoints (one point of each class) that seem to be mostdifficult (nearest) and predict a line that not only sepa-rates the two points but, as much as possible, correctlyseparates the two classes, that is it tries to constructa line having one of the points with remaining pointsof its class in one side of the line and the second pointwith the rest of its class in the other side. If such aline exists, the task is done. Otherwise, another twopoints are chosen to start a next iteration. These newpoints are chosen adaptively as those expected, by theconstructed line in the current iteration, to lie in theborder region between the two classes.

Construction of a separating line in our approach ischaracterized by the requirement that the 2-points lieat prescribed distance ε (but at opposite sides) from it.In fact, there exist precisely two such lines from whichwe choose the one that correctly separates more points.

A generalization about Rn is straight forward. Start-ing with n-points in Rn belonging to two differentclasses, we construct a hyperplane such that each ofthe n-points lies at prescribed distance ε (but pointsbelonging to each class lie at opposite sides) from it.Again, there exist precisely two such hyperplanes fromwhich we choose the one that correctly classifies morepoints. If the chosen hyperplane successfully classifiesall the points, we terminate calculations. Otherwise,a new iteration is started by choosing another n-pointfrom the misclassified ones (see Subsection 3.1 for moredetails).

Page 3: quad_comolinear.pdf

910 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

This approach is more efficient than other relatedmethods proposed in the literature. For example, theCLS[9-11] examines each possible hyperplane passing byevery n-points set to check if it can successfully clas-sify the remaining points. When such a hyperplane isreached, the required hyperplane is constructed suchthat it, further, properly separates the n-points accor-ding to their classes.

3.1 Geometric Interpretation and TheoreticalBasis for AILC

The classification problem considered in this workconsists of finding a hyperplane P that linearly sepa-rates N points in Rn. Each of these points belongs toeither of the two disjoint classes A or B that lie in thepositive or negative half space of P , respectively. If thetraining data are linearly separable, then a hyperplane

P (w; t) : xTi w + t = 0 (1)

exists such that

xTi w + t > 0, for all xi ∈ A

xTi w + t < 0, for all xi ∈ B, (2)

where xTi ∈ Rn is the feature vector (or the coordi-

nates) of point i while w ∈ Rn is termed the weightvector and t ∈ R the bias (or −t is termed the thresh-old) of the hyperplane.

Defining the class identifier

di ={

1, if xi ∈ class A,

−1, if xi ∈ class B,(3)

(2) reduces to the single form

di(xTi w + t) > 0, i = 1, 2, . . . , N. (4)

Dividing (4) by |t| yields

di(xTi W + c) = ei, ei > 0, i = 1, 2, . . . , N, (5)

where W = w/|t| is a weighted vector having thesame direction of w (normal to hyperplane P (W ; c):xT

i W + c = 0) and pointing to its positive half space,and c = 1 or −1 according to the sign of t is eitherpositive or negative, respectively.

In (5), we have introduced the variables ei, i =1, 2, . . . , N for the first time. These variables will bethe source of information in our approach. Accordingto (5), a hyperplane P (W ; c) will correctly separate thetwo classes if ei > 0, i = 1, 2, . . . , N . However, for atrial hyperplane P (W ; c), if substitution of W and c in(5) produces negative value for ei then point i is mis-classified by P . Another interesting importance of these

variables is that each ei is a measure of the distancebetween point xi and P . This can easily be provenas follows. Recall that the distance between any pointxi ∈ Rn and the hyperplane P (W ; c), xi 6∈ P (W ; c) isgiven by

δ(xi, P ) = |xTi W + c|/‖W ‖ = |ei|/‖W ‖ > 0, (6)

where ‖W ‖ is the L2 norm of W (length of vector W ),then

|ei| = ‖W ‖δ(xi, P ). (7)

In our approach, since W = (w1, w2, . . . , wn)T consistsof n-unknown components, we choose n-points and as-sume that they all lie at constant distances from a trialhyperplane P such that each point lies in the properhalf space according to its class. Substitution of xT

i ,di and ei = ε > 0, i = 1, 2, . . . , n in (5) and not-ing that c = 1 or −1, produces two linear systems ofequations in the n-unknowns w1, w2, . . . , wn. Solutionof these systems (assuming linear independence of theequations) produces two hyperplanes: P1 = P (W 1; 1)and P2 = P (W 2;−1). The first adaptive feature ofthe proposed algorithm is to select from P1 and P2 themore efficient one in classifying the remaining N − npoints.

Fig.1. Choice of the better hyperplane. The arrow of each hy-

perplane refers to its positive half-space.

In Fig.1, an elastration in R2 is presented withN = 16 (8 points of each class), where we refer bya black circle to the class with identifier d = 1 and atriangle to the other class with d = −1. The starting2-points are enclosed in squares. Both P1 and P2 suc-cessfully separate the chosen points into the two classes.However it is not guaranteed that both P1 and P2 cor-rectly classify the full N -set of points. P2 succeeded inclassifying 12 points (5 circles in its positive half spaceand 7 triangles in the other side) but failed with the rest4 points whereas P1 succeeded in classifying 6 points (4circles in its positive half space and 2 triangles in theother side) but failed with the rest 10 points. Thus thealgorithm chooses P2.

Page 4: quad_comolinear.pdf

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 911

3.2 Mathematical Formulation

Let xTi = [xi1, xi2, . . . , xin ] be the row representa-

tion of the components of an input data point xi ∈ Rn

that has n features and let N be the number of datapoints belonging to two disjoint classes (A and B).Then, applying (5) for all N -points yields the system

D(XTW + C) = E (8)

where

D =

d1 0d2

. . .0

dN

,

XT =

x11 x12 · · · x1n

x21 x22 · · · x2n...

......

...xN1 xN2 · · · xNn

N×n

,

W =

W1

W2...

Wn

n×1

, C =

cc...c

N×1

= cJN ,

JN =

11...1

N×1

, E =

e1

e2...

eN

. (9)

And thus, the classification problem is formulated by

D(XTW + cJN ) = E. (10)

One has to notice that matrices XT and D representthe input data such that, for each point i, XT containsin each row i, the feature vector XT

i and D is a diago-nal matrix formed such that its diagonal elements arethe elements of vector d = [d1 d2 . . . dN ]T. Thus, in-terchanging the rows of both XT and D corresponds toreordering of the N -points. In (10), JN is an N -vectorwhose entries are all unity and c = ±1. Also, refer-ring to (5), for a separating hyperplane, all the entriesof vector E must be positive. Hence the classificationproblem reads: find a hyperplane such that all the en-tries of E are all positive, or equivalently

find W and c, such that: E > 0. (11)

The proposed solution consists of the partitioning ofthe N -system (see (10)), into two subsystems; the firstone consists of the first n-equations while the secondsubsystem consists of the next (N − n) equations. Let

XT be partitioned as:

XT =[ an×n

b(N−n) × n

],

then (10) is rewritten as

[ D1|0 |

0D2

]([a

b

]W + c

[J1

J2

])=

[E1

E2

], (12)

where a is a nonsingular square matrix of dimensionn, b is in general a rectangular matrix of dimension(N − n) × n, J1 and J2 are vectors with unit n andN −n components, respectively. D1 and D2 are diago-nal square matrices of dimensions n and N −n, respec-tively. (12) can then be written as

D1(aW + cJ1) = E1, (13)

D2(bW + cJ2) = E2. (14)

And the classification problem becomes: find W and c,such that: E1 > 0, E2 > 0.

3.3 Adaptive Iterative Linear Classifier(AILC)

To simplify the solution of (13) and (14), choose asmall positive number ε and assume

e1 = e2 = · · · = en = ε > 0, (15)

then E1 = εJ1 > 0 and hence, upon substitution in(13), using D−1

1 = D1, and solving W as a function ofc, it reduces to

W = a−1Q. (16)

Here Q = (εD1J1 − cJ1) is a vector of length n and iscomputed easily because its i-entry is given by: εdi− c,1 < i < n.

Substituting (16) in (14)

E2 = D2bW + cD2J2. (17)

To compute E2, note that the i-th-entry of E2 is

ei = di(bTi W + c), n + 1 6 i 6 N. (18)

Clearly, since vector Q is dependent on the value of c,then so are both W and E2.

3.3.1 Adaptive Procedure

In the proposed AILC, we try to speed up the con-vergence rate by making full use of all available informa-tion within and after iteration. Two adaptive choicesare performed as follows.

First, within iteration r, the algorithm chooses thevalue of c as +1 or −1 such that the constructed hy-perplane correctly classifies more points as described in

Page 5: quad_comolinear.pdf

912 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

Subsection 3.1. In Algorithm 1 the implementation ofthis adaptive choice is presented.

Algorithm 1. Iteration r (a−1, b, D1, D2; c, W , E2, m)

1. Set c = 1.

2. Compute the vectors W (c) and E2(c) using (16), (17).

3. Compute m(c) as the number of negative entries of

E2(c).

4. if m(c) = 0, then E2(c) > 0, go to step 8.

5. else if c = 1, set c = −1 and repeat steps 2∼4.

6. if m(1) < m(−1) then c = 1 produces the accepted

hyperplane Pr. Set c = 1, go to step 9.

7. else c = −1 produces the accepted hyperplane Pr.

Set c = −1, go to step 9.

8. The separating hyperplane P is defined by c, W (c).

end iteration r.

9. The best hyperplane Pr is defined by c, W = W (c),

return also, E2(c), m(c). end iteration r.

Second, after an iteration r, vector Er =[E1

E2

]r

iscomputed. Er is constructed as augmentation of E1

all of whose n-entries equal ε and E2 whose entries arecomputed by (18). In fact, Er contains important infor-mation about the fitness of the constructed hyperplanePr as a separator. First, recall that a negative sign ofan entry ei of Er means that point i is misclassified bythe hyperplane. Second, (in (7)) the absolute value ofei provides a measure of the distance of point i fromthe hyperplane. Thus, if entries of Er are all positive,then Pr is an acceptable classifier, otherwise, the entrieshaving the lowest values in Er would correspond to thefurthest misclassified points from Pr and hence suchpoints more probably lie in the critical region betweenthe two classes where an objective classifier P has to beconstructed. Accordingly, we choose n of these points(that, in addition, must be linearly independent andbelong to both of the different classes) to determinethe hyperplane in the next iteration. So, matrix a in(12) is chosen by adaptively reordering the input ma-trix XT after each iteration such that the first n-rowsof XT and D correspond to the data of the chosenn-points. An illustration in R2 is shown in Fig.2 whereblack circles and triangles refer to the classes that must

Fig.2. Illustration of the adaptive choice of next iteration in the

classifier (AILC) in R2.

lie in the positive and negative half space, respectively.The misclassified points lie in the shaded regions andthe chosen 2-points for the next iteration are shown inrectangles.

3.3.2 Implementation of AILC

Algorithm 1 describes a typical iteration r that re-turns either a separating hyperplane P or a hyper-plane Pr. Although it does not successfully classify allthe points, it minimizes the number m of misclassifiedpoints through the adaptive choice of c.

Adaptive Reordering Algorithm (Algorithm 2) rear-ranges XT, d such that the first n-points in XT (form-ing a in the next iteration) must satisfy the conditions:1) correspond to rows that have the lowest values inE, 2) a is nonsingular, and 3) belonging to the twoclasses. The details of such an algorithm are presentedin Algorithm 2.

The complete algorithm AILC is presented in Algo-rithm 3.

Algorithm 2. Adaptive Reordering (n, N, ε, XT, d, E2)

1. Form vector E as augmentation of E1 (all its n-

entries equal ε) and E2.

2. Form vector F such that its entries are the rows nu-

mbers of E when it is sorted in an ascending order.

3. Set a(n, n) = zero matrix, da(n) = zero vector,

flag(N) = zero vector.

4. Set i = 1, j = 1.

5. while i < n

6. while j < N

I. k = F (j), aTi = XT

k

II. if rank (first i rows of a) = i, then

set: da(i) = d(k); flag(k) = i; break, end.

III. j = j + 1; go to step 6.

7. i = i + 1; go to step 5.

8. i = n.

9. while j 6 N

I. K = F (j), aTi = XT

k

II. if (d(k) = −da(n− 1) and rank (first i rows

of a) = i), then

set: da(i) = d(k); flag(k) = i; break, end.

III. j = j + 1; go to step 9.

10. for each 1 6 k 6 N , if (flag(k) = i 6= 0), XTk = XT

i ,

d(k) = d(i).

11. for i = 1 to n, XTi = aT

i , d(i) = da(i).

4 Numerical Illustration

In this section, the use of algorithm AILC is demon-strated by three linearly separable (LS) examples. The

Page 6: quad_comolinear.pdf

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 913

Algorithm 3. AILC (N, n, XT, d, ε, rmax; c, W , r, m)

Input: data N, n, N × n array XT, class identifier N × 1

array d, maximum number of iterations rmax, and

parameter ε.

Output: a hyperplane (c, W ), iteration r, and number

of misclassified points m.

1. Arrange XT, d such that the first n-rows of XT form

a nonsingular n× n matrix a.

2. Set m0 = N , r = 1.

3. while r 6 rmax

a) Form the partitioned matrices a, b, D1, D2 then

compute a−1 (see (12)).

b) Call Iteration r (a−1, b, D1, D2; c, W , E2, m).

c) if m = 0 (successful separation), return c, W , r, m;

break; end.

d) else if m < m0, m0 = m, copt = c, W opt = W ,

ropt = r.

e) Call Adaptive Reordering (n, N , ε, XT, d, E2).

f) r = r + 1.

g) go to step 3.

4. return data of hyperplane with minimum misclassi-

fied points: c = copt , W = W opt , also return

m = m0, r = ropt . end.

first is a 2D-classification problem where successive ite-rations are visualized to illustrate the adaptive featureand convergence behavior of the algorithm. The influ-ence of the value of ε and the reordering of input dataon the convergence are numerically discussed. The se-cond example is a 3D-classification problem in R3 whilethe third one is a 4D-classification problem in R4 wherethe standard benchmark classification dataset: IRIS[12]

is arranged as two LS classes.Example 1. A 2D-classification problem consists of

two classes A (black circles) and B (triangles) given:A = {(4, 3), (0, 4), (2, 1.6), (7, 3), (3, 4), (4, 3), (3, 2)}B = {(4,−4), (−3, 0), (−6,−1), (1, 0), (1, 0.5), (0,−7),(6, 2)}.

Points of the two classes A, B are represented inFig.3(a) showing great difficulty in classifying thesedata. A circle about the starting two points (−4, 3),(4,−4) are also shown. Figs. 3(b)∼ 3(d) show the ap-plication of our algorithm to this problem with ε = 0.45.After each iteration, the computed weight vector andthreshold are shown in Table 1.

To discuss the dependency of the proposed algorithmon the starting n-points and the parameter ε, we repeatsolving the previous example starting with another twopoints (−3, 0), (4,−4) and select ε = 0.4. The num-ber of iteration changes; two iterations were required toclassify these difficult data although the starting points

Fig.3. 2D plot of the two-class classification problem (class A

(black circles), class B (triangles)). Squares indicate the worst

points after iterations 1 and 2. (a) Original dataset. (b) After

iteration 1. (c) After iteration 2. (d) After iteration 3.

Table 1. Weight Vectors and Threshold Values Obtained

by Executing the Algorithm

i (iteration) W (weight vector) c (threshold)

1 (1.6375, 2) 1

2 (−0.0481, 0.4192) −1

3 (−0.3, 1.175) −1

Table 2. Weight Vectors and Threshold Values

Obtained by Executing the Algorithm

i (iteration) W (weight vector) c (threshold)

1 (0.4667, 0.8167) 1

2 (−0.1355, 0.7059) −1

belong to the same class (B). The results are presentedin Table 2 and Fig.4.

It would be mentioned that no more than 4 itera-tions were needed to solve this classification problemirrespective of starting points and for 0.0001 < ε < 0.5.

Example 2. The algorithm presented in Algorithm3 was tested by applying it to an LS 3D-classificationproblem that consists of two classes A(•) and B(4).

A = {(1, 4.5, 1), (2, 4, 3), (6, 5, 4), (4, 6, 5), (4, 5, 6),(1, 3, 1)}

B = {(0, 4, 0), (2, 4,−3), (−4, 4,−2), (−3, 4,−4),(−2, 3,−3), (−4, 4, 1)}.

Page 7: quad_comolinear.pdf

914 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

Fig.4. Classification of the same classes (black circles = class A,

triangles = class B) represented in Fig.3(a) when we start with

(−3, 0), (4,−4) and ε = 0.4. The worst points after the first

iteration are included in squares.

Table 3. Weight Vectors and Threshold Values Obtained

by Executing the Algorithm

i (iteration) W (weight vector) c (threshold)

1 (0.9375, 0.175, 0.425) −1

2 (0.465, 0.175, 0.31) −1

Starting with points (0, 4, 0), (1, 4.5, 1), (2, 4, 3)and choosing ε = 0.3, Algorithm in Table 3 was appliedto classify these data. Two iterations were sufficient tosolve this classification problem as shown in Fig.5. Thesituation after the first and second iterations are shownin Figs. 5(a) and 5(b) respectively. In each case, thegraph was rotated such that the view was perpendic-ular to the separating plane. After the first iterationthe points (2, 4,−3), (1, 3, 1), (0, 4, 0) were found to bethe worst. The results of different iterations are sum-marized in Table 3.

Fig.5. Original dataset and the constructed hyperplanes for 3D-

problem of Example 2. (a) After first iteration. (b) After second

iteration.

Example 3. The IRIS dataset[12] classifies a plant asbeing an Iris Setosa, Iris Versicolour or Iris Virginica.

The dataset describes every iris plant using four inputparameters (Sepal length, Sepal width, Petal length,and Petal width). The dataset contains a total of 150samples with 50 samples for each of the three classes.Some of the publications that used only the samplesbelonging to the Iris Versicolour and the Iris Virginicaclasses include: Fisher[13] (1936), Dasarathy (1980),Elizondo (1997), and Gates (1972).

Although the IRIS dataset is nonlinearly separable,it is known that all the samples of the Iris Setosa classare linearly separable from the rest of the samples (IrisVersicolour and Iris Virginica). Therefore, in this exam-ple, a linearly separable dataset was constructed fromthe IRIS dataset such that the samples belonging to IrisVersicolour and Iris Virginica classes were grouped inone class and the Iris Setosa was considered to be theother class. Thus, a linearly separable 4D-classificationproblem was considered in this example with 100 pointsin class A and 50 points in class B. Using the proposedalgorithm with ε = 0.5, data were completely classifiedafter two iterations and the results were collected inTable 4.

Table 4. Weight Vectors and Threshold Values

Iris Classification Problem

i (iteration) W (weight vector) c (threshold)

1 (0, 0, 0,−2.5) 1

2 (0.3763, 0.1096,−0.3907,−0.2335) −1

5 Classification of Quadratically SeparableSets

Two classes A, B are said to be quadratically sepa-rable if there exists a quadratic polynomial P2(y) = 0,y ∈ Rm such that P2(y) > 0 if y ∈ A and P2(y) < 0 ify ∈ B. In R2, a general quadratic polynomial can beput in the form:

w1y21 + w2y

22 + w3y1y2 + w4y1 + w5y2 + c = 0. (18)

(18) represents a conic section (parabola, ellipse, orhyperbola depending on the values of coefficients wi).Now, consider a mapping φ : R2 → R5 such that apoint y(y1, y2) ∈ R2 is mapped into a point x ∈ R5,with components:

x1 = y21 , x2 = y2

2 , x3 = y1y2, x4 = y1, x5 = y2.(19)

Using this mapping, P2(y) = 0 is transformed intoa hyperplane; xTw + c = 0 in R5. The transformed lin-ear classification problem can be solved by algorithmAILC to get w and c and hence a quadratic polynomialP2(y) = 0 is determined.

Generally, a quadratic polynomial in Rm can betransformed into a hyperplane in Rn with n = m +

Page 8: quad_comolinear.pdf

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 915

m(m + 1)/2. Quadratic polynomials in R3 representsurfaces such as ellipsoids, paraboloids, hyperboloids,cone.

Although the algorithm is applicable to higher di-mensions, we represent an example in R2 for conve-nience in visualization. A set of points belonging totwo classes (black and red +) are presented in Fig.6.The mapping φ defined by (19) is used to generate co-ordinates in R5 corresponding to input data points. Al-gorithm AILC is used to solve the transformed linearlyseparable problem with two different values of ε = 0.4and ε = 0.5. For each of these values, the resultingquadratic equation is plotted in blue. Although the al-gorithm successfully classified the points in both thecases, it shows the sensitivity to the value of ε. Forε = 0.4, five iterations were required to converge toa parabola (see Fig.6(a)) while it takes eleven itera-tions when ε = 0.5 to converge to the hyperbola shownin Fig.6(b). Moreover, the algorithm may diverge forother range of values compared with the case of thelinearly separable classification problems where veryfew iterations (1∼3) were sufficient for convergence for0 < ε < 0.5.

Fig.6. Classification by a conic section using different values of

ε. (a) ε = 0.4. (b) ε = 0.5.

For the difficult dataset presented in Fig.7, the appli-cation of the algorithm produces the shown separatingellipse.

6 Numerical Results

In this section we discuss the performance of algo-rithm AILC compared with other learning algorithmsin the case of linearly and nonlinearly separable practi-cal and benchmark datasets.

6.1 Classification of Linearly SeparableDatasets

For the evaluation of the AILC algorithm, the

Fig.7. Application of algorithm produces an ellipse for the

quadratically separable data.

following linearly separable datasets were chosen, inclu-ding benchmark dataset IRIS[12] and some randomlygenerated datasets.

1) IRIS: full description of IRIS dataset is given inSection 4 (Example 3). Here, we consider two classes:Iris Setosa (50 samples) versus the non- Setosa (the re-maining 100 samples belonging to the Iris Versicolourand the Iris Virginica).

2) G 589 2.3) G 1972 2.4) G 19001 2.5) G 1367 10.6) G 1353 15.The following procedure describes the automatic

generation of data. Generate a random array consist-ing of N rows and n columns as input matrix XT. Todefine the class identifier d, we first generate a randomvector of length n + 1 for the weight W and c then for1 6 i 6 N compute bi = XT

i w + c and define di as +1or −1 according to bi > δ or bi < −δ where δ is a smallpositive number that preserves a margin between thetwo generated sets. The generated data consist of XT

and d in the form of N × (n + 1) array. Table 5 givesthe summary of the datasets being used.

Table 5. Description of the Benchmark and Randomly

Generated Linearly Separable Datasets

Samples N Features n

IRIS 150 4

G 589 2 589 2

G 1972 2 1 972 2

G 19001 2 19 001 2

G 1367 10 1 367 10

G 1353 15 1 353 15

In the next experiment these linearly separable

Page 9: quad_comolinear.pdf

916 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

datasets are used to predict the performance of ourproposed algorithm AILC and other machine learningalgorithms including: decision tree, support vector ma-chine and radial basis function network. A summary ofthese algorithms is given in Table 6. We compared ourresults with the implementations in WEKA[14-15].

Table 6. Summary of Machine Learning Algorithms Used

to Produce Results of Tables (7, 9)

J48 Decision tree learner

RBF Radial basis function network

MLP (L) Multilayer perceptron with back-propagationneural network using L hidden layers

SMO (d) Sequential minimal optimization algorithm forsupport vector classification with polynomialkernel of degree d

AILC (d) Proposed adaptive iterative linear (d = 1) andquadratic (d = 2) classifier

Table 7. Results for the Empirical Comparison Showing

the Number of Misclassified Instants

J48 SMO (1) RBF AILC (1)

IRIS 0 0 0 0 (2)

G 589 2 1 4 2 0 (3)

G 1972 2 5 2 66 0 (4)

G 19001 2 1 0 162 0 (2)

G 1367 10 11 0 5 0 (42)

G 1353 15 21 0 21 0 (255)

For each dataset, the full data were used in train-ing different algorithms to predict the best separatinghyperplane. The number of misclassified samples, ifexist, is reported in Table 7. In addition, for AILC, thenumber of required iterations to obtain the separatinghyperplane is given in parentheses. One can easily con-clude (from Table 7 and many other experiments notreported here) that although the number of requirediterations increases significantly with the increase ofnumber of features n, it is nearly independent of thenumber of samples N . Being independent of N showsthe strength point of the adaptive technique and beingsignificantly dependent on n presents the weakness ofthe proposed technique that resulted from the assump-tion that the chosen n-points have to be at an equal andprescribed distance from the hyperplane. However, theproposed algorithm succeeded in separating all thesedatasets while other algorithms did not.

6.2 Behavior of Algorithm AILC inNonlinearly Separable Datasets

In this subsection, we will discuss the behavior ofthe proposed adaptive iterative linear classifier algo-rithm if the dataset is nonlinearly separable. And a

comparison among this algorithm and decision tree,back-propagation neural network and support vectormachines is presented.

6.2.1 Datasets Used for Empirical Evaluation

For an empirical evaluation of the algorithm AILCwith the nonlinearly separable datasets we have cho-sen five datasets from the UCI machine learningrepository[12] for binary classification tasks.

1) Breast-Cancer (BC). We used the original Win-consin breast cancer dataset, which consists of 699 sam-ples of breast-cancer medical data out of two classes.Sixteen examples containing missing values have beenremoved. 65.5% of the samples came from the majorityclass.

2) Pima Indian Diabetes (DI). This dataset contains768 samples with eight attributes (features) each plusa binary class label.

3) Ionosphere (IO). This database contains 351 sam-ples of radar return signals from the ionosphere. Eachsample consists of 34 real valued attributes plus binaryclass information.

4) IRIS. Full description of IRIS dataset is given inSection 4 (Example 3). Here, only the 100 samples be-longing to the Iris Versicolour and the Iris Virginicaclasses are considered.

5) Sonar (SN). The sonar database is a high-dimensional dataset describing sonar signals in 60 real-valued attributes. The dataset contains 208 samples.

Table 8 gives an overview of the datasets being used.The numbers of the examples in brackets show the orig-inal size of the dataset before the examples containingmissing values had been removed.

Table 8. Numerical Description of the Benchmark Datasets

Used for Empirical Evaluation

Samples Majority Features

(Instances) Class (%) (Attributes)

BC (699) 683 65.50 9

DI 768 65.10 8

IO 351 64.10 34

IRIS 100 50.00 4

SN 208 53.40 60

There exist many different techniques to evaluatethe performance of different learning techniques basedon data with a limited number of samples. The strati-fied ten-fold cross-validation technique is gaining ascen-dancy and is probably the evaluation method of choicein most practical limited-data situations. In this tech-nique, the data are divided randomly into ten partsin which the class is represented in approximately thesame proportions as in the full dataset. Each part is

Page 10: quad_comolinear.pdf

Mohamed Abdel-Kawy Mohamed Ali Soliman et al.: Linearly and Quadratically Separable Classifiers 917

Table 9. Results for the Empirical Comparison Showing the Number of Misclassified

Instances and Accuracy on the Test Set Using 10-Fold Cross Validation

BC DI IO IRIS SN

J48 32 (95.31%) 196 (74.48%) 34 (90.31%) 6 (94%) 60 (71.15%)

MLP (3) 36 (94.73%) 181 (76.43%) 31 (91.17%) 7 (93%) 41 (80.28%)

SMO (1) 21 (96.93%) 179 (76.69%) 44 (87.46%) 6 (94%) 50 (75.96%)

SMO (2) 24 (96.49%) 171 (77.73%) 33 (90.60%) 7 (93%) 37 (82.21%)

AILC (1) 37 (94.58%) 199 (74.09%) 69 (80.34%) 6 (94%) 71 (65.87%)

AILC (2) 4 (96%)

held out in turn and the learning scheme trained onthe remaining nine-tenths; then its error rate is calcu-lated on the holdout set. Thus the learning procedureis executed a total of ten times on different trainingsets (each of which has a lot in common). Finally, tenerror estimates are averaged to yield an overall errorestimate.

In this study, the technique of cross validation wasapplied to benchmark datasets (see Table 8) to predictthe performance of our proposed algorithm AILC andother machine learning algorithms including: decisiontree, back-propagation neural network and support vec-tor machines (see Table 6). We compared our resultswith the implementations in WEKA[14-15].

The results of comparison are summarized in Table9 where the number of misclassified instances and ac-curacy of classification, in parentheses, are given.

Although AILC is a linear classifier, it produces rea-sonable results even in the case of nonlinearly separabledatasets. Again, as in the linearly separable case (Sub-section 6.1), one can easily conclude that the perfor-mance of AILC is independent of the size of the samplesN but reduces with the increase of feature dimension n.Note that for IRIS dataset where n = 4, AILC is as ac-curate as SVM when using polynomial kernel of degree1 and its performance outperforms that of SVM whenusing polynomial kernel of degree 2. For the datasetsBC (n = 9) and DI (n = 8), comparable results areobtained even N is large (see Table 8). On the otherhand less acceptable results are obtained in case of IO(n = 34) and SN (n = 60).

7 Conclusions

A fast adaptive iterative algorithm AILC forclassifying linearly separable data is presented. In a bi-nary classification problem containing N samples withn features, the main idea of the algorithm is that itchooses adaptively a subset of n-samples and constructsa hyperplane that separates the n-samples at a margin εand it best classifies the remaining points. This processis repeated until the separating hyperplane is obtained.If such a hyperplane was not obtained after the pre-scribed number of iterations, the algorithm returns the

hyperplane that misclassifies fewer samples. Further,a quadratically separable classification problem can bemapped from its physical space to another larger wherethe problem becomes linearly separable. From variousnumerical illustrations and comparisons with other clas-sification algorithms using benchmark datasets, one canconclude:

1) the algorithm is fast due to its adaptive feature;2) the complexity of the algorithm is C1N+C2n

2, C1

and C2 are independent on N , which ensures excellentperformance especially when n is small;

3) the assumption that n-samples must lie at a pre-scribed margin from the hyperplane is restrictive andmakes the convergence rate dependent on n; and onthe other hand, the user must provide the prescribedparameter ε which is problem dependent;

4) convergence rates of AILC are measured either bya number of required iterations to get the separating hy-perplane or by a number of misclassified samples afterprescribed number of iterations. Theoretical and nu-merical results show that convergence rates are nearlyindependent on N but reduce with the increase of n,and usually fewer iterations are sufficient for conver-gence for small n.

Although reasonable results were obtained, conver-gence was greatly dependent on the value n which inturn depends on the prescribed parameter ε. Other al-gorithms are in development to predict the value of εthat ensures maximum margin for the n-points. More-over, the classification problem as formulated in Section3 may be developed as a linear programming algorithmthat determines ε as an n-valued vector, rather than ascalar value, and produces the hyperplane with maxi-mum margin.

References

[1] Duda R O, Hart P E, Stork D G. Pattern Classification. NewYork: Wiley-Interscience, 2000.

[2] Theodoridis S, Koutroumbas K. Pattern Recognition. Aca-demic Press, An Imprint of Elsevier, 2006.

[3] Cristianini N, Shawe T J. An Introduction to Support VectorMachines. Vol. I, Cambridge University Press, 2003.

[4] Atiya A. Learning with kernels: Support vector machines, reg-ularization, optimization, and beyond. IEEE Transactions onNeural Networks, 2005, 16(3): 781.

Page 11: quad_comolinear.pdf

918 J. Comput. Sci. & Technol., Sept. 2011, Vol.26, No.5

[5] Rosenblatt F. Principles of Neurodynamics. Spartan Books,1962.

[6] Taha H A. Operations Research An Introduction. MacmillanPublishing Co., Inc, 1982.

[7] Zurada J M. Introduction to Artificial Neural Systems.Boston: PWS Publishing Co., USA, 1999.

[8] Barber C B, Dodkin D P, Huhdanpaa H. The quickhull algo-rithm for convex hulls. ACM Transactions on MathematicalSoftware, 1996, 22(4): 469-483.

[9] Tajine M, Elizondo D. New methods for testing linear sepa-rability. Neurocomputing, 2002, 47(1-4): 295-322.

[10] Elizondo D. Searching for linearly separable subsets using theclass of linear separability method. In Proc. IEEE-IJCNN,Budapest, Hungary, Jul. 25-29, 2004, pp.955-960.

[11] Elizondo D. The linear separability problem: Some testingmethods. IEEE Transactions on Neural Networks, 2006,17(2): 330-344.

[12] www.archive.ics.uci.edu/ml/datasets.html, Mar. 31, 2009.[13] Fisher R A. The Use of Multiple Measurements in Taxonomic

Problems. Annals of Eugenics, 1936, 7: 179-188.[14] http://www.cs.waikato.ac.nz/∼ml/weka/, May 1, 2009.[15] Witten I H, Frank E. Data Mining: Practical Machine Lear-

ning Tools and Techniques. Elsevier, 2005.

Mohamed Abdel-Kawy Mo-hamed Ali Soliman received theB.S. degree in electrical and elec-tronic engineering from M.T.C (Mil-itary Technical College), Cairo,Egypt, with grade (Excellent) in1974, the M.S. degree in elec-tronic and communications engineer-ing from Faculty of Engineering,Cairo University, Egypt, with the re-

search on “observers in modern control systems theory”,1985, and the Ph.D. degree in aeronautical engineering,the thesis title is “Intelligent Management for Aircraft andSpacecraft Sensors Systems”, 2000. He is currently head ofComputer and Systems Engineering Department, Faculty ofEngineering, Zagazig University. His research interests liein the intersection of the general fields of computer scienceand engineering, brain science, and cognitive science.

Rasha M. Abo-Bakr was bornin 1976 in Egypt, received her Bache-lor’s degree from Mathematics (Com-puter Science) Department, Fac-ulty of Science, Zagazig University,Egypt. She was also awarded herMaster’s degree in computer sciencein 2003, with a thesis titled “Com-puter Algorithms for System Identi-fication”. Since 2003 she has been

an assistant lecturer at Mathematics (Computer Science)Department, Faculty of Science, Zagazig University. Shereceived her Ph.D. degree in mathematics & computer sci-ence from Zagazig University, in 2011, with a dissertationtitled “Symbolic Modeling of Dynamical Systems Using SoftComputing Techniques”. Her research interests are artificialintelligence, soft computing technologies, and astronomy.