Support Vector Machine - Indian Institute of Technology ...dsamanta/courses/da/... · A classiﬁcation that has received considerable attention is support vector machine and popularly

Support Vector Machine

Debasis Samanta

IIT Kharagpur

[email protected]

Autumn 2018

Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 1 / 131

Topics to be covered...

Introduction to SVM

Concept of maximum margin hyperplane

Linear SVM

1 Calculation of MMH

2 Learning a linear SVM

3 Classifying a test sample using linear SVM

4 Classifying multi-class data

Non-linear SVM

Concept of non-linear data

Soft-margin SVM

Kernel Trick


Introduction to SVM


Introduction

A classification that has received considerable attention is supportvector machine and popularly abbreviated as SVM.

This technique has its roots in statistical learning theory (VlamidirVapnik, 1992).

As a task of classification, it searches for optimal hyperplane(i.e.,decision boundary, see Fig. 1 in the next slide) separating thetuples of one class from another.

SVM works well with higher dimensional data and thus avoidsdimensionality problem.

Although the SVM based classification (i.e., training time) isextremely slow, the result, is however highly accurate. Further,testing an unknown data is very fast.

SVM is less prone to over fitting than other methods. It alsofacilitates compact model for classification.


Introduction

Figure 1: Decision boundary in SVM.


Introduction

In this lecture, we shall discuss the following.

1 Maximum margin hyperplane : a key concept in SVM.

2 Linear SVM : a classification technique when training data arelinearly separable.

3 Non-linear SVM : a classification technique when training data arelinearly non-separable.


Maximum Margin Hyperplane



In our subsequent discussion, we shall assume a simplisticsituation that given a training data D ={t1,t2,....tn} with a set of ntuples, which belong to two classes either + or - and each tuple isdescribed by two attributes say A1, A2.



Figure 2: A 2D data linearly seperable by hyperplanes


Maximum Margin Hyperplane contd...

Figure 2 shows a plot of data in 2-D. Another simplistic assumptionhere is that the data is linearly separable, that is, we can find ahyperplane (in this case, it is a straight line) such that all +’s resideon one side whereas all -’s reside on other side of the hyperplane.

From Fig. 2, it can be seen that there are an infinite number ofseparating lines that can be drawn. Therefore, the following twoquestions arise:

1 Whether all hyperplanes are equivalent so far the classification ofdata is concerned?

2 If not, which hyperplane is the best?



We may note that so far the classification error is concerned (withtraining data), all of them are with zero error.

However, there is no guarantee that all hyperplanes performequally well on unseen (i.e., test) data.



Thus, for a good classifier it must choose one of the infinitenumber of hyperplanes, so that it performs better not only ontraining data but as well as test data.

To illustrate how the different choices of hyperplane influence theclassification error, consider any arbitrary two hyperplanes H1 andH2 as shown in Fig. 3.



Figure 3: Hyperplanes with decision boundaries and their margins.



In Fig. 3, two hyperplanes H1 and H2 have their own boundariescalled decision boundaries(denoted as b11 and b12 for H1 and b21and b22 for H2).

A decision boundary is a boundary which is parallel to hyperplaneand touches the closest class in one side of the hyperplane.

The distance between the two decision boundaries of ahyperplane is called the margin. So, if data is classified usingHyperplane H1, then it is with larger margin then usingHyperplane H2.

The margin of hyperplane implies the error in classifier. In otherwords, the larger the margin, lower is the classification error.



Intuitively, the classifier that contains hyperplane with a smallmargin are more susceptible to model over fitting and tend toclassify with weak confidence on unseen data.

Thus during the training or learning phase, the approach would beto search for the hyperplane with maximum margin.

Such a hyperplane is called maximum margin hyperplane andabbreviated as MMH.

We may note the shortest distance from a hyperplane to one of itsdecision boundary is equal to the shortest distance from thehyperplane to the decision boundary at its other side.

Alternatively, hyperplane is at the middle of its decisionboundaries.


Linear SVM


Linear SVM

A SVM which is used to classify data which are linearly separableis called linear SVM.

In other words, a linear SVM searches for a hyperplane with themaximum margin.

This is why a linear SVM is often termed as a maximal marginclassifier (MMC).


Finding MMH for a Linear SVM

In the following, we shall discuss the mathematics to find the MMHgiven a training set.

In our discussion, we shall consider a binary classificationproblem consisting of n training data.

Each tuple is denoted by (Xi ,Yi ) where Xi = (xi1,xi2.......xim)corresponds to the attribute set for the i th tuple (data inm-dimensional space) and Yi ε [+,-] denotes its class label.

Note that choice of which class should be labeled as + or - isarbitrary.


Finding MMH for a Linear SVM

Thus, given {(Xi ,Yi)}ni=1, we are to obtain a hyperplane whichseparates all Xi |ni=1 into two sides of it (of course with maximumgap).

Before, going to a general equation of a plane in n-dimension, letus consider first, a hyperplane in 2-D plane.


Equation of a hyperplane in 2-D

Let us consider a 2-D training tuple with attributes A1 and A2 asX =(x1,x2), where x1 and x2 are values of attributes A1 and A2,respectively for X .

Equation of a plane in 2-D space can be written as

w0 + w1x1 + w2x2 = 0 [e.g., ax + by + c = 0]

where w0, w1, and w2 are some constants defining the slope andintercept of the line.

Any point lying above such a hyperplane satisfies

w0 + w1x1 + w2x2 > 0 (1)


Equation of a hyperplane in 2-D

Similarly, any point lying below the hyperplane satisfies

w0 + w1x1 + w2x2 < 0 (2)

An SVM hyperplane is an n-dimensional generalization of astraight line in 2-D.

It can be visualized as a plane surface in 3-D, but it is not easy tovisualize when dimensionality is greater than 3!


Equation of a hyperplane

In fact, Euclidean equation of a hyperplane in Rm is

w1x1 + w2x2 + ........+ wmxm = b (3)

where wi ’s are the real numbers and b is a real constant (calledthe intercept, which can be positive or negative).


Finding a hyperplane

In matrix form, a hyperplane thus can be represented as

W .X + b = 0 (4)

where W =[w1,w2.......wm] and X = [x1,x2.......xm] and b is a realconstant.

Here, W and b are parameters of the classifier model to beevaluated given a training set D.



Let us consider a two-dimensional training set consisting twoclasses + and - as shown in Fig. 4.

Suppose, b1 and b2 are two decision boundaries above and belowa hyperplane, respectively.

Consider any two points X+ and X− as shown in Fig. 4.

For X+ located above the decision boundary, the equation can bewritten as

W .X+ + b = K where K > 0 (5)



Figure 4: Computation of the MMH



Similarly, for any point X− located below the decision boundary,the equation is

W .X− + b = K ′ where K ′ < 0 (6)

Thus, if we label all +’s are as class label + and all -’s are classlabel -, then we can predict the class label Y for any test data X as

(Y ) =

{+ if W .X + b > 0− if W .X + b < 0


Hyperplane and Classification

Note that W .X + b = 0, the equation representing hyperplane canbe interpreted as follows.

Here, W represents the orientation and b is the intercept of thehyperplane from the origin.

If both W and b are scaled (up or down) by dividing a non zeroconstant, we get the same hyperplane.

This means there can be infinite number of solutions using variousscaling factors, all of them geometrical representing the samehyperplane.


Hyperplane and Classification

To avoid such a confusion, we can make W and b unique byadding a constraint that W ′.X + b′ = ±1 for data points onboundary of each class.

It may be noted that W ′.X + b′ = ±1 represents two hyperplaneparallel to each other.

For clarity in notation, we write this as W .X + b = ±1.

Having this understating, now we are in the position to calculatethe margin of a hyperplane.


Calculating Margin of a Hyperplane

Suppose, x1 and x2 (refer Figure 3) are two points on the decisionboundaries b1 and b2, respectively. Thus,

W .x1 + b = 1 (7)

W .x2 + b = −1 (8)

orW .(x1 − x2) = 2 (9)

This represents a dot (.) product of two vectors W and x1-x2. Thustaking magnitude of these vectors, the equation obtained is

d =2||W ||

(10)

where ||W || =√

w21 + w2

2 + .......w2m in an m-dimension space.



We calculate the margin more mathematically, as in the following.

Consider two parallel hyperplanes H1 and H2 as shown in Fig. 5.Let the equations of hyperplanes be

H1 : w1x1 + w2x2 − b1 = 0 (11)

H2 : w1x1 + w2x2 − b2 = 0 (12)

To draw a perpendicular distance d between H1 and H2, we drawa right-angled triangle ABC as shown in Fig.5.



Figure 5: Detail of margin calculation.

w1x

1 + w2x

2 – b2 = 0

w1x

1 + w2x

2 – b1 = 0

X2

X1

A

(b1 / w1 , 0)

H2

H1 d

C

B

(b2 / w1 , 0)



Being parallel, the slope of H1 (and H2) is tanθ = −w1

w2.

In triangle ABC, AB is the hypotenuse and AC is the perpendiculardistance between H1 and H2.

Thus , sin(180− θ) = ACAB

or AC = AB · sinθ.

AB =b2

w1− b1

w1=|b2 − b1|

w1, sinθ =

w1√w2

1 + w22

(Since, tanθ = −w1

w2).

Hence, AC =|b2 − b1|√w2

1 + w22

.



This can be generalized to find the distance between two parallelmargins of any hyperplane in n-dimensional space as

d =|b2 − b1|√

w21 + w2

2 + · · ·+ w2n

' |b2 − b1|‖W ‖

where, ‖W ‖=√

w21 + w2

2 + · · ·+ w2n .

In SVM literature, this margin is famously written as µ(W ,b).



The training phase of SVM involves estimating the parameters Wand b for a hyperplane from a given training data.The parameters must be chosen in such a way that the followingtwo inequalities are satisfied.

W .xi + b ≥ 1 if yi = 1 (13)

W .xi + b ≤ −1 if yi = −1 (14)

These conditions impose the requirements that all training tuplesfrom class Y = + must be located on or above the hyperplaneW .x + b = 1, while those instances from class Y = − must belocated on or below the hyperplane W .x + b = −1 (also seeFig. 4).


Learning for a Linear SVM

Both the inequalities can be summarized as

yi(W .x1 + b) ≥ 1 ∀i i = 1,2, ...,n (15)

Note that any tuples that lie on the hyperplanes H1 and H2 arecalled support vectors.

Essentially, the support vectors are the most difficult tuples toclassify and give the most information regarding classification.

In the following, we discuss the approach of finding MMH and thesupport vectors.

The above problem is turned out to be an optimization problem,that is, to maximize µ(W ,b) = 2

‖w‖ .


Searching for MMH

Maximizing the margin is, however, equivalent to minimizing thefollowing objective function

µ‘(W ,b) =||W ||

2(16)

In nutshell, the learning task in SVM, can be formulated as thefollowing constrained optimization problem.

minimize µ‘(W ,b)

subject to yi(W .xi + b) ≥ 1, i = 1,2,3.....n (17)


Searching for MMH

The above stated constrained optimization problem is popularlyknown as convex optimization problem, where objective function isquadratic and constraints are linear in the parameters W and b.

The well known technique to solve a convex optimization problemis the standard Lagrange Multiplier method.

First, we shall learn the Lagrange Multiplier method, then comeback to the solving of our own SVM problem.


Lagrange Multiplier Method

The Lagrange multiplier method follows two different stepsdepending on type of constraints.

1 Equality constraint optimization problem: In this case, theproblem is of the form:

minimize f (x1, x2, ........, xd )

subject to gi(x) = 0, i = 1,2, ......,p

2 Inequality constraint optimization problem: In this case, theproblem is of the form:

minimize f (x1, x2, ........, xd )

subject to hi(x) ≤ 0, i = 1,2, ......,p



Equality constraint optimization problem solving

The following steps are involved in this case:

1 Define the Lagrangian as follows:

(X , λ) = f (X ) +

p∑i=1

λi .gi(x) (18)

where λ′i s are dummy variables called Lagrangian multipliers.

2 Set the first order derivatives of the Lagrangian with respect to xand the Lagrangian multipliers λ′i s to zero’s. That is

δLδxi

= 0, i = 1,2, ......,d

δLδλi

= 0, i = 1,2, ......, p

3 Solve the (d + p) equations to find the optimal value ofX =[x1, x2......., xd ] and λ′i s.



Example: Equality constraint optimization problem

Suppose, minimize f (x , y) = x + 2ysubject to x2 + y2 − 4 = 0

1 Lagrangian L(x , y ,λ)=x + 2y+λ(x2 + y2 − 4)

2 δLδx = 1 + 2λx = 0δLδy = 1 + 2λy = 0δLδλ = x2 + y2 − 4 = 0

3 Solving the above three equations for x , y and λ, we get x = ∓ 2√5,

y = ∓ 4√5

and λ = ±√

54



Example : Equality constraint optimization problem

When λ =√

54 ,

x = − 2√5,

y = − 4√5,

we get f(x , y , λ)=− 10√5

Similarly, when λ = −√

54 ,

x = 2√5,

y = 4√5,

we get f(x , y , λ)= 10√5

Thus, the function f(x , y ) has its minimum value atx = − 2√

5, y = − 4√

5



Inequality constraint optimization problem solving

The method for solving this problem is quite similar to theLagrange multiplier method described above.

It starts with the Lagrangian

L = f (x) +p∑

i=1

λi .hi(x) (19)

In addition to this, it introduces additional constraints, calledKarush-Kuhn-Tucker (KKT) constraints, which are stated in thenext slide.



Inequality constraint optimization problem solving

δLδxi

= 0, i = 1,2, ......,d

λi ≥ 0, i = 1,2.......,p

hi(x) ≤ 0, i = 1,2.......,p

λi .hi(x) = 0, i = 1,2.......,p

Solving the above equations, we can find the optimal value of f (x).



Example: Inequality constraint optimization problem

Consider the following problem.

Minimize f (x , y) = (x − 1)2 + (y − 3)2

subject to x + y ≤ 2,y ≥ x

The Lagrangian for this problem isL = (x − 1)2 + (y − 3)2 + λ1(x + y − 2) + λ2(x − y).

subject to the KKT constraints, which are as follows:




δLδx

= 2(x − 1) + λ1 + λ2 = 0

δLδy

= 2(y − 3) + λ1 − λ2 = 0

λ1(x + y − 2) = 0

λ2(x − y) = 0

λ1 ≥ 0, λ2 ≥ 0

(x + y) ≤ 2, y ≥ x




To solve KKT constraints, we have to check the following tests:

Case 1: λ1 = 0, λ2 = 0

2(x − 1) = 0 | 2(y − 3) = 0⇒ x = 1, y = 3

since, x + y = 4, it violates x + y ≤ 2; is not a feasible solution.

Case 2: λ1 = 0, λ2 6= 0 2(x − y) = 0 |2(x − 1) + λ2 = 0|2(y − 3)− λ2 = 0⇒ x = 2, y = 2 and λ2 = −2since, x + y ≤ 4, it violates λ2 ≥ 0; is not a feasible solution.




Case 3: λ1 6= 0, λ2 = 0 2(x + y) = 22(x − 1) + λ1 = 02(y − 3) + λ1 = 0

⇒ x = 0, y = 2 and λ1 = 2; this is a feasible solution.

Case 4: λ1 6= 0, λ2 6= 0 2(x + y) = 22(x − y) = 02(x − 1) + λ1 + λ2 = 02(y − 3) + λ1 − λ2 = 0

⇒ x = 1, y = 1 and λ1 = 2 λ2 = −2This is not a feasible solution.


LMM to Solve Linear SVM

The optimization problem for the linear SVM is inequalityconstraint optimization problem.

The Lagrangian multiplier for this optimization problem can bewritten as

L =||W ||2

2−

n∑i=1

λi(yi(W .xi + b)− 1) (20)

where the parameters λ′is are the Lagrangian multipliers, andW = [w1,w2........wm] and b are the model parameters.



The KKT constraints are:

δLδw = 0⇒W =

∑ni=1 λi .yi .xi

δLδb = 0⇒

∑ni=1 λi .yi = 0

λ ≥ 0, i = 1,2, .....n

λi [yi(W .xi + b)− 1] = 0, i = 1,2, .....n

yi(W .xi + b) ≥ 1, i = 1,2, .....n

Solving KKT constraints are computationally expensive and canbe solved using a typical linear/ quadratic programming technique(or any other numerical technique).



We first solve the above set of equations to find all the feasiblesolutions.

Then, we can determine optimum value of µ(W ,b).

Note:1 Lagrangian multiplier λi must be zero unless the training instance xi

satisfies the equation yi(W .xi + b) = 1. Thus, the training tupleswith λi > 0 lie on the hyperplane margins and hence are supportvectors.

2 The training instances that do not lie on the hyperplane marginhave λi = 0.


Classifying a test sample using Linear SVM

For a given training data, using SVM principle, we obtain MMH inthe form of W , b and λi ’s. This is the machine (i.e., the SVM).

Now, let us see how this MMH can be used to classify a test tuplesay X . This can be done as follows.

δ(X ) = W .X + b =n∑

i=1

λi .yi .xi .X + b (21)

Note that

W =n∑

i=1

λi .yi .xi



This is famously called as “Representer Theorem” which statesthat the solution W always be represented as a linear combinationof training data.

Thus, δ(X ) = W .X + b =n∑

i=1

λi .yi .xi .X + b



The above involves a dot product of xi .X ,where xi is a supportvector (this is so because (λi) = 0 for all training tuples except thesupport vectors), we can check the sign of δ(X ).

If it is positive, then X falls on or above the MMH and so the SVMpredicts that X belongs to class label +. On the other hand, if thesign is negative, then X falls on or below MMH and the classprediction is -.

Note:

1 Once the SVM is trained with training data, the complexity of theclassifier is characterized by the number of support vectors.

2 Dimensionality of data is not an issue in SVM unlike in otherclassifier.


Illustration : Linear SVM

Consider the case of a binary classification starting with a trainingdata of 8 tuples as shown in Table 1.Using quadratic programming, we can solve the KKT constraintsto obtain the Lagrange multipliers λi for each training tuple, whichis shown in Table 1.Note that only the first two tuples are support vectors in this case.Let W = (w1,w2) and b denote the parameter to be determinednow. We can solve for w1 and w2 as follows:

w1 =∑

i

λi .yi .xi1 = 65.52× 1× 0.38+ 65.52×−1× 0.49 = −6.64

(22)w2 =

∑i

λi .yi .xi2 = 65.52× 1× 0.47+ 65.52×−1× 0.61 = −9.32

(23)



Table 1: Training Data

A1 A2 y λi

0.38 0.47 + 65.520.49 0.61 - 65.520.92 0.41 - 00.74 0.89 - 00.18 0.58 + 00.41 0.35 + 00.93 0.81 - 00.21 0.10 + 0



Figure 6: Linear SVM example.



The parameter b can be calculated for each support vector asfollowsb1 = 1−W .x1 // for support vector x1= 1− (−6.64)× 0.38− (−9.32)× 0.47 //using dot product= 7.93b2 = 1−W .x2 // for support vector x2= 1− (−6.64)× 0.48− (−9.32)× 0.611 //using dot product= 7.93

Averaging these values of b1 and b2, we get b = 7.93.



Thus, the MMH is −6.64x1 − 9.32x2 + 7.93 = 0 (also see Fig. 6).Suppose, test data is X = (0.5,0.5). Therefore,δ(X ) = W .X + b= −6.64× 0.5− 9.32× 0.5 + 7.93= −0.05= −ve

This implies that the test data falls on or below the MMH and SVMclassifies that X belongs to class label -.


Classification of Multiple-class Data

In the discussion of linear SVM, we have limited to binaryclassification (i.e., classification with two classes only).

Note that the discussed linear SVM can handle any n-dimension,n ≥ 2. Now, we are to discuss a more generalized linear SVM toclassify n-dimensional data belong to two or more classes.

There are two possibilities: all classes are pairwise linearlyseparable, or classes are overlapping, that is, not linearlyseparable.

If the classes are pair wise linearly separable, then we can extendthe principle of linear SVM to each pair. Theare two strategies:

1 One versus one (OVO) strategy2 One versus all (OVA) strategy


Multi-Class Classification: OVO Strategy

In OVO strategy, we are to find MMHs for each pair of classes.

Thus, if there are n classes, then nc2 pairs and hence so manyclassifiers possible (of course, some of which may be redundant).

Also, see Fig. 7 for 3 classes (namely +, - and ×). Here, Hxy

denotes MMH between class labels x and y .

Similarly, you can think for 4 classes (namely +, -, ×, ÷) and more.



Figure 7: 3-pairwise linearly separable classes.



With OVO strategy, we test each of the classifier in turn and obtainδj

i (X ), that is, the count for MMH between i th and i th classes fortest data X .

If there is a class i , for which δji (X ) for all j( and j 6= i), gives same

sign, then unambiguously, we can say that X is in class i .


Multi-Class Classification: OVA Strategy

OVO strategy is not useful for data with a large number of classes,as the computational complexity increases exponentially with thenumber of classes.

As an alternative to OVO strategy, OVA(one versus all) strategyhas been proposed.

In this approach, we choose any class say Ci and consider that alltuples of other classes belong to a single class.



This is, therefore, transformed into a binary classification problemand using the linear SVM discussed above, we can find thehyperplane. Let the hyperplane between Ci and remainingclasses be MMHi .

The process is repeated for each Ciε[C1,C2, .....Ck ] and gettingMMHi .

Thus, with OVA strategies we get k classifiers.



The unseen data X is then tested with each classifier so obtained.

Let δj(X ) be the test result with MMHj , which has the maximummagnitude of test values.

That is

δj(X ) = max∀i{δi(X )} (24)

Thus, X is classified into class Cj .



Note:The linear SVM that is used to classify multi-class data fails, if allclasses are not linearly separable.

If one class is linearly separable to remaining other classes andtest data belongs to that particular class, then only it classifiesaccurately.



Further, it is possible to have some tuples which cannot beclassified none of the linear SVMs (also see Fig.8).

There are some tuples which cannot be classified unambiguouslyby neither of the hyperplanes.

All these tuples may be due to noise, errors or data are notlinearly separable.



Figure 8: Unclassifiable region in OVA strategy.


Non-Linear SVM


Non-Linear SVM

SVM classification for non separable data

Figure 9 shows a 2-D views of data when they are linearlyseparable and not separable.

In general, if data are linearly separable, then there is ahyperplane otherwise no hyperplane.


Linear and Non-Linear Separable Data

Figure 9: Two types of training data.


SVM classification for non separable data

Such a linearly not separable data can be classified using twoapproaches.

1 Linear SVM with soft margin

2 Non-linear SVM

In the following, we discuss the extension of linear SVM to classifylinearly not separable data.

We discuss non-linear SVM in detail later.


Linear SVM for Linearly Not Separable Data

If the number of training data instances violating linear separabilityis less, then we can use linear SVM classifier to classify them.

The rational behind this approach can be better understood fromFig. 10.



Figure 10: Problem with linear SVM for linearly not separable data.



Suppose, X1 and X2 are two instances.

We see that the hyperplane H1 classifies wrongly both X1 and X2.

Also, we may note that with X1 and X2, we could draw anotherhyperplane namely H2, which could classify all training datacorrectly.

However, H1 is more preferable than H2 as H1 has higher margincompared to H2 and thus H1 is less susceptible to over fitting.



In other words, a linear SVM can be refitted to learn a hyperplanethat is tolerable to a small number of non-separable training data.

The approach of refitting is called soft margin approach (hence,the SVM is called Soft Margin SVM), where it introduces slackvariables to the inseparable cases.

More specifically, the soft margin SVM considers a linear SVMhyperplane (i.e., linear decision boundaries) even in situationswhere the classes are not linearly separable.

The concept of Soft Margin SVM is presented in the followingslides.


Soft Margin SVM

Recall that for linear SVM, we are to determine a maximummargin hyperplane W .X + b = 0 with the following optimization:

minimize||W ||2

2(25)

subject to yi .(W .xi + b) ≥ 1, i = 1,2, ...,n

In soft margin SVM, we consider the similar optimizationtechnique except that a relaxation of inequalities, so that it alsosatisfies the case of linearly not separable data.


Soft Margin SVM

To do this, soft margin SVM introduces slack variable (ξ), apositive-value into the constraint of optimization problem.

Thus, for soft margin we rewrite the optimization problem asfollows.

minimize||W ||2

2

subject to (W .xi + b) ≥ 1− ξi , if yi = +1

(W .xi + b) ≤ −1 + ξi , if yi = −1 (26)

where ∀i , ξi ≥ 0.

Thus, in soft margin SVM, we are to calculate W , b and ξis as asolution to learn SVM.


Soft Margin SVM : Interpretation of ξ

Let us find an interpretation of ξ, the slack variable in soft marginSVM. For this consider the data distribution shown in the Fig. 11.

Figure 11: Interpretation of slack variable ξ.



The data X is one of the instances that violates the constraints tobe satisfied for linear SVM.

Thus, W .X + b = −1 + ξ represents a hyperplane that is parallelto the decision boundaries for class - and passes through X .

It can be shown that the distance between the hyperplanes isd = ξ

||W || .

In other words, ξ provides an estimate of the error of decisionboundary on the training example X .



In principle, Eqn. 25 and 26 can be chosen as the optimizationproblem to train a soft margin SVM.

However, the soft margin SVM should impose a constraint on thenumber of such non linearly separable data it takes into account.

This is so because a SVM may be trained with decisionboundaries with very high margin thus, chances of misclassifyingmany of the training data.



This is explained in Fig. 12. Here, if we increase margin further,then P and Q will be misclassified.Thus, there is a trade-off between the length of margin andtraining error.

Figure 12: MMH with wide margin and large training error.



To avoid this problem, it is therefore necessary to modify theobjective function, so that penalizing for margins with a large gap,that is, large values of slack variables.

The modified objective function can be written as

f (W ) =||W ||2

2+ c.

n∑i=1

(ξi)φ (27)

where c and φ are user specified parameters representing thepenalty of misclassifying the training data.

Usually, φ = 1. The larger value of c implies more penalty.


Solving for Soft Margin SVM

We can follow the Lagrange multiplier method to solve theinequality constraint optimization problem, which can be reworkedas follows:

L =||W ||2

2+c.

n∑i=1

ξi −n∑

i=1

λi(yi(W .xi +b)−1+ξi)−n∑

i=1

µi .ξi (28)

Here, λi ’s and µi ’s are Lagrange multipliers.The inequality constraints are:

ξi ≥ 0, λi ≥ 0, µi ≥ 0.

λi{yi(W .xi + b)− 1 + ξi} = 0

µi .ξi = 0



The KKT constraints (in terms of first order derivative of L withrespect to different parameters) are:

δLδwj

= wj −n∑

i=1

λi .yi .xij = 0

wj =n∑

i=1

λi .yi .xij = 0 ∀i = 1,2....n

δLδb

= −n∑

i=1

λi .yi = 0

n∑i=1

λi .yi = 0



δLδξi

= c − λi − µi = 0

λi + µi = c and µi .ξi = 0∀i = 1,2, ....n. (29)

The above set of equations can be solved for values ofW = [w1,w2.......wm],b, λ′is, µ

′is and ξ′i s.

Note:1 λi 6= 0 for support vectors or has ξi > 0 and

2 µi = 0 for those training data which are misclassified, that is, ξi > 0.


Non-Linear SVM

Linear SVM undoubtedly better to classify data if it is trained bylinearly separable data.

Linear SVM also can be used for non-linearly separable data,provided that number of such instances is less.

However, in real life applications, number of data overlapping is sohigh that soft margin SVM cannot cope to yield accurate classifier.

As an alternative to this there is a need to compute a decisionboundary, which is not linear (i.e., not a hyperplane ratherhypersurface).


Non-Linear SVM

For understanding this, see Figure 13.

Note that a linear hyperplane is expressed as a linear equation interms of n-dimensional component, whereas a non-linearhypersurface is a non-linear expression.

Figure 13: 2D view of few class separabilities.


Non-Linear SVM

A hyperplane is expressed as

linear : w1x1 + w2x2 + w3x3 + c = 0 (30)

Whereas a non-linear hypersurface is expressed as.

Nonlinear : w1x21 + w2x2

2 + w3x1x2 + w4x23 + w5x1x3 + c = 0 (31)

The task therefore takes a turn to find a nonlinear decisionboundaries, that is, nonlinear hypersurface in input spacecomprising with linearly not separable data.

This task indeed neither hard nor so complex, and fortunately canbe accomplished extending the formulation of linear SVM, wehave already learned.


Non-Linear SVM

This can be achieved in two major steps.1 Transform the original (non-linear) input data into a higher

dimensional space (as a linear representation of data).Note that this is feasible because SVM’s performance is decided bynumber of support vectors (i.e., ≈ training data) not by thedimension of data.

2 Search for the linear decision boundaries to separate thetransformed higher dimensional data.The above can be done in the same line as we have done for linearSVM.

In nutshell, to have a nonlinear SVM, the trick is to transformnon-linear data into higher dimensional linear data.This transformation is popularly called non linear mapping orattribute transformation. The rest is same as the linear SVM.


Concept of Non-Linear Mapping

In order to understand the concept of non-linear transformation oforiginal input data into a higher dimensional space, let us consider anon-linear second order polynomial in a 3-D input space.

X (x1, x2, x3) = w1x1 + w2x2 + w3x3 + w4x21 + w5x1x2 + w6x1x3 + c

The 3-D input vector X (x1, x2, x3) can be mapped into a 6-D spaceZ (z1, z2, z3, z4, z5, z6) using the following mappings:

z1 = φ1(x) = x1

z2 = φ2(x) = x2

z3 = φ3(x) = x3

z4 = φ4(x) = x21

z5 = φ5(x) = x1.x2

z6 = φ6(x) = x1.x3



The transformed form of linear data in 6-D space will look like.

Z : w1z1 + w2z2 + w3z3 + w4z4 + w5z5 + w6x1z6 + c

Thus, if Z space has input data for its attributes x1, x2, x3 (andhence Z ’s values), then we can classify them using linear decisionboundaries.



Example: Non-linear mapping to linear SVM

The below figure shows an example of 2-D data set consisting ofclass label +1 (as +) and class label -1 (as -).

Figure 14: Non-linear mapping to Linear SVM.




We see that all instances of class -1 can be separated frominstances of class +1 by a circle, The following equation of thedecision boundary can be thought of:

X (x1, x2) = +1 if√(x1 − 0.5)2 + (x2 − 0.5)2 > 2

X (x1, x2) = −1 otherwise (32)




The decision boundary can be written as:

X =√

(x1 − 0.5)2 + (x2 − 0.5)2 = 0.2

or x21 − x1 + x2

2 − x2 = 0.46

A non-linear transformation in 2-D space is proposed as follows:

Z (z1, z2) : φ1(x) = x21 − x1, φ2(x) = x2

2 − x2 (33)



Example: Non-linear mapping to linear SVMThe Z space when plotted will take view as shown in Fig. 15,where data are separable with linear boundary, namelyZ : z1 + z2 = −0.46

Figure 15: Non-linear mapping to Linear SVM.


Non-Linear to Linear Transformation: Issues

The non linear mapping and hence a linear decision boundaryconcept looks pretty simple. But there are many potentialproblems to do so.

1 Mapping: How to choose the non linear mapping to a higherdimensional space?

In fact, the φ-transformation works fine for small example.

But, it fails for realistically sized problems.

2 Cost of mapping:For n-dimensional input instances there existNH = (N+d−1)!

d!(N−1)! different monomials comprising a feature space ofdimensionality NH . Here, d is the maximum degree of monomial.



...3 Dimensionality problem: It may suffer from the curse of

dimensionality problem often associated with a high dimensionaldata.

More specifically, in the calculation of W.X or Xi .X (in δ(X ) see Eqn.18), we need n multiplications and n additions (in their dot products)for each of the n-dimensional input instances and support vectors,respectively.

As the number of input instances as well as support vectors areenormously large, it is therefore, computationally expensive.

4 Computational cost: Solving the quadratic constrained optimizationproblem in the high dimensional feature space is too acomputationally expensive task.



Fortunately, mathematicians have cleverly proposes an elegantsolution to the above problems.

Their solution consist of the following:

1 Dual formulation of optimization problem

2 Kernel trick

In the next few slides, we shall learn about the above-mentionedtwo topics.


Dual Formulation of Optimization Problem

We have already learned the Lagrangian formulation to find themaximum margin hyperplane as a linear SVM classifier.

Such a formulation is called primal form of the constraintoptimization problem.

Primal form of Lagrangian optimization problem is reproducedfurther

Minimize||W ||2

2

Subject to yi(W .xi + b) ≥ 1, i = 1,2,3.......n. (34)



The primal form of the above mentioned inequality constraintoptimization problem(according to Lagrange multiplier method) isgiven by

Lp =||W ||2

2−

n∑i=1

λi(yi(W .xi + b)− 1) (35)

where λi ’s are called Lagrange multipliers.

This Lp is called the primal form of the Lagrangian optimizationproblem.



The dual form of the same problem can be derived as follows.

To minimize the Lagrangian, we must take the derivative of Lp withrespect to W , b and set them to zero.

δLp

δW= 0⇒W =

n∑i=1

λi .yi .xi (36)

δLp

δb= 0⇒

n∑i=1

λi .yi = 0 (37)



From above two equation we get Lagrangian L as

L =n∑

i=1

λi +12

∑i,j

λi .yi .λj .yj .xi .xj −n∑

i=1

λi .yi

n∑j=1

(λj .yj .xj)xi

=n∑

i=1

λi −12

∑i,j

λi .λj .yi .yj .xi .xj

This form is called the dual form of Lagrangian and distinguishablywritten as:

LD =n∑

i=1

λi −12

∑i,j

λi .λj .yi .yj .xi .xj (38)



This form is called the dual form of Lagrangian and distinguishablywritten as:

LD =n∑

i=1

λi −12

∑i,j

λi .λj .yi .yj .xi .xj (39)



There are key differences between primal (Lp) and dual (LD)forms of Lagrangian optimization problem as follows.

1 Lp involves a large number of parameters namely W , b and λi ’s. Onthe other hand, LD involves only λi ’s, that is, Lagrange multipliers.

2 Lp is the minimization problem as the quadratic term is positive.However, the quadratic term in LD is negative sign, Hence it isturned out to be a maximization problem.

3 Lp involves the calculation of W .x , whereas LD involves thecalculation of xi .xj . This, in fact, advantageous, and we will realize itwhen we learn Kernel-based calculation.



...4 The SVM classifier with primal form is δp(x) = W .x + b with

W =∑n

i=1 λi .yi .xi whereas the dual version of the classifier is

δD(X ) =m∑

i=1

λi .yi .(xi .x) + b

where xi being the i th support vector, and there are m supportvectors. Thus, both Lp and LD are equivalent.


Kernel Trick

We have already covered an idea that training data which are notlinearly separable, can be transformed into a higher dimensionalfeature space such that in higher dimensional transformed spacea hyperplane can be decided to separate the transformed dataand hence original data(also see Fig.16).

Clearly the data on the left in the figure is not linearly separable.

Yet if we map it to a 3D space using φ and the mapped data, thenit is possible to have a decision boundaries and hence hyperplanein 3D space.


Kernel Trick

Figure 16: SVM classifier in transformed feature space


Kernel Trick

Example: SVM classifier for non-linear dataSuppose, there are a set of data in R2(i.e., in 2-D space), φ is themapping for X εR2 to Z (z1, z2, z3)εR3, in the 3-D space.

R2 ⇒ X (x1, x2)

R3 ⇒ Z (z1, z2, z3)


Kernel Trick

Example: SVM classifier for non-linear data

φ(X )⇒ Z

z1 = x21 , z2 =

√2x1x2, z3 = x2

2

The hyperplane in R2 is of the form

w1x21 + w2

√2x1x2 + w3x2

2 = 0

Which is the equation of an ellipse in 2D.


Kernel Trick

Example: SVM classifier for non-linear data

After the transformation, φ as mentioned above, we have thedecision boundary of the form

w1z1 + w2z2 + w3z2 = 0

This is clearly a linear form in 3-D space. In other words,W .x + b = 0 in R2 has a mapped equivalent W .z + b′ = 0 in R3

This means that data which are not linearly separable in 2-D areseparable in 3-D, that is, non linear data can be classified by alinear SVM classifier.


Kernel Trick

The above can be generalized in the following.

Classifier:

δ(x) =n∑

i=1

λiyiyi .x + b

δ(z) =n∑

i=1

λiyiφ(xi).φ(x) + b

Learning:

Maximizen∑

i=1

λi −12

∑i,j

λiλj .yi .yj .xi .xj

Maximizen∑

i=1

λi −12

∑i,j

λiλj .yi .yjφ(xi).φ(xj)

Subject to: λi ≥ 0,∑

i λi .yi = 0


Kernel Trick

Now, question here is how to choose φ, the mapping functionX ⇒ Z , so that linear SVM can be directly applied to.

A breakthrough solution to this problem comes in the form of amethod as the kernel trick.

We discuss the kernel trick in the following.

We know that (.) dot product is often regarded as a measure ofsimilarity between two input vectors.

For example, if X and Y are two vectors, then

X .Y = |X ||Y |cosθ

Here, similarity between X and Y is measured as cosine similarity.

If θ=0 (i.e., cosθ=1), then they are most similar, otherwiseorthogonal means dissimilar.


Kernel Trick

Analogously, if Xi and Xj are two tuples, then Xi .Xj is regarded asa measure of similarity between Xi and Xj .

Again, φ(Xi) and φ(Xj) are the transformed features of Xi and Xj ,respectively in the transformed space; thus, φ(Xi).φ(Xj) is alsoshould be regarded as the similarity measure between φ(Xi) andφ(Xj) in the transformed space.

This is indeed an important revelation and is the basic idea behindthe kernel trick.

Now, naturally question arises, if both measures the similarity, thenwhat is the correlation between them (i.e., Xi .Xj and φ(Xi).φ(Xj)).

Let us try to find the answer to this question through an example.


Kernel Trick

Example: Correlation between Xi .Xj and φ(Xi).φ(Xj)

Without any loss of generality, let us consider a situation statedbelow.

φ : R2 ⇒ R3 = x21 ⇒ z, x2

2 ⇒ z2,√

2x1x2 ⇒ z3 (40)

Suppose, Xi = [xi1, xi2] and Xj = [xj1, xj2] are any two vectors inR2.

Similarly, φ(Xi) = [x2i1,√

2.xi1.xi2, x2i2] and

φ(Xj) = [x2j1,√

2.xj1.xj2, x2j2] are two transformed version of Xi and

Xj but in R3.


Kernel Trick

Example: Correlation between Xi .Xj and φ(Xi).φ(Xj)

Now,

φ(Xi).φ(Xj) = [x2i1,√

2.xi1.xi2, x2i2]

x2j1√

2xj1xj2x2

j2

= x2

i1.x2j1 + 2xi1xi2xj1xj2 + x2

i2.x2j2

= (xi1.xj1 + xi2.xj2)2

= {[xi1, xi2]

[xj1xj2

]}2

= (Xi .Xj)2


Kernel Trick : Correlation between Xi .Xj andφ(Xi).φ(Xj)

With reference to the above example, we can conclude thatφ(Xi).φ(Xj) are correlated to Xi .Xj .

In fact, the same can be proved in general, for any feature vectorsand their transformed feature vectors.

A formal proof of this is beyond the scope of this course.

More specifically, there is a correlation between dot products oforiginal data and transformed data.

Based on the above discussion, we can write the followingimplications.

Xi .Xj ⇒ φ(Xi).φ(Xj)⇒ K (Xi ,Xj) (41)

Here, K (Xi ,Xj denotes a function more popularly called as kernelfunction


Kernel Trick : Significance

This kernel function K (Xi ,Xj) physically implies the similarity inthe transformed space (i.e., nonlinear similarity measure) usingthe original attribute Xi ,Xj . In other words, K , the similarityfunction compute a similarity of both whether data in originalattribute space or transformed attribute space.

Implicit transformationThe first and foremost significance is that we do not require any φtransformation to the original input data at all. This is evident fromthe following re-writing of our SVM classification problem.

Classifier : δ(X ) =n∑

i=1

λi .Yi .K (Xi ,X ) + b (42)

Learning : maximize∑i=1

λi −12

∑i,j

λi .λjYiYj .K (Xi .Xj) (43)

Subject to λi ≥ 0 andn∑

i=1

λi .Yi = 0 (44)Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 118 / 131


Computational efficiency:Another important significance is easy and efficient computability.

We know that in the discussed SVM classifier, we need several andrepeated round of computation of dot products both in learningphase as well as in classification phase.

On other hand, using kernel trick, we can do it once and with fewerdot products.

This is explained in the following.



We define a matrix called design matrix(X), which contains alldata and gram matrix (K), which contains all dot products are asfollows:

Design matrix : X =

X1X2...

Xn

n×N

(45)

where n denotes the number of training data in N-dimensionaldata space.Note that X contains all input data in the original attribute space.



Gram matrix : K =

X T

1 .X1 X T1 .X2 . . . X T

1 .XnX T

2 .X1 X T2 .X2 . . . X T

2 .Xn...

......

...X T

n .X1 X Tn .X2 . . . X T

n .Xn

n×n

(46)

Note that K contains all dot products among all training data andas X T

i .Xj = X Tj .Xi . We, in fact, need to compute only half of the

matrix.More elegantly all dot products are mere in a matrix operation,and that is too one operation only.



In nutshell, we have the following.

Instead of mapping our data via φ and computing the dotproducts, we can accomplish everything in one operation.

Classifier can be learnt and applied without explicitly computingφ(X ).

Complexity of learning depends on n (typically it is O(n3)), not onN, the dimensionality of data space.

All that is required is the kernel K (Xi ,Xj)

Next, we discuss the aspect of deciding kernel functions.


Kernel Functions

Before going to learn the popular kernel functions adopted in SVMclassification, we give a precise and formal definition of kernelk(Xi ,Xj).

Definition 10.1:A kernel function K (Xi ,Xj) is a real valued function defined on R suchthat there is another function φ : X → Z such thatK (Xi ,Xj) = φ(Xi).φ(Xj). Symbolically we writeφ : Rm → Rn : K (Xi ,Xj) = φ(Xi).φ(Xj) where n > m.


Kernel Functions : Example

Table 2: Some standard kernel functions

Kernel name Functional form Remark

Linear kernel K (X ,Y ) = X T Y The simplest kernel used in linear SVM

Polynomial kernel of degree p K (X ,Y ) = (X T Y + 1)ρ, ρ > 0 It produces a large dot products. Power ρ is specified apriori by the user.

Gaussian (RBF) kernel K (X ,Y ) = ec||x−y||2

2σ2 It is a nonlinear kernel called Gaussian Radia Bias function kernel

Laplacian kernel K (X ,Y ) = e−λ||x−y || Follows laplacian mapping

Sigmoid kernel K (X ,Y ) = tanh(β0X T y + β1) Followed when statistical test data is known

Mahalanobis kernel K (X ,Y ) = e−(X−y)T A(x−y) Followed when statistical test data is known



Different kernel functions follow different parameters. Thoseparameters are called magic parameters and to be decided apriori.

Further, which kernels to be followed also depends on the patternof data as well as prudent of user.

In general, polynomial kernels result in a large dot products,Gaussian RBF produces more support vectors than other kernels.

To have an intuitive idea of kernels to be used in non-linear datahandling is shown in Fig. 17.



(a) Polynomial kernel (b) Sigmoid kernel

(c) Laplacian kernel (d) Gaussian RBF kernel

Figure 17: Visual interpretation of few kernel functions.Debasis Samanta (IIT Kharagpur) Data Analytics Autumn 2018 126 / 131

Mercers Theorem: Properties of Kernel Functions

Other than the standard kernel functions, we can define of ourown kernels as well as combining two or more kernels to anotherkernel.

It is interesting to note the following properties. If K1 and K2 aretwo kernels then,

1 K1+c2 aK13 aK1+bK24 K1.K2

are also kernels. Here, a, b and c ε R+.



Another requirement for kernel function used in non-linear SVM isthat there must exist a corresponding transformation such that thekernel function computed for a pair of vectors is equivalent to thedot product between the vectors in the transformed space.

This requirement can be formally stated in the form of Mercer’stheorem.



Theore 10.1: Mercer’s TheoremA kernel function K can be expressed as K (X ,Y ) = φ(X ).φ(Y ), if andonly if, for any function g(x) such that

∫g(x)2.dx is finite then∫

K (X ,Y )g(x)g(y)dxdy ≥ 0

The kernels which satisfy the Mercer’s theorem are called Mercerkernels.Additional properties of Mercer’s kernels are:



Symmetric: K (X ,Y ) = K (Y ,X )

Positive definite: aT .K .α ≥ 0 for all αεRN where K is the n × ngram matrix.

It can be prove that all the kernels listed in Table 2 are satisfyingthe (i) kernel properties, (ii) Mercer’s theorem and hence (iii)Mercer kernels.

PracticeProve that k(x , z) = X − X T z is not a valid kernel.


Characteristics of SVM

1 The SVM learning problem can be formulated as a convexoptimization problem, in which efficient algorithms are available tofind the global minimum of the objective function. Other methodsnamely rule based classifier, ANN classifier etc. find only localoptimum solutions.

2 SVM is the best suitable to classify both linear as well asnon-linear training data efficiently.

3 SVM can be applied to categorical data by introducing a suitablesimilarity measures.

4 Computational complexity is influenced by number of training datanot the dimension of data. In fact, learning is a bit computationallyheavy and hence slow, but classification of test is extremely fastand accurate.


Documents

Support Vector Machine - Indian Institute of Technology ...dsamanta/courses/da/... · A classiﬁcation that has received considerable attention is support vector machine and popularly