42
Support Vector Machines Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced Studies May 23, 2008 / Statistical Learning Theory II

Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

Support Vector MachinesStatistical Inference with Reproducing Kernel Hilbert Space

Kenji Fukumizu

Institute of Statistical Mathematics, ROISDepartment of Statistical Science, Graduate University for Advanced Studies

May 23, 2008 / Statistical Learning Theory II

Page 2: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Outline

1 A quick course on convex optimizationConvexity and convex optimizationDual problem for optimization

2 Optimization in learning of SVMDual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

2 / 38

Page 3: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

1 A quick course on convex optimizationConvexity and convex optimizationDual problem for optimization

2 Optimization in learning of SVMDual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

3 / 38

Page 4: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Convexity I

For the details on convex optimization, see [BV04].

Convex set:A set C in a vector space is convex if for every x, y ∈ C andt ∈ [0, 1]

tx+ (1− t)y ∈ C.

Convex function:Let C be a convex set. f : C → R is called a convex function if forevery x, y ∈ C and t ∈ [0, 1]

f(tx+ (1− t)y) ≤ tf(x) + (1− t)f(y).

Concave function:Let C be a convex set. f : C → R is called a concave function iffor every x, y ∈ C and t ∈ [0, 1]

f(tx+ (1− t)y) ≥ tf(x) + (1− t)f(y).

4 / 38

Page 5: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Convexity II

5 / 38

Page 6: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Convexity IIIFact: If f : C → R is a convex function, the set

{x ∈ C | f(x) ≤ α}

is a convex set for every α ∈ R.If ft(x) : C → R (t ∈ T ) are convex, then

f(x) = supt∈T ft(x)

is also convex.

6 / 38

Page 7: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Convex optimization I

A general form of convex optimizationf(x), hi(x) (1 ≤ i ≤ `): D → R, convex functions on D ⊂ Rn.ai ∈ Rn, bj ∈ R (1 ≤ j ≤ m).

minx∈D

f(x) subject to

{hi(x) ≤ 0 (1 ≤ i ≤ `),aTj x+ bj = 0 (1 ≤ j ≤ m).

hi: inequality constraints,rj(x) = aTj x+ bj : linear equality constraints.

Feasible set:

F = {x ∈ D | hi(x) ≤ 0 (1 ≤ i ≤ `), rj(x) = 0 (1 ≤ j ≤ m)}.

The above optimization problem is called feasible if F 6= ∅.In convex optimization, there are no local minima. It is possible tofind a minimizer numerically.

7 / 38

Page 8: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Convex optimization II

Fact 1. The feasible set is a convex set.Fact 2. The set of minimizers

Xopt ={x ∈ F | f(x) = inf{f(y) | y ∈ F}

}is convex.

proof. The intersection of convex sets is convex, which leads(1).Let

p∗ = infx∈Ff(x).

Then,Xopt = {x ∈ D | f(x) ≤ p∗} ∩ F .

Both sets in r.h.s. are convex. This proves (2)

8 / 38

Page 9: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Examples

Linear program (LP)

min cTx subject to

{Ax = b,

Gx � h.1

The objective function, the equality and inequality constraints areall linear.Quadratic program (QP)

min12xTPx+ qtx+ r subject to

{Ax = b,

Gx � h,

where P is a positive semidefinite matrix.The objective function is quadratic, while the equality andinequality constraints are linear.

1Gx � h denotes gTj x ≤ hj for all j, where G = (g1, . . . , gm)T .

9 / 38

Page 10: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

1 A quick course on convex optimizationConvexity and convex optimizationDual problem for optimization

2 Optimization in learning of SVMDual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

10 / 38

Page 11: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Lagrange duality I

Consider an optimization problem (which may not be convex):

(primal) minx∈D

f(x) subject to

{hi(x) ≤ 0 (1 ≤ i ≤ `),rj(x) = 0 (1 ≤ j ≤ m).

Lagrange dual function: g : R` × Rm → [−∞,∞)

g(λ, ν) = infx∈D

L(x, λ, ν),

where

L(x, λ, µ) = f(x) +∑`i=1λihi(x) +

∑mj=1νjrj(x).

λi and νj are called Lagrange multipliers.g is a concave function.

11 / 38

Page 12: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Lagrange duality II

Dual problem

(dual) max g(λ, ν) subject to λ � 0.

The dual and primal problems have close connection.

Theorem (weak duality)

Letp∗ = inf{f(x) | hi(x) ≤ 0 (1 ≤ i ≤ `), rj(x) = 0 (1 ≤ j ≤ m)}.

andd∗ = sup{g(λ, ν) | λ � 0, ν ∈ Rm}.

Then,d∗ ≤ p∗.

The weak duality does not require the convexity of the primaloptimization problem.

12 / 38

Page 13: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Lagrange duality III

Proof. Let ∀λ � 0, ν ∈ Rm.For any feasible point x,∑`

i=1λihi(x) +∑mj=1νjrj(x) ≤ 0.

(The first part is non-positive, and the second part is zero.)This means for any feasible point x,

L(x, λ, ν) ≤ f(x).

By taking infimum,

infx:feasible

L(x, λ, ν) ≤ p∗.

Thus,

g(λ, ν) = infx∈D

L(x, λ, ν) ≤ infx:feasible

L(x, λ, ν) ≤ p∗

for any λ � 0, ν ∈ Rm.13 / 38

Page 14: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Strong duality

We need some conditions to obtain the strong duality d∗ = p∗.Convexity of the problem: f and hi are convex, rj are linear.

Slater’s conditionThere is x ∈ D such that

hi(x) < 0 (1 ≤ ∀i ≤ `), rj(x) = 0 (1 ≤ ∀j ≤ m).

Theorem (Strong duality)

Suppose the primal problem is convex, and Slater’s condition holds.Then, there is λ∗ ≥ 0 and ν∗ ∈ Rm such that

g(λ∗, ν∗) = d∗ = p∗.

Proof is omitted (see [BV04] Sec.5.3.2.).There are also other conditions to guarantee the strong duality.

14 / 38

Page 15: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Geometric interpretation of duality

G = {(h(x), r(x), f(x)) ∈ R` × Rm × R |x ∈ Rn}.p∗ = inf{(u, 0, t) ∈ G | u ≤ 0}.For λ ≥ 0 (neglecting ν),

g(λ) = inf{(λ, 1)T (u, t) | (u, t) ∈ G}.

(λ, 1)T (u, t) = g(λ) is a non-verticalhyperplane, which intersects with t-axisat g(λ).Weak duality d∗ = sup g(λ) ≤ p∗ is easyto see.Strong duality d∗ = p∗ holds under someconditions.

Figure

15 / 38

Page 16: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Complementary slackness I

Consider the (not necessarily convex) optimization problem:

min f(x) subject to

{hi(x) ≤ 0 (1 ≤ i ≤ `),rj(x) = 0 (1 ≤ j ≤ m).

Assumption: the optimum of the primal and dual problems areattained by x∗ and (λ∗, ν∗), and they satisfy the strong duality;

g(λ∗, ν∗) = f(x∗).

Observation:

f(x∗) = g(λ∗, ν∗) = infx∈D

L(x, λ∗, ν∗) [definition]

≤ L(x∗, λ∗, ν∗)

= f(x∗) +∑`i=1λ

∗i hi(x

∗) +∑mj=1ν

∗j rj(x

∗)

≤ f(x∗) [2nd ≤ 0 and 3rd = 0]

The two inequalities are in fact equalities.16 / 38

Page 17: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Complementary slackness II

Consequence 1:

x∗ minimizes L(x, λ∗, ν∗)

Consequence 2:

λ∗i hi(x∗) = 0 for all i.

This is called complementary slackness.Equivalently,

λ∗i > 0 ⇒ hi(x∗) = 0,

orhi(x∗) < 0 ⇒ λ∗i = 0.

17 / 38

Page 18: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Solving primal problem via dual

If the dual problem is easier to solve, then we can sometimes solvethe primal using the dual.

Assumption: strong duality holds (e.g., Slater’s condition), andwe have the dual solution (λ∗, ν∗).The primal solution x∗, if it exists, should give minx∈D L(x, λ∗, ν∗)(previous slide).If a solution of

minx∈D

f(x) +∑`i=1λ

∗i hi(x) +

∑mj=1ν

∗j rj(x)

is obtained and it is primary feasible, then it must be the primalsolution.

18 / 38

Page 19: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

KKT condition I

KKT conditions give useful relations between the primal and dualsolutions.

Consider the convex optimization problem. Assume D is openand f(x), hi(x) are differentiable.

min f(x) subject to

{hi(x) ≤ 0 (1 ≤ i ≤ `),rj(x) = 0 (1 ≤ j ≤ m).

x∗ and (λ∗, ν∗): any optimal points of the primal and dualproblems.Assume the strong duality holds.Since

x∗ minimizes L(x, λ∗, ν∗),

∇f(x∗) +∑`i=1λ

∗i∇gi(x∗) +

∑mj=1ν

∗j∇rj(x∗) = 0.

19 / 38

Page 20: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

KKT condition II

The following are necessary conditions.Karush-Kuhn-Tucker (KKT) conditions:

hi(x∗) ≤ 0 (i = 1, . . . , `) [primal constraints]rj(x∗) = 0 (j = 1, . . . ,m) [primal constraints]λ∗i ≥ 0 (i = 1, . . . , `) [dual constraints]λ∗i hi(x

∗) = 0 (i = 1, . . . , `) [ complementary slackness]

∇f(x∗) +∑`i=1λ

∗i∇gi(x∗) +

∑mj=1ν

∗j∇rj(x∗) = 0.

Theorem (KKT condition)

For a convex optimization problem with differentiable functions, x∗

and (λ∗, ν∗) are the primal-dual solutions with strong duality if andonly if they satisfy KKT conditions.

20 / 38

Page 21: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

KKT condition III

Proof.x∗ is primal-feasible by the first two conditions.From λ∗i ≥ 0, L(x, λ∗, ν∗) is convex (and differentiable).The last condition ∇xL(x∗, λ∗, ν∗) = 0 implies x∗ is a minimizer.It follows

g(λ∗, ν∗) = infx∈D

L(x, λ∗, ν∗) [by definition]

= L(x∗, λ∗, ν∗) [x∗: minimizer]

= f(x∗) +∑`

i=1λ∗i hi(x

∗) +∑m

j=1ν∗j rj(x

∗)

= f(x∗) [complementary slackness and rj(x∗) = 0].

Strong duality holds, and x∗ and (λ∗, ν∗) must be the optimizers.

21 / 38

Page 22: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Convexity and convex optimizationDual problem for optimization

Example

Quadratic minimization under equality constraints.

min12xTPx+ qTx+ r subject to Ax = b,

where P is (strictly) positive definite.KKT conditions:

Ax∗ = b, [primal constraint]

∇xL(x∗, ν∗) = 0 =⇒ Px∗ + q +AT ν∗ = 0

The solution is given by(P AT

A O

)(x∗

ν∗

)=(−qb

).

22 / 38

Page 23: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

1 A quick course on convex optimizationConvexity and convex optimizationDual problem for optimization

2 Optimization in learning of SVMDual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

23 / 38

Page 24: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Lagrange dual of SVM I

The QP for SVM can be solved in the primal form, but the dualform is easier.SVM: primal problem

minwi,b,ξi

12∑Ni,j=1wiwjk(Xi, Xj) + C

∑Ni=1ξi,

subj. to

{Yi(∑Nj=1k(Xi, Xj)wj + b) ≥ 1− ξi,

ξi ≥ 0.

Lagrange function of SVM:

L(w, b, ξ, α, β) =12∑Ni,j=1wiwjk(Xi, Xj) + C

∑Ni=1ξi

+∑Ni=1αi(1− Yif(Xi)− ξi) +

∑Ni=1βi(−ξi),

where f(Xi) =∑Nj=1 wjk(Xi, Xj) + b.

24 / 38

Page 25: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Lagrange dual of SVM IISVM Dual problem:

maxα

N∑i=1

αi −12

N∑i,j=1

αiαjYiYjKij subj. to

{0 ≤ αi ≤ C,∑Ni=1αiYi = 0.

Notation: Kij = k(Xi, Xi).

Derivation.Lagrange dual function

g(α, β) = minw,b,ξ

L(w, b, ξ, α, β)

The minimizer (w∗, b∗, ξ∗) is given by

∇w :∑nj=1Kijw

∗j +

∑nj=1αjYjKij (∀i)

,∇b :∑nj=1αjYj = 0,

∇ξ : C − αi − βi = 0 (∀i).

25 / 38

Page 26: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Lagrange dual of SVM III

From these relations,

12∑Ni,j=1w

∗iw∗jKij =

12∑Ni,j=1αiαjYiYjKij ,∑n

j=1(C − αi − βi) = 0,∑Ni=1αi(1− Yi(

∑jKijw

∗j + b)) =

∑Ni=1αi −

∑Ni,j=1αiαjYiYjKij .

Thus,

g(α, β) = L(w∗, b∗, ξ∗, α, β)

=∑Ni=1αi −

12∑Ni,j=1αiαjYiYjKij .

From αi ≥ 0 and βi ≥ 0, The constraints of αi is given by

0 ≤ αi ≤ C (∀i).

26 / 38

Page 27: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

KKT conditions of SVM

KKT conditions(1) 1− Yif∗(Xi)− ξ∗i ≤ 0 (∀i),(2) −ξ∗i ≤ 0 (∀i),(3) α∗i ≥ 0, (∀i),(4) β∗i ≥ 0, (∀i),(5) α∗i (1− Yif∗(Xi)− ξ∗i ) = 0 (∀i),(6) β∗i ξ

∗i = 0 (∀i),

(7) ∇w :∑nj=1Kijw

∗j +

∑nj=1α

∗jYjKij ,

∇b :∑nj=1α

∗jYj = 0,

∇ξ : C − α∗i − β∗i = 0 (∀i).

27 / 38

Page 28: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Support vectors I

Complementary slackness

α∗i (1− Yif∗(Xi)− ξ∗i ) = 0 (∀i),

(C − α∗i )ξ∗i = 0 (∀i).

If α∗i = 0, then ξ∗i = 0, and

Yif∗(Xi) ≥ 1. [well separated]

Support vectorsIf 0 < α∗i < C, then ξ∗i = 0 and

Yif∗(Xi) = 1.

If α∗i = C,Yif∗(Xi) ≤ 1.

28 / 38

Page 29: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Support vectors II

Sparse representationThe optimum classifier is expressed only with the supportvectors:

f(x) =∑

i:support vector

α∗i Yik(x,Xi) + b∗

29 / 38

Page 30: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

How to solve b

The optimum value of b is given by the complementaryslackness.For any i with 0 < α∗i < C,

Yi(∑

jk(Xi, Xj)Yjα∗j + b)

= 1.

Use the above relation for any of such i, or take the average overall of such i.

30 / 38

Page 31: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

1 A quick course on convex optimizationConvexity and convex optimizationDual problem for optimization

2 Optimization in learning of SVMDual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

31 / 38

Page 32: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Computational problem in solving SVM

The dual QP problem of SVM has N variables, where N is thesample size.If N is very large, say N = 100, 000, the optimization is very hard.Some approaches have been proposed for optimizing subsets ofthe variables sequentially.

Chunking [Vap82]Osuna’s method [OFG]Sequential minimal optimization (SMO) [Pla99]SVMlight (http://svmlight.joachims.org/)

32 / 38

Page 33: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Sequential minimal optimization (SMO) I

Solve small QP problems sequentially for a pair of variables(αi, αj).How to choose the pair? – Intuition from the KKT conditions isused.

After removing w, ξ, and β, the KKT conditions of SVM areequivalent to

0 ≤ α∗i ≤ C,N∑

i=1

Yiα∗i = 0,

(∗)

α∗i = 0 ⇒ Yif

∗(Xi) ≥ 1,

0 < α∗i < C ⇒ Yif∗(Xi) = 1,

α∗i = C ⇒ Yif∗(Xi) ≤ 1.

(shown later.)The conditions can be checked for each data point.Choose (i, j) such that at least one of them breaks the KKTconditions.

33 / 38

Page 34: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Sequential minimal optimization (SMO) II

The QP problem for (αi, αj) is analytically solvable!

For simplicity, assume (i, j) = (1, 2).Constraint of α1 and α2:

α1 + s12α2 = γ, 0 ≤ α1, α2 ≤ C,

where s12 = Y1Y2 and γ = ±∑`≥3 Y`α` is constat.

Objective function:

α1 + α2 −12α2

1K11 −12α2

2K22 − s12α1α2K12

− Y1α1

∑j≥3YjαjK1j − Y2α2

∑j≥3YjαjK2j + const.

This optimization is a quadratic optimization of one variable onan interval. Directly solved.

34 / 38

Page 35: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

KKT conditions revisited I

β and w can be removed by

∇ξ : β∗i = C − α∗i = 0 (∀i),∇w :

∑nj=1Kijw

∗j = −

∑nj=1αjYjKij (∀i).

From (4) and (6),

α∗i ≥ C, ξ∗i (C − α∗i ) = 0 (∀i).

The KKT conditions are equivalent to(a) 1− Yif

∗(Xi)− ξ∗i ≤ 0 (∀i),(b) ξ∗i ≥ 0 (∀i),(c) 0 ≤ α∗i ≤ C (∀i),(d) α∗i (1− Yif

∗(Xi)− ξ∗i ) = 0 (∀i),(e) ξ∗i (C − α∗i ) = 0 (∀i),(f)

∑Ni=1 Yiα

∗i = 0.

35 / 38

Page 36: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

KKT conditions revisited II

We can further remove ξ.Case α∗i = 0:

From (e), ξ∗i = 0. Then, from (a), Yif∗(Xi) ≥ 1.

Case 0 < α∗i < C:From (e), ξ∗i = 0. From (d), Yif

∗(Xi) = 1.Case α∗i = C:

From (d), ξ∗i = 1− Yif∗(Xi).

Note in all cases, (a) and (b) are satisfied.

The KKT conditions are equivalent to

0 ≤ α∗i ≤ C (∀i),∑Ni=1Yiα

∗i = 0,

α∗i = 0 ⇒ Yif∗(Xi) ≥ 1, (ξ∗i = 0)

0 < α∗i < C ⇒ Yif∗(Xi) = 1, (ξ∗i = 0)

α∗i = C ⇒ Yif∗(Xi) ≤ 1, (ξ∗i = 1− Yif∗(Xi)).

36 / 38

Page 37: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

1 A quick course on convex optimizationConvexity and convex optimizationDual problem for optimization

2 Optimization in learning of SVMDual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

37 / 38

Page 38: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

A quick course on convex optimizationOptimization in learning of SVM

Dual problem and support vectorsSequential Minimal Optimization (SMO)Other approaches

Recent studies (not a compete list).

Solution in primal.O. Chapelle [Cha07]T. Joachims, SVMperf [Joa06]S. Shalev-Shwartz et al. [SSSS07]

Online SVM.Tax and Laskov [TL03]LaSVM [BEWB05]http://leon.bottou.org/projects/lasvm/

Parallel computationCascade SVM [GCB+05]Zanni et al [ZSZ06]

OthersColumn generation technique for large scale problems [DBS02]

38 / 38

Page 39: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

References I

Antoine Bordes, Seyda Ertekin, Jason Weston, and Léon Bottou.Fast kernel classifiers with online and active learning.Journal of Machine Learning Research, 6:1579–1619, 2005.

Stephen Boyd and Lieven Vandenberghe.Convex Optimization.Cambridge University Press, 2004.http://www.stanford.edu/ boyd/cvxbook/.

Olivier Chapelle.Training a support vector machine in the primal.Neural Computation, 19:1155–1178, 2007.

Ayhan Demiriz, Kristin P. Bennett, and John Shawe-Taylor.Linear programming boosting via column generation.Machine Learning, 46(1-3):225–254, 2002.

39 / 38

Page 40: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

References II

Hans Peter Graf, Eric Cosatto, Léon Bottou, Igor Dourdanovic, andVladimir Vapnik.Parallel support vector machines: The Cascade SVM.In Lawrence Saul, Yair Weiss, and Léon Bottou, editors, Advances inNeural Information Processing Systems, volume 17. MIT Press, 2005.

Thorsten Joachims.Training linear svms in linear time.In Proceedings of the ACM Conference on Knowledge Discovery andData Mining (KDD), 2006.

Edgar Osuna, Robert Freund, and Federico Girosi.An improved training algorithm for support vector machines.In Proceedings of the 1997 IEEE Workshop on Neural Networks forSignal Processing (IEEE NNSP 1997), pages 276–285.

40 / 38

Page 41: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

References III

John Platt.Fast training of support vector machines using sequential minimaloptimization.In Bernhard Schölkopf, Cristopher Burges, and Alexander Smola,editors, Advances in Kernel Methods - Support Vector Learning, pages185–208. MIT Press, 1999.

Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro.Pegasos: Primal estimated sub-gradient solver for svm.In Proc. International Conerence of Machine Learning, 2007.

D.M.J. Tax and P. Laskov.Online svm learning: from classification to data description and back.In Proceddings of IEEE 13th Workshop on Neural Networks for SignalProcessing (NNSP2003), pages 499–508, 2003.

41 / 38

Page 42: Support Vector Machines - Statistical Inference with ...fukumizu/H20_kernel/Kernel_4_SVM.pdf · A quick course on convex optimization Optimization in learning of SVM Convexity and

References IV

Vladimir N. Vapnik.Estimation of Dependences Based on Empirical Data.Springer-Verlag, 1982.

Luca Zanni, Thomas Serafini, and Gaetano Zanghirati.Parallel software for training large scale support vector machines onmultiprocessor systems.Journal of Machine Learning Research, 7:1467–1492, 2006.

42 / 38