Upload
jasmine-berry
View
225
Download
1
Tags:
Embed Size (px)
Citation preview
1
3.6 Support Vector Machines
K. M. Koo
2
Goal of SVMFind Maximum Margin
goalfind a separating hyperplane with
maximum margin
marginminimum distance between a
separating hyperplane and the sets of or
1 2
3
Goal of SVMFind Maximum Margin
assume that are linearly separable
margin
find separating hyperplane with maximum margin
1
2
21,
4
Calculate margin
separating hyperplane and are not uniquely determinedunder the constraint ,
and are uniquely determine
0)( 0 wg Txwx
0)( 0 wxg Txww 0w
10min wTxwx
w0w
5
Calculate margin
distance between a point x and is given by
thus, the margin is given by
)(xg
wxw /0wT
w
wxwwxwxx
/1
/)()/( 00 minmin
ww TT
6
Optimization of margin
maximization of margin
20
10
,1
,1
thatrequiring
211
xxw
xxw
www
w
wT
T
7
Optimization of margin
therefore, we want to
Niwy
J
iT
i ,...,2,1 1)( subject to
2
1)( minimize
0
2
xw
ww
separating hyperplanewith maximal margin
separating hyperplanewith minimum w
This is an optimization-problem with inequality constraints
8
optimization with constraints
)(θJ
1
2
constraint
cost function
min-value
)(θJ
1
2
min-value
optimization with equality constraintsoptimization with inequality constraints
constraint
9
Lagrange Multiplier
optimization problem under constraints can be solved by the method of Lagrange Multipliers
let be real valued functions, let and ,and let , the level set for with value . assume .if has a local minimum or maximum on at , which is called a critical point of ,then there is a real number ,called a Lagrange multiplier, such that
nRURUgf for ,:,U0x cg )( 0x )(1 cgS
g c 0)( 0 xg Sf |
0x
Sf |
)()( 00 xx gf
S
0),( xL
10
The Method of Lagrange Multiplier
1
2
1)( cJ θ
2)( cJ θ
3)( cJ θ
b
JT θa
θ
subject to
)( minimize
aθθ
))(( *J
bT θa
a
1
2
)(θJ
1c2c
3c
bT θa
11
Lagrange Multiplier
Lagrangian is obtained as follows:for equality constraints
for inequality constraints
In our caseInequality constraints
N1,2,...,i ,0
]1)([2
1),,(
10
2
0
i
N
ii
Tii wywL
xwwλw
)()(),(1
iTi
m
ii bJL
θaθλθ
mibJL iiTi
m
ii ,...,1,0 0, )()(),(
1
θaθλθ
12
Convex
a subset is convex iff for any , the line segment joining and is also a subset of , i.e. for any ,
a real-valued function on is convex iff for any two points and for any ,
XC Cyx ,
x yC
f C
Cyx ,
]1,0[)()1()())1(( yfxfyxf
]1,0[Cyx )1(
13
Convex
)(xf
x
)(xf
x
)(xf
x
x x
y y
convex set concave set
convex function concave function neither convex nor concave
1x 2x
)( 1xf
)( 2xf
14
Convex Optimization
an optimization problem is said to be convex iff the cost function as well as the constraints are convex the optimization problem for SVM is convex
the solution to a convex problem, if it exist, is unique. that is, there is no local optimum!
for convex optimization problem, KKT(Karush-Kuhn-Tucker) condition is necessary and sufficient for the solution
15
KKT(Karush-Kuhn-Tucker) condition KKT condition
1. The gradient of the Lagrangian with respect to the original variable is 0
2. The original constraints are satisfied
3. Multipliers for inequality constraints
4. (Complementary KKT) product of multiplier and constraints equal to 0
for convex optimize problems,1-4 are necessary and sufficient for the solution
0
0λww
),,( 0wL
0),,( 00
λw wLw
Nii ,...,2,0 ,0
Niwy iT
ii ,...,2,1 ,0]1)([ 0 xw
16
KKT condition for the optimization of margin
recall
KKT condition
Niwy
J
iT
i ,...,2,1 1)( subject to
2
1)( minimize
0
2
xw
ww
Niwy
Ni
wLw
wL
iT
ii
i
,...,2,1 ,0]1)([
,...,2,0 ,0
0),,(
),,(
0
00
0
xw
λw
0λww
(3.62)
(3.63)
(3.64)
(3.65)
(3.66)
N
ii
Tii
T wxwywwwL1
00 ]1)([2
1),,( λw
17
KKT condition for the optimization of margin
Combining (3.66) with (3.62)
01
1
N
iii
N
iiii
y
yw
x (3.67)
(3.68)
18
Remarks-support vector
of the optimal solution is a linear combination of feature vectors which are associated with
support vectors are associated with
sN
iiii y
1
xw
wNN s
ix0i
0i
Niwy iT
ii ,...,2,1 ,0]1)([ 0 xw
19
Remarks-support vector
0
ctorsupport ve -non
i
0
ctorsupport ve
i
00 wTxw
10 wTxw
10 wTxw
The resulting hyperplane classifier is insensitive to the number and position of non-support vector
20
Remark-computation w0
can be implicitly obtaines by any of the condition satisfying strict complement (i.e. )
In practice, is computed as an average value obtained using all conditions of the type
0]1)([ 0 wy iT
ii xw
0i
0w
0w
21
Remark-optimal hyperplane is unique
the optimal hyperplane classifier of a support vector machine is unique under two conditionthe cost function is convex the inequality constraints consist of
linear functionsconstraints are convex
an optimization problem is said to be convex iff the target(or cost) function as well as the constraints are convex (the optimization problem for SVM is convex)
the solution to a convex problem, if it exist, is unique. that is, there is no local optimum!
22
Computation optimal Lagrange multiplier
optimization problem belongs to the convex programming family (convex optimization problem) of problems
It can be solved by considering the so called Lagrangian duality and can be stated equivalently by its Wolfe dual representation form
),( subject to
),(max
),(minmax),(maxmin)(min
0
00
λθθ
λθ
λθλθθ
λ
θλλθθ
L
L
LLJ
Lagrangian duality
Wolfe dual representation
23
Wolfe dual representation form
0λ
xw
λw
0
subject to
]1)([
2
1),,( maximize
N
1i
1
10
0
ii
N
iiii
N
ii
Tii
T
y
y
wxwy
wwwL
24
Computation optimal Lagrange multiplier
once the optimal Lagrangian multipliers have been computed, the optimal hyperplane is obtained
0λ
xxλ
,0 subject to
2
1max
1
,1
N
iii
jij
Tijiji
N
ii
y
yy
(3.75)
(3.76)
25
Remarks
the cost function does not depend explicitly on the dimensionality of the input spacethis allows for efficient generalizations
in the case of nonlinearly separable classes
although the resulting optimal hyperplane is unique, there is no guarantee about Lagrange multipliers
26
Simple example
T
N
ii
Tii
TT
TT
w
L
ww
L
ww
L
www
www
www
wwwww
wxywL
],[
00
0
0
)1(
)1(
)1(
)1(2
]1)([2
),,(
]1,1[,]1,1[:
]1,1[,]1,1[:
43214321
43210
432122
432111
0214
0213
0212
0211
22
21
10
2
0
2
1
w
ww
λw
consider the two classification task that consists of the following points
its Lagrangian function
KKT condition
27
Simple exampleLagrangian duality
0,]0,1[
],[
0
)(21
)(21
)(21
)(21
)22(max
0
43214321
4321
32
41
32
41
324124
23
22
214321
wT
T
w
λ
optimize with equality constraint
resultmore then one solution
28
SVM for Non-separable Classes
in the case of non-separable, the training feature vector belong to one of the following three categories
10 wTxw
1)(0 0 wy Ti xw
0)( 0 wy Ti xw
10 wTxw
00 wTxw
10 wTxw
29
SVM for Non-separable Classes
All three cases can be treated under a single type constraints
iT
i wy 1)][ 0xw
0i
10 i
0i
30
SVM for Non-separable Classes
goal ismake the margin as large as possible keep the number of points with as
small as possible
(3.79) is intractable because of discontinuous function
0i
0 0
0 1)(
)(2
1),,(
1
2
0
i
ii
N
ii
I
ICwJ
wξw (3.79)
)(I
31
SVM for Non-separable Classes
as common case, we choose to optimize a closely related cost function
Ni
Niwy
CwJ
i
iiT
i
N
ii
,...,2,1 ,0
,...,2,1 ,1][ subject to
2
1),,( minimize
0
1
2
0
xw
wξw
32
SVM for Non-separable Classes
to Lagrangian
]1)([2
1),,,,(
11
10
2
0
N
iii
N
ii
N
iii
Tii
C
wywL
xwwξw
33
SVM for Non-separable Classes
The corresponding KKT condition
Ni
Niwy
Ni
NiCL
yw
L
yL
ii
iiT
ii
ii
iii
ii
N
iiii
,...,2,0 ,0
,...,2,1 ,0]1)([
,...,2,0 ,0 ,0
,...,2,1 ,0 or 0
0or 0
or
0
N
1i0
1
xw
xw0w (3.85)
(3.86)
(3.87)
(3.90)
(3.88)
(3.89)
34
SVM for Non-separable Classes
The associated Wolfe dual representation now becomes
Ni
NiC
y
y
wL
ii
ii
ii
N
iiii
,...,2,0 ,0 ,0
,...,2,1 ,0
0
subject to
),,,,( maximize
N
1i
1
0
xw
ξw
35
SVM for Non-separable Classes
equivalent to
0
,...,2,1 ,0 subject to
2
1max
1
,1
N
iii
i
jij
Tijiji
N
ii
y
NiC
yy
xx
36
Remarks-difference with the linearly separable case
Lagrange multipliers( ) need to be bounded by C
the slack variables, , and their associated Lagrange multipliers, , do not enter into the problem explicitlyreflected indirectly though C
i
ii
37
RemarksM-class problem
SVM for M-class problem design M separating hyperplanes so th
at separate class from all the others
assign0)( xgi
0)( xgi i
0)( xgi
0)( xgi
0)( xgi
)}({maxarg if in xgi kk
i x