ETC3250: Support VectorMachinesSemester 1, 2020
Professor Di Cook
Econometrics and Business Statistics Monash University
Week 7 (b)
Separating hyperplanes
In a -dimensional space, a hyperplane is a �at af�ne subspace ofdimension .
(ISLR: Fig 9 1)
p
p − 1
2 / 36
Separating hyperplanes
The equation of -dimensional hyperplane is given by
If and for , then
Equivalently,
p
β0 + β1X1 + ⋯ + βpXp = 0
xi ∈ Rp yi ∈ {−1, 1} i = 1, … , n
β0 + β1xi1 + ⋯ + βpxip > 0 if yi = 1,
β0 + β1xi1 + ⋯ + βpxip < 0 if yi = −1
yi(β0 + β1xi1 + ⋯ + βpxip) > 0
3 / 36
Separating hyperplanes
A new observation is assigned a class depending on which side of thehyperplane it is located Classify the test observation based on the sign of
If , class , and if , class , i.e.
x∗
s(x∗) = β0 + β1x∗1 + ⋯ + βpx∗
p
s(x∗) > 0 1 s(x∗) < 0 −1h(x∗) = sign(s(x∗)).
4 / 36
Separating hyperplanes
What about the magnitude of ?
lies far from the hyperplane + morecon�dent about our classi�cation near the hyperplane + less con�dentabout our classi�cation
s(x∗)
s(x∗) far from zero → x
∗
s(x∗) close to zero → x
∗
5 / 36
Separating hyperplane classi�ers
We will explore three different types of hyperplane classi�ers, witheach method generalising the one before it.
Maximal marginal classi�er for when the data is perfectly separatedby a hyperplane Support vector classi�er/ soft margin classi�er for when data is NOTperfectly separated by a hyperplane but still has a linear decisionboundary, and Support vector machines used for when the data has non-lineardecision boundaries.
In practice, SVMs are used to refer to all three methods, however wewill distinguish between the three notions in this lecture.
6 / 36
Maximal margin classi�er
All are separating hyperplanes. Which is best?
7 / 36
Maximal margin classi�er
Source: Machine Learning Memes for Convolutional Teens
8 / 36
From LDA to SVM
Linear discriminant analysis uses the difference between means toset the separating hyperplane. Support vector machines uses gaps between points on the outeredge of clusters to set the separating hyperplane.
9 / 36
SVM vs Logistic Regression
If our data can be perfectly separated using a hyperplane, then therewill in fact exist an in�nite number of such hyperplanes. We compute the (perpendicular) distance from each trainingobservation to a given separating hyperplane. The smallest suchdistance is known as the margin. The optimal separating hyperplane (or maximal margin hyperplane)is the separating hyperplane for which the margin is largest. We can then classify a test observation based on which side of themaximal margin hyperplane it lies. This is known as the maximal marginclassi�er.
10 / 36
Support vectors
11 / 36
Support vectors
The support vectors are equidistant from the maximal marginhyperplane and lie along the dashed lines indicating the width of themargin. They support the maximal margin hyperplane in the sense that ifthese points were moved slightly then the maximal margin hyperplanewould move as well
The maximal margin hyperplane depends directly on the supportvectors, but not on the other observations
12 / 36
Support vectors
13 / 36
Example: Support vectors (and slack vectors)
14 / 36
Maximal margin classi�er
If and for , the separatinghyperplane is de�ned as
where and is the number of support vectors. Then
the maximal margin hyperplane is found by
maximising , subject to , and
.
xi ∈ Rp yi ∈ {−1, 1} i = 1, … , n
{x : β0 + xT β = 0}
β = ∑s
i=1(αiyi)xi s
M ∑p
j=1 β2j = 1
yi(xTi β + β0) ≥ M, i = 1, . . . , n
15 / 36
Non-separable case
The maximal margin classi�er only workswhen we have perfect separability in our data.
What do we do if data is not perfectlyseparable by a hyperplane?
The support vector classi�er allows points toeither lie on the wrong side of the margin, oron the wrong side of the hyperplanealtogether.
Right: ISLR Fig 9.6
16 / 36
Support vector classi�er - optimisation
Maximise , subject to , and
, AND .
tells us where the th observation is located and is a nonnegativetuning parameter.
: correct side of the margin, : wrong side of the margin (violation of the margin), : wrong side of the hyperplane.
M ∑p
i=1 β2i = 1
yi(x′iβ + β0) ≥ M(1 − ϵi), i = 1, . . . , n ϵi ≥ 0,∑n
i=1 ϵi ≤ C
εi i C
εi = 0εi > 0εi > 1
17 / 36
Non-separable case
Tuning parameter: decreasing the value of C
18 / 36
Non-linear boundaries
The support vector classi�er doesn't work well for non-linearboundaries. What solution do we have?
19 / 36
Enlarging the feature space
Consider the following 2D non-linear classi�cation problem. We cantransform this to a linear problem separated by a maximal marginhyperplane by introducing an additional third dimension.
Source: Grace Zhang @zxr.nju
20 / 36
The inner product
Consider two -vectors
The inner product is de�ned as
A linear measure of similarity, and allows geometric constrctions suchas the maximal marginal hyperplane.
p
x = (x1, x2, … , xp) ∈ Rp
and y = (y1, y2, … , yp) ∈ Rp.
⟨x, y⟩ = x1y1 + x2y2 + ⋯ + xpyp =
p
∑j=1
xjyj
21 / 36
Kernel functions
A kernel function is an inner product of vectors mapped to a (higherdimensional) feature space .
Non-linear measure of similarity, and allows geometric constructions inhigh dimensional space.
H = Rd, d > p
K(x, y) = ⟨ψ(x),ψ(y)⟩
ψ : Rp → H
22 / 36
Examples of kernels
Standard kernels include:
Linear K(x, y) = ⟨x, y⟩
Polynomial K(x, y) = (⟨x, y⟩ + 1)d
Radial K(x, y) = exp(−γ||x − y||2)
23 / 36
Support Vector Machines
The kernel trick
The linear support vector classi�er can berepresented as follows:
We can generalise this by replacing the innerproduct with the kernel function as follows:
f(x) = β0 +∑i∈S
αi⟨x, xi⟩.
f(x) = β0 +∑i∈S
αiK(x, xi).
24 / 36
Your turn
Let and be vectors in . By expanding
show that this is equilvalent to an inner product in .
Remember: .
x y R2
K(x,y) = (1 + ⟨x,y⟩)2
H = R6
⟨x,y⟩ =∑p
j=1 xjyj
03:0025 / 36
Solution
where .
K(x, y) = (1 + ⟨x, y⟩)2
= (1 +2
∑j=1
xjyj)
2
= (1 + x1y1 + x2y2)2
= (1 + x21y2
1 + x22y2
2 + 2x1y1 + 2x2y2 + 2x1x2y1y2)
= ⟨ψ(x), ψ(y)⟩
ψ(x) = (1, x21, x2
2, √2x1, √2x2, √2x1x2)
26 / 36
The kernel trick - why is it a trick?
We do not need to know what the high dimensional enlarged featurespace really looks like.
We just need to know the which kernel function is most appropriate asa measure of similarity.
The Support Vector Machine (SVM) is a maximal margin hyperplane in built by using a kernel function in the low dimensional feature space .
H
H
Rp
27 / 36
Non-linear boundaries
Polynomial and radial kernel SVMs
28 / 36
Non-linear boundaries
Italian olive oils: Regions 2, 3 (North and Sardinia)
29 / 36
SVM in high dimensions
Examining misclassi�cations and which points are selected to besupport vectors
30 / 36
SVM in high dimensions
Examining boundaries
31 / 36
Comparing decision boundaries
34 / 36
Increasing the value of cost in svm
35 / 36
� Made by a human with a computerSlides at https://iml.numbat.space.
Code and data at https://github.com/numbats/iml.
Created using R Markdown with �air by xaringan, andkunoichi (female ninja) style.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
36 / 36