Support vector machines (SVMs) Lecture 2

David Sontag

New York University

Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin

Geometry of linear separators (see blackboard)

A plane can be specified as the set of all points given by:

Barber, Section 29.1.1-4

Vector from origin to a point in the plane Two non-parallel directions in the plane

Alternatively, it can be specified as:

Normal vector (we will call this w)

Only need to specify this dot product, a scalar (we will call this the offset, b)

Linear Separators

  If training data is linearly separable, perceptron is guaranteed to find some linear separator

  Which of these is optimal?

  SVMs (Vapnik, 1990’s) choose the linear separator with the largest margin

•  Good according to intuition, theory, practice

•  SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task

Support Vector Machine (SVM)

V. Vapnik

Robust to outliers!

1.  Use optimization to find solution (i.e. a hyperplane) with few errors

2.  Seek large margin separator to improve generalization

3.  Use kernel trick to make large feature spaces computationally efficient

Support vector machines: 3 key ideas

Finding a perfect classifier (when one exists) using linear programming

for yt = +1,

and for yt = -1,

For every data point (xt, yt), enforce the constraint

Equivalently, we want to satisfy all of the linear constraints

This linear program can be efficiently solved using algorithms such as simplex, interior point, or ellipsoid

Finding a perfect classifier (when one exists) using linear programming

Example of 2-dimensional linear programming (feasibility) problem:

For SVMs, each data point gives one inequality:

What happens if the data set is not linearly separable?

Weight space

•  Try to find weights that violate as few constraints as possible?

•  Formalize this using the 0-1 loss:

•  Unfortunately, minimizing 0-1 loss is NP-hard in the worst-case –  Non-starter. We need another

approach.

#(mistakes)

Minimizing number of errors (0-1 loss)

minw,b

`0,1(yj , w · xj + b)

where `0,1(y, y) = 1[y 6= sign(y)]

Key idea #1: Allow for slack

For each data point:

• If functional margin ≥ 1, don’t care

• If functional margin < 1, pay linear penalty

Σj ξj

- ξj ξj≥0

“slack variables”

We now have a linear program again, and can efficiently find its optimum

Key idea #1: Allow for slack

Σj ξj

- ξj ξj≥0

What is the optimal value ξj* as a function

of w* and b*?

If then ξj = 0

If then ξj =

Sometimes written as

Equivalent hinge loss formulation

Σj ξj - ξj ξj≥0

Substituting into the objective, we get:

⇣0, 1� (w · xj + b) yj

This is empirical risk minimization, using the hinge loss

minw,b

`hinge(yj , w · xj + b)

The hinge loss is defined as `hinge(y, y) = max

⇣0, 1� yy

Hinge loss vs. 0/1 loss

Hinge loss upper bounds 0/1 loss!

It is the tightest convex upper bound on the 0/1 loss

Hinge loss: `hinge(y, y) = max

⇣0, 1� yy

0-1 Loss: `0,1(y, y) = 1[y 6= sign(y)]

Key idea #2: seek large margin

•  Suppose again that the data is linearly separable and we are solving a feasibility problem, with constraints

•  If the length of the weight vector ||w|| is too small, the optimization problem is infeasible! Why?

Key idea #2: seek large margin

As ||w|| (and |b|) get smaller

What is (geometric margin) as a function of w?

We also know that:

�i = Distance to i’th data point

� = mini

(assuming there is a data point on the w.x + b = +1 or -1 line)

Final result: can maximize by minimizing ||w||2!!!

(Hard margin) support vector machines

•  Example of a convex optimization problem

–  A quadratic program

–  Polynomial-time algorithms to solve!

•  Hyperplane defined by support vectors

–  Could use them as a lower-dimension basis to write down line, although we haven’t seen how yet

•  More on these later

margin 2γ

Support Vectors: •  data points on the

canonical lines

Non-support Vectors: •  everything else •  moving them will

not change w

Allowing for slack: “Soft margin SVM”

For each data point:

• If margin ≥ 1, don’t care

• If margin < 1, pay linear penalty

+ C Σj ξj

- ξj ξj≥0

Slack penalty C > 0: •  C=∞ have to separate the data! •  C=0 ignores the data entirely!

•  Select using cross-validation

Equivalent formulation using hinge loss

+ C Σj ξj - ξj ξj≥0

Substituting into the objective, we get:

This part is empirical risk minimization, using the hinge loss

This is called regularization; used to prevent overfitting!

The hinge loss is defined as `hinge(y, y) = max

⇣0, 1� yy

minw,b

||w||22 + C

`hinge(yj , w · xj + b)

What if the data is not linearly separable?

Use features of features of features of features….

Feature space can get really large really quickly!

�(x) =

⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

. . .x(n)

x(1)x(2)

x(1)x(3)

⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

[Tommi Jaakkola]

Example

What’s Next!

•  Learn one of the most interesting and exciting recent advancements in machine learning – Key idea #3: the “kernel trick”

– High dimensional feature spaces at no extra cost

•  But first, a detour – Constrained optimization!

Constrained optimization

No Constraint x ≥ -1

x*=0 x*=1

x ≥ 1

How do we solve with constraints? Lagrange Multipliers!!!

Lagrange multipliers – Dual variables

Introduce Lagrangian (objective):

We will solve:

Add Lagrange multiplier

Add new constraint

Why is this equivalent? •  min is fighting max! x<b (x-b)<0 maxα-α(x-b) = ∞

•  min won’t let this happen!

x>b, α≥0 (x-b)>0 maxα-α(x-b) = 0, α*=0 •  min is cool with 0, and L(x, α)=x2 (original objective)

x=b α can be anything, and L(x, α)=x2 (original objective)

Rewrite Constraint

The min on the outside forces max to behave, so constraints will be satisfied.

Dual SVM derivation (1) – the linearly separable case (hard margin SVM)

Original optimization problem:

Lagrangian:

Rewrite constraints

One Lagrange multiplier per example

Our goal now is to solve:

Swap min and max

Slater’s condition from convex optimization guarantees that these two optimization problems are equivalent!

(Primal)

(Dual)

Can solve for optimal w, b as function of α:

⇥(x) =

⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇧⇤

. . .x(n)

x(1)x(2)

x(1)x(3)

⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌃⌅

⇤w= w �

�jyjxj

(Dual)

Substituting these values back in (and simplifying), we obtain:

(Dual)

Sums over all training examples dot product scalars

Dual formulation only depends on dot-products of the features!

First, we introduce a feature mapping:

Next, replace the dot product with an equivalent kernel function:

SVM with kernels

•  Never compute features explicitly!!! –  Compute dot products in closed form

•  O(n2) time in size of dataset to compute objective –  much work on speeding up

Predict with:

Common kernels •  Polynomials of degree exactly d

•  Polynomials of degree up to d

•  Gaussian kernels

•  Sigmoid

•  And many others: very active area of research!

[Tommi Jaakkola]

Quadratic kernel

[Cynthia Rudin]

Feature mapping given by:

Support vector machines (SVMs) Lecture 2 - MIT...

Documents

Introduction To Machine Learning - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml16/slides/lecture21.pdf · Introduction To Machine Learning David Sontag New York University Lecture

Introduction To Machine Learning - MIT CSAIL › dsontag › courses › ml16 › slides › ...Introduction To Machine Learning David Sontag New York University Lecture 22, April

Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm13/slides/lecture12.pdf1 Density estimation: we are interested in the full distribution (so later we

Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture4.pdf · Undirected graphical models Reminder of lecture 2 An alternative representation

Markov chain Monte Carlo Lecture 6 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/inference15/slides/lecture6_gibbs.pdfMarkov chain Monte Carlo Lecture 6 David Sontag New

lecture10 - Massachusetts Institute of Technologypeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture10.pdf · Lecture10 David&Sontag& New&York&University& Slides adapted from

Probabilistic Graphical Models - Peoplepeople.csail.mit.edu/dsontag/courses/pgm12/slides/lecture1.pdf · Book: Probabilistic Graphical Models: Principles and Techniques by Daphne

lecture9 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml13/slides/lecture9.pdf · Learningtheory Lecture9 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin

Inference and Representation - MITpeople.csail.mit.edu/dsontag/courses/inference14/slides/... · 2014-09-03 · Exact inference & learning Then generalize Continuous variables Partially

lecture11 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml13/slides/lecture... · 2013-10-08 · [Slides&from Mehyrar&Mohri] Mehryar Mohri - Introduction to Machine Learning

Probabilistic Graphical Models: Lagrangian Relaxation …people.csail.mit.edu/dsontag/courses/pgm12/slides/lecture3.pdf · Decoding tasks high complexity combined parsing and part-of-speech

Inference and Representation - Peoplepeople.csail.mit.edu/dsontag/courses/inference15/slides/lecture4.pdf · David Sontag (NYU) Inference and Representation Lecture 4, Sept. 29, 2015

lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Hierarchical)&)Spectral)clustering) Lecture)13)people.csail.mit.edu/dsontag/courses/ml16/slides/lecture13.pdfHierarchical)&)Spectral)clustering) Lecture)13) David&Sontag& New&York&University&

Lync Server 2013 PSFP ML14 Deploying Persistent Chat

Support Vector Machines & Kernels Lecture 6people.csail.mit.edu/dsontag/courses/ml12/slides/lecture6.pdf · – SVM objective seeks a solution with large margin • Theory says that

lecture10 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml14/slides/lecture... · 2014. 4. 9. · specifies a real number for each assignment: – How many assignments

lecture14 - Massachusetts Institute of Technologypeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture14.pdf · The K-means clustering algorithm represents a key tool in the apparently

lecture16 - People | MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml13/slides/lecture... · 2013. 10. 31. · Spectral&clustering& [Shi & Malik ‘00; Ng, Jordan, Weiss NIPS ‘01]

Latent Dirichlet Allocation - MIT CSAILpeople.csail.mit.edu/dsontag/courses/pgm13/slides/... · We describe latent Dirichlet allocation (LDA), a generative probabilistic model for