Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes...

Kernels

Usman RoshanCS 675 Machine Learning

Feature space representation

• Consider two classes shown below• Data cannot be separated by a hyperplane

Feature space representation

• Suppose we square each coordinate• In other words (x1 , x2 ) => (x1

2 , x22 )

• Now the data are well separated

Feature spaces/Kernel trick

• Using a linear classifier (nearest means or SVM) we solve a non-linear problem simply by working in a different feature space.

• With kernels – we don’t have to make the new feature space

explicit.– we can implicitly work in a different space and

efficiently compute dot products there.

Support vector machine

• Consider the hard margin SVM optimization

• Solve by applying KKT. Think of KKT as a tool for constrained convex optimization.

• Form Lagrangian2

1- ( ( ) 1)

where are Lagrange multipliers

Tp i i i

L w y w x w

• KKT says the optimal w and w0 are given by the saddle point solution

• And KKT conditions imply that and

1min max - ( ( ) 1)

subject to 0

Tw w p i i i

L w y w x w

• After applying the Lagrange multipliers we obtain the dual by substituting w into the primal (dual is maximized)

SVM and kernels

• We can rewrite the dual in a compact form:

Optimization

• The SVM is thus a quadratic program that can be solved by any quadratic program solver.

• Platt’s Sequential Minimization Optimization (SMO) algorithm offers a simple specific solution to the SVM dual

• Idea is to perform coordinate ascent by selecting two variables at a time to optimize

• Let’s look at some kernels.

Example kernels

• Polynomial kernels of degree d give a feature space with higher order non-linear terms

• Radial basis kernel gives infinite dimensional space (Taylor series)

( , ) ( 1)T di j i jK x x x x

2( , )i jx x

si jK x x e

Example kernels

• Empirical kernel map– Define a set of reference vectors for– Define a score between xi and mj

– Then– And

Example kernels

• Bag of words– Given two documents D1 and D2 the we define the

kernel K(D1,D2) as the number of words in common

– To prove this is a kernel first create a large set of words Wi. Define the mapping Φ(D1) as a high dimensional vector where Φ(D1)[i] is 1 if the word Wi is present in the document.

SVM and kernels

• What if we make the kernel matrix K a variable and optimize the dual

• But now there is no way to tie the kernel matrix to the training data points.

SVM and kernels

• To tie the kernel matrix to training data we assume that the kernel to be determined is a linear combination of some existing base kernels.

• Now we have a problem that is not a quadratic program anymore.

• Instead we have a semi-definite program (Lanckriet et. al. 2002)

Theoretical foundation

• Recall the margin error theorem (7.3 from Learning with kernels)

Theoretical foundation

• The kernel analogue of Theorem 7.3 from Lackriet et. al. 2002:

How does MKL work in practice?

• Gonnen and Alpaydin, JMLR, 2011• Datasets:– Digit recognition, – Internet advertisements– Protein folding

• Form kernels with different sets of features• Apply SVM with various kernel learning

algorithms.

From Gonnen and Alpaydin, JMLR, 2011

• MKL better than single kernel• Mean kernel hard to beat• Non-linear MKL looks promising

Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes...

Documents

Expected accuracy sequence alignment Usman Roshan

Regularized risk minimization Usman Roshan. Supervised learning for two classes We are given n training samples (x i,y i ) for i=1..n drawn i.i.d from

CIS 101: Computer Programming and Problem Solving Usman Roshan Department of Computer Science NJIT

Lecture 2 BNFO 135 Usman Roshan. Perl variables Scalar –Number –String Examples –$myname = “Roshan”; –$year = 2006;

Приложение к Оперативному бюллетеню МСЭ № 1067 ... · 2015-02-03 · 4-025-1 8393 Roshan TDCA (dba Roshan) 4-025-2 8394 Roshan TDCA (dba Roshan)

BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing

CIS 101: Computer Programming and Problem Solving Lecture 5 Usman Roshan Department of Computer Science NJIT

Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming

Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University

(1) Risk prediction by kernels and (2) Ranking SNPs Usman Roshan

Lecture 1 BNFO 135 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Programs for comparing

CIS 101: Computer Programming and Problem Solving Lecture10 Usman Roshan Department of Computer Science NJIT

Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David

Lecture 1 BNFO 136 Usman Roshan. Course overview Pre-req: BNFO 135 or approval of instructor Python progamming language and Perl for continuing students

The (Supertree) of Life: Procedures, Problems, and Prospects Presented by Usman Roshan

Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 4 Usman Roshan

Hidden Markov Models Usman Roshan CS 675 Machine Learning

Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities

Genome-wide association studies Usman Roshan. SNP Single nucleotide polymorphism Specific position and specific chromosome

Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 3 Usman Roshan