Upload
trandung
View
225
Download
1
Embed Size (px)
Citation preview
Lecture 3Theory of Kernel Functions
Pavel Laskov1 Blaine Nelson1
1Cognitive Systems GroupWilhelm Schickard Institute for Computer ScienceUniversitat Tubingen, Germany
Advanced Topics in Machine Learning, 2012
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 1 / 47
Part I
Introduction: Kernel Functions
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 2 / 47
Overview
In this lecture, we will formally define kernel functions
Recall: advantages of kernel-based learning:1 Kernels allow for learning in high-dimensional feature spaces without explicit
mapping into feature space2 Kernels make learning in high-dimensional feature spaces computationally
feasible3 Kernel methods learn non-linear function with the machinery of algorithms
for learning linear functions4 Kernels provide an abstraction that separates data representation & learning
Questions to be addressed:1 What properties do kernels have & what properties does a function need to
be a kernel?2 How can we verify that a kernel function is valid?3 How does one construct a kernel function?
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 3 / 47
Recall: Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:
Φ(x) = [x31 , x32 ,√3x21 x2,
√3x1x
22 ,√3x21 ,√3x22 ,√6x1x2,
√3x1,√3x2, 1]
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 4 / 47
Recall: Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:
Φ(x) = [x31 , x32 ,√3x21 x2,
√3x1x
22 ,√3x21 ,√3x22 ,√6x1x2,
√3x1,√3x2, 1]
Let us compute the inner product between two points in the featurespace:
Φ(x)⊤Φ(y) = x31 y31 + x32 y
32 + 3x21 x2y
21 y2 + 3x1x
22 y1y
22 + 3x21 y
21 + 3x22 y
22
+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1
= (x1y1 + x2y2 + 1)3
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 4 / 47
Recall: Kernel MagicExample 2: 2-dimensional Polynomials of Degree 3
Consider a (slightly modified) feature space for 2-dimensionalpolynomials of degree 3:
Φ(x) = [x31 , x32 ,√3x21 x2,
√3x1x
22 ,√3x21 ,√3x22 ,√6x1x2,
√3x1,√3x2, 1]
Let us compute the inner product between two points in the featurespace:
Φ(x)⊤Φ(y) = x31 y31 + x32 y
32 + 3x21 x2y
21 y2 + 3x1x
22 y1y
22 + 3x21 y
21 + 3x22 y
22
+ 6x1x2y1y2 + 3x1y1 + 3x2y2 + 1
= (x1y1 + x2y2 + 1)3
Complexity: 3 multiplications instead of 10.
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 4 / 47
Kernel Questions
Which of the following functions are kernels?
κ1 (x, z) =∑D
i=1 (xi + zi ) κ2 (x, z) =∏D
i=1 h(xi−ca
)h( zi−ca
)
κ3 (x, z) = − 〈x,z〉‖x‖2‖z‖2
κ4 (x, z) =√
‖x− z‖22 + 1
where h(x) = cos(1.75x) exp(−x2/2)
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 5 / 47
Part II
Linear Algebra Review
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 6 / 47
Vector Space
Definition 1
A set X is a vector space (over the reals) if it is closed under an additionoperator ‘+’ (i.e., ∀ x, z ∈ X x+ z ∈ X ) & a scalar multiplicationoperator ‘·’ (i.e., ∀ x ∈ X , a ∈ ℜ a · x ∈ X ) and these operators satisfy;
1 (Additive Associativity) u+ (v + w) = (u+ v) + w
2 (Additive Commutativity) u+ v = v + u
3 (Additive Identity) ∃ 0 ∈ X s.t. ∀ u ∈ X u+ 0 = u
4 (Additive Inverse) ∀ u ∈ X ∃ −u ∈ X s.t. u+ (−u) = 0
5 (Distibutivity) a · (u+ v) = a · u+ a · v & (a + b) · u = a · u+ b · u6 (Multiplicative Associativity) a · (b · u) = (a · b) · u7 (Multiplicative Identity) 1 · u = u
Example: For any D ∈ ℵ, ℜD is a vector space.
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 7 / 47
Vectors I
A D-dimensional vector x is a list of D-reals in vector space ℜD
Vectors {xi}Ni=1 are linearly dependent if there exists c1, c2, . . . , cD (atleast one not 0) such that
∑Ni=1 ci · xi = 0 ;
otherwise, they are linearly independent.
The inner product of x and z is defined as: x⊤z =∑D
i=1 xi · ziNon-trivial vectors {xi}Ni=1 are orthogonal if for i 6= j , x⊤i xj = 0. They arenormal vectors if for all i , x⊤i xi = 1. They are orthonormal if both hold; i.e.,∀ i , j 〈xi , xj〉 = δi ,j = I [i == j ]A set of orthogonal vectors are linearly independentEuclidean norm of a vector: ‖x‖2 =
√
〈x, x〉Angle between vectors: θ (u, v) = arccos 〈u,v〉
‖u‖‖v‖
Projection onto vector z: projz(x) =〈z,x〉〈z,z〉z
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 8 / 47
Vectors II
Vectors {xi}Ni=1 spans X if for every x ∈ X , x =∑N
i=1 αixi
{xi}Ni=1 is a basis for X if it spans X & is linearly independent
The dimension of X is the number of elements in any basis of X :Every vector in X can be represented as its projection onto an orthonormalbasis {xi}Ni=1 of X :
z =∑D
i=1 projxi (z) =∑D
i=1 〈xi , z〉 · xi
This is a Fourier decomposition of z in that basis.
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 9 / 47
Matrices
Matrices are a N × D grid of reals.
A =
A1,1 A1,2 . . . A1,D
A2,1 A2,2 . . . A2,D
......
. . ....
AN,1 AN,2 . . . AN,D
=
−A⊤
1,•−
−A⊤
2,•−...
−A⊤
N,•−
=
| | |A•,1 A•,2 . . . A•,D
| | |
Matrix-Vector Multiplication (x is a D-vector & z is a N-vector):
Ax =
−A⊤
1,•−−A⊤
2,•−...
−A⊤
N,•−
|x|
=
A⊤
1,•xA⊤
2,•x...
A⊤
N,•x
z⊤A =[
z⊤A•,1 z⊤A•,2 . . . z⊤A•,1
]
Matrix-Matrix Multiplication:
AB =
−A⊤
1,•−−A⊤
2,•−...
−A⊤
N,•−
| | |B•,1 B•,2 . . . B•,D
| | |
=
A⊤
1,•B•,1 A⊤
1,•B•,2 . . . A⊤
1,•B•,D
A⊤
2,•B•,1 A⊤
2,•B•,2 . . . A⊤
2,•B•,D
......
. . ....
A⊤
N,•B•,1 A⊤
N,•B•,2 . . . A⊤
N,•B•,D
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 10 / 47
Matrices
Matrix Multiplication as summations:
[Ax]i =∑
ℓ Ai ,ℓxℓ
[
z⊤A]
j=
∑
k zkAk,j
z⊤Ax =∑
k,ℓ zkAk,ℓxℓ [AB]i ,k =∑
j Ai ,jBj ,k
Special forms of matrices:
Lower Triangular
• 0 0 . . . 0• • 0 . . . 0• • • . . . 0...
......
. . ....
• • • . . . •
Diagonal Matrix
• 0 0 . . . 00 • 0 . . . 00 0 • . . . 0...
......
. . ....
0 0 0 . . . •
Upper Triangular
• • • . . . •0 • • . . . •0 0 • . . . •...
......
. . ....
0 0 0 . . . •
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 11 / 47
Basic Linear Algebra
Suppose matrix A is an N × D matrix
A square matrix has same # of rows & columns; i.e., N = D
The identity matrix IN is a N × N diagonal matrix of 1’s
For any A, AID = A and INA = A
The transpose of A is denoted by A⊤ (it is D × N
A symmetric matrix is its own transpose: A = A⊤
The inverse of A is denoted by A−1: AA−1 = I & A−1A = I
An orthonormal or unitary matrix is its own inverse:AA⊤ = A⊤A = I. . . both the columns & rows of A form a basis
The rank of matrix A is the maximum number of columns of A that arelinearly independent (i.e., the dimension of its column space). A isfull-rank if rank (A) = min(M,N)
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 12 / 47
Singularity
Definition 2
A matrix A is singular if there exists some x 6= 0 such that Ax = 0;otherwise, A is nonsingular.
Theorem 3
The following are equivalent:
Matrix A is invertible
Matrix A is nonsingular
Matrix A is full-rank
The spectrum of A does not contain 0; i.e., 0 6∈ eig (A)
If A is square, the (linear) function f (x) = Ax is one-to-one & onto,f (x) = b has at least 1 solution, and f (x) = 0 only has solution x = 0
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 13 / 47
Eigenvalues & Eigenvectors
Given an N × N matrix A, an eigenvector of A is a non-trivial vector vthat satisfies
Av = λv ;
the corresponding value λ is an eigenvalue
The Rayleigh quotient is defined by
λ =v⊤Av
v⊤v
In fact, the maximum eigen-value/vector pair of A is a solution to
max‖x‖=1
x⊤Ax
x⊤x
with x restricted to have norm 1
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 14 / 47
Eigen-Decomposition & Deflation
Deflation: for any eigen-value/vector pair (λ,v) of A, the transform
A← A− λvv⊤
deflates the matrix; i.e., v is an eigenvector of A but has eigenvalue 0
A symmetric matrix has N orthonormal eigenvectors {vi} correspondingto N eigenvalues—its spectrum; eig (A)
λ1 (A) ≥ λ2 (A) ≥ . . . ≥ λN (A)
Eigen-vectors/values form orthonormal matrix V & diagonal matrix Λ
V =
| | |v1 v2 . . . vN| | |
Λ =
λ1 (A) 0 . . . 00 λ2 (A) . . . 0...
.... . .
...0 0 λN (A)
which form the eigen-decomposition of A: A = VΛV⊤
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 15 / 47
The Spectral Theorem
Theorem 4
If A is a symmetric N × N real-valued matrix, it can be written as
A =
N∑
i=1
λN (A) viv⊤i
where (λi , vi ) are eigen-value/vector pairs of A. This is called the spectraldecomposition of A
For a matrix with rank K < N, the spectral decomposition of A onlyhas K summands
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 16 / 47
Matrix Functions
Properties of diagonal matrix D with entries Di ,i :
For k = 0, 1, 2, . . .: Dk is diagonal with entries [Dk ]i ,i = (Di ,i)k
If ∄ i s.t. Di ,i = 0 then D−1 exists, is diagonal, & [D−1]i ,i = (Di ,i)−1
√D is diagonal with entries [
√D]i ,i =
√
Di ,i
Functions of A are defined by its eigen-decomposition A = VΛV⊤ &the fact that V⊤V = I
For k = 0, 1, 2, . . . Ak = VΛkV⊤
If A is non-singular, then A−1 = VΛ−1V⊤√A = V
√ΛV⊤ (Note: this satisfies
√A√A = A)
exp (A) = V exp (Λ)V⊤
log (A) = V log (Λ)V⊤
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 17 / 47
Part III
Positive (Semi-)Definiteness
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 18 / 47
Positive (Semi-)Definite Matrices
Definition 5 (Positive Semi-Definite Matrix)
Matrix A is positive semi-definite (PSD) if all its eigenvalues arenon-negative (∀ i λi (A) ≥ 0); i.e., for all x ∈ X :
x⊤Ax ≥ 0
from the Rayleigh quotient. We use A � 0 to denote that A is PSD
Definition 6 (Positive Definite Matrix)
Matrix A is positive definite if all its eigenvalues are positive(∀ i λi (A) > 0); i.e., A is PSD &
x⊤Ax = 0 ⇔ x = 0
We denote this as A ≻ 0.
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 19 / 47
PSD Matrices
Proposition 7
Matrix A is PSD iff there exists a real matrix B such that A = B⊤B
ProofCase ⇐: Suppose A = B⊤B, then for any x
x⊤Ax = x⊤B⊤Bx = ‖Bx‖2 ≥ 0
Case ⇒: If A � 0 then its eigen-decomposition (A = VΛV⊤) has onlynon-negative eigenvalues and thus,
√Λ is a real-valued matrix. Thus, let
B =√ΛV⊤ and we have
B⊤B = V√Λ√ΛV⊤ = VΛV⊤ = A
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 20 / 47
Part IV
Reproducing Kernel Hilbert Spaces
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 21 / 47
Inner Product Space
Definition 8
An inner product space X is a vector space with an associated innerproduct 〈·, ·〉 : X × X → ℜ that satisfies:
1 (Symmetry) 〈x, z〉 = 〈z, x〉2 (Linearity) 〈a · x, z〉 = a · 〈x, z〉 & 〈w + x, z〉 = 〈w, z〉+ 〈x, z〉3 (Positive Semi-Definiteness) 〈x, x〉 ≥ 0
The inner product space is strict if 〈x, x〉 = 0 ⇔ x = 0
A strict inner product space X has a natural norm given by‖x‖2 =
√
〈x, x〉. The associated metric is d (x, z) = ‖x− z‖2The space ℜD has the inner product 〈x, z〉 = x⊤z which yields theEuclidean norm:
‖x‖22 =D∑
i=1
x2i
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 22 / 47
Hilbert Space
Definition 9
A strict inner product space F is a Hilbert space if it is
1 Complete: Every (Cauchy) sequence {hi ∈ F}∞i=1 such that
limn→∞
supm>n‖hn − hm‖ = 0
converges to an element h ∈ F ; i.e., hi → h
2 Separable: There is a countable subset F = {hi ∈ F}∞i=1 such that forall h ∈ F and ǫ > 0, there exists hi ∈ F such that
‖hi − h‖ < ǫ
Hilbert Space Examples: the interval [0, 1], the reals ℜ, the complexnumbers C, & Euclidean spaces ℜD for D ∈ ℵ.P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 23 / 47
Hilbert Space
Definition 9
A strict inner product space F is a Hilbert space if it is
1 Complete: Every (Cauchy) sequence {hi ∈ F}∞i=1 such that
limn→∞
supm>n‖hn − hm‖ = 0
converges to an element h ∈ F ; i.e., hi → h
Technical Condition required for potentially infinite-dimensional sets
2 Separable: There is a countable subset F = {hi ∈ F}∞i=1 such that forall h ∈ F and ǫ > 0, there exists hi ∈ F such that
‖hi − h‖ < ǫ
Hilbert Space Examples: the interval [0, 1], the reals ℜ, the complexnumbers C, & Euclidean spaces ℜD for D ∈ ℵ.P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 23 / 47
Hilbert Space
Definition 9
A strict inner product space F is a Hilbert space if it is
1 Complete: Every (Cauchy) sequence {hi ∈ F}∞i=1 such that
limn→∞
supm>n‖hn − hm‖ = 0
converges to an element h ∈ F ; i.e., hi → h
Technical Condition required for potentially infinite-dimensional sets
2 Separable: There is a countable subset F = {hi ∈ F}∞i=1 such that forall h ∈ F and ǫ > 0, there exists hi ∈ F such that
‖hi − h‖ < ǫCondition required to make Hilbert space isomorphisms
Hilbert Space Examples: the interval [0, 1], the reals ℜ, the complexnumbers C, & Euclidean spaces ℜD for D ∈ ℵ.P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 23 / 47
Function Spaces
What is a vector? An ordered list of D elements from ℜ indexed by theindex set ID = {1, 2, . . . ,D}. The set of all such lists is ℜD
We can extend this notion to countable sequences x = (x1, x2, . . .) byusing the index set I = ℵ
Inner-product generalizes naturally as
〈x, y〉 =∑
i∈ℵ
xi · yi
However, we need additional restrictions to make such spaces well-behavedThe subspace ℓ2 for which ∀ x 〈x, x〉 <∞ is a Hilbert space
Further, a function f : X → ℜ maps each x ∈ X to exactly one y ∈ ℜ;i.e., it is also a vector with an uncountable index set (e.g., I = ℜD)
Inner-product again generalizes naturally as
〈f , g〉 =∫
X
f (x) g (x)dx
Subspace L2 (X ) defined on X , a compact subspace of ℜD , for which∀ f ∈ L2 (X ) , 〈f , f 〉 <∞ is a Hilbert space
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 24 / 47
Properties of (Separable) Hilbert Spaces
Hilbert space F is isomorphic to H if there is a one-to-one linearmapping T : F → H such that for all x, z ∈ F
〈T (x),T (z)〉H = 〈x, z〉F
Every separable Hilbert space (A) of dimension D is isomorphic to ℜD
and (B) of infinite dimension is isomorphic to ℓ2
Since Hilbert space F is isomorphic to ℜD or ℓ2, F has an orthonormalbasis {φi} & element in x ∈ F have a Fourier decomposition:
x =∑
i 〈φi , x〉F · φi
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 25 / 47
Part V
Characterizing Kernel Functions
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 26 / 47
Kernel Terminology
Definition 10
A kernel is a two-argument real-valued function over X ×X(κ : X × X → ℜ) such that for any x, z ∈ X
κ (x, z) = 〈φ (x) , φ (z)〉F (1)
for some inner-product space F such that ∀ x ∈ X φ (x) ∈ F
Kernel functions must be symmetric since inner products are symmetric
To show that κ is a valid kernel, it is sufficient to show that a mapping φexists that yields Eq. 1. However, this is generally difficult to construct.
In this rest of this lecture, we will demonstrate additional ways toconstruct & validate kernels
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 27 / 47
Kernel Matrices
Definition 11
A kernel matrix (or Gram matrix) K is the matrix that results fromapplying κ to all pairs of datapoints in set {xi}Ni=1
K =
κ (x1, x1) κ (x1, x2) . . . κ (x1, xN)κ (x2, x1) κ (x2, x2) . . . κ (x2, xN)
......
. . ....
κ (xN , x1) κ (xN , x2) . . . κ (xN , xN)
that is, Ki ,j = κ (xi , xj )
Kernel matrices are square & symmetric
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 28 / 47
Kernel Matrices are PSD
Proposition 12
Kernel matrices, which are constructed from a kernel corresponding to astrict inner product space F , are PSD.
ProofBy definition of a kernel matrix, for all i , j ∈ 1, . . . ,N
Ki ,j = κ (xi , xj ) = 〈φ (xi ) , φ (xj )〉FThus, for any v ∈ ℜN :
v⊤Kv =∑
i ,j viKi ,jvj =∑
i ,j vi 〈φ (xi ) , φ (xj )〉F vj
=⟨
∑
i viφ (xi ),∑
j vjφ (xj)⟩
F
= ‖∑
i viφ (xi )‖2F ≥ 0
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 29 / 47
Reproducing Kernel Function
Definition 13 (Reproducing Kernel Function (Aronszajn, 1950) [1])
Suppose F is a Hilbert space of functions over X ; the functionκ : X ×X → ℜ is a reproducing kernel of F if
1 For every x ∈ X , the function fx (·) = κ (·, x) is in F .2 Reproducing Property: for every z ∈ X and every f ∈ F
f (z) = 〈f , κ (·, z)〉F
Further, the space is called a Reproducing Kernel Hilbert Space (RKHS)
By 1st property & closure of F , for any αi ∈ ℜ and xi ∈ X , we have∑N
i=1 αi · κ (·, xi ) ∈ XApplying fx from 1st property to 2nd property, for any x, z ∈ X , we have
κ (x, z) = 〈κ (·, x) , κ (·, z)〉FP. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 30 / 47
Kernel Functions
Definition 14 (Finitely Positive Semi-definite)
A function κ : X × X → ℜ is finitely positive semi-definite (FPSD) if
It is symmetric; i.e., ∀ x, z ∈ X κ (x, z) = κ (z, x)
The matrix K formed by applying κ to any finite subset of X is positivesemi-definite: K � 0
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 31 / 47
Kernel Functions I
Theorem 15
κ : X ×X → ℜ (either continuous or with a countable domain) is FPSDiff ∃ Hilbert space F with feature map φ : X → F s.t.
κ (x, z) = 〈φ (x) , φ (z)〉
ProofCase ⇐: Follows from Proposition 12.Case ⇒: Suppose κ if FPSD & we construct Hilbert Space Fκ with κ asits reproducing kernel; i.e., Fκ is the closure of functions: fx (·) = κ (·, x).Thus, for any αi , xi , g (·) = ∑
i αiκ (·, xi ) is in Fκ &, by the reproducingproperty,
〈g , g〉 = ∑
i ,j αiαjκ (xi , xj ) = α⊤Kα
where K is the kernel matrix {xi}, & thus α⊤Kα ≥ 0 since K � 0.
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 32 / 47
Kernel Functions II
(Completeness) Follows from the Cauchy-Schwarz inequality, but beyondthe scope of this course.
(Separability) Separability follows from κ being continuous or having acountable domain, but is not shown here.
Finally, the mapping φ is specified by κ and φ (x) = κ (·, x) ∈ Fκ .
Note, the inner product defined above is strict since if ‖f ‖ = 0, then forall x, |f (x)| ≤ ‖f ‖ ‖φ (x)‖ = 0
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 33 / 47
Part VI
Kernel Constructions
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 34 / 47
Simple Kernels
Clearly, the linear kernel defined by
κlin (x, z) = 〈x, z〉 = x⊤z
is a valid kernel function since it is an inner product in XFor any N × N matrix B � 0,
κB (x, z) = 〈x |B| z〉 = x⊤Bz
is a valid kernel function
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 35 / 47
Closure Properties of Kernels I
Proposition 16
Suppose κ1 & κ2 are kernels on X , a > 0, f : X → ℜ, φ : X → ℜM , & κ3is a kernel on ℜM . Then these are all kernel functions on X :
16.1 κ (x, z) = κ1 (x, z) + κ2 (x, z)
16.2 κ (x, z) = a · κ1 (x, z)16.3 κ (x, z) = κ1 (x, z) · κ2 (x, z)16.4 κ (x, z) = f (x) f (z)
16.5 κ (x, z) = κ3 (φ (x) , φ (z))
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 36 / 47
Closure Properties of Kernels II
ProofLet K1 & K2 be the kernel matrices of κ1 & κ2 applied to any set{xi}Ni=1—both these matrices are PSD. Also let α be any N-vector:
(Part 1): K = K1 +K2 ⇒ α⊤Kα = α⊤K1α+α⊤K2α ≥ 0(Part 2): K = aK1 ⇒ α⊤Kα = a · α⊤K1α ≥ 0(Part 3): Take the spectral decomposition of K1 =
∑Ni=1 λiviv
⊤i and
K2 =∑N
i=1 γiwiw⊤i . The spectral decomposition of their element-wise
product, K = K1 ⊙K2, is then K =∑N
i ,j=1
√
λiγj(vi ⊙ wj)(vi ⊙ wj)⊤;
i.e., a summation of rank-1 matrices with positive coefficients ⇒ PSD.(Part 4): κ (x, z) = 〈ψ (x) , ψ (z)〉 where ψ : x 7→ f (x); thus, κ is PSD.(Part 5): Since κ3 is a kernel, applying it to any set of vectors {φ (xi )}Ni=1
yields a PSD matrix.
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 37 / 47
Closure Properties of Kernels III
The feature spaces for these kernels are as follows:
For kernel κ1 (x, z) + κ2 (x, z), the new feature map is equivalent tostacking the feature maps of κ1 & κ2:
φ (x) =
[
φ1 (x)φ2 (x)
]
For kernel a · κ1 (x, z), its feature space is scaled by√a
For kernel κ1 (x, z) · κ2 (x, z), if φ1 has dimension N1 and φ2 hasdimension N2, φ has N1N2 features given by
[φ (x)]ij = [φ1 (x)]i · [φ2 (x)]j
It follows that the features of κ1 (x, z)d are all monomials of the form
[φ1 (x)]d11 [φ1 (x)]
d22 . . . [φ1 (x)]
dNN
∑
i di = d
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 38 / 47
Additional Kernel Functions
Proposition 17
Suppose κ1 is a kernel on X & p : ℜ → ℜ is a polynomial withnon-negative coefficients. Then, the following are kernels:
1 κ (x, z) = p (κ1 (x, z))
2 κ (x, z) = exp (κ1 (x, z))
3 Gaussian or RBF kernel: κ (x, z) = exp(
−‖x−z‖222σ2
)
Proof(Part 1) Constructing a polynomial kernel from base kernel κ1 proceedsdirectly from Proposition 16.1, 16.2, & 16.3(Part 2) Consider that exp (x) = 1+ x + 1
2x2 + . . . 1
i !xi + . . .. Thus, it is a
limit of polynomials & the PSD property is closed under pointwise limits.(Part 3) Left as an exercise.
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 39 / 47
Common Kernel Functions
Linear Kernel: κlin (x, z) = x⊤z
Polynomial Kernel: κpoly (x, z) = (x⊤z+ R)d
RBF Kernel: κrbf (x, z) = exp(
−‖x−z‖222σ2
)
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 40 / 47
Kernel Questions
Which of the following functions are kernels?
κ1 (x, z) =∑D
i=1 (xi + zi ) κ2 (x, z) =∏D
i=1 h(xi−ca
)h( zi−ca
)
κ3 (x, z) = − 〈x,z〉‖x‖2‖z‖2
κ4 (x, z) =√
‖x− z‖22 + 1
where h(x) = cos(1.75x) exp(−x2/2)κ1 is not a kernel. Consider x1 =
[
1 0]⊤
& x2 =[
0 2]⊤
. Their kernelmatrix has eigenvalues −1 and 5.κ2 is a kernel because it can be written as the product f (x)f (z) wheref (x) =
∏Di=1 h(
xi−ca
)κ3 is not a kernel because it is the negation of a valid non-trivial kernel& thus will have negative eigenvalues
κ4 is not a kernel. Consider x1 =[
1 0]⊤
& x2 =[
0 1]⊤
. Again, theirkernel matrix has a negative eigenvalue
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 41 / 47
Part VII
Transforming Kernel Matrices
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 42 / 47
Operations on Kernel MatricesSimple Transformations
Adding a non-negative constant to the Kernel Matrix: corresponds toadding a new constant feature to each training example; i.e., given thematrix Φ of features such that K = ΦΦ⊤,
[
Φ c1]
∗[
Φ c1]⊤
= K+ c211⊤
Adding a non-negative constant to its diagonal: corresponds to addingan indicator feature for every data point
φ (x1) c 0 . . . 0φ (x2) 0 c . . . 0
......
.... . .
...φ (xN) 0 0 . . . c
φ (x1) c 0 . . . 0φ (x2) 0 c . . . 0
......
.... . .
...φ (xN) 0 0 . . . c
⊤
= K+ c2I
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 43 / 47
Operations on Kernel MatricesCentering Data
Suppose we want to translate the origin to the data’s center of mass. . .
x1
x2
x3
x4
x5x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 44 / 47
Operations on Kernel MatricesCentering Data
Suppose we want to translate the origin to the data’s center of mass. . .
x1
x2
x3
x4
x5x6
x7
x8
x9
x10
x11
x12x
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 44 / 47
Operations on Kernel MatricesCentering Data
Suppose we want to translate the origin to the data’s center of mass. . .
x1
x2
x3
x4
x5x6
x7
x8
x9
x10
x11
x12x
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 44 / 47
Operations on Kernel MatricesCentering Data
Suppose we want to translate the origin to the data’s center of mass. . .
x1
x2
x3
x4
x5x6
x7
x8
x9
x10
x11
x12x
As we will see next lecture, this transformation can be expressed askernel transform
K← K− 1N11⊤K− 1
NK11⊤ + 1⊤K1
N2 11⊤
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 44 / 47
Operations on Kernel MatricesNormalizing Data
Suppose we want to project all data to be norm 1; i.e., ‖x‖ = 1. . .
x1
x2
x3
x4
x5x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 45 / 47
Operations on Kernel MatricesNormalizing Data
Suppose we want to project all data to be norm 1; i.e., ‖x‖ = 1. . .
x1
x2
x3
x4
x5x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 45 / 47
Operations on Kernel MatricesNormalizing Data
Suppose we want to project all data to be norm 1; i.e., ‖x‖ = 1. . .
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 45 / 47
Operations on Kernel MatricesNormalizing Data
Suppose we want to project all data to be norm 1; i.e., ‖x‖ = 1. . .
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
This transformation can be achieved using only the information from thekernel matrix:
κ (x, z) =κ (x, z)
√
κ (x, x) κ (z, z)
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 45 / 47
Summary
We explored a formal framework for kernels
We saw a formal defintion for kernel functions & matrices
We saw the properties that kernels must exhibit and how thoseproperties can be used to validate kernel functions & construct newkernels from existing kernels
We explored some operations that allow us to manipulate data infeature space
Next Lecture: we will see basic kernel-based learning algorithms
We will explore how to take the mean of data in feature space & use that toconstruct a novelty detection algorithmWe will explore how to project data in feature space & use that for a basicsubspace algorithm
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 46 / 47
Bibliography I
The Majority of the work from this talk can be found in the lecuture’saccompanying book, “Kernel Methods for Pattern Analysis.”
[1] N. Aronszajn. Theory of reproducing kernels. Transactions of theAmerican Mathematical Society, 68(3):337–404, 1950.
P. Laskov and B. Nelson (Tubingen) Lecture 3: Kernel Functions April 17, 2012 47 / 47