28
c Stanley Chan 2020. All Rights Reserved. ECE595 / STAT598: Machine Learning I Lecture 03: Regression with Kernels Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 28

ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 03: Regression with Kernels

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 28

Page 2: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Outline

2 / 28

Page 3: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 1: Linear regression: A basic data analytic tool

Lecture 2: Regularization: Constraining the solution

Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 3: Kernel Method

Kernel Method

Dual FormKernel TrickAlgorithm

Examples

Radial Basis Function (RBF)Regression using RBFKernel Methods in Classification

3 / 28

Page 4: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Why Another Method?

Linear regression: Pick a global model, best fit globally.Kernel method: Pick a local model, best fit locally.In kernel method, instead of picking a line / a quadratic equation, wepick a kernel.A kernel is a measure of distance between training samples.Kernel method buys us the ability to handle nonlinearity.Ordinary regression is based on the columns (features) of A.Kernel method is based on the rows (samples) of A.

4 / 28

Page 5: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Pictorial Illustration

5 / 28

Page 6: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Overview of the Method

Model Parameter:

We want the model parameter θ to look like: (How? Question 1)

θ =N∑

n=1

αnxn.

This model expresses θ as a combination of the samples.The trainable parameters are αn, where n = 1, . . . ,N.If we can make αn local, i.e., non-zero for only a few of them, thenwe can achieve our goal: localized, sample-dependent.

Predicted Value

The predicted value of a new sample x is

y = θTx =

N∑n=1

αn〈x , xn〉.

We want this model to encapsulate nonlinearity. (How? Question 2)6 / 28

Page 7: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Dual Form of Linear Regression

Goal: Addresses Question 1: Express θ as

θ =N∑

n=1

αnxn.

We start by listing out a technical lemma:

Lemma

For any A ∈ RN×d , y ∈ Rd , and λ > 0,

(ATA + λI )−1ATy = AT (AAT + λI )−1y . (1)

Proof: See Appendix.

Remark:

The dimensions of I on the left is d × d , on the right is N × N.

If λ = 0, then the above is true only when A is invertible.7 / 28

Page 8: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Dual Form of Linear Regression

Using the Lemma, we can show that

θ = (ATA + λI )−1ATy (Primal Form)

= AT (AAT + λI )−1y︸ ︷︷ ︸def=α

(Dual Form)

=

— (x1)T —— (x2)T —

...— (xN)T —

T α1

...αN

=N∑

n=1

αnxn, αn

def= [(AAT + λI )−1y ]n.

8 / 28

Page 9: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

The Kernel Trick

Goal: Addresses Question 2: Introduce nonlinearity to

y = θTx =

N∑n=1

αn〈x , xn〉.

The Idea:

Replace the inner product 〈x , xn〉 by k(x , xn):

y = θTx =

N∑n=1

αnk(x , xn).

k(·, ·) is called a kernel.

A kernel is a measure of the distance between two samples x i and x j .

〈x i , x j〉 measure distance in the ambient space, k(x i , x j) measuredistance in a transformed space.

In particular, a valid kernel takes the form k(x i , x j) = 〈φ(x i ), φ(x j)〉for some nonlinear transforms φ.

9 / 28

Page 10: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Kernels Illustrated

A kernel typically lifts the ambient dimension to a higher one.For example, mapping from R2 to R3

xn =

[x1x2

]and φ(xn) =

x21x1x2x22

10 / 28

Page 11: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Relationship between Kernel and Transform

Consider the following kernel k(u, v) = (uTv)2. What is the transform?

Suppose u and v are in R2. Then (uTv)2 is

(uTv)2 =

(2∑

i=1

uivi

) 2∑j=1

ujvj

=2∑

i=1

2∑j=1

(uiuj)(vivj) =[u21 u1u2 u2u1 u22

] v21v1v2v2v1v22

.So if we define φ as

u =

[u1u2

]7→ φ(u) =

u21u1u2u2u1u22

then (uTv)2 = 〈φ(u), φ(v)〉. 11 / 28

Page 12: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Radial Basis Function

A useful kernel is the radial basis kernel (RBF):

k(u, v) = exp

{−‖u − v‖2

2σ2

}.

The corresponding nonlinear transform of RBF is infinitedimensional. See Appendix.‖u − v‖2 measures the distance between two data points u and v .σ is the std dev, defining “far” and “close”.RBF enforces local structure; Only a few samples are used.

12 / 28

Page 13: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Kernel Method

Given the choice of the kernel function, we can write down the algorithmas follows.

1 Pick a kernel function k(·, ·).

2 Construct a kernel matrix K ∈ RN×N , where [K ]ij = k(x i , x j), fori = 1, . . . ,N and j = 1, . . . ,N.

3 Compute the coefficients α ∈ RN , with

αn = [(K + λI )−1y ]n.

4 Estimate the predicted value for a new sample x :

gθ(x) =N∑

n=1

αnk(x , xn).

Therefore, the choice of the regression function is shifted to the choice ofthe kernel.

13 / 28

Page 14: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Outline

Mathematical Background

Lecture 1: Linear regression: A basic data analytic tool

Lecture 2: Regularization: Constraining the solution

Lecture 3: Kernel Method: Enabling nonlinearity

Lecture 3: Kernel Method

Kernel Method

Dual FormKernel TrickAlgorithm

Examples

Radial Basis Function (RBF)Regression using RBFKernel Methods in Classification

14 / 28

Page 15: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Example

Goal: Use the kernel method to fit the data points shown as follows.

What is the input feature vector xn? xn = tn: The time stamps.

What is the output yn? yn is the height.

Which kernel to choose? Let us consider the RBF.

0 10 20 30 40 50 60 70 80 90 100

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

15 / 28

Page 16: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Example (using RBF)

Define the fitted function as gθ(t). [Here, θ refers to α.]

The RBF is defined as k(ti , tj) = exp{−(ti − tj)2/2σ2}, for some σ.

The matrix K looks something below

[K ]ij = exp{−(ti − tj)2/2σ2}.

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

K is a banded diagonal matrix. (Why?)

The coefficient vector is αn = [(K + λI )−1y ]n.

16 / 28

Page 17: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Example (using RBF)

Using the RBF, the predicted value is given by

gθ(tnew) =N∑

n=1

αnk(tnew, tn) =N∑

n=1

αne− (tnew−tn)

2

2σ2 .

Pictorially, the predicted function gθ can be viewed as the linearcombination of the Gaussian kernels.

17 / 28

Page 18: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Effect of σ

Large σ: Flat kernel. Over-smoothing.Small σ: Narrow kernel. Under-smoothing.Below shows an example of the fitting and the kernel matrix K .

10 20 30 40 50 60 70 80 90 100-0.2

0

0.2

0.4

0.6

0.8

1

noisy sample

kernel regression

ground truth

10 20 30 40 50 60 70 80 90 100-0.2

0

0.2

0.4

0.6

0.8

1

noisy sample

kernel regression

ground truth

10 20 30 40 50 60 70 80 90 100-0.2

0

0.2

0.4

0.6

0.8

1

noisy sample

kernel regression

ground truth

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

100

Too large About right Too small18 / 28

Page 19: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Any Improvement?

We can improve the above kernel by considering xn = [yn, tn]T .

Define the kernel as

k(x i , x j) = exp

{−(

(yi − yj)2

2σ2r+

(ti − tj)2

2σ2s

)}.

This new kernel is adaptive (edge-aware).

19 / 28

Page 20: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Any Improvement?

Here is a comparison.

10 20 30 40 50 60 70 80 90 100-0.2

0

0.2

0.4

0.6

0.8

1

noisy sample

kernel regression

ground truth

10 20 30 40 50 60 70 80 90 100-0.2

0

0.2

0.4

0.6

0.8

1

noisy sample

kernel regression

ground truth

without improvement with improvement

This idea is known as bilateral filter in the computer vision literature.

Can be further extended to 2D image where xn = [yn, sn], for somespatial coordinate sn.

Many applications. See Reading List.

20 / 28

Page 21: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Kernel Methods in Classification

The concept of lifting the data to higher dimension is useful forclassification. 1

1Image source:https://towardsdatascience.com/understanding-the-kernel-trick-e0bc6112ef78

21 / 28

Page 22: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Kernels in Support Vector Machines

Example. RBF for SVM (We will discuss SVM later in the semester.)

Radial Basis Function is often used in support vector machine.

Poor choice of parameter can lead to low training loss, but with therisk of over-fit.

Under-fitted data can sometimes give better generalization.

22 / 28

Page 23: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Kernel Method:

Learning from Data (Chapter 3.4)https://work.caltech.edu/telecourse

CMU 10701 Lecture 4 https://www.cs.cmu.edu/~tom/10701_

sp11/slides/Kernels_SVM_04_7_2011-ann.pdf

Berkeley CS 194 Lecture 7 https:

//people.eecs.berkeley.edu/~russell/classes/cs194/

Oxford C19 Lecture 3http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf

Kernel Regression in Computer Vision:

Bilateral Filter https://people.csail.mit.edu/sparis/bf_course/course_notes.pdf

Takeda and Milanfar, “Kernel regression for image processing andreconstruction”, IEEE Trans. Image Process. (2007)https://ieeexplore.ieee.org/document/4060955

23 / 28

Page 24: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Appendix

24 / 28

Page 25: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Proof of Lemma

Lemma

For any matrix A ∈ RN×d , y ∈ Rd , and λ > 0,

(ATA + λI )−1ATy = AT (AAT + λI )−1y . (2)

The left hand side is solution to normal equation, which meansATAθ + λθ = ATy .Rearrange terms gives θ = AT

[1λ(y − Aθ)

].

Define α = 1λ(y − Aθ), then θ = ATα.

Substitute θ = ATα into α = 1λ(y − Aθ), we have

α =1

λ(y − AATα).

Rearrange terms gives (AAT + λI )α = y , which yieldsα = (AAT + λI )−1y .Substitute into θ = ATα gives θ = AT (AAT + λI )−1y .

25 / 28

Page 26: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Non-Linear Transform for RBF

Let us consider scalar u ∈ R.

k(u, v) = exp{−(u − v)2}= exp{−u2} exp{2uv} exp{−v2}

= exp{−u2}

( ∞∑k=0

2kukvk

k!

)exp{−v2}

= exp{−u2}

(1,

√21

1!u,

√22

2!u2,

√23

3!u3, . . . ,

)T

×

(1,

√21

1!v ,

√22

2!v2,

√23

3!v3, . . . ,

)exp{−v2}

So Φ is

φ(x) = exp{−x2}

(1,

√21

1!x ,

√22

2!x2,

√23

3!x3, . . . ,

)26 / 28

Page 27: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Kernels are Positive Semi-Definite

Given {x j}Nj=1, construct a N × N matrix K such that

[K ]ij = K (x i , x j) = Φ(x i )TΦ(x j).

Claim: K is positive semi-definite.

Let z be an arbitrary vector. Then,

zTKz =n∑

i=1

N∑j=1

ziKijzj =N∑i=1

N∑j=1

ziΦ(x i )TΦ(x j)zj

=N∑i=1

N∑j=1

zi

(N∑

k=1

[Φ(x i )]k [Φ(x j)]k

)zj

(a)=

N∑k=1

(N∑i=1

[Φ(x i )]kzi

)2

≥ 0

where [Φ(x i )]k denotes the k-th element of the vector Φ(x i ).

27 / 28

Page 28: ECE595 / STAT598: Machine Learning I Lecture 03 ...Linear regression: Pick a global model, best t globally. Kernel method: Pick a local model, best t locally. In kernel method, instead

c©Stanley Chan 2020. All Rights Reserved.

Existence of Nonlinear Transform

We just showed that: If K (x i , x j) = Φ(x i )TΦ(x j) for any x1, . . . , xN ,

then K is symmetric positive semi-definite.

The converse also holds: If K is symmetric positive semi-definite forany x1, . . . , xN , then there exist Φ such thatK (x i , x j) = Φ(x i )

TΦ(x j).

This converse is difficult to prove.

It is called the Mercer Condition.

Kernels satisfying Mercer’s condition have Φ.

You can use the condition to rule out invalid kernels.

But proving a valid kernel is still hard.

28 / 28