41
5/7/18 1 ©2018 Emily Fox STAT/CSE 416: Machine Learning Emily Fox University of Washington May 3, 2018 Going nonparametric: Nearest neighbor methods for regression and classification STAT/CSE 416: Intro to Machine Learning Locality sensitive hashing for approximate NN search…cont’d ©2018 Emily Fox

Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

1

STAT/CSE 416: Intro to Machine Learning©2018 Emily Fox

STAT/CSE 416: Machine LearningEmily FoxUniversity of WashingtonMay 3, 2018

Going nonparametric:Nearest neighbor methods for regression and classification

STAT/CSE 416: Intro to Machine Learning

Locality sensitive hashingfor approximate NN search…cont’d

©2018 Emily Fox

Page 2: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

2

STAT/CSE 416: Intro to Machine Learning3

Complexity of brute-force search

Given a query point, scan through each point-O(N) distance computations per 1-NN query!

-O(Nlogk) per k-NN query!

©2018 Emily Fox

What if N is huge??? (and many queries)

STAT/CSE 416: Intro to Machine Learning4

Moving away from exact NN search

• Approximate neighbor finding…- Don’t find exact neighbor, but that’s okay for

many applications

• Focus on methods that provide good probabilistic guarantees on approximation

©2018 Emily Fox

Out of millions of articles, do we need the closest article or just one that’s pretty similar?

Do we even fully trust our measure of similarity???

Page 3: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

3

STAT/CSE 416: Intro to Machine Learning5 ©2018 Emily Fox#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

1.0 #awesome –

1.5 #awful =

0

Using score for NN search

Only search here for queries with Score(x)>0

Only search here for queries with Score(x)<0

candidate neighbors if Score(x)<0

Query point x

2D Data Sign(Score) Bin index

x1 = [0, 5] -1 0

x2 = [1, 3] -1 0

x3 = [3, 0] 1 1

… … …

STAT/CSE 416: Intro to Machine Learning6 ©2018 Emily Fox

Using score for NN search

Bin 0 1

List containing indices of datapoints:

{1,2,4,7,…} {3,5,6,8,…}

search for NN amongst this set

HASH TABLE

candidate neighbors if Score(x)<0

2D Data Sign(Score) Bin index

x1 = [0, 5] -1 0

x2 = [1, 3] -1 0

x3 = [3, 0] 1 1

… … …

Page 4: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

4

STAT/CSE 416: Intro to Machine Learning7

Three potential issues with simple approach

1. Challenging to find good line2. Poor quality solution:- Points close together get split into separate bins

3. Large computational cost:- Bins might contain many points, so still

searching over large set for each NN query

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning8 ©2018 Emily Fox#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

1.0 #awesome –

1.5 #awful =

0

How to define the line?

Crazy idea: Define line randomly!

Page 5: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

5

STAT/CSE 416: Intro to Machine Learning9

How bad can a random line be?

©2018 Emily Fox#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

θxy

Bins are the same

Bins are the same

Bins are different

If θxy is small (x,y close), unlikely to be placed

into separate bins

Goal: If x,y are close (according to cosine similarity), want binned values to be the same.

x

y

STAT/CSE 416: Intro to Machine Learning10

Three potential issues with simple approach

©2018 Emily Fox

Bin 0 1

List containing indices of datapoints:

{1,2,4,7,…} {3,5,6,8,…}

1. Challenging to find good line2. Poor quality solution:- Points close together get split into separate bins

3. Large computational cost:- Bins might contain many points, so still

searching over large set for each NN query

Page 6: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

6

STAT/CSE 416: Intro to Machine Learning

Improving efficiency: Reducing # points examined per query

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning12

Reducing search cost through more bins

©2018 Emily Fox

#awesome

#aw

ful

0 1 2 3 4 …

0

1

2

3

4

…Line 1

Line 2

Line 3

Bin index:[0 0 0]

Bin index:[0 1 0]

Bin index:[1 1 0]

Bin index:[1 1 1]

Page 7: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

7

STAT/CSE 416: Intro to Machine Learning13

Using score for NN search

©2018 Emily Fox

Bin [0 0 0] = 0

[0 0 1] = 1

[0 1 0] = 2

[0 1 1] = 3

[1 0 0] = 4

[1 0 1] = 5

[1 1 0] = 6

[1 1 1] = 7

Data indices:

{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}

search for NN amongst this set

2D Data Sign(Score1)

Bin 1 index

Sign(Score2)

Bin 2 index

Sign(Score3)

Bin 3 index

x1 = [0, 5] -1 0 -1 0 -1 0

x2 = [1, 3] -1 0 -1 0 -1 0

x3 = [3, 0] 1 1 1 1 1 1

… … … … … … …

STAT/CSE 416: Intro to Machine Learning14

Improving search quality by searching neighboring bins

©2018 Emily Fox

Bin [0 0 0] = 0

[0 0 1] = 1

[0 1 0] = 2

[0 1 1] = 3

[1 0 0] = 4

[1 0 1] = 5

[1 1 0] = 6

[1 1 1] = 7

Data indices:

{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}

Query point here, but is NN?

Not necessarily

#awesome

#aw

ful

0 1 2 3 4 …01234… Line 1

Line 2

Line 3

Bin index:[0 0 0]

Bin index:[0 1 0]

Bin index:[1 1 0]

Bin index:[1 1 1]

Even worse than before…Each line can split pts.Sacrificing accuracy for speed

Page 8: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

8

STAT/CSE 416: Intro to Machine Learning15

Improving search quality by searching neighboring bins

©2018 Emily Fox

Bin [0 0 0] = 0

[0 0 1] = 1

[0 1 0] = 2

[0 1 1] = 3

[1 0 0] = 4

[1 0 1] = 5

[1 1 0] = 6

[1 1 1] = 7

Data indices:

{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}

#awesome#

awfu

l0 1 2 3 4 …

01234… Line 1

Line 2

Line 3

Bin index:[0 0 0]

Bin index:[0 1 0]

Bin index:[1 1 0]

Bin index:[1 1 1]

Next closest bins(flip 1 bit)

STAT/CSE 416: Intro to Machine Learning16

Improving search quality by searching neighboring bins

©2018 Emily Fox

Bin [0 0 0] = 0

[0 0 1] = 1

[0 1 0] = 2

[0 1 1] = 3

[1 0 0] = 4

[1 0 1] = 5

[1 1 0] = 6

[1 1 1] = 7

Data indices:

{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}

#awesome

#aw

ful

0 1 2 3 4 …01234… Line 1

Line 2

Line 3

Bin index:[0 0 0]

Bin index:[0 1 0]

Bin index:[1 1 0]

Bin index:[1 1 1]

Further bin(flip 2 bits)

Page 9: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

9

STAT/CSE 416: Intro to Machine Learning17

Improving search quality by searching neighboring bins

©2018 Emily Fox

Bin [0 0 0] = 0

[0 0 1] = 1

[0 1 0] = 2

[0 1 1] = 3

[1 0 0] = 4

[1 0 1] = 5

[1 1 0] = 6

[1 1 1] = 7

Data indices:

{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}

#awesome#

awfu

l0 1 2 3 4 …

01234… Line 1

Line 2

Line 3

Bin index:[0 0 0]

Bin index:[0 1 0]

Bin index:[1 1 0]

Bin index:[1 1 1]

Quality of retrieved NN can only improve with searching more bins

Algorithm: Continue searching until computational budget is reached or quality of NN good enough

STAT/CSE 416: Intro to Machine Learning18

LSH recap

• Draw h random lines

• Compute “score” for each point under each line and translate to binary index

• Use h-bit binary vector per data point as bin index

• Create hash table

©2018 Emily Fox

kd-t

ree

com

pet

itor

dat

a st

ruct

ure

• For each query point x, search bin(x), then neighboring bins until time limit

Page 10: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

10

STAT/CSE 416: Intro to Machine Learning

Moving to higher dimensions d

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning20

Draw random planes

©2018 Emily Fox

#awesome

#aw

ful

x[1]

#great

x[3]

Score(x) = v1 #awesome+ v2 #awful+ v3 #great

x[2]

Page 11: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

11

STAT/CSE 416: Intro to Machine Learning21

Cost of binning points in d-dim

©2018 Emily Fox

Score(x) = v1 #awesome+ v2 #awful+ v3 #great

Per data point, need d multipliesto determine bin index per plane

One-time cost offset if many queries of fixed dataset

STAT/CSE 416: Intro to Machine Learning

Summary for retrieval usingnearest neighbors and locality sensitive hashing

©2018 Emily Fox

Page 12: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

12

STAT/CSE 416: Intro to Machine Learning23

What you can do now…• Implement nearest neighbor search for retrieval tasks• Contrast document representations (e.g., raw word counts, tf-idf,…)

- Emphasize important words using tf-idf

• Contrast methods for measuring similarity between two documents- Euclidean vs. weighted Euclidean- Cosine similarity vs. similarity via unnormalized inner product

• Describe complexity of brute force search• Implement LSH for approximate nearest neighbor search

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning

Motivating NN methods for regression:Fit globally vs. fit locally

©2018 Emily Fox

Page 13: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

13

STAT/CSE 416: Intro to Machine Learning25

Parametric models of f(x)

©2018 Emily Fox

y

sq.ft.

pri

ce ($

)

x

STAT/CSE 416: Intro to Machine Learning26

Parametric models of f(x)

©2018 Emily Fox

y

sq.ft.

pri

ce ($

)

x

Page 14: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

14

STAT/CSE 416: Intro to Machine Learning27

Parametric models of f(x)

©2018 Emily Fox

y

sq.ft.

pri

ce ($

)

x

STAT/CSE 416: Intro to Machine Learning28

Parametric models of f(x)

©2018 Emily Fox

y

sq.ft.

pri

ce ($

)

x

Page 15: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

15

STAT/CSE 416: Intro to Machine Learning29

f(x) is not really a polynomial

©2018 Emily Fox

y

sq.ft.

pri

ce ($

)

x

STAT/CSE 416: Intro to Machine Learning30

What alternative do we have?

If we:- Want to allow flexibility in f(x) having local structure

- Don’t want to infer “structural breaks”

What’s a simple option we have?- Assuming we have plenty of data…

©2018 Emily Fox

Page 16: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

16

STAT/CSE 416: Intro to Machine Learning

Simplest approach:Nearest neighbor regression

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning32

Fit locally to each data point

Predicted value = “closest” yi

©2018 Emily Fox

Her

e, th

is is

the

clos

est

data

poin

t

Here, this is the closest datapoint

Her

e, th

is

is th

e cl

oses

t da

tapo

int

Here,

this

is

the c

lose

st

datap

oint

y

sq.ft.

pri

ce ($

)

x

1 nearest neighbor (1-NN)

regression

Page 17: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

17

STAT/CSE 416: Intro to Machine Learning33

What people do naturally…

Real estate agent assesses value by finding sale of most similar house

©2018 Emily Fox

$ = ??? $ = 850k

STAT/CSE 416: Intro to Machine Learning34

1-NN regression more formally

Dataset of ( ,$) pairs: (x1,y1), (x2,y2),…,(xN,yN) Query point: xq

1. Find “closest” xi in dataset

2. Predict

©2018 Emily Fox

Her

e, th

is

is th

e cl

oses

t da

tapo

int

Here, this is the closest datapoint

Her

e, th

is

is th

e cl

oses

t da

tapo

int

Here,

this

is

the c

lose

st

datap

oint

y

sq.ft.

pri

ce ($

)

x

Page 18: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

18

STAT/CSE 416: Intro to Machine Learning35

Visualizing 1-NN in multiple dimensions

Voronoi tesselation(or diagram):- Divide space into N regions,

each containing 1 datapoint

- Defined such that any x in region is “closest” to region’s datapoint

©2018 Emily Fox

Don’t explicitly form!

STAT/CSE 416: Intro to Machine Learning36

Different distance metrics lead to different predictive surfaces

©2018 Emily Fox

Euclidean distance Manhattan distance

Page 19: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

19

STAT/CSE 416: Intro to Machine Learning37

Can 1-NN be used for classification?

Yes!!

Just predict class of neighbor

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning

1-NN algorithm

©2018 Emily Fox

Page 20: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

20

STAT/CSE 416: Intro to Machine Learning

Performing 1-NN search

• Query house:

• Dataset:

• Specify: Distance metric• Output: Most similar house

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning

1-NN algorithm

Initialize Dist2NN = ∞, = Ø

For i=1,2,…,NCompute: δ = distance( , )

If δ < Dist2NN

set =set Dist2NN = δ

Return most similar house

©2018 Emily Fox

i

query house

closest house

i

closest house to query house

q

Page 21: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

21

STAT/CSE 416: Intro to Machine Learning410 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

1.2

1.4Nearest Neighbors Kernel (K = 1)

1-NN in practice

©2018 Emily Fox

Fit looks good for data dense in x and low noise

STAT/CSE 416: Intro to Machine Learning42

Sensitive to regions with little data

©2018 Emily Fox

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

1.2

1.4Nearest Neighbors Kernel (K = 1)

Not great at interpolating over large regions…

Page 22: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

22

STAT/CSE 416: Intro to Machine Learning430 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Nearest Neighbors Kernel (K = 1)

Also sensitive to noise in data

©2018 Emily Fox

Fits can look quite wild…Overfitting?

STAT/CSE 416: Intro to Machine Learning

k-Nearest neighbors

©2018 Emily Fox

Page 23: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

23

STAT/CSE 416: Intro to Machine Learning45

Get more “comps”

More reliable estimate if you base estimate off of a larger set of comparable homes

©2018 Emily Fox

$ = ???

$ = 850k

$ = 749k

$ = 833k$ = 901k

STAT/CSE 416: Intro to Machine Learning46

k-NN regression more formally

Dataset of ( ,$) pairs: (x1,y1), (x2,y2),…,(xN,yN) Query point: xq

1. Find k closest xi in dataset

2. Predict

©2018 Emily Fox

Page 24: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

24

STAT/CSE 416: Intro to Machine Learning

Performing k-NN search

• Query house:

• Dataset:

• Specify: Distance metric• Output: Most similar houses

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning

k-NN algorithm

Initialize Dist2kNN = sort(δ1,…,δk) = sort( ,…, )

For i=k+1,…,NCompute: δ = distance( , )

If δ < Dist2kNN[k]find j such that δ > Dist2kNN[j-1] but δ < Dist2kNN[j]remove furthest house and shift queue:

[j+1:k] = [j:k-1] Dist2kNN[j+1:k] = Dist2kNN[j:k-1]

set Dist2kNN[j] = δ and [j] = Return k most similar houses

©2018 Emily Fox

i

query house

closest houses to query house

q

i

1 k

sort first k houses by distance to query house

list of sorted distances

list of sorted houses

Page 25: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

25

STAT/CSE 416: Intro to Machine Learning49

k-NN in practice

©2018 Emily Fox

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Nearest Neighbors Kernel (K = 30)

Much more reasonable fit in the presence of noise

Boundary & sparse region issues

STAT/CSE 416: Intro to Machine Learning50

k-NN in practice

©2018 Emily Fox

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Nearest Neighbors Kernel (K = 30)

Discontinuities!Neighbor either

in or out

Page 26: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

26

STAT/CSE 416: Intro to Machine Learning51

Issues with discontinuities

Overall predictive accuracy might be okay, but…

For example, in housing application:- If you are a buyer or seller, this matters- Can be a jump in estimated value of house going just from

2640 sq.ft. to 2641 sq.ft.- Don’t really believe this type of fit

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning

Weighted k-nearest neighbors

©2018 Emily Fox

Page 27: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

27

STAT/CSE 416: Intro to Machine Learning53

Weighted k-NN

Weigh more similar houses more than those less similar in list of k-NN

Predict:

©2018 Emily Fox

cqNNj

cqNN1yNN1 + cqNN2yNN2 + cqNN3yNN3 +…+ cqNNkyNNk

kX

j=1

ŷq =

weights on NN

STAT/CSE 416: Intro to Machine Learning54

How to define weights?

Want weight cqNNj to be small whendistance(xNNj,xq) large

and cqNNj to be large whendistance(xNNj,xq) small

©2018 Emily Fox

Page 28: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

28

STAT/CSE 416: Intro to Machine Learning55

Kernel weights for d=1

©2018 Emily Fox

Define: cqNNj = Kernelλ(|xNNj-xq|)

0-λ λ

Gaussian kernel:Kernelλ(|xi-xq|) = exp(-(xi-xq)2/λ)

Note: never exactly 0!

simpleisotropic case

STAT/CSE 416: Intro to Machine Learning56

Kernel weights for d≥1

©2018 Emily Fox

Define: cqNNj = Kernelλ(distance(xNNj,xq))

0-λ λ

Page 29: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

29

STAT/CSE 416: Intro to Machine Learning

Kernel regression

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning58

Weighted k-NN

Weigh more similar houses more than those less similar in list of k-NN

Predict:

©2018 Emily Fox

cqNNj

cqNN1yNN1 + cqNN2yNN2 + cqNN3yNN3 +…+ cqNNkyNNk

kX

j=1

ŷq =

weights on NN

Page 30: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

30

STAT/CSE 416: Intro to Machine Learning59

Kernel regression

Instead of just weighting NN, weight all points

Predict:

©2018 Emily Fox

ŷq =

weight on each datapoint

cqi

cqiyi

NX

i=1

NX

i=1

Nadaraya-Watson kernel weighted

average

Kernelλ(distance(xi,xq))

Kernelλ(distance(xi,xq)) * yi

NX

i=1

NX

i=1=

STAT/CSE 416: Intro to Machine Learning60

Kernel regression in practice

©2018 Emily Fox

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Epanechnikov Kernel (lambda = 0.2)

Kernel has bounded support…Only subset of data needed to compute local fit

Page 31: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

31

STAT/CSE 416: Intro to Machine Learning61

Choice of bandwidth λ

Often, choice of kernel matters much less than choice of λ

©2018 Emily Fox

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Epanechnikov Kernel (lambda = 0.4)

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Epanechnikov Kernel (lambda = 0.04)

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Epanechnikov Kernel (lambda = 0.2)λ = 0.04 λ = 0.4

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Boxcar Kernel (lambda = 0.2)λ = 0.2

Boxcar kernel

STAT/CSE 416: Intro to Machine Learning62

Choosing λ (or k in k-NN)

How to choose? Same story as always…

©2018 Emily Fox

Cross Validation

Page 32: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

32

STAT/CSE 416: Intro to Machine Learning

Formalizing the idea of local fits

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning64

Contrasting with global average

A globally constant fit weights all points equally

ŷq =

equal weight on each datapoint

c

c yi

NX

i=1

NX

i=1

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Boxcar Kernel (lambda = 1)

©2018 Emily Fox

yi

NX

i=1

1

N =

Page 33: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

33

STAT/CSE 416: Intro to Machine Learning65

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Boxcar Kernel (lambda = 0.2)

0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1

−1

−0.5

0

0.5

1

1.5

f(x0)

Epanechnikov Kernel (lambda = 0.2)

Contrasting with global averageKernel regression leads to locally constant fit- slowly add in some points and let others gradually die off

ŷq =Kernelλ(distance(xi,xq))

Kernelλ(distance(xi,xq)) * yi

NX

i=1

NX

i=1

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning66

Local linear regression

So far, fitting constant function locally at each pointà “locally weighted averages”

Can instead fit a line or polynomial locally at each pointà “locally weighted linear regression”

©2018 Emily Fox

Page 34: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

34

STAT/CSE 416: Intro to Machine Learning67

Local regression rules of thumb

- Local linear fit reduces bias at boundaries with minimum increase in variance

- Local quadratic fit doesn’t help at boundaries and increases variance, but does help capture curvature in the interior

- With sufficient data, local polynomials of odd degree dominate those of even degree

©2018 Emily Fox

Recommended default choice: local linear regression

STAT/CSE 416: Intro to Machine Learning

Discussion on k-NN and kernel regression

©2018 Emily Fox

Page 35: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

35

STAT/CSE 416: Intro to Machine Learning69

Nonparametric approaches

k-NN and kernel regression are examples of nonparametricregression

General goals of nonparametrics:- Flexibility- Make few assumptions about f(x)- Complexity can grow with the number of observations N

Lots of other choices:- Splines, trees, locally weighted structured regression models…

©2018 Emily Fox

STAT/CSE 416: Intro to Machine Learning70

Limiting behavior of NN: Noiseless setting (εi=0)

In the limit of getting an infinite amount of noiseless data,

the MSE of 1-NN fit goes to 0

©2018 Emily Fox

Page 36: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

36

STAT/CSE 416: Intro to Machine Learning71 ©2018 Emily Fox

Not true for parametric models!

Quadratic fit1-NN fit

Limiting behavior of NN: Noiseless setting (εi=0)

In the limit of getting an infinite amount of noiseless data,

the MSE of 1-NN fit goes to 0

STAT/CSE 416: Intro to Machine Learning72

Error vs. amount of data

©2018 Emily Fox

# data points in training set

Erro

r

Page 37: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

37

STAT/CSE 416: Intro to Machine Learning73

Limiting behavior of NN: Noisy data setting

In the limit of getting an infinite amount of data, the MSE of NN fit goes to 0 if k grows, too

©2018 Emily Fox

1-NN fit 200-NN fit Quadratic fit

STAT/CSE 416: Intro to Machine Learning74

NN and kernel methods for large d or small N

NN and kernel methods work well when the data cover the space, but…- the more dimensions d you have, the more points N you need to

cover the space- need N = O(exp(d)) data points for good performance

This is where parametric models become useful…

©2018 Emily Fox

Page 38: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

38

STAT/CSE 416: Intro to Machine Learning75

Complexity of NN search

Naïve approach: Brute force search- Given a query point xq

- Scan through each point x1,x2,…, xN

- O(N) distance computations per 1-NN query!

- O(Nlogk) per k-NN query!

What if N is huge??? (and many queries)

©2018 Emily Fox

KD-trees!Locality-sensitive

hashing, etc.

STAT/CSE 416: Intro to Machine Learning

k-NN for classification

©2018 Emily Fox

Page 39: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

39

STAT/CSE 416: Intro to Machine Learning77

Spam filtering example

©2018 Emily Fox

Input: x Output: y

Not spam

Spam

Text of email,sender, IP,…

STAT/CSE 416: Intro to Machine Learning78

Using k-NN for classification

Space of labeled emails (not spam vs. spam), organized by similarity of text

©2018 Emily Fox

not spam vs. spam:decide via majority vote of k-NN

query email

Page 40: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

40

STAT/CSE 416: Intro to Machine Learning79

not spam vs. spam:decide via majority vote of k-NN

Using k-NN for classification

Space of labeled emails (not spam vs. spam), organized by similarity of text

©2018 Emily Fox

query email

STAT/CSE 416: Intro to Machine Learning

Summary for nearest neighborand kernel regression

©2018 Emily Fox

Page 41: Going nonparametric€¦ · indices of datapoints: {1,2,4,7,…} {3,5,6,8,…} 1. Challenging to find good line 2. Poor quality solution:-Points close together get split into separate

5/7/18

41

STAT/CSE 416: Intro to Machine Learning81

What you can do now…• Motivate the use of nearest neighbor (NN) regression• Perform k-NN regression• Analyze computational costs of these algorithms• Discuss sensitivity of NN to lack of data, dimensionality, and noise• Perform weighted k-NN and define weights using a kernel• Define and implement kernel regression• Describe the effect of varying the kernel bandwidth λ or # of nearest neighbors k

• Select λ or k using cross validation• Compare and contrast kernel regression with a global average fit• Define what makes an approach nonparametric and why NN and kernel regression are

considered nonparametric methods• Analyze the limiting behavior of NN regression• Use NN for classification

©2018 Emily Fox