Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
5/7/18
1
STAT/CSE 416: Intro to Machine Learning©2018 Emily Fox
STAT/CSE 416: Machine LearningEmily FoxUniversity of WashingtonMay 3, 2018
Going nonparametric:Nearest neighbor methods for regression and classification
STAT/CSE 416: Intro to Machine Learning
Locality sensitive hashingfor approximate NN search…cont’d
©2018 Emily Fox
5/7/18
2
STAT/CSE 416: Intro to Machine Learning3
Complexity of brute-force search
Given a query point, scan through each point-O(N) distance computations per 1-NN query!
-O(Nlogk) per k-NN query!
©2018 Emily Fox
What if N is huge??? (and many queries)
STAT/CSE 416: Intro to Machine Learning4
Moving away from exact NN search
• Approximate neighbor finding…- Don’t find exact neighbor, but that’s okay for
many applications
• Focus on methods that provide good probabilistic guarantees on approximation
©2018 Emily Fox
Out of millions of articles, do we need the closest article or just one that’s pretty similar?
Do we even fully trust our measure of similarity???
5/7/18
3
STAT/CSE 416: Intro to Machine Learning5 ©2018 Emily Fox#awesome
#aw
ful
0 1 2 3 4 …
0
1
2
3
4
…
1.0 #awesome –
1.5 #awful =
0
Using score for NN search
Only search here for queries with Score(x)>0
Only search here for queries with Score(x)<0
candidate neighbors if Score(x)<0
Query point x
2D Data Sign(Score) Bin index
x1 = [0, 5] -1 0
x2 = [1, 3] -1 0
x3 = [3, 0] 1 1
… … …
STAT/CSE 416: Intro to Machine Learning6 ©2018 Emily Fox
Using score for NN search
Bin 0 1
List containing indices of datapoints:
{1,2,4,7,…} {3,5,6,8,…}
search for NN amongst this set
HASH TABLE
candidate neighbors if Score(x)<0
2D Data Sign(Score) Bin index
x1 = [0, 5] -1 0
x2 = [1, 3] -1 0
x3 = [3, 0] 1 1
… … …
5/7/18
4
STAT/CSE 416: Intro to Machine Learning7
Three potential issues with simple approach
1. Challenging to find good line2. Poor quality solution:- Points close together get split into separate bins
3. Large computational cost:- Bins might contain many points, so still
searching over large set for each NN query
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning8 ©2018 Emily Fox#awesome
#aw
ful
0 1 2 3 4 …
0
1
2
3
4
…
1.0 #awesome –
1.5 #awful =
0
How to define the line?
Crazy idea: Define line randomly!
5/7/18
5
STAT/CSE 416: Intro to Machine Learning9
How bad can a random line be?
©2018 Emily Fox#awesome
#aw
ful
0 1 2 3 4 …
0
1
2
3
4
…
θxy
Bins are the same
Bins are the same
Bins are different
If θxy is small (x,y close), unlikely to be placed
into separate bins
Goal: If x,y are close (according to cosine similarity), want binned values to be the same.
x
y
STAT/CSE 416: Intro to Machine Learning10
Three potential issues with simple approach
©2018 Emily Fox
Bin 0 1
List containing indices of datapoints:
{1,2,4,7,…} {3,5,6,8,…}
1. Challenging to find good line2. Poor quality solution:- Points close together get split into separate bins
3. Large computational cost:- Bins might contain many points, so still
searching over large set for each NN query
5/7/18
6
STAT/CSE 416: Intro to Machine Learning
Improving efficiency: Reducing # points examined per query
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning12
Reducing search cost through more bins
©2018 Emily Fox
#awesome
#aw
ful
0 1 2 3 4 …
0
1
2
3
4
…Line 1
Line 2
Line 3
Bin index:[0 0 0]
Bin index:[0 1 0]
Bin index:[1 1 0]
Bin index:[1 1 1]
5/7/18
7
STAT/CSE 416: Intro to Machine Learning13
Using score for NN search
©2018 Emily Fox
Bin [0 0 0] = 0
[0 0 1] = 1
[0 1 0] = 2
[0 1 1] = 3
[1 0 0] = 4
[1 0 1] = 5
[1 1 0] = 6
[1 1 1] = 7
Data indices:
{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}
search for NN amongst this set
2D Data Sign(Score1)
Bin 1 index
Sign(Score2)
Bin 2 index
Sign(Score3)
Bin 3 index
x1 = [0, 5] -1 0 -1 0 -1 0
x2 = [1, 3] -1 0 -1 0 -1 0
x3 = [3, 0] 1 1 1 1 1 1
… … … … … … …
STAT/CSE 416: Intro to Machine Learning14
Improving search quality by searching neighboring bins
©2018 Emily Fox
Bin [0 0 0] = 0
[0 0 1] = 1
[0 1 0] = 2
[0 1 1] = 3
[1 0 0] = 4
[1 0 1] = 5
[1 1 0] = 6
[1 1 1] = 7
Data indices:
{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}
Query point here, but is NN?
Not necessarily
#awesome
#aw
ful
0 1 2 3 4 …01234… Line 1
Line 2
Line 3
Bin index:[0 0 0]
Bin index:[0 1 0]
Bin index:[1 1 0]
Bin index:[1 1 1]
Even worse than before…Each line can split pts.Sacrificing accuracy for speed
5/7/18
8
STAT/CSE 416: Intro to Machine Learning15
Improving search quality by searching neighboring bins
©2018 Emily Fox
Bin [0 0 0] = 0
[0 0 1] = 1
[0 1 0] = 2
[0 1 1] = 3
[1 0 0] = 4
[1 0 1] = 5
[1 1 0] = 6
[1 1 1] = 7
Data indices:
{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}
#awesome#
awfu
l0 1 2 3 4 …
01234… Line 1
Line 2
Line 3
Bin index:[0 0 0]
Bin index:[0 1 0]
Bin index:[1 1 0]
Bin index:[1 1 1]
Next closest bins(flip 1 bit)
STAT/CSE 416: Intro to Machine Learning16
Improving search quality by searching neighboring bins
©2018 Emily Fox
Bin [0 0 0] = 0
[0 0 1] = 1
[0 1 0] = 2
[0 1 1] = 3
[1 0 0] = 4
[1 0 1] = 5
[1 1 0] = 6
[1 1 1] = 7
Data indices:
{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}
#awesome
#aw
ful
0 1 2 3 4 …01234… Line 1
Line 2
Line 3
Bin index:[0 0 0]
Bin index:[0 1 0]
Bin index:[1 1 0]
Bin index:[1 1 1]
Further bin(flip 2 bits)
5/7/18
9
STAT/CSE 416: Intro to Machine Learning17
Improving search quality by searching neighboring bins
©2018 Emily Fox
Bin [0 0 0] = 0
[0 0 1] = 1
[0 1 0] = 2
[0 1 1] = 3
[1 0 0] = 4
[1 0 1] = 5
[1 1 0] = 6
[1 1 1] = 7
Data indices:
{1,2} -- {4,8,11} -- -- -- {7,9,10} {3,5,6}
#awesome#
awfu
l0 1 2 3 4 …
01234… Line 1
Line 2
Line 3
Bin index:[0 0 0]
Bin index:[0 1 0]
Bin index:[1 1 0]
Bin index:[1 1 1]
Quality of retrieved NN can only improve with searching more bins
Algorithm: Continue searching until computational budget is reached or quality of NN good enough
STAT/CSE 416: Intro to Machine Learning18
LSH recap
• Draw h random lines
• Compute “score” for each point under each line and translate to binary index
• Use h-bit binary vector per data point as bin index
• Create hash table
©2018 Emily Fox
kd-t
ree
com
pet
itor
dat
a st
ruct
ure
• For each query point x, search bin(x), then neighboring bins until time limit
5/7/18
10
STAT/CSE 416: Intro to Machine Learning
Moving to higher dimensions d
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning20
Draw random planes
©2018 Emily Fox
#awesome
#aw
ful
x[1]
#great
x[3]
Score(x) = v1 #awesome+ v2 #awful+ v3 #great
x[2]
5/7/18
11
STAT/CSE 416: Intro to Machine Learning21
Cost of binning points in d-dim
©2018 Emily Fox
Score(x) = v1 #awesome+ v2 #awful+ v3 #great
Per data point, need d multipliesto determine bin index per plane
One-time cost offset if many queries of fixed dataset
STAT/CSE 416: Intro to Machine Learning
Summary for retrieval usingnearest neighbors and locality sensitive hashing
©2018 Emily Fox
5/7/18
12
STAT/CSE 416: Intro to Machine Learning23
What you can do now…• Implement nearest neighbor search for retrieval tasks• Contrast document representations (e.g., raw word counts, tf-idf,…)
- Emphasize important words using tf-idf
• Contrast methods for measuring similarity between two documents- Euclidean vs. weighted Euclidean- Cosine similarity vs. similarity via unnormalized inner product
• Describe complexity of brute force search• Implement LSH for approximate nearest neighbor search
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning
Motivating NN methods for regression:Fit globally vs. fit locally
©2018 Emily Fox
5/7/18
13
STAT/CSE 416: Intro to Machine Learning25
Parametric models of f(x)
©2018 Emily Fox
y
sq.ft.
pri
ce ($
)
x
STAT/CSE 416: Intro to Machine Learning26
Parametric models of f(x)
©2018 Emily Fox
y
sq.ft.
pri
ce ($
)
x
5/7/18
14
STAT/CSE 416: Intro to Machine Learning27
Parametric models of f(x)
©2018 Emily Fox
y
sq.ft.
pri
ce ($
)
x
STAT/CSE 416: Intro to Machine Learning28
Parametric models of f(x)
©2018 Emily Fox
y
sq.ft.
pri
ce ($
)
x
5/7/18
15
STAT/CSE 416: Intro to Machine Learning29
f(x) is not really a polynomial
©2018 Emily Fox
y
sq.ft.
pri
ce ($
)
x
STAT/CSE 416: Intro to Machine Learning30
What alternative do we have?
If we:- Want to allow flexibility in f(x) having local structure
- Don’t want to infer “structural breaks”
What’s a simple option we have?- Assuming we have plenty of data…
©2018 Emily Fox
5/7/18
16
STAT/CSE 416: Intro to Machine Learning
Simplest approach:Nearest neighbor regression
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning32
Fit locally to each data point
Predicted value = “closest” yi
©2018 Emily Fox
Her
e, th
is is
the
clos
est
data
poin
t
Here, this is the closest datapoint
Her
e, th
is
is th
e cl
oses
t da
tapo
int
Here,
this
is
the c
lose
st
datap
oint
y
sq.ft.
pri
ce ($
)
x
1 nearest neighbor (1-NN)
regression
5/7/18
17
STAT/CSE 416: Intro to Machine Learning33
What people do naturally…
Real estate agent assesses value by finding sale of most similar house
©2018 Emily Fox
$ = ??? $ = 850k
STAT/CSE 416: Intro to Machine Learning34
1-NN regression more formally
Dataset of ( ,$) pairs: (x1,y1), (x2,y2),…,(xN,yN) Query point: xq
1. Find “closest” xi in dataset
2. Predict
©2018 Emily Fox
Her
e, th
is
is th
e cl
oses
t da
tapo
int
Here, this is the closest datapoint
Her
e, th
is
is th
e cl
oses
t da
tapo
int
Here,
this
is
the c
lose
st
datap
oint
y
sq.ft.
pri
ce ($
)
x
5/7/18
18
STAT/CSE 416: Intro to Machine Learning35
Visualizing 1-NN in multiple dimensions
Voronoi tesselation(or diagram):- Divide space into N regions,
each containing 1 datapoint
- Defined such that any x in region is “closest” to region’s datapoint
©2018 Emily Fox
Don’t explicitly form!
STAT/CSE 416: Intro to Machine Learning36
Different distance metrics lead to different predictive surfaces
©2018 Emily Fox
Euclidean distance Manhattan distance
5/7/18
19
STAT/CSE 416: Intro to Machine Learning37
Can 1-NN be used for classification?
Yes!!
Just predict class of neighbor
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning
1-NN algorithm
©2018 Emily Fox
5/7/18
20
STAT/CSE 416: Intro to Machine Learning
Performing 1-NN search
• Query house:
• Dataset:
• Specify: Distance metric• Output: Most similar house
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning
1-NN algorithm
Initialize Dist2NN = ∞, = Ø
For i=1,2,…,NCompute: δ = distance( , )
If δ < Dist2NN
set =set Dist2NN = δ
Return most similar house
©2018 Emily Fox
i
query house
closest house
i
closest house to query house
q
5/7/18
21
STAT/CSE 416: Intro to Machine Learning410 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
1.2
1.4Nearest Neighbors Kernel (K = 1)
1-NN in practice
©2018 Emily Fox
Fit looks good for data dense in x and low noise
STAT/CSE 416: Intro to Machine Learning42
Sensitive to regions with little data
©2018 Emily Fox
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
1.2
1.4Nearest Neighbors Kernel (K = 1)
Not great at interpolating over large regions…
5/7/18
22
STAT/CSE 416: Intro to Machine Learning430 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Nearest Neighbors Kernel (K = 1)
Also sensitive to noise in data
©2018 Emily Fox
Fits can look quite wild…Overfitting?
STAT/CSE 416: Intro to Machine Learning
k-Nearest neighbors
©2018 Emily Fox
5/7/18
23
STAT/CSE 416: Intro to Machine Learning45
Get more “comps”
More reliable estimate if you base estimate off of a larger set of comparable homes
©2018 Emily Fox
$ = ???
$ = 850k
$ = 749k
$ = 833k$ = 901k
STAT/CSE 416: Intro to Machine Learning46
k-NN regression more formally
Dataset of ( ,$) pairs: (x1,y1), (x2,y2),…,(xN,yN) Query point: xq
1. Find k closest xi in dataset
2. Predict
©2018 Emily Fox
5/7/18
24
STAT/CSE 416: Intro to Machine Learning
Performing k-NN search
• Query house:
• Dataset:
• Specify: Distance metric• Output: Most similar houses
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning
k-NN algorithm
Initialize Dist2kNN = sort(δ1,…,δk) = sort( ,…, )
For i=k+1,…,NCompute: δ = distance( , )
If δ < Dist2kNN[k]find j such that δ > Dist2kNN[j-1] but δ < Dist2kNN[j]remove furthest house and shift queue:
[j+1:k] = [j:k-1] Dist2kNN[j+1:k] = Dist2kNN[j:k-1]
set Dist2kNN[j] = δ and [j] = Return k most similar houses
©2018 Emily Fox
i
query house
closest houses to query house
q
i
1 k
sort first k houses by distance to query house
list of sorted distances
list of sorted houses
5/7/18
25
STAT/CSE 416: Intro to Machine Learning49
k-NN in practice
©2018 Emily Fox
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Nearest Neighbors Kernel (K = 30)
Much more reasonable fit in the presence of noise
Boundary & sparse region issues
STAT/CSE 416: Intro to Machine Learning50
k-NN in practice
©2018 Emily Fox
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Nearest Neighbors Kernel (K = 30)
Discontinuities!Neighbor either
in or out
5/7/18
26
STAT/CSE 416: Intro to Machine Learning51
Issues with discontinuities
Overall predictive accuracy might be okay, but…
For example, in housing application:- If you are a buyer or seller, this matters- Can be a jump in estimated value of house going just from
2640 sq.ft. to 2641 sq.ft.- Don’t really believe this type of fit
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning
Weighted k-nearest neighbors
©2018 Emily Fox
5/7/18
27
STAT/CSE 416: Intro to Machine Learning53
Weighted k-NN
Weigh more similar houses more than those less similar in list of k-NN
Predict:
©2018 Emily Fox
cqNNj
cqNN1yNN1 + cqNN2yNN2 + cqNN3yNN3 +…+ cqNNkyNNk
kX
j=1
ŷq =
weights on NN
STAT/CSE 416: Intro to Machine Learning54
How to define weights?
Want weight cqNNj to be small whendistance(xNNj,xq) large
and cqNNj to be large whendistance(xNNj,xq) small
©2018 Emily Fox
5/7/18
28
STAT/CSE 416: Intro to Machine Learning55
Kernel weights for d=1
©2018 Emily Fox
Define: cqNNj = Kernelλ(|xNNj-xq|)
0-λ λ
Gaussian kernel:Kernelλ(|xi-xq|) = exp(-(xi-xq)2/λ)
Note: never exactly 0!
simpleisotropic case
STAT/CSE 416: Intro to Machine Learning56
Kernel weights for d≥1
©2018 Emily Fox
Define: cqNNj = Kernelλ(distance(xNNj,xq))
0-λ λ
5/7/18
29
STAT/CSE 416: Intro to Machine Learning
Kernel regression
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning58
Weighted k-NN
Weigh more similar houses more than those less similar in list of k-NN
Predict:
©2018 Emily Fox
cqNNj
cqNN1yNN1 + cqNN2yNN2 + cqNN3yNN3 +…+ cqNNkyNNk
kX
j=1
ŷq =
weights on NN
5/7/18
30
STAT/CSE 416: Intro to Machine Learning59
Kernel regression
Instead of just weighting NN, weight all points
Predict:
©2018 Emily Fox
ŷq =
weight on each datapoint
cqi
cqiyi
NX
i=1
NX
i=1
Nadaraya-Watson kernel weighted
average
Kernelλ(distance(xi,xq))
Kernelλ(distance(xi,xq)) * yi
NX
i=1
NX
i=1=
STAT/CSE 416: Intro to Machine Learning60
Kernel regression in practice
©2018 Emily Fox
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Epanechnikov Kernel (lambda = 0.2)
Kernel has bounded support…Only subset of data needed to compute local fit
5/7/18
31
STAT/CSE 416: Intro to Machine Learning61
Choice of bandwidth λ
Often, choice of kernel matters much less than choice of λ
©2018 Emily Fox
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Epanechnikov Kernel (lambda = 0.4)
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Epanechnikov Kernel (lambda = 0.04)
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Epanechnikov Kernel (lambda = 0.2)λ = 0.04 λ = 0.4
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Boxcar Kernel (lambda = 0.2)λ = 0.2
Boxcar kernel
STAT/CSE 416: Intro to Machine Learning62
Choosing λ (or k in k-NN)
How to choose? Same story as always…
©2018 Emily Fox
Cross Validation
5/7/18
32
STAT/CSE 416: Intro to Machine Learning
Formalizing the idea of local fits
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning64
Contrasting with global average
A globally constant fit weights all points equally
ŷq =
equal weight on each datapoint
c
c yi
NX
i=1
NX
i=1
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Boxcar Kernel (lambda = 1)
©2018 Emily Fox
yi
NX
i=1
1
N =
5/7/18
33
STAT/CSE 416: Intro to Machine Learning65
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Boxcar Kernel (lambda = 0.2)
0 0.1 0.2 0.3 0.4 x0 0.6 0.7 0.8 0.9 1
−1
−0.5
0
0.5
1
1.5
f(x0)
Epanechnikov Kernel (lambda = 0.2)
Contrasting with global averageKernel regression leads to locally constant fit- slowly add in some points and let others gradually die off
ŷq =Kernelλ(distance(xi,xq))
Kernelλ(distance(xi,xq)) * yi
NX
i=1
NX
i=1
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning66
Local linear regression
So far, fitting constant function locally at each pointà “locally weighted averages”
Can instead fit a line or polynomial locally at each pointà “locally weighted linear regression”
©2018 Emily Fox
5/7/18
34
STAT/CSE 416: Intro to Machine Learning67
Local regression rules of thumb
- Local linear fit reduces bias at boundaries with minimum increase in variance
- Local quadratic fit doesn’t help at boundaries and increases variance, but does help capture curvature in the interior
- With sufficient data, local polynomials of odd degree dominate those of even degree
©2018 Emily Fox
Recommended default choice: local linear regression
STAT/CSE 416: Intro to Machine Learning
Discussion on k-NN and kernel regression
©2018 Emily Fox
5/7/18
35
STAT/CSE 416: Intro to Machine Learning69
Nonparametric approaches
k-NN and kernel regression are examples of nonparametricregression
General goals of nonparametrics:- Flexibility- Make few assumptions about f(x)- Complexity can grow with the number of observations N
Lots of other choices:- Splines, trees, locally weighted structured regression models…
©2018 Emily Fox
STAT/CSE 416: Intro to Machine Learning70
Limiting behavior of NN: Noiseless setting (εi=0)
In the limit of getting an infinite amount of noiseless data,
the MSE of 1-NN fit goes to 0
©2018 Emily Fox
5/7/18
36
STAT/CSE 416: Intro to Machine Learning71 ©2018 Emily Fox
Not true for parametric models!
Quadratic fit1-NN fit
Limiting behavior of NN: Noiseless setting (εi=0)
In the limit of getting an infinite amount of noiseless data,
the MSE of 1-NN fit goes to 0
STAT/CSE 416: Intro to Machine Learning72
Error vs. amount of data
©2018 Emily Fox
# data points in training set
Erro
r
5/7/18
37
STAT/CSE 416: Intro to Machine Learning73
Limiting behavior of NN: Noisy data setting
In the limit of getting an infinite amount of data, the MSE of NN fit goes to 0 if k grows, too
©2018 Emily Fox
1-NN fit 200-NN fit Quadratic fit
STAT/CSE 416: Intro to Machine Learning74
NN and kernel methods for large d or small N
NN and kernel methods work well when the data cover the space, but…- the more dimensions d you have, the more points N you need to
cover the space- need N = O(exp(d)) data points for good performance
This is where parametric models become useful…
©2018 Emily Fox
5/7/18
38
STAT/CSE 416: Intro to Machine Learning75
Complexity of NN search
Naïve approach: Brute force search- Given a query point xq
- Scan through each point x1,x2,…, xN
- O(N) distance computations per 1-NN query!
- O(Nlogk) per k-NN query!
What if N is huge??? (and many queries)
©2018 Emily Fox
KD-trees!Locality-sensitive
hashing, etc.
STAT/CSE 416: Intro to Machine Learning
k-NN for classification
©2018 Emily Fox
5/7/18
39
STAT/CSE 416: Intro to Machine Learning77
Spam filtering example
©2018 Emily Fox
Input: x Output: y
Not spam
Spam
Text of email,sender, IP,…
STAT/CSE 416: Intro to Machine Learning78
Using k-NN for classification
Space of labeled emails (not spam vs. spam), organized by similarity of text
©2018 Emily Fox
not spam vs. spam:decide via majority vote of k-NN
query email
5/7/18
40
STAT/CSE 416: Intro to Machine Learning79
not spam vs. spam:decide via majority vote of k-NN
Using k-NN for classification
Space of labeled emails (not spam vs. spam), organized by similarity of text
©2018 Emily Fox
query email
STAT/CSE 416: Intro to Machine Learning
Summary for nearest neighborand kernel regression
©2018 Emily Fox
5/7/18
41
STAT/CSE 416: Intro to Machine Learning81
What you can do now…• Motivate the use of nearest neighbor (NN) regression• Perform k-NN regression• Analyze computational costs of these algorithms• Discuss sensitivity of NN to lack of data, dimensionality, and noise• Perform weighted k-NN and define weights using a kernel• Define and implement kernel regression• Describe the effect of varying the kernel bandwidth λ or # of nearest neighbors k
• Select λ or k using cross validation• Compare and contrast kernel regression with a global average fit• Define what makes an approach nonparametric and why NN and kernel regression are
considered nonparametric methods• Analyze the limiting behavior of NN regression• Use NN for classification
©2018 Emily Fox