7/23/2019 lec02 Machine Learning
1/67
CSCI567 Machine Learning (Fall 2014)
Drs. Sha & Liu
{feisha,yanliu.cs}@usc.edu
September 9, 2014
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47
http://find/7/23/2019 lec02 Machine Learning
2/67
Administration
Outline
1 Administration
2 First learning algorithm: Nearest neighbor classifier
3 More deep understanding about NNC
4 Some practical sides of NNC
5 What we have learned
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 2 / 47
http://find/7/23/2019 lec02 Machine Learning
3/67
Administration
Entrance exam
All were graded
about 87% has passed96 students in Prof. Shas section and 48 in Prof. Lius section
Those who have passed the threshold were granted D-clearance
please enroll asap by Friday noon.
In several cases, advisors had sent out inquiries please respond byThursday noon per their instructions.
If not being granted D-clearance, or being contacted by advisors, you
can assume that you are not permitted to enroll.You can ask TAs to check up your grades, only if you strongly believeyou did well.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 3 / 47
http://find/7/23/2019 lec02 Machine Learning
4/67
First learning algorithm: Nearest neighbor classifier
Outline
1 Administration
2 First learning algorithm: Nearest neighbor classifierIntuitive example
General setup for classificationAlgorithm
3 More deep understanding about NNC
4 Some practical sides of NNC
5 What we have learned
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 4 / 47
http://goforward/http://find/http://goback/7/23/2019 lec02 Machine Learning
5/67
First learning algorithm: Nearest neighbor classifier Intuitive example
Recognizing flowers
Types of Iris: setosa, versicolor, and virginica
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 5 / 47
http://find/http://goback/7/23/2019 lec02 Machine Learning
6/67
First learning algorithm: Nearest neighbor classifier Intuitive example
Measuring the properties of the flowers
Features and attributes: the widths and lengths of sepal and petal
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 6 / 47
Fi l i l i h N i hb l ifi I i i l
http://find/http://goback/7/23/2019 lec02 Machine Learning
7/67
First learning algorithm: Nearest neighbor classifier Intuitive example
Pairwise scatter plots of 131 flower specimens
Visualization of data helps to identify the right learning model to
use
Each colored point is a flower specimen: setosa,versicolor,virginica
sepallength
sepal length
sepalwidth
petallength
petalwidth
sepal width petal length petal width
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 7 / 47
Fi t l i l ith N t i hb l ifi I t iti l
http://find/7/23/2019 lec02 Machine Learning
8/67
First learning algorithm: Nearest neighbor classifier Intuitive example
Different types seem well-clustered and separable
Using two features: petal width and sepal length
sepallength
sepal length
sepalwidth
petallength
petalwidth
sepal width petal length petal width
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 8 / 47
First learning algorithm: Nearest neighbor classifier Intuitive example
http://find/7/23/2019 lec02 Machine Learning
9/67
First learning algorithm: Nearest neighbor classifier Intuitive example
Labeling an unknown flower type
Closer to red cluster: so labeling it as setosa
sepallength
sepal length
sepalwidth
petalle
ngth
petalwidth
sepal width petal length petal width
?
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 9 / 47
First learning algorithm: Nearest neighbor classifier General setup for classification
http://find/7/23/2019 lec02 Machine Learning
10/67
First learning algorithm: Nearest neighbor classifier General setup for classification
Multi-class classification
Classify data into one of the multiple categories
Input (feature vectors): x RD
Output (label): y[C] ={1, 2, , C}Learning goal: y =f(x)
Special case: binary classification
Number of classes: C= 2
Labels: {0, 1} or{1, +1}
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 10 / 47
First learning algorithm: Nearest neighbor classifier General setup for classification
http://find/7/23/2019 lec02 Machine Learning
11/67
First learning algorithm: Nearest neighbor classifier General setup for classification
More terminology
Training data (set)
N samples/instances: Dtrain ={(x1, y1), (x2, y2), , (xN, yN)}
They are used for learning f()
Test (evaluation) dataM samples/instances: Dtest ={(x1, y1), (x2, y2), , (xM, yM)}
They are used for assessing how well f() will do in predicting anunseen x / Dtrain
Training data and test data should notoverlap: Dtrain Dtest =
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 11 / 47
First learning algorithm: Nearest neighbor classifier General setup for classification
http://find/7/23/2019 lec02 Machine Learning
12/67
First learning algorithm: Nearest neighbor classifier General setup for classification
Often, data is conveniently organized as a table
Ex: Iris data (clickherefor all data)
4 features3 classes
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 12 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://en.wikipedia.org/wiki/Iris_flower_data_sethttp://en.wikipedia.org/wiki/Iris_flower_data_sethttp://find/7/23/2019 lec02 Machine Learning
13/67
g g g g
Nearest neighbor classification (NNC)
Nearest neighborx(1) = xnn(x)
where nn(x) [N] ={1, 2, , N}, i.e., the index to one of the training
instances,
nn(x) = arg minn[N] x xn22= arg minn[N]
Dd=1
(xd xnd)2
Classification ruley=f(x) =ynn(x)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 13 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
14/67
g g g g
Visual example
In this 2-dimensional example, the nearest point to x
is atraininginstance, thus, x will be labeled asred.
x1
x2
(a)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 14 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
15/67
Example: classify Iris with two features
Training data
ID (n) petal width (x1) sepal length (x2) category (y)1 0.2 5.1 setoas
2 1.4 7.0 versicolor
3 2.5 6.7 virginica...
...
...
Flower with unknown categorypetal width = 1.8 and sepal width = 6.4Calculating distance = x1 xn1)
2 + (x2 xn2)2
ID distance1 1.75
2 0.72
3 0.76
Thus, the category is versicolor(the real categoryis viriginica)Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 15 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
16/67
Decision boundary
For every point in the space, we can determine its label using the NNC
rule. This gives rise to a decision boundarythat partitions the space intodifferent regions.
x1
x2
(b)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 16 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
17/67
How to measure nearness with other distances?
Previously, we use the Euclidean distance
nn(x) = arg minn[N] x xn22
We can also use alternative distancesE.g., the following L1 distance (i.e., cityblock distance, or Manhattan distance)
nn(x) = arg minn[N] x xn1
= arg minn[N]D
d=1
|xd xnd| Green line is Euclidean distance.Red, Blue, and Yellow lines are
L1 distance
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 17 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
18/67
K-nearest neighbor (KNN) classification
Increase the number of nearest neighbors to use?
1-nearest neighbor: nn1(x) = arg minn[N] x xn22
2nd-nearest neighbor: nn2(x) = arg minn[N]nn1(x) x xn22
3rd-nearest neighbor: nn2(x) = arg minn[N]nn
1
(x
)nn2
(x
)
x xn2
2The set of K-nearest neighbor
knn(x) ={nn1(x), nn2(x), , nnK(x)}
Let x
(k) =x
nnk(x
), then
x x(1)22 x x(2)22 x x(K)
22
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 18 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
19/67
How to classify with Kneighbors?
Classification ruleEvery neighbor votes: suppose yn (the true label) for xn is c, then
vote for c is 1vote for c =c is 0
We use the indicator function I(yn==c) to represent.
Aggregate everyones vote
vc=
nknn(x)
I(yn==c), c [C]
Label with the majority
y=f(x) = arg maxc[C]vc
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 19 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
20/67
Example
K=1, Label: red
x1
x2
(a)
K=3, Label: red
x1
x2
(a)
K=5, Label: blue
x1
x2
(a)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 20 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
21/67
How to choose an optimal K?
x6
x7
K = 1
0 1 20
1
2
x6
x7
K = 3
0 1 20
1
2
x6
x7
K = 3 1
0 1 20
1
2
When K increases, the decision boundary becomes smooth.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 21 / 47
First learning algorithm: Nearest neighbor classifier Algorithm
http://find/7/23/2019 lec02 Machine Learning
22/67
Mini-summary
Advantages of NNC
Computationally, simple and easy to implement just computing thedistance
Theoretically, has strong guarantees doing the right thing
Disadvantages of NNC
Computationally intensive for large-scale problems: O(ND) forlabeling a data point
We need to carry the training data around. Without it, we cannot
do classification. This type of method is called nonparametric.
Choosing the right distance measure and Kcan be involved.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 22 / 47
More deep understanding about NNC
http://find/7/23/2019 lec02 Machine Learning
23/67
Outline
1 Administration
2 First learning algorithm: Nearest neighbor classifier
3 More deep understanding about NNCMeasuring performanceThe ideal classifierComparing NNC to the ideal classifier
4 Some practical sides of NNC
5 What we have learned
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 23 / 47
More deep understanding about NNC
http://find/7/23/2019 lec02 Machine Learning
24/67
Is NNC too simple to do the right thing?
To answer this question, we proceed in 3 steps
1 We define a performance metric for a classifier/algorithm.
2 We then propose an ideal classifier.
3 We then compare our simple NNC classifier to the ideal one and showthat it performs nearly as good.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 24 / 47
More deep understanding about NNC Measuring performance
http://find/7/23/2019 lec02 Machine Learning
25/67
How to measure performance of a classifier?
Intuition
We should computeaccuracy the percentage of data points beingcorrectly classified, or theerror rate the percentage of data points beingincorrectly classified.
Two versions: which one to use?
Defined on the training data set
Atrain = 1
N
n
I[f(xn) ==yn], train =
1
N
n
I[f(xn)=yn]
Defined on the test (evaluation) data set
Atest = 1
M
m
I[f(xm) ==ym], test =
1
M
M
I[f(xm)=ym]
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 25 / 47
More deep understanding about NNC Measuring performance
http://find/7/23/2019 lec02 Machine Learning
26/67
Example
Training data
What are Atrain and train?
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 26 / 47
More deep understanding about NNC Measuring performance
http://find/7/23/2019 lec02 Machine Learning
27/67
Example
Training data
What are Atrain and train?
Atrain = 100%, train = 0%
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 26 / 47
More deep understanding about NNC Measuring performance
http://find/7/23/2019 lec02 Machine Learning
28/67
Example
Training data
What are Atrain and train?
Atrain = 100%, train = 0%
Test data
What are Atest and test?
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 26 / 47
More deep understanding about NNC Measuring performance
http://find/7/23/2019 lec02 Machine Learning
29/67
Example
Training data
What are Atrain and train?
Atrain = 100%, train = 0%
Test data
What are Atest and test?
Atest = 0%, test = 100%
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 26 / 47
More deep understanding about NNC Measuring performance
http://find/7/23/2019 lec02 Machine Learning
30/67
Leave-one-out (LOO)
Idea
For each training instance xn,take it out of the training setand then label it.
For NNC, xns nearest neighbor
will not be itself. So the errorrate would not become 0necessarily.
Training data
What are the LOO-version ofAtrain
and train?
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 27 / 47
More deep understanding about NNC Measuring performance
http://find/7/23/2019 lec02 Machine Learning
31/67
Leave-one-out (LOO)
Idea
For each training instance xn,take it out of the training setand then label it.
For NNC, xns nearest neighbor
will not be itself. So the errorrate would not become 0necessarily.
Training data
What are the LOO-version ofAtrain
and train?
Atrain = 66.67%(i.e., 4/6)
train = 33.33%(i.e., 2/6)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 27 / 47
More deep understanding about NNC Measuring performance
http://find/7/23/2019 lec02 Machine Learning
32/67
Drawback of the metrics
They are dataset-specificGiven a different training (or test) dataset, Atrain (orAtest) willchange.
Thus, if we get a dataset randomly, these variables would be
random quantities.
AtestD1 , Atest
D2 , , Atest
Dq ,
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 28 / 47
More deep understanding about NNC Measuring performance
D b k f h i
http://find/7/23/2019 lec02 Machine Learning
33/67
Drawback of the metrics
They are dataset-specificGiven a different training (or test) dataset, Atrain (orAtest) willchange.
Thus, if we get a dataset randomly, these variables would be
random quantities.
AtestD1 , Atest
D2 , , Atest
Dq ,
These are called empirical accuracies (or errors).
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 28 / 47
More deep understanding about NNC Measuring performance
D b k f h i
http://find/7/23/2019 lec02 Machine Learning
34/67
Drawback of the metrics
They are dataset-specificGiven a different training (or test) dataset, Atrain (orAtest) willchange.
Thus, if we get a dataset randomly, these variables would be
random quantities.
AtestD1 , Atest
D2 , , Atest
Dq ,
These are called empirical accuracies (or errors).
Can we understand the algorithm itself in a more certain nature, byremoving the uncertainty caused by the datasets?
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 28 / 47
More deep understanding about NNC Measuring performance
E d i k
http://find/7/23/2019 lec02 Machine Learning
35/67
Expected mistakes
Setup
Assume our data (x, y) is drawn from the joint and unknowndistribution p(x, y)
Classification mistake on a single data point x with the ground-truthlabel y
L(f(x
), y) = 0 iff(x) =y1 iff(x)=y
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 29 / 47
More deep understanding about NNC Measuring performance
E t d i t k
http://find/7/23/2019 lec02 Machine Learning
36/67
Expected mistakes
Setup
Assume our data (x, y) is drawn from the joint and unknowndistribution p(x, y)
Classification mistake on a single data point x with the ground-truthlabel y
L(f(x
), y) = 0 iff(x) =y1 iff(x)=yExpected classification mistake on a single data point x
R(f, x) = Eyp(y|x)L(f(x), y)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 29 / 47
More deep understanding about NNC Measuring performance
E t d i t k
http://find/http://goback/7/23/2019 lec02 Machine Learning
37/67
Expected mistakes
Setup
Assume our data (x, y) is drawn from the joint and unknowndistribution p(x, y)
Classification mistake on a single data point x with the ground-truthlabel y
L(f(x
), y) = 0 iff(x) =y1 iff(x)=yExpected classification mistake on a single data point x
R(f, x) = Eyp(y|x)L(f(x), y)
The average classification mistake by the classifier itself
R(f) = Exp(x)R(f, x) = E(x,y)p(x,y)L(f(x), y)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 29 / 47
More deep understanding about NNC Measuring performance
Jargons
http://find/7/23/2019 lec02 Machine Learning
38/67
Jargons
L(f(x), y) is called 0/1 loss function many other forms of loss
functions exist for different learning problems.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 30 / 47
More deep understanding about NNC Measuring performance
Jargons
http://find/7/23/2019 lec02 Machine Learning
39/67
Jargons
L(f(x), y) is called 0/1 loss function many other forms of loss
functions exist for different learning problems.Expected conditional risk
R(f, x) = Eyp(y|x)L(f(x), y)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 30 / 47
More deep understanding about NNC Measuring performance
Jargons
http://find/7/23/2019 lec02 Machine Learning
40/67
Jargons
L(f(x), y) is called 0/1 loss function many other forms of loss
functions exist for different learning problems.Expected conditional risk
R(f, x) = Eyp(y|x)L(f(x), y)
Expected riskR(f) = E(x,y)p(x,y)L(f(x), y)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 30 / 47
More deep understanding about NNC Measuring performance
Jargons
http://find/7/23/2019 lec02 Machine Learning
41/67
Jargons
L(f(x), y) is called 0/1 loss function many other forms of loss
functions exist for different learning problems.Expected conditional risk
R(f, x) = Eyp(y|x)L(f(x), y)
Expected riskR(f) = E(x,y)p(x,y)L(f(x), y)
Empirical risk
RD(f) = 1
N
n
L(f(xn), yn)
Obviously, this is our empirical error (rates).
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 30 / 47
More deep understanding about NNC Measuring performance
Ex: binary classification
http://find/7/23/2019 lec02 Machine Learning
42/67
Ex: binary classification
Expected conditional risk of a single data point x
R(f, x) = Eyp(y|x)L(f(x), y)
=P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 31 / 47
More deep understanding about NNC Measuring performance
Ex: binary classification
http://find/7/23/2019 lec02 Machine Learning
43/67
Ex: binary classification
Expected conditional risk of a single data point x
R(f, x) = Eyp(y|x)L(f(x), y)
=P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]
Let (x) =P(y = 1|x), we have
R(f, x) =(x)I[f(x) = 0] + (1 (x))I[f(x) = 1]
= 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}
expected conditional accuracies
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 31 / 47
More deep understanding about NNC Measuring performance
Ex: binary classification
http://find/7/23/2019 lec02 Machine Learning
44/67
Ex: binary classification
Expected conditional risk of a single data point x
R(f, x) = Eyp(y|x)L(f(x), y)
=P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]
Let (x) =P(y = 1|x), we have
R(f, x) =(x)I[f(x) = 0] + (1 (x))I[f(x) = 1]
= 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}
expected conditional accuraciesExercise: please verify the last equality.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 31 / 47
More deep understanding about NNC The ideal classifier
Bayes optimal classifier
http://find/7/23/2019 lec02 Machine Learning
45/67
Bayes optimal classifier
Consider the following classifier, using the posterior probability(x) =p(y= 1|x)
f(x) = 1 if(x) 1/20 if(x)< 1/2 equivalentlyf(x) = 1 ifp(y= 1|x) p(y = 0|x)0 ifp(y= 1|x)< p(y = 0|x)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 32 / 47
More deep understanding about NNC The ideal classifier
Bayes optimal classifier
http://find/http://goback/7/23/2019 lec02 Machine Learning
46/67
Bayes optimal classifier
Consider the following classifier, using the posterior probability(x) =p(y= 1|x)
f(x) = 1 if(x) 1/20 if(x)< 1/2 equivalentlyf(x) = 1 ifp(y= 1|x) p(y = 0|x)0 ifp(y= 1|x)< p(y = 0|x)Theorem
For any labeling function f(), R(f, x) R(f, x). Similarly,
R(f) R(f). Namely, f() is optimal.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 32 / 47
More deep understanding about NNC The ideal classifier
Proof (not required)
http://find/7/23/2019 lec02 Machine Learning
47/67
Proof (not required)
From definition
R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}
R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 33 / 47
More deep understanding about NNC The ideal classifier
Proof (not required)
http://find/7/23/2019 lec02 Machine Learning
48/67
( q )
From definition
R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}
R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}
Thus,
R(f, x) R(f, x) =(x) {I[f(x) = 1] I[f(x) = 1]}
+ (1 (x)) {I[f(x) = 0] I[f(x) = 0]}
=(x) {I[f(x) = 1] I[f(x) = 1]}
+ (1 (x)) {1 I[f(x) = 1] 1 + I[f(x) = 1]}= (2(x) 1) {I[f(x) = 1] I[f(x) = 1]}
0
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 33 / 47
More deep understanding about NNC The ideal classifier
Bayes optimal classifier in general form
http://find/7/23/2019 lec02 Machine Learning
49/67
y p g
For multi-class classification problem
f(x) = arg maxc[C]p(y=c|x)
when C= 2, this reduces to detecting whether or not (x) =p(y= 1|x)
is greater than 1/2. We refer p(y=c|x) as the posterior probability of x.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 34 / 47
More deep understanding about NNC The ideal classifier
Bayes optimal classifier in general form
http://find/7/23/2019 lec02 Machine Learning
50/67
y p g
For multi-class classification problem
f(x) = arg maxc[C]p(y=c|x)
when C= 2, this reduces to detecting whether or not (x) =p(y= 1|x)
is greater than 1/2. We refer p(y=c|x) as the posterior probability of x.
Remarks
The Bayes optimal classifier is generally not computable as it assumesthe knowledge ofp(x, y) orp(y|x).
However, it is useful as a conceptual tool to formalize how well aclassifier can do withoutknowing the joint distribution.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 34 / 47
More deep understanding about NNC Comparing NNC to the ideal classifier
Comparing NNC to Bayes optimal classifier
http://find/7/23/2019 lec02 Machine Learning
51/67
p g y p
How well does our NNC do?
Theorem
For the NNC rulefnnc for binary classification, we have,
R(f) R(fnnc) 2R(f)(1 R(f)) 2R(f)
Namely, the expected risk by the classifier is at worst twice that of theBayes optimal classifier.
In short, NNC seems doing a reasonable thing
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 35 / 47
More deep understanding about NNC Comparing NNC to the ideal classifier
Proof sketches (not required)
http://find/7/23/2019 lec02 Machine Learning
52/67
( )
Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47
More deep understanding about NNC Comparing NNC to the ideal classifier
Proof sketches (not required)
http://find/7/23/2019 lec02 Machine Learning
53/67
Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.
Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47
More deep understanding about NNC Comparing NNC to the ideal classifier
Proof sketches (not required)
http://find/7/23/2019 lec02 Machine Learning
54/67
Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.
Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.
Step 3 To show the probability having different labels is
R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47
More deep understanding about NNC Comparing NNC to the ideal classifier
Proof sketches (not required)
http://find/7/23/2019 lec02 Machine Learning
55/67
Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.
Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.
Step 3 To show the probability having different labels is
R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))
Step 4 To show the expected risk of the Bayes optimal classifier on x is
R(f, x) = min{(x), 1 (x)}
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47
More deep understanding about NNC Comparing NNC to the ideal classifier
Proof sketches (not required)
http://find/http://goback/7/23/2019 lec02 Machine Learning
56/67
Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.
Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.
Step 3 To show the probability having different labels is
R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))
Step 4 To show the expected risk of the Bayes optimal classifier on x is
R(f, x) = min{(x), 1 (x)}
Step 5 Then tie everything together
R(f, x) = 2R(f, x)(1 R(f, x))
We are also most there. But one more step is needed (and omitted here).
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47
More deep understanding about NNC Comparing NNC to the ideal classifier
Mini-summary
http://find/7/23/2019 lec02 Machine Learning
57/67
Advantages of NNC
Computationally, simple and easy to implement just computing thedistance
Theoretically, has strong guarantees doing the right thing
Disadvantages of NNCComputationally intensive for large-scale problems: O(ND) forlabeling a data point
We need to carry the training data around. Without it, we cannot
do classification. This type of method is called nonparametric.Choosing the right distance measure and Kcan be involved.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 37 / 47
Some practical sides of NNC
Outline
http://find/7/23/2019 lec02 Machine Learning
58/67
1 Administration
2 First learning algorithm: Nearest neighbor classifier
3 More deep understanding about NNC
4 Some practical sides of NNCHow to tune to get the best out of it?Preprocessing data
5 What we have learned
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 38 / 47
Some practical sides of NNC How to tune to get the best out of it?
Hypeparameters in NNC
http://find/7/23/2019 lec02 Machine Learning
59/67
Two practical issues about NNCChoosingK, i.e., the number of nearest neighbors (default is 1)
Choosing the right distance measure (default is Euclidean distance),for example, from the following generalized distance measure
x xnp=
d
(xd xnd)p
1/p
forp 1.
Those are not specified by the algorithm itself resolving them requiresempirical studies and are task/dataset-specific.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 39 / 47
Some practical sides of NNC How to tune to get the best out of it?
Tuning by using a validation dataset
http://find/7/23/2019 lec02 Machine Learning
60/67
Training data (set)
N samples/instances: Dtrain ={(x1, y1), (x2, y2), , (xN, yN)}
They are used for learning f()
Test (evaluation) data
M samples/instances: Dtest ={(x1, y1), (x2, y2), , (xM, yM)}
They are used for assessing how well f() will do in predicting anunseen x / Dtrain
Development (or validation) data
L samples/instances: Ddev ={(x1, y1), (x2, y2), , (xL, yL)}
They are used to optimize hyperparameter(s).
Training data, validation and test data should notoverlap!
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 40 / 47
Some practical sides of NNC How to tune to get the best out of it?
Recipe
http://find/7/23/2019 lec02 Machine Learning
61/67
for each possible value of the hyperparameter (say K= 1, 3, , 100)
Train a model usingD
train
Evaluate the performance of the model on Ddev
Choose the model with the best performance on Ddev
Evaluate the model on Dtest
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 41 / 47
Some practical sides of NNC How to tune to get the best out of it?
Cross-validation
http://find/7/23/2019 lec02 Machine Learning
62/67
What if we do not have validation data?
We split the training data into Sequal parts.
We use each part in turn as avalidation dataset and use theothers as a training dataset.
We choose the hyperparametersuch that on average, the model
performing the best
S= 5: 5-fold cross validation
Special case: when S=N, this will be leave-one-out.
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 42 / 47
Some practical sides of NNC How to tune to get the best out of it?
Recipe
http://find/7/23/2019 lec02 Machine Learning
63/67
Split the training data into S equal parts. Denote each part as Dtrainsfor each possible value of the hyperparameter (say K= 1, 3, , 100)
for every s [1, S]
Train a model using Dtrain
\s = D
trainD
train
s
Evaluate the performance of the model on Dtrains
Average the S performance metrics
Choose the hyperparameter corresponding to the best averagedperformance
Use the best hyerparamter to train on a model using all Dtrain
Evaluate the model on Dtest
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 43 / 47
Some practical sides of NNC Preprocessing data
Yet, another practical issue with NNC
http://find/7/23/2019 lec02 Machine Learning
64/67
Distances depend on units of the features!(Show how proximity can be changed due to change in features scale;Draw on screen or blackboard)
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 44 / 47
Some practical sides of NNC Preprocessing data
Preprocess data
http://find/http://goback/7/23/2019 lec02 Machine Learning
65/67
Normalize data so that the data look like from a normal distributionCompute the means and standard deviations in each feature
xd = 1
N nxnd, s
2d =
1
N 1 n(xnd xd)
2
Scale the feature accordingly
xndxnd xd
sd
Many other ways of normalizing data you would need/want to trydifferent ones and pick them using (cross)validation
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 45 / 47
What we have learned
Outline
http://find/http://goback/7/23/2019 lec02 Machine Learning
66/67
1 Administration
2 First learning algorithm: Nearest neighbor classifier
3 More deep understanding about NNC
4 Some practical sides of NNC
5 What we have learned
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 46 / 47
What we have learned
Summary so far
http://find/http://goback/7/23/2019 lec02 Machine Learning
67/67
Described a simple learning algorithm
Used intensively in practical applications you will get a taste of it inyour homework
Discussed a few practical aspects, such as tuning hyperparameters,with (cross)validation
Briefly studied its theoretical properties
Concepts: loss function, risks, Bayes optimalTheoretical guarantees: explaining why NNC would work
Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 47 / 47
http://find/http://goback/