View
215
Download
1
Tags:
Embed Size (px)
Citation preview
ReviewRong Jin
Comparison of Different Classification Models The goal of all classifiers
Predicating class label y for an input x Estimate p(y|x)
K Nearest Neighbor (kNN) Approach
(k=1)(k=4)
Probability interpretation: estimate p(y|x) as
, | , ( )( | ) , ( ) is the neighborhood around
| ( ) |
i i i ix y y y x N xp y x N x x
N x
K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)?
Leave one out approach Weight K nearest neighbor
Neighbor is defined through a weight function
Estimate p(y|x)
How to estimate the appropriate value for 2?
( ) ( , )( | )
( )i ii
ii
w x y yp y x
w x
2
22
( ) exp2
ii
x xw x
K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)?
Leave one out approach Weight K nearest neighbor
Neighbor is defined through a weight function
Estimate p(y|x)
How to estimate the appropriate value for 2?
( ) ( , )( | )
( )i ii
ii
w x y yp y x
w x
2
22
( ) exp2
ii
x xw x
K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)?
Leave one out approach Weight K nearest neighbor
Neighbor is defined through a weight function
Estimate p(y|x)
How to estimate the appropriate value for 2?
( ) ( , )( | )
( )i ii
ii
w x y yp y x
w x
2
22
( ) exp2
ii
x xw x
Weighted K Nearest Neighbor Leave one out + maximum likelihood Estimate leave one out probability
Leave one out likelihood of training data
Search the optimal 2 by maximizing the leave one out likelihood
( ) ( , ) 1 ( ) ( , )( | )
( ) 1 ( )i j j ii j i j j ii
j ji j i ji j i
w x y y w x y yp y x
w x w x
LOO 1 1
1 ( ) ( , )log ( | ) log
1 ( )n n i j j ii
j jj ji ji
w x y yl p y x
w x
Weight K Nearest Neighbor Leave one out + maximum likelihood Estimate leave one out probability
Leave one out likelihood of training data
Search the optimal 2 by maximizing the leave one out likelihood
( ) ( , ) 1 ( ) ( , )( | )
( ) 1 ( )i j j ii j i j j ii
j ji j i ji j i
w x y y w x y yp y x
w x w x
LOO 1 1
1 ( ) ( , )log ( | ) log
1 ( )n n i j j ii
j jj ji ji
w x y yl p y x
w x
Gaussian Generative Model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior Estimate p(x|y) and p(y)
Allocate a separate set of parameters for each class {1, 2,…, c}
p(xly;) p(x;y)
Maximum likelihood estimation2
22
( )1( | ) exp
22
y
yy
xp x y
2
22
1 1
( )1log ( | ) log log 2 log
2 2i
i i i
i
N Ni y
i i y y yi i y
xl p x y p p
Gaussian Generative Model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior Estimate p(x|y) and p(y)
Allocate a separate set of parameters for each class {1, 2,…, c}
p(xly;) p(x;y)
Maximum likelihood estimation2
22
( )1( | ) exp
22
y
yy
xp x y
2
22
1 1
( )1log ( | ) log log 2 log
2 2i
i i i
i
N Ni y
i i y y yi i y
xl p x y p p
Gaussian Generative Model Difficult to estimate p(x|y) if x is of high dimensionality
Naïve Bayes:
Essentially a linear model
How to make a Gaussian generative model discriminative? (m,m) of each class are only based on the data belonging
to that class lack of discriminative power
1 2( | ; ) ( | ; ) ( | ; )... ( | ; )dp x y p x y p x y p x y
Gaussian Generative Model
Maximum likelihood estimation
2
22
2'' 1
'22' 1 ''
( )1exp
22( | ) ( )( | )
( )( | ') ( ') 1exp
22
yy
yy
c cyy
yy yy
xp
p x y p yp y x
xp x y p yp
1
2 2' '2
2 221 ' 1 ''
log ( | )
( ) ( )1log 2 log log exp
2 2 22
i
i i
i
N
i ii
N ci y y i y
y yi yy yy
l p y x
x p xp
How to optimize this objective function?
Gaussian Generative Model Bound optimization algorithm
1 1 1
' ' ' ' ' '1 1 1
2 ' '
'2 ' 2
''
, , ,..., , , : parameter of current iteration
' , , ,..., , , : parameter of last iteration
( ) ( ')
( )(2 )1log log
2 2
log2
i i i i i i
i i i
c c c
c c c
y y y y i y y
y y y
y
y
p p
p p
l l
p x
p
p
' 2 2
1 ' ' '
'2 2'2 2' 1 ' 1' '' '
( ) ( )exp log exp
2 22
N
c ci i y y i y
y yy yy
x p x
Gaussian Generative Model
2 ' '
'2 ' 2
2' '
22' 1 '1 '
' ' 2' '
'2'2' 1 ''
( ) ( ')
( )(2 )1log log
2 2
( )exp
22
( )exp
22
i i i i i i
i i i
y y y y i y y
y y y
cNy i y
y yi y
cy i y
y yy
l l
p x
p
p x
p x
Using log 1x x We have decomposed the interaction of parameters between different classes
Question: how to handle x with multiple features ?
Logistic Regression Model A linear decision boundary: wx+b
A probabilistic model p(y|x)
Maximum likelihood approach for estimating weights w and threshold b
0 positive
0 negative
w x b
w x b
1( 1| )
1 exp( ( ))p y x
y w x b
( ) ( )
1 1
( ) ( )
1 1
( ) log ( | ) log ( | )
1 1log log
1 exp 1 exp
N Ntrain i ii i
N N
i ii i
l D p x p x
w x b w x b
1w x b
Logistic Regression Model Overfitting issue Example: text classification
Words that appears in only one document will be assigned with infinite large weight
Solution: regularization
( ) ( )
21 1
( ) ( ) 21 1 1
( ) log ( | ) log ( | )
1 1log log
1 exp 1 exp
N Ntrain i ii i
N N mji i j
i i
l D p x p x s w
s ww x b w x b
Regularization term
Kernelize logistic regression model
1
1
( ), ( )
( , ) ( , )
Ni ii
Ni ii
x x w x
w x K w x K x x
1
1 , 1
1
1 1( | )
1 exp( ( , )) 1 exp ( , )
1( ) log ( , )
1 exp ( , )
Ni ii
N Nreg i j i ji i jN
i j j ij
p y xyK x w y K x x
l c K x xy K x x
Non-linear Logistic Regression Model
Non-linear Logistic Regression Model Hierarchical Mixture Expert Model
Group linear classifiers into a tree structure
Group 1
g1(x)
m1,1(x)
Group Layer
ExpertLayer
r(x)
Group 2
g2(x)
m1,2(x) m2,1(x) m2,2(x)
1 11 2 21
1 12 2 22
( 1| ) ( | ) ( 1| ) ( | )( | ) ( 1| ) ( 1| )
( 1| ) ( | ) ( 1| ) ( | )
g x m y x g x m y xp y x r x r x
g x m y x g x m y x
Products generates nonlinearity in the prediction function
It could be a rough assumption by assuming all data points can be fitted by a linear model
But, it is usually appropriate to assume a local linear model
KNN can be viewed as a localized model without any parameters Can we extend the KNN approach by introducing a
localized linear model?
Non-linear Logistic Regression Model
Localized Logistic Regression Model Similar to the weight KNN
Weigh each training example by
Build a logistic regression model using the weighted examples
2
22
( ) exp2
ii
x xw x
2
21
1log
1 exp
Nreg i
i
l c wy w x b
2
21
1( ) log
1 exp
Nreg ii
i
l w x c wy w x b
Localized Logistic Regression Model Similar to the weight KNN
Weigh each training example by
Build a logistic regression model using the weighted examples
2
22
( ) exp2
ii
x xw x
2
21
1log
1 exp
Nreg i
i
l c wy w x b
2
21
1( ) log
1 exp
Nreg ii
i
l w x c wy w x b
Conditional Exponential Model An extension of logistic regression model to multiple class case
A different set of weights wy and threshold b for each class y Translation invariance
1( | ; ) exp( )
( )
( ) exp( )
y y
y yy
p y x b x wZ x
Z x b x w
1 1 0w b
Iterative scaling methods for optimization
Maximum Entropy Model Finding the simplest model that matches with the
data
1 1( | )
1 1
max ( | ) max ( | ) log ( | )
subject to
( | ) ( , ), ( | )=1
i
N Ni i i ii i yp y x p
N Ni i i i ii i y
H y x p y x p y x
p y x x x y y p y x
Maximize Entropy Prefer uniform distribution
Constraints Enforce the model to be
consistent with observed data
Classification Margin
Support Vector Machine Classification margin Maximum margin principle:
Separate data far away from the decision boundary
Two objectives Minimize the classification
error over training data Maximize the classification
margin Support vectors
Only support vectors have impact on the location of decision boundary
denotes +1
denotes -1
0w x b
Support Vector Machine Classification margin Maximum margin principle:
Separate data far away from the decision boundary
Two objectives Minimize the classification
error over training data Maximize the classification
margin Support vectors
Only support vectors have impact on the location of decision boundary
denotes +1
denotes -1
Support Vectors
0w x b
Support Vector Machine Separable case
Noisy case
* * 21
,
1 1
2 2
{ , }= argmin
subject to
1
1
....
1
mii
w b
N N
w b w
y w x b
y w x b
y w x b
* * 21 1
,
1 1 1 1
2 2 2 2
{ , }= argmin
subject to
1 , 0
1 , 0
....
1 , 0
m Ni ji j
w b
N N N N
w b w c
y w x b
y w x b
y w x b
Support Vector Machine Separable case
Noisy case
* * 21
,
1 1
2 2
{ , }= argmin
subject to
1
1
....
1
mii
w b
N N
w b w
y w x b
y w x b
y w x b
* * 21 1
,
1 1 1 1
2 2 2 2
{ , }= argmin
subject to
1 , 0
1 , 0
....
1 , 0
m Ni ji j
w b
N N N N
w b w c
y w x b
y w x b
y w x b
Quadratic programming!
Logistic Regression Model vs. Support Vector Machine Logistic regression model
Support vector machine
21 1
,{ , }* arg min log 1 exp ( )
N mi ji j
w bw b y w x b s w
* * 21 1
,
1 1 1 1
{ , }= argmin
subject to
1 , 0
....
1 , 0
N mi ji j
w b
N N N N
w b c w
y w x b
y w x b
Different loss function for punishing mistakes
Identical terms
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 10
0.5
1
1.5
2
2.5
3
3.5
wx+b
Loss
Loss function for logistic regressionLoss function for SVM
Logistic Regression Model vs. Support Vector Machine
Logistic regression differs from support vector machine only in the loss function
( ) 1y wx b
Kernel Tricks Introducing nonlinearity into the discriminative models Diffusion kernel
A graph laplacian L for local similarity
Diffusion kernel
Propagate local similarity information into a global one
,
,
,
i j
i j
i kk i
s x x i jL
s x x i j
, or L dK e K LK
d
Fisher Kernel Derive a kernel function from a generative model Key idea
Map a point x in original input space into the model space The similarity of two data points are measured in the
model space
Original Input Space Model Space
1 1( )x x
2 2( )x x
Measure the similarity in the
model space
Kernel Methods in Generative Model Usually, kernels can be introduced to a
generative model through a Gaussian process Define a “kernelized” covariance matrix
Positive semi-definitive, similar to Mercer’s condition
Multi-class SVM SVMs can only handle two-class outputs One-against-all
Learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” : SVM N learns “Output==N” vs “Output != N”
Error Correct Output Code (ECOC) Encode each class into a bit vector
1 0 0 1
0 1 0 1
0 1 1 0
A
B
C
S1 S2 S3 S4
x 1 1 1 0
1
1
2
Ordinal Regression A special class of multi-class classification problem There a natural ordinal relationship between multiple
classes Maximum margin principle
The computation of margin involves multiple classes
‘good’
‘OK’
‘bad’w’
Ordinal Regression
1 2
1 2
1 2
* * *1 2 1 2
, ,
1 1 2 2, ,
1 2
2 2, ,1 1
{ , , } arg max margin( , , )
arg max min(margin ( , ),margin ( , ))
arg max min min , ming o o b
b b
b b
d dD D D Db bi ii i
b b b b
b b
b b
w w
w
w
x xw
w w
w w
x w x w
1
1 2
2
subject to
: 0
: 0, 0
: 0
i g i
i o i i
i b i
D b
D b b
D b
x x w
x x w x w
x x w
1 2
* * * 21 2 1
, ,
{ , , } arg min dii
b b
b b w w
w
1
1 2
2
subject to
: 1
: 1, 1
: 1
i g i
i o i i
i b i
D b
D b b
D b
x x w
x x w x w
x x w
Decision Tree
From slides of Andrew Moore
Decision Tree A greedy approach for generating a decision tree
1. Choose the most informative feature Using the mutual information measurements
2. Split data set according to the values of the selected feature
3. Recursive until each data item is classified correctly Attributes with real values
Quantize the real value into a discrete one
Decision Tree The overfitting problem
Tree pruning Reduced error pruning Rule post-pruning
Decision Tree The overfitting problem
Tree pruning Reduced error pruning Rule post-pruning
Generalize Decision Tree
+ +
a decision tree with simple data partition
+
a decision tree using classifiers for data partition
+
Each node is a linear classifier
Attribute 1
Attribute 2
classifier