Upload
julia-flowers
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Linear Models for Classification
Berkay Topçu
Linear Models for Classification Goal: Take an input vector and assign it to
one of K classes (Ck where k=1,...,K) Linear separation of classes
Generalized Linear Models We wish to predict discrete class labels, or
more generally class posterior probabilities that lies in range (0,1).
Classification model as a linear function of the parameters,
Classification directly in the original input space , or a fixed nonlinear transformation of the input variables using a vector of basis functions
)()( 0Txwx wfy w
)|( xkCp
x
)(x
Discriminant Functions Linear discriminants
If , assign to class C1 and to class C2 otherwise
Decision boundary is given by determines the orientation of the decision
surface and determines the location Compact notation:
0Txwx wy )(
0)( xy0)( xy
w0w
xwx T~~)( y
),(~0 ww w
),1(~0 xx x
Multiple Classes K-class discriminant by combining number of
two-class discriminant functions (K>2) One-versus-the-rest: seperating points in one
particular class Ck from points not in that class One-versus-one: K(K-1)/2 binary discriminant
functions
Multiple Classes A single K-class discriminant comprising K
linear functions
Assign to class Ck if for all
How to learn the parameters of linear discriminant functions?
0xwx kTkk wy )(
)()( xx jk yy kj
Least Squares for Classification Each class Ck is described by its own linear
model
Training data set for n =1,...,N where
Matrix whose nth row is the vector and whose nth row is
xWxxwx 0~~
)(y )( Tk
Tkk wy
n, txn
)0,...,1,...,0,0(nt
T Tnt X
Tnx
Least Squares for Classification Minimizing the sum-of-squares error function
Solution :
Discriminant function :
)~~()
~~( Tr
2
1)( T TWXTWXW DE
TXTXXXW ~~)
~~(
~ 1 TT
xXTxWxy ~)~
(~~)( TTT
Fisher’s Linear Discriminant Dimensionality reduction: take the D-dimensional
input vector and project to one dimension using
Projection that maximizes class seperation Two-class problem: N1 points of C1 and N2 points of C2
Fisher’s idea: large separation between the projected class means small variance within each class, minimizing class
overlap
xwTy
ii
T
Cnn
Cnn
wmm
NN
1 ),(
1 ,
1
2112
22
11
21
mmw
xmxm
Fisher’s Linear Discriminant
The Fisher criterion:
21
))(())((
))(( ,)(
2211
1212
Cn
Tnn
Cn
TnnW
TB
WT
BT
J
mxmxmxmxS
mmmmSwSw
wSww
)( 121 mmSw w
Fisher’s Linear Discriminant For the two-class problem, Fisher criterion is a
special case of least squares (reference : Penalized Discriminant Analysis – Hastie, Buja and Tibshirani)
For multiple classes:
The weights values are determined by the eigenvectors that corresponds to K highest eigenvalues of
xWTy
)(
)(Tr)(
))(( ,1
B1
TW
TB
K
k
Tkkk
K
kkW
J
N
WWS
WWSW
mmmmSSS
BWSS 1
The Perceptron Algorithm Input vector is transformed using a
nonlinear transformation
Perceptron criterion: For all training samples
We need to minimize
)(xx
))(()( xwx Tfy
0 ,1
0 ,1)(
a
aaf
)1,1( t
0)( nnT txw
Mn
nnT tE ww)(P
The Perceptron Algorithm – Stocastic Gradient Descent Cycle through the training patterns in turn
If the pattern is correctly classified weight vectors remains unchanged, else:
nntE )(P
)()1( )( wwww
Probabilistic Generative Models Depend on simple assumptions about the
distribution of the data
Logistic sigmoid function Maps the whole real axis to a
finite interval
)()exp(1
1
)()|()()|(
)()|()|(
2211
111
aa
CpCpCpCp
CpCpCp
xx
xx
)()|(
)()|(ln
22
11
CpCp
CpCpa
x
x
Continuous Inputs - Gaussian Assuming the class-conditional densities are
Gaussian
Case of two classes
)()(
2
1exp
1
)2(
1)|( 1
212 kT
kDkCp μxΣμxΣ
x
)(
)(ln
2
1
2
1
)(
)()|(
2
12
121
110
211
01
Cp
Cpw
wCp
TT
T
μΣμμΣμ
μμΣw
xwx
Maximum Likelihood Solution Likelihood function:
Maximizing log-likelihood
N
nnn
nn NNp1
12121 )]|()1[()]|([)|( tt ,μx,μx,μ,μπ,t
2
1
))((1
))((1
)1(1
1
222
2
111
1
22
11
122
111
Cn
Tnn
Cn
Tnn
N
nnn
N
nnn
N
N
N
N
N
N
tN
tN
μxμxS
μxμxS
SS
xμxμ
Probabilistic Discriminative Models Probabilistic generative model
Number of parameters grows quadratically with M (# dim.)
However has M adjustable parameters Maximum likelihood solution for Logistic
Regression
Energy function: negative log likelihood
)()()|( 1 φwφφ TyCp
φwT
N
n
tn
tn
nn yyp1
1)1()|( wt
N
nnnnn ytytpE
1
)}1ln()1(ln{)|(ln)( wtw
Iterative Reweighted Least Squares Newton-Raphson iterative optimization on
linear regression
Same as the standard least-squares solution
)(1)()( wHww Eoldnew
tΦΦwΦφφww TTn
N
nnn
T tE 1
)()(
ΦΦφφwH TTn
N
nnE
1
)(
tΦΦΦtΦΦwΦΦΦww TTToldTToldnew 1)(1)()( )(}{)(
Iterative Reweighted Least Squares Newton-Raphson update for negative log
likelihood
Weighted least-squares problem
)()()(1
tyΦφw
Tn
N
nnn tyE
)1( ,)1()(1
nnnnTT
nn
N
nnn yyRyyE
RΦΦφφwH
)( )(
)( )(
)()(
1)(1
)(1
1)()(
tyRwΦRΦRΦΦ
tyΦwRΦΦRΦΦ
tyΦRΦΦww
oldTT
ToldTT
TToldnew
Maximum Margin Classifiers Support Vector Machines for two-class
problem
Assuming linearly seperable data set There exists at least one set of variables satisfies That give the smallest generalization error Margin: the smallest distance between decision
boundary and any of the samples
by T )()( xwx }1,1{ ,...1 nN txx0)( nn yt x
Support Vector Machines Optimization of parameters, maximizing the
margin
Maximizing the margin minimizing : subject to the constraint:
Introduction of Lagrange multipliers
w
xw
w
x ))(()( btyt nT
nnn
2
, 2
1minarg ww b
2w
1))(( bt nT
n xw
0na
N
nn
Tnn btabL
1
2}1))(({
2
1),,( xwwaw
Support Vector Machines - Lagrange Multipliers Minimizing with respect to w and b and
maximizing with respect to a.
The dual form:
Quadratic programming problem:
N
nn
N
nnnn tata
1n1
0 ,)(xw
,),(2
1)(
~
1 11
N
n
N
mmnmnmn
N
nn kttaaaL xxa
N
nnnn taa
1
0 ,0
CQaa
a
T
2
1 maxarg
~
bktayN
nnnn
1
),()( xxx
Support Vector Machines Overlapping class distributions (linearly
unseparable data) Slack variable: distance from the boundary
To maximize the margin whilepenalizing points that lie on the wrong side of the margin boundary
)(or 0 nnnn yt x
nnn yt 1)(x
2
1 2
1 min w
N
nnC
SVM-Overlapping Class Distributions Identical to separable case
Again represents a quadratic programming problem
N
nnn
N
nnnnn
N
nn ytaCbL
111
2}1)({
2
1),,( xwaw
,),(2
1)(
~
1 11
N
n
N
mmnmnmn
N
nn kttaaaL xxa
N
nnnn taCa
1
0 ,0
Support Vector Machines Relation to logistic regression
Hinge loss used in SVM and the error function of logistic regression approximate the ideal misclassification error(MCE)
Black : MCE, Blue: Hinge Loss, Red: Logistic Regression, Green: Squared Error
nntyz