Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 1

Linear Discriminant

Functions

Prof. Daniel Yeung

School of Computer Science and Engineering

South China University of Technology

Lecture 3Pattern Recognition


Outline

� Linear Discriminant Function (5.2)

� Generalized Linear Discriminant Function

(5.3)

� Gradient Descent (5.4)

� Criterion Function (5.5)

� Sum-of-squared-error function (5.8)

� Least-Mean-Squared (5.8.4)

� Relationship: MSE vs Bayes (5.8.3)


Review

� Chapter 1 – Introduce Pattern Recognition systems and its main concepts involved (data collection, feature, model, cost, decision making, classifier, training and learning, performance evaluation)

� Chapter 2 – Bayes decision theory: ideal case where probability structure underlying the classification categories is known perfectly. Hence one can design optimal (Bayes) classifier and even predict error.

� Chapter 3 – Probability structure not known, but general form of distributions are known. So only need to estimate parameters to achieve best categorization by maximum likelihood technique.


Review

� Chapter 4 – Move a further step away from Bayes model and make no assumption on underlying probability structure. One relies on information provided by training samples alone. Example: nearest-neighborhood algorithm and potential function.

� Chapter 5 – We know a “nearly” linear form for the discriminant functions, and use samples to estimate the values of parameters of the classifier.

� Chapter 6 – Extend some of the linear discriminant idea to train multilayer neural networks.


Parametric VS non-Parametric

� Parametric Methods

�Assume the form of the sample distribution

(pdf) is known

�Training samples used to estimate distribution

parameters

� E.g. µ and σ in Gaussian function

�Accurate if the distribution assumption is

correct; otherwise, the result may be very

poor

�E.g. Bayes Rule, etc


Parametric VS non-Parametric

� Non-parametric Methods

�Do not make assumption on the form of the

sample distribution “pdf”

� Instead, the proper form for discriminant

function is assumed (e.g., linear, neural

network, SVM)

� training samples used to estimate values of

parameters of the classifier.

�Sub-optimal, but simple to use


Linear Discriminant Function (LDF)

� Definition

�A linear combination of the components of

x (vector representing an object to be classified)

g(x) = wtx + w0

w : is the weight vector

w0 : the bias or threshold weight

� In general, there are c functions g1(x) ,

g2(x), …, gc(x), where c is the number of

classes


Linear Discriminant Function2-Class Problem

� Decision rule

� If g(x) > 0, decide ω1

� If g(x) < 0, decide ω2

� Or

� If wtx > -w0, decide ω1

�Otherwise, decide ω2

� If g(x) = 0, an ambiguous situation, then x

is assigned to either class



� Each unit is shown as having inputs and outputs.

� The input units exactly output the same values as the inputs (except the bias unit which outputs a constant 1)

� The output unit emits “1” if the sum of its weighted inputs is greater than zero and “-1” otherwise

Bias unit

Feature Vector

SummationWeight



� Decision surface is defined as g(x) = 0,

�When g(x) is linear, this decision surface is a

hyperplane

g(x)


Linear Discriminant Function

Multi-Class Problem: 1-against-All

H1

H3

H4

H1

H2

H4

H3

H2

H1 : not ω1

H2 : ω2

H3 : not ω3

H4 : not ω4

ω2

1 Yes 3 No




H1

H3

H4

H1

H2

H4

H3

H2

H1 : not ω1

H2 : not ω2

H3 : not ω3

H4 : not ω4

0 Yes 4 No




H1

H3

H4

H1

H2

H4

H3

H2

H1 : ω1

H2 : not ω2

H3 : ω3

H4 : not ω4

• Two classifiers think this region belongs to them

• This problem can be solved by comparing the outputs

of classifiers, such as,

• if H1 > H3, class 1

• Otherwise, class 3



Multi-Class Problem: 1-against-1

2. One against One

� Consider every possible pair of classes

� c(c−1)/2 classifiers are needed

� For example, 3-class problem

1 vs 2 2 vs 3 1 vs 3

� If Y Y/ N Y , then class 1

� If N Y Y / N , then class 2

� If Y / N N Y , then class 3

� Any problem?


H13

H13

H24

H24H23H12

H34

H14

H12H34

H14



H12 : ω2

H13 : ω3

H14 : ω4

H21 : ω2

H23 : ω2

H24 : ω2

H31 : ω4

H32 : ω2

H34 : ω4

H41 : ω4

H42 : ω2

H43 : ω4

ω2


H31 : ω3

H32 : ω3

H34 : ω3

H13

H13

H24

H24H23H12

H34

H14

H12H34

H14



ω3

H12 : ω1

H13 : ω3

H14 : ω1

H21 : ω1

H23 : ω3

H24 : ω4

H41 : ω1

H42 : ω4

H43 : ω4


H12 : ω1

H13 : ω1

H14 : ω1

H21 : ω1

H23 : ω2

H24 : ω4

H31 : ω1

H32 : ω2

H34 : ω3

H41 : ω1

H42 : ω4

H43 : ω3

H13

H13

H24

H24H23H12

H34

H14

H12H34

H14



ω1

P(ω2) > P(ω3)

P(ω3) > P(ω4)

P(ω4) > P(ω2)

Although there is a conflict between some classifiers,

we still can have a result


H13

H13

H24

H24H23H12

H34

H14

H12H34

H14



?

?

?

?

H12 : ω2

H13 : ω3

H14 : ω1

H21 : ω2

H23 : ω2

H24 : ω4

H31 : ω3

H32 : ω2

H34 : ω3

H41 : ω4

H42 : ω4

H43 : ω3



Multi-Class Problem

� Which approach is better?

�1-against-All needs a smaller number of

classifiers (c vs c(c−1)/2 classifiers)

�1-against-1 has a smaller ambiguous region

� Any other choice?1-against-11-against-All



Linear Machine for Multi-Class Problem� How to avoid the existence of ambiguous region caused by LDF (1-against-1 or 1-agianst-All) ?

� One possible solution : Use Linear Machine� For each class:

gi(x) = wtxi + wi0 , i = 1, ..., c

� x belongs to ωi (region Ri ), if gi(x) > gj(x) for all j ≠ i

� Undefined if gi(x) = gj(x)

� If Ri and Rj are contiguous, the boundary between them is a portion of the hyperplane Hij defined by

gi(x) = gj(x), or

(wi − wj)tx + (wi0 − wj0) = 0



Linear Machine for Multi-Class Problem

� Advantages:

�Avoid ambiguous region

�Every decision region is singly connected

�Low complexity� < c(c-1)/2 classifiers are needed

3-class problem 5-class problem

correction


Linear Discriminant Function (LDF)

� Practically, a problem is seldom linearly separable

� Can we handle a non linearly separable problem by LDF?� Use a mapping to convert a nonlinearly to linearly

separable problem, i.e., map the points from a lower dim space (non linearly separable) to a higher dim space (linearly separable)

low dim space high dim space

Non linearly separableLinearly

separable


Generalized Linear Discriminant Function

� Linear Discriminant Function

� Quadratic Discriminant Function

� Continue to add…

∑∑∑= ==

++=d

i

d

j

jiij

d

i

ii xxwxwwxg1 11

0)(

∑=

+=d

i

ii xwwxg1

0)(

∑∑∑= = =

d

i

d

j

d

k

kjiijk xxxw1 1 1

∑∑∑∑= = = =

d

i

d

j

d

k

d

l

lkjiijkl xxxxw1 1 1 1

, ,…



� Generalized Linear Discriminant Function

g is not linear in x, but linear in y.

where

� a is a d dimensional weight vector

� yi(x) are called φ-functions� Arbitrary function of x

� Map points in low d-dim space into points in high d-dim space

� Homogeneous discriminant function atyseparates points by a hyperplane that passes through origin in transformed space

∑=

=d

i

ii xyaxg

ˆ

1

)()( )()( xyaxg t=or

^

^



Example: Quadratic as Generalized LDF

� The quadratic discriminant function

� Generalized LDF

where y1(x) = 1, y2(x) = x, y3(x) = x2

�3-dim vector y:

2

321)( xaxaaxg ++=

=

2

1

)(

x

xxy

∑=

=3

1

)()(i

ii xyaxg



Example: Quadratic as Generalized LDF

� A plane splits the resulting y-space into regions corresponding to two categories which gives a non-simply connected decision region in 1-dim x-space

� Mapping y = (1, x, x2)t

takes a line and transforms it to a parabola in 3-dim

Points below this separating plane belong to

Region ℜ2 and those abovre belong to ℜ1

Non linearly separable

Linearly separable



Another Example

2121 2)( xxxxxg ++−=

� A linear discriminant in this transformed space is a hyperplane which cuts the surface

� Points on the positive side of H (above H) correspond to ω1 and those beneath it correspond to ω2

∧∧∧∧ ∧∧∧∧


Two-Class Linearly Separable Case

� Given

�Set of n samples y1, y2, …, yn

�Some are ω1 and some are ω2

� Objective

�Use samples to determine weight vector “a” such that

� atyi > 0 implies ω1

� atyi < 0 implies ω2

� If such a weight vector exists a then the samples are said to be linearly separable



� Normalization

�Simplify the setting

�Replace all samples labeled by ω2 by their

“negatives”

� Such as yi = -yi, where yi is ω2

� If the problem is linearly separable, all

samples are atyi > 0

�Labels can be ignored



Raw Data Normalized Data

• Solution vector separates

red from black, a is at right

angle with the red dotted

line

• Solution vector places all vectors

on same side of plane

• Solution region is intersection of n

half-spaces

aty<0 red

aty>0 black

Negative yi’s

Infinite solutions


Additional Requirements

� Solution vector (solution to the decision plane) is not unique� Which one is the best?

� Additional requirements to constrain solution vector

g(x)

� Two possible approaches:1.Seek a hyperplane having a larger margin

� Discuss in next slides

2. Seek a smoother and simpler decision hyperplane� Regularization, discuss later


Margin

� Minimum length weight vector a satisfying

atyi > b > 0

where b is called the margin

� It is insulated from old boundaries by b / ||yi||

No Margin: b = 0

Margin b > 0

shrinks solution region

by margins b / ||yi||

like a buffer zone

Old boundaries

New boundary


Gradient Descent (GD)

� Recall after normalization (slide 30),

sample yi is classified correctly if atyi > 0

� To find a solution to set of inequalities

atyi > 0

� Define a criterion function J(a) that is

minimized if a is a solution vector

�Function J will be defined later



� Algorithm

�Start with an arbitrarily chosen weight a(1)

�Compute gradient vector ∇J(a(1))

�Next value a(2) determined by moving some

distance from a(1) in the direction of the

steepest descent

� i.e., along the negative of the gradient

))(()()()1( kaJkkaka ∇−=+ η

a

JJ

∂∂

=∇η is the learning rate which

controls the size of each step



� Algorithm

Note: in step 4, it should be

The absolute value < θθθθ a(0)

a(1)

a* a

)0()( Jk ∇η

a(2) )1()( Jk ∇η

)2()( Jk ∇η a(3)

1

2

3Lec3: Linear Discriminant Functions 36


� Related Issues:

�Size of Learning Rate (η)

� Too small, convergence is needlessly slow

� Too large, the correction process will overshoot

and can even diverge

�Sub-optimal Solution

� Trapped by

local minimum


Criterion Function (J)

� Recall, all correctly classified samples should satisfy atyi > 0

� We would like to find at which yields smallest error

� Examples of Criterion Function (J)

� Perceptron Criterion Function

� Squared Error Function

� If all samples are correct (i.e. Y set is empty), J is equal to 0

� If all samples are wrong (i.e. all atyi < 0 ), J is a positive value

( )∑∈

=Yy

t

q yaaJ2

)(

( )∑∈

−=Yy

t

P yaaJ )(

Set of misclassified

samples



� Gradient function:

� Update Rule:

( )∑∈

−=∇Yy

P yaJ )(

∑∈

+=+Yy

ykkaka )()()1( η

( )∑∈

=Yy

t

q yaaJ2

)(( )∑∈

−=Yy

t

P yaaJ )(

∑∈

=∇Yy

P yaJ 2)(

∑∈

−=+Yy

ykkaka )(2)()1( η

Perceptron Criterion Function Squared Error Function



Sum-of-squared-error Function� Perceptron Criterion Function and

Sum-of-Squared Error Function

� focus only on misclassified samples

� hope that all of the inner products (atyi) are positive

� Sum-of-squared-error Function handles the margin situation (atyi= bi) (slide 32)

where bi are some arbitrarily specified positive constants

� The problem is more stringent but better understood problem of finding a solution to a set of linear equations (rather than set of linear inequalities)

( )∑=

−=n

i

ii

t

s byaaJ1

2)(



Sum-of-squared-error Function

� For all samples y1, y2, …, yn

� We want to find a weight vector a, so that

atyi = bi

� where bi are some arbitrarily specified positive

constants

� Matrix notation:

=

ddnddn

d

d

b

b

b

a

a

a

yyy

yyy

yyy

MM

L

MOM

K

1

0

1

0

10

22120

11110

bYa=or




� As Ya = b, a can be solved by calculating

the inverse if Y is nonsingular,

a = Y-1b

� However, usually, Y is rectangular

�More rows than columns

More samples than features

More equations than unknowns

� a is over-determined

� No exact solution exists



Sum-of-squared-error Function� Pseudoinverse Method

� Error vector e

� Sum-of-squared-error function

� The Gradient

� Necessary Condition

�a can be solved uniquely

bYae −=

( )∑=

−=−=n

i

ii

t

s byabYaaJ1

22)(

( ) ( )∑=

−=−=∇n

i

ii

t

i

t

s byaybYaYaJ1

22)(

bYYaY tt =

bYbYYYa tt '1)( == − tt YYYY 1' )( −=where




� Remark

�YtY is not always nonsingluar

� Y’ should be defined more generally by

�The solution depends on b

tt YIYYY 1

0

' )(lim −

→+= ε

ε


x1 x2 ω

1 2 1

2 0 1

3 1 2

2 3 2

� Given

� y and b are defined as

� Hence

( )txxy 211=

−−−

−−−=

321

131

021

211

Y

( ) 01 21 =tt xxa

−−

−−−−== −

3/103/10

6/12/16/12/1

12/74/312/134/5

)('1 ttYYYY

( )tb 1111=

( )tbYa 3/23/43/11' −−==

Normalize Class 2 samples

Refer to slide 28

aty = 0 => 4x1 + 2x2 = 11

=> 4x1 + 2x2 = 11

(-1, -1)

CF: Sum-of-squared-error function

(example: find a by pseudoinverse)

y



Least-Mean-Squared (LMS)

� Besides pseudo-inverse method, Gradient

Descent can also be used

� The algorithm is called Least-Mean-Squared (LMS)

� Advantage over pseudo-inverse:

� Problem when YtY is singular

� Avoids need for working with large matrices

� Computation involved is a feedback scheme that

copes with round off or truncation

� Disadvantage

� Longer training time




� Recall

� Gradient function:

� Update Rule:

( )∑=

−=n

i

ii

t

s byaaJ1

2)(

( ) ii

t

i yyabkkaka ++=+ )()()1( η

( )∑=

−=∇n

i

iii

t

s ybyaaJ1

2)(




� LMS needs not converge to a separating hyperplane (which can separate the samples perfectly), even if one exists

� It minimizes the sum of the squares of the distances of the training points to the hyperplane

� For this example, the plane is different from the separating hyperplane


Relationship: MSE vs BayesClass conditional densities

Posteriors

Bayes discriminant function

(grey line)

MSE solution

(dotted line)

best approximation in

region of data points

i.e., when g(x)>>>> g0(x)

� If b=1 MSE solution approaches a

minimum mean squared

approximation to Bayes

discriminant function

Documents

Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s