12
Lec3: Linear Discriminant Functions 1 Linear Discriminant Functions Prof. Daniel Yeung School of Computer Science and Engineering South China University of Technology Lecture 3 Pattern Recognition Lec3: Linear Discriminant Functions 2 Outline Linear Discriminant Function (5.2) Generalized Linear Discriminant Function (5.3) Gradient Descent (5.4) Criterion Function (5.5) Sum-of-squared-error function (5.8) Least-Mean-Squared (5.8.4) Relationship: MSE vs Bayes (5.8.3) Lec3: Linear Discriminant Functions 3 Review Chapter 1 – Introduce Pattern Recognition systems and its main concepts involved (data collection, feature, model, cost, decision making, classifier, training and learning, performance evaluation) Chapter 2 – Bayes decision theory: ideal case where probability structure underlying the classification categories is known perfectly. Hence one can design optimal (Bayes) classifier and even predict error. Chapter 3 – Probability structure not known, but general form of distributions are known. So only need to estimate parameters to achieve best categorization by maximum likelihood technique. Lec3: Linear Discriminant Functions 4 Review Chapter 4 – Move a further step away from Bayes model and make no assumption on underlying probability structure. One relies on information provided by training samples alone. Example: nearest-neighborhood algorithm and potential function. Chapter 5 – We know a “nearly” linear form for the discriminant functions, and use samples to estimate the values of parameters of the classifier. Chapter 6 – Extend some of the linear discriminant idea to train multilayer neural networks.

Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 1

Linear Discriminant

Functions

Prof. Daniel Yeung

School of Computer Science and Engineering

South China University of Technology

Lecture 3Pattern Recognition

Lec3: Linear Discriminant Functions 2

Outline

� Linear Discriminant Function (5.2)

� Generalized Linear Discriminant Function

(5.3)

� Gradient Descent (5.4)

� Criterion Function (5.5)

� Sum-of-squared-error function (5.8)

� Least-Mean-Squared (5.8.4)

� Relationship: MSE vs Bayes (5.8.3)

Lec3: Linear Discriminant Functions 3

Review

� Chapter 1 – Introduce Pattern Recognition systems and its main concepts involved (data collection, feature, model, cost, decision making, classifier, training and learning, performance evaluation)

� Chapter 2 – Bayes decision theory: ideal case where probability structure underlying the classification categories is known perfectly. Hence one can design optimal (Bayes) classifier and even predict error.

� Chapter 3 – Probability structure not known, but general form of distributions are known. So only need to estimate parameters to achieve best categorization by maximum likelihood technique.

Lec3: Linear Discriminant Functions 4

Review

� Chapter 4 – Move a further step away from Bayes model and make no assumption on underlying probability structure. One relies on information provided by training samples alone. Example: nearest-neighborhood algorithm and potential function.

� Chapter 5 – We know a “nearly” linear form for the discriminant functions, and use samples to estimate the values of parameters of the classifier.

� Chapter 6 – Extend some of the linear discriminant idea to train multilayer neural networks.

Page 2: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 5

Parametric VS non-Parametric

� Parametric Methods

�Assume the form of the sample distribution

(pdf) is known

�Training samples used to estimate distribution

parameters

� E.g. µ and σ in Gaussian function

�Accurate if the distribution assumption is

correct; otherwise, the result may be very

poor

�E.g. Bayes Rule, etc

Lec3: Linear Discriminant Functions 6

Parametric VS non-Parametric

� Non-parametric Methods

�Do not make assumption on the form of the

sample distribution “pdf”

� Instead, the proper form for discriminant

function is assumed (e.g., linear, neural

network, SVM)

� training samples used to estimate values of

parameters of the classifier.

�Sub-optimal, but simple to use

Lec3: Linear Discriminant Functions 7

Linear Discriminant Function (LDF)

� Definition

�A linear combination of the components of

x (vector representing an object to be classified)

g(x) = wtx + w0

w : is the weight vector

w0 : the bias or threshold weight

� In general, there are c functions g1(x) ,

g2(x), …, gc(x), where c is the number of

classes

Lec3: Linear Discriminant Functions 8

Linear Discriminant Function2-Class Problem

� Decision rule

� If g(x) > 0, decide ω1

� If g(x) < 0, decide ω2

� Or

� If wtx > -w0, decide ω1

�Otherwise, decide ω2

� If g(x) = 0, an ambiguous situation, then x

is assigned to either class

Page 3: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 9

Linear Discriminant Function2-Class Problem

� Each unit is shown as having inputs and outputs.

� The input units exactly output the same values as the inputs (except the bias unit which outputs a constant 1)

� The output unit emits “1” if the sum of its weighted inputs is greater than zero and “-1” otherwise

Bias unit

Feature Vector

SummationWeight

Lec3: Linear Discriminant Functions 10

Linear Discriminant Function2-Class Problem

� Decision surface is defined as g(x) = 0,

�When g(x) is linear, this decision surface is a

hyperplane

g(x)

Lec3: Linear Discriminant Functions 11

Linear Discriminant Function

Multi-Class Problem: 1-against-All

H1

H3

H4

H1

H2

H4

H3

H2

H1 : not ω1

H2 : ω2

H3 : not ω3

H4 : not ω4

ω2

1 Yes 3 No

Lec3: Linear Discriminant Functions 12

Linear Discriminant Function

Multi-Class Problem: 1-against-All

H1

H3

H4

H1

H2

H4

H3

H2

H1 : not ω1

H2 : not ω2

H3 : not ω3

H4 : not ω4

0 Yes 4 No

Page 4: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 13

Linear Discriminant Function

Multi-Class Problem: 1-against-All

H1

H3

H4

H1

H2

H4

H3

H2

H1 : ω1

H2 : not ω2

H3 : ω3

H4 : not ω4

• Two classifiers think this region belongs to them

• This problem can be solved by comparing the outputs

of classifiers, such as,

• if H1 > H3, class 1

• Otherwise, class 3

Lec3: Linear Discriminant Functions 14

Linear Discriminant Function

Multi-Class Problem: 1-against-1

2. One against One

� Consider every possible pair of classes

� c(c−1)/2 classifiers are needed

� For example, 3-class problem

1 vs 2 2 vs 3 1 vs 3

� If Y Y/ N Y , then class 1

� If N Y Y / N , then class 2

� If Y / N N Y , then class 3

� Any problem?

Lec3: Linear Discriminant Functions 15

H13

H13

H24

H24H23H12

H34

H14

H12H34

H14

Linear Discriminant Function

Multi-Class Problem: 1-against-1

H12 : ω2

H13 : ω3

H14 : ω4

H21 : ω2

H23 : ω2

H24 : ω2

H31 : ω4

H32 : ω2

H34 : ω4

H41 : ω4

H42 : ω2

H43 : ω4

ω2

Lec3: Linear Discriminant Functions 16

H31 : ω3

H32 : ω3

H34 : ω3

H13

H13

H24

H24H23H12

H34

H14

H12H34

H14

Linear Discriminant Function

Multi-Class Problem: 1-against-1

ω3

H12 : ω1

H13 : ω3

H14 : ω1

H21 : ω1

H23 : ω3

H24 : ω4

H41 : ω1

H42 : ω4

H43 : ω4

Page 5: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 17

H12 : ω1

H13 : ω1

H14 : ω1

H21 : ω1

H23 : ω2

H24 : ω4

H31 : ω1

H32 : ω2

H34 : ω3

H41 : ω1

H42 : ω4

H43 : ω3

H13

H13

H24

H24H23H12

H34

H14

H12H34

H14

Linear Discriminant Function

Multi-Class Problem: 1-against-1

ω1

P(ω2) > P(ω3)

P(ω3) > P(ω4)

P(ω4) > P(ω2)

Although there is a conflict between some classifiers,

we still can have a result

Lec3: Linear Discriminant Functions 18

H13

H13

H24

H24H23H12

H34

H14

H12H34

H14

Linear Discriminant Function

Multi-Class Problem: 1-against-1

?

?

?

?

H12 : ω2

H13 : ω3

H14 : ω1

H21 : ω2

H23 : ω2

H24 : ω4

H31 : ω3

H32 : ω2

H34 : ω3

H41 : ω4

H42 : ω4

H43 : ω3

Lec3: Linear Discriminant Functions 19

Linear Discriminant Function

Multi-Class Problem

� Which approach is better?

�1-against-All needs a smaller number of

classifiers (c vs c(c−1)/2 classifiers)

�1-against-1 has a smaller ambiguous region

� Any other choice?1-against-11-against-All

Lec3: Linear Discriminant Functions 20

Linear Discriminant Function

Linear Machine for Multi-Class Problem� How to avoid the existence of ambiguous region caused by LDF (1-against-1 or 1-agianst-All) ?

� One possible solution : Use Linear Machine� For each class:

gi(x) = wtxi + wi0 , i = 1, ..., c

� x belongs to ωi (region Ri ), if gi(x) > gj(x) for all j ≠ i

� Undefined if gi(x) = gj(x)

� If Ri and Rj are contiguous, the boundary between them is a portion of the hyperplane Hij defined by

gi(x) = gj(x), or

(wi − wj)tx + (wi0 − wj0) = 0

Page 6: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 21

Linear Discriminant Function

Linear Machine for Multi-Class Problem

� Advantages:

�Avoid ambiguous region

�Every decision region is singly connected

�Low complexity� < c(c-1)/2 classifiers are needed

3-class problem 5-class problem

correction

Lec3: Linear Discriminant Functions 22

Linear Discriminant Function (LDF)

� Practically, a problem is seldom linearly separable

� Can we handle a non linearly separable problem by LDF?� Use a mapping to convert a nonlinearly to linearly

separable problem, i.e., map the points from a lower dim space (non linearly separable) to a higher dim space (linearly separable)

low dim space high dim space

Non linearly separableLinearly

separable

Lec3: Linear Discriminant Functions 23

Generalized Linear Discriminant Function

� Linear Discriminant Function

� Quadratic Discriminant Function

� Continue to add…

∑∑∑= ==

++=d

i

d

j

jiij

d

i

ii xxwxwwxg1 11

0)(

∑=

+=d

i

ii xwwxg1

0)(

∑∑∑= = =

d

i

d

j

d

k

kjiijk xxxw1 1 1

∑∑∑∑= = = =

d

i

d

j

d

k

d

l

lkjiijkl xxxxw1 1 1 1

, ,…

Lec3: Linear Discriminant Functions 24

Generalized Linear Discriminant Function

� Generalized Linear Discriminant Function

g is not linear in x, but linear in y.

where

� a is a d dimensional weight vector

� yi(x) are called φ-functions� Arbitrary function of x

� Map points in low d-dim space into points in high d-dim space

� Homogeneous discriminant function atyseparates points by a hyperplane that passes through origin in transformed space

∑=

=d

i

ii xyaxg

ˆ

1

)()( )()( xyaxg t=or

^

^

Page 7: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 25

Generalized Linear Discriminant Function

Example: Quadratic as Generalized LDF

� The quadratic discriminant function

� Generalized LDF

where y1(x) = 1, y2(x) = x, y3(x) = x2

�3-dim vector y:

2

321)( xaxaaxg ++=

=

2

1

)(

x

xxy

∑=

=3

1

)()(i

ii xyaxg

Lec3: Linear Discriminant Functions 26

Generalized Linear Discriminant Function

Example: Quadratic as Generalized LDF

� A plane splits the resulting y-space into regions corresponding to two categories which gives a non-simply connected decision region in 1-dim x-space

� Mapping y = (1, x, x2)t

takes a line and transforms it to a parabola in 3-dim

Points below this separating plane belong to

Region ℜ2 and those abovre belong to ℜ1

Non linearly separable

Linearly separable

Lec3: Linear Discriminant Functions 27

Generalized Linear Discriminant Function

Another Example

2121 2)( xxxxxg ++−=

� A linear discriminant in this transformed space is a hyperplane which cuts the surface

� Points on the positive side of H (above H) correspond to ω1 and those beneath it correspond to ω2

∧∧∧∧ ∧∧∧∧

Lec3: Linear Discriminant Functions 28

Two-Class Linearly Separable Case

� Given

�Set of n samples y1, y2, …, yn

�Some are ω1 and some are ω2

� Objective

�Use samples to determine weight vector “a” such that

� atyi > 0 implies ω1

� atyi < 0 implies ω2

� If such a weight vector exists a then the samples are said to be linearly separable

Page 8: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 29

Two-Class Linearly Separable Case

� Normalization

�Simplify the setting

�Replace all samples labeled by ω2 by their

“negatives”

� Such as yi = -yi, where yi is ω2

� If the problem is linearly separable, all

samples are atyi > 0

�Labels can be ignored

Lec3: Linear Discriminant Functions 30

Two-Class Linearly Separable Case

Raw Data Normalized Data

• Solution vector separates

red from black, a is at right

angle with the red dotted

line

• Solution vector places all vectors

on same side of plane

• Solution region is intersection of n

half-spaces

aty<0 red

aty>0 black

Negative yi’s

Infinite solutions

Lec3: Linear Discriminant Functions 31

Additional Requirements

� Solution vector (solution to the decision plane) is not unique� Which one is the best?

� Additional requirements to constrain solution vector

g(x)

� Two possible approaches:1.Seek a hyperplane having a larger margin

� Discuss in next slides

2. Seek a smoother and simpler decision hyperplane� Regularization, discuss later

Lec3: Linear Discriminant Functions 32

Margin

� Minimum length weight vector a satisfying

atyi > b > 0

where b is called the margin

� It is insulated from old boundaries by b / ||yi||

No Margin: b = 0

Margin b > 0

shrinks solution region

by margins b / ||yi||

like a buffer zone

Old boundaries

New boundary

Page 9: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 33

Gradient Descent (GD)

� Recall after normalization (slide 30),

sample yi is classified correctly if atyi > 0

� To find a solution to set of inequalities

atyi > 0

� Define a criterion function J(a) that is

minimized if a is a solution vector

�Function J will be defined later

Lec3: Linear Discriminant Functions 34

Gradient Descent (GD)

� Algorithm

�Start with an arbitrarily chosen weight a(1)

�Compute gradient vector ∇J(a(1))

�Next value a(2) determined by moving some

distance from a(1) in the direction of the

steepest descent

� i.e., along the negative of the gradient

))(()()()1( kaJkkaka ∇−=+ η

a

JJ

∂∂

=∇η is the learning rate which

controls the size of each step

Lec3: Linear Discriminant Functions 35

Gradient Descent (GD)

� Algorithm

Note: in step 4, it should be

The absolute value < θθθθ a(0)

a(1)

a* a

)0()( Jk ∇η

a(2) )1()( Jk ∇η

)2()( Jk ∇η a(3)

1

2

3Lec3: Linear Discriminant Functions 36

Gradient Descent (GD)

� Related Issues:

�Size of Learning Rate (η)

� Too small, convergence is needlessly slow

� Too large, the correction process will overshoot

and can even diverge

�Sub-optimal Solution

� Trapped by

local minimum

Page 10: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 37

Criterion Function (J)

� Recall, all correctly classified samples should satisfy atyi > 0

� We would like to find at which yields smallest error

� Examples of Criterion Function (J)

� Perceptron Criterion Function

� Squared Error Function

� If all samples are correct (i.e. Y set is empty), J is equal to 0

� If all samples are wrong (i.e. all atyi < 0 ), J is a positive value

( )∑∈

=Yy

t

q yaaJ2

)(

( )∑∈

−=Yy

t

P yaaJ )(

Set of misclassified

samples

Lec3: Linear Discriminant Functions 38

Criterion Function (J)

� Gradient function:

� Update Rule:

( )∑∈

−=∇Yy

P yaJ )(

∑∈

+=+Yy

ykkaka )()()1( η

( )∑∈

=Yy

t

q yaaJ2

)(( )∑∈

−=Yy

t

P yaaJ )(

∑∈

=∇Yy

P yaJ 2)(

∑∈

−=+Yy

ykkaka )(2)()1( η

Perceptron Criterion Function Squared Error Function

Lec3: Linear Discriminant Functions 39

Criterion Function (J)

Sum-of-squared-error Function� Perceptron Criterion Function and

Sum-of-Squared Error Function

� focus only on misclassified samples

� hope that all of the inner products (atyi) are positive

� Sum-of-squared-error Function handles the margin situation (atyi= bi) (slide 32)

where bi are some arbitrarily specified positive constants

� The problem is more stringent but better understood problem of finding a solution to a set of linear equations (rather than set of linear inequalities)

( )∑=

−=n

i

ii

t

s byaaJ1

2)(

Lec3: Linear Discriminant Functions 40

Criterion Function (J)

Sum-of-squared-error Function

� For all samples y1, y2, …, yn

� We want to find a weight vector a, so that

atyi = bi

� where bi are some arbitrarily specified positive

constants

� Matrix notation:

=

ddnddn

d

d

b

b

b

a

a

a

yyy

yyy

yyy

MM

L

MOM

K

1

0

1

0

10

22120

11110

bYa=or

Page 11: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 41

Criterion Function (J)

Sum-of-squared-error Function

� As Ya = b, a can be solved by calculating

the inverse if Y is nonsingular,

a = Y-1b

� However, usually, Y is rectangular

�More rows than columns

More samples than features

More equations than unknowns

� a is over-determined

� No exact solution exists

Lec3: Linear Discriminant Functions 42

Criterion Function (J)

Sum-of-squared-error Function� Pseudoinverse Method

� Error vector e

� Sum-of-squared-error function

� The Gradient

� Necessary Condition

�a can be solved uniquely

bYae −=

( )∑=

−=−=n

i

ii

t

s byabYaaJ1

22)(

( ) ( )∑=

−=−=∇n

i

ii

t

i

t

s byaybYaYaJ1

22)(

bYYaY tt =

bYbYYYa tt '1)( == − tt YYYY 1' )( −=where

Lec3: Linear Discriminant Functions 43

Criterion Function (J)

Sum-of-squared-error Function

� Remark

�YtY is not always nonsingluar

� Y’ should be defined more generally by

�The solution depends on b

tt YIYYY 1

0

' )(lim −

→+= ε

ε

Lec3: Linear Discriminant Functions 44

x1 x2 ω

1 2 1

2 0 1

3 1 2

2 3 2

� Given

� y and b are defined as

� Hence

( )txxy 211=

−−−

−−−=

321

131

021

211

Y

( ) 01 21 =tt xxa

−−

−−−−== −

3/103/10

6/12/16/12/1

12/74/312/134/5

)('1 ttYYYY

( )tb 1111=

( )tbYa 3/23/43/11' −−==

Normalize Class 2 samples

Refer to slide 28

aty = 0 => 4x1 + 2x2 = 11

=> 4x1 + 2x2 = 11

(-1, -1)

CF: Sum-of-squared-error function

(example: find a by pseudoinverse)

y

Page 12: Linear Discriminant Functionsmlclab.org/PR/notes/Lecture03-LinearDiscriminant...Lec3: Linear Discriminant Functions 5 Parametric VS non-Parametric Parametric Methods Ass t f m t s

Lec3: Linear Discriminant Functions 45

CF: Sum-of-squared-error function

Least-Mean-Squared (LMS)

� Besides pseudo-inverse method, Gradient

Descent can also be used

� The algorithm is called Least-Mean-Squared (LMS)

� Advantage over pseudo-inverse:

� Problem when YtY is singular

� Avoids need for working with large matrices

� Computation involved is a feedback scheme that

copes with round off or truncation

� Disadvantage

� Longer training time

Lec3: Linear Discriminant Functions 46

CF: Sum-of-squared-error function

Least-Mean-Squared (LMS)

� Recall

� Gradient function:

� Update Rule:

( )∑=

−=n

i

ii

t

s byaaJ1

2)(

( ) ii

t

i yyabkkaka ++=+ )()()1( η

( )∑=

−=∇n

i

iii

t

s ybyaaJ1

2)(

Lec3: Linear Discriminant Functions 47

CF: Sum-of-squared-error function

Least-Mean-Squared (LMS)

� LMS needs not converge to a separating hyperplane (which can separate the samples perfectly), even if one exists

� It minimizes the sum of the squares of the distances of the training points to the hyperplane

� For this example, the plane is different from the separating hyperplane

Lec3: Linear Discriminant Functions 48

Relationship: MSE vs BayesClass conditional densities

Posteriors

Bayes discriminant function

(grey line)

MSE solution

(dotted line)

best approximation in

region of data points

i.e., when g(x)>>>> g0(x)

� If b=1 MSE solution approaches a

minimum mean squared

approximation to Bayes

discriminant function