Statistical classifiers: Bayesian decision theory and ...research.cs.tamu.edu/prism/lectures/talks/nose04.pdf · Statistical classifiers: Bayesian decision theory and density estimation

3rd NOSE Short CourseAlpbach, 21st – 26th Mar 2004

Statistical classifiers: Bayesian decision theory and density estimation

Ricardo Gutierrez-Osuna

Department of Computer ScienceTexas A&M University

[email protected] http://research.cs.tamu.edu/prism

Ricardo Gutierrez- Osuna

Texas A&M University

3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 2 -

Outline

o Chapter 1: Review of pattern classificationo Chapter 2: Review of probability theoryo Chapter 3: Bayesian Decision Theoryo Chapter 4: Quadratic classifierso Chapter 5: Kernel density estimationo Chapter 6: Nearest neighborso Chapter 7: Perceptron and least-squares classifiers




CHAPTER 1: Review of pattern classification

o Features and patterns




Features and patterns (1)

o Feature• Feature is any distinctive aspect, quality

or characteristic• Features may be symbolic (i.e., color) or

numeric (i.e., height)

• Feature vector: The combination of d features is represented as a d-dimensional column

• Feature space: The d-dimensional space defined by the feature vector

• Scatter plot: Representation of an object collection in feature space

=

d

2

1

x

xx

x

x3

x1 x2

x

Feature 1

Class 1

Class 2

Class 3Feat

ure

2

Featurevector

Featurespace

Scatter plot





o Pattern• Pattern is a composite of traits or features characteristic of an

individual• In classification tasks, a pattern is a pair of variables {x,ω} where

• x is a collection of observations or features (feature vector)• ω is the concept behind the observation (label)





o What makes a “good” feature vector?• The quality of a feature vector is related to its ability to

discriminate examples from different classes• Examples from the same class should have similar feature values• Examples from different classes have different feature values

• More feature properties

“Good” features “Bad” features

Highly correlated featuresNon-linear separabilityLinear separability Multi-modal




Classifiers

o The task of a classifier is to partition feature space into class-labeled decision regions• Borders between decision regions are called

decision boundaries• The classification of feature vector x consists

of determining which decision region it belongs to, and assign x to this class

o In this lecture we will overview two methodologies for designing classifiers• Based on the underlying probability density

functions of the data• Based on geometric pattern-separability criteria

R1R2

R3

R1R2

R3

R1

R2

R3

R4

R1

R2

R3

R4




CHAPTER 2: Review of probability theory

o What is a probabilityo Probability density functionso Conditional probabilityo Bayes theoremo Probabilistic reasoning: a case example




Basic probability concepts

• Probabilities are numbers assigned to events that indicate “how likely” it is that the event will occur when a random experiment is performed

• A probability law for a random experiment is a rule that assigns probabilities to the events in the experiment

• The sample space S of a random experiment is the set of all possible outcomes

A2

A1

A3

A4 event

prob

abili

ty

A1 A2 A3 A4

PROBABILITY LAW

S



3rd Short CourseStatistical classifiers: Bayesian decision theory and density estimation - 10-

Conditional probability (1)

o If A and B are two events, the probability of event A when we already know that event B has occurred is defined by the relation

o This conditional probability P[A|B] is read:• the “conditional probability of A conditioned on B”, or simply • the “probability of A given B”

0P[B]forP[B]

B]P[AB]|P[A >=I




Conditional probability (2)

o Interpretation• The new evidence “B has occurred” has the following effects

• The original sample space S (the whole square) becomes B (the rightmost circle)

• The event A becomes A∩B• P[B] simply re-normalizes the probability of events that occur

jointly with B

S S

“B has occurred”

A A∩B B A A∩B B




Theorem of total probability

o Let B1, B2, …, BN be a partitionof S, a set of mutually exclusive events such that

• Any event A can then be represented as:

• Since B1, B2, …, BN are mutually exclusive then, by Axiom III:

• and, therefore

)B...(A)B(A)B(A)B...BB(ASAA N21N21 IUIUIUUUII ===

]BP[A...]BP[A]BP[AP[A] N21 III +++=

∑=

=+=N

1kkkNN11 ]]P[BB|P[A]]P[BB|...P[A]]P[BB|P[AP[A]

B1

B2

B3 BN-1

BN

A

B4N21 B...BBS UUU=




Bayes theorem

o Given {B1, B2, …, BN}, a partition of the sample space S. Suppose that event A occurs; what is the probability of event Bj?

• Using the definition of conditional probability and the theorem of total probability we obtain

• This is known as Bayes theorem or Bayes rule, and is (one of) the most useful relations in probability and statistics

∑=

⋅

⋅== N

1kkk

jjjj

]P[B]B|P[A

]P[B]B|P[AP[A]

]BP[AA]|P[B

I




Applying Bayesian theorem (1)

o Consider a clinical problem where we need to decide if a patient has a particular medical condition on the basis of an imperfect test:

• Someone with the condition may go undetected (false-negative)• Someone free of the condition may yield a positive result (false-

positive)

o Nomenclature• SPECIFICITY: The true-negative rate P(NEG|¬COND) of a test • SENSITIVITY: The true-positive rate P(POS|COND) of a test





o PROBLEM• Assume a population of 10,000 where 1 out of every 100 people

has the medical condition• Assume that we design a test with 98% specificity

P(NEG|¬COND) and 90% sensitivity P(POS|COND)

o Assume you take the test, and it yields a POSITIVE resulto What is the probability that you have the medical

condition?





o SOLUTION A: Fill in the joint frequency table below• The answer is the ratio of individuals with the condition to total

individuals (considering only individuals that tested positive) or 90/288=0.3125

TEST IS POSITIVE TEST IS NEGATIVE ROW TOTAL

HAS CONDITION True-positive

P(POS|COND) 100×0.90=90

False-negative P(NEG|COND)

100×(1-0.90)=10

100

FREE OF CONDITION False-positive

P(POS|¬COND) 9,900×(1-0.98)=198

True-negative P(NEG|¬COND)

9,900×0.98=9,072

9,900 COLUMN TOTAL 288 9,712 10,000





o SOLUTION B: Apply Bayes theorem

0.3125

0.990.98)(10.010.900.010.90

COND]P[COND]|P[POSP[COND]COND]|P[POSP[COND]COND]|P[POS

P[POS]P[COND]COND]|P[POS

POS]|P[COND

=

=⋅−+⋅

⋅=

=¬⋅¬+⋅

⋅=

=⋅

=

=




Bayes theorem and pattern classification

o For the purpose of pattern classification, Bayes theorem is normally expressed as

• where ωj is the ith class and x is the feature vector

o Bayes theorem is relevant because, as we will see in a minute, a sensible classification rule is to choose the class ωi with the highest P[ωi|x]

• This represents the intuitive rationale of choosing the class that is more “likely” given the observed feature vector x

P[x]]P[]|P[x

]P[]|P[x

]P[]|P[xx]|P[ jj

N

1kkk

jjj

ωω

ωω

ωωω

⋅=

⋅

⋅=

∑=




Bayes theorem and pattern classification

o Each term in the Bayes theorem has a special name, which you should become familiarized with

• Prior probability (of class ωi)• Posterior Probability (of class ωi given the

observation x)• Likelihood (conditional probability of observation x

given class ωi)• A normalization constant (does not affect the decision)

]P[ jωx]|P[ jω

]|P[x jω

P[x]




CHAPTER 3: Bayesian Decision Theory

o The Likelihood Ratio Testo The Probability of Erroro The Bayes Risko Bayes, MAP and ML Criteriao Multi-class problemso Discriminant Functions




The Likelihood Ratio Test (1)

o Assume we are to classify an object based on the evidence provided by a measurement (or feature vector) x

o Would you agree that a reasonable decision rule would be the following?

• "Choose the class that is most ‘probable’ given the observed feature vector x”

• More formally: Evaluate the posterior probability of each class P(ωi|x) and choose the class with largest P(ωi|x)





o Let us examine this decision rule for a 2-class problem• In this case the decision rule becomes

• Or, in a more compact form

• Applying Bayes theorem

2

121

ωchooseelseωchoose)x|ω(P)x|ω(Pif >

)x|ω(P)x|ω(P 2

ω

ω1

1

2

<>

)x(P)ω(P)ω|x(P

)x(P)ω(P)ω|x(P 22ω

ω

111

2

<>





• P(x) does not affect the decision rule so it can be eliminated*.Rearranging the previous expression

• The term Λ(x) is called the likelihood ratio, and the decision rule is known as the likelihood ratio test

)ω(P)ω(P

)ω|x(P)ω|x(P)x(Λ

1

2ω

ω2

11

2

<>=

*P(x) can be disregarded in the decision rule since it is constant regardless of class ωi. However, P(x) will be needed if we want to estimate the posterior P(ωi|x) which, unlike P(x|ωi)P(x), is a true probability value and, therefore, gives us an estimate of the “goodness” of our decision.




Likelihood Ratio Test: an example (1)

o Given a classification problem with the following class conditional densities:

o Derive a classification rule based on the Likelihood Ratio Test (assume equal priors)

2

2

10)(x21

2

4)(x21

1

e2π1)ω|P(x

e2π1)ω|P(x

−−

−−

=

=

x

P(x|ω1) P(x|ω2)

4 10




Likelihood Ratio Test: an example (2)

o Solution• Substituting the given likelihoods and priors

into the LRT expression:

• Simplifying, changing signs and taking logs:

• Which yields:

• This LRT result makes intuitive sense since the likelihoods are identical and differ only in their mean value

11

e

eΛ(x)

1

2

2

2 ω

ω10)(x

21

2π1

4)(x21

2π1

<>

=−−

−−

0)10x()4x(1

2

ω

ω

22

><

−−−

7x1

2

ω

ω><

R1: say ω1

x

R2: say ω2

P(x|ω1) P(x|ω2)

4 10




The probability of error

o Prob. of error is “the probability of assigning x to the wrong class”

• For a two-class problem, P[error|x] is simply

• It makes sense that the classification rule be designed to minimize the average prob. of error P[error] across all possible values of x

• To minimize P(error) we minimize the integrand P(error|x) at each x: choose the class with maximum posterior P(ωi|x)

• This is called the MAXIMUM A POSTERIORI (MAP) RULE

=12

21

ωdecideweifx)|P(ωωdecideweifx)|P(ω

x)|P(error

∫∫+∞

∞−

+∞

∞−== x)P(x)dx|P(errorx)dxP(error,P(error)




Minimizing probability of error

o We “prove” the optimality of the MAP rule graphically

• The right plot shows the posterior for each of the two classes

• The bottom plots shows the P(error) for the MAP rule and an alternative decision rule

• Which one has lower P(error) (color-filled area) ?

x

P(w

i|x)

ChooseRED

ChooseBLUE

ChooseRED

THE MAP RULEChoose

REDChooseBLUE

ChooseRED

THE “OTHER” RULE




The Bayes Risk (1)

o So far we have assumed that the penalty of misclassifying a class ω1example as class ω2 is the same as the reciprocal

o In general, this is not the case:• For example, misclassifying a cancer sufferer as a healthy patient is a

much more serious problem than the other way around• Misclassifying salmon as sea bass has lower cost (unhappy customers)

than the opposite error

o This concept can be formalized in terms of a cost function Cij

• Cij represents the cost of choosing class ωi when class ωj is the true class

o We define the Bayes Risk as the expected value of the cost

∑∑∑∑= == =

⋅∈⋅=∈⋅==ℜ2

1i

2

1jjjiij

2

1i

2

1jjiij ]ω[P]ω|Rx[PC]ωxandωchoose[PC]C[E




The Bayes Risk (2)

o What is the decision rule that minimizes the Bayes Risk?• It can be shown* that the minimum risk can be achieved by using

the following decision rule:

• *For an intuitive proof visit my lecture notes at TAMU

o Notice any similarities with the LRT?

]ω[P]ω[P

)CC()CC(

)ω|x(P)ω|x(P

1

2

1121

2212

ω

ω2

1

1

2

−−

<>




The Bayes Risk: an example (1)

o Consider a classification problem with two classes defined by the following likelihood functions

o What is the decision rule that minimizes P[error]?• Assume P[ω1]=P[ω2]=0.5, C11=C22=0, C12=1 and C21=31/2

2

2

2)(x21

2

3x

21

1

e2π1)ω|P(x

e32π

1)ω|P(x

−−

−

=

=

-6 -4 -2 0 2 4 60

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

x

likel

ihoo

d




The Bayes Risk: an example (2)

1.274.73,x01212x2x

02)(x21

3x

21

1e

e

31

e2π

e32πΛ(x)

1

2

1

2

1

2

2

2

1

2

2

2

ω

ω

2

ω

ω

22

ω

ω2)(x

21

3x

21

ω

ω2)(x

21

3x

21

=⇒<>

+−

⇒<>

−+−

⇒<>

⇒<>

=

−−

−

−−

−

1

1

-6 -4 -2 0 2 4 60

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

x

R1R2R1




o The LRT that minimizes the Bayes Risk is called the Bayes Criterion

o Many times we will simply be interested in minimizing P[error], which is a special case of the Bayes Criterion if we use a zero-one cost function

• This version of the LRT is referred to as the Maximum A Posteriori Criterion, since it seeks to maximize the posterior P(ωi|x)

o Finally, for the case of equal priors P[ωi]=1/2 and zero-one cost function, the LTR is called the Maximum Likelihood Criterion, since it will maximize the likelihood P(x|ωi)

Variations of the LRT

Bayes criterionBayes criterion]ω[P]ω[P

)CC()CC(

)ω|x(P)ω|x(P)x(Λ

1

2

1121

2212

ω

ω2

1

1

2

−−

<>

=

Maximum A Posteriori(MAP) Criterion

Maximum A Posteriori(MAP) Criterion

1)x|ω(P)x|ω(P

)ω(P)ω(P

)ω|x(P)ω|x(P)x(Λ

ji1ji0

C1

2

1

2

ω

ω2

1

1

2

ω

ω2

1ij <

>⇔

<>

=⇒

≠=

=

Maximum Likelihood(ML) Criterion

Maximum Likelihood(ML) Criterion

1)ω|x(P)ω|x(P)x(Λ

iC1)ω(P

ji1ji0

C 1

2

ω

ω2

1

i

ij

<>

=⇒

∀=

≠=

=




Multi-class problems

o The previous decisions were derived for two-class problems, but generalize gracefully for multiple classes:• To minimize P[error] choose the class ωi with highest P[ωi|x]

• To minimize Bayes risk choose the class ωi with lowest ℜ[ωi|x]

∑=≤≤≤≤

=ℜ=C

1jjij

Ci1j

Ci1i x)|P(ωCargminx)|(ωargminω

x)|P(ωargmaxω iCi1

i≤≤

=




Discriminant functions (1)

o Note that all the decision rules have the same structure• At each point x in feature space choose class ωi which

maximizes (or minimizes) some measure gi(x)• This structure can be formalized with a set of discriminant

functions gi(x), i=1..C, and the following decision rule

• We can then express the three basic decision rules (Bayes, MAP and ML) in terms of Discriminant Functions:

i"j(x)g(x)gifωclasstoxassign" jii ≠∀>

Criterion Discriminant FunctionBayes gi(x)=-ℜ(αi|x)MAP gi(x)=P(ωi|x)ML gi(x)=P(x|ωi)




Discriminant functions (2)

o Therefore, we can visualize the decision rule as a networkthat computes C discriminant functions and selects the category corresponding to the largest discriminant

x2x2 x3

x3 xdxd

g1(x)g1(x)

x1x1

g2(x)g2(x) gC(x)gC(x)

Select maxSelect max

CostsCosts

Class assignment

Discriminant functions

Features




Recapping…

o The LRT is a theoretical result that can only be applied if we have complete knowledge of the likelihoods P[x|ωi]• P[x|ωi] generally unknown, but can be estimated from data

• If the form of the likelihood is known (e.g., Gaussian) the problem is simplified b/c we only need to estimate the parameters of the model (e.g., mean and covariance)

• This leads to a classifier known as QUADRATIC, which we cover next

• If the form of the likelihood is unknown, the problem becomes much harder, and requires a technique known as non-parametric density estimation

• This technique is covered in the final chapters of this lecture




CHAPTER 4: Quadratic classifiers

o Bayes classifiers for normally distributed classeso The Euclidean-distance classifiero The Mahalanobis-distance classifiero Numerical example




The Normal or Gaussian distribution

o Remember that the univariate Normal distribution N(µ,σ) is

o Similarly, the multivariate Normal distribution N(µ,Σ) is defined as

o Gaussian pdfs are very popular since• The parameters (µ,Σ) are sufficient

to uniquely characterize the pdf• If the xi’s are mutually uncorrelated

(cik=0), then they are also independent

• The covariance matrix becomes a diagonal matrix, with the individual variances in the main diagonal

-6 -4 -2 0 2 4 6 8 10 12 14

0.05

0.1

0.15

0.2

0.25

0.3

0.35

p(x)

x

µ=2; σ=3µ=6; σ=1

-6 -4 -2 0 2 4 6 8 10 12 14

0.05

0.1

0.15

0.2

0.25

0.3

0.35

p(x)

x

µ=2; σ=3µ=6; σ=1

−∑−−

∑= − µ)(Xµ)(X

21exp

)(21(x)f 1T

1/2n/2Xπ

−=

2

Xµ-X

21exp

21(x)f

σσπ

-2 0 2 4 6

-4

-2

0

2

4

6

8

x1

x 2

-2 0 2 4 6

-4

-2

0

2

4

6

8

x1

x 2

-2 0 2 4 6

-4

-2

0

2

4

6

8

x1

x 2




Covariance matrix

o The covariance matrix indicates the tendency of each pair of features (dimensions in a random vector) to vary together, i.e., to co-vary*

o The covariance has several important properties• If xi and xk tend to increase together, then cik>0• If xi tends to decrease when xk increases, then cik<0• If xi and xk are uncorrelated, then cik=0• |cik|≤σiσk, where σi is the standard deviation of xi

• cii = σi2 = VAR(xi)

o The covariance terms can be expressed as• where ρik is called the correlation coefficient

kiikik2

iii candc σσρσ ==

Xi

Xk

Cik=-σiσkρik=-1

Xi

Xk

Cik=-½σiσkρik=-½

Xi

Xk

Cik=0ρik=0

Xi

Xk

Cik=+½σiσkρik=+½

Xi

Xk

Cik=σiσkρik=+1

*from http://www.engr.sjsu.edu/~knapp/HCIRODPR/PR_home.htm




Bayes classifier for Gaussian classes (1)

o For Normally distributed classes, the DFs can be reduced to very simple expressions• The (multivariate) Gaussian density can be defined as

• Using Bayes rule, the MAP DF can be written as

−∑−−

∑= − µ)(xµ)(x

21exp

π)(21p(x) 1T

1/2n/2

P(x)1)P()µ(x)µ(x

21exp

π)(21

P(x)))P(|P(xx)|P((x)g

iiT

i1/2n/2

iiii

ω

ωωω

−∑−−

∑=

===

−1i

i




Bayes classifier for Gaussian classes (2)

• Eliminating constant terms

• Taking logs

• This is known as a QUADRATIC discriminant function (because it is a function of the square of x)

• In the next few slides we will analyze what happens to this expression under different assumptions for the covariance

( ) ( ))ωP(loglog21-)µ(x)µ(x

21(x)g iii

1i

Tii +∑−∑−−= −

)ωP()µ(x)µ(x21exp(x)g ii

1i

Ti

1/2-ii

−∑−−∑= −




Case 1: Σi=σ2I (1)

o This situation occurs when the features are statistically independent, and have the same variance for all classes• In this case, the quadratic discriminant function becomes

• Assuming equal priors and dropping constant terms

• This is called an Euclidean-distance or nearest mean classifier

( ) ( ) ( ) ( ))P(ωlog)µ(x)µ(x2σ

1)P(ωlogIσlog21-)µ(xIσ)µ(x

21(x)g ii

Ti2i

2i

-12Tii +−−−=+−−−=

( )∑=

−=−−−=DIM

1i

2iii

Tii µx-)µ(x)µ(x(x)g

From [Schalkoff, 1992]




Case 1: Σi=σ2I (1)

o This is probably the simplest statistical classifier that you can build:• “Assign an unknown example

to the class whose center is the closest using the Euclidean distance”

• How valid is the assumption Σi=σ2I in chemical sensor arrays?

µ1

Min

imum

Sel

ecto

r

µ2

µC

class

xEuclidean Distance

Euclidean Distance

Euclidean Distance

µ1

Min

imum

Sel

ecto

r

µ2

µC

class

xEuclidean Distance

Euclidean Distance

Euclidean Distance




Case 1: Σi=σ2I, example

[ ] [ ] [ ]

=

=

=

===

2002

Σ2002

Σ2002

Σ

52µ47µ23µ

321

T3

T2

T1




Case 2: Σi=Σ (Σ non-diagonal)

o All the classes have the same covariance matrix, but the matrix is not diagonal• In this case, the quadratic discriminant becomes

• Assuming equal priors and eliminating constant terms

• This is known as a Mahalanobis-distance classifier

( ) ( ))P(ωloglog21-)µ(x)µ(x

21(x)g ii

1Tii +∑−∑−−= −

)µ(x)µ(x(x)g i-1T

ii −Σ−−= µ1

Min

imum

Sel

ecto

r

µ2

µC

class

xMahalanobis

Distance

Mahalanobis Distance


Σµ1

Min

imum

Sel

ecto

r

µ2

µC

class

xMahalanobis

Distance



Σ




The Mahalanobis distance

o The quadratic term is called the Mahalanobis distance, a very important metric in Statistical Pattern Recognition (right up there with Bayes theorem)• The Mahalanobis distance is a vector distance that uses a ∑-1 norm• ∑-1 can be thought of as a stretching factor on the space• Note that for an identity covariance matrix (∑=I), the Mahalanobis

distance becomes the familiar Euclidean distance

µx2

x1

Κ µ-x 2i =

K µ-x 2i 1 =−∑




[ ] [ ] [ ]

=

=

=

===

27.07.01

Σ27.07.01

Σ27.07.01

Σ

52µ45µ23µ

321

T3

T2

T1

Case 2: Σi=Σ (Σ non-diagonal), example




[ ] [ ] [ ]

=

−

−=

−

−=

===

35.05.05.0

Σ7111

Σ2111

Σ

52µ45µ23µ

321

T3

T2

T1

Case 3: Σi≠Σj general case, example

Zoomout




Numerical example (1)

o Derive a linear discriminant function for the two-class 3D classification problem defined by

o Anybody would dare to sketch the likelihood densities and decision boundary for this problem?

[ ] [ ] ( ) ( )1221T

2T

1 ω2pωp;1/40001/40001/4

ΣΣ ;111µ;000µ =

====





o Solution

( ) ( ) )logP(ω-µ-zµ-yµ-x

400040004

µ-zµ-yµ-x

21)logP(ωΣlog

21µ-xΣµ-x

21(x)g i

z

y

xT

z

y

x

ii1T

ii +

−∝+−−= −

32log

1-z1-y1-x

400040004

1-z1-y1-x

21(x)g

31log

0-z0-y0-x

400040004

0-z0-y0-x

21(x)g

T

2

T

1

+

−=

+

−= ;





o Solution (continued)

• Classify the test example xu=[0.1 0.7 0.8]T

( ) ( ) ( ) ( )( )32log1z1y1x2-

31logzyx2-

(x)g(x)g

222

ω

ω

222

2

ω

ω

1

2

1

2

1

+−+−+−<>

+++

⇒<>

1.324log26 zyx

1

2

ω

ω

=−

<>

++

2u

ω

ω

ωx1.321.6 0.80.70.11

2

∈⇒<>

=++




Conclusions

o The Euclidean distance classifier is Bayes-optimal* for• Gaussian classes and equal covariance matrices proportional to

the identity matrix and equal priors

o The Mahalanobis distance classifier is Bayes-optimal for• Gaussian classes and equal covariance matrices and equal priors

*Bayes optimal means that the classifier yields the minimum P[error], which is the best ANY classifier can achieve




CHAPTER 5: Kernel Density Estimation

o Histogramso Parzen Windowso Smooth Kernelso The Naïve Bayes Classifier




Non-parametric density estimation (NPDE)

o In the previous two chapters we have assumed that either• The likelihoods p(x|ωi) were known (Likelihood Ratio Test) or• At least, the parametric form of the likelihoods were known

(Parameter Estimation)

o The methods that will be presented in the next two chapters do not afford such luxuries

• Instead, they attempt to estimate the density directly from the data without making assumptions about the underlying distribution

o Sounds challenging? You bet!




The histogram

o The simplest form of NPDE is the familiar histogram• Divide the sample space into a number of bins and approximate

the density at the center of each bin by the fraction of points in the training data that fall into the corresponding bin

[ ][ ]xcontainingbinofwidthN

xasbinsameinxofnumber(x)P(k

H ×=

0 2 4 6 8 10 12 14 160

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

x

p(x)

0 2 4 6 8 10 12 14 160

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

x

p(x)




Shortcomings of the histogram

o The shape of the NPDE depends on the starting position of the bins• For multivariate data, the final shape of the NDPE also depends on the

orientation of the bins

o The discontinuities are not due to the underlying density, they are only an artifact of the chosen bin locations

• These discontinuities make it very difficult, without experience, to grasp the structure of the data

o A much more serious problem is the curse of dimensionality: the number of bins grows exponentially with the number of dimensions

• In high dimensions we would require a very large number of examples or else most of the bins would be empty

o All these drawbacks make the histogram unsuitable for most practical applications except for rapid visualization of results in one or two dimensions




NPDE, general formulation (1)

o Let us return to the basic definition of probability to get a solid idea of what we are trying to accomplish

• The probability that a vector x, drawn from a distribution p(x),will fall in a given region ℜ of the sample space is

∫ℜ

= )dx'p(x'P

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 2 4 6 8 10 12 14 160

x

p(x)

ℜ0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 2 4 6 8 10 12 14 160

x

p(x)

ℜ

From [Bishop, 1995]





• Suppose now that N vectors {x(1, x(2, …, x(N} are drawn from the distribution; the probability that k of these N vectors fall in ℜ is now given by the binomial distribution

• It can be shown (from the properties of the binomial) that the mean and variance of the ratio k/N are

• Note that the variance gets smaller as N→∞, so we can expect that a good estimate of P is the mean fraction of points that fall within ℜ

( ) kNk P)(1PkN

kProb −−

=

( )N

P1PPNkE

NkVarandP

NkE

2 −=

−=

=

NkP ≅

From [Bishop, 1995]





• Assume now that ℜ is so small that p(x) does not vary appreciably within it, then the integral can be approximated by

• where V is the volume enclosed by region ℜ

p(x)V)dx'p(x' ≅∫ℜ

From [Bishop, 1995]

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 2 4 6 8 10 12 14 160

x

p(x)

ℜ→∅0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0 2 4 6 8 10 12 14 160

x

p(x)

ℜ→∅





• Merging the two expressions we obtain

• This estimate becomes more accurate as we increase the number ofsample points N and shrink the volume V

• In practice the value of N (the total number of examples) is fixed• To improve the estimate p(x) we could let V approach zero but then

region ℜ would become so small that it would enclose no examples• This means that, in practice, we will have to find a compromise

value for the volume V• Large enough to include enough examples within ℜ• Small enough to support the assumption that p(x) is constant within ℜ

NVkp(x)

NkP

p(x)V)dx'p(x'P≅⇒

≅

≅= ∫ℜ

From [Bishop, 1995]





o In conclusion, the general expression for NPDE is

• When applying this result to practical density estimation problems, two basic approaches can be adopted

• Kernel Density Estimation (KDE): Choose a fixed value of the volume V and determine k from the data

• k Nearest Neighbor (kNN): Choose a fixed value of k and determine the corresponding volume V from the data

• It can be shown that both KDE and kNN converge to the true probability density as N→∞, provided that V shrinks with N, and k grows with N appropriately

≅Vinsideexamplesofnumbertheisk

examplesofnumbertotaltheisNxgsurroundinvolumetheisV

whereNVkp(x)

From [Bishop, 1995]




Parzen windows (1)

o Suppose that the region ℜ that encloses the k examples is a hypercube of side h

• Then its volume is given by V=hD, where D is the number of dimensions

o To find the #examples that fall within this region we define a kernel function K(u)

• This kernel, which corresponds to a unit hypercube centered at the origin, is known as a Parzen window or the naïve estimator

( ) =∀<

=otherwise0

D1,..,j1/2u1uK j

x

h

h

h

From [Bishop, 1995]

-h/2 h/2 u

1K(u)

-h/2 h/2 u

1K(u)




Parzen windows (2)

o The total number of points inside the hypercube is then

o Substituting back into the density estimate expression

• Note that the Parzen window DE resembles the histogram, with the exception that the bin locations are determined by the data points

∑=

−=

N

1n

(n

DKDE hxxK

Nh1(x)p

∑=

−=

N

1n

n(

hxxKk

From [Bishop, 1995]

Volume

1 / V

K(x-x(1)=1

K(x-x(2)=1

K(x-x(3)=1

K(x-x(4)=0

x(1x(2x(3x(4

x

x(1

x(2

x(3

x(4

Volume

1 / V

K(x-x(1)=1K(x-x(1)=1

K(x-x(2)=1K(x-x(2)=1

K(x-x(3)=1K(x-x(3)=1

K(x-x(4)=0K(x-x(4)=0

x(1x(2x(3x(4

x

x(1

x(2

x(3

x(4

( )∑=

−=N

1n

(nxxKk

Volume

1 / V

K(x-x(1)=1

K(x-x(2)=1

K(x-x(3)=1

K(x-x(4)=0

x(1x(2x(3x(4

x

x(1

x(2

x(3

x(4

Volume

1 / V

K(x-x(1)=1K(x-x(1)=1

K(x-x(2)=1K(x-x(2)=1

K(x-x(3)=1K(x-x(3)=1

K(x-x(4)=0K(x-x(4)=0

x(1x(2x(3x(4

x

x(1

x(2

x(3

x(4

( )∑=

−=N

1n

(nxxKk




Numerical exercise (1)

o Given the dataset X below, use Parzen windows to estimate the density p(x) at y=3, 10, 15.

• X = {x(1, x(2,…x(N} = {4, 5, 5, 6, 12, 14, 15, 15, 16, 17}• Use a bandwidth of h=4

5 10 15 x

p(x)y=3 y=10

y=15

5 10 15 x

p(x)y=3 y=10

y=15




Numerical exercise (2)

o Solution: Let’s first estimate p(y=3):

• Similarly

[ ] 0.025410

10000000001410

1

4173K...

463K

453K

453K

443K

4101

hxyK

Nh13)(yp

1

13/4-1-1/2-1/2-1/4-

1

N

1n

(n

DKDE

=×

=+++++++++×

=

=

−

++

−

+

−

+

−

+

−

×=

=

−== ∑

=

4342143421434214342143421

[ ] 0410

00000000000410

110)(yp 1KDE =×

=+++++++++×

==

[ ] 0.1410

40111100000410

115)(yp 1KDE =×

=+++++++++×

==




Smooth kernels (1)

o The Parzen window has several drawbacks• Yields density estimates that have discontinuities• Weights equally all the points xi, regardless of their distance to the

estimation point x

o Some of these difficulties can be overcome by replacing the Parzen window with a smooth kernel K(u) such that

( ) 1dxxKDR

=∫

-1/2 -1/2 u

1Parzen(u)

A=1

-1/2 -1/2 u

K(u)

A=1

-1/2 -1/2 u

1Parzen(u)

A=1

-1/2 -1/2 u

K(u)

A=1




Smooth kernels (2)

• Usually, but not not always, K(u) will be a radially symmetric and unimodal probability density function, such as the multivariate Gaussian density function

• where the expression of the density estimate remains the same aswith Parzen windows

( )( )

−= xx

21exp

π21xK T

2/D

∑=

−=

N

1n

(n

DKDE hxxK

Nh1(x)p




Smooth kernels (3)

o Just as the Parzen window DE can be considered a sum of boxes centered at the observations, the smooth kernel estimate is a sum of “bumps” placed at the data points

• The kernel function determines the shape of the bumps

• The parameter h, also called the smoothing parameter or bandwidth, determines their width

-10 -5 0 5 10 15 20 25 30 35 400

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

x

P KD

E(x)

; h=3

Density estimate

Data points

Kernel functions

-10 -5 0 5 10 15 20 25 30 35 400

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

x

P KD

E(x)

; h=3

Density estimate

Data points

Kernel functions




Bandwidth selection, univariate case (1)

-10 -5 0 5 10 15 20 25 30 35 400

0.005

0.01

0.015

0.02

0.025

0.03

0.035

x

PKD

E(x)

; h=5

.0

-10 -5 0 5 10 15 20 25 30 35 400

0.005

0.01

0.015

0.02

0.025

0.03

x

PKD

E(x)

; h=1

0.0

-10 -5 0 5 10 15 20 25 30 35 400

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

x

PKD

E(x)

; h=2

.5

-10 -5 0 5 10 15 20 25 30 35 400

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

x

PKD

E(x)

; h=1

.0





o Subjective choice• Plot out several curves and choose the estimate that is most in

accordance with one’s prior (subjective) ideas• However, this method is not practical in pattern recognition since we

typically have high-dimensional data

o Reference to a standard distribution• Assume a standard density function and find the value of the bandwidth

that minimizes the integral of the square error (MISE)

• If we assume that the true distribution is a Gaussian and we use a Gaussian kernel, it can be shown that the optimal bandwidth is

• where σ is the sample variance and N is the number of training examples

( )( ){ } ( ) ( )( )[ ]{ }∫ −== dxxpxpEargminxpMISEargminh 2KDE

hKDE

hopt

5/1opt Nσ06.1h −=

From [Silverman, 1986]





o Likelihood cross-validation• The ML estimate of h is degenerate since it yields hML=0, a density

estimate with Dirac delta functions at each training data point• A practical alternative is to maximize the “pseudo-likelihood” computed

using leave-one-out cross-validation

From [Silverman, 1986]

( ) ( ) ( ) ∑∑≠=

−=

−

−−

=

=

N

mn1,m

(m(n(n

n

N

1n

(nn

hMLCV h

xxKh1N

1xpwherexplogN1argmaxh ;

p-1(x)

xx(1

p-1(x(1)

p-2(x)

xx(2

p-2(x(2)

p-3(x)

xx(3

p-3(x(3)

p-4(x)

xx(4

p-4(x(4)

p-1(x)

xx(1

p-1(x(1)

p-2(x)

xx(2

p-2(x(2)

p-3(x)

xx(3

p-3(x(3)

p-4(x)

xx(4

p-4(x(4)




Multivariate density estimation

o Bandwidth needs to be selected individually for each axis• Alternative, one may pre-scale axes or whiten the data, so that

the same bandwidth can be used for all dimensions

o Density can be estimated with a multivariate kernel or by means of so-called product kernels (see TAMU notes)

x1

x2

P(x1, x2| ωi)

PRODUCT KERNELS

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

7777777

8888

888888888888

9999999999

9999

1010101010101010101010101010

1*1*1*1*

2*

2*2*2*2*2*

3*3*3*3*3*3*3*

4*

5*5*5*5*5*

6*6*6*6*6*

6*6*

7*

7*7*7*

8*8*8*9*9*9*

9*9*

10*

10*10*

10*10*

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

7777777

8888

888888888888

9999999999

9999

1010101010101010101010101010

1*1*1*1*

2*

2*2*2*2*2*

3*3*3*3*3*3*3*

4*

5*5*5*5*5*

6*6*6*6*6*

6*6*

7*

7*7*7*

8*8*8*9*9*9*

9*9*

10*

10*10*

10*10*

x1

x2

x1

x2

P(x1, x2| ωi)

x1

x2

P(x1, x2| ωi)

PRODUCT KERNELS

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

7777777

8888

888888888888

9999999999

9999

1010101010101010101010101010

1*1*1*1*

2*

2*2*2*2*2*

3*3*3*3*3*3*3*

4*

5*5*5*5*5*

6*6*6*6*6*

6*6*

7*

7*7*7*

8*8*8*9*9*9*

9*9*

10*

10*10*

10*10*

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

7777777

8888

888888888888

9999999999

9999

1010101010101010101010101010

1*1*1*1*

2*

2*2*2*2*2*

3*3*3*3*3*3*3*

4*

5*5*5*5*5*

6*6*6*6*6*

6*6*

7*

7*7*7*

8*8*8*9*9*9*

9*9*

10*

10*10*

10*10*

x1

x2

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

7777777

8888

888888888888

9999999999

9999

1010101010101010101010101010

1*1*1*1*

2*

2*2*2*2*2*

3*3*3*3*3*3*3*

4*

5*5*5*5*5*

6*6*6*6*6*

6*6*

7*

7*7*7*

8*8*8*9*9*9*

9*9*

10*

10*10*

10*10*

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

7777777

8888

888888888888

9999999999

9999

1010101010101010101010101010

1*1*1*1*

2*

2*2*2*2*2*

3*3*3*3*3*3*3*

4*

5*5*5*5*5*

6*6*6*6*6*

6*6*

7*

7*7*7*

8*8*8*9*9*9*

9*9*

10*

10*10*

10*10*

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

77

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

7777777

8888

888888888888

9999999999

9999

1010101010101010101010101010

1*1*1*1*

2*

2*2*2*2*2*

3*3*3*3*3*3*3*

4*

5*5*5*5*5*

6*6*6*6*6*

6*6*

7*

7*7*7*

8*8*8*9*9*9*

9*9*

10*

10*10*

10*10*

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

77

111111111111111

2222222222222222

333333333

3 33

4444

4444444444

4 444

55

5555555

5555555

66 666 6666666

6

777777

77

7777777

8888

888888888888

9999999999

9999

1010101010101010101010101010

1*1*1*1*

2*

2*2*2*2*2*

3*3*3*3*3*3*3*

4*

5*5*5*5*5*

6*6*6*6*6*

6*6*

7*

7*7*7*

8*8*8*9*9*9*

9*9*

10*

10*10*

10*10*

x1

x2




Naïve Bayes classifier (1)

o How do we apply KDE to classifier design?• First, we estimate the likelihood of each class P(x|ωi)• Then we apply Bayes rule to derive the MAP rule

o However, P(x|ωi) is multivariate: NPDE becomes hard!!• To avoid this problem, one practical simplification is

sometimes made: assume that the features are class-conditionally independent

))P(ωω|P(xx)|P(ω(x)g iiii ∝=

∏=

=D

1dii )ω|P(x(d))ω|P(x





o Class-conditional independence vs. independence

∏=

≠D

1dP(x(d))P(x)

∏=

=D

1dii )ω|P(x(d))ω|P(xx1

x2x2

x1

x2

∏=

=D


∏=

≅D

1dP(x(d))P(x)

∏=

≠D


∏=

≠D

1dP(x(d))P(x)

∏=

=D

1dii )ω|P(x(d))ω|P(xx1

x2x2

x1

x2

∏=

=D


∏=

≅D

1dP(x(d))P(x)

∏=

≠D






o Merging this expression into the discriminant function yields the decision rule for the Naïve Bayes classifier

o The main advantage of the Naïve Bayes classifier is that we only need to compute the univariate densities P(x(d)|ωi), which is a much easier problem than estimating the multivariate density P(x|ωi)

• Despite its simplicity, the Naïve Bayes has been shown to have comparable performance to artificial neural networks and decision tree learning in some domains

Naïve Bayes Classifier

Naïve Bayes Classifier( )∏

=

=D

1diiNB,i ω|)d(xP)P(ω)x(g




CHAPTER 6: Nearest Neighbors

o Nearest Neighbors density estimationo The k Nearest Neighbors classification ruleo kNN as a lazy learnero Characteristics of the kNN classifiero Optimizing the kNN classifier




kNN Density Estimation (1)

o In the kNN method we grow the volume surrounding the estimation point x until it encloses a total of k data points

o The density estimate then becomes

• Rk(x) is the distance between the estimation point x and its k-th closest neighbor

• cD is the volume of the unit sphere in D dimensions:

• Thus c1=2, c2=π, c3=4π/3 and so on

(x)RcNk

NVkP(x) D

kD ⋅⋅=≅

( ) ( )1D/2!D/2c

D/2D/2

D +==Γ

ππ

R

Vol=πR2

x2

RN

kP(x)

π=




kNN Density Estimation (2)

o In general, the estimates that can be obtained with the kNN method are not very satisfactory

• The estimates are prone to local noise• The method produces estimates with very heavy tails• Since the function Rk(x) is not differentiable, the

density estimate will have discontinuities

o These properties are illustrated in the next few slides




kNN Density Estimation, example 1

o To illustrate kNN we generated several DEs for a univariate mixture of two Gaussians: P(x)=½N(0,1)+½N(10,4) and several values of N and k




kNN Density Estimation, example 2 (a)

o The performance of the kNN density estimation technique on two dimensions is illustrated in these figures

• The top figure shows the true density, a mixture of two bivariate Gaussians

• The bottom figure shows the density estimate for k=10 neighbors and N=200 examples

• In the next slide we show the contours of the two distributions overlapped with the training data used to generate the estimate

( ) ( )

[ ]

[ ]

−

−==

==

+=

4111

Σ05µ

2111

Σ50µwith

Σ,µN21Σ,µN

21p(x)

2T

2

1T

1

2211




kNN Density Estimation, example 2 (b)

True density contours kNN density estimate contours




kNN as a Bayesian classifier (1)

o The main advantage of the kNN method is that it leads to a very simple approximation of the Bayes classifier

o Assume that we have a dataset with N examples, Ni from class ωi, and that we are interested in classifying an unknown sample xu

• We draw a hyper-sphere of volume V around xu. Assume this volume contains a total of k examples, ki from class ωi.

• The unconditional density is estimated by

From [Bishop, 1995]

NVkP(x) =




kNN as a Bayesian classifier (2)

• Similarly, we can then approximate the likelihood functions by counting the number of examples of each class inside volume V:

• And the priors are approximated by

• Putting everything together, the Bayes classifier becomes

NN)P( i

i =ω

kk

NVk

NN

VNk

P(x)))P(ωω|P(xx)|P(ω i

i

i

i

iii =

⋅==

From [Bishop, 1995]

VNk)|ωP(xi

ii =




The kNN classification rule (1)

o The K Nearest Neighbor Rule (kNN) is a very intuitive method that classifies unlabeled examples based on their similarity to examples in the training set

o The kNN only requires• An integer k• A set of labeled examples (training data)• A metric to measure “closeness”

For a given unlabeled example xu∈ℜD, find the k “closest”labeled examples in the training data set and assign xu to the class that appears most frequently within the k-subset

For a given unlabeled example xu∈ℜD, find the k “closest”labeled examples in the training data set and assign xu to the class that appears most frequently within the k-subset




The kNN classification rule (2)

o Example• In the example below we have three classes: the goal

is to find a class label for the unknown example xu

• In this case we use the Euclidean distance and a value of k=5 neighbors

• Of the 5 closest neighbors, 4 belong to ω1 and 1 belongs to ω3, so xu is assigned to ω1, the predominant class

xu

ω3

ω1 ω2




kNN in action: example 1

o We have generated data for a 2-dimensional 3-class problem, where the class-conditional densities are multi-modal, and non-linearly separable, as illustrated in the figure

o We used the kNN rule with• k = 5• The Euclidean distance as a metric

o The resulting decision boundaries and decision regions are shown below




kNN in action: example 2

o We have generated data for a 2-dimensional 3-class problem, where the class-conditional densities are unimodal, and are distributed in rings around a common mean. These classes are also non-linearly separable, as illustrated in the figure

o We used the kNN rule with• k = 5• The Euclidean distance as a metric

o The resulting decision boundaries and decision regions are shownbelow




Characteristics of the kNN classifier (1)

o Advantages• Simple implementation• Nearly optimal in the large sample limit (N→∞)

• P[error]Bayes <P[error]1NN<2P[error]Bayes

• Uses local information, which can yield highly adaptive behavior

• Lends itself very easily to parallel implementationso Disadvantages

• Large storage requirements• Computationally intensive recall• Highly susceptible to the curse of dimensionality




Characteristics of the kNN classifier (2)

o 1NN versus kNN• The use of large values of k has two main advantages

• Yields smoother decision regions• Provides probabilistic information

• The ratio of examples for each class gives information about theambiguity of the decision

• However, too large a value of k is detrimental• It destroys the locality of the estimation, since farther

examples are taken into consideration• In addition, it increases the computational burden




kNN versus 1NN1-NN 5-NN 20-NN




kNN and the problem of feature weighting




Feature weighting

o The previous example illustrated the Achilles’ heel of the kNN classifier: its sensitivity to noisy axes

• A possible solution would be to normalize each feature to N(0,1)• However, normalization does not resolve the curse of

dimensionality. A close look at the Euclidean distance shows that this metric can become very noisy for high dimensional problems if only a few of the features carry the classification information

o The solution to this problem is to modify the Euclidean metric by a set of weights that represent the information content or “goodness” of each feature

∑=

−=D

1k

2uu ))k(x)k(x()x,x(d




CHAPTER 7: Linear Discriminant Functions

o Perceptron learningo Minimum squared error (MSE) solutiono Least-mean squares (LMS) rule




Linear Discriminant Functions (1)

o The objective of this chapter is to present methods for learning linear discriminant functions of the form

• where w is the weight vectorand w0 is the threshold weightor bias

• Similar discriminant functions were derived in chapter 3 as a special case of the quadratic classifier

• In this chapter, the discriminant functions will be derived in a non-parametric fashion, this is, no assumptions will be made about the underlying densities

x1

x2

wx

wTx+w0>0

wTx+w0<0

x(1

x(2d

x1

x2

wx

wTx+w0>0

wTx+w0<0

x(1

x(2d

( ) ( )( )

∈<∈>

⇔+=2

10

T

ωx0xgωx0xg

wxwxg




Linear Discriminant Functions (2)

o For convenience, we will focus on binary classification• Extension to the multicategory case can be easily achieved by

• Using ωi/not ωi dichotomies• Using ωi/ωi dichotomies




Gradient descent (1)

o Gradient descent is a general method for function minimization

• From basic calculus, we know that the minimum of a function J(x) is defined by the zeros of the gradient

• Only in very special cases this minimization function has a closed form solution

• In some other cases, a closed form solution may exist, but is numerically ill-posed or impractical (e.g., memory requirements)

[ ] 0J(x)J(x)argminx* xx

=⇒∇=∀




Gradient descent (2)

o Gradient descent finds the minimum in an iterative fashion by moving in the direction of steepest descent

• where η is a learning rate

1. Start with an arbitrary solution x(0)2. Compute the gradient ∇xJ(x(k))3. Move in the direction of steepest descent:

4. Go to 1 (until convergence)

1. Start with an arbitrary solution x(0)2. Compute the gradient ∇xJ(x(k))3. Move in the direction of steepest descent:

4. Go to 1 (until convergence)( ) ( ) ( )( )kxJηkx1kx x∇−=+

-2 0 2-2

0

2

x1

x 2

Initialguess Global

minimum

Localminimum

-2 0 2-2

0

2

x1

x 2

Initialguess Global

minimum

Localminimum

J(w)

w

∇J<0∆w>0

∇J>0∆w<0

J(w)

w

∇J<0∆w>0

∇J>0∆w<0




Perceptron learning (1)

o Let’s now consider the problem of learning a binary classification problem with a linear discriminant function

• As usual, assume we have a dataset X={x(1,x(2,…x(N} containing examples from the two classes

• For convenience, we will absorb the intercept w0 by augmenting the feature vector x with an additional constant dimension:

[ ] yax1

wwwxw TT00

T =

=+

From [Duda, Hart and Stork, 2001]





• Keep in mind that our objective is to find a vector a such that

• To simplify the derivation, we will “normalize” the training set by replacing all examples from class ω2 by their negative

• This allows us to ignore class labels and look for a weight vector such that

y0yaT ∀>

[ ] 2ωyyy ∈∀−←


( )

∈<∈>

=2

1T

ωx0ωx0

yaxg





o To find this solution we must first define an objective function J(a)

• A good choice is what is known as the Perceptron criterion

• where YM is the set of examples misclassified by a• Note that JP(a) is non-negative since aTy<0 for misclassified samples

( ) ( )∑∈

−=MΥy

TP yaaJ





o To find the minimum of JP(a), we use gradient descent• The gradient is defined by

• And the gradient descent update rule becomes

• This is known as the perceptron batch update rule. • The weight vector may also be updated in an “on-line” fashion, this

is, after the presentation of each individual example

• where y(i is an example that has been misclassified by a(k)

( ) ( )∑∈

−=∇MΥy

Pa yaJ

( ) ( )( )

∑∈

+=+kΥy M

yηka1ka

( ) ( ) (iηyka1ka +=+ Perceptron rule





o If classes are linearly separable, the perceptron rule is guaranteed to converge to a valid solution

o However, if the two classes are not linearly separable, the perceptron rule will not converge

• Since no weight vector a can correctly classify every sample in a non-separable dataset, the corrections in the perceptron rule will never cease

• One ad-hoc solution to this problem is to enforce convergence by using variable learning rates η(k) that approach zero as kapproaches infinite




Minimum Squared Error solution (1)

o The classical Minimum Squared Error (MSE) criterion provides an alternative to the perceptron rule

• The perceptron rule seeks a weight vector aT that satisfies the inequality aTy(i>0

• The perceptron rule only considers misclassified samples, since these are the only ones that violate the above inequality

• Instead, the MSE criterion looks for a solution to the equality aTy(i=b(i, where b(i are some pre-specified target values (e.g., class labels)

• As a result, the MSE solution uses ALL samples in the training set






o The system of equations solved by MSE is

• where a is the weight vector, each row in Y is a training example, and each row in b is the corresponding class label

• For consistency, we will continue assuming that examples from class ω2 have been replaced by their negative vector, although this is not a requirement for the MSE solution

bYa

b

bb

a

aa

yyy

yyyyyy

(N

(2

(1

D

1

0

N(D

N(1

N(0

(2D

2(1

2(0

(1D

(11

(10

=⇔

=

M

MM

L

MMM

MMM

L

L






o An exact solution to Ya=b can sometimes be found • If the number of (independent) equations (N) is equal to the

number of unknowns (D+1), the exact solution is defined by

o In practice, however, Y will be singular so its inverse Y-1

does not exist• Y will commonly have more rows (examples) than columns

(unknown), which yields an over-determined system, for which an exact solution cannot be found

bYa 1−=





o The solution in this case is to find a weight vector that minimizes some function of the error between the model (aY) and the desired output (b)

• In particular, MSE seeks to Minimize the sum of the Squares of these Errors:

• which, as usual, can be found by setting its gradient to zero

( ) ( ) 2N

1i

2(i(iTMSE b-YabyaaJ =−=∑

=




The pseudo-inverse solution

o The gradient of the objective function is

• with zeros defined by

• Notice that YTY is now a square matrix!

o If YTY is nonsingular, the MSE solution becomes

• where the matrix Y†=(YTY)-1YT is known as the pseudo-inverse of Y (Y†Y=I)

• Note that, in general, YY†≠I in general

( ) ( ) ( ) 0bYa2Yybya2aJ TN

1i

(i(i(iTMSEa =−=−=∇ ∑

=

bYYaY TT =

( ) bYbYYYa †T-1T == Pseudo-inverse solution




Ridge-regression solution (1)

o If the training data is extremely correlated (collinearity problem), the matrix YTY becomes near singular

• The smaller eigenvalues (the noise) dominate the computation of the inverse (YTY)-1, which leads to numerical problems

o Collinearity problem can be solved through regularization• This is equivalent to adding a small multiple of the identity matrix

to the term YTY, which results in

• where ε (0<ε<1) is a regularization parameter that controls the amount of shrinkage to the identity matrix. This is known as the ridge-regression solution

( ) ( ) bYID

YYtrεYYε1-a T-1T

T

+= Ridge Regression




Ridge-regression solution (2)

o Selection of the regularization parameter• For ε=0, ridge-regression solution is equivalent to the pseudo-

inverse solution• For ε=1, the ridge-regression solution is a constant function that

predicts the average classification rate across the entire dataset• An appropriate value for ε is typically found through cross-

validation

From [Gutierrez-Osuna, 2002]




Least-mean-squares solution (1)

o The objective function JMSE(a)=||Ya-b||2 can also be found using a gradient descent procedure

• This avoids the problems that arise when YTY is singular• In addition, it also avoids working with large matrices

o Looking at the expression of the gradient, the obvious update rule is

• It can be shown that if η(k)=η(1)/k, where η(1) is any positive constant, this rule generates a sequence of vectors that converge to a solution to YT(Ya-b)=0

( ) ( ) ( ) ( )( )kYa-bYkηka1ka T+=+





Least-mean-squares solution (2)

o The storage requirements of this algorithm can be reduced by considering each sample sequentially

• This is known as the Widrow-Hoff, least-mean-squares(LMS) or delta rule [Mitchell, 1997]

( ) ( ) ( ) ( )( ) (i(i(i ykay-bkηka1ka +=+ LMS rule





Summary: Perceptron vs. MSE procedures

o Perceptron rule• The perceptron rule always finds a

solution if the classes are linearly separable, but does not converge if the classes are non-separable

o MSE criterion• The MSE solution has guaranteed

convergence, but it may not find a separating hyperplane if classes are linearly separable

• Notice that MSE tries to minimize the sum of squares of the distances of the training data to the separating hyperplane, as opposed to finding this hyperplane

x1

x2

LMS

Perceptron

x1

x2

LMS

Perceptron




Summary of classifier decision boundaries

Feat

ure

2

Feature 1

Feat

ure

2

Feature 1

Feat

ure

2

Feature 1

QUADRATIC KNN

RBFMLP

Feat

ure

2

Feature 1

Feat

ure

2

Feature 1

Feat

ure

2

Feature 1

Feat

ure

2

Feature 1

QUADRATIC KNN

RBFMLP

Feat

ure

2

Feature 1




References

o This material is an abridged version of my lecture notes in Pattern Recognition at Texas A&M University, which you may download from

http://research.cs.tamu.edu/prismo Additional references are:

• C. Bishop (1995), Neural Networks for Pattern Recognition, Oxford University Press

• B. W. Silverman (1986), Density Estimation for Statistics and Data Analysis, Chapman and Hall

• R. O. Duda, P. E. Hart and D. G. Stork (2001), Pattern Classification, Wiley

• R. Schalkoff (1992), Pattern Recognition; Statistical, Structural and Neural Approaches, Wiley Inc

• R. Gutierrez-Osuna (2002), “Pattern Analysis for Machine Olfaction: A Review,” IEEE Sensors Journal, 2(3), 189-202


Questions


Thank you

Documents

Statistical classifiers: Bayesian decision theory and ...research.cs.tamu.edu/prism/lectures/talks/nose04.pdf · Statistical classifiers: Bayesian decision theory and density estimation