UV Pattern Recognition I

UV Pattern Recognition I


Helmut A. Mayer

Department of Computer SciencesUniversity of Salzburg

SS 17


Outline

1 Introduction

2 Statistical ClassifiersBayesian Decision TheoryDiscriminant Functions and Decision SurfacesMaximum–Likelihood EstimationComponent Analysis

3 Nonparametric TechniquesDensity EstimationParzen Windows


Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example


Introduction

Salmon or Sea Bass

FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),whose signals are preprocessed. Next the features are extracted and finally the clas-sification is emitted, here either “salmon” or “sea bass.” Although the information flowis often chosen to be from the source to the classifier, some systems employ informationflow in which earlier levels of processing can be altered based on the tentative or pre-liminary response in later levels (gray arrows). Yet others combine two or more stagesinto a unified step, such as simultaneous segmentation and feature extraction. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.


Introduction

Fish Length Histogram

salmon sea bass

length

count

l*

0

2

4

6

8

10

12

16

18

20

22

5 10 2015 25

FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.


Introduction

Fish Lightness Histogram

2 4 6 8 100

2

4

6

8

10

12

14

lightness

count

x*

salmon sea bass

FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗

marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.


Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)2D Decision Boundary


Introduction

2D Feature Space

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.


Introduction

Overfitting

?

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.


Introduction

Generalization

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.


Introduction

Related Fields

Statistical Hypothesis Testing

Image Processing

Regression (age ↔ weight)

Interpolation

Density Estimation


Introduction

Pattern Recognition Systems

post-processing

classification

feature extraction

segmentation

sensing

input

decision

adjustments for missing features

adjustments for context

costs

FIGURE 1.7. Many pattern recognition systems can be partitioned into componentssuch as the ones shown here. A sensor converts images or sounds or other physicalinputs into signal data. The segmentor isolates sensed objects from the background orfrom other objects. A feature extractor measures object properties that are useful forclassification. The classifier uses these features to assign the sensed object to a cate-gory. Finally, a post processor can take account of other considerations, such as theeffects of context and the costs of errors, to decide on the appropriate action. Althoughthis description stresses a one-way or “bottom-up” flow of data, some systems employfeedback from higher levels back down to lower levels (gray arrows). From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.


Introduction

Feature Extraction

Features ↔ Classification

Invariant Features (translation, rotation, scale)

Deformation (e.g. Cropping)

Feature Selection (Filter, Wrapper)


Introduction

Post Processing

Error Rate, Risk (weighted error)

Context (IC* *IN)

Multiple Classifiers (subspaces, fusion)


Introduction

Design Cycle

collect data

choose features

choose model

train classifier

evaluate classifier

end

start

prior knowledge(e.g., invariances)

FIGURE 1.8. The design of a pattern recognition system involves a design cycle similarto the one shown here. Data must be collected, both to train and to test the system. Thecharacteristics of the data impact both the choice of appropriate discriminating featuresand the choice of models for the different categories. The training process uses some orall of the data to determine the system parameters. The results of evaluation may callfor repetition of various steps in this process in order to obtain satisfactory results. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.


Introduction

Learning and Adaptation

Learning is Parameter Tuning

Supervised Learning (teacher)

Reinforcement Learning (critic)

Unsupervised Learning (clustering)


Statistical Classifiers

Bayesian Decision Theory

Probabilities

State of Nature ω = ω1 (class)

A Priori Probability P(ω1) (prior)

Decision Rule P(ω1) > P(ω2)→ ω1

Class–Conditional Probability Density Function p(x |ω)




Class–Conditional Probability Density

9 10 11 12 13 14 15

0.1

0.2

0.3

0.4

p(x|ωi)

x

ω1

ω2

FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi . If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.




Bayes Decision Rule

Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)

Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )

p(x)

Decision Rule P(ω1|x) > P(ω2|x)→ ω1




Posterior Probabilities

0.2

0.4

0.6

0.8

1

P(ωi|x)

x

ω1

ω2

9 10 11 12 13 14 15

FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)

= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.




Error Probabilities

Error P(error |x) =

{P(ω1|x) if ω2

P(ω2|x) if ω1

Average Error ProbabilityP(error) =

∫∞−∞ p(error , x)dx =

∫∞−∞ P(error |x)p(x)dx

Bayes Rule minimizes P(error)




Generalized Bayes Rule

Feature vector ~x ∈ Rd

Classes ω1 . . . ωc

Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )

p(~x)




Conditional Risk

Actions/Decisions α1 . . . αa

Loss Function λ(αi |ωj)

Decision Rule α(~x)

Expected Loss/Conditional RiskR(αi |~x) =

∑cj=1 λ(αi |ωj)P(ωj |~x)




Bayes Risk

Overall Risk R =∫R(α(~x)|~x)p(~x)d~x

Bayes Rule R(αi |~x)→ min(i = i∗)→ αi∗

Bayes Risk R∗ is best performance




Two–Class Example

Actions α1 → ω1, α2 → ω2

Loss Function λ(αi |ωj) = λij

Decision Rule/Likelihood Ratiop(~x |ω1)p(~x |ω2) >

(λ12−λ22)P(ω2)(λ21−λ11)P(ω1) (λ21 > λ11)→ ω1

Zero-One Loss Function λij =

{0 i = j1 i 6= j

Conditional Risk is Average Error Probability (?)




Likelihood Ratio

x

θa

p(x|ω1)p(x|ω2)

θb

R1R2 R1R2

FIGURE 2.3. The likelihood ratio p(x|ω1)/p(x|ω2) for the distributions shown inFig. 2.1. If we employ a zero-one or classification loss, our decision boundaries aredetermined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1

patterns more than the converse, we get the larger threshold θb, and hence R1 becomessmaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifica-tion. Copyright c© 2001 by John Wiley & Sons, Inc.



Discriminant Functions and Decision Surfaces

Discriminant Functions

Discriminant Functions gi (~x), i = 1, . . . , c

gi (~x) > gj(~x) ∀j 6= i

Bayes Classifier gi (~x) = −R(αi |~x)

Decision Invariance gi (~x)→ f (gi (~x))if f (.) (strictly) monontonically increasing

Bayes Minimum Errorgi (~x) = P(ωi |~x)gi (~x) = p(~x |ωi )P(ωi )gi (~x) = ln p(~x |ωi ) + lnP(ωi )




Discriminant Network

discriminantfunctions

input

g1(x) g2(x) gc(x). . .

x1x2 xd. . .x3

costs

action(e.g., classification)

FIGURE 2.5. The functional structure of a general statistical pattern classifier whichincludes d inputs and c discriminant functions gi(x). A subsequent step determineswhich of the discriminant values is the maximum, and categorizes the input patternaccordingly. The arrows show the direction of the flow of information, though frequentlythe arrows are omitted when the direction of flow is self-evident. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.




Two Categories

Discriminant Function g(~x) ≡ g1(~x)−g2(~x) g(~x) > 0→ ω1

Bayes Minimum Errorg(~x) = P(ω1|~x)− P(ω2|~x)

g(~x) = ln p(~x |ω1)p(~x |ω2) + ln P(ω1)

P(ω2)




Dichotomizer

0

0.1

0.2

0.3

decisionboundary

p(x|ω2)P(ω2)

R1

R2

p(x|ω1)P(ω1)

R2

0

5

0

5

FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.




The Normal Density

Randomized Prototype Vectors with Mean ~µ→Normal Distribution

Expected ValueE [f (x)] =

∫∞−∞ f (x)p(x)dx (continous)

E [f (x)] =∑

x∈D f (x)P(x) (discrete)

Univariate Normal Density

p(x) = 1√2πσ

e−12

( x−µσ

)2

E [x ] = µ E [(x − µ)2] = σ2

EntropyH(p(x)) = −

∫∞−∞ p(x) ln p(x)dx (nats/bits)




Normal Distribution

x

2.5% 2.5%

σ

p(x)

µ + σ µ + 2σµ - σµ - 2σ µ

FIGURE 2.7. A univariate normal distribution has roughly 95% of its area in the range|x − µ| ≤ 2σ , as shown. The peak of the distribution has value p(µ) = 1/

√2πσ . From:

Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.




Multivariate Density

Multivariate Normal Density

p(~x) = 1

2πd2 |Σ|

12e−

12

(~x−~µ)tΣ−1(~x−~µ)

Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij

Linear Transformationp(~x) ∼ N(~µ,Σ) A (d × k)~y = At~x → p(~y) ∼ N(At~µ,AtΣA)




Linear Transformations

0

µ

Ptµ

Atµ

N(µ,Σ)

P

N(Atµ, At

Σ A)

A

a

N(Atwµ, I)

Aw

Atwµ

σ

x1

x2

FIGURE 2.8. The action of a linear transformation on the feature space will con-vert an arbitrary normal distribution into another normal distribution. One transforma-tion, A, takes the source distribution into distribution N(At�, At�A). Another lineartransformation—a projection P onto a line defined by vector a—leads to N(µ, σ 2) mea-sured along that line. While the transforms yield distributions in a different space, weshow them superimposed on the original x1x2-space. A whitening transform, Aw, leadsto a circularly symmetric Gaussian, here shown displaced. From: Richard O. Duda, Pe-ter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.




Mahalanobis Distance

Points of Constant Densityr2 = (~x − ~µ)tΣ−1(~x − ~µ) (r is Mahalanobis distance)

Hyperellipsoidaxes are eigenvectors of Σaxis lengths are eigenvalues of Σ




2D Gaussian

x2

x1

µ

FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centeredon the mean �. The ellipses show lines of equal probability density of the Gaussian.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copy-right c© 2001 by John Wiley & Sons, Inc.




Normal Density Discriminant Functions

Minimum Error Rategi (~x) = ln p(~x |ωi ) + lnP(ωi )

Multivariate Densitygi (~x) = −1

2 (~x− ~µi )tΣ−1i (~x− ~µi )− d

2 ln 2π− 12 ln |Σi |+ lnP(ωi )

Case Σi = σ2I

gi (~x) = − ||~x−~µi ||2

2σ2 + lnP(ωi ) ||~x − ~µi ||2 = (~x − ~µi )t(~x − ~µi )

If P(ωi ) = const →Minimum-Distance-Classifier/Nearest Neighbor




Linear Discriminant Functions

Quadratic Formgi (~x) = − 1

2σ2 [~x t~x − 2~µit~x + ~µi

t ~µi ] + lnP(ωi )

Linear Machinegi (~x) = ~w t

i ~x + wi0 (wi0 threshold/bias)

Decision Boundary gi (~x) = gj(~x) (Hyperplanes)Normal Form ~w t(~x − ~x0) = 0→ ~w = ~µi − ~µj

~x0 = 12 (~µi + ~µj)− σ2

||~µi−~µj ||2ln P(ωi )

P(ωj )(~µi − ~µj)




Equal Variances/Equal Priors

-2 2 4

0.1

0.2

0.3

0.4

P(ω1)=.5 P(ω2)=.5

x

p(x|ωi)ω1 ω2

0

R1 R2

02

4

0

0.05

0.1

0.15

-2

P(ω1)=.5P(ω2)=.5

ω1

ω2

R1

R2

-20

2

4

-2-1

0

1

2

0

1

2

-2

-1

0

1

2

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R1

R2

FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identitymatrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane ofd − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensionalexamples, we indicate p(x|ωi) and the boundaries for the case P(ω1) = P(ω2). In the three-dimensional case,the grid plane separates R1 from R2. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.




Equal Variances/Unequal Priors

P(ω1)=.7 P(ω2)=.3

ω1 ω2

R1 R2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

P(ω1)=.9 P(ω2)=.1

ω1 ω2

R1 R2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

-2

0

2

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.8

P(ω2)=.2

ω1 ω2

R1

R2

-2

0

2

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.99

P(ω2)=.01

ω1ω2

R1

R2

-10

1

2

0

1

2

3

-2

-1

0

1

2

-2

P(ω1)=.8

P(ω2)=.2

ω1

ω2R1

R2

0

2

4

-10

1

2

-2

-1

0

1

2

-2

P(ω1)=.99

P(ω2)=.01ω1

ω2

R1

R2

FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficientlydisparate priors the boundary will not lie between the means of these one-, two- andthree-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.




Identical Covariance Matrices

Case Σi = Σgi (~x) = −1

2 (~x − ~µi )tΣ−1(~x − ~µi ) + lnP(ωi )

P(ωi ) = P0 → Minimal Distancegi (~x) = −d2

m(~x , ~µi ) (dm Mahalanobis distance)

Linear Decision Boundary (~x tΣ−1~x independent of i)Normal Form ~w t(~x − ~x0) = 0→ ~w = Σ−1(~µi − ~µj)

~x0 = 12 (~µi + ~µj)− 1

d2m(~µi , ~µj )

ln P(ωi )P(ωj )

(~µi − ~µj)




Equal Covariances/Various Priors

-5

0

5 -5

0

5

0

-0.1

0.2

P(ω1)=.5

P(ω2)=.5

ω1ω2

R1R2

-5

0

5-5

0

5

0

-0.1

0.2

P(ω1)=.1

P(ω2)=.9

ω1ω2

R1R2

-20

2-2

02

4

-2.5

0

2.5

5

7.5

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R1

R2

-2

0

2-2

02

4

0

-2.5

5

7.5

10

P(ω1)=.1

P(ω2)=.9

ω1

ω2

R1

R2

FIGURE 2.12. Probability densities (indicated by the surfaces in two dimensions andellipsoidal surfaces in three dimensions) and decision regions for equal but asymmet-ric Gaussian distributions. The decision hyperplanes need not be perpendicular to theline connecting the means. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.




Arbitrary Covariance Matrices

Case Σi = arbitrarygi (~x) = ~x tWi~x + ~w t

i ~x + wi0

Wi = −12 Σ−1

i ~wi = Σ−1i ~µi

Decision Surfaces are Hyperquadricshyper–(planes, spheres, ellipsoids, paraboloids, hyperboloids)




1D Decision Regions

-5 -2.5 2.5 5 7.5

0.1

0.2

0.3

0.4

x

p(x|ωi)

ω1

ω2

R1 R2 R1

FIGURE 2.13. Non-simply connected decision regions can arise in one dimensions forGaussians having unequal variance. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.




Arbitrary Covariances/Various Priors

FIGURE 2.14. Arbitrary Gaussian distributions lead to Bayes decision boundaries thatare general hyperquadrics. Conversely, given any hyperquadric, one can find two Gaus-sian distributions whose Bayes decision boundary is that hyperquadric. These variancesare indicated by the contours of constant probability density. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.




Four Categories

R3

R2

R1

R4

R4

FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.



Maximum–Likelihood Estimation

Parameter Estimation

Estimation of Prior and Class–Conditional DensityTraining Data, Parameterized Model (Density)

Adapt Density Parameters to Training Data

Parameter Vector ~θj , p(~x |ωj) = p(~x |ωj , ~θj)

Training Samples Di for ωi , Di contains n samples ~x1 . . . ~xnIndependent draws from p(~x |~θ) (generating Di )

Likelihood p(D|~θ) =∏n

k=1 p(~xk |~θ)




Gaussian Likelihood

1 2 3 4 5 6 7

1 2 3 4 5 6 7

1 2 3 4 5 6 7-100

-60

-40

-20

θ

l(θ )

θ

θ 0.4 x 10-7

0.8 x 10-7

1.2 x 10-7

θ

p(D|θ )

x

-80

ˆ

ˆ

FIGURE 3.1. The top graph shows several training points in one dimension, known orassumed to be drawn from a Gaussian of a particular variance, but unknown mean.Four of the infinite number of candidate source distributions are shown in dashedlines. The middle figure shows the likelihood p(D|θ) as a function of the mean. If wehad a very large number of training points, this likelihood would be very narrow. Thevalue that maximizes the likelihood is marked θ ; it also maximizes the logarithm ofthe likelihood—that is, the log-likelihood l(θ), shown at the bottom. Note that eventhough they look similar, the likelihood p(D|θ) is shown as a function of θ whereas theconditional density p(x|θ) is shown as a function of x. Furthermore, as a function of θ ,the likelihood p(D|θ) is not a probability density function and its area has no signifi-cance. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.




Maximum Likelihood

Maximize Likelihood p(D|~θ)→ Estimate ~θ

Log–Likelihood Function l(~θ) ≡ ln p(D|~θ), ~θ = arg max l(~θ)

Maximize l(~θ) =∑n

k=1 ln p(~xk |~θ), Gradient ~∇~θl = ~0




The Gaussian Case

Normal Density with unknown ~µln p(~xk |~µ) = −1

2 ln [(2π)d |Σ|]− 12 (~xk − ~µ)tΣ−1(~xk − ~µ)

Gradient ~∇~µ ln p(~xk |~µ) = Σ−1(~xk − ~µ)

Maximize ~∇~µl = ~0→∑n

k=1 Σ−1(~xk − ~µ) = ~0

Estimate ~µ = 1n

∑nk=1 ~xk

Unknown ~µ and Σ~µ = 1

n

∑nk=1 ~xk , Σ = 1

n

∑nk=1 (~xk − ~µ)(~xk − ~µ)t




Biased Estimates

Univariate Case σ2 = 1n

∑nk=1 (xk − µ)2

E(σ2) = n−1n σ2 6= σ2, σ2 is asymptotically unbiased

s2 = 1n−1

∑nk=1 (xk − µ)2, s2 is absolutely unbiased




Classification Errors

Bayes Error: overlapping densities, inherent problem property

Model Error: incorrect model

Estimation Error: finite sample of training data




Bayes Error and Dimensionality

x3

x1

x2

FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, andthus in three dimensions the Bayes error vanishes. When projected to a subspace—here,the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace—there canbe greater overlap of the projected distributions, and hence greater Bayes error. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.



Component Analysis

Principal Component Analysis

PCA is Dimensionality Reduction by Projection to Best DataRepresentation

Representing Vector ~x0 of ~x1 . . . ~xn?Minimize J0(~x0) =

∑nk=1 ||~x0 − ~xk ||2 →

~x0 = ~m = 1n

∑nk=1 ~xk (sample mean, zero–dimensional

representation)

One–dimensional Representation ~x = ~m + a~e (~e is unit vector)Minimize J1(a1, . . . , an, ~e) =

∑nk=1 ||( ~m + ak~e)− ~xk ||2 →

ak = ~et(~xk − ~m) (projection of ~xk onto ~e)



Component Analysis

Principal Component

Best Direction of ~e?

Minimize J1(~e) = −~etS~e +∑n

k=1 ||~xk − ~m||2S =

∑nk=1 (~xk − ~m)(~xk − ~m)t (scatter matrix)

Maximize ~etS~e with constraint ||~e|| = 1Lagrange Multipliers u = ~etS~e − λ(~et~e − 1)

Gradient ∂u∂~e = 2S~e − 2λ~e →

S~e = λ~e → ~etS~e = λ~et~e = λ

Principal Component is Eigenvector with Largest Eigenvalueof Scatter Matrix



Component Analysis

Fisher Linear Discriminant

PCA components represent data but which discriminate?

Two classes, project samples onto ~wSample mean ~mi = 1

ni

∑~x∈Di

~x

~w t ~mi = 1ni

∑~x∈Di

~w t~x = 1ni

∑y∈Yi y = mi

Distance of means |m1 − m2| = |~w t( ~m1 − ~m2)|

Maximization is trivial (by scaling ~w)



Component Analysis

Linear Discriminant Projection

0.5 1 1.5

0.5

1

1.5

2

0.5 1 1.5x1

-0.5

0.5

1

1.5

2

x2

w

w

x1

x2

FIGURE 3.5. Projection of the same set of samples onto two different lines in the di-rections marked w. The figure on the right shows greater separation between the redand black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.



Component Analysis

Fisher Criterion

Sample scatter for each class si =∑

y∈Yi (y − mi )2

Total within–class scatter s21 + s2

2

Fisher criterion J(~w) = |m1−m2|2s2

1 +s22

(independent of |~w |, maximize)

Scatter matrices Si =∑

~x∈Di(~x − ~mi )(~x − ~mi )

t

Within–class scatter SW = S1 + S2

Between–class scatter SB = ( ~m1 − ~m2)( ~m1 − ~m2)t

Fisher criterion J(~w) = ~w tSB ~w~w tSW ~w

(generalized Rayleigh quotient, maximize)



Component Analysis

Linear Discriminant

J(~w)→ max if SB ~w = λSW ~wS−1W SB ~w = λ~w SB ~w ||( ~m1 − ~m2)→~w = S−1

W ( ~m1 − ~m2) (Fisher’s Linear Discriminant)

Remaining problem: find threshold

Assume classes with normal densities and equal covariancematrix ΣRecall optimal decision boundary ~w t~x + w0 = 0~w = Σ−1(~µ1 − ~µ2) (sample means, covariances yield directionof Fisher’s Linear Discriminant)

General method: fit projected data to univariate Gaussian,threshhold w0 is at equal posteriors


Nonparametric Techniques

Density Estimation

Unknown Densities

Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data

P that pattern ~x falls in region R, P =∫R p(~x)d~x

n patterns, probability that k patterns are in RPk =

(nk

)Pk(1− P)n−k E [k] = nP

Assuming small region R →p(~x) ' const →

∫R p(~x)d~x ' p(~x)V → p(~x) '

knV



Density Estimation

Relative Probability

1k/n

0.5

1

relativeprobability

0

1005020

P = 0.7

FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particularvalue for the probability density, here where the true probability was chosen to be 0.7.Each curve is labeled by the total number of patterns n sampled, and is scaled to givethe same maximum (at the true probability). The form of each curve is binomial, asgiven by Eq. 2. For large n, such binomials peak strongly at the true probability. In thelimit n → ∞, the curve approaches a delta function, and we are guaranteed that ourestimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.



Density Estimation

Sample Size

Estimate p(~x) 'knV is dependent on size of V

if V → 0, p(~x) would be exact, but no more samples in V

Assuming infinite pattern set with decreasing Vn

n–th estimate pn(~x) =knnVn

For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞

knn = 0

Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows

Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors



Density Estimation

Point Density Estimation

n = 1 n = 4 n = 9 n = 16 n = 100

...

...

...

...

Vn =1/ √n

kn = √n

FIGURE 4.2. There are two leading methods for estimating the density at a point, hereat the center of each square. The one shown in the top row is to start with a large volumecentered on the test point and shrink it according to a function such as Vn = 1/

√n. The

other method, shown in the bottom row, is to decrease the volume in a data-dependentway, for instance letting the volume enclose some number kn = √

n of sample points.The sequences in both cases represent random variables that generally converge andallow the true density at the test point to be calculated. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.



Parzen Windows

Window Function

Region Rn is hypercube with Vn = hdn

Window function ϕ(~u) =

{1 |uj | ≤ 1

2 ,0 otherwise.

(unit hypercube)

Number of samples ~xi in hypercube with side length hnkn =

∑ni=1 ϕ(~x−~xihn

)

Density estimate pn(~x) = 1n

∑ni=1

1Vnϕ(~x−~xihn

)

Generalization: ϕ(.) is arbitrary density, pn(~x) is interpolateddensity



Parzen Windows

Window Width

Effect of window width on pn(~x)

With δn(~x) = 1Vnϕ( ~xhn )

pn(~x) = 1n

∑ni=1 δ(~x − ~xi )

hn >>→ δn(~x) has small amplitude, slowly changing, smoothhn <<→ δn(~x) is sharp pulse (Dirac) at pattern



Parzen Windows

Parzen Windows with Varying Width

-2-1

01

2-2

-1

0

1

2

0

0.05

0.1

0.15

h = 1

δ(x)

0

0.2

0.4

0.6

h = 0.5

-2-1

01

2-2

-1

0

1

2

δ(x)

0

1

2

3

4

h = 0.2

-2-1

01

2-2

-1

0

1

2

δ(x)

FIGURE 4.3. Examples of two-dimensional circularly symmetric normal Parzen win-dows for three different values of h. Note that because the δ(x) are normalized, differentvertical scales must be used to show their structure. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.



Parzen Windows

Parzen Density Estimates

02

46

810 0

2

4

6

8

0

0.1

0.2

p(x)

h = 1

00.20.4

0.6

02

46

810 0

2

4

6

8

p(x)

h = 0.5

0123

02

46

810 0

2

4

6

8

p(x)

h = 0.2

FIGURE 4.4. Three Parzen-window density estimates based on the same set of five samples, using the windowfunctions in Fig. 4.3. As before, the vertical axes have been scaled to show the structure of each distribution.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.



Parzen Windows

Convergence of Parzen Densities

Parzen density pn(~x) is a random variable dependent on thesample set of size n

Estimate pn(~x) converges to true p(~x) iflimn→∞ pn(~x) = p(~x) limn→∞ σ

2n(~x) = 0

Convergence of mean pn(~x) = E [pn(~x)]pn(~x) = 1

n

∑ni=1 E [δ(~x − ~xi )] =

∫δn(~x − ~v)p(~v)d~v

(convolution)pn(~x) is blurred p(~x), if limn→∞ Vn = 0→ δn(.) is Dirac,convolution yields true p(~x)



Parzen Windows

Convergence of Variance

Estimate pn(~x) = pn(~x) also for finite sample size, if windowis Dirac, spiky estimate, high variance

Variance σn(~x) =∑n

i=1 E [( 1nδ(~x − ~xi )− 1

n pn(~x))2] =1n (E [δ2(~x − ~xi )]− pn(~x)2) = 1

n (∫δ2n(~x − ~v)p(~v)d~v − pn(~x)2)

σn(~x) ≤ sup δ(.)pn(~x)n = supϕ(.)pn(~x)

nVn

If limn→∞ nVn =∞→ limn→∞ σ2n(~x) = 0

Vn may approach 0



Parzen Windows

Finite Samples

Finite samples: choosing ϕ(.) and Vn?

N(0, 1) window function ϕ(u) = 1√2πe

−u2

2

Average of normal densities pn(x) = 1n

∑ni=1

1hnϕ( x−xihn

)



Parzen Windows

Parzen Estimates of 1D Normal Density

-2 0 2 -2 0 2 -2 0 2

h1 = 1 h1 = 0.5 h1 = 0.1

-2 0 2 -2 0 2 -2 0 2

n = 1

n = 10

-2 0 2 -2 0 2 -2 0 2

-2 0 2 -2 0 2 -2 0 2

n = 100

n = ∞

FIGURE 4.5. Parzen-window estimates of a univariate normal density using differentwindow widths and numbers of samples. The vertical axes have been scaled to bestshow the structure in each graph. Note particularly that the n = ∞ estimates are thesame (and match the true density function), regardless of window width. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.



Parzen Windows

Parzen Estimates of 2D Normal Density

-2

-1

0

1

2-2

-1

-0

1

2

0.01

0.02

0.03

0.04

n=1

h1=2

0

0.05

0.10

0.15

h1=1

-2

-1

0

1

2-2

-1

-0

1

2

0

0.2

0.4

0.6

-2

-1

0

1

2-2

-1

-0

1

2

h1=0.5

-2

-1

0

1

2 -2

-1

0

1

2

0

0.05

0.10

0.15

n=100

0.10.20.30.4

-2

-1

0

1

2 -2

-1

0

1

2

0

0.25

0.50

0.75

-2

-1

0

1

2 -2

-1

0

1

2

-2

-1

0

1

2 -2

-1

0

1

2

0

0.1

0.2

n=1000

0.2

0.4

-2

-1

0

1

2 -2

-1

0

1

2

0

0.5

1

-2

-1

0

1

2 -2

-1

0

1

2

-2

-1

0

1

2 -2

-1

0

1

2

0

0.05

0.10

0.15

n=∞-2

-1

0

1

2 -2

-1

0

1

2

0

0.05

0.10

0.15

-2

-1

0

1

2 -2

-1

0

1

2

0

0.05

0.10

0.15

FIGURE 4.6. Parzen-window estimates of a bivariate normal density using different window widths and num-bers of samples. The vertical axes have been scaled to best show the structure in each graph. Note particularlythat the n = ∞ estimates are the same (and match the true distribution), regardless of window width. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.



Parzen Windows

Parzen Estimates of 1D Bimodal Density

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

n=1

n=16

n=256

n=∞

h1=1 h1=0.5 h1=0.2

FIGURE 4.7. Parzen-window estimates of a bimodal distribution using different windowwidths and numbers of samples. Note particularly that the n = ∞ estimates are the same(and match the true distribution), regardless of window width. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.



Parzen Windows

Decision Boundaries of 2D Parzen Dichotomizer

x1

x2

x1

x2

FIGURE 4.8. The decision boundaries in a two-dimensional Parzen-window di-chotomizer depend on the window width h. At the left a small h leads to boundariesthat are more complicated than for large h on same data set, shown at the right. Appar-ently, for these data a small h would be appropriate for the upper region, while a largeh would be appropriate for the lower region; no single window width is ideal over-all. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Documents

UV Pattern Recognition I