74
UV Pattern Recognition I UV Pattern Recognition I Helmut A. Mayer Department of Computer Sciences University of Salzburg SS 17

UV Pattern Recognition I

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: UV Pattern Recognition I

UV Pattern Recognition I

UV Pattern Recognition I

Helmut A. Mayer

Department of Computer SciencesUniversity of Salzburg

SS 17

Page 2: UV Pattern Recognition I

UV Pattern Recognition I

Outline

1 Introduction

2 Statistical ClassifiersBayesian Decision TheoryDiscriminant Functions and Decision SurfacesMaximum–Likelihood EstimationComponent Analysis

3 Nonparametric TechniquesDensity EstimationParzen Windows

Page 3: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Human vs. Machine

Human Perception

Senses to neural patterns

Machine Perception

Sensors to value patterns

Patterns are everywhere...

Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)

Features build Model

Fish Example

Page 4: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Salmon or Sea Bass

FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),whose signals are preprocessed. Next the features are extracted and finally the clas-sification is emitted, here either “salmon” or “sea bass.” Although the information flowis often chosen to be from the source to the classifier, some systems employ informationflow in which earlier levels of processing can be altered based on the tentative or pre-liminary response in later levels (gray arrows). Yet others combine two or more stagesinto a unified step, such as simultaneous segmentation and feature extraction. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 5: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Fish Length Histogram

salmon sea bass

length

count

l*

0

2

4

6

8

10

12

16

18

20

22

5 10 2015 25

FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 6: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Fish Lightness Histogram

2 4 6 8 100

2

4

6

8

10

12

14

lightness

count

x*

salmon sea bass

FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗

marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.

Page 7: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Decision Theory

Cost of an Error?

Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary

Improving Recognition

Feature Vector ~x =

(lightnesswidth

)2D Decision Boundary

Page 8: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

2D Feature Space

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 9: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Overfitting

?

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 10: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Generalization

2 4 6 8 1014

15

16

17

18

19

20

21

22

width

lightness

salmon sea bass

FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 11: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Related Fields

Statistical Hypothesis Testing

Image Processing

Regression (age ↔ weight)

Interpolation

Density Estimation

Page 12: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Pattern Recognition Systems

post-processing

classification

feature extraction

segmentation

sensing

input

decision

adjustments for missing features

adjustments for context

costs

FIGURE 1.7. Many pattern recognition systems can be partitioned into componentssuch as the ones shown here. A sensor converts images or sounds or other physicalinputs into signal data. The segmentor isolates sensed objects from the background orfrom other objects. A feature extractor measures object properties that are useful forclassification. The classifier uses these features to assign the sensed object to a cate-gory. Finally, a post processor can take account of other considerations, such as theeffects of context and the costs of errors, to decide on the appropriate action. Althoughthis description stresses a one-way or “bottom-up” flow of data, some systems employfeedback from higher levels back down to lower levels (gray arrows). From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Page 13: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Feature Extraction

Features ↔ Classification

Invariant Features (translation, rotation, scale)

Deformation (e.g. Cropping)

Feature Selection (Filter, Wrapper)

Page 14: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Post Processing

Error Rate, Risk (weighted error)

Context (IC* *IN)

Multiple Classifiers (subspaces, fusion)

Page 15: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Design Cycle

collect data

choose features

choose model

train classifier

evaluate classifier

end

start

prior knowledge(e.g., invariances)

FIGURE 1.8. The design of a pattern recognition system involves a design cycle similarto the one shown here. Data must be collected, both to train and to test the system. Thecharacteristics of the data impact both the choice of appropriate discriminating featuresand the choice of models for the different categories. The training process uses some orall of the data to determine the system parameters. The results of evaluation may callfor repetition of various steps in this process in order to obtain satisfactory results. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 16: UV Pattern Recognition I

UV Pattern Recognition I

Introduction

Learning and Adaptation

Learning is Parameter Tuning

Supervised Learning (teacher)

Reinforcement Learning (critic)

Unsupervised Learning (clustering)

Page 17: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Probabilities

State of Nature ω = ω1 (class)

A Priori Probability P(ω1) (prior)

Decision Rule P(ω1) > P(ω2)→ ω1

Class–Conditional Probability Density Function p(x |ω)

Page 18: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Class–Conditional Probability Density

9 10 11 12 13 14 15

0.1

0.2

0.3

0.4

p(x|ωi)

x

ω1

ω2

FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi . If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.

Page 19: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Bayes Decision Rule

Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)

Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )

p(x)

Decision Rule P(ω1|x) > P(ω2|x)→ ω1

Page 20: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Posterior Probabilities

0.2

0.4

0.6

0.8

1

P(ωi|x)

x

ω1

ω2

9 10 11 12 13 14 15

FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)

= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Page 21: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Error Probabilities

Error P(error |x) =

{P(ω1|x) if ω2

P(ω2|x) if ω1

Average Error ProbabilityP(error) =

∫∞−∞ p(error , x)dx =

∫∞−∞ P(error |x)p(x)dx

Bayes Rule minimizes P(error)

Page 22: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Generalized Bayes Rule

Feature vector ~x ∈ Rd

Classes ω1 . . . ωc

Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )

p(~x)

Page 23: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Conditional Risk

Actions/Decisions α1 . . . αa

Loss Function λ(αi |ωj)

Decision Rule α(~x)

Expected Loss/Conditional RiskR(αi |~x) =

∑cj=1 λ(αi |ωj)P(ωj |~x)

Page 24: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Bayes Risk

Overall Risk R =∫R(α(~x)|~x)p(~x)d~x

Bayes Rule R(αi |~x)→ min(i = i∗)→ αi∗

Bayes Risk R∗ is best performance

Page 25: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Two–Class Example

Actions α1 → ω1, α2 → ω2

Loss Function λ(αi |ωj) = λij

Decision Rule/Likelihood Ratiop(~x |ω1)p(~x |ω2) >

(λ12−λ22)P(ω2)(λ21−λ11)P(ω1) (λ21 > λ11)→ ω1

Zero-One Loss Function λij =

{0 i = j1 i 6= j

Conditional Risk is Average Error Probability (?)

Page 26: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Bayesian Decision Theory

Likelihood Ratio

x

θa

p(x|ω1)p(x|ω2)

θb

R1R2 R1R2

FIGURE 2.3. The likelihood ratio p(x|ω1)/p(x|ω2) for the distributions shown inFig. 2.1. If we employ a zero-one or classification loss, our decision boundaries aredetermined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1

patterns more than the converse, we get the larger threshold θb, and hence R1 becomessmaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifica-tion. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 27: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Discriminant Functions

Discriminant Functions gi (~x), i = 1, . . . , c

gi (~x) > gj(~x) ∀j 6= i

Bayes Classifier gi (~x) = −R(αi |~x)

Decision Invariance gi (~x)→ f (gi (~x))if f (.) (strictly) monontonically increasing

Bayes Minimum Errorgi (~x) = P(ωi |~x)gi (~x) = p(~x |ωi )P(ωi )gi (~x) = ln p(~x |ωi ) + lnP(ωi )

Page 28: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Discriminant Network

discriminantfunctions

input

g1(x) g2(x) gc(x). . .

x1x2 xd. . .x3

costs

action(e.g., classification)

FIGURE 2.5. The functional structure of a general statistical pattern classifier whichincludes d inputs and c discriminant functions gi(x). A subsequent step determineswhich of the discriminant values is the maximum, and categorizes the input patternaccordingly. The arrows show the direction of the flow of information, though frequentlythe arrows are omitted when the direction of flow is self-evident. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.

Page 29: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Two Categories

Discriminant Function g(~x) ≡ g1(~x)−g2(~x) g(~x) > 0→ ω1

Bayes Minimum Errorg(~x) = P(ω1|~x)− P(ω2|~x)

g(~x) = ln p(~x |ω1)p(~x |ω2) + ln P(ω1)

P(ω2)

Page 30: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Dichotomizer

0

0.1

0.2

0.3

decisionboundary

p(x|ω2)P(ω2)

R1

R2

p(x|ω1)P(ω1)

R2

0

5

0

5

FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 31: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

The Normal Density

Randomized Prototype Vectors with Mean ~µ→Normal Distribution

Expected ValueE [f (x)] =

∫∞−∞ f (x)p(x)dx (continous)

E [f (x)] =∑

x∈D f (x)P(x) (discrete)

Univariate Normal Density

p(x) = 1√2πσ

e−12

( x−µσ

)2

E [x ] = µ E [(x − µ)2] = σ2

EntropyH(p(x)) = −

∫∞−∞ p(x) ln p(x)dx (nats/bits)

Page 32: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Normal Distribution

x

2.5% 2.5%

σ

p(x)

µ + σ µ + 2σµ - σµ - 2σ µ

FIGURE 2.7. A univariate normal distribution has roughly 95% of its area in the range|x − µ| ≤ 2σ , as shown. The peak of the distribution has value p(µ) = 1/

√2πσ . From:

Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 33: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Multivariate Density

Multivariate Normal Density

p(~x) = 1

2πd2 |Σ|

12e−

12

(~x−~µ)tΣ−1(~x−~µ)

Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij

Linear Transformationp(~x) ∼ N(~µ,Σ) A (d × k)~y = At~x → p(~y) ∼ N(At~µ,AtΣA)

Page 34: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Linear Transformations

0

µ

Ptµ

Atµ

N(µ,Σ)

P

N(Atµ, At

Σ A)

A

a

N(Atwµ, I)

Aw

Atwµ

σ

x1

x2

FIGURE 2.8. The action of a linear transformation on the feature space will con-vert an arbitrary normal distribution into another normal distribution. One transforma-tion, A, takes the source distribution into distribution N(At�, At�A). Another lineartransformation—a projection P onto a line defined by vector a—leads to N(µ, σ 2) mea-sured along that line. While the transforms yield distributions in a different space, weshow them superimposed on the original x1x2-space. A whitening transform, Aw, leadsto a circularly symmetric Gaussian, here shown displaced. From: Richard O. Duda, Pe-ter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.

Page 35: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Mahalanobis Distance

Points of Constant Densityr2 = (~x − ~µ)tΣ−1(~x − ~µ) (r is Mahalanobis distance)

Hyperellipsoidaxes are eigenvectors of Σaxis lengths are eigenvalues of Σ

Page 36: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

2D Gaussian

x2

x1

µ

FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centeredon the mean �. The ellipses show lines of equal probability density of the Gaussian.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copy-right c© 2001 by John Wiley & Sons, Inc.

Page 37: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Normal Density Discriminant Functions

Minimum Error Rategi (~x) = ln p(~x |ωi ) + lnP(ωi )

Multivariate Densitygi (~x) = −1

2 (~x− ~µi )tΣ−1i (~x− ~µi )− d

2 ln 2π− 12 ln |Σi |+ lnP(ωi )

Case Σi = σ2I

gi (~x) = − ||~x−~µi ||2

2σ2 + lnP(ωi ) ||~x − ~µi ||2 = (~x − ~µi )t(~x − ~µi )

If P(ωi ) = const →Minimum-Distance-Classifier/Nearest Neighbor

Page 38: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Linear Discriminant Functions

Quadratic Formgi (~x) = − 1

2σ2 [~x t~x − 2~µit~x + ~µi

t ~µi ] + lnP(ωi )

Linear Machinegi (~x) = ~w t

i ~x + wi0 (wi0 threshold/bias)

Decision Boundary gi (~x) = gj(~x) (Hyperplanes)Normal Form ~w t(~x − ~x0) = 0→ ~w = ~µi − ~µj

~x0 = 12 (~µi + ~µj)− σ2

||~µi−~µj ||2ln P(ωi )

P(ωj )(~µi − ~µj)

Page 39: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Equal Variances/Equal Priors

-2 2 4

0.1

0.2

0.3

0.4

P(ω1)=.5 P(ω2)=.5

x

p(x|ωi)ω1 ω2

0

R1 R2

02

4

0

0.05

0.1

0.15

-2

P(ω1)=.5P(ω2)=.5

ω1

ω2

R1

R2

-20

2

4

-2-1

0

1

2

0

1

2

-2

-1

0

1

2

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R1

R2

FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identitymatrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane ofd − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensionalexamples, we indicate p(x|ωi) and the boundaries for the case P(ω1) = P(ω2). In the three-dimensional case,the grid plane separates R1 from R2. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 40: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Equal Variances/Unequal Priors

P(ω1)=.7 P(ω2)=.3

ω1 ω2

R1 R2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

P(ω1)=.9 P(ω2)=.1

ω1 ω2

R1 R2

p(x|ωi)

x-2 2 4

0.1

0.2

0.3

0.4

0

-2

0

2

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.8

P(ω2)=.2

ω1 ω2

R1

R2

-2

0

2

4

-20

24

0

0.05

0.1

0.15

P(ω1)=.99

P(ω2)=.01

ω1ω2

R1

R2

-10

1

2

0

1

2

3

-2

-1

0

1

2

-2

P(ω1)=.8

P(ω2)=.2

ω1

ω2R1

R2

0

2

4

-10

1

2

-2

-1

0

1

2

-2

P(ω1)=.99

P(ω2)=.01ω1

ω2

R1

R2

FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficientlydisparate priors the boundary will not lie between the means of these one-, two- andthree-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

Page 41: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Identical Covariance Matrices

Case Σi = Σgi (~x) = −1

2 (~x − ~µi )tΣ−1(~x − ~µi ) + lnP(ωi )

P(ωi ) = P0 → Minimal Distancegi (~x) = −d2

m(~x , ~µi ) (dm Mahalanobis distance)

Linear Decision Boundary (~x tΣ−1~x independent of i)Normal Form ~w t(~x − ~x0) = 0→ ~w = Σ−1(~µi − ~µj)

~x0 = 12 (~µi + ~µj)− 1

d2m(~µi , ~µj )

ln P(ωi )P(ωj )

(~µi − ~µj)

Page 42: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Equal Covariances/Various Priors

-5

0

5 -5

0

5

0

-0.1

0.2

P(ω1)=.5

P(ω2)=.5

ω1ω2

R1R2

-5

0

5-5

0

5

0

-0.1

0.2

P(ω1)=.1

P(ω2)=.9

ω1ω2

R1R2

-20

2-2

02

4

-2.5

0

2.5

5

7.5

P(ω1)=.5

P(ω2)=.5

ω1

ω2

R1

R2

-2

0

2-2

02

4

0

-2.5

5

7.5

10

P(ω1)=.1

P(ω2)=.9

ω1

ω2

R1

R2

FIGURE 2.12. Probability densities (indicated by the surfaces in two dimensions andellipsoidal surfaces in three dimensions) and decision regions for equal but asymmet-ric Gaussian distributions. The decision hyperplanes need not be perpendicular to theline connecting the means. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 43: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Arbitrary Covariance Matrices

Case Σi = arbitrarygi (~x) = ~x tWi~x + ~w t

i ~x + wi0

Wi = −12 Σ−1

i ~wi = Σ−1i ~µi

Decision Surfaces are Hyperquadricshyper–(planes, spheres, ellipsoids, paraboloids, hyperboloids)

Page 44: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

1D Decision Regions

-5 -2.5 2.5 5 7.5

0.1

0.2

0.3

0.4

x

p(x|ωi)

ω1

ω2

R1 R2 R1

FIGURE 2.13. Non-simply connected decision regions can arise in one dimensions forGaussians having unequal variance. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 45: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Arbitrary Covariances/Various Priors

FIGURE 2.14. Arbitrary Gaussian distributions lead to Bayes decision boundaries thatare general hyperquadrics. Conversely, given any hyperquadric, one can find two Gaus-sian distributions whose Bayes decision boundary is that hyperquadric. These variancesare indicated by the contours of constant probability density. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.

Page 46: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Discriminant Functions and Decision Surfaces

Four Categories

R3

R2

R1

R4

R4

FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 47: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Maximum–Likelihood Estimation

Parameter Estimation

Estimation of Prior and Class–Conditional DensityTraining Data, Parameterized Model (Density)

Adapt Density Parameters to Training Data

Parameter Vector ~θj , p(~x |ωj) = p(~x |ωj , ~θj)

Training Samples Di for ωi , Di contains n samples ~x1 . . . ~xnIndependent draws from p(~x |~θ) (generating Di )

Likelihood p(D|~θ) =∏n

k=1 p(~xk |~θ)

Page 48: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Maximum–Likelihood Estimation

Gaussian Likelihood

1 2 3 4 5 6 7

1 2 3 4 5 6 7

1 2 3 4 5 6 7-100

-60

-40

-20

θ

l(θ )

θ

θ 0.4 x 10-7

0.8 x 10-7

1.2 x 10-7

θ

p(D|θ )

x

-80

ˆ

ˆ

FIGURE 3.1. The top graph shows several training points in one dimension, known orassumed to be drawn from a Gaussian of a particular variance, but unknown mean.Four of the infinite number of candidate source distributions are shown in dashedlines. The middle figure shows the likelihood p(D|θ) as a function of the mean. If wehad a very large number of training points, this likelihood would be very narrow. Thevalue that maximizes the likelihood is marked θ ; it also maximizes the logarithm ofthe likelihood—that is, the log-likelihood l(θ), shown at the bottom. Note that eventhough they look similar, the likelihood p(D|θ) is shown as a function of θ whereas theconditional density p(x|θ) is shown as a function of x. Furthermore, as a function of θ ,the likelihood p(D|θ) is not a probability density function and its area has no signifi-cance. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.

Page 49: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Maximum–Likelihood Estimation

Maximum Likelihood

Maximize Likelihood p(D|~θ)→ Estimate ~θ

Log–Likelihood Function l(~θ) ≡ ln p(D|~θ), ~θ = arg max l(~θ)

Maximize l(~θ) =∑n

k=1 ln p(~xk |~θ), Gradient ~∇~θl = ~0

Page 50: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Maximum–Likelihood Estimation

The Gaussian Case

Normal Density with unknown ~µln p(~xk |~µ) = −1

2 ln [(2π)d |Σ|]− 12 (~xk − ~µ)tΣ−1(~xk − ~µ)

Gradient ~∇~µ ln p(~xk |~µ) = Σ−1(~xk − ~µ)

Maximize ~∇~µl = ~0→∑n

k=1 Σ−1(~xk − ~µ) = ~0

Estimate ~µ = 1n

∑nk=1 ~xk

Unknown ~µ and Σ~µ = 1

n

∑nk=1 ~xk , Σ = 1

n

∑nk=1 (~xk − ~µ)(~xk − ~µ)t

Page 51: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Maximum–Likelihood Estimation

Biased Estimates

Univariate Case σ2 = 1n

∑nk=1 (xk − µ)2

E(σ2) = n−1n σ2 6= σ2, σ2 is asymptotically unbiased

s2 = 1n−1

∑nk=1 (xk − µ)2, s2 is absolutely unbiased

Page 52: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Maximum–Likelihood Estimation

Classification Errors

Bayes Error: overlapping densities, inherent problem property

Model Error: incorrect model

Estimation Error: finite sample of training data

Page 53: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Maximum–Likelihood Estimation

Bayes Error and Dimensionality

x3

x1

x2

FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, andthus in three dimensions the Bayes error vanishes. When projected to a subspace—here,the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace—there canbe greater overlap of the projected distributions, and hence greater Bayes error. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.

Page 54: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Component Analysis

Principal Component Analysis

PCA is Dimensionality Reduction by Projection to Best DataRepresentation

Representing Vector ~x0 of ~x1 . . . ~xn?Minimize J0(~x0) =

∑nk=1 ||~x0 − ~xk ||2 →

~x0 = ~m = 1n

∑nk=1 ~xk (sample mean, zero–dimensional

representation)

One–dimensional Representation ~x = ~m + a~e (~e is unit vector)Minimize J1(a1, . . . , an, ~e) =

∑nk=1 ||( ~m + ak~e)− ~xk ||2 →

ak = ~et(~xk − ~m) (projection of ~xk onto ~e)

Page 55: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Component Analysis

Principal Component

Best Direction of ~e?

Minimize J1(~e) = −~etS~e +∑n

k=1 ||~xk − ~m||2S =

∑nk=1 (~xk − ~m)(~xk − ~m)t (scatter matrix)

Maximize ~etS~e with constraint ||~e|| = 1Lagrange Multipliers u = ~etS~e − λ(~et~e − 1)

Gradient ∂u∂~e = 2S~e − 2λ~e →

S~e = λ~e → ~etS~e = λ~et~e = λ

Principal Component is Eigenvector with Largest Eigenvalueof Scatter Matrix

Page 56: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Component Analysis

Fisher Linear Discriminant

PCA components represent data but which discriminate?

Two classes, project samples onto ~wSample mean ~mi = 1

ni

∑~x∈Di

~x

~w t ~mi = 1ni

∑~x∈Di

~w t~x = 1ni

∑y∈Yi y = mi

Distance of means |m1 − m2| = |~w t( ~m1 − ~m2)|

Maximization is trivial (by scaling ~w)

Page 57: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Component Analysis

Linear Discriminant Projection

0.5 1 1.5

0.5

1

1.5

2

0.5 1 1.5x1

-0.5

0.5

1

1.5

2

x2

w

w

x1

x2

FIGURE 3.5. Projection of the same set of samples onto two different lines in the di-rections marked w. The figure on the right shows greater separation between the redand black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 58: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Component Analysis

Fisher Criterion

Sample scatter for each class si =∑

y∈Yi (y − mi )2

Total within–class scatter s21 + s2

2

Fisher criterion J(~w) = |m1−m2|2s2

1 +s22

(independent of |~w |, maximize)

Scatter matrices Si =∑

~x∈Di(~x − ~mi )(~x − ~mi )

t

Within–class scatter SW = S1 + S2

Between–class scatter SB = ( ~m1 − ~m2)( ~m1 − ~m2)t

Fisher criterion J(~w) = ~w tSB ~w~w tSW ~w

(generalized Rayleigh quotient, maximize)

Page 59: UV Pattern Recognition I

UV Pattern Recognition I

Statistical Classifiers

Component Analysis

Linear Discriminant

J(~w)→ max if SB ~w = λSW ~wS−1W SB ~w = λ~w SB ~w ||( ~m1 − ~m2)→~w = S−1

W ( ~m1 − ~m2) (Fisher’s Linear Discriminant)

Remaining problem: find threshold

Assume classes with normal densities and equal covariancematrix ΣRecall optimal decision boundary ~w t~x + w0 = 0~w = Σ−1(~µ1 − ~µ2) (sample means, covariances yield directionof Fisher’s Linear Discriminant)

General method: fit projected data to univariate Gaussian,threshhold w0 is at equal posteriors

Page 60: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Density Estimation

Unknown Densities

Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data

P that pattern ~x falls in region R, P =∫R p(~x)d~x

n patterns, probability that k patterns are in RPk =

(nk

)Pk(1− P)n−k E [k] = nP

Assuming small region R →p(~x) ' const →

∫R p(~x)d~x ' p(~x)V → p(~x) '

knV

Page 61: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Density Estimation

Relative Probability

1k/n

0.5

1

relativeprobability

0

1005020

P = 0.7

FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particularvalue for the probability density, here where the true probability was chosen to be 0.7.Each curve is labeled by the total number of patterns n sampled, and is scaled to givethe same maximum (at the true probability). The form of each curve is binomial, asgiven by Eq. 2. For large n, such binomials peak strongly at the true probability. In thelimit n → ∞, the curve approaches a delta function, and we are guaranteed that ourestimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.

Page 62: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Density Estimation

Sample Size

Estimate p(~x) 'knV is dependent on size of V

if V → 0, p(~x) would be exact, but no more samples in V

Assuming infinite pattern set with decreasing Vn

n–th estimate pn(~x) =knnVn

For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞

knn = 0

Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows

Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors

Page 63: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Density Estimation

Point Density Estimation

n = 1 n = 4 n = 9 n = 16 n = 100

...

...

...

...

Vn =1/ √n

kn = √n

FIGURE 4.2. There are two leading methods for estimating the density at a point, hereat the center of each square. The one shown in the top row is to start with a large volumecentered on the test point and shrink it according to a function such as Vn = 1/

√n. The

other method, shown in the bottom row, is to decrease the volume in a data-dependentway, for instance letting the volume enclose some number kn = √

n of sample points.The sequences in both cases represent random variables that generally converge andallow the true density at the test point to be calculated. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

Page 64: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Window Function

Region Rn is hypercube with Vn = hdn

Window function ϕ(~u) =

{1 |uj | ≤ 1

2 ,0 otherwise.

(unit hypercube)

Number of samples ~xi in hypercube with side length hnkn =

∑ni=1 ϕ(~x−~xihn

)

Density estimate pn(~x) = 1n

∑ni=1

1Vnϕ(~x−~xihn

)

Generalization: ϕ(.) is arbitrary density, pn(~x) is interpolateddensity

Page 65: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Window Width

Effect of window width on pn(~x)

With δn(~x) = 1Vnϕ( ~xhn )

pn(~x) = 1n

∑ni=1 δ(~x − ~xi )

hn >>→ δn(~x) has small amplitude, slowly changing, smoothhn <<→ δn(~x) is sharp pulse (Dirac) at pattern

Page 66: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Parzen Windows with Varying Width

-2-1

01

2-2

-1

0

1

2

0

0.05

0.1

0.15

h = 1

δ(x)

0

0.2

0.4

0.6

h = 0.5

-2-1

01

2-2

-1

0

1

2

δ(x)

0

1

2

3

4

h = 0.2

-2-1

01

2-2

-1

0

1

2

δ(x)

FIGURE 4.3. Examples of two-dimensional circularly symmetric normal Parzen win-dows for three different values of h. Note that because the δ(x) are normalized, differentvertical scales must be used to show their structure. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.

Page 67: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Parzen Density Estimates

02

46

810 0

2

4

6

8

0

0.1

0.2

p(x)

h = 1

00.20.4

0.6

02

46

810 0

2

4

6

8

p(x)

h = 0.5

0123

02

46

810 0

2

4

6

8

p(x)

h = 0.2

FIGURE 4.4. Three Parzen-window density estimates based on the same set of five samples, using the windowfunctions in Fig. 4.3. As before, the vertical axes have been scaled to show the structure of each distribution.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.

Page 68: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Convergence of Parzen Densities

Parzen density pn(~x) is a random variable dependent on thesample set of size n

Estimate pn(~x) converges to true p(~x) iflimn→∞ pn(~x) = p(~x) limn→∞ σ

2n(~x) = 0

Convergence of mean pn(~x) = E [pn(~x)]pn(~x) = 1

n

∑ni=1 E [δ(~x − ~xi )] =

∫δn(~x − ~v)p(~v)d~v

(convolution)pn(~x) is blurred p(~x), if limn→∞ Vn = 0→ δn(.) is Dirac,convolution yields true p(~x)

Page 69: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Convergence of Variance

Estimate pn(~x) = pn(~x) also for finite sample size, if windowis Dirac, spiky estimate, high variance

Variance σn(~x) =∑n

i=1 E [( 1nδ(~x − ~xi )− 1

n pn(~x))2] =1n (E [δ2(~x − ~xi )]− pn(~x)2) = 1

n (∫δ2n(~x − ~v)p(~v)d~v − pn(~x)2)

σn(~x) ≤ sup δ(.)pn(~x)n = supϕ(.)pn(~x)

nVn

If limn→∞ nVn =∞→ limn→∞ σ2n(~x) = 0

Vn may approach 0

Page 70: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Finite Samples

Finite samples: choosing ϕ(.) and Vn?

N(0, 1) window function ϕ(u) = 1√2πe

−u2

2

Average of normal densities pn(x) = 1n

∑ni=1

1hnϕ( x−xihn

)

Page 71: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Parzen Estimates of 1D Normal Density

-2 0 2 -2 0 2 -2 0 2

h1 = 1 h1 = 0.5 h1 = 0.1

-2 0 2 -2 0 2 -2 0 2

n = 1

n = 10

-2 0 2 -2 0 2 -2 0 2

-2 0 2 -2 0 2 -2 0 2

n = 100

n = ∞

FIGURE 4.5. Parzen-window estimates of a univariate normal density using differentwindow widths and numbers of samples. The vertical axes have been scaled to bestshow the structure in each graph. Note particularly that the n = ∞ estimates are thesame (and match the true density function), regardless of window width. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.

Page 72: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Parzen Estimates of 2D Normal Density

-2

-1

0

1

2-2

-1

-0

1

2

0.01

0.02

0.03

0.04

n=1

h1=2

0

0.05

0.10

0.15

h1=1

-2

-1

0

1

2-2

-1

-0

1

2

0

0.2

0.4

0.6

-2

-1

0

1

2-2

-1

-0

1

2

h1=0.5

-2

-1

0

1

2 -2

-1

0

1

2

0

0.05

0.10

0.15

n=100

0.10.20.30.4

-2

-1

0

1

2 -2

-1

0

1

2

0

0.25

0.50

0.75

-2

-1

0

1

2 -2

-1

0

1

2

-2

-1

0

1

2 -2

-1

0

1

2

0

0.1

0.2

n=1000

0.2

0.4

-2

-1

0

1

2 -2

-1

0

1

2

0

0.5

1

-2

-1

0

1

2 -2

-1

0

1

2

-2

-1

0

1

2 -2

-1

0

1

2

0

0.05

0.10

0.15

n=∞-2

-1

0

1

2 -2

-1

0

1

2

0

0.05

0.10

0.15

-2

-1

0

1

2 -2

-1

0

1

2

0

0.05

0.10

0.15

FIGURE 4.6. Parzen-window estimates of a bivariate normal density using different window widths and num-bers of samples. The vertical axes have been scaled to best show the structure in each graph. Note particularlythat the n = ∞ estimates are the same (and match the true distribution), regardless of window width. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.

Page 73: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Parzen Estimates of 1D Bimodal Density

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

0 1 2 3 4

1

n=1

n=16

n=256

n=∞

h1=1 h1=0.5 h1=0.2

FIGURE 4.7. Parzen-window estimates of a bimodal distribution using different windowwidths and numbers of samples. Note particularly that the n = ∞ estimates are the same(and match the true distribution), regardless of window width. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.

Page 74: UV Pattern Recognition I

UV Pattern Recognition I

Nonparametric Techniques

Parzen Windows

Decision Boundaries of 2D Parzen Dichotomizer

x1

x2

x1

x2

FIGURE 4.8. The decision boundaries in a two-dimensional Parzen-window di-chotomizer depend on the window width h. At the left a small h leads to boundariesthat are more complicated than for large h on same data set, shown at the right. Appar-ently, for these data a small h would be appropriate for the upper region, while a largeh would be appropriate for the lower region; no single window width is ideal over-all. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.