Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
UV Pattern Recognition I
UV Pattern Recognition I
Helmut A. Mayer
Department of Computer SciencesUniversity of Salzburg
SS 17
UV Pattern Recognition I
Outline
1 Introduction
2 Statistical ClassifiersBayesian Decision TheoryDiscriminant Functions and Decision SurfacesMaximum–Likelihood EstimationComponent Analysis
3 Nonparametric TechniquesDensity EstimationParzen Windows
UV Pattern Recognition I
Introduction
Human vs. Machine
Human Perception
Senses to neural patterns
Machine Perception
Sensors to value patterns
Patterns are everywhere...
Images, Time Series, Medical Diagnosis, Customer Analysis(only a few examples)
Features build Model
Fish Example
UV Pattern Recognition I
Introduction
Salmon or Sea Bass
FIGURE 1.1. The objects to be classified are first sensed by a transducer (camera),whose signals are preprocessed. Next the features are extracted and finally the clas-sification is emitted, here either “salmon” or “sea bass.” Although the information flowis often chosen to be from the source to the classifier, some systems employ informationflow in which earlier levels of processing can be altered based on the tentative or pre-liminary response in later levels (gray arrows). Yet others combine two or more stagesinto a unified step, such as simultaneous segmentation and feature extraction. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Introduction
Fish Length Histogram
salmon sea bass
length
count
l*
0
2
4
6
8
10
12
16
18
20
22
5 10 2015 25
FIGURE 1.2. Histograms for the length feature for the two categories. No single thresh-old value of the length will serve to unambiguously discriminate between the two cat-egories; using length alone, we will have some errors. The value marked l∗ will lead tothe smallest number of errors, on average. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Introduction
Fish Lightness Histogram
2 4 6 8 100
2
4
6
8
10
12
14
lightness
count
x*
salmon sea bass
FIGURE 1.3. Histograms for the lightness feature for the two categories. No singlethreshold value x∗ (decision boundary) will serve to unambiguously discriminate be-tween the two categories; using lightness alone, we will have some errors. The value x∗
marked will lead to the smallest number of errors, on average. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.
UV Pattern Recognition I
Introduction
Decision Theory
Cost of an Error?
Salmon tastes better..;-)Minimization of cost (risk)Decision Rule/Boundary
Improving Recognition
Feature Vector ~x =
(lightnesswidth
)2D Decision Boundary
UV Pattern Recognition I
Introduction
2D Feature Space
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.4. The two features of lightness and width for sea bass and salmon. The darkline could serve as a decision boundary of our classifier. Overall classification error onthe data shown is lower than if we use only one feature as in Fig. 1.3, but there willstill be some errors. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Introduction
Overfitting
?
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.5. Overly complex models for the fish will lead to decision boundaries thatare complicated. While such a decision may lead to perfect classification of our trainingsamples, it would lead to poor performance on future patterns. The novel test pointmarked ? is evidently most likely a salmon, whereas the complex decision boundaryshown leads it to be classified as a sea bass. From: Richard O. Duda, Peter E. Hart, andDavid G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Introduction
Generalization
2 4 6 8 1014
15
16
17
18
19
20
21
22
width
lightness
salmon sea bass
FIGURE 1.6. The decision boundary shown might represent the optimal tradeoff be-tween performance on the training set and simplicity of classifier, thereby giving thehighest accuracy on new patterns. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Introduction
Related Fields
Statistical Hypothesis Testing
Image Processing
Regression (age ↔ weight)
Interpolation
Density Estimation
UV Pattern Recognition I
Introduction
Pattern Recognition Systems
post-processing
classification
feature extraction
segmentation
sensing
input
decision
adjustments for missing features
adjustments for context
costs
FIGURE 1.7. Many pattern recognition systems can be partitioned into componentssuch as the ones shown here. A sensor converts images or sounds or other physicalinputs into signal data. The segmentor isolates sensed objects from the background orfrom other objects. A feature extractor measures object properties that are useful forclassification. The classifier uses these features to assign the sensed object to a cate-gory. Finally, a post processor can take account of other considerations, such as theeffects of context and the costs of errors, to decide on the appropriate action. Althoughthis description stresses a one-way or “bottom-up” flow of data, some systems employfeedback from higher levels back down to lower levels (gray arrows). From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
UV Pattern Recognition I
Introduction
Feature Extraction
Features ↔ Classification
Invariant Features (translation, rotation, scale)
Deformation (e.g. Cropping)
Feature Selection (Filter, Wrapper)
UV Pattern Recognition I
Introduction
Post Processing
Error Rate, Risk (weighted error)
Context (IC* *IN)
Multiple Classifiers (subspaces, fusion)
UV Pattern Recognition I
Introduction
Design Cycle
collect data
choose features
choose model
train classifier
evaluate classifier
end
start
prior knowledge(e.g., invariances)
FIGURE 1.8. The design of a pattern recognition system involves a design cycle similarto the one shown here. Data must be collected, both to train and to test the system. Thecharacteristics of the data impact both the choice of appropriate discriminating featuresand the choice of models for the different categories. The training process uses some orall of the data to determine the system parameters. The results of evaluation may callfor repetition of various steps in this process in order to obtain satisfactory results. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Introduction
Learning and Adaptation
Learning is Parameter Tuning
Supervised Learning (teacher)
Reinforcement Learning (critic)
Unsupervised Learning (clustering)
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Probabilities
State of Nature ω = ω1 (class)
A Priori Probability P(ω1) (prior)
Decision Rule P(ω1) > P(ω2)→ ω1
Class–Conditional Probability Density Function p(x |ω)
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Class–Conditional Probability Density
9 10 11 12 13 14 15
0.1
0.2
0.3
0.4
p(x|ωi)
x
ω1
ω2
FIGURE 2.1. Hypothetical class-conditional probability density functions show theprobability density of measuring a particular feature value x given the pattern is incategory ωi . If x represents the lightness of a fish, the two curves might describe thedifference in lightness of populations of two types of fish. Density functions are normal-ized, and thus the area under each curve is 1.0. From: Richard O. Duda, Peter E. Hart,and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons,Inc.
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Bayes Decision Rule
Joint Probability Densityp(ωj , x) = P(ωj |x)p(x) = p(x |ωj)P(ωj)
Bayes Formula P(ωj |x) =p(x |ωj )P(ωj )
p(x)
Decision Rule P(ω1|x) > P(ω2|x)→ ω1
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Posterior Probabilities
0.2
0.4
0.6
0.8
1
P(ωi|x)
x
ω1
ω2
9 10 11 12 13 14 15
FIGURE 2.2. Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2)
= 1/3 for the class-conditional probability densities shown in Fig. 2.1. Thus in thiscase, given that a pattern is measured to have feature value x = 14, the probability it isin category ω2 is roughly 0.08, and that it is in ω1 is 0.92. At every x, the posteriors sumto 1.0. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Error Probabilities
Error P(error |x) =
{P(ω1|x) if ω2
P(ω2|x) if ω1
Average Error ProbabilityP(error) =
∫∞−∞ p(error , x)dx =
∫∞−∞ P(error |x)p(x)dx
Bayes Rule minimizes P(error)
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Generalized Bayes Rule
Feature vector ~x ∈ Rd
Classes ω1 . . . ωc
Bayes Formula P(ωj |~x) =p(~x |ωj )P(ωj )
p(~x)
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Conditional Risk
Actions/Decisions α1 . . . αa
Loss Function λ(αi |ωj)
Decision Rule α(~x)
Expected Loss/Conditional RiskR(αi |~x) =
∑cj=1 λ(αi |ωj)P(ωj |~x)
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Bayes Risk
Overall Risk R =∫R(α(~x)|~x)p(~x)d~x
Bayes Rule R(αi |~x)→ min(i = i∗)→ αi∗
Bayes Risk R∗ is best performance
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Two–Class Example
Actions α1 → ω1, α2 → ω2
Loss Function λ(αi |ωj) = λij
Decision Rule/Likelihood Ratiop(~x |ω1)p(~x |ω2) >
(λ12−λ22)P(ω2)(λ21−λ11)P(ω1) (λ21 > λ11)→ ω1
Zero-One Loss Function λij =
{0 i = j1 i 6= j
Conditional Risk is Average Error Probability (?)
UV Pattern Recognition I
Statistical Classifiers
Bayesian Decision Theory
Likelihood Ratio
x
θa
p(x|ω1)p(x|ω2)
θb
R1R2 R1R2
FIGURE 2.3. The likelihood ratio p(x|ω1)/p(x|ω2) for the distributions shown inFig. 2.1. If we employ a zero-one or classification loss, our decision boundaries aredetermined by the threshold θa. If our loss function penalizes miscategorizing ω2 as ω1
patterns more than the converse, we get the larger threshold θb, and hence R1 becomessmaller. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classifica-tion. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Discriminant Functions
Discriminant Functions gi (~x), i = 1, . . . , c
gi (~x) > gj(~x) ∀j 6= i
Bayes Classifier gi (~x) = −R(αi |~x)
Decision Invariance gi (~x)→ f (gi (~x))if f (.) (strictly) monontonically increasing
Bayes Minimum Errorgi (~x) = P(ωi |~x)gi (~x) = p(~x |ωi )P(ωi )gi (~x) = ln p(~x |ωi ) + lnP(ωi )
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Discriminant Network
discriminantfunctions
input
g1(x) g2(x) gc(x). . .
x1x2 xd. . .x3
costs
action(e.g., classification)
FIGURE 2.5. The functional structure of a general statistical pattern classifier whichincludes d inputs and c discriminant functions gi(x). A subsequent step determineswhich of the discriminant values is the maximum, and categorizes the input patternaccordingly. The arrows show the direction of the flow of information, though frequentlythe arrows are omitted when the direction of flow is self-evident. From: Richard O.Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 byJohn Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Two Categories
Discriminant Function g(~x) ≡ g1(~x)−g2(~x) g(~x) > 0→ ω1
Bayes Minimum Errorg(~x) = P(ω1|~x)− P(ω2|~x)
g(~x) = ln p(~x |ω1)p(~x |ω2) + ln P(ω1)
P(ω2)
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Dichotomizer
0
0.1
0.2
0.3
decisionboundary
p(x|ω2)P(ω2)
R1
R2
p(x|ω1)P(ω1)
R2
0
5
0
5
FIGURE 2.6. In this two-dimensional two-category classifier, the probability densitiesare Gaussian, the decision boundary consists of two hyperbolas, and thus the decisionregion R2 is not simply connected. The ellipses mark where the density is 1/e timesthat at the peak of the distribution. From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
The Normal Density
Randomized Prototype Vectors with Mean ~µ→Normal Distribution
Expected ValueE [f (x)] =
∫∞−∞ f (x)p(x)dx (continous)
E [f (x)] =∑
x∈D f (x)P(x) (discrete)
Univariate Normal Density
p(x) = 1√2πσ
e−12
( x−µσ
)2
E [x ] = µ E [(x − µ)2] = σ2
EntropyH(p(x)) = −
∫∞−∞ p(x) ln p(x)dx (nats/bits)
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Normal Distribution
x
2.5% 2.5%
σ
p(x)
µ + σ µ + 2σµ - σµ - 2σ µ
FIGURE 2.7. A univariate normal distribution has roughly 95% of its area in the range|x − µ| ≤ 2σ , as shown. The peak of the distribution has value p(µ) = 1/
√2πσ . From:
Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Multivariate Density
Multivariate Normal Density
p(~x) = 1
2πd2 |Σ|
12e−
12
(~x−~µ)tΣ−1(~x−~µ)
Covariance Matrix Σ (d × d)E [~x ] = ~µ E [(~x − ~µ)(~x − ~µ)t ] = ΣE [xi ] = µi E [(xi − µi )(xj − µj)] = σij
Linear Transformationp(~x) ∼ N(~µ,Σ) A (d × k)~y = At~x → p(~y) ∼ N(At~µ,AtΣA)
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Linear Transformations
0
µ
Ptµ
Atµ
N(µ,Σ)
P
N(Atµ, At
Σ A)
A
a
N(Atwµ, I)
Aw
Atwµ
σ
x1
x2
FIGURE 2.8. The action of a linear transformation on the feature space will con-vert an arbitrary normal distribution into another normal distribution. One transforma-tion, A, takes the source distribution into distribution N(At�, At�A). Another lineartransformation—a projection P onto a line defined by vector a—leads to N(µ, σ 2) mea-sured along that line. While the transforms yield distributions in a different space, weshow them superimposed on the original x1x2-space. A whitening transform, Aw, leadsto a circularly symmetric Gaussian, here shown displaced. From: Richard O. Duda, Pe-ter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Mahalanobis Distance
Points of Constant Densityr2 = (~x − ~µ)tΣ−1(~x − ~µ) (r is Mahalanobis distance)
Hyperellipsoidaxes are eigenvectors of Σaxis lengths are eigenvalues of Σ
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
2D Gaussian
x2
x1
µ
FIGURE 2.9. Samples drawn from a two-dimensional Gaussian lie in a cloud centeredon the mean �. The ellipses show lines of equal probability density of the Gaussian.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copy-right c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Normal Density Discriminant Functions
Minimum Error Rategi (~x) = ln p(~x |ωi ) + lnP(ωi )
Multivariate Densitygi (~x) = −1
2 (~x− ~µi )tΣ−1i (~x− ~µi )− d
2 ln 2π− 12 ln |Σi |+ lnP(ωi )
Case Σi = σ2I
gi (~x) = − ||~x−~µi ||2
2σ2 + lnP(ωi ) ||~x − ~µi ||2 = (~x − ~µi )t(~x − ~µi )
If P(ωi ) = const →Minimum-Distance-Classifier/Nearest Neighbor
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Linear Discriminant Functions
Quadratic Formgi (~x) = − 1
2σ2 [~x t~x − 2~µit~x + ~µi
t ~µi ] + lnP(ωi )
Linear Machinegi (~x) = ~w t
i ~x + wi0 (wi0 threshold/bias)
Decision Boundary gi (~x) = gj(~x) (Hyperplanes)Normal Form ~w t(~x − ~x0) = 0→ ~w = ~µi − ~µj
~x0 = 12 (~µi + ~µj)− σ2
||~µi−~µj ||2ln P(ωi )
P(ωj )(~µi − ~µj)
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Equal Variances/Equal Priors
-2 2 4
0.1
0.2
0.3
0.4
P(ω1)=.5 P(ω2)=.5
x
p(x|ωi)ω1 ω2
0
R1 R2
02
4
0
0.05
0.1
0.15
-2
P(ω1)=.5P(ω2)=.5
ω1
ω2
R1
R2
-20
2
4
-2-1
0
1
2
0
1
2
-2
-1
0
1
2
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R1
R2
FIGURE 2.10. If the covariance matrices for two distributions are equal and proportional to the identitymatrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane ofd − 1 dimensions, perpendicular to the line separating the means. In these one-, two-, and three-dimensionalexamples, we indicate p(x|ωi) and the boundaries for the case P(ω1) = P(ω2). In the three-dimensional case,the grid plane separates R1 from R2. From: Richard O. Duda, Peter E. Hart, and David G. Stork, PatternClassification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Equal Variances/Unequal Priors
P(ω1)=.7 P(ω2)=.3
ω1 ω2
R1 R2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
P(ω1)=.9 P(ω2)=.1
ω1 ω2
R1 R2
p(x|ωi)
x-2 2 4
0.1
0.2
0.3
0.4
0
-2
0
2
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.8
P(ω2)=.2
ω1 ω2
R1
R2
-2
0
2
4
-20
24
0
0.05
0.1
0.15
P(ω1)=.99
P(ω2)=.01
ω1ω2
R1
R2
-10
1
2
0
1
2
3
-2
-1
0
1
2
-2
P(ω1)=.8
P(ω2)=.2
ω1
ω2R1
R2
0
2
4
-10
1
2
-2
-1
0
1
2
-2
P(ω1)=.99
P(ω2)=.01ω1
ω2
R1
R2
FIGURE 2.11. As the priors are changed, the decision boundary shifts; for sufficientlydisparate priors the boundary will not lie between the means of these one-, two- andthree-dimensional spherical Gaussian distributions. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Identical Covariance Matrices
Case Σi = Σgi (~x) = −1
2 (~x − ~µi )tΣ−1(~x − ~µi ) + lnP(ωi )
P(ωi ) = P0 → Minimal Distancegi (~x) = −d2
m(~x , ~µi ) (dm Mahalanobis distance)
Linear Decision Boundary (~x tΣ−1~x independent of i)Normal Form ~w t(~x − ~x0) = 0→ ~w = Σ−1(~µi − ~µj)
~x0 = 12 (~µi + ~µj)− 1
d2m(~µi , ~µj )
ln P(ωi )P(ωj )
(~µi − ~µj)
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Equal Covariances/Various Priors
-5
0
5 -5
0
5
0
-0.1
0.2
P(ω1)=.5
P(ω2)=.5
ω1ω2
R1R2
-5
0
5-5
0
5
0
-0.1
0.2
P(ω1)=.1
P(ω2)=.9
ω1ω2
R1R2
-20
2-2
02
4
-2.5
0
2.5
5
7.5
P(ω1)=.5
P(ω2)=.5
ω1
ω2
R1
R2
-2
0
2-2
02
4
0
-2.5
5
7.5
10
P(ω1)=.1
P(ω2)=.9
ω1
ω2
R1
R2
FIGURE 2.12. Probability densities (indicated by the surfaces in two dimensions andellipsoidal surfaces in three dimensions) and decision regions for equal but asymmet-ric Gaussian distributions. The decision hyperplanes need not be perpendicular to theline connecting the means. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Arbitrary Covariance Matrices
Case Σi = arbitrarygi (~x) = ~x tWi~x + ~w t
i ~x + wi0
Wi = −12 Σ−1
i ~wi = Σ−1i ~µi
Decision Surfaces are Hyperquadricshyper–(planes, spheres, ellipsoids, paraboloids, hyperboloids)
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
1D Decision Regions
-5 -2.5 2.5 5 7.5
0.1
0.2
0.3
0.4
x
p(x|ωi)
ω1
ω2
R1 R2 R1
FIGURE 2.13. Non-simply connected decision regions can arise in one dimensions forGaussians having unequal variance. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Arbitrary Covariances/Various Priors
FIGURE 2.14. Arbitrary Gaussian distributions lead to Bayes decision boundaries thatare general hyperquadrics. Conversely, given any hyperquadric, one can find two Gaus-sian distributions whose Bayes decision boundary is that hyperquadric. These variancesare indicated by the contours of constant probability density. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Discriminant Functions and Decision Surfaces
Four Categories
R3
R2
R1
R4
R4
FIGURE 2.16. The decision regions for four normal distributions. Even with such a lownumber of categories, the shapes of the boundary regions can be rather complex. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Maximum–Likelihood Estimation
Parameter Estimation
Estimation of Prior and Class–Conditional DensityTraining Data, Parameterized Model (Density)
Adapt Density Parameters to Training Data
Parameter Vector ~θj , p(~x |ωj) = p(~x |ωj , ~θj)
Training Samples Di for ωi , Di contains n samples ~x1 . . . ~xnIndependent draws from p(~x |~θ) (generating Di )
Likelihood p(D|~θ) =∏n
k=1 p(~xk |~θ)
UV Pattern Recognition I
Statistical Classifiers
Maximum–Likelihood Estimation
Gaussian Likelihood
1 2 3 4 5 6 7
1 2 3 4 5 6 7
1 2 3 4 5 6 7-100
-60
-40
-20
θ
l(θ )
θ
θ 0.4 x 10-7
0.8 x 10-7
1.2 x 10-7
θ
p(D|θ )
x
-80
ˆ
ˆ
FIGURE 3.1. The top graph shows several training points in one dimension, known orassumed to be drawn from a Gaussian of a particular variance, but unknown mean.Four of the infinite number of candidate source distributions are shown in dashedlines. The middle figure shows the likelihood p(D|θ) as a function of the mean. If wehad a very large number of training points, this likelihood would be very narrow. Thevalue that maximizes the likelihood is marked θ ; it also maximizes the logarithm ofthe likelihood—that is, the log-likelihood l(θ), shown at the bottom. Note that eventhough they look similar, the likelihood p(D|θ) is shown as a function of θ whereas theconditional density p(x|θ) is shown as a function of x. Furthermore, as a function of θ ,the likelihood p(D|θ) is not a probability density function and its area has no signifi-cance. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Maximum–Likelihood Estimation
Maximum Likelihood
Maximize Likelihood p(D|~θ)→ Estimate ~θ
Log–Likelihood Function l(~θ) ≡ ln p(D|~θ), ~θ = arg max l(~θ)
Maximize l(~θ) =∑n
k=1 ln p(~xk |~θ), Gradient ~∇~θl = ~0
UV Pattern Recognition I
Statistical Classifiers
Maximum–Likelihood Estimation
The Gaussian Case
Normal Density with unknown ~µln p(~xk |~µ) = −1
2 ln [(2π)d |Σ|]− 12 (~xk − ~µ)tΣ−1(~xk − ~µ)
Gradient ~∇~µ ln p(~xk |~µ) = Σ−1(~xk − ~µ)
Maximize ~∇~µl = ~0→∑n
k=1 Σ−1(~xk − ~µ) = ~0
Estimate ~µ = 1n
∑nk=1 ~xk
Unknown ~µ and Σ~µ = 1
n
∑nk=1 ~xk , Σ = 1
n
∑nk=1 (~xk − ~µ)(~xk − ~µ)t
UV Pattern Recognition I
Statistical Classifiers
Maximum–Likelihood Estimation
Biased Estimates
Univariate Case σ2 = 1n
∑nk=1 (xk − µ)2
E(σ2) = n−1n σ2 6= σ2, σ2 is asymptotically unbiased
s2 = 1n−1
∑nk=1 (xk − µ)2, s2 is absolutely unbiased
UV Pattern Recognition I
Statistical Classifiers
Maximum–Likelihood Estimation
Classification Errors
Bayes Error: overlapping densities, inherent problem property
Model Error: incorrect model
Estimation Error: finite sample of training data
UV Pattern Recognition I
Statistical Classifiers
Maximum–Likelihood Estimation
Bayes Error and Dimensionality
x3
x1
x2
FIGURE 3.3. Two three-dimensional distributions have nonoverlapping densities, andthus in three dimensions the Bayes error vanishes. When projected to a subspace—here,the two-dimensional x1 − x2 subspace or a one-dimensional x1 subspace—there canbe greater overlap of the projected distributions, and hence greater Bayes error. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyrightc© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Component Analysis
Principal Component Analysis
PCA is Dimensionality Reduction by Projection to Best DataRepresentation
Representing Vector ~x0 of ~x1 . . . ~xn?Minimize J0(~x0) =
∑nk=1 ||~x0 − ~xk ||2 →
~x0 = ~m = 1n
∑nk=1 ~xk (sample mean, zero–dimensional
representation)
One–dimensional Representation ~x = ~m + a~e (~e is unit vector)Minimize J1(a1, . . . , an, ~e) =
∑nk=1 ||( ~m + ak~e)− ~xk ||2 →
ak = ~et(~xk − ~m) (projection of ~xk onto ~e)
UV Pattern Recognition I
Statistical Classifiers
Component Analysis
Principal Component
Best Direction of ~e?
Minimize J1(~e) = −~etS~e +∑n
k=1 ||~xk − ~m||2S =
∑nk=1 (~xk − ~m)(~xk − ~m)t (scatter matrix)
Maximize ~etS~e with constraint ||~e|| = 1Lagrange Multipliers u = ~etS~e − λ(~et~e − 1)
Gradient ∂u∂~e = 2S~e − 2λ~e →
S~e = λ~e → ~etS~e = λ~et~e = λ
Principal Component is Eigenvector with Largest Eigenvalueof Scatter Matrix
UV Pattern Recognition I
Statistical Classifiers
Component Analysis
Fisher Linear Discriminant
PCA components represent data but which discriminate?
Two classes, project samples onto ~wSample mean ~mi = 1
ni
∑~x∈Di
~x
~w t ~mi = 1ni
∑~x∈Di
~w t~x = 1ni
∑y∈Yi y = mi
Distance of means |m1 − m2| = |~w t( ~m1 − ~m2)|
Maximization is trivial (by scaling ~w)
UV Pattern Recognition I
Statistical Classifiers
Component Analysis
Linear Discriminant Projection
0.5 1 1.5
0.5
1
1.5
2
0.5 1 1.5x1
-0.5
0.5
1
1.5
2
x2
w
w
x1
x2
FIGURE 3.5. Projection of the same set of samples onto two different lines in the di-rections marked w. The figure on the right shows greater separation between the redand black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork,Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Statistical Classifiers
Component Analysis
Fisher Criterion
Sample scatter for each class si =∑
y∈Yi (y − mi )2
Total within–class scatter s21 + s2
2
Fisher criterion J(~w) = |m1−m2|2s2
1 +s22
(independent of |~w |, maximize)
Scatter matrices Si =∑
~x∈Di(~x − ~mi )(~x − ~mi )
t
Within–class scatter SW = S1 + S2
Between–class scatter SB = ( ~m1 − ~m2)( ~m1 − ~m2)t
Fisher criterion J(~w) = ~w tSB ~w~w tSW ~w
(generalized Rayleigh quotient, maximize)
UV Pattern Recognition I
Statistical Classifiers
Component Analysis
Linear Discriminant
J(~w)→ max if SB ~w = λSW ~wS−1W SB ~w = λ~w SB ~w ||( ~m1 − ~m2)→~w = S−1
W ( ~m1 − ~m2) (Fisher’s Linear Discriminant)
Remaining problem: find threshold
Assume classes with normal densities and equal covariancematrix ΣRecall optimal decision boundary ~w t~x + w0 = 0~w = Σ−1(~µ1 − ~µ2) (sample means, covariances yield directionof Fisher’s Linear Discriminant)
General method: fit projected data to univariate Gaussian,threshhold w0 is at equal posteriors
UV Pattern Recognition I
Nonparametric Techniques
Density Estimation
Unknown Densities
Real problems: multi–modal, parametric densities: uni–modal→ estimation of densities directly from data
P that pattern ~x falls in region R, P =∫R p(~x)d~x
n patterns, probability that k patterns are in RPk =
(nk
)Pk(1− P)n−k E [k] = nP
Assuming small region R →p(~x) ' const →
∫R p(~x)d~x ' p(~x)V → p(~x) '
knV
UV Pattern Recognition I
Nonparametric Techniques
Density Estimation
Relative Probability
1k/n
0.5
1
relativeprobability
0
1005020
P = 0.7
FIGURE 4.1. The relative probability an estimate given by Eq. 4 will yield a particularvalue for the probability density, here where the true probability was chosen to be 0.7.Each curve is labeled by the total number of patterns n sampled, and is scaled to givethe same maximum (at the true probability). The form of each curve is binomial, asgiven by Eq. 2. For large n, such binomials peak strongly at the true probability. In thelimit n → ∞, the curve approaches a delta function, and we are guaranteed that ourestimate will give the true probability. From: Richard O. Duda, Peter E. Hart, and DavidG. Stork, Pattern Classification. Copyright c© 2001 by John Wiley & Sons, Inc.
UV Pattern Recognition I
Nonparametric Techniques
Density Estimation
Sample Size
Estimate p(~x) 'knV is dependent on size of V
if V → 0, p(~x) would be exact, but no more samples in V
Assuming infinite pattern set with decreasing Vn
n–th estimate pn(~x) =knnVn
For convergence of pn(~x)→ p(~x)limn→∞ Vn = 0 limn→∞ kn =∞ limn→∞
knn = 0
Decreasing Vn, e.g., Vn = 1√n→ Parzen Windows
Increasing kn, e.g., kn =√n→ kn–Nearest–Neighbors
UV Pattern Recognition I
Nonparametric Techniques
Density Estimation
Point Density Estimation
n = 1 n = 4 n = 9 n = 16 n = 100
...
...
...
...
Vn =1/ √n
kn = √n
FIGURE 4.2. There are two leading methods for estimating the density at a point, hereat the center of each square. The one shown in the top row is to start with a large volumecentered on the test point and shrink it according to a function such as Vn = 1/
√n. The
other method, shown in the bottom row, is to decrease the volume in a data-dependentway, for instance letting the volume enclose some number kn = √
n of sample points.The sequences in both cases represent random variables that generally converge andallow the true density at the test point to be calculated. From: Richard O. Duda, PeterE. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Window Function
Region Rn is hypercube with Vn = hdn
Window function ϕ(~u) =
{1 |uj | ≤ 1
2 ,0 otherwise.
(unit hypercube)
Number of samples ~xi in hypercube with side length hnkn =
∑ni=1 ϕ(~x−~xihn
)
Density estimate pn(~x) = 1n
∑ni=1
1Vnϕ(~x−~xihn
)
Generalization: ϕ(.) is arbitrary density, pn(~x) is interpolateddensity
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Window Width
Effect of window width on pn(~x)
With δn(~x) = 1Vnϕ( ~xhn )
pn(~x) = 1n
∑ni=1 δ(~x − ~xi )
hn >>→ δn(~x) has small amplitude, slowly changing, smoothhn <<→ δn(~x) is sharp pulse (Dirac) at pattern
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Parzen Windows with Varying Width
-2-1
01
2-2
-1
0
1
2
0
0.05
0.1
0.15
h = 1
δ(x)
0
0.2
0.4
0.6
h = 0.5
-2-1
01
2-2
-1
0
1
2
δ(x)
0
1
2
3
4
h = 0.2
-2-1
01
2-2
-1
0
1
2
δ(x)
FIGURE 4.3. Examples of two-dimensional circularly symmetric normal Parzen win-dows for three different values of h. Note that because the δ(x) are normalized, differentvertical scales must be used to show their structure. From: Richard O. Duda, Peter E.Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley &Sons, Inc.
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Parzen Density Estimates
02
46
810 0
2
4
6
8
0
0.1
0.2
p(x)
h = 1
00.20.4
0.6
02
46
810 0
2
4
6
8
p(x)
h = 0.5
0123
02
46
810 0
2
4
6
8
p(x)
h = 0.2
FIGURE 4.4. Three Parzen-window density estimates based on the same set of five samples, using the windowfunctions in Fig. 4.3. As before, the vertical axes have been scaled to show the structure of each distribution.From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Convergence of Parzen Densities
Parzen density pn(~x) is a random variable dependent on thesample set of size n
Estimate pn(~x) converges to true p(~x) iflimn→∞ pn(~x) = p(~x) limn→∞ σ
2n(~x) = 0
Convergence of mean pn(~x) = E [pn(~x)]pn(~x) = 1
n
∑ni=1 E [δ(~x − ~xi )] =
∫δn(~x − ~v)p(~v)d~v
(convolution)pn(~x) is blurred p(~x), if limn→∞ Vn = 0→ δn(.) is Dirac,convolution yields true p(~x)
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Convergence of Variance
Estimate pn(~x) = pn(~x) also for finite sample size, if windowis Dirac, spiky estimate, high variance
Variance σn(~x) =∑n
i=1 E [( 1nδ(~x − ~xi )− 1
n pn(~x))2] =1n (E [δ2(~x − ~xi )]− pn(~x)2) = 1
n (∫δ2n(~x − ~v)p(~v)d~v − pn(~x)2)
σn(~x) ≤ sup δ(.)pn(~x)n = supϕ(.)pn(~x)
nVn
If limn→∞ nVn =∞→ limn→∞ σ2n(~x) = 0
Vn may approach 0
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Finite Samples
Finite samples: choosing ϕ(.) and Vn?
N(0, 1) window function ϕ(u) = 1√2πe
−u2
2
Average of normal densities pn(x) = 1n
∑ni=1
1hnϕ( x−xihn
)
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Parzen Estimates of 1D Normal Density
-2 0 2 -2 0 2 -2 0 2
h1 = 1 h1 = 0.5 h1 = 0.1
-2 0 2 -2 0 2 -2 0 2
n = 1
n = 10
-2 0 2 -2 0 2 -2 0 2
-2 0 2 -2 0 2 -2 0 2
n = 100
n = ∞
FIGURE 4.5. Parzen-window estimates of a univariate normal density using differentwindow widths and numbers of samples. The vertical axes have been scaled to bestshow the structure in each graph. Note particularly that the n = ∞ estimates are thesame (and match the true density function), regardless of window width. From: RichardO. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001by John Wiley & Sons, Inc.
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Parzen Estimates of 2D Normal Density
-2
-1
0
1
2-2
-1
-0
1
2
0.01
0.02
0.03
0.04
n=1
h1=2
0
0.05
0.10
0.15
h1=1
-2
-1
0
1
2-2
-1
-0
1
2
0
0.2
0.4
0.6
-2
-1
0
1
2-2
-1
-0
1
2
h1=0.5
-2
-1
0
1
2 -2
-1
0
1
2
0
0.05
0.10
0.15
n=100
0.10.20.30.4
-2
-1
0
1
2 -2
-1
0
1
2
0
0.25
0.50
0.75
-2
-1
0
1
2 -2
-1
0
1
2
-2
-1
0
1
2 -2
-1
0
1
2
0
0.1
0.2
n=1000
0.2
0.4
-2
-1
0
1
2 -2
-1
0
1
2
0
0.5
1
-2
-1
0
1
2 -2
-1
0
1
2
-2
-1
0
1
2 -2
-1
0
1
2
0
0.05
0.10
0.15
n=∞-2
-1
0
1
2 -2
-1
0
1
2
0
0.05
0.10
0.15
-2
-1
0
1
2 -2
-1
0
1
2
0
0.05
0.10
0.15
FIGURE 4.6. Parzen-window estimates of a bivariate normal density using different window widths and num-bers of samples. The vertical axes have been scaled to best show the structure in each graph. Note particularlythat the n = ∞ estimates are the same (and match the true distribution), regardless of window width. From:Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by John Wiley& Sons, Inc.
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Parzen Estimates of 1D Bimodal Density
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
0 1 2 3 4
1
n=1
n=16
n=256
n=∞
h1=1 h1=0.5 h1=0.2
FIGURE 4.7. Parzen-window estimates of a bimodal distribution using different windowwidths and numbers of samples. Note particularly that the n = ∞ estimates are the same(and match the true distribution), regardless of window width. From: Richard O. Duda,Peter E. Hart, and David G. Stork, Pattern Classification. Copyright c© 2001 by JohnWiley & Sons, Inc.
UV Pattern Recognition I
Nonparametric Techniques
Parzen Windows
Decision Boundaries of 2D Parzen Dichotomizer
x1
x2
x1
x2
FIGURE 4.8. The decision boundaries in a two-dimensional Parzen-window di-chotomizer depend on the window width h. At the left a small h leads to boundariesthat are more complicated than for large h on same data set, shown at the right. Appar-ently, for these data a small h would be appropriate for the upper region, while a largeh would be appropriate for the lower region; no single window width is ideal over-all. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification.Copyright c© 2001 by John Wiley & Sons, Inc.